← Self-Hosting AI the Right Way

Chapter 2

Prerequisites & Architecture

In this chapter
<nav id="TableOfContents" aria-label="Chapter sections"> <ul> <li><a href="#the-big-picture">The Big Picture</a></li> <li><a href="#hardware-sizing">Hardware Sizing</a> <ul> <li><a href="#by-vram-tier">By VRAM Tier</a></li> <li><a href="#vm-sizing">VM Sizing</a></li> </ul> </li> <li><a href="#sidebar-vllm-vs-ollama">Sidebar: vLLM vs Ollama</a></li> <li><a href="#network-requirements">Network Requirements</a></li> <li><a href="#software-prerequisites">Software Prerequisites</a></li> <li><a href="#pre-flight-checklist">Pre-Flight Checklist</a></li> <li><a href="#what-to-have-ready">What to Have Ready</a></li> </ul> </nav>

What you’ll accomplish: Understand the full deployment architecture, size your hardware correctly, and confirm all prerequisites are in place before touching a terminal.

The Big Picture

Here’s what we’re building. Every component runs on a single Rocky Linux 9 host:

                        +--------------------+
                        |   Your Browser     |
                        +--------+-----------+
                                 |
                           HTTPS (443)
                                 |
                        +--------+-----------+
                        |   nginx            |
                        |   (SSL termination,|
                        |    WebSocket proxy, |
                        |    security headers)|
                        +--------+-----------+
                          /               \
                    /  (path)         /grafana/  \
                   /                               \
          HTTP (3000)                        HTTP (3001)
          loopback                           loopback
                   |                               |
        +----------+---------+         +-----------+---------+
        |   Open WebUI       |         | Grafana             |
        |   (Podman Quadlet) |         | (Podman Quadlet)    |
        |                    |         | - Pre-built         |
        |   - Chat UI        |         |   dashboard         |
        |   - User accounts  |         | - Provisioned       |
        |   - Chat history   |         |   datasource        |
        +--------+-----------+         +-----------+---------+
                 |                               |
          HTTP (11434, loopback)                 |
                 |                               |
        +--------+-----------+         +---------+-------+
        |   Ollama           |         | Prometheus       |
        |   (systemd service)|         | (Podman Quadlet) |
        |                    |         | Port 9090        |
        |   - Model loading  |         | - Ollama metrics |
        |   - Inference      |         | - GPU metrics    |
        |   - Model storage  |         | - Self metrics   |
        +----+----------+----+         +---------+-------+
             |          |                        ^
         GPU (if present)                        |
             |                          +--------+-------+
             |                          | nvidia_gpu     |
             |                          | _exporter      |
             |                          | (systemd, 9400)|
             |                          | (GPU hosts only)|
             |                          +----------------+

The key design decisions:

  • nginx sits in front of everything. Users hit port 443 (HTTPS). nginx terminates SSL, handles WebSocket upgrades for streaming responses, and proxies to Open WebUI at / and Grafana at /grafana/. Neither service handles SSL directly.
  • Ollama listens on localhost only. Port 11434 is bound to 127.0.0.1. Open WebUI connects over localhost. No network exposure for the inference API — Ollama has no authentication, so exposing it to the network means anyone can run models on your GPU.
  • Monitoring stays localhost-only. Prometheus, Grafana, and the GPU exporter all listen on loopback. Grafana is the only monitoring service that needs browser access, so nginx proxies it at /grafana/. Prometheus and the GPU exporter are only accessed by other services on the same host.
  • Everything is optional. GPU support, monitoring, and model selection are all optional. Every chapter tells you what to skip if it doesn’t apply. The CPU-only path is a first-class citizen.

Hardware Sizing

By VRAM Tier

This is the table you actually need. Official model pages list theoretical requirements — here’s what works in practice on Rocky Linux with Ollama:

VRAMModels You Can RunSystem RAMNotes
0 (CPU only)7B quantized (slow)16 GBUsable for testing and light use. A 7B model generates ~5-10 tokens/sec on a modern CPU. Painful for anything conversational, but fine for “ask a question, wait 30 seconds.”
8 GB7B-13B quantized16 GBGTX 1070/1080, RTX 3060. The sweet spot for home use. llama3.1:8b runs well, mistral:7b is snappy.
16 GB13B-30B quantized32 GBRTX 4060 Ti 16GB, A4000. You can run codellama:34b-instruct-q4 for a solid coding assistant.
24 GB30B-70B quantized64 GBRTX 3090/4090, A5000. This is where it gets interesting — llama3.1:70b-q4 fits and runs well.

Important: These are quantized (compressed) model sizes. Full-precision models need roughly 2x the VRAM. Ollama uses quantized models by default, which is the right choice for inference.

VM Sizing

ResourceMinimumRecommendedNotes
CPU2 cores4 coresCPU inference is heavily multi-threaded. More cores = faster generation on CPU-only hosts. For GPU hosts, 2 cores is enough since the GPU does the heavy lifting.
RAM4 GB8-16 GBOllama needs system RAM even with GPU inference (model loading, KV cache). Open WebUI and monitoring add ~1 GB. Budget generously.
Disk20 GB + model storage50 GB+The stack itself needs ~10 GB. Each model needs 4-40 GB. A 7B quantized model is ~4 GB; a 70B quantized model is ~40 GB. Plan accordingly.

Tip: If you’re running this on Proxmox, start with a VM at 4 cores, 8 GB RAM, and 50 GB disk. You can always increase later. Pass the GPU through to the VM (Chapter 3 covers how) or skip it for CPU-only.

You’ll see vLLM mentioned in every “production LLM” discussion. Here’s the short version:

Ollama is the right choice when you want simplicity, easy model management, and a chat-oriented workflow. It handles model downloading, loading, unloading, and serving behind a clean API. It’s what this guide uses.

vLLM is the right choice when you need high-concurrency API serving, structured output (JSON mode), or you’re building an application that hammers the inference endpoint with dozens of parallel requests. It’s more complex to operate but significantly faster under load.

For a home lab running 1-3 concurrent users, Ollama is the clear winner. If you outgrow it — if you’re building a RAG pipeline that fires 50 requests per minute — vLLM is the natural next step. That’s a separate guide.

Network Requirements

Only one port needs to be exposed to your network. Everything else stays on localhost.

PortProtocolServiceExposure
443TCPnginx (HTTPS)Network — the only port users access
80TCPnginx (HTTP redirect)Network — redirects to 443
11434TCPOllama APILocalhost only — no auth, never expose
3000TCPOpen WebUILocalhost only — proxied via nginx at /
9090TCPPrometheusLocalhost only — accessed by Grafana
3001TCPGrafanaLocalhost only — proxied via nginx at /grafana/
9400TCPnvidia_gpu_exporterLocalhost only — scraped by Prometheus (GPU only)

Firewall strategy: Open ports 443, 80 (for HTTPS redirect), and SSH inbound. That’s it. Every other service — Ollama, Open WebUI, Prometheus, Grafana, and the GPU exporter — stays on localhost. Grafana is accessible through the nginx reverse proxy at /grafana/, so there’s no reason to punch holes in the firewall for monitoring. Ollama has no authentication — exposing port 11434 to the network means anyone can run inference on your hardware. Don’t do it.

Software Prerequisites

On the AI host (the Rocky Linux 9 VM you’re building):

RequirementVersionHow to Get It
Rocky Linux9.x (minimal install)Fresh VM or bare metal
SELinuxEnforcing (default)Don’t disable it. We’ll configure the booleans and contexts it needs.
Internet accessRequired for package and container image downloads

You’ll also need SSH access to the AI host with a sudo-capable user, so you can run commands remotely or directly on the host.

Pre-Flight Checklist

Run these on your AI host before starting Chapter 3. Every check should pass.

Confirm the OS:

cat /etc/os-release | grep PRETTY_NAME

Expected:

PRETTY_NAME="Rocky Linux 9.x (Blue Onyx)"

Confirm SELinux is enforcing:

getenforce

Expected:

Enforcing

If this says Permissive or Disabled, fix it now. Edit /etc/selinux/config, set SELINUX=enforcing, and reboot. Every tutorial that tells you to disable SELinux is creating a debt you’ll pay later — when you re-enable it and everything breaks at once.

Confirm available RAM:

free -h

You need at least 4 GB total. If you’re under that, resize the VM before continuing.

Confirm available disk:

df -h /

You need at least 20 GB free, plus whatever you plan to allocate for models. A single 7B model is about 4 GB.

Confirm internet access:

curl -sL -o /dev/null -w "%{http_code}" https://ollama.com

Expected:

200

What to Have Ready

Before you start Chapter 3, gather these:

  1. The IP address or hostname of your AI host. We’ll use ai.example.com / 192.168.1.50 as examples throughout this guide.
  2. A DNS record (optional but recommended). Point ai.example.com at your host’s IP. If you don’t have internal DNS, the IP works — you’ll just need to adjust the ai_domain variable.
  3. A decision on GPU vs CPU. If you have an NVIDIA GPU to pass through, continue to Chapter 3. If not, skip straight to Chapter 4 — everything works on CPU, just slower.

GPU or not, the next step is getting the foundation in place. If you have a GPU, let’s pass it through.

Want the automation code? Get the Ansible playbooks that deploy this entire stack in minutes.

Get Guide + Playbooks — $14