Prerequisites & Architecture

What you’ll accomplish: Understand the full deployment architecture, size your hardware correctly, and confirm all prerequisites are in place before touching a terminal.

The Big Picture

Here’s what we’re building. Every component runs on a single Rocky Linux 9 host:

                        +--------------------+
                        |   Your Browser     |
                        +--------+-----------+
                                 |
                           HTTPS (443)
                                 |
                        +--------+-----------+
                        |   nginx            |
                        |   (SSL termination,|
                        |    WebSocket proxy, |
                        |    security headers)|
                        +--------+-----------+
                          /               \
                    /  (path)         /grafana/  \
                   /                               \
          HTTP (3000)                        HTTP (3001)
          loopback                           loopback
                   |                               |
        +----------+---------+         +-----------+---------+
        |   Open WebUI       |         | Grafana             |
        |   (Podman Quadlet) |         | (Podman Quadlet)    |
        |                    |         | - Pre-built         |
        |   - Chat UI        |         |   dashboard         |
        |   - User accounts  |         | - Provisioned       |
        |   - Chat history   |         |   datasource        |
        +--------+-----------+         +-----------+---------+
                 |                               |
          HTTP (11434, loopback)                 |
                 |                               |
        +--------+-----------+         +---------+-------+
        |   Ollama           |         | Prometheus       |
        |   (systemd service)|         | (Podman Quadlet) |
        |                    |         | Port 9090        |
        |   - Model loading  |         | - Ollama metrics |
        |   - Inference      |         | - GPU metrics    |
        |   - Model storage  |         | - Self metrics   |
        +----+----------+----+         +---------+-------+
             |          |                        ^
         GPU (if present)                        |
             |                          +--------+-------+
             |                          | nvidia_gpu     |
             |                          | _exporter      |
             |                          | (systemd, 9400)|
             |                          | (GPU hosts only)|
             |                          +----------------+

The key design decisions:

nginx sits in front of everything. Users hit port 443 (HTTPS). nginx terminates SSL, handles WebSocket upgrades for streaming responses, and proxies to Open WebUI at / and Grafana at /grafana/. Neither service handles SSL directly.
Ollama listens on localhost only. Port 11434 is bound to 127.0.0.1. Open WebUI connects over localhost. No network exposure for the inference API — Ollama has no authentication, so exposing it to the network means anyone can run models on your GPU.
Monitoring stays localhost-only. Prometheus, Grafana, and the GPU exporter all listen on loopback. Grafana is the only monitoring service that needs browser access, so nginx proxies it at /grafana/. Prometheus and the GPU exporter are only accessed by other services on the same host.
Everything is optional. GPU support, monitoring, and model selection are all optional. Every chapter tells you what to skip if it doesn’t apply. The CPU-only path is a first-class citizen.

Hardware Sizing

By VRAM Tier

This is the table you actually need. Official model pages list theoretical requirements — here’s what works in practice on Rocky Linux with Ollama:

VRAM	Models You Can Run	System RAM	Notes
0 (CPU only)	7B quantized (slow)	16 GB	Usable for testing and light use. A 7B model generates ~5-10 tokens/sec on a modern CPU. Painful for anything conversational, but fine for “ask a question, wait 30 seconds.”
8 GB	7B-13B quantized	16 GB	GTX 1070/1080, RTX 3060. The sweet spot for home use. `llama3.1:8b` runs well, `mistral:7b` is snappy.
16 GB	13B-30B quantized	32 GB	RTX 4060 Ti 16GB, A4000. You can run `codellama:34b-instruct-q4` for a solid coding assistant.
24 GB	30B-70B quantized	64 GB	RTX 3090/4090, A5000. This is where it gets interesting — `llama3.1:70b-q4` fits and runs well.

Important: These are quantized (compressed) model sizes. Full-precision models need roughly 2x the VRAM. Ollama uses quantized models by default, which is the right choice for inference.

VM Sizing

Resource	Minimum	Recommended	Notes
CPU	2 cores	4 cores	CPU inference is heavily multi-threaded. More cores = faster generation on CPU-only hosts. For GPU hosts, 2 cores is enough since the GPU does the heavy lifting.
RAM	4 GB	8-16 GB	Ollama needs system RAM even with GPU inference (model loading, KV cache). Open WebUI and monitoring add ~1 GB. Budget generously.
Disk	20 GB + model storage	50 GB+	The stack itself needs ~10 GB. Each model needs 4-40 GB. A 7B quantized model is ~4 GB; a 70B quantized model is ~40 GB. Plan accordingly.

Tip: If you’re running this on Proxmox, start with a VM at 4 cores, 8 GB RAM, and 50 GB disk. You can always increase later. Pass the GPU through to the VM (Chapter 3 covers how) or skip it for CPU-only.

You’ll see vLLM mentioned in every “production LLM” discussion. Here’s the short version:

Ollama is the right choice when you want simplicity, easy model management, and a chat-oriented workflow. It handles model downloading, loading, unloading, and serving behind a clean API. It’s what this guide uses.

vLLM is the right choice when you need high-concurrency API serving, structured output (JSON mode), or you’re building an application that hammers the inference endpoint with dozens of parallel requests. It’s more complex to operate but significantly faster under load.

For a home lab running 1-3 concurrent users, Ollama is the clear winner. If you outgrow it — if you’re building a RAG pipeline that fires 50 requests per minute — vLLM is the natural next step. That’s a separate guide.

Network Requirements

Only one port needs to be exposed to your network. Everything else stays on localhost.

Port	Protocol	Service	Exposure
443	TCP	nginx (HTTPS)	Network — the only port users access
80	TCP	nginx (HTTP redirect)	Network — redirects to 443
11434	TCP	Ollama API	Localhost only — no auth, never expose
3000	TCP	Open WebUI	Localhost only — proxied via nginx at `/`
9090	TCP	Prometheus	Localhost only — accessed by Grafana
3001	TCP	Grafana	Localhost only — proxied via nginx at `/grafana/`
9400	TCP	nvidia_gpu_exporter	Localhost only — scraped by Prometheus (GPU only)

Firewall strategy: Open ports 443, 80 (for HTTPS redirect), and SSH inbound. That’s it. Every other service — Ollama, Open WebUI, Prometheus, Grafana, and the GPU exporter — stays on localhost. Grafana is accessible through the nginx reverse proxy at /grafana/, so there’s no reason to punch holes in the firewall for monitoring. Ollama has no authentication — exposing port 11434 to the network means anyone can run inference on your hardware. Don’t do it.

Software Prerequisites

On the AI host (the Rocky Linux 9 VM you’re building):

Requirement	Version	How to Get It
Rocky Linux	9.x (minimal install)	Fresh VM or bare metal
SELinux	Enforcing (default)	Don’t disable it. We’ll configure the booleans and contexts it needs.
Internet access	—	Required for package and container image downloads

You’ll also need SSH access to the AI host with a sudo-capable user, so you can run commands remotely or directly on the host.

Pre-Flight Checklist

Run these on your AI host before starting Chapter 3. Every check should pass.

Confirm the OS:

cat /etc/os-release | grep PRETTY_NAME

Expected:

PRETTY_NAME="Rocky Linux 9.x (Blue Onyx)"

Confirm SELinux is enforcing:

getenforce

Expected:

Enforcing

If this says Permissive or Disabled, fix it now. Edit /etc/selinux/config, set SELINUX=enforcing, and reboot. Every tutorial that tells you to disable SELinux is creating a debt you’ll pay later — when you re-enable it and everything breaks at once.

Confirm available RAM:

free -h

You need at least 4 GB total. If you’re under that, resize the VM before continuing.

Confirm available disk:

df -h /

You need at least 20 GB free, plus whatever you plan to allocate for models. A single 7B model is about 4 GB.

Confirm internet access:

curl -sL -o /dev/null -w "%{http_code}" https://ollama.com

Expected:

What to Have Ready

Before you start Chapter 3, gather these:

The IP address or hostname of your AI host. We’ll use ai.example.com / 192.168.1.50 as examples throughout this guide.
A DNS record (optional but recommended). Point ai.example.com at your host’s IP. If you don’t have internal DNS, the IP works — you’ll just need to adjust the ai_domain variable.
A decision on GPU vs CPU. If you have an NVIDIA GPU to pass through, continue to Chapter 3. If not, skip straight to Chapter 4 — everything works on CPU, just slower.

GPU or not, the next step is getting the foundation in place. If you have a GPU, let’s pass it through.

Prerequisites & Architecture

The Big Picture

Hardware Sizing

By VRAM Tier

VM Sizing

Sidebar: vLLM vs Ollama

Network Requirements

Software Prerequisites

Pre-Flight Checklist

What to Have Ready