What you’ll accomplish: Understand the full deployment architecture, size your hardware correctly, and confirm all prerequisites are in place before touching a terminal.
The Big Picture
Here’s what we’re building. Every component runs on a single Rocky Linux 9 host:
+--------------------+
| Your Browser |
+--------+-----------+
|
HTTPS (443)
|
+--------+-----------+
| nginx |
| (SSL termination,|
| WebSocket proxy, |
| security headers)|
+--------+-----------+
/ \
/ (path) /grafana/ \
/ \
HTTP (3000) HTTP (3001)
loopback loopback
| |
+----------+---------+ +-----------+---------+
| Open WebUI | | Grafana |
| (Podman Quadlet) | | (Podman Quadlet) |
| | | - Pre-built |
| - Chat UI | | dashboard |
| - User accounts | | - Provisioned |
| - Chat history | | datasource |
+--------+-----------+ +-----------+---------+
| |
HTTP (11434, loopback) |
| |
+--------+-----------+ +---------+-------+
| Ollama | | Prometheus |
| (systemd service)| | (Podman Quadlet) |
| | | Port 9090 |
| - Model loading | | - Ollama metrics |
| - Inference | | - GPU metrics |
| - Model storage | | - Self metrics |
+----+----------+----+ +---------+-------+
| | ^
GPU (if present) |
| +--------+-------+
| | nvidia_gpu |
| | _exporter |
| | (systemd, 9400)|
| | (GPU hosts only)|
| +----------------+
The key design decisions:
- nginx sits in front of everything. Users hit port 443 (HTTPS). nginx terminates SSL, handles WebSocket upgrades for streaming responses, and proxies to Open WebUI at
/and Grafana at/grafana/. Neither service handles SSL directly. - Ollama listens on localhost only. Port 11434 is bound to 127.0.0.1. Open WebUI connects over localhost. No network exposure for the inference API — Ollama has no authentication, so exposing it to the network means anyone can run models on your GPU.
- Monitoring stays localhost-only. Prometheus, Grafana, and the GPU exporter all listen on loopback. Grafana is the only monitoring service that needs browser access, so nginx proxies it at
/grafana/. Prometheus and the GPU exporter are only accessed by other services on the same host. - Everything is optional. GPU support, monitoring, and model selection are all optional. Every chapter tells you what to skip if it doesn’t apply. The CPU-only path is a first-class citizen.
Hardware Sizing
By VRAM Tier
This is the table you actually need. Official model pages list theoretical requirements — here’s what works in practice on Rocky Linux with Ollama:
| VRAM | Models You Can Run | System RAM | Notes |
|---|---|---|---|
| 0 (CPU only) | 7B quantized (slow) | 16 GB | Usable for testing and light use. A 7B model generates ~5-10 tokens/sec on a modern CPU. Painful for anything conversational, but fine for “ask a question, wait 30 seconds.” |
| 8 GB | 7B-13B quantized | 16 GB | GTX 1070/1080, RTX 3060. The sweet spot for home use. llama3.1:8b runs well, mistral:7b is snappy. |
| 16 GB | 13B-30B quantized | 32 GB | RTX 4060 Ti 16GB, A4000. You can run codellama:34b-instruct-q4 for a solid coding assistant. |
| 24 GB | 30B-70B quantized | 64 GB | RTX 3090/4090, A5000. This is where it gets interesting — llama3.1:70b-q4 fits and runs well. |
Important: These are quantized (compressed) model sizes. Full-precision models need roughly 2x the VRAM. Ollama uses quantized models by default, which is the right choice for inference.
VM Sizing
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 2 cores | 4 cores | CPU inference is heavily multi-threaded. More cores = faster generation on CPU-only hosts. For GPU hosts, 2 cores is enough since the GPU does the heavy lifting. |
| RAM | 4 GB | 8-16 GB | Ollama needs system RAM even with GPU inference (model loading, KV cache). Open WebUI and monitoring add ~1 GB. Budget generously. |
| Disk | 20 GB + model storage | 50 GB+ | The stack itself needs ~10 GB. Each model needs 4-40 GB. A 7B quantized model is ~4 GB; a 70B quantized model is ~40 GB. Plan accordingly. |
Tip: If you’re running this on Proxmox, start with a VM at 4 cores, 8 GB RAM, and 50 GB disk. You can always increase later. Pass the GPU through to the VM (Chapter 3 covers how) or skip it for CPU-only.
Sidebar: vLLM vs Ollama
You’ll see vLLM mentioned in every “production LLM” discussion. Here’s the short version:
Ollama is the right choice when you want simplicity, easy model management, and a chat-oriented workflow. It handles model downloading, loading, unloading, and serving behind a clean API. It’s what this guide uses.
vLLM is the right choice when you need high-concurrency API serving, structured output (JSON mode), or you’re building an application that hammers the inference endpoint with dozens of parallel requests. It’s more complex to operate but significantly faster under load.
For a home lab running 1-3 concurrent users, Ollama is the clear winner. If you outgrow it — if you’re building a RAG pipeline that fires 50 requests per minute — vLLM is the natural next step. That’s a separate guide.
Network Requirements
Only one port needs to be exposed to your network. Everything else stays on localhost.
| Port | Protocol | Service | Exposure |
|---|---|---|---|
| 443 | TCP | nginx (HTTPS) | Network — the only port users access |
| 80 | TCP | nginx (HTTP redirect) | Network — redirects to 443 |
| 11434 | TCP | Ollama API | Localhost only — no auth, never expose |
| 3000 | TCP | Open WebUI | Localhost only — proxied via nginx at / |
| 9090 | TCP | Prometheus | Localhost only — accessed by Grafana |
| 3001 | TCP | Grafana | Localhost only — proxied via nginx at /grafana/ |
| 9400 | TCP | nvidia_gpu_exporter | Localhost only — scraped by Prometheus (GPU only) |
Firewall strategy: Open ports 443, 80 (for HTTPS redirect), and SSH inbound. That’s it. Every other service — Ollama, Open WebUI, Prometheus, Grafana, and the GPU exporter — stays on localhost. Grafana is accessible through the nginx reverse proxy at /grafana/, so there’s no reason to punch holes in the firewall for monitoring. Ollama has no authentication — exposing port 11434 to the network means anyone can run inference on your hardware. Don’t do it.
Software Prerequisites
On the AI host (the Rocky Linux 9 VM you’re building):
| Requirement | Version | How to Get It |
|---|---|---|
| Rocky Linux | 9.x (minimal install) | Fresh VM or bare metal |
| SELinux | Enforcing (default) | Don’t disable it. We’ll configure the booleans and contexts it needs. |
| Internet access | — | Required for package and container image downloads |
You’ll also need SSH access to the AI host with a sudo-capable user, so you can run commands remotely or directly on the host.
Pre-Flight Checklist
Run these on your AI host before starting Chapter 3. Every check should pass.
Confirm the OS:
cat /etc/os-release | grep PRETTY_NAME
Expected:
PRETTY_NAME="Rocky Linux 9.x (Blue Onyx)"
Confirm SELinux is enforcing:
getenforce
Expected:
Enforcing
If this says Permissive or Disabled, fix it now. Edit /etc/selinux/config, set SELINUX=enforcing, and reboot. Every tutorial that tells you to disable SELinux is creating a debt you’ll pay later — when you re-enable it and everything breaks at once.
Confirm available RAM:
free -h
You need at least 4 GB total. If you’re under that, resize the VM before continuing.
Confirm available disk:
df -h /
You need at least 20 GB free, plus whatever you plan to allocate for models. A single 7B model is about 4 GB.
Confirm internet access:
curl -sL -o /dev/null -w "%{http_code}" https://ollama.com
Expected:
200
What to Have Ready
Before you start Chapter 3, gather these:
- The IP address or hostname of your AI host. We’ll use
ai.example.com/192.168.1.50as examples throughout this guide. - A DNS record (optional but recommended). Point
ai.example.comat your host’s IP. If you don’t have internal DNS, the IP works — you’ll just need to adjust theai_domainvariable. - A decision on GPU vs CPU. If you have an NVIDIA GPU to pass through, continue to Chapter 3. If not, skip straight to Chapter 4 — everything works on CPU, just slower.
GPU or not, the next step is getting the foundation in place. If you have a GPU, let’s pass it through.