What you’ll accomplish: Understand what gap this guide fills, why every “install Ollama” tutorial leaves you stranded, and exactly what you’ll have when you finish.
The “Install Ollama” Problem
Search for “self-host AI” and you’ll find a hundred tutorials that all end the same way: ollama run llama3. Congratulations, you have a chatbot in your terminal. The tutorial declares victory and moves on.
Here’s what they don’t cover:
- How do you put a web UI in front of it so people can actually use it?
- How do you handle SSL so your credentials aren’t flying across the network in plaintext?
- How do you monitor VRAM usage so you know when your GPU is about to fall over?
- How do you make it survive a reboot?
- How do you make SELinux happy instead of just disabling it?
- How do you back up chat history without backing up 40 GB of model files?
- How do you deploy the whole thing again when you rebuild the VM?
The answer to most of these, in most tutorials, is “left as an exercise for the reader.” The real answer is that the tutorial author didn’t bother, because getting from “Ollama runs in my terminal” to “production-ready AI stack” is where the actual work starts.
What’s Actually Missing
The gap isn’t “how to install Ollama.” That’s one command. The gap is everything around it:
A real web interface. Open WebUI gives your users a ChatGPT-like experience. But deploying it properly means Podman containers, Quadlet systemd integration, persistent storage for chat history, and connecting it to Ollama across the right network boundary.
A reverse proxy. You need SSL termination. You need WebSocket support for streaming responses — and this is where every generic nginx tutorial fails, because LLM inference responses can take minutes on CPU, and default proxy timeouts kill the connection at 60 seconds.
Monitoring. GPU utilization, VRAM pressure, inference latency, model load/unload events. Without visibility, you’re flying blind. You won’t know your GPU is memory-starved until a user complains that responses stopped.
Hardening. SELinux enforcing, not disabled. Firewall rules that expose only HTTPS, not every service port. fail2ban on the nginx frontend. Log rotation so journald doesn’t eat your disk.
Backup. Chat history and user accounts need backing up. Model files don’t — they’re 4-40 GB blobs you can re-pull in minutes. Knowing what to back up and what to skip saves you from a 50 GB backup job that should be 50 MB.
Reproducibility. A way to rebuild the whole thing when you nuke the VM or migrate to new hardware. Manual wikis drift out of date the moment you finish writing them. Automation doesn’t.
What You’ll Have at the End
By the time you finish this guide, you’ll have:
- Ollama serving LLM inference as a systemd service, with GPU acceleration if available, CPU fallback if not
- Open WebUI providing a ChatGPT-like web interface via Podman with Quadlet systemd integration
- nginx as a reverse proxy with SSL termination, WebSocket support for streaming, and security headers
- Prometheus + Grafana monitoring GPU utilization, VRAM, inference metrics, and system resources — with a pre-built dashboard
- Automated backups via a daily cron job that knows what to back up and what to skip
- fail2ban protecting the nginx frontend from brute-force attacks
- Firewall rules that keep everything behind nginx — only HTTPS exposed to the network
- SELinux enforcing throughout — no
setenforce 0, no permissive mode, proper contexts and booleans
Everything runs on a single Rocky Linux 9 host. Every credential is stored securely, every service survives a reboot, and every step includes verification so you know it worked.
About the Playbook Bundle: This guide teaches you every step manually. If you’d rather automate the whole thing, the companion Ansible playbook bundle deploys everything in about 15 minutes — and it’s idempotent, so re-running it is safe. Available at RavenForge Press or directly on Payhip.
Who This Guide Is For
You’re an intermediate sysadmin or homelab builder. You know your way around SSH, dnf, and config files. You don’t need to know Ansible — this guide walks through every step manually. You don’t need hand-holding on Linux basics, but you do want someone to say “use a Podman Quadlet, not podman generate systemd” and “set this SELinux boolean or you’ll waste two hours staring at a 502.”
You might want to run a private ChatGPT for your household. You might want a coding assistant that stays on your network. You might just want to understand how local LLM infrastructure actually works, beyond the “just run it in Docker” handwave.
Who This Guide Is Not For
If you want a one-click Docker Compose that runs on your laptop — plenty of those exist, and they’re free. This guide is for people who want something they’d trust to run on a server, not something they’d demo once and forget about.
This guide does not cover model fine-tuning, RAG pipelines, multi-node inference clusters, or vLLM. Those are real topics that deserve their own guides. We’ll note where they’d plug in, but the scope here is a single-node, production-quality deployment.
Prerequisites
Before you start Chapter 2, make sure you have:
- A Rocky Linux 9 host (VM or bare metal) with at least 4 GB RAM and 2 CPU cores. A fresh minimal install is ideal. If you have an NVIDIA GPU, we’ll cover passthrough in Chapter 3.
- SSH access to that host with a user that has sudo privileges.
- A DNS record or IP address for your AI host. Point
ai.example.comat it, or just use the IP — the configuration works either way. - Disk space proportional to the models you plan to run. Budget 10 GB for the stack itself, plus 5-50 GB per model depending on size. We’ll cover sizing in Chapter 2.
That’s it. Let’s look at what we’re building.