What you’ll accomplish: Deploy Prometheus and Grafana for real-time monitoring, set up GPU metrics collection, configure automated backups, add fail2ban for nginx protection, and have a capacity planning baseline.
This chapter is what separates a weekend project from a system you can trust. Free tutorials stop at “it works.” This covers what happens on day 2, day 30, and day 365.
Prometheus — Metrics Collection
Prometheus scrapes metrics from your services at regular intervals and stores them in a time-series database. We’ll scrape three targets:
- Ollama — exposes inference metrics at
http://127.0.0.1:11434/metrics - nvidia_gpu_exporter — exposes GPU utilization, VRAM, temperature, and power draw (GPU hosts only)
- Prometheus itself — for internal health metrics
Deploy Prometheus
Create the data directory. Prometheus runs as UID 65534 (nobody) inside the container:
sudo mkdir -p /opt/prometheus/data
sudo chown 65534:65534 /opt/prometheus/data
Deploy the scrape configuration:
sudo tee /opt/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['127.0.0.1:9090']
- job_name: 'ollama'
static_configs:
- targets: ['127.0.0.1:11434']
metrics_path: /metrics
# Include this block only if you have a GPU
- job_name: 'nvidia_gpu'
static_configs:
- targets: ['127.0.0.1:9400']
EOF
Note: On CPU-only hosts, remove the
nvidia_gpujob block from this file — there’s nothing to scrape without a GPU.
Prometheus Quadlet
sudo tee /etc/containers/systemd/prometheus.container > /dev/null << 'EOF'
[Unit]
Description=Prometheus — metrics collection and alerting
After=ollama.service
[Container]
ContainerName=prometheus
Image=docker.io/prom/prometheus:latest
Network=host
Volume=/opt/prometheus/data:/prometheus:Z
Volume=/opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro,Z
[Service]
Restart=always
TimeoutStartSec=120
[Install]
WantedBy=multi-user.target
EOF
sudo podman pull docker.io/prom/prometheus:latest
sudo systemctl daemon-reload
sudo systemctl start prometheus
Verify Prometheus
# Check the ready endpoint
curl -s http://127.0.0.1:9090/-/ready
Expected: Prometheus Server is Ready.
Browse to http://127.0.0.1:9090/targets (or through your nginx proxy if you’ve added a location block for it) to confirm all scrape targets are UP.
Grafana — Visualization
Grafana provides the dashboards. We’ll deploy it with auto-provisioned Prometheus datasource and a pre-built AI stack dashboard.
Directory Structure
# Grafana runs as UID 472 inside the container
sudo mkdir -p /opt/grafana/{data,provisioning/datasources,provisioning/dashboards,dashboards}
sudo chown -R 472:472 /opt/grafana
Auto-Provision the Prometheus Datasource
sudo tee /opt/grafana/provisioning/datasources/prometheus.yml > /dev/null << 'EOF'
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://127.0.0.1:9090
isDefault: true
editable: false
EOF
This means Grafana connects to Prometheus automatically on first boot — no manual datasource configuration in the UI.
Dashboard Provider
sudo tee /opt/grafana/provisioning/dashboards/ai-stack.yml > /dev/null << 'EOF'
apiVersion: 1
providers:
- name: 'AI Stack'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false
EOF
You can build your own dashboard or import one. To import a dashboard JSON file, either place it in /opt/grafana/dashboards/ (Grafana picks it up automatically via the provider above) or use the Grafana UI: Dashboards > Import > Upload JSON file. The companion playbook bundle includes a pre-built AI stack dashboard that covers GPU, Ollama, and system metrics out of the box.
Grafana Quadlet
Generate a Grafana admin password and save it somewhere you won’t lose it — you’ll need it every time you log in:
openssl rand -hex 24
Copy the output, then create the Quadlet file:
sudo tee /etc/containers/systemd/grafana.container > /dev/null << 'EOF'
[Unit]
Description=Grafana — metrics visualization and dashboards
After=prometheus.service
[Container]
ContainerName=grafana
Image=docker.io/grafana/grafana-oss:latest
Network=host
Volume=/opt/grafana/data:/var/lib/grafana:Z
Volume=/opt/grafana/provisioning:/etc/grafana/provisioning:ro,Z
Volume=/opt/grafana/dashboards:/var/lib/grafana/dashboards:ro,Z
Environment=GF_SERVER_HTTP_PORT=3001
Environment=GF_SERVER_ROOT_URL=https://ai.example.com/grafana/
Environment=GF_SERVER_SERVE_FROM_SUB_PATH=true
Environment=GF_SECURITY_ADMIN_USER=admin
Environment=GF_SECURITY_ADMIN_PASSWORD=<from vault>
Environment=GF_USERS_ALLOW_SIGN_UP=false
[Service]
Restart=always
TimeoutStartSec=120
[Install]
WantedBy=multi-user.target
EOF
Now edit the file and replace <from vault> on the GF_SECURITY_ADMIN_PASSWORD line with the password you generated above. The default username is admin.
GF_SERVER_ROOT_URL and GF_SERVER_SERVE_FROM_SUB_PATH — These tell Grafana it’s being served at /grafana/ behind nginx, not at the root. Without these, Grafana generates asset URLs and API links that point to / instead of /grafana/, breaking the UI when accessed through the proxy.
sudo podman pull docker.io/grafana/grafana-oss:latest
sudo systemctl daemon-reload
sudo systemctl start grafana
Verify Grafana through nginx
The nginx config from Chapter 5 already includes the /grafana/ location block — no changes needed. Grafana should be accessible through the proxy immediately.
Verify Grafana
# Direct localhost check (confirms the container is healthy)
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:3001/grafana/api/health
# Through nginx (confirms the full proxy chain works)
curl -sk -o /dev/null -w "%{http_code}" https://localhost/grafana/api/health
Expected: 200 for both. Log in at https://<your-host-ip>/grafana/ with the admin credentials. The Prometheus datasource and AI Stack dashboard should already be available.
GPU Monitoring with nvidia_gpu_exporter
If you have a GPU, the nvidia_gpu_exporter exposes metrics that Prometheus scrapes. It runs as a native systemd service (not a container) because it needs direct access to nvidia-smi.
Install the Exporter
# Download the binary
curl -Lo /tmp/nvidia_gpu_exporter.tar.gz \
https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.4.1/nvidia_gpu_exporter_1.4.1_linux_x86_64.tar.gz
# Extract
sudo tar -xzf /tmp/nvidia_gpu_exporter.tar.gz -C /usr/local/bin/ nvidia_gpu_exporter
sudo chmod 755 /usr/local/bin/nvidia_gpu_exporter
systemd Service
sudo tee /etc/systemd/system/nvidia_gpu_exporter.service > /dev/null << 'EOF'
[Unit]
Description=NVIDIA GPU Prometheus Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter --web.listen-address=:9400
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia_gpu_exporter
Metrics Exposed
The exporter calls nvidia-smi and converts the output to Prometheus metrics:
| Metric | What It Tells You |
|---|---|
nvidia_gpu_utilization | GPU compute utilization (%) |
nvidia_gpu_memory_used_bytes | VRAM currently in use |
nvidia_gpu_memory_total_bytes | Total VRAM available |
nvidia_gpu_temperature_celsius | GPU temperature |
nvidia_gpu_power_draw_watts | Current power consumption |
These metrics feed the GPU panels in the pre-built Grafana dashboard.
The Pre-Built Dashboard
The bundled Grafana dashboard (ai-stack-dashboard.json) includes panels for:
- GPU utilization over time — shows compute load during inference. Spikes correlate with active requests.
- VRAM usage — current vs total. When this hits 100%, the next model load will evict the previous one or Ollama will fall back to CPU for the overflow.
- GPU temperature — thermal throttling starts at 83-90C depending on your card. If you’re consistently above 80C, improve your case airflow.
- Ollama request metrics — if Ollama exposes request counters (version-dependent), these show request rate and latency.
- System resources — CPU, RAM, and disk usage from Prometheus node exporter (if you add one later — the dashboard gracefully shows “No Data” for missing metrics).
Tip: The dashboard is a starting point. Grafana dashboards are fully editable — add panels, change time ranges, set up alerts. The JSON file in the bundle is yours to customize.
Seeing It All Work
This is the payoff. Open two browser tabs: Open WebUI (https://ai.example.com/) in one, Grafana’s AI Stack dashboard (https://ai.example.com/grafana/) in the other. Arrange them side by side.
In Open WebUI, pick a model and send it a multi-paragraph prompt — something meaty, like “Explain the trade-offs between microservices and monoliths for a team of five.” Now watch the Grafana tab.
On a GPU host, you’ll see GPU utilization spike from idle to 80-100% within seconds. The VRAM usage panel climbs as the model’s KV cache fills during generation. Inference duration ticks up proportional to the response length. When the response finishes, utilization drops back to near-zero. That’s your GPU doing exactly what you bought it for.
On a CPU-only host, the picture is different but equally informative. System CPU pegs at 100% across all cores. The response streams in slowly — a few words per second instead of a flood. RAM usage climbs. You can feel the difference, and now you can see it in graphs.
Try one more thing: load a second model (pick a different one from the model selector and send a prompt). Watch the VRAM panel — if both models fit, VRAM climbs higher. If they don’t, you’ll see Ollama evict the first model to make room. The eviction shows as a brief VRAM drop followed by the new model loading in. This is OLLAMA_MAX_LOADED_MODELS in action.
This is your operational baseline. You now know what “normal” looks like for your hardware. When something feels slow next week, you’ll open Grafana and immediately see whether it’s a GPU memory issue, a CPU bottleneck, or something else entirely.
Backup Strategy
What to Back Up
| Data | Location | Size | Back Up? | Why |
|---|---|---|---|---|
| Open WebUI data | /opt/open-webui/data | 10-100 MB | Yes | Chat history, user accounts, settings. Irreplaceable. |
| Model inventory | ollama list output | < 1 KB | Yes | Just the list of model names. Re-pull is easy when you know what you had. |
| Grafana dashboards | /opt/grafana/dashboards/ | < 1 MB | Yes | Custom dashboards and modifications. |
| nginx config | /etc/nginx/conf.d/ | < 1 KB | Yes | Easy to recreate, but nice to have. |
| Model files | /var/lib/ollama/models/ | 4-40 GB per model | No | Re-pull with ollama pull. Backing up 40 GB blobs is wasteful when the download takes 10 minutes. |
| Prometheus TSDB | /opt/prometheus/data/ | Grows over time | No | Historical metrics. Nice to have but not critical — re-scrape rebuilds it. |
The Backup Script
Create a backup script at /usr/local/bin/ai-backup.sh:
sudo tee /usr/local/bin/ai-backup.sh > /dev/null << 'EOF'
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/opt/ai-backup"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/ai-stack-backup-${DATE}.tar.gz"
logger -t ai-backup "Starting AI stack backup"
# Back up Open WebUI data (SQLite DB, chat history, users)
tar czf "${BACKUP_FILE}" \
-C / \
"opt/open-webui/data" \
2>/dev/null || true
# Save model inventory (names only — NOT the multi-GB model files)
ollama list > "${BACKUP_DIR}/model-inventory-${DATE}.txt" 2>/dev/null || true
# Prune backups older than retention period
find "${BACKUP_DIR}" -name "ai-stack-backup-*.tar.gz" -mtime +7 -delete
find "${BACKUP_DIR}" -name "model-inventory-*.txt" -mtime +7 -delete
logger -t ai-backup "AI stack backup completed: ${BACKUP_FILE}"
EOF
Make it executable, create the backup directory, and schedule it via cron:
sudo chmod +x /usr/local/bin/ai-backup.sh
sudo mkdir -p /opt/ai-backup
Add the cron job to root’s crontab (sudo crontab -e):
0 2 * * * /usr/local/bin/ai-backup.sh
Backups older than 7 days are automatically pruned by the script. Adjust the -mtime +7 value in the script if you want longer retention.
Restoring from Backup
Don’t run these now — this is the procedure for when you need to recover from a backup later.
| Step | Command | What It Does |
|---|---|---|
| 1 | sudo systemctl stop open-webui | Stop Open WebUI before overwriting its data |
| 2 | sudo tar xzf /opt/ai-backup/ai-stack-backup-<DATE>.tar.gz -C / | Extract the backup over the existing data directory |
| 3 | tail -n +2 /opt/ai-backup/model-inventory-<DATE>.txt | while read -r model _; do ollama pull "$model"; done | Re-pull models from the saved inventory list |
| 4 | sudo systemctl start open-webui | Start Open WebUI with the restored data |
fail2ban — nginx Protection
Open WebUI is on the internet (or at least your home network). fail2ban watches nginx logs and bans IPs that show brute-force behavior.
Install fail2ban
# fail2ban is in EPEL on Rocky 9
sudo dnf install -y epel-release
sudo dnf install -y fail2ban fail2ban-firewalld
Configure Jails
Create a jail configuration file with two jails:
sudo tee /etc/fail2ban/jail.d/nginx-ai.conf > /dev/null << 'EOF'
[nginx-http-auth]
enabled = true
port = https
logpath = /var/log/nginx/error.log
maxretry = 5
bantime = 3600
findtime = 600
[nginx-botsearch]
enabled = true
port = https
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 600
EOF
nginx-http-auth — Bans IPs after 5 failed authentication attempts in 10 minutes. The ban lasts 1 hour.
nginx-botsearch — Bans IPs that hit 10 non-existent URLs in 10 minutes (bot scanners probing for vulnerable endpoints). Ban lasts 1 hour.
Both use built-in fail2ban filters — no custom filter files needed.
sudo systemctl enable --now fail2ban
Verify fail2ban
sudo fail2ban-client status
sudo fail2ban-client status nginx-http-auth
Log Rotation
Ollama writes logs through systemd journal, which handles its own rotation. But if you configure file-based logging later, create a logrotate config:
sudo tee /etc/logrotate.d/ollama > /dev/null << 'EOF'
/var/log/ollama/*.log {
weekly
rotate 4
compress
delaycompress
missingok
notifempty
create 0640 root root
}
EOF
For the containers (Open WebUI, Prometheus, Grafana), Podman manages log rotation through its own configuration. The default settings are reasonable for home lab use.
Capacity Planning
VRAM Pressure
Watch the VRAM usage panel in Grafana. When you see it consistently above 90%, you’re running out of room:
- Reduce
OLLAMA_MAX_LOADED_MODELSto 1 (if it’s higher). Each loaded model reserves its full VRAM footprint. - Use smaller quantized models. A
q4quantization uses roughly half the VRAM ofq8. - Upgrade your GPU. If you’re on 8 GB and hitting the ceiling, a 24 GB card (RTX 3090 used, or RTX 4090) is the next step.
Disk Growth
Models are the biggest disk consumer. Monitor /var/lib/ollama/models/:
du -sh /var/lib/ollama/models/
Open WebUI data grows slowly — a few MB per week for typical home use. The backup script keeps this under control.
Prometheus data grows at roughly 1-2 MB per day with three scrape targets. At default retention (15 days), you’re looking at 30-50 MB total.
When to Scale
If you find yourself wanting:
- More concurrent users — increase
OLLAMA_NUM_PARALLEL(needs more VRAM per additional parallel slot) - More models loaded simultaneously — increase
OLLAMA_MAX_LOADED_MODELS(needs a bigger GPU) - Faster inference — upgrade to a card with more VRAM and higher memory bandwidth
- Multiple inference endpoints — that’s multi-node deployment, which is beyond this guide’s scope
What Automation Looks Like
This chapter covered Prometheus, Grafana, GPU monitoring, backups, fail2ban, and log rotation — easily the most configuration-dense chapter in this guide. Here’s what the playbook bundle automates:
Monitoring:
- Creates data directories with correct ownership for Prometheus and Grafana
- Deploys scrape configuration (conditionally includes GPU exporter for GPU hosts)
- Deploys Grafana datasource, dashboard provider, and a pre-built dashboard JSON
- Pulls container images, deploys Quadlet files, enables services
- Waits for health endpoints before proceeding
- Downloads and deploys the GPU exporter binary and systemd unit (GPU hosts only)
Hardening:
- Deploys the logrotate config for Ollama
- Creates the backup directory, deploys the backup script, schedules daily cron
- Installs EPEL, installs fail2ban with firewalld integration, deploys jail config, enables service
That’s six distinct concerns — monitoring, dashboards, GPU metrics, backups, intrusion prevention, and log management — deployed and configured in a single playbook run. The companion playbook bundle is available at RavenForge Press.
Verification Checkpoint
Before moving to Chapter 7, confirm:
curl -s http://127.0.0.1:9090/-/readyreturns ready- Prometheus targets page shows all targets as UP
curl -sk https://localhost/grafana/api/healthreturns 200- Grafana login works at
https://<your-host-ip>/grafana/with admin credentials - The AI Stack dashboard shows live data (at least Ollama and Prometheus panels)
- If GPU: GPU panels show utilization, VRAM, and temperature
sudo /usr/local/bin/ai-backup.shruns without errorls /opt/ai-backup/shows a backup filesudo fail2ban-client statusshows both jails activesudo systemctl status fail2banshows active
Your AI stack is deployed, monitored, and hardened. The next chapter covers what to do when things go wrong.