Monitoring & Operations

What you’ll accomplish: Deploy Prometheus and Grafana for real-time monitoring, set up GPU metrics collection, configure automated backups, add fail2ban for nginx protection, and have a capacity planning baseline.

This chapter is what separates a weekend project from a system you can trust. Free tutorials stop at “it works.” This covers what happens on day 2, day 30, and day 365.

Prometheus — Metrics Collection

Prometheus scrapes metrics from your services at regular intervals and stores them in a time-series database. We’ll scrape three targets:

Ollama — exposes inference metrics at http://127.0.0.1:11434/metrics
nvidia_gpu_exporter — exposes GPU utilization, VRAM, temperature, and power draw (GPU hosts only)
Prometheus itself — for internal health metrics

Deploy Prometheus

Create the data directory. Prometheus runs as UID 65534 (nobody) inside the container:

sudo mkdir -p /opt/prometheus/data
sudo chown 65534:65534 /opt/prometheus/data

Deploy the scrape configuration:

sudo tee /opt/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']

  - job_name: 'ollama'
    static_configs:
      - targets: ['127.0.0.1:11434']
    metrics_path: /metrics

  # Include this block only if you have a GPU
  - job_name: 'nvidia_gpu'
    static_configs:
      - targets: ['127.0.0.1:9400']
EOF

Note: On CPU-only hosts, remove the nvidia_gpu job block from this file — there’s nothing to scrape without a GPU.

Prometheus Quadlet

sudo tee /etc/containers/systemd/prometheus.container > /dev/null << 'EOF'
[Unit]
Description=Prometheus — metrics collection and alerting
After=ollama.service

[Container]
ContainerName=prometheus
Image=docker.io/prom/prometheus:latest
Network=host
Volume=/opt/prometheus/data:/prometheus:Z
Volume=/opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro,Z

[Service]
Restart=always
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target
EOF

sudo podman pull docker.io/prom/prometheus:latest
sudo systemctl daemon-reload
sudo systemctl start prometheus

Verify Prometheus

# Check the ready endpoint
curl -s http://127.0.0.1:9090/-/ready

Expected: Prometheus Server is Ready.

Browse to http://127.0.0.1:9090/targets (or through your nginx proxy if you’ve added a location block for it) to confirm all scrape targets are UP.

Grafana — Visualization

Grafana provides the dashboards. We’ll deploy it with auto-provisioned Prometheus datasource and a pre-built AI stack dashboard.

Directory Structure

# Grafana runs as UID 472 inside the container
sudo mkdir -p /opt/grafana/{data,provisioning/datasources,provisioning/dashboards,dashboards}
sudo chown -R 472:472 /opt/grafana

Auto-Provision the Prometheus Datasource

sudo tee /opt/grafana/provisioning/datasources/prometheus.yml > /dev/null << 'EOF'
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://127.0.0.1:9090
    isDefault: true
    editable: false
EOF

This means Grafana connects to Prometheus automatically on first boot — no manual datasource configuration in the UI.

Dashboard Provider

sudo tee /opt/grafana/provisioning/dashboards/ai-stack.yml > /dev/null << 'EOF'
apiVersion: 1
providers:
  - name: 'AI Stack'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false
EOF

You can build your own dashboard or import one. To import a dashboard JSON file, either place it in /opt/grafana/dashboards/ (Grafana picks it up automatically via the provider above) or use the Grafana UI: Dashboards > Import > Upload JSON file. The companion playbook bundle includes a pre-built AI stack dashboard that covers GPU, Ollama, and system metrics out of the box.

Grafana Quadlet

Generate a Grafana admin password and save it somewhere you won’t lose it — you’ll need it every time you log in:

openssl rand -hex 24

Copy the output, then create the Quadlet file:

sudo tee /etc/containers/systemd/grafana.container > /dev/null << 'EOF'
[Unit]
Description=Grafana — metrics visualization and dashboards
After=prometheus.service

[Container]
ContainerName=grafana
Image=docker.io/grafana/grafana-oss:latest
Network=host
Volume=/opt/grafana/data:/var/lib/grafana:Z
Volume=/opt/grafana/provisioning:/etc/grafana/provisioning:ro,Z
Volume=/opt/grafana/dashboards:/var/lib/grafana/dashboards:ro,Z

Environment=GF_SERVER_HTTP_PORT=3001
Environment=GF_SERVER_ROOT_URL=https://ai.example.com/grafana/
Environment=GF_SERVER_SERVE_FROM_SUB_PATH=true
Environment=GF_SECURITY_ADMIN_USER=admin
Environment=GF_SECURITY_ADMIN_PASSWORD=<from vault>
Environment=GF_USERS_ALLOW_SIGN_UP=false

[Service]
Restart=always
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target
EOF

Now edit the file and replace <from vault> on the GF_SECURITY_ADMIN_PASSWORD line with the password you generated above. The default username is admin.

GF_SERVER_ROOT_URL and GF_SERVER_SERVE_FROM_SUB_PATH — These tell Grafana it’s being served at /grafana/ behind nginx, not at the root. Without these, Grafana generates asset URLs and API links that point to / instead of /grafana/, breaking the UI when accessed through the proxy.

sudo podman pull docker.io/grafana/grafana-oss:latest
sudo systemctl daemon-reload
sudo systemctl start grafana

Verify Grafana through nginx

The nginx config from Chapter 5 already includes the /grafana/ location block — no changes needed. Grafana should be accessible through the proxy immediately.

Verify Grafana

# Direct localhost check (confirms the container is healthy)
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:3001/grafana/api/health

# Through nginx (confirms the full proxy chain works)
curl -sk -o /dev/null -w "%{http_code}" https://localhost/grafana/api/health

Expected: 200 for both. Log in at https://<your-host-ip>/grafana/ with the admin credentials. The Prometheus datasource and AI Stack dashboard should already be available.

GPU Monitoring with nvidia_gpu_exporter

If you have a GPU, the nvidia_gpu_exporter exposes metrics that Prometheus scrapes. It runs as a native systemd service (not a container) because it needs direct access to nvidia-smi.

Install the Exporter

# Download the binary
curl -Lo /tmp/nvidia_gpu_exporter.tar.gz \
  https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.4.1/nvidia_gpu_exporter_1.4.1_linux_x86_64.tar.gz

# Extract
sudo tar -xzf /tmp/nvidia_gpu_exporter.tar.gz -C /usr/local/bin/ nvidia_gpu_exporter
sudo chmod 755 /usr/local/bin/nvidia_gpu_exporter

systemd Service

sudo tee /etc/systemd/system/nvidia_gpu_exporter.service > /dev/null << 'EOF'
[Unit]
Description=NVIDIA GPU Prometheus Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter --web.listen-address=:9400
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now nvidia_gpu_exporter

Metrics Exposed

The exporter calls nvidia-smi and converts the output to Prometheus metrics:

Metric	What It Tells You
`nvidia_gpu_utilization`	GPU compute utilization (%)
`nvidia_gpu_memory_used_bytes`	VRAM currently in use
`nvidia_gpu_memory_total_bytes`	Total VRAM available
`nvidia_gpu_temperature_celsius`	GPU temperature
`nvidia_gpu_power_draw_watts`	Current power consumption

These metrics feed the GPU panels in the pre-built Grafana dashboard.

The Pre-Built Dashboard

The bundled Grafana dashboard (ai-stack-dashboard.json) includes panels for:

GPU utilization over time — shows compute load during inference. Spikes correlate with active requests.
VRAM usage — current vs total. When this hits 100%, the next model load will evict the previous one or Ollama will fall back to CPU for the overflow.
GPU temperature — thermal throttling starts at 83-90C depending on your card. If you’re consistently above 80C, improve your case airflow.
Ollama request metrics — if Ollama exposes request counters (version-dependent), these show request rate and latency.
System resources — CPU, RAM, and disk usage from Prometheus node exporter (if you add one later — the dashboard gracefully shows “No Data” for missing metrics).

Tip: The dashboard is a starting point. Grafana dashboards are fully editable — add panels, change time ranges, set up alerts. The JSON file in the bundle is yours to customize.

Seeing It All Work

This is the payoff. Open two browser tabs: Open WebUI (https://ai.example.com/) in one, Grafana’s AI Stack dashboard (https://ai.example.com/grafana/) in the other. Arrange them side by side.

In Open WebUI, pick a model and send it a multi-paragraph prompt — something meaty, like “Explain the trade-offs between microservices and monoliths for a team of five.” Now watch the Grafana tab.

On a GPU host, you’ll see GPU utilization spike from idle to 80-100% within seconds. The VRAM usage panel climbs as the model’s KV cache fills during generation. Inference duration ticks up proportional to the response length. When the response finishes, utilization drops back to near-zero. That’s your GPU doing exactly what you bought it for.

On a CPU-only host, the picture is different but equally informative. System CPU pegs at 100% across all cores. The response streams in slowly — a few words per second instead of a flood. RAM usage climbs. You can feel the difference, and now you can see it in graphs.

Try one more thing: load a second model (pick a different one from the model selector and send a prompt). Watch the VRAM panel — if both models fit, VRAM climbs higher. If they don’t, you’ll see Ollama evict the first model to make room. The eviction shows as a brief VRAM drop followed by the new model loading in. This is OLLAMA_MAX_LOADED_MODELS in action.

This is your operational baseline. You now know what “normal” looks like for your hardware. When something feels slow next week, you’ll open Grafana and immediately see whether it’s a GPU memory issue, a CPU bottleneck, or something else entirely.

Backup Strategy

What to Back Up

Data	Location	Size	Back Up?	Why
Open WebUI data	`/opt/open-webui/data`	10-100 MB	Yes	Chat history, user accounts, settings. Irreplaceable.
Model inventory	`ollama list` output	< 1 KB	Yes	Just the list of model names. Re-pull is easy when you know what you had.
Grafana dashboards	`/opt/grafana/dashboards/`	< 1 MB	Yes	Custom dashboards and modifications.
nginx config	`/etc/nginx/conf.d/`	< 1 KB	Yes	Easy to recreate, but nice to have.
Model files	`/var/lib/ollama/models/`	4-40 GB per model	No	Re-pull with `ollama pull`. Backing up 40 GB blobs is wasteful when the download takes 10 minutes.
Prometheus TSDB	`/opt/prometheus/data/`	Grows over time	No	Historical metrics. Nice to have but not critical — re-scrape rebuilds it.

The Backup Script

Create a backup script at /usr/local/bin/ai-backup.sh:

sudo tee /usr/local/bin/ai-backup.sh > /dev/null << 'EOF'
#!/bin/bash
set -euo pipefail

BACKUP_DIR="/opt/ai-backup"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/ai-stack-backup-${DATE}.tar.gz"

logger -t ai-backup "Starting AI stack backup"

# Back up Open WebUI data (SQLite DB, chat history, users)
tar czf "${BACKUP_FILE}" \
    -C / \
    "opt/open-webui/data" \
    2>/dev/null || true

# Save model inventory (names only — NOT the multi-GB model files)
ollama list > "${BACKUP_DIR}/model-inventory-${DATE}.txt" 2>/dev/null || true

# Prune backups older than retention period
find "${BACKUP_DIR}" -name "ai-stack-backup-*.tar.gz" -mtime +7 -delete
find "${BACKUP_DIR}" -name "model-inventory-*.txt" -mtime +7 -delete

logger -t ai-backup "AI stack backup completed: ${BACKUP_FILE}"
EOF

Make it executable, create the backup directory, and schedule it via cron:

sudo chmod +x /usr/local/bin/ai-backup.sh
sudo mkdir -p /opt/ai-backup

Add the cron job to root’s crontab (sudo crontab -e):

0 2 * * * /usr/local/bin/ai-backup.sh

Backups older than 7 days are automatically pruned by the script. Adjust the -mtime +7 value in the script if you want longer retention.

Restoring from Backup

Don’t run these now — this is the procedure for when you need to recover from a backup later.

Step	Command	What It Does
1	`sudo systemctl stop open-webui`	Stop Open WebUI before overwriting its data
2	`sudo tar xzf /opt/ai-backup/ai-stack-backup-<DATE>.tar.gz -C /`	Extract the backup over the existing data directory
3	`tail -n +2 /opt/ai-backup/model-inventory-<DATE>.txt \| while read -r model _; do ollama pull "$model"; done`	Re-pull models from the saved inventory list
4	`sudo systemctl start open-webui`	Start Open WebUI with the restored data

fail2ban — nginx Protection

Open WebUI is on the internet (or at least your home network). fail2ban watches nginx logs and bans IPs that show brute-force behavior.

Install fail2ban

# fail2ban is in EPEL on Rocky 9
sudo dnf install -y epel-release
sudo dnf install -y fail2ban fail2ban-firewalld

Configure Jails

Create a jail configuration file with two jails:

sudo tee /etc/fail2ban/jail.d/nginx-ai.conf > /dev/null << 'EOF'
[nginx-http-auth]
enabled = true
port = https
logpath = /var/log/nginx/error.log
maxretry = 5
bantime = 3600
findtime = 600

[nginx-botsearch]
enabled = true
port = https
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 600
EOF

nginx-http-auth — Bans IPs after 5 failed authentication attempts in 10 minutes. The ban lasts 1 hour.

nginx-botsearch — Bans IPs that hit 10 non-existent URLs in 10 minutes (bot scanners probing for vulnerable endpoints). Ban lasts 1 hour.

Both use built-in fail2ban filters — no custom filter files needed.

sudo systemctl enable --now fail2ban

Verify fail2ban

sudo fail2ban-client status
sudo fail2ban-client status nginx-http-auth

Log Rotation

Ollama writes logs through systemd journal, which handles its own rotation. But if you configure file-based logging later, create a logrotate config:

sudo tee /etc/logrotate.d/ollama > /dev/null << 'EOF'
/var/log/ollama/*.log {
    weekly
    rotate 4
    compress
    delaycompress
    missingok
    notifempty
    create 0640 root root
}
EOF

For the containers (Open WebUI, Prometheus, Grafana), Podman manages log rotation through its own configuration. The default settings are reasonable for home lab use.

Capacity Planning

VRAM Pressure

Watch the VRAM usage panel in Grafana. When you see it consistently above 90%, you’re running out of room:

Reduce OLLAMA_MAX_LOADED_MODELS to 1 (if it’s higher). Each loaded model reserves its full VRAM footprint.
Use smaller quantized models. A q4 quantization uses roughly half the VRAM of q8.
Upgrade your GPU. If you’re on 8 GB and hitting the ceiling, a 24 GB card (RTX 3090 used, or RTX 4090) is the next step.

Disk Growth

Models are the biggest disk consumer. Monitor /var/lib/ollama/models/:

du -sh /var/lib/ollama/models/

Open WebUI data grows slowly — a few MB per week for typical home use. The backup script keeps this under control.

Prometheus data grows at roughly 1-2 MB per day with three scrape targets. At default retention (15 days), you’re looking at 30-50 MB total.

When to Scale

If you find yourself wanting:

More concurrent users — increase OLLAMA_NUM_PARALLEL (needs more VRAM per additional parallel slot)
More models loaded simultaneously — increase OLLAMA_MAX_LOADED_MODELS (needs a bigger GPU)
Faster inference — upgrade to a card with more VRAM and higher memory bandwidth
Multiple inference endpoints — that’s multi-node deployment, which is beyond this guide’s scope

What Automation Looks Like

This chapter covered Prometheus, Grafana, GPU monitoring, backups, fail2ban, and log rotation — easily the most configuration-dense chapter in this guide. Here’s what the playbook bundle automates:

Monitoring:

Creates data directories with correct ownership for Prometheus and Grafana
Deploys scrape configuration (conditionally includes GPU exporter for GPU hosts)
Deploys Grafana datasource, dashboard provider, and a pre-built dashboard JSON
Pulls container images, deploys Quadlet files, enables services
Waits for health endpoints before proceeding
Downloads and deploys the GPU exporter binary and systemd unit (GPU hosts only)

Hardening:

Deploys the logrotate config for Ollama
Creates the backup directory, deploys the backup script, schedules daily cron
Installs EPEL, installs fail2ban with firewalld integration, deploys jail config, enables service

That’s six distinct concerns — monitoring, dashboards, GPU metrics, backups, intrusion prevention, and log management — deployed and configured in a single playbook run. The companion playbook bundle is available at RavenForge Press.

Verification Checkpoint

Before moving to Chapter 7, confirm:

curl -s http://127.0.0.1:9090/-/ready returns ready
Prometheus targets page shows all targets as UP
curl -sk https://localhost/grafana/api/health returns 200
Grafana login works at https://<your-host-ip>/grafana/ with admin credentials
The AI Stack dashboard shows live data (at least Ollama and Prometheus panels)
If GPU: GPU panels show utilization, VRAM, and temperature
sudo /usr/local/bin/ai-backup.sh runs without error
ls /opt/ai-backup/ shows a backup file
sudo fail2ban-client status shows both jails active
sudo systemctl status fail2ban shows active

Your AI stack is deployed, monitored, and hardened. The next chapter covers what to do when things go wrong.