← Self-Hosting AI the Right Way

Chapter 6

Monitoring & Operations

In this chapter
<nav id="TableOfContents" aria-label="Chapter sections"> <ul> <li><a href="#prometheus--metrics-collection">Prometheus — Metrics Collection</a> <ul> <li><a href="#deploy-prometheus">Deploy Prometheus</a></li> <li><a href="#prometheus-quadlet">Prometheus Quadlet</a></li> <li><a href="#verify-prometheus">Verify Prometheus</a></li> </ul> </li> <li><a href="#grafana--visualization">Grafana — Visualization</a> <ul> <li><a href="#directory-structure">Directory Structure</a></li> <li><a href="#auto-provision-the-prometheus-datasource">Auto-Provision the Prometheus Datasource</a></li> <li><a href="#dashboard-provider">Dashboard Provider</a></li> <li><a href="#grafana-quadlet">Grafana Quadlet</a></li> <li><a href="#verify-grafana-through-nginx">Verify Grafana through nginx</a></li> <li><a href="#verify-grafana">Verify Grafana</a></li> </ul> </li> <li><a href="#gpu-monitoring-with-nvidia_gpu_exporter">GPU Monitoring with nvidia_gpu_exporter</a> <ul> <li><a href="#install-the-exporter">Install the Exporter</a></li> <li><a href="#systemd-service">systemd Service</a></li> <li><a href="#metrics-exposed">Metrics Exposed</a></li> </ul> </li> <li><a href="#the-pre-built-dashboard">The Pre-Built Dashboard</a></li> <li><a href="#seeing-it-all-work">Seeing It All Work</a></li> <li><a href="#backup-strategy">Backup Strategy</a> <ul> <li><a href="#what-to-back-up">What to Back Up</a></li> <li><a href="#the-backup-script">The Backup Script</a></li> <li><a href="#restoring-from-backup">Restoring from Backup</a></li> </ul> </li> <li><a href="#fail2ban--nginx-protection">fail2ban — nginx Protection</a> <ul> <li><a href="#install-fail2ban">Install fail2ban</a></li> <li><a href="#configure-jails">Configure Jails</a></li> <li><a href="#verify-fail2ban">Verify fail2ban</a></li> </ul> </li> <li><a href="#log-rotation">Log Rotation</a></li> <li><a href="#capacity-planning">Capacity Planning</a> <ul> <li><a href="#vram-pressure">VRAM Pressure</a></li> <li><a href="#disk-growth">Disk Growth</a></li> <li><a href="#when-to-scale">When to Scale</a></li> </ul> </li> <li><a href="#what-automation-looks-like">What Automation Looks Like</a></li> <li><a href="#verification-checkpoint">Verification Checkpoint</a></li> </ul> </nav>

What you’ll accomplish: Deploy Prometheus and Grafana for real-time monitoring, set up GPU metrics collection, configure automated backups, add fail2ban for nginx protection, and have a capacity planning baseline.

This chapter is what separates a weekend project from a system you can trust. Free tutorials stop at “it works.” This covers what happens on day 2, day 30, and day 365.

Prometheus — Metrics Collection

Prometheus scrapes metrics from your services at regular intervals and stores them in a time-series database. We’ll scrape three targets:

  • Ollama — exposes inference metrics at http://127.0.0.1:11434/metrics
  • nvidia_gpu_exporter — exposes GPU utilization, VRAM, temperature, and power draw (GPU hosts only)
  • Prometheus itself — for internal health metrics

Deploy Prometheus

Create the data directory. Prometheus runs as UID 65534 (nobody) inside the container:

sudo mkdir -p /opt/prometheus/data
sudo chown 65534:65534 /opt/prometheus/data

Deploy the scrape configuration:

sudo tee /opt/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']

  - job_name: 'ollama'
    static_configs:
      - targets: ['127.0.0.1:11434']
    metrics_path: /metrics

  # Include this block only if you have a GPU
  - job_name: 'nvidia_gpu'
    static_configs:
      - targets: ['127.0.0.1:9400']
EOF

Note: On CPU-only hosts, remove the nvidia_gpu job block from this file — there’s nothing to scrape without a GPU.

Prometheus Quadlet

sudo tee /etc/containers/systemd/prometheus.container > /dev/null << 'EOF'
[Unit]
Description=Prometheus — metrics collection and alerting
After=ollama.service

[Container]
ContainerName=prometheus
Image=docker.io/prom/prometheus:latest
Network=host
Volume=/opt/prometheus/data:/prometheus:Z
Volume=/opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro,Z

[Service]
Restart=always
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target
EOF
sudo podman pull docker.io/prom/prometheus:latest
sudo systemctl daemon-reload
sudo systemctl start prometheus

Verify Prometheus

# Check the ready endpoint
curl -s http://127.0.0.1:9090/-/ready

Expected: Prometheus Server is Ready.

Browse to http://127.0.0.1:9090/targets (or through your nginx proxy if you’ve added a location block for it) to confirm all scrape targets are UP.

Grafana — Visualization

Grafana provides the dashboards. We’ll deploy it with auto-provisioned Prometheus datasource and a pre-built AI stack dashboard.

Directory Structure

# Grafana runs as UID 472 inside the container
sudo mkdir -p /opt/grafana/{data,provisioning/datasources,provisioning/dashboards,dashboards}
sudo chown -R 472:472 /opt/grafana

Auto-Provision the Prometheus Datasource

sudo tee /opt/grafana/provisioning/datasources/prometheus.yml > /dev/null << 'EOF'
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://127.0.0.1:9090
    isDefault: true
    editable: false
EOF

This means Grafana connects to Prometheus automatically on first boot — no manual datasource configuration in the UI.

Dashboard Provider

sudo tee /opt/grafana/provisioning/dashboards/ai-stack.yml > /dev/null << 'EOF'
apiVersion: 1
providers:
  - name: 'AI Stack'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false
EOF

You can build your own dashboard or import one. To import a dashboard JSON file, either place it in /opt/grafana/dashboards/ (Grafana picks it up automatically via the provider above) or use the Grafana UI: Dashboards > Import > Upload JSON file. The companion playbook bundle includes a pre-built AI stack dashboard that covers GPU, Ollama, and system metrics out of the box.

Grafana Quadlet

Generate a Grafana admin password and save it somewhere you won’t lose it — you’ll need it every time you log in:

openssl rand -hex 24

Copy the output, then create the Quadlet file:

sudo tee /etc/containers/systemd/grafana.container > /dev/null << 'EOF'
[Unit]
Description=Grafana — metrics visualization and dashboards
After=prometheus.service

[Container]
ContainerName=grafana
Image=docker.io/grafana/grafana-oss:latest
Network=host
Volume=/opt/grafana/data:/var/lib/grafana:Z
Volume=/opt/grafana/provisioning:/etc/grafana/provisioning:ro,Z
Volume=/opt/grafana/dashboards:/var/lib/grafana/dashboards:ro,Z

Environment=GF_SERVER_HTTP_PORT=3001
Environment=GF_SERVER_ROOT_URL=https://ai.example.com/grafana/
Environment=GF_SERVER_SERVE_FROM_SUB_PATH=true
Environment=GF_SECURITY_ADMIN_USER=admin
Environment=GF_SECURITY_ADMIN_PASSWORD=<from vault>
Environment=GF_USERS_ALLOW_SIGN_UP=false

[Service]
Restart=always
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target
EOF

Now edit the file and replace <from vault> on the GF_SECURITY_ADMIN_PASSWORD line with the password you generated above. The default username is admin.

GF_SERVER_ROOT_URL and GF_SERVER_SERVE_FROM_SUB_PATH — These tell Grafana it’s being served at /grafana/ behind nginx, not at the root. Without these, Grafana generates asset URLs and API links that point to / instead of /grafana/, breaking the UI when accessed through the proxy.

sudo podman pull docker.io/grafana/grafana-oss:latest
sudo systemctl daemon-reload
sudo systemctl start grafana

Verify Grafana through nginx

The nginx config from Chapter 5 already includes the /grafana/ location block — no changes needed. Grafana should be accessible through the proxy immediately.

Verify Grafana

# Direct localhost check (confirms the container is healthy)
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:3001/grafana/api/health

# Through nginx (confirms the full proxy chain works)
curl -sk -o /dev/null -w "%{http_code}" https://localhost/grafana/api/health

Expected: 200 for both. Log in at https://<your-host-ip>/grafana/ with the admin credentials. The Prometheus datasource and AI Stack dashboard should already be available.

GPU Monitoring with nvidia_gpu_exporter

If you have a GPU, the nvidia_gpu_exporter exposes metrics that Prometheus scrapes. It runs as a native systemd service (not a container) because it needs direct access to nvidia-smi.

Install the Exporter

# Download the binary
curl -Lo /tmp/nvidia_gpu_exporter.tar.gz \
  https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.4.1/nvidia_gpu_exporter_1.4.1_linux_x86_64.tar.gz

# Extract
sudo tar -xzf /tmp/nvidia_gpu_exporter.tar.gz -C /usr/local/bin/ nvidia_gpu_exporter
sudo chmod 755 /usr/local/bin/nvidia_gpu_exporter

systemd Service

sudo tee /etc/systemd/system/nvidia_gpu_exporter.service > /dev/null << 'EOF'
[Unit]
Description=NVIDIA GPU Prometheus Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter --web.listen-address=:9400
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia_gpu_exporter

Metrics Exposed

The exporter calls nvidia-smi and converts the output to Prometheus metrics:

MetricWhat It Tells You
nvidia_gpu_utilizationGPU compute utilization (%)
nvidia_gpu_memory_used_bytesVRAM currently in use
nvidia_gpu_memory_total_bytesTotal VRAM available
nvidia_gpu_temperature_celsiusGPU temperature
nvidia_gpu_power_draw_wattsCurrent power consumption

These metrics feed the GPU panels in the pre-built Grafana dashboard.

The Pre-Built Dashboard

The bundled Grafana dashboard (ai-stack-dashboard.json) includes panels for:

  • GPU utilization over time — shows compute load during inference. Spikes correlate with active requests.
  • VRAM usage — current vs total. When this hits 100%, the next model load will evict the previous one or Ollama will fall back to CPU for the overflow.
  • GPU temperature — thermal throttling starts at 83-90C depending on your card. If you’re consistently above 80C, improve your case airflow.
  • Ollama request metrics — if Ollama exposes request counters (version-dependent), these show request rate and latency.
  • System resources — CPU, RAM, and disk usage from Prometheus node exporter (if you add one later — the dashboard gracefully shows “No Data” for missing metrics).

Tip: The dashboard is a starting point. Grafana dashboards are fully editable — add panels, change time ranges, set up alerts. The JSON file in the bundle is yours to customize.

Seeing It All Work

This is the payoff. Open two browser tabs: Open WebUI (https://ai.example.com/) in one, Grafana’s AI Stack dashboard (https://ai.example.com/grafana/) in the other. Arrange them side by side.

In Open WebUI, pick a model and send it a multi-paragraph prompt — something meaty, like “Explain the trade-offs between microservices and monoliths for a team of five.” Now watch the Grafana tab.

On a GPU host, you’ll see GPU utilization spike from idle to 80-100% within seconds. The VRAM usage panel climbs as the model’s KV cache fills during generation. Inference duration ticks up proportional to the response length. When the response finishes, utilization drops back to near-zero. That’s your GPU doing exactly what you bought it for.

On a CPU-only host, the picture is different but equally informative. System CPU pegs at 100% across all cores. The response streams in slowly — a few words per second instead of a flood. RAM usage climbs. You can feel the difference, and now you can see it in graphs.

Try one more thing: load a second model (pick a different one from the model selector and send a prompt). Watch the VRAM panel — if both models fit, VRAM climbs higher. If they don’t, you’ll see Ollama evict the first model to make room. The eviction shows as a brief VRAM drop followed by the new model loading in. This is OLLAMA_MAX_LOADED_MODELS in action.

This is your operational baseline. You now know what “normal” looks like for your hardware. When something feels slow next week, you’ll open Grafana and immediately see whether it’s a GPU memory issue, a CPU bottleneck, or something else entirely.

Backup Strategy

What to Back Up

DataLocationSizeBack Up?Why
Open WebUI data/opt/open-webui/data10-100 MBYesChat history, user accounts, settings. Irreplaceable.
Model inventoryollama list output< 1 KBYesJust the list of model names. Re-pull is easy when you know what you had.
Grafana dashboards/opt/grafana/dashboards/< 1 MBYesCustom dashboards and modifications.
nginx config/etc/nginx/conf.d/< 1 KBYesEasy to recreate, but nice to have.
Model files/var/lib/ollama/models/4-40 GB per modelNoRe-pull with ollama pull. Backing up 40 GB blobs is wasteful when the download takes 10 minutes.
Prometheus TSDB/opt/prometheus/data/Grows over timeNoHistorical metrics. Nice to have but not critical — re-scrape rebuilds it.

The Backup Script

Create a backup script at /usr/local/bin/ai-backup.sh:

sudo tee /usr/local/bin/ai-backup.sh > /dev/null << 'EOF'
#!/bin/bash
set -euo pipefail

BACKUP_DIR="/opt/ai-backup"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/ai-stack-backup-${DATE}.tar.gz"

logger -t ai-backup "Starting AI stack backup"

# Back up Open WebUI data (SQLite DB, chat history, users)
tar czf "${BACKUP_FILE}" \
    -C / \
    "opt/open-webui/data" \
    2>/dev/null || true

# Save model inventory (names only — NOT the multi-GB model files)
ollama list > "${BACKUP_DIR}/model-inventory-${DATE}.txt" 2>/dev/null || true

# Prune backups older than retention period
find "${BACKUP_DIR}" -name "ai-stack-backup-*.tar.gz" -mtime +7 -delete
find "${BACKUP_DIR}" -name "model-inventory-*.txt" -mtime +7 -delete

logger -t ai-backup "AI stack backup completed: ${BACKUP_FILE}"
EOF

Make it executable, create the backup directory, and schedule it via cron:

sudo chmod +x /usr/local/bin/ai-backup.sh
sudo mkdir -p /opt/ai-backup

Add the cron job to root’s crontab (sudo crontab -e):

0 2 * * * /usr/local/bin/ai-backup.sh

Backups older than 7 days are automatically pruned by the script. Adjust the -mtime +7 value in the script if you want longer retention.

Restoring from Backup

Don’t run these now — this is the procedure for when you need to recover from a backup later.

StepCommandWhat It Does
1sudo systemctl stop open-webuiStop Open WebUI before overwriting its data
2sudo tar xzf /opt/ai-backup/ai-stack-backup-<DATE>.tar.gz -C /Extract the backup over the existing data directory
3tail -n +2 /opt/ai-backup/model-inventory-<DATE>.txt | while read -r model _; do ollama pull "$model"; doneRe-pull models from the saved inventory list
4sudo systemctl start open-webuiStart Open WebUI with the restored data

fail2ban — nginx Protection

Open WebUI is on the internet (or at least your home network). fail2ban watches nginx logs and bans IPs that show brute-force behavior.

Install fail2ban

# fail2ban is in EPEL on Rocky 9
sudo dnf install -y epel-release
sudo dnf install -y fail2ban fail2ban-firewalld

Configure Jails

Create a jail configuration file with two jails:

sudo tee /etc/fail2ban/jail.d/nginx-ai.conf > /dev/null << 'EOF'
[nginx-http-auth]
enabled = true
port = https
logpath = /var/log/nginx/error.log
maxretry = 5
bantime = 3600
findtime = 600

[nginx-botsearch]
enabled = true
port = https
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 600
EOF

nginx-http-auth — Bans IPs after 5 failed authentication attempts in 10 minutes. The ban lasts 1 hour.

nginx-botsearch — Bans IPs that hit 10 non-existent URLs in 10 minutes (bot scanners probing for vulnerable endpoints). Ban lasts 1 hour.

Both use built-in fail2ban filters — no custom filter files needed.

sudo systemctl enable --now fail2ban

Verify fail2ban

sudo fail2ban-client status
sudo fail2ban-client status nginx-http-auth

Log Rotation

Ollama writes logs through systemd journal, which handles its own rotation. But if you configure file-based logging later, create a logrotate config:

sudo tee /etc/logrotate.d/ollama > /dev/null << 'EOF'
/var/log/ollama/*.log {
    weekly
    rotate 4
    compress
    delaycompress
    missingok
    notifempty
    create 0640 root root
}
EOF

For the containers (Open WebUI, Prometheus, Grafana), Podman manages log rotation through its own configuration. The default settings are reasonable for home lab use.

Capacity Planning

VRAM Pressure

Watch the VRAM usage panel in Grafana. When you see it consistently above 90%, you’re running out of room:

  • Reduce OLLAMA_MAX_LOADED_MODELS to 1 (if it’s higher). Each loaded model reserves its full VRAM footprint.
  • Use smaller quantized models. A q4 quantization uses roughly half the VRAM of q8.
  • Upgrade your GPU. If you’re on 8 GB and hitting the ceiling, a 24 GB card (RTX 3090 used, or RTX 4090) is the next step.

Disk Growth

Models are the biggest disk consumer. Monitor /var/lib/ollama/models/:

du -sh /var/lib/ollama/models/

Open WebUI data grows slowly — a few MB per week for typical home use. The backup script keeps this under control.

Prometheus data grows at roughly 1-2 MB per day with three scrape targets. At default retention (15 days), you’re looking at 30-50 MB total.

When to Scale

If you find yourself wanting:

  • More concurrent users — increase OLLAMA_NUM_PARALLEL (needs more VRAM per additional parallel slot)
  • More models loaded simultaneously — increase OLLAMA_MAX_LOADED_MODELS (needs a bigger GPU)
  • Faster inference — upgrade to a card with more VRAM and higher memory bandwidth
  • Multiple inference endpoints — that’s multi-node deployment, which is beyond this guide’s scope

What Automation Looks Like

This chapter covered Prometheus, Grafana, GPU monitoring, backups, fail2ban, and log rotation — easily the most configuration-dense chapter in this guide. Here’s what the playbook bundle automates:

Monitoring:

  • Creates data directories with correct ownership for Prometheus and Grafana
  • Deploys scrape configuration (conditionally includes GPU exporter for GPU hosts)
  • Deploys Grafana datasource, dashboard provider, and a pre-built dashboard JSON
  • Pulls container images, deploys Quadlet files, enables services
  • Waits for health endpoints before proceeding
  • Downloads and deploys the GPU exporter binary and systemd unit (GPU hosts only)

Hardening:

  • Deploys the logrotate config for Ollama
  • Creates the backup directory, deploys the backup script, schedules daily cron
  • Installs EPEL, installs fail2ban with firewalld integration, deploys jail config, enables service

That’s six distinct concerns — monitoring, dashboards, GPU metrics, backups, intrusion prevention, and log management — deployed and configured in a single playbook run. The companion playbook bundle is available at RavenForge Press.

Verification Checkpoint

Before moving to Chapter 7, confirm:

  • curl -s http://127.0.0.1:9090/-/ready returns ready
  • Prometheus targets page shows all targets as UP
  • curl -sk https://localhost/grafana/api/health returns 200
  • Grafana login works at https://<your-host-ip>/grafana/ with admin credentials
  • The AI Stack dashboard shows live data (at least Ollama and Prometheus panels)
  • If GPU: GPU panels show utilization, VRAM, and temperature
  • sudo /usr/local/bin/ai-backup.sh runs without error
  • ls /opt/ai-backup/ shows a backup file
  • sudo fail2ban-client status shows both jails active
  • sudo systemctl status fail2ban shows active

Your AI stack is deployed, monitored, and hardened. The next chapter covers what to do when things go wrong.

Want the automation code? Get the Ansible playbooks that deploy this entire stack in minutes.

Get Guide + Playbooks — $14