What you’ll accomplish: Know where to look when things break, understand the 8 most common failure modes, and have a diagnostic sequence for each one.
The Quick Diagnostic Sequence
When something isn’t working, run this first:
# Service status
systemctl status ollama
sudo systemctl status open-webui
sudo systemctl status nginx
sudo systemctl status prometheus
sudo systemctl status grafana
# Recent logs
journalctl -u ollama --since "10 minutes ago" --no-pager
sudo podman logs open-webui 2>&1 | tail -30
sudo tail -30 /var/log/nginx/error.log
# SELinux denials (catches silent failures)
sudo ausearch -m avc -ts recent
# Disk and memory
df -h / /var/lib/ollama/models
free -h
# GPU (if applicable)
nvidia-smi
Nine times out of ten, the answer is in one of those outputs.
The 8 Problems That Cost You Hours
Problem 1: CUDA Out of Memory — But nvidia-smi Shows Free VRAM
Symptom: Ollama logs show CUDA out of memory or the model silently falls back to CPU. But nvidia-smi shows hundreds of MB or even gigabytes of “free” memory.
Cause: VRAM fragmentation. The model needs a contiguous block of memory, and the free VRAM is scattered across small fragments. This happens when multiple models have been loaded and unloaded, or when OLLAMA_MAX_LOADED_MODELS is set higher than 1.
Fix:
# Check what's loaded
ollama ps
# Unload all models (frees VRAM)
# Simply restart Ollama — it starts clean
sudo systemctl restart ollama
# Prevent recurrence: keep only one model loaded
# In /etc/systemd/system/ollama.service.d/override.conf:
# Environment="OLLAMA_MAX_LOADED_MODELS=1"
If you consistently hit OOM with a model that should fit, the model’s actual VRAM requirement is higher than the advertised size. Try a more aggressively quantized version (e.g., q4_0 instead of q4_K_M).
Problem 2: SELinux Denials After Moving Model Storage
Symptom: Ollama can’t read or write models after you moved OLLAMA_MODELS to a new path. ollama pull fails with permission errors. journalctl -u ollama shows access denials.
Cause: The new directory doesn’t have the correct SELinux file context. Files inherit context from their parent directory, and your new mount point likely has default_t or mnt_t context instead of what Ollama expects.
Fix:
# Check current context on the new directory
ls -Zd /new/path/to/models
# Apply a suitable context (var_lib_t works for /var paths)
sudo semanage fcontext -a -t var_lib_t "/new/path/to/models(/.*)?"
sudo restorecon -Rv /new/path/to/models
# Verify
ls -Z /new/path/to/models
Also check that the ollama user has filesystem permissions:
sudo chown -R ollama:ollama /new/path/to/models
Problem 3: Podman Network Namespace — Open WebUI Can’t Reach Ollama
Symptom: Open WebUI shows “Ollama not reachable” or “Could not connect to Ollama.” Ollama is running fine — curl http://127.0.0.1:11434/api/tags works from the host.
Cause: If the Open WebUI container is using Podman’s default bridge network instead of Network=host, 127.0.0.1 inside the container points to the container’s own loopback, not the host’s. Ollama is listening on the host’s loopback and the container can’t reach it.
Fix: Verify the Quadlet file uses host networking:
# /etc/containers/systemd/open-webui.container
[Container]
Network=host
If you’re already using Network=host and it still doesn’t work:
# Verify Ollama is actually listening
sudo ss -tlnp | grep 11434
# Verify the OLLAMA_HOST binding
grep OLLAMA_HOST /etc/systemd/system/ollama.service.d/override.conf
If OLLAMA_HOST is set to 127.0.0.1:11434, that’s correct for host networking. If it’s set to 0.0.0.0:11434, that also works but exposes the API to the network (which we don’t want).
Problem 4: nginx Timeout on Streaming Responses
Symptom: Long inference responses get cut off mid-stream. The browser shows a partial response and then stops. nginx error log shows upstream timed out (110: Connection timed out).
Cause: The default proxy_read_timeout is 60 seconds. A large model on CPU can easily take longer than that for a single response, especially with long prompts.
Fix: Verify these lines in /etc/nginx/conf.d/ai-stack.conf:
proxy_read_timeout 86400s;
proxy_send_timeout 86400s;
If you changed the config, reload nginx:
sudo nginx -t && sudo systemctl reload nginx
86400 seconds is 24 hours. That’s not a typo — if the connection is actually dead, TCP keepalive detects it. The timeout is just a safety net against nginx killing connections that are legitimately streaming.
Problem 5: GPU Passthrough “Code 43” in Proxmox
Symptom: nvidia-smi returns an error after installing drivers. dmesg | grep nvidia shows NVRM: GPU does not have the correct resources assigned or the NVIDIA Control Panel shows “Code 43.”
Cause: NVIDIA drivers detect they’re running inside a hypervisor and refuse to initialize. This was historically an anti-VM measure in consumer GPU drivers (GeForce series). Quadro and Tesla cards don’t have this issue.
Note: NVIDIA drivers from 2021+ (version 465+) have removed the hypervisor detection that caused Code 43 on Linux KVM guests. If you’re running a recent driver on a Linux VM, you’re unlikely to hit this. Code 43 is primarily a Windows VM problem now.
Fix (try in order):
First: Ensure the CPU type is host — this is required for GPU passthrough regardless of Code 43:
# /etc/pve/qemu-server/<VMID>.conf
cpu: host
Second: If you’re getting Code 43 (especially on Windows VMs), add hidden=1:
cpu: host,hidden=1
Or via the Proxmox UI: VM > Hardware > Processor > Type: host, then add hidden=1 to the args.
Last resort: For stubborn cards or older drivers, add the full hypervisor-hiding args:
# Additional args in the VM config
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=proxmox,kvm=off'
Reboot the VM after making these changes.
Problem 6: Ollama Loads Model to CPU Despite GPU Being Available
Symptom: ollama ps shows the model running on CPU. nvidia-smi shows low or zero GPU utilization. Inference is painfully slow.
Cause: Multiple possible causes. Check in order:
CUDA not detected: Check Ollama startup logs for CUDA initialization messages:
journalctl -u ollama | grep -i cudaIf there’s no CUDA mention, the drivers aren’t installed correctly (back to Chapter 3).
Wrong CUDA_VISIBLE_DEVICES: If you’ve set this in the override config and it points to a nonexistent device:
grep CUDA /etc/systemd/system/ollama.service.d/override.confSELinux blocking GPU access: The most subtle cause. Check for denials:
sudo ausearch -m avc -ts recent | grep nvidiaIf you see denials, the SELinux device context from Chapter 3 wasn’t applied correctly. Re-run:
sudo semanage fcontext -a -t xserver_misc_device_t "/dev/nvidia.*" sudo restorecon -Rv /dev/nvidia*Model too large for VRAM: Ollama silently falls back to CPU when a model doesn’t fit in available VRAM. Check
nvidia-smifor free VRAM vs the model’s size. Try a smaller or more quantized model.
Problem 7: Open WebUI Shows “Ollama Not Reachable”
Symptom: The Open WebUI interface loads, but when you try to chat, you get “Ollama not reachable” or “Could not connect to Ollama server.”
Cause: This is a connectivity issue between the Open WebUI container and Ollama. The three most common causes:
Check 1: Is Ollama running?
systemctl status ollama
curl -s http://127.0.0.1:11434/api/tags
Check 2: Is the container using host networking?
sudo podman inspect open-webui | grep -i networkmode
Should show host. If not, fix the Quadlet file (see Problem 3 above).
Check 3: Is the environment variable correct?
sudo podman exec open-webui env | grep OLLAMA
Should show OLLAMA_BASE_URL=http://127.0.0.1:11434. If the port or host is wrong, update the Quadlet file and restart.
Check 4: Is there a firewall issue?
# Ollama should be on loopback — firewall shouldn't matter
# But check anyway
sudo ss -tlnp | grep 11434
Problem 8: Grafana “No Data” Panels
Symptom: Grafana dashboard loads but panels show “No Data” or “No data points.”
Cause: Prometheus isn’t scraping the targets, the targets aren’t exposing metrics, or the nginx proxy isn’t reaching Grafana.
Fix — check each layer:
# Is Prometheus running?
curl -s http://127.0.0.1:9090/-/ready
# Are targets up? Check the targets page
curl -s http://127.0.0.1:9090/api/v1/targets | python3 -m json.tool | grep -A5 health
# Is Ollama exposing metrics?
curl -s http://127.0.0.1:11434/metrics | head -20
# Is the GPU exporter running? (GPU hosts only)
curl -s http://127.0.0.1:9400/metrics | head -20
# Is Grafana healthy on localhost? (bypasses nginx)
curl -s http://127.0.0.1:3001/grafana/api/health
# Is the nginx proxy to Grafana working?
curl -sk https://localhost/grafana/api/health
If Grafana works on localhost but not through nginx:
- Check that the
/grafana/location block exists in/etc/nginx/conf.d/ai-stack.conf - Verify the
/grafana/location block exists in/etc/nginx/conf.d/ai-stack.conf - Reload nginx:
sudo nginx -t && sudo systemctl reload nginx - Check the SELinux boolean:
getsebool httpd_can_network_connect— it must beonfor nginx to proxy to Grafana
If targets show as DOWN:
- Check that the ports in
prometheus.ymlmatch the actual service ports - On GPU hosts, verify
nvidia_gpu_exporteris running:systemctl status nvidia_gpu_exporter
If targets are UP but Grafana still shows no data:
- Check the Grafana datasource is configured to point to
http://127.0.0.1:9090 - The dashboard may use metric names that differ between Ollama versions — check the Explore tab in Grafana to see what metrics are actually available
Log Locations Quick Reference
| Service | Primary Log | How to View |
|---|---|---|
| Ollama | systemd journal | journalctl -u ollama -f |
| Open WebUI | Podman container log | sudo podman logs -f open-webui |
| nginx | File-based | sudo tail -f /var/log/nginx/error.log |
| nginx access | File-based | sudo tail -f /var/log/nginx/access.log |
| Prometheus | Podman container log | sudo podman logs -f prometheus |
| Grafana | Podman container log | sudo podman logs -f grafana |
| nvidia_gpu_exporter | systemd journal | journalctl -u nvidia_gpu_exporter -f |
| fail2ban | File-based | sudo tail -f /var/log/fail2ban.log |
| SELinux denials | audit log | sudo ausearch -m avc -ts recent |
The “It Worked Yesterday” Checklist
When something that was working stops, check these in order:
- Disk space.
df -h /. Full disks cause bizarre failures everywhere. - Memory.
free -h. OOM killer may have stopped a service. - Service status.
systemctl status ollama open-webui nginx. Did something crash and not restart? - SELinux.
sudo ausearch -m avc -ts recent. A system update may have changed policy. - Container images.
sudo podman ps -a. Did an auto-update pull a broken image? - DNS/network.
ping ai.example.com. Did your DNS change? - Certificate expiry.
openssl s_client -connect ai.example.com:443 2>/dev/null | openssl x509 -noout -dates. Self-signed certs expire after 365 days by default. - Package updates.
dnf history list. Did a kernel update break NVIDIA drivers? (Reboot + DKMS rebuild needed.)
What’s Next
You have a working, monitored, hardened AI stack on Rocky Linux. Ollama serves models, Open WebUI gives your users a clean interface, nginx handles SSL, Grafana shows you what’s happening, and backups run on autopilot. That’s a solid foundation — and it’s designed to be extended.
Natural next steps that build on this base:
- RAG pipeline — connect Open WebUI to a document store so your models can answer questions about your own files, wikis, or knowledge base. This plugs into the existing Open WebUI + Ollama setup without replacing anything.
- vLLM backend — swap Ollama for vLLM when you need higher concurrency or continuous batching. The nginx proxy and monitoring layers stay the same; only the inference backend changes.
- Multi-model load balancing — distribute requests across multiple GPU hosts when one card isn’t enough.
- OAuth2 / SSO access control — replace Open WebUI’s built-in auth with your existing identity provider for team deployments.
Each of these extends the ai-stack-deploy/ playbook bundle rather than replacing it. Check RavenForge Press for upcoming add-on guides as they become available.