What you’ll accomplish: Install Ollama as a systemd service, configure it with production-ready settings, pull your first model, and verify the API is serving.
What Ollama Actually Does
Ollama is an inference server. It downloads, loads, and serves language models behind a REST API. When Open WebUI (Chapter 5) sends a chat message, it hits Ollama’s API at http://127.0.0.1:11434, Ollama loads the requested model into GPU memory (or system RAM for CPU), runs inference, and streams the response back.
The key things Ollama manages:
- Model downloading and storage —
ollama pullfetches models and stores them in a structured directory - Model loading/unloading — loads models into VRAM on demand, unloads when memory is needed
- Inference serving — REST API that handles chat completions, embeddings, and raw generation
- Concurrent requests — configurable parallelism for multiple users
Installation
Create the Ollama User
Ollama runs as a dedicated system user. Don’t run it as root or your personal account:
# Create a system group and user for Ollama
sudo groupadd --system ollama
sudo useradd --system --gid ollama --shell /usr/sbin/nologin \
--home /usr/share/ollama ollama
# Create the home directory
sudo mkdir -p /usr/share/ollama
sudo chown ollama:ollama /usr/share/ollama
Create the Models Directory
Models need their own directory with enough disk space. Don’t use the ollama user’s home directory — /var/lib/ollama/models is the right place because it follows Linux filesystem conventions, has predictable SELinux contexts, and makes disk planning straightforward.
sudo mkdir -p /var/lib/ollama/models
sudo chown ollama:ollama /var/lib/ollama/models
Download and Install the Binary
Ollama distributes as a .tar.zst archive containing the binary and supporting libraries:
# Install zstd for archive extraction (Rocky 9 may not have it by default)
sudo dnf install -y zstd
# Download and extract Ollama to /usr/local
curl -fsSL https://ollama.com/download/ollama-linux-amd64.tar.zst \
| sudo zstd -d --stdout \
| sudo tar -xf - -C /usr/local
# Verify the binary is in place
ls -la /usr/local/bin/ollama
Note: If you need a specific version instead of latest, the URL format is
https://github.com/ollama/ollama/releases/download/v<VERSION>/ollama-linux-amd64.tar.zst.
Create the systemd Service
sudo tee /etc/systemd/system/ollama.service > /dev/null << 'EOF'
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="HOME=/usr/share/ollama"
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable ollama
Don’t start it yet — we’ll configure the override first.
Configuration
systemd Override
The override file is where the real configuration lives. Instead of editing the service unit directly, we use a drop-in override:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null << 'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF
Let’s break down each variable:
OLLAMA_HOST=127.0.0.1:11434 — Bind to localhost only. This is critical for security: Ollama has no authentication. If you bind to 0.0.0.0, anyone on your network can run inference on your GPU. nginx (Chapter 5) handles external access with SSL and authentication.
OLLAMA_MODELS=/var/lib/ollama/models — Where models are stored on disk. Change this if you have a separate data drive or want models on a specific mount point.
OLLAMA_NUM_PARALLEL=1 — How many inference requests Ollama processes simultaneously. Each parallel request needs additional VRAM for its KV cache. Start with 1 and increase if you have VRAM headroom and multiple users.
OLLAMA_MAX_LOADED_MODELS=1 — How many models stay loaded in VRAM simultaneously. Each loaded model reserves its full VRAM allocation. For 8 GB GPUs, keep this at 1. For 24 GB GPUs with smaller models, you might increase to 2.
Tip: If you have multiple GPUs, add
Environment="CUDA_VISIBLE_DEVICES=0"to target a specific GPU, or"0,1"to use both.
Start Ollama
sudo systemctl daemon-reload
sudo systemctl start ollama
Watch the startup:
sudo journalctl -u ollama -f
Look for the line Listening on 127.0.0.1:11434 — that means Ollama is ready. On GPU hosts, you’ll also see lines about CUDA and your GPU model. On CPU-only hosts, you’ll see inference compute ... library=cpu — that’s normal. Press Ctrl+C to exit the log viewer once you’ve confirmed it’s listening.
Verify the API
# Check the API is responding
curl -s http://127.0.0.1:11434/api/tags | python3 -m json.tool
This should return a JSON object with an empty models array (we haven’t pulled anything yet). If you get connection refused, check systemctl status ollama and the journal logs.
Model Management
Pulling Models
Pick one model to start with. You don’t need both — tinyllama is a quick download to verify the stack works end-to-end, while llama3.1:8b is what you’d actually use day-to-day. You can always pull more later.
# Option A: Small model for quick testing (~637 MB)
ollama pull tinyllama
# Option B: Production-quality 7B model (~4.7 GB)
ollama pull llama3.1:8b
The first pull takes a while depending on your bandwidth. Subsequent pulls of the same model are instant (cached layers).
Model Recommendations by VRAM
| VRAM | Recommended Models | Notes |
|---|---|---|
| CPU only | tinyllama, phi3:mini | Small models only. Expect 5-10 tokens/sec. |
| 8 GB | llama3.1:8b, mistral:7b, gemma2:9b | The sweet spot. Fast inference, good quality. |
| 16 GB | llama3.1:8b (full precision), codellama:34b-instruct-q4 | Room for larger or higher-precision models. |
| 24 GB | llama3.1:70b-q4, deepseek-coder-v2:33b, mixtral:8x7b | The big models. Excellent quality. |
Tip: Start with
tinyllama(637 MB) to verify the stack works end-to-end before pulling larger models. Don’t start with a 40 GB download on your first deployment — verify first, upgrade later.
Managing Models
| Command | What It Does |
|---|---|
ollama list | List installed models |
ollama show llama3.1:8b | Show a model’s details (size, quantization, parameters) |
ollama ps | Show currently loaded models and VRAM usage |
ollama rm <model> | Remove a model (frees disk space) |
Storage Considerations
Models are stored as deduplicated layers in /var/lib/ollama/models (or wherever you set OLLAMA_MODELS in the override). A 7B quantized model is about 4 GB. A 70B quantized model is about 40 GB.
If you move the models directory after initial setup, you’ll need to update OLLAMA_MODELS in the override config and fix SELinux contexts on the new path (see Chapter 7 — this is a common gotcha).
What Automation Looks Like
You just created a user, built directory structures, downloaded and extracted a binary, wrote a systemd unit, configured an override file, started the service, and pulled models. Here’s what that looks like automated:
Installation:
- Creates the
ollamasystem group and user - Creates the home directory and models directory
- Installs
zstdfor archive extraction - Downloads and extracts the Ollama binary (idempotent — skips if binary exists)
- Deploys the systemd service unit
Configuration:
- Creates the systemd override directory
- Templates the override config with your chosen settings
- Enables and starts the service with daemon reload
- Waits for the API to respond (retries 12 times at 5-second intervals)
- Pulls any models you’ve specified that aren’t already present
The model pull is idempotent — re-running the playbook doesn’t re-download models you already have. The entire Ollama setup takes about 2 minutes with the playbook versus 15-20 minutes manually. The companion playbook bundle is available at RavenForge Press.
Verification Checkpoint
Before moving to Chapter 5, confirm:
systemctl status ollamashows active (running)curl -s http://127.0.0.1:11434/api/tagsreturns JSONollama listshows your pulled model(s)ollama psshows the model loads correctly (runollama run tinyllama "hello"first to trigger a load)journalctl -u ollamashows no errors- If GPU:
nvidia-smishows VRAM usage when a model is loaded
Ollama is serving. Now let’s give it a proper web interface.