← Self-Hosting AI the Right Way

Chapter 4

Ollama Installation & Configuration

In this chapter
<nav id="TableOfContents" aria-label="Chapter sections"> <ul> <li><a href="#what-ollama-actually-does">What Ollama Actually Does</a></li> <li><a href="#installation">Installation</a> <ul> <li><a href="#create-the-ollama-user">Create the Ollama User</a></li> <li><a href="#create-the-models-directory">Create the Models Directory</a></li> <li><a href="#download-and-install-the-binary">Download and Install the Binary</a></li> <li><a href="#create-the-systemd-service">Create the systemd Service</a></li> </ul> </li> <li><a href="#configuration">Configuration</a> <ul> <li><a href="#systemd-override">systemd Override</a></li> <li><a href="#start-ollama">Start Ollama</a></li> <li><a href="#verify-the-api">Verify the API</a></li> </ul> </li> <li><a href="#model-management">Model Management</a> <ul> <li><a href="#pulling-models">Pulling Models</a></li> <li><a href="#model-recommendations-by-vram">Model Recommendations by VRAM</a></li> <li><a href="#managing-models">Managing Models</a></li> <li><a href="#storage-considerations">Storage Considerations</a></li> </ul> </li> <li><a href="#what-automation-looks-like">What Automation Looks Like</a></li> <li><a href="#verification-checkpoint">Verification Checkpoint</a></li> </ul> </nav>

What you’ll accomplish: Install Ollama as a systemd service, configure it with production-ready settings, pull your first model, and verify the API is serving.

What Ollama Actually Does

Ollama is an inference server. It downloads, loads, and serves language models behind a REST API. When Open WebUI (Chapter 5) sends a chat message, it hits Ollama’s API at http://127.0.0.1:11434, Ollama loads the requested model into GPU memory (or system RAM for CPU), runs inference, and streams the response back.

The key things Ollama manages:

  • Model downloading and storageollama pull fetches models and stores them in a structured directory
  • Model loading/unloading — loads models into VRAM on demand, unloads when memory is needed
  • Inference serving — REST API that handles chat completions, embeddings, and raw generation
  • Concurrent requests — configurable parallelism for multiple users

Installation

Create the Ollama User

Ollama runs as a dedicated system user. Don’t run it as root or your personal account:

# Create a system group and user for Ollama
sudo groupadd --system ollama
sudo useradd --system --gid ollama --shell /usr/sbin/nologin \
  --home /usr/share/ollama ollama

# Create the home directory
sudo mkdir -p /usr/share/ollama
sudo chown ollama:ollama /usr/share/ollama

Create the Models Directory

Models need their own directory with enough disk space. Don’t use the ollama user’s home directory — /var/lib/ollama/models is the right place because it follows Linux filesystem conventions, has predictable SELinux contexts, and makes disk planning straightforward.

sudo mkdir -p /var/lib/ollama/models
sudo chown ollama:ollama /var/lib/ollama/models

Download and Install the Binary

Ollama distributes as a .tar.zst archive containing the binary and supporting libraries:

# Install zstd for archive extraction (Rocky 9 may not have it by default)
sudo dnf install -y zstd

# Download and extract Ollama to /usr/local
curl -fsSL https://ollama.com/download/ollama-linux-amd64.tar.zst \
  | sudo zstd -d --stdout \
  | sudo tar -xf - -C /usr/local

# Verify the binary is in place
ls -la /usr/local/bin/ollama

Note: If you need a specific version instead of latest, the URL format is https://github.com/ollama/ollama/releases/download/v<VERSION>/ollama-linux-amd64.tar.zst.

Create the systemd Service

sudo tee /etc/systemd/system/ollama.service > /dev/null << 'EOF'
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="HOME=/usr/share/ollama"

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable ollama

Don’t start it yet — we’ll configure the override first.

Configuration

systemd Override

The override file is where the real configuration lives. Instead of editing the service unit directly, we use a drop-in override:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null << 'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

Let’s break down each variable:

OLLAMA_HOST=127.0.0.1:11434 — Bind to localhost only. This is critical for security: Ollama has no authentication. If you bind to 0.0.0.0, anyone on your network can run inference on your GPU. nginx (Chapter 5) handles external access with SSL and authentication.

OLLAMA_MODELS=/var/lib/ollama/models — Where models are stored on disk. Change this if you have a separate data drive or want models on a specific mount point.

OLLAMA_NUM_PARALLEL=1 — How many inference requests Ollama processes simultaneously. Each parallel request needs additional VRAM for its KV cache. Start with 1 and increase if you have VRAM headroom and multiple users.

OLLAMA_MAX_LOADED_MODELS=1 — How many models stay loaded in VRAM simultaneously. Each loaded model reserves its full VRAM allocation. For 8 GB GPUs, keep this at 1. For 24 GB GPUs with smaller models, you might increase to 2.

Tip: If you have multiple GPUs, add Environment="CUDA_VISIBLE_DEVICES=0" to target a specific GPU, or "0,1" to use both.

Start Ollama

sudo systemctl daemon-reload
sudo systemctl start ollama

Watch the startup:

sudo journalctl -u ollama -f

Look for the line Listening on 127.0.0.1:11434 — that means Ollama is ready. On GPU hosts, you’ll also see lines about CUDA and your GPU model. On CPU-only hosts, you’ll see inference compute ... library=cpu — that’s normal. Press Ctrl+C to exit the log viewer once you’ve confirmed it’s listening.

Verify the API

# Check the API is responding
curl -s http://127.0.0.1:11434/api/tags | python3 -m json.tool

This should return a JSON object with an empty models array (we haven’t pulled anything yet). If you get connection refused, check systemctl status ollama and the journal logs.

Model Management

Pulling Models

Pick one model to start with. You don’t need both — tinyllama is a quick download to verify the stack works end-to-end, while llama3.1:8b is what you’d actually use day-to-day. You can always pull more later.

# Option A: Small model for quick testing (~637 MB)
ollama pull tinyllama

# Option B: Production-quality 7B model (~4.7 GB)
ollama pull llama3.1:8b

The first pull takes a while depending on your bandwidth. Subsequent pulls of the same model are instant (cached layers).

Model Recommendations by VRAM

VRAMRecommended ModelsNotes
CPU onlytinyllama, phi3:miniSmall models only. Expect 5-10 tokens/sec.
8 GBllama3.1:8b, mistral:7b, gemma2:9bThe sweet spot. Fast inference, good quality.
16 GBllama3.1:8b (full precision), codellama:34b-instruct-q4Room for larger or higher-precision models.
24 GBllama3.1:70b-q4, deepseek-coder-v2:33b, mixtral:8x7bThe big models. Excellent quality.

Tip: Start with tinyllama (637 MB) to verify the stack works end-to-end before pulling larger models. Don’t start with a 40 GB download on your first deployment — verify first, upgrade later.

Managing Models

CommandWhat It Does
ollama listList installed models
ollama show llama3.1:8bShow a model’s details (size, quantization, parameters)
ollama psShow currently loaded models and VRAM usage
ollama rm <model>Remove a model (frees disk space)

Storage Considerations

Models are stored as deduplicated layers in /var/lib/ollama/models (or wherever you set OLLAMA_MODELS in the override). A 7B quantized model is about 4 GB. A 70B quantized model is about 40 GB.

If you move the models directory after initial setup, you’ll need to update OLLAMA_MODELS in the override config and fix SELinux contexts on the new path (see Chapter 7 — this is a common gotcha).

What Automation Looks Like

You just created a user, built directory structures, downloaded and extracted a binary, wrote a systemd unit, configured an override file, started the service, and pulled models. Here’s what that looks like automated:

Installation:

  1. Creates the ollama system group and user
  2. Creates the home directory and models directory
  3. Installs zstd for archive extraction
  4. Downloads and extracts the Ollama binary (idempotent — skips if binary exists)
  5. Deploys the systemd service unit

Configuration:

  1. Creates the systemd override directory
  2. Templates the override config with your chosen settings
  3. Enables and starts the service with daemon reload
  4. Waits for the API to respond (retries 12 times at 5-second intervals)
  5. Pulls any models you’ve specified that aren’t already present

The model pull is idempotent — re-running the playbook doesn’t re-download models you already have. The entire Ollama setup takes about 2 minutes with the playbook versus 15-20 minutes manually. The companion playbook bundle is available at RavenForge Press.

Verification Checkpoint

Before moving to Chapter 5, confirm:

  • systemctl status ollama shows active (running)
  • curl -s http://127.0.0.1:11434/api/tags returns JSON
  • ollama list shows your pulled model(s)
  • ollama ps shows the model loads correctly (run ollama run tinyllama "hello" first to trigger a load)
  • journalctl -u ollama shows no errors
  • If GPU: nvidia-smi shows VRAM usage when a model is loaded

Ollama is serving. Now let’s give it a proper web interface.

Want the automation code? Get the Ansible playbooks that deploy this entire stack in minutes.

Get Guide + Playbooks — $14