← Deploying the ELK Stack the Right Way

Chapter 7

Making It Real — ILM & Index Management

In this chapter
<nav id="TableOfContents" aria-label="Chapter sections"> <ul> <li><a href="#why-ilm-matters-from-day-one">Why ILM Matters From Day One</a></li> <li><a href="#the-three-retention-tiers">The Three Retention Tiers</a> <ul> <li><a href="#general-logs--120-days">General Logs — 120 Days</a></li> <li><a href="#metrics--26-days">Metrics — 26 Days</a></li> <li><a href="#monitoring--3-days">Monitoring — 3 Days</a></li> </ul> </li> <li><a href="#warm-tier-and-zero-replica-policy">Warm Tier and Zero-Replica Policy</a></li> <li><a href="#index-templates">Index Templates</a></li> <li><a href="#applying-ilm-to-existing-indices">Applying ILM to Existing Indices</a></li> <li><a href="#sending-your-first-logs">Sending Your First Logs</a></li> <li><a href="#customizing-retention">Customizing Retention</a></li> <li><a href="#what-automation-looks-like">What Automation Looks Like</a></li> <li><a href="#verification-checkpoint">Verification Checkpoint</a></li> </ul> </nav>

What you’ll accomplish: Configure index lifecycle policies that keep your disk from filling up, understand how index templates work, send your first logs through the pipeline, and see them in Kibana.

This is the chapter that separates “I installed the ELK stack” from “I have a logging system I can trust.” Without ILM, your cluster quietly fills its disks and locks itself. With it, you never think about retention again.

Why ILM Matters From Day One

Here’s what happens without Index Lifecycle Management: every day, Logstash creates new indices in Elasticsearch — app-logs-2026.03.16, app-logs-2026.03.17, and so on. Each index consumes disk space. Nothing ever gets deleted. After 3-4 weeks on a 20 GB disk, Elasticsearch hits the high watermark (95% full), marks itself read-only, and refuses to index new data. Your logging pipeline stops, and recovery requires manually deleting indices and clearing the read-only flag.

ILM prevents this by automatically managing index lifecycle — from creation through deletion. You define policies like “delete indices older than 120 days” and Elasticsearch enforces them without intervention.

The pro_elasticsearch role configures ILM policies and index templates as part of the deployment. This chapter explains what it sets up and why.

The Three Retention Tiers

Not all logs have the same value over time. We define three retention policies based on how long data is useful. Run these commands on any one of your ES nodes — ILM policies are cluster-wide, so you only need to create them once.

General Logs — 120 Days

curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/_ilm/policy/delete-after-120d" \
  -H 'Content-Type: application/json' \
  -d '{
    "policy": {
      "phases": {
        "hot": { "min_age": "0ms", "actions": {} },
        "delete": { "min_age": "120d", "actions": { "delete": {} } }
      }
    }
  }'

This covers application logs, system logs, and anything that isn’t high-volume metrics. 120 days is enough to investigate most incidents (you rarely need logs from 6 months ago for a home lab), while keeping disk usage bounded.

Metrics — 26 Days

curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/_ilm/policy/metrics-delete-26d" \
  -H 'Content-Type: application/json' \
  -d '{
    "policy": {
      "phases": {
        "hot": { "min_age": "0ms", "actions": {} },
        "delete": { "min_age": "26d", "actions": { "delete": {} } }
      }
    }
  }'

Metrics indices (CPU, memory, disk usage over time) generate high volume but lose value quickly. 26 days gives you about a month of trending data — enough to spot patterns and diagnose recent performance issues.

Monitoring — 3 Days

curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/_ilm/policy/monitoring-delete-3d" \
  -H 'Content-Type: application/json' \
  -d '{
    "policy": {
      "phases": {
        "hot": { "min_age": "0ms", "actions": {} },
        "delete": { "min_age": "3d", "actions": { "delete": {} } }
      }
    }
  }'

Elasticsearch’s internal monitoring indices (.monitoring-es-*) are extremely high volume and only useful for diagnosing active problems. Three days is plenty.

Warm Tier and Zero-Replica Policy

For metrics data, there’s an intermediate optimization. After 10 days, metrics indices are rarely queried but still consuming replica storage. The metrics-zero-replicas policy moves them to a warm phase that drops replicas to zero:

curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/_ilm/policy/metrics-delete-26d-zero-replicas" \
  -H 'Content-Type: application/json' \
  -d '{
    "policy": {
      "phases": {
        "warm": {
          "min_age": "10d",
          "actions": {
            "set_priority": { "priority": 50 },
            "allocate": { "number_of_replicas": 0 }
          }
        },
        "delete": { "min_age": "120d", "actions": { "delete": {} } }
      }
    }
  }'

In a 3-node cluster, each index with 1 replica stores data twice. Dropping to 0 replicas after 10 days halves the disk usage for aging metrics. The trade-off: if a node fails, you lose that data. For 10+ day old metrics in a home lab, that’s acceptable.

Note the delete phase uses 120d (the general retention), not 26d (the standard metrics retention). Because this policy saves disk by dropping replicas, we can afford to keep the data longer — you get extended metrics history at roughly the same storage cost as 26 days with replicas.

Index Templates

ILM policies don’t apply to indices directly — they’re attached via index templates. When Elasticsearch creates a new index that matches a template’s pattern, it automatically applies the template’s settings, including the ILM policy.

The playbook creates three templates:

TemplatePatternILM PolicyPriority
metrics-delete-26d-templateapp-*-metrics-*metrics-delete-26d200
monitoring-delete-3d-template.monitoring-es-*monitoring-delete-3d200
delete-after-120d-templateapp-logs-*delete-after-120d150

Priority matters when patterns overlap. A specific template (priority 200) wins over a general one (priority 150). The app-*-metrics-* pattern matches metrics indices specifically, while app-logs-* is the catch-all for everything else.

The metrics and monitoring templates are forward-looking — they’ll apply automatically when you add Metricbeat or enable Elasticsearch’s internal monitoring. The default Logstash pipeline creates app-logs-* indices, which match the general template.

Applying ILM to Existing Indices

Here’s the gotcha that catches everyone: index templates only apply to new indices. If you had indices before you created the template, they’re unmanaged. The playbook handles this by explicitly applying ILM policies to existing indices:

# Apply metrics policy to existing metrics indices
curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/app-*-metrics-*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index": {"lifecycle": {"name": "metrics-delete-26d"}}}'

# Apply general policy to existing log indices
curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/app-logs-*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index": {"lifecycle": {"name": "delete-after-120d"}}}'

The playbook uses ansible.builtin.uri with status_code: [200, 404] — a 404 just means no matching indices exist yet, which is fine on a fresh deployment.

Sending Your First Logs

Let’s prove the pipeline works end-to-end. The simplest test: forward syslog from any Linux host to Logstash.

On a VM or host in your lab, configure rsyslog to forward to Logstash:

# /etc/rsyslog.d/50-logstash.conf
*.* @@192.168.1.62:5514

Restart rsyslog:

systemctl restart rsyslog

The @@ prefix means TCP (single @ is UDP). Port 5514 matches Logstash’s syslog input.

Generate a test log entry:

logger -t test-elk "Hello from the ELK pipeline"

Wait 10-15 seconds for Logstash to process and index the event, then check Elasticsearch:

curl -u elastic:YOUR_PASSWORD -s "http://192.168.1.61:9200/app-*/_search?q=test-elk&pretty" | head -20

You should see your log entry in the search results. If you open Kibana and navigate to Discover, create an index pattern for app-*, and search for “test-elk” — there it is.

This is the payoff. Open Kibana in your browser, go to Discover, and watch your logs arrive. Generate some traffic — restart a service, trigger a cron job, SSH into a few hosts. Within seconds, those events show up in Kibana with parsed fields, timestamps, and the environment tag your pipeline adds. Try the time range picker, filter by hostname, search for an error message. This is what centralized logging feels like — instead of SSHing into five boxes and grepping five different log formats, you searched once and found it.

Now imagine it’s 11 PM and something is broken. You open Kibana, set the time range to the last 30 minutes, and search for ERROR. Every host, every service, one query. That’s what you just built.

Customizing Retention

To change retention periods, edit group_vars/all.yml:

elk_ilm_general_retention: "90d"    # Was 120d
elk_ilm_metrics_retention: "14d"    # Was 26d
elk_ilm_monitoring_retention: "1d"  # Was 3d

Re-run the playbook:

ansible-playbook site.yml -i inventory/hosts.yml --vault-password-file .credentials/vault.txt --tags ilm

The pro_elasticsearch role will update the ILM policies via the Elasticsearch API. Existing indices keep their current policy assignment — the new retention only applies to policy checks going forward.

What Automation Looks Like

The pro_elasticsearch role (ILM section):

  1. Creates 4 ILM policies via REST API — general, metrics, monitoring, zero-replicas
  2. Creates 3 index templates that auto-assign policies to new indices by pattern
  3. Applies policies to existing indices matching known patterns (handles the “created before template existed” gap)
  4. All API calls use changed_when: false — they’re idempotent PUTs that report ok regardless of whether the policy changed
  5. Runs run_once: true — only one node needs to call the cluster-wide API

Everything in this chapter — ILM policies, index templates, applying policies to existing indices — happens automatically as part of the deployment. One ansible-playbook run, and your cluster has lifecycle management from the first index.

Verification Checkpoint

Before moving to Chapter 8, confirm:

  • curl -u elastic:YOUR_PASSWORD -s http://localhost:9200/_ilm/policy?pretty shows all four ILM policies
  • curl -u elastic:YOUR_PASSWORD -s http://localhost:9200/_index_template?pretty shows the three index templates
  • logger -t test-elk "verification test" on a forwarding host produces a searchable event in Kibana (requires rsyslog forwarding from “Sending Your First Logs” above, or Filebeat configured in Chapter 6)
  • curl -u elastic:YOUR_PASSWORD -s "http://localhost:9200/app-*/_search?q=test-elk&pretty" returns your test log entry
  • The Kibana Discover tab shows incoming logs with parsed fields

Your ELK stack is deployed, ingesting logs, and managing its own retention. The next chapter covers what to do when things go wrong.

Want the automation code? Get the production-ready Ansible playbooks that deploy this entire ELK stack in ~20 minutes.

Get Playbooks — $29