← Deploying the ELK Stack the Right Way

Chapter 8

Gotchas & Troubleshooting

In this chapter
<nav id="TableOfContents" aria-label="Chapter sections"> <ul> <li><a href="#where-the-logs-live">Where the Logs Live</a></li> <li><a href="#a-note-about-authentication">A Note About Authentication</a></li> <li><a href="#quick-diagnostic-sequence">Quick Diagnostic Sequence</a></li> <li><a href="#the-problems-that-cost-you-hours">The Problems That Cost You Hours</a> <ul> <li><a href="#split-brain--two-masters-one-cluster">Split Brain — Two Masters, One Cluster</a></li> <li><a href="#jvm-out-of-memory-on-logstash">JVM Out of Memory on Logstash</a></li> <li><a href="#selinux-blocking-apache-proxy-503-error">SELinux Blocking Apache Proxy (503 Error)</a></li> <li><a href="#elasticsearch-discovery-timeout--single-node-cluster">Elasticsearch Discovery Timeout — Single-Node Cluster</a></li> <li><a href="#ilm-policies-not-applying-to-existing-indices">ILM Policies Not Applying to Existing Indices</a></li> <li><a href="#logstash-pipeline-wont-start">Logstash Pipeline Won&rsquo;t Start</a></li> <li><a href="#error-401-unauthorized">401 Unauthorized on All Elasticsearch API Calls</a></li> <li><a href="#transport-tls-certificate-expired">Transport TLS Certificate Expired</a></li> <li><a href="#logstash-restart-hangs-forever">Logstash Restart Hangs Forever</a></li> </ul> </li> <li><a href="#it-worked-yesterday-checklist">&ldquo;It Worked Yesterday&rdquo; Checklist</a></li> <li><a href="#common-error-messages">Common Error Messages</a></li> </ul> </nav>

What you’ll accomplish: Know where to look when things break, recognize the most common failure modes, and fix them without spending hours on Stack Overflow.

Where the Logs Live

Before debugging anything, know where to look:

ComponentLog PathjournalctlWhat to Look For
Elasticsearch/var/log/elasticsearch/journalctl -u elasticsearchClusterBlockException, OutOfMemoryError, discovery failures
Kibana/var/log/kibana/journalctl -u kibanaFATAL, connection refused to ES
Logstash/var/log/logstash/journalctl -u logstashPipeline errors, ConnectionRefused, codec failures
Apache/var/log/httpd/journalctl -u httpd503, 502, proxy errors
Filebeat/var/log/filebeat/journalctl -u filebeatConnection refused to Logstash, fileset errors, harvester issues
SELinux/var/log/audit/audit.logausearch -m AVC -ts recentdenied entries for httpd, java

Tip: For Elasticsearch, also check the cluster-specific log at /var/log/elasticsearch/<cluster-name>.log — it’s named after your cluster.name setting.

A Note About Authentication

With security enabled (the default), all curl commands against the Elasticsearch REST API require authentication. Add -u elastic:YOUR_PASSWORD to every curl command in this chapter. If you get 401 Unauthorized, that’s the first thing to check.

Quick Diagnostic Sequence

When something isn’t working, run through this 6-step sequence before diving into specifics:

1. Are the services running?

systemctl status elasticsearch kibana httpd logstash

If any show inactive or failed, check the journal: journalctl -u <service> -n 50 --no-pager.

2. Are the ports listening?

ss -tlnp | grep -E '9200|9300|5601|443|5044|9600'

Missing ports mean the service either isn’t running or is bound to the wrong interface.

3. Is the cluster healthy?

curl -u elastic:YOUR_PASSWORD -s http://localhost:9200/_cluster/health?pretty

Red cluster = data loss. Yellow = replicas unassigned (often normal on fresh clusters). Green = healthy.

4. Are the firewall rules correct?

firewall-cmd --list-all

Look for the rich rules allowing 9200, 9300, 5044 from the correct source IPs.

5. Is SELinux blocking something?

ausearch -m AVC -ts recent 2>/dev/null | tail -20

Any denied entries? The message tells you exactly which process and action was blocked.

6. Is disk space available?

df -h /opt /var

Elasticsearch goes read-only at 95% disk usage. Logstash stops writing if its data directory fills.

The Problems That Cost You Hours

Split Brain — Two Masters, One Cluster

Symptom: _cluster/health shows fewer nodes than expected. Two nodes each think they’re the master. Data written to one master isn’t visible on the other.

Cause: Network partition between nodes, or an even number of nodes (2 nodes can’t achieve majority). With 2 nodes, if one can’t reach the other, both promote themselves to master.

Fix: Always use an odd number of nodes (3 minimum). Verify cluster.initial_master_nodes lists all nodes and discovery.seed_hosts is correct on every node. Check port 9300 connectivity between all pairs:

# From es01, test transport to es02 and es03
curl -v telnet://192.168.1.62:9300 2>&1 | head -5
curl -v telnet://192.168.1.63:9300 2>&1 | head -5

If one node is isolated, check its firewall rules — port 9300 rich rules may be missing.

JVM Out of Memory on Logstash

Symptom: Logstash process disappears. dmesg | grep -i oom shows Out of memory: Killed process ... (java).

Cause: JVM heap is larger than available RAM. If Logstash’s heap is 5 GB on a 4 GB host, the OOM killer intervenes.

Fix: Check the heap setting: grep -E 'Xms|Xmx' /etc/logstash/jvm.options. It should be 62.5% of total RAM, not 100%. On a 4 GB host, that’s 2560m. Adjust elk_logstash_jvm_heap_pct in group_vars/all.yml and re-run the playbook.

SELinux Blocking Apache Proxy (503 Error)

Symptom: Browser shows “503 Service Unavailable.” Apache is running, Kibana is running, but Apache can’t reach Kibana.

Cause: SELinux httpd_can_network_connect boolean is off. Apache isn’t allowed to make outbound TCP connections.

Fix:

# Check the boolean
getsebool httpd_can_network_connect

# If it shows 'off', set it:
setsebool -P httpd_can_network_connect on

The playbook sets this automatically, but manual Apache reconfiguration or an OS update can reset it.

Elasticsearch Discovery Timeout — Single-Node Cluster

Symptom: _cluster/health shows number_of_nodes: 1 on every node. Each node formed its own cluster instead of joining the others.

Cause: Nodes can’t reach each other on port 9300. Common reasons:

  • Firewall blocking transport port
  • discovery.seed_hosts has wrong hostnames/IPs
  • network.host bound to the wrong interface (127.0.0.1 instead of the node’s IP)

Fix: On each node, verify the config:

grep -E 'discovery.seed_hosts|network.host|cluster.name' /etc/elasticsearch/elasticsearch.yml

Test transport connectivity from each node to the others. If the cluster already bootstrapped incorrectly, you may need to stop all nodes, delete the data in /opt/lib/elasticsearch/nodes/, and restart them simultaneously.

ILM Policies Not Applying to Existing Indices

Symptom: You created ILM policies and index templates, but old indices (created before the template) aren’t being managed. Disk usage keeps growing.

Cause: Index templates only apply to new indices. Existing indices need the policy applied manually.

Fix:

# Apply to existing app-logs indices
curl -u elastic:YOUR_PASSWORD -X PUT "http://es01.example.com:9200/app-logs-*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index": {"lifecycle": {"name": "delete-after-120d"}}}'

# Verify
curl -u elastic:YOUR_PASSWORD -s "http://es01.example.com:9200/app-logs-*/_ilm/explain?pretty" | head -20

The pro_elasticsearch role does this for known index patterns, but any custom indices you create need manual policy assignment.

Logstash Pipeline Won’t Start

Symptom: Port 5044 isn’t listening. journalctl -u logstash shows pipeline configuration errors.

Cause: Syntax error in /etc/logstash/conf.d/logstash.conf. A missing brace, bad filter syntax, or invalid plugin configuration.

Fix: Test the pipeline config without starting the service:

/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/logstash.conf

This parses the config and reports syntax errors without affecting the running service.

401 Unauthorized on All Elasticsearch API Calls

Symptom: Every curl command against port 9200 returns {"error":"security_exception","reason":"missing authentication credentials"} or 401 Unauthorized.

Cause: Security is enabled but your password is wrong, or the elastic superuser password was never reset during deployment.

Fix: Verify your password works:

curl -u elastic:YOUR_PASSWORD http://localhost:9200/

If this returns 401, the password in your vault doesn’t match what Elasticsearch has. Re-run the password reset:

sudo /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic -b -i

Enter the password from your vault file. Then re-run the playbook to ensure kibana_system and logstash_system passwords are also set correctly.

Transport TLS Certificate Expired

Symptom: Cluster suddenly can’t form — nodes report transport-layer handshake failures. Worked fine for years, then stopped.

Cause: The transport TLS certificates generated during initial deployment have a 10-year expiry (--days 3650). When they expire, inter-node communication silently fails.

Fix: Regenerate certificates on the first ES node and redistribute:

# Stop all ES nodes first
# Then on es01:
sudo rm /etc/elasticsearch/certs/elastic-*.p12
sudo /usr/share/elasticsearch/bin/elasticsearch-certutil ca --out /etc/elasticsearch/certs/elastic-stack-ca.p12 --pass "" --days 3650
sudo /usr/share/elasticsearch/bin/elasticsearch-certutil cert --ca /etc/elasticsearch/certs/elastic-stack-ca.p12 --ca-pass "" --days 3650 --out /etc/elasticsearch/certs/elastic-certificates.p12 --pass ""
# Copy to es02 and es03, then start all nodes

Or re-run the playbook — the svc_elasticsearch role regenerates certs when they’re missing.

Logstash Restart Hangs Forever

Symptom: systemctl restart logstash never returns. The process is in deactivating state. You can’t stop it, and starting a new instance fails because the old one is still “running.”

Cause: The Logstash RPM ships with TimeoutStopSec=infinity. If the pipeline stalls during shutdown (common with Beats input when a client disconnects uncleanly), systemd waits forever for the process to exit.

Fix: Create a systemd override:

sudo mkdir -p /etc/systemd/system/logstash.service.d
sudo tee /etc/systemd/system/logstash.service.d/timeout.conf > /dev/null << 'EOF'
[Service]
TimeoutStopSec=90
EOF
sudo systemctl daemon-reload
sudo systemctl kill logstash    # force-stop the hung process
sudo systemctl start logstash

The playbook applies this override automatically. If you’re stuck right now with a hung process, systemctl kill logstash sends SIGKILL immediately.

“It Worked Yesterday” Checklist

When something that was working suddenly isn’t, check these first — they cover 90% of post-deployment breakage:

  • Did a dnf update change a config file? Check for .rpmnew or .rpmsave in /etc/elasticsearch/, /etc/kibana/, /etc/logstash/, /etc/httpd/
  • Did the service actually restart after your change? systemctl status <service>
  • Is SELinux blocking something new? ausearch -m AVC -ts recent
  • Did a firewall rule get lost? firewall-cmd --list-all — compare to what the playbook configures
  • Is the disk full? df -h /opt /var — check both data and log partitions
  • Did a certificate expire? openssl x509 -in /etc/pki/tls/certs/<hostname>.crt -noout -dates
  • Is the JVM crashing? dmesg | grep -i oom and check /var/log/elasticsearch/*.log for OutOfMemoryError
  • Did the cluster lose quorum? curl -u elastic:YOUR_PASSWORD -s http://localhost:9200/_cluster/health?pretty — if number_of_nodes < 2, a node is unreachable

Common Error Messages

ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]

Elasticsearch disk watermark exceeded. Free up disk space, then clear the block:

curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/_all/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index.blocks.read_only_allow_delete": null}'

ConnectionRefusedError: Connection refused - connect(2) for "192.168.1.61" port 9200

Elasticsearch isn’t running or isn’t bound to that IP. Check systemctl status elasticsearch and verify network.host in elasticsearch.yml.

Logstash could not be started because there is already another instance of Logstash running

A previous Logstash process didn’t shut down cleanly. Find and kill it:

ps aux | grep logstash
kill <pid>
rm -f /var/lock/logstash  # if a lock file exists
systemctl start logstash

Want the automation code? Get the production-ready Ansible playbooks that deploy this entire ELK stack in ~20 minutes.

Get Playbooks — $29