What you’ll accomplish: Know where to look when things break, recognize the most common failure modes, and fix them without spending hours on Stack Overflow.
Where the Logs Live
Before debugging anything, know where to look:
| Component | Log Path | journalctl | What to Look For |
|---|---|---|---|
| Elasticsearch | /var/log/elasticsearch/ | journalctl -u elasticsearch | ClusterBlockException, OutOfMemoryError, discovery failures |
| Kibana | /var/log/kibana/ | journalctl -u kibana | FATAL, connection refused to ES |
| Logstash | /var/log/logstash/ | journalctl -u logstash | Pipeline errors, ConnectionRefused, codec failures |
| Apache | /var/log/httpd/ | journalctl -u httpd | 503, 502, proxy errors |
| Filebeat | /var/log/filebeat/ | journalctl -u filebeat | Connection refused to Logstash, fileset errors, harvester issues |
| SELinux | /var/log/audit/audit.log | ausearch -m AVC -ts recent | denied entries for httpd, java |
Tip: For Elasticsearch, also check the cluster-specific log at
/var/log/elasticsearch/<cluster-name>.log— it’s named after yourcluster.namesetting.
A Note About Authentication
With security enabled (the default), all curl commands against the Elasticsearch REST API require authentication. Add -u elastic:YOUR_PASSWORD to every curl command in this chapter. If you get 401 Unauthorized, that’s the first thing to check.
Quick Diagnostic Sequence
When something isn’t working, run through this 6-step sequence before diving into specifics:
1. Are the services running?
systemctl status elasticsearch kibana httpd logstash
If any show inactive or failed, check the journal: journalctl -u <service> -n 50 --no-pager.
2. Are the ports listening?
ss -tlnp | grep -E '9200|9300|5601|443|5044|9600'
Missing ports mean the service either isn’t running or is bound to the wrong interface.
3. Is the cluster healthy?
curl -u elastic:YOUR_PASSWORD -s http://localhost:9200/_cluster/health?pretty
Red cluster = data loss. Yellow = replicas unassigned (often normal on fresh clusters). Green = healthy.
4. Are the firewall rules correct?
firewall-cmd --list-all
Look for the rich rules allowing 9200, 9300, 5044 from the correct source IPs.
5. Is SELinux blocking something?
ausearch -m AVC -ts recent 2>/dev/null | tail -20
Any denied entries? The message tells you exactly which process and action was blocked.
6. Is disk space available?
df -h /opt /var
Elasticsearch goes read-only at 95% disk usage. Logstash stops writing if its data directory fills.
The Problems That Cost You Hours
Split Brain — Two Masters, One Cluster
Symptom: _cluster/health shows fewer nodes than expected. Two nodes each think they’re the master. Data written to one master isn’t visible on the other.
Cause: Network partition between nodes, or an even number of nodes (2 nodes can’t achieve majority). With 2 nodes, if one can’t reach the other, both promote themselves to master.
Fix: Always use an odd number of nodes (3 minimum). Verify cluster.initial_master_nodes lists all nodes and discovery.seed_hosts is correct on every node. Check port 9300 connectivity between all pairs:
# From es01, test transport to es02 and es03
curl -v telnet://192.168.1.62:9300 2>&1 | head -5
curl -v telnet://192.168.1.63:9300 2>&1 | head -5
If one node is isolated, check its firewall rules — port 9300 rich rules may be missing.
JVM Out of Memory on Logstash
Symptom: Logstash process disappears. dmesg | grep -i oom shows Out of memory: Killed process ... (java).
Cause: JVM heap is larger than available RAM. If Logstash’s heap is 5 GB on a 4 GB host, the OOM killer intervenes.
Fix: Check the heap setting: grep -E 'Xms|Xmx' /etc/logstash/jvm.options. It should be 62.5% of total RAM, not 100%. On a 4 GB host, that’s 2560m. Adjust elk_logstash_jvm_heap_pct in group_vars/all.yml and re-run the playbook.
SELinux Blocking Apache Proxy (503 Error)
Symptom: Browser shows “503 Service Unavailable.” Apache is running, Kibana is running, but Apache can’t reach Kibana.
Cause: SELinux httpd_can_network_connect boolean is off. Apache isn’t allowed to make outbound TCP connections.
Fix:
# Check the boolean
getsebool httpd_can_network_connect
# If it shows 'off', set it:
setsebool -P httpd_can_network_connect on
The playbook sets this automatically, but manual Apache reconfiguration or an OS update can reset it.
Elasticsearch Discovery Timeout — Single-Node Cluster
Symptom: _cluster/health shows number_of_nodes: 1 on every node. Each node formed its own cluster instead of joining the others.
Cause: Nodes can’t reach each other on port 9300. Common reasons:
- Firewall blocking transport port
discovery.seed_hostshas wrong hostnames/IPsnetwork.hostbound to the wrong interface (127.0.0.1 instead of the node’s IP)
Fix: On each node, verify the config:
grep -E 'discovery.seed_hosts|network.host|cluster.name' /etc/elasticsearch/elasticsearch.yml
Test transport connectivity from each node to the others. If the cluster already bootstrapped incorrectly, you may need to stop all nodes, delete the data in /opt/lib/elasticsearch/nodes/, and restart them simultaneously.
ILM Policies Not Applying to Existing Indices
Symptom: You created ILM policies and index templates, but old indices (created before the template) aren’t being managed. Disk usage keeps growing.
Cause: Index templates only apply to new indices. Existing indices need the policy applied manually.
Fix:
# Apply to existing app-logs indices
curl -u elastic:YOUR_PASSWORD -X PUT "http://es01.example.com:9200/app-logs-*/_settings" \
-H 'Content-Type: application/json' \
-d '{"index": {"lifecycle": {"name": "delete-after-120d"}}}'
# Verify
curl -u elastic:YOUR_PASSWORD -s "http://es01.example.com:9200/app-logs-*/_ilm/explain?pretty" | head -20
The pro_elasticsearch role does this for known index patterns, but any custom indices you create need manual policy assignment.
Logstash Pipeline Won’t Start
Symptom: Port 5044 isn’t listening. journalctl -u logstash shows pipeline configuration errors.
Cause: Syntax error in /etc/logstash/conf.d/logstash.conf. A missing brace, bad filter syntax, or invalid plugin configuration.
Fix: Test the pipeline config without starting the service:
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/logstash.conf
This parses the config and reports syntax errors without affecting the running service.
401 Unauthorized on All Elasticsearch API Calls
Symptom: Every curl command against port 9200 returns {"error":"security_exception","reason":"missing authentication credentials"} or 401 Unauthorized.
Cause: Security is enabled but your password is wrong, or the elastic superuser password was never reset during deployment.
Fix: Verify your password works:
curl -u elastic:YOUR_PASSWORD http://localhost:9200/
If this returns 401, the password in your vault doesn’t match what Elasticsearch has. Re-run the password reset:
sudo /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic -b -i
Enter the password from your vault file. Then re-run the playbook to ensure kibana_system and logstash_system passwords are also set correctly.
Transport TLS Certificate Expired
Symptom: Cluster suddenly can’t form — nodes report transport-layer handshake failures. Worked fine for years, then stopped.
Cause: The transport TLS certificates generated during initial deployment have a 10-year expiry (--days 3650). When they expire, inter-node communication silently fails.
Fix: Regenerate certificates on the first ES node and redistribute:
# Stop all ES nodes first
# Then on es01:
sudo rm /etc/elasticsearch/certs/elastic-*.p12
sudo /usr/share/elasticsearch/bin/elasticsearch-certutil ca --out /etc/elasticsearch/certs/elastic-stack-ca.p12 --pass "" --days 3650
sudo /usr/share/elasticsearch/bin/elasticsearch-certutil cert --ca /etc/elasticsearch/certs/elastic-stack-ca.p12 --ca-pass "" --days 3650 --out /etc/elasticsearch/certs/elastic-certificates.p12 --pass ""
# Copy to es02 and es03, then start all nodes
Or re-run the playbook — the svc_elasticsearch role regenerates certs when they’re missing.
Logstash Restart Hangs Forever
Symptom: systemctl restart logstash never returns. The process is in deactivating state. You can’t stop it, and starting a new instance fails because the old one is still “running.”
Cause: The Logstash RPM ships with TimeoutStopSec=infinity. If the pipeline stalls during shutdown (common with Beats input when a client disconnects uncleanly), systemd waits forever for the process to exit.
Fix: Create a systemd override:
sudo mkdir -p /etc/systemd/system/logstash.service.d
sudo tee /etc/systemd/system/logstash.service.d/timeout.conf > /dev/null << 'EOF'
[Service]
TimeoutStopSec=90
EOF
sudo systemctl daemon-reload
sudo systemctl kill logstash # force-stop the hung process
sudo systemctl start logstash
The playbook applies this override automatically. If you’re stuck right now with a hung process, systemctl kill logstash sends SIGKILL immediately.
“It Worked Yesterday” Checklist
When something that was working suddenly isn’t, check these first — they cover 90% of post-deployment breakage:
- Did a
dnf updatechange a config file? Check for.rpmnewor.rpmsavein/etc/elasticsearch/,/etc/kibana/,/etc/logstash/,/etc/httpd/ - Did the service actually restart after your change?
systemctl status <service> - Is SELinux blocking something new?
ausearch -m AVC -ts recent - Did a firewall rule get lost?
firewall-cmd --list-all— compare to what the playbook configures - Is the disk full?
df -h /opt /var— check both data and log partitions - Did a certificate expire?
openssl x509 -in /etc/pki/tls/certs/<hostname>.crt -noout -dates - Is the JVM crashing?
dmesg | grep -i oomand check/var/log/elasticsearch/*.logforOutOfMemoryError - Did the cluster lose quorum?
curl -u elastic:YOUR_PASSWORD -s http://localhost:9200/_cluster/health?pretty— ifnumber_of_nodes < 2, a node is unreachable
Common Error Messages
ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]
Elasticsearch disk watermark exceeded. Free up disk space, then clear the block:
curl -u elastic:YOUR_PASSWORD -X PUT "http://localhost:9200/_all/_settings" \
-H 'Content-Type: application/json' \
-d '{"index.blocks.read_only_allow_delete": null}'
ConnectionRefusedError: Connection refused - connect(2) for "192.168.1.61" port 9200
Elasticsearch isn’t running or isn’t bound to that IP. Check systemctl status elasticsearch and verify network.host in elasticsearch.yml.
Logstash could not be started because there is already another instance of Logstash running
A previous Logstash process didn’t shut down cleanly. Find and kill it:
ps aux | grep logstash
kill <pid>
rm -f /var/lock/logstash # if a lock file exists
systemctl start logstash