What you’ll accomplish: Install Logstash, tune its JVM for your available RAM, configure a pipeline that accepts Beats and syslog input, and verify data is flowing into Elasticsearch.
Why Logstash
Logstash is the ingestion pipeline — it receives logs from your hosts, transforms them (parsing, filtering, enriching), and ships them into Elasticsearch. The obvious question: Beats agents (Filebeat, Metricbeat) can ship directly to Elasticsearch without Logstash. So when do you actually need Logstash in the middle?
Use Logstash when you need to:
- Parse unstructured logs into structured fields (grok patterns, JSON parsing)
- Route different log types to different indices based on content (our default pipeline sends everything to one index series — you can add routing rules later)
- Enrich logs with metadata (environment tags, geo-IP lookup)
- Accept syslog input from network devices that can’t run Beats agents
Skip Logstash when:
- Your logs are already structured JSON (Filebeat → ES directly works fine)
- You have fewer than 5 hosts and simple log formats
For a home lab, Logstash is worth it. You’ll have a mix of structured application logs and unstructured syslog from network devices, VMs, and services that don’t support Beats. Logstash handles both.
We co-locate Logstash on es02 — the second Elasticsearch node. Logstash is JVM-based and uses significant heap memory, so giving it a node with 8 GB RAM is ideal.
Installation
Logstash installs from the same Elastic repository we set up in Chapter 3. If your Logstash host is also an ES node, the repo file is already there. If it’s a separate machine, create the repo and import the GPG key first:
# Only needed if this host doesn't already have the Elastic repo from Chapter 3
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
sudo tee /etc/yum.repos.d/elastic-9.x.repo > /dev/null << 'EOF'
[elastic-9.x]
name=Elastic repository for 9.x packages
baseurl=https://artifacts.elastic.co/packages/9.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=0
autorefresh=1
type=rpm-md
EOF
Install Logstash:
sudo dnf install logstash -y --enablerepo=elastic-9.x
After install, fix ownership on the Logstash directories — the RPM sometimes leaves them owned by root:
sudo chown -R logstash:logstash /usr/share/logstash/data
sudo chown -R logstash:logstash /etc/logstash
Create the Systemd Unit File
Do not skip this step. If you skip it,
systemctl start logstashfails with “Unit logstash.service not found” and you’ll spend 20 minutes wondering why the service doesn’t exist.
The Logstash RPM doesn’t always create a systemd service file. The system-install script reads /etc/logstash/startup.options and generates the unit file:
sudo /usr/share/logstash/bin/system-install /etc/logstash/startup.options
sudo systemctl daemon-reload
Data Path Relocation
Same pattern as Elasticsearch — relocate to /opt to protect the root partition:
sudo mkdir -p /opt/lib/logstash
sudo chown logstash:logstash /opt/lib/logstash
sudo chmod 755 /opt/lib/logstash
The logstash.yml configuration below points path.data at this new location.
JVM Heap Tuning
Logstash runs on the JVM, and heap sizing matters. The conventional wisdom for Elasticsearch is “half your RAM for heap, half for filesystem cache.” Logstash is different.
Logstash should get 62.5% of total RAM for heap. Here’s why: Logstash doesn’t benefit from filesystem cache the way Elasticsearch does. ES uses the remaining RAM for Lucene’s memory-mapped file I/O, which dramatically speeds up search. Logstash’s pipeline is entirely in-memory — filter plugins, codec processing, and output buffering all live on the heap. More heap means larger pipeline buffers and less backpressure when Elasticsearch is slow to acknowledge writes.
On a 4 GB host: 4096 * 0.625 = 2560 MB heap. On an 8 GB host: 8192 * 0.625 = 5120 MB.
The playbook sets this in /etc/logstash/jvm.options:
-Xms2560m
-Xmx2560m
To change the heap size, edit the JVM options file:
sudo vi /etc/logstash/jvm.options
-Xms and -Xmx should always be equal — this prevents the JVM from wasting time resizing the heap during operation.
The playbook uses lineinfile with regexp: '^-Xms' to replace the heap line idempotently. If the computed value hasn’t changed, the task reports ok (not changed).
Logstash Configuration
The stock logstash.yml has these settings either commented out or with default values.
File modifications reference
| Line to find | Replace with |
|---|---|
path.data: /var/lib/logstash | path.data: /opt/lib/logstash |
# node.name: test | node.name: es02 |
# api.http.host: 127.0.0.1 | api.http.host: 0.0.0.0 |
# api.http.port: 9600-9700 | api.http.port: 9600 |
Change es02 to your Logstash host’s short hostname if it differs.
Apply all settings (copy-paste)
sudo sed -i 's|^path.data:.*|path.data: /opt/lib/logstash|' /etc/logstash/logstash.yml
sudo sed -i 's/^# node.name:.*/node.name: es02/' /etc/logstash/logstash.yml
sudo sed -i 's/^# api.http.host:.*/api.http.host: 0.0.0.0/' /etc/logstash/logstash.yml
sudo sed -i 's/^# api.http.port:.*/api.http.port: 9600/' /etc/logstash/logstash.yml
The playbook uses ansible_hostname (the short hostname, not the FQDN).
The monitoring API on port 9600 provides pipeline stats, JVM metrics, and health information. We bind it to 0.0.0.0 (all interfaces) for flexibility — the firewall controls which hosts can reach it.
Pipeline Configuration
The heart of Logstash is its pipeline — the input → filter → output chain defined in /etc/logstash/conf.d/logstash.conf. The playbook deploys a generic pipeline that handles two common input types:
Here’s the complete pipeline configuration. Copy-paste this to create the file, then read the breakdown below to understand each section:
sudo tee /etc/logstash/conf.d/logstash.conf > /dev/null << 'EOF'
input {
beats {
port => 5044
client_inactivity_timeout => 180
tags => ["prod"]
}
syslog {
port => 5514
tags => ["syslog", "prod"]
}
}
filter {
if "beats_input_codec_json_applied" not in [tags] and [message] =~ /^\s*\{/ {
json {
source => "message"
target => "parsed"
skip_on_invalid_json => true
}
}
if "syslog" in [tags] {
grok {
match => {
"message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"
}
overwrite => ["message"]
}
date {
match => ["syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss"]
}
}
mutate {
add_field => { "environment" => "prod" }
}
}
output {
elasticsearch {
hosts => ["http://192.168.1.61:9200", "http://192.168.1.62:9200", "http://192.168.1.63:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
http_compression => true
user => "elastic"
password => "YOUR_ELASTIC_PASSWORD"
}
}
EOF
Replace YOUR_ELASTIC_PASSWORD with the value you set for vault_elk_elastic_password. The user and password fields are required when Elasticsearch security is enabled. Without them, Logstash gets 401 Unauthorized responses and can’t index any data.
Note: Using the
elasticsuperuser here works but is overpowered for just writing indices. In a production environment, you’d create a dedicated Logstash writer role with only thecreate_indexandwritepermissions. For a home lab, the superuser is fine and avoids extra role management.
The following breakdown explains what you just pasted — you don’t need to copy anything else. This is all already in the file you created above.
Understanding the Input Section
The input block defines two listeners:
beatson port 5044 — the standard shipping method. Install Filebeat on your hosts, point it at Logstash, and it’ll ship logs with metadata (hostname, file path, timestamps). Theclient_inactivity_timeoutof 180 seconds gives Beats agents time to reconnect after network blips without Logstash closing their connection.syslogon port 5514 — for network devices, legacy systems, and anything that speaks syslog. We use 5514 (not 514) because binding to ports below 1024 requires root, and Logstash runs as thelogstashuser.
Both inputs tag events with "prod" — change this in group_vars/all.yml via the elk_environment variable if you’re running a different environment (e.g., staging, dev).
Understanding the Filter Section
The filter block does four things:
- JSON parsing — if the message looks like JSON (starts with
{) and wasn’t already parsed by Beats, it attempts to parse it into structured fields underparsed. Skips gracefully on invalid JSON. - Syslog parsing — for events tagged
"syslog", a grok pattern extracts timestamp, hostname, program name, PID, and message into named fields. Theoverwritedirective replaces the rawmessagewith the parsedsyslog_message. - Timestamp correction — parses the syslog timestamp and sets it as the event’s
@timestamp. Without this, events would use the ingest time instead of when the log was actually generated. - Environment tagging — adds an
environmentfield to every event (defaults to"prod").
Understanding the Output Section
The output block sends processed events to Elasticsearch:
- All three ES node IPs are listed for client-side load balancing. If one node is down, Logstash routes to the others. Replace these IPs with your actual ES node addresses.
http_compressionreduces network overhead between Logstash and ES.userandpasswordauthenticate against Elasticsearch. ReplaceYOUR_ELASTIC_PASSWORDwith theelasticpassword you set in Chapter 3.index => "app-logs-%{+YYYY.MM.dd}"creates daily indices —app-logs-2026.03.16, for example. This pattern matches theapp-logs-*ILM template configured in Chapter 7, so every index gets automatic lifecycle management from day one.
Important: If you don’t update the
hostslist with your actual ES node IPs, Logstash will only send data to localhost — your other nodes won’t receive any data.
The companion playbook templates all three files (logstash.yml, jvm.options, logstash.conf) from your variables and calculates JVM heap automatically — no manual math required.
Firewall Rules
Logstash needs three ports open:
# Beats input
sudo firewall-cmd --permanent --add-port=5044/tcp
# Monitoring API
sudo firewall-cmd --permanent --add-port=9600/tcp
# Syslog input
sudo firewall-cmd --permanent --add-port=5514/tcp
sudo firewall-cmd --reload
Logstash also needs outbound access to port 9200 on all three Elasticsearch nodes to write index data. Outbound connections aren’t blocked by firewalld’s default policy, but the ES nodes’ firewalls must allow your Logstash host’s IP on port 9200 — if you followed the firewall rules in Chapter 3, this is already handled.
Start the Service
If you’re following the manual path, start Logstash:
sudo systemctl enable --now logstash
Logstash takes 30-60 seconds to start, especially on first boot when it compiles the pipeline. Don’t panic if it doesn’t respond immediately.
Systemd Timeout Fix
The Logstash RPM ships with TimeoutStopSec=infinity in its systemd unit file. This means if the pipeline stalls during shutdown (common with the Beats input when a client disconnects uncleanly), systemctl restart logstash waits forever. The process never stops, the restart never completes, and you’re stuck SSH’d into a machine with a hung service.
The fix is a systemd override that sets a reasonable timeout:
sudo mkdir -p /etc/systemd/system/logstash.service.d
sudo tee /etc/systemd/system/logstash.service.d/timeout.conf > /dev/null << 'EOF'
[Service]
TimeoutStopSec=90
EOF
sudo systemctl daemon-reload
90 seconds gives Logstash plenty of time to flush its pipeline and shut down gracefully. If it takes longer than that, something is genuinely hung and SIGKILL is appropriate.
The playbook applies this override automatically.
Verification
After Logstash finishes starting, check the monitoring API. Replace the IP below with your Logstash host’s IP:
curl http://192.168.1.62:9600/_node/stats/pipelines?pretty
Expected: a JSON response showing pipeline stats. If events.in is 0, that’s normal — no data has been sent yet. The important thing is that the API responds and the pipeline is loaded.
Check that the input ports are listening:
ss -tlnp | grep -E '5044|9600|5514'
Expected: three LISTEN entries for the Logstash process.
Important: Logstash takes 30-60 seconds to start, especially on first boot when it compiles the pipeline. If the API doesn’t respond immediately after
systemctl start logstash, wait and try again.
What Automation Looks Like
The svc_logstash8 role:
- Opens firewall ports 5044, 9600, and the syslog port
- Installs Logstash (GPG key, repo, dnf)
- Fixes directory ownership on
/usr/share/logstash/dataand/etc/logstash - Creates data directory on
/opt(first install only) - Deploys
logstash.ymlfrom template — node name, data path, API binding - Configures JVM heap via
lineinfile— 62.5% of available RAM - Runs system-install script (creates systemd unit file)
- Applies systemd TimeoutStopSec override (90s instead of infinity)
- Starts and enables the service
The pro_logstash8 role then:
- Deploys
logstash.confpipeline configuration from template (includes conditionaluser/passwordfor ES auth when security is enabled) - Notifies Restart Logstash handler (only if the pipeline config changed)
- Updates the MOTD with Logstash service information
Every step is idempotent — re-running the playbook on a configured host changes nothing.
Verification Checkpoint
Before moving to Chapter 6, confirm:
-
curl http://<logstash-ip>:9600/_node/stats/pipelines?prettyreturns pipeline stats -
ss -tlnp | grep -E '5044|9600|5514'shows threeLISTENentries -
systemctl status logstashshows active -
firewall-cmd --list-allshows ports 5044, 9600, and 5514
Your ingestion pipeline is running. Now let’s set up Filebeat to ship logs into it.