Logstash

What you’ll accomplish: Install Logstash, tune its JVM for your available RAM, configure a pipeline that accepts Beats and syslog input, and verify data is flowing into Elasticsearch.

Why Logstash

Logstash is the ingestion pipeline — it receives logs from your hosts, transforms them (parsing, filtering, enriching), and ships them into Elasticsearch. The obvious question: Beats agents (Filebeat, Metricbeat) can ship directly to Elasticsearch without Logstash. So when do you actually need Logstash in the middle?

Use Logstash when you need to:

Parse unstructured logs into structured fields (grok patterns, JSON parsing)
Route different log types to different indices based on content (our default pipeline sends everything to one index series — you can add routing rules later)
Enrich logs with metadata (environment tags, geo-IP lookup)
Accept syslog input from network devices that can’t run Beats agents

Skip Logstash when:

Your logs are already structured JSON (Filebeat → ES directly works fine)
You have fewer than 5 hosts and simple log formats

For a home lab, Logstash is worth it. You’ll have a mix of structured application logs and unstructured syslog from network devices, VMs, and services that don’t support Beats. Logstash handles both.

We co-locate Logstash on es02 — the second Elasticsearch node. Logstash is JVM-based and uses significant heap memory, so giving it a node with 8 GB RAM is ideal.

Installation

Logstash installs from the same Elastic repository we set up in Chapter 3. If your Logstash host is also an ES node, the repo file is already there. If it’s a separate machine, create the repo and import the GPG key first:

# Only needed if this host doesn't already have the Elastic repo from Chapter 3
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
sudo tee /etc/yum.repos.d/elastic-9.x.repo > /dev/null << 'EOF'
[elastic-9.x]
name=Elastic repository for 9.x packages
baseurl=https://artifacts.elastic.co/packages/9.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=0
autorefresh=1
type=rpm-md
EOF

Install Logstash:

sudo dnf install logstash -y --enablerepo=elastic-9.x

After install, fix ownership on the Logstash directories — the RPM sometimes leaves them owned by root:

sudo chown -R logstash:logstash /usr/share/logstash/data
sudo chown -R logstash:logstash /etc/logstash

Create the Systemd Unit File

Do not skip this step. If you skip it, systemctl start logstash fails with “Unit logstash.service not found” and you’ll spend 20 minutes wondering why the service doesn’t exist.

The Logstash RPM doesn’t always create a systemd service file. The system-install script reads /etc/logstash/startup.options and generates the unit file:

sudo /usr/share/logstash/bin/system-install /etc/logstash/startup.options
sudo systemctl daemon-reload

Data Path Relocation

Same pattern as Elasticsearch — relocate to /opt to protect the root partition:

sudo mkdir -p /opt/lib/logstash
sudo chown logstash:logstash /opt/lib/logstash
sudo chmod 755 /opt/lib/logstash

The logstash.yml configuration below points path.data at this new location.

JVM Heap Tuning

Logstash runs on the JVM, and heap sizing matters. The conventional wisdom for Elasticsearch is “half your RAM for heap, half for filesystem cache.” Logstash is different.

Logstash should get 62.5% of total RAM for heap. Here’s why: Logstash doesn’t benefit from filesystem cache the way Elasticsearch does. ES uses the remaining RAM for Lucene’s memory-mapped file I/O, which dramatically speeds up search. Logstash’s pipeline is entirely in-memory — filter plugins, codec processing, and output buffering all live on the heap. More heap means larger pipeline buffers and less backpressure when Elasticsearch is slow to acknowledge writes.

On a 4 GB host: 4096 * 0.625 = 2560 MB heap. On an 8 GB host: 8192 * 0.625 = 5120 MB.

The playbook sets this in /etc/logstash/jvm.options:

-Xms2560m
-Xmx2560m

To change the heap size, edit the JVM options file:

sudo vi /etc/logstash/jvm.options

-Xms and -Xmx should always be equal — this prevents the JVM from wasting time resizing the heap during operation.

The playbook uses lineinfile with regexp: '^-Xms' to replace the heap line idempotently. If the computed value hasn’t changed, the task reports ok (not changed).

Logstash Configuration

The stock logstash.yml has these settings either commented out or with default values.

File modifications reference

Line to find	Replace with
`path.data: /var/lib/logstash`	`path.data: /opt/lib/logstash`
`# node.name: test`	`node.name: es02`
`# api.http.host: 127.0.0.1`	`api.http.host: 0.0.0.0`
`# api.http.port: 9600-9700`	`api.http.port: 9600`

Change es02 to your Logstash host’s short hostname if it differs.

Apply all settings (copy-paste)

sudo sed -i 's|^path.data:.*|path.data: /opt/lib/logstash|' /etc/logstash/logstash.yml
sudo sed -i 's/^# node.name:.*/node.name: es02/' /etc/logstash/logstash.yml
sudo sed -i 's/^# api.http.host:.*/api.http.host: 0.0.0.0/' /etc/logstash/logstash.yml
sudo sed -i 's/^# api.http.port:.*/api.http.port: 9600/' /etc/logstash/logstash.yml

The playbook uses ansible_hostname (the short hostname, not the FQDN).

The monitoring API on port 9600 provides pipeline stats, JVM metrics, and health information. We bind it to 0.0.0.0 (all interfaces) for flexibility — the firewall controls which hosts can reach it.

Pipeline Configuration

The heart of Logstash is its pipeline — the input → filter → output chain defined in /etc/logstash/conf.d/logstash.conf. The playbook deploys a generic pipeline that handles two common input types:

Here’s the complete pipeline configuration. Copy-paste this to create the file, then read the breakdown below to understand each section:

sudo tee /etc/logstash/conf.d/logstash.conf > /dev/null << 'EOF'
input {
  beats {
    port => 5044
    client_inactivity_timeout => 180
    tags => ["prod"]
  }

  syslog {
    port => 5514
    tags => ["syslog", "prod"]
  }
}

filter {
  if "beats_input_codec_json_applied" not in [tags] and [message] =~ /^\s*\{/ {
    json {
      source => "message"
      target => "parsed"
      skip_on_invalid_json => true
    }
  }

  if "syslog" in [tags] {
    grok {
      match => {
        "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"
      }
      overwrite => ["message"]
    }
    date {
      match => ["syslog_timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss"]
    }
  }

  mutate {
    add_field => { "environment" => "prod" }
  }
}

output {
  elasticsearch {
    hosts => ["http://192.168.1.61:9200", "http://192.168.1.62:9200", "http://192.168.1.63:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
    http_compression => true
    user => "elastic"
    password => "YOUR_ELASTIC_PASSWORD"
  }
}
EOF

Replace YOUR_ELASTIC_PASSWORD with the value you set for vault_elk_elastic_password. The user and password fields are required when Elasticsearch security is enabled. Without them, Logstash gets 401 Unauthorized responses and can’t index any data.

Note: Using the elastic superuser here works but is overpowered for just writing indices. In a production environment, you’d create a dedicated Logstash writer role with only the create_index and write permissions. For a home lab, the superuser is fine and avoids extra role management.

The following breakdown explains what you just pasted — you don’t need to copy anything else. This is all already in the file you created above.

Understanding the Input Section

The input block defines two listeners:

beats on port 5044 — the standard shipping method. Install Filebeat on your hosts, point it at Logstash, and it’ll ship logs with metadata (hostname, file path, timestamps). The client_inactivity_timeout of 180 seconds gives Beats agents time to reconnect after network blips without Logstash closing their connection.
syslog on port 5514 — for network devices, legacy systems, and anything that speaks syslog. We use 5514 (not 514) because binding to ports below 1024 requires root, and Logstash runs as the logstash user.

Both inputs tag events with "prod" — change this in group_vars/all.yml via the elk_environment variable if you’re running a different environment (e.g., staging, dev).

The filter block does four things:

JSON parsing — if the message looks like JSON (starts with {) and wasn’t already parsed by Beats, it attempts to parse it into structured fields under parsed. Skips gracefully on invalid JSON.
Syslog parsing — for events tagged "syslog", a grok pattern extracts timestamp, hostname, program name, PID, and message into named fields. The overwrite directive replaces the raw message with the parsed syslog_message.
Timestamp correction — parses the syslog timestamp and sets it as the event’s @timestamp. Without this, events would use the ingest time instead of when the log was actually generated.
Environment tagging — adds an environment field to every event (defaults to "prod").

Understanding the Output Section

The output block sends processed events to Elasticsearch:

All three ES node IPs are listed for client-side load balancing. If one node is down, Logstash routes to the others. Replace these IPs with your actual ES node addresses.
http_compression reduces network overhead between Logstash and ES.
user and password authenticate against Elasticsearch. Replace YOUR_ELASTIC_PASSWORD with the elastic password you set in Chapter 3.
index => "app-logs-%{+YYYY.MM.dd}" creates daily indices — app-logs-2026.03.16, for example. This pattern matches the app-logs-* ILM template configured in Chapter 7, so every index gets automatic lifecycle management from day one.

Important: If you don’t update the hosts list with your actual ES node IPs, Logstash will only send data to localhost — your other nodes won’t receive any data.

The companion playbook templates all three files (logstash.yml, jvm.options, logstash.conf) from your variables and calculates JVM heap automatically — no manual math required.

Firewall Rules

Logstash needs three ports open:

# Beats input
sudo firewall-cmd --permanent --add-port=5044/tcp

# Monitoring API
sudo firewall-cmd --permanent --add-port=9600/tcp

# Syslog input
sudo firewall-cmd --permanent --add-port=5514/tcp

sudo firewall-cmd --reload

Logstash also needs outbound access to port 9200 on all three Elasticsearch nodes to write index data. Outbound connections aren’t blocked by firewalld’s default policy, but the ES nodes’ firewalls must allow your Logstash host’s IP on port 9200 — if you followed the firewall rules in Chapter 3, this is already handled.

Start the Service

If you’re following the manual path, start Logstash:

sudo systemctl enable --now logstash

Logstash takes 30-60 seconds to start, especially on first boot when it compiles the pipeline. Don’t panic if it doesn’t respond immediately.

Systemd Timeout Fix

The Logstash RPM ships with TimeoutStopSec=infinity in its systemd unit file. This means if the pipeline stalls during shutdown (common with the Beats input when a client disconnects uncleanly), systemctl restart logstash waits forever. The process never stops, the restart never completes, and you’re stuck SSH’d into a machine with a hung service.

The fix is a systemd override that sets a reasonable timeout:

sudo mkdir -p /etc/systemd/system/logstash.service.d
sudo tee /etc/systemd/system/logstash.service.d/timeout.conf > /dev/null << 'EOF'
[Service]
TimeoutStopSec=90
EOF
sudo systemctl daemon-reload

90 seconds gives Logstash plenty of time to flush its pipeline and shut down gracefully. If it takes longer than that, something is genuinely hung and SIGKILL is appropriate.

The playbook applies this override automatically.

Verification

After Logstash finishes starting, check the monitoring API. Replace the IP below with your Logstash host’s IP:

curl http://192.168.1.62:9600/_node/stats/pipelines?pretty

Expected: a JSON response showing pipeline stats. If events.in is 0, that’s normal — no data has been sent yet. The important thing is that the API responds and the pipeline is loaded.

Check that the input ports are listening:

ss -tlnp | grep -E '5044|9600|5514'

Expected: three LISTEN entries for the Logstash process.

Important: Logstash takes 30-60 seconds to start, especially on first boot when it compiles the pipeline. If the API doesn’t respond immediately after systemctl start logstash, wait and try again.

What Automation Looks Like

The svc_logstash8 role:

Opens firewall ports 5044, 9600, and the syslog port
Installs Logstash (GPG key, repo, dnf)
Fixes directory ownership on /usr/share/logstash/data and /etc/logstash
Creates data directory on /opt (first install only)
Deploys logstash.yml from template — node name, data path, API binding
Configures JVM heap via lineinfile — 62.5% of available RAM
Runs system-install script (creates systemd unit file)
Applies systemd TimeoutStopSec override (90s instead of infinity)
Starts and enables the service

The pro_logstash8 role then:

Deploys logstash.conf pipeline configuration from template (includes conditional user/password for ES auth when security is enabled)
Notifies Restart Logstash handler (only if the pipeline config changed)
Updates the MOTD with Logstash service information

Every step is idempotent — re-running the playbook on a configured host changes nothing.

Verification Checkpoint

Before moving to Chapter 6, confirm:

curl http://<logstash-ip>:9600/_node/stats/pipelines?pretty returns pipeline stats
ss -tlnp | grep -E '5044|9600|5514' shows three LISTEN entries
systemctl status logstash shows active
firewall-cmd --list-all shows ports 5044, 9600, and 5514

Your ingestion pipeline is running. Now let’s set up Filebeat to ship logs into it.