Skip to content
Published on

ELK Stack Log Collection and Analysis Pipeline: Elasticsearch, Fluentd, and Kibana Production Deployment and Optimization

Authors
  • Name
    Twitter
ELK Stack Log Pipeline

Introduction

In production environments, logs are critical data for failure detection, debugging, security auditing, and performance analysis. As system scale grows, a centralized system for collecting and analyzing logs from tens to hundreds of distributed servers becomes essential. The ELK stack (Elasticsearch + Logstash + Kibana) has established itself as the de facto standard for such log pipelines, and the EFK stack, which replaces Logstash with Fluentd, is also widely adopted in Kubernetes environments.

This article covers the architecture of each ELK/EFK stack component, Elasticsearch cluster design and shard strategy, ILM (Index Lifecycle Management) configuration, Fluentd and Fluent Bit comparison and configuration, Kibana dashboard setup, performance tuning, and common failure scenarios with recovery procedures in production environments.

ELK vs EFK Stack Comparison

The key difference between the ELK and EFK stacks lies in the log collector.

ItemELK (Logstash)EFK (Fluentd)
LanguageJava (JRuby)Ruby + C
Memory Usage~500MB-1GB~40-100MB
Plugin Count200+1,000+
Configuration FormatCustom DSLTag-based routing
Kubernetes AffinityModerateVery high (CNCF graduated)
BufferingMemory/DiskMemory/File
Data ParsingGrok patternsRegex + parser plugins
Best ForComplex transformation logicCloud-native, K8s

Fluentd vs Fluent Bit Comparison

Fluentd and Fluent Bit belong to the same ecosystem but serve different purposes.

ItemFluentdFluent Bit
LanguageRuby + CC
Memory Usage~40MB+~450KB
Plugin Ecosystem1,000+Core plugins only
RoleCentral aggregation/transformationEdge collection/forwarding
Best ForServer, aggregatorIoT, sidecar, DaemonSet
Processing PerformanceModerateVery high (10-40x)

The recommended production architecture is the Fluent Bit (DaemonSet) + Fluentd (Aggregator) + Elasticsearch combination. Fluent Bit collects logs in a lightweight manner on each node, while Fluentd handles transformation and routing centrally.

Elasticsearch Cluster Architecture

Node Role Separation

Separating node roles is the key to a production Elasticsearch cluster.

# elasticsearch-master.yml
cluster.name: prod-logs
node.name: master-01
node.roles: [ master ]
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03
cluster.initial_master_nodes:
  - master-01
  - master-02
  - master-03

# JVM heap settings (jvm.options)
-Xms4g
-Xmx4g
# elasticsearch-data-hot.yml
cluster.name: prod-logs
node.name: data-hot-01
node.roles: [ data_hot, data_content ]
node.attr.data: hot
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03

# JVM heap settings
-Xms16g
-Xmx16g
# elasticsearch-data-warm.yml
cluster.name: prod-logs
node.name: data-warm-01
node.roles: [ data_warm ]
node.attr.data: warm
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03

# JVM heap settings
-Xms8g
-Xmx8g

Recommended specifications for each node role are as follows.

Node RoleCPUMemoryStorageCount
Master4 vCPU8GB50GB SSD3 (odd number)
Data Hot8-16 vCPU32-64GBNVMe SSD3+
Data Warm4-8 vCPU16-32GBHDD/SSD2+
Data Cold2-4 vCPU8-16GBHDD1+
Coordinating4-8 vCPU16GB50GB SSD2
Ingest4-8 vCPU16GB50GB SSD2

Shard and Replica Strategy

# Set shard configuration via index template
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "codec": "best_compression",
      "routing.allocation.require.data": "hot"
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "host": { "type": "keyword" }
      }
    }
  }
}'

The key principles of shard design are as follows.

  • Shard size: Target 30-50GB per shard. Small shards under 10GB incur overhead, and shards over 50GB increase recovery time
  • Shard count: Keep the number of shards per node under 20 per GB of heap
  • Replicas: Set at least 1 replica in production to ensure availability
  • refresh_interval: Increase to 30-60 seconds for log data where real-time visibility is less critical

Index Lifecycle Management (ILM) Configuration

ILM is the core feature that automates index management from creation to deletion.

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
# Create ILM policy
curl -X PUT "localhost:9200/_ilm/policy/logs-lifecycle" \
  -H 'Content-Type: application/json' \
  -d @ilm-policy.json

# Attach ILM policy to index template
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-lifecycle",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}'

# Create bootstrap index
curl -X PUT "localhost:9200/logs-000001" \
  -H 'Content-Type: application/json' \
  -d '{
  "aliases": {
    "logs": {
      "is_write_index": true
    }
  }
}'

Key Actions per ILM Phase

PhaseTrigger ConditionKey ActionsPurpose
HotOn index creationrollover, set_priorityActive writes/reads
WarmAfter 3 daysshrink, forcemerge, allocateRead-heavy, save storage
ColdAfter 30 daysallocate, freezeInfrequent reads, minimize cost
DeleteAfter 90 daysdeleteReclaim storage

Fluentd Configuration and Pipeline Setup

Basic Fluentd Configuration

<!-- /etc/fluentd/fluent.conf -->
<system>
  log_level info
  workers 4
</system>

<!-- Input: Application log collection -->
<source>
  @type tail
  path /var/log/app/*.log
  pos_file /var/log/fluentd/app.log.pos
  tag app.logs
  read_from_head true
  <parse>
    @type json
    time_key timestamp
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<!-- Input: Kubernetes log collection -->
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd/containers.log.pos
  tag kubernetes.*
  <parse>
    @type cri
  </parse>
</source>

<!-- Filter: Add Kubernetes metadata -->
<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
  skip_labels false
  skip_container_metadata false
</filter>

<!-- Filter: Remove unnecessary logs -->
<filter **>
  @type grep
  <exclude>
    key log
    pattern /healthcheck|readiness|liveness/
  </exclude>
</filter>

<!-- Filter: Tag based on log level -->
<filter app.logs>
  @type record_transformer
  enable_ruby true
  <record>
    hostname "#{Socket.gethostname}"
    environment "production"
  </record>
</filter>

<!-- Output: Send to Elasticsearch -->
<match **>
  @type elasticsearch
  host elasticsearch-coordinating
  port 9200
  logstash_format true
  logstash_prefix fluentd-logs
  logstash_dateformat %Y.%m.%d
  include_tag_key true
  tag_key @fluentd_tag

  <buffer tag, time>
    @type file
    path /var/log/fluentd/buffer
    timekey 1h
    timekey_wait 10m
    chunk_limit_size 64MB
    total_limit_size 8GB
    flush_mode interval
    flush_interval 30s
    flush_thread_count 4
    retry_max_interval 30
    retry_forever true
    overflow_action block
  </buffer>
</match>

Fluent Bit DaemonSet Configuration (Kubernetes)

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path  /var/log/flb-storage/
        storage.sync  normal
        storage.checksum off
        storage.backlog.mem_limit 5M

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            cri
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name    grep
        Match   *
        Exclude log healthcheck

    [OUTPUT]
        Name            forward
        Match           *
        Host            fluentd-aggregator.logging.svc.cluster.local
        Port            24224
        Retry_Limit     False

  parsers.conf: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z
# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.2
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: config
              mountPath: /fluent-bit/etc/
            - name: storage
              mountPath: /var/log/flb-storage/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluent-bit-config
        - name: storage
          emptyDir: {}

Kibana Dashboard Setup

Index Pattern Configuration

You must first create an index pattern in Kibana.

# Create index pattern via Kibana API
curl -X POST "localhost:5601/api/saved_objects/index-pattern" \
  -H 'kbn-xsrf: true' \
  -H 'Content-Type: application/json' \
  -d '{
  "attributes": {
    "title": "fluentd-logs-*",
    "timeFieldName": "@timestamp"
  }
}'

Effective Dashboard Design Principles

Kibana dashboards should be designed according to their purpose.

Dashboard TypeIncluded VisualizationsTarget Users
Operations OverviewLog volume trends, error rate graphs, service distributionSRE/DevOps
Error AnalysisError type classification, top error messages, stack tracesDevelopers
Security AuditAuthentication failure events, anomalous access patternsSecurity team
Infrastructure MonitoringPer-node log volume, indexing rate, latencyPlatform team

Key visualization components include the following.

  • Lens charts: Time-series log volume, error rate by service
  • TSVB (Time Series Visual Builder): Detailed time-series analysis
  • Data Table: Top N error messages, per-service statistics
  • Markdown widgets: Dashboard descriptions, runbook links

Performance Tuning and Optimization

Elasticsearch Indexing Performance Optimization

# Bulk indexing optimization settings
curl -X PUT "localhost:9200/logs-000001/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "index": {
    "refresh_interval": "30s",
    "translog.durability": "async",
    "translog.sync_interval": "30s",
    "translog.flush_threshold_size": "1gb"
  }
}'

Key Performance Tuning Parameters

ParameterDefaultRecommendedDescription
refresh_interval1s30s-60sIncrease for log data where real-time visibility is less critical
translog.durabilityrequestasyncAsync translog for improved write performance
number_of_replicas10 (during initial load)Disable replicas during bulk loading
bulk size-5-15MBToo large causes memory pressure, too small adds overhead
flush_thread_count14-8Fluentd output thread count

JVM Heap Memory Configuration

# jvm.options settings
# Max 50% of total RAM, never more than 31GB (Compressed OOPs limit)
-Xms16g
-Xmx16g

# G1GC settings (default in Elasticsearch 8.x)
-XX:+UseG1GC
-XX:G1HeapRegionSize=16m
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled

# Enable GC logging
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Storage Optimization

# Force merge segments (on read-only indices)
curl -X POST "localhost:9200/logs-2026.03.01/_forcemerge?max_num_segments=1"

# Check index compression
curl -X GET "localhost:9200/_cat/indices/logs-*?v&h=index,store.size,pri.store.size,docs.count&s=index"

# Disk watermark settings
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}'

Operational Notes

1. Preventing Mapping Explosion

When dynamic mapping is enabled and you index JSON logs with freely defined key names, the field count grows rapidly, destabilizing the cluster.

# Set mapping field count limit
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.mapping.total_fields.limit": 1000
    },
    "mappings": {
      "dynamic": "strict"
    }
  }
}'

2. Avoiding Shard Overallocation

  • More than 1,000 small shards can cause severe load on master nodes
  • Minimize the number of shards per index and use ILM rollover for size-based splitting
  • Periodically monitor using the _cat/shards API

3. Managing GC Pressure

  • Keep heap under 31GB to leverage Compressed OOPs
  • If Old GC is frequent, limit the fielddata cache size
  • Configure circuit breakers to prevent OOM
# Circuit breaker settings
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.request.limit": "40%"
  }
}'

4. Security Configuration

# elasticsearch.yml - Security settings
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/certs/elastic-certificates.p12

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: /etc/elasticsearch/certs/http.p12

Failure Cases and Recovery Procedures

Failure Case 1: Cluster Status RED

Symptom: Primary shards are unassigned, risking data loss

# Check unassigned shards
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"

# Diagnose unassignment cause
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# Manual shard allocation (when disk space is insufficient)
curl -X POST "localhost:9200/_cluster/reroute" \
  -H 'Content-Type: application/json' \
  -d '{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-2026.03.10",
        "shard": 0,
        "node": "data-hot-02",
        "accept_data_loss": true
      }
    }
  ]
}'

Failure Case 2: Indexing Delay (Bulk Rejection)

Symptom: Large volume of 429 Too Many Requests errors in Fluentd logs

# Check thread pool status
curl -X GET "localhost:9200/_cat/thread_pool/write?v&h=node_name,active,rejected,queue,completed"

# Adjust bulk queue size
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "thread_pool.write.queue_size": 1000
  }
}'

Recovery procedure:

  1. Check Fluentd buffer status (remaining disk buffer)
  2. Add Elasticsearch nodes or increase bulk queue size
  3. Temporarily increase refresh_interval to 60s
  4. If necessary, reduce replicas to 0 to decrease indexing load
  5. Restore replicas to original value after stabilization

Failure Case 3: Disk Watermark Exceeded

Symptom: Indices switch to read-only mode

# Release read-only (after securing disk space)
curl -X PUT "localhost:9200/_all/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "index.blocks.read_only_allow_delete": null
}'

# Manually delete old indices
curl -X DELETE "localhost:9200/logs-2026.01.*"

# Check disk usage
curl -X GET "localhost:9200/_cat/allocation?v"

Monitoring Setup

It is recommended to set up a separate monitoring cluster to monitor the ELK stack itself.

# metricbeat.yml - Elasticsearch monitoring
metricbeat.modules:
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - 'https://es-node-01:9200'
      - 'https://es-node-02:9200'
    username: 'monitoring_user'
    password: 'secure_password'
    ssl.certificate_authorities:
      - /etc/metricbeat/certs/ca.crt

  - module: kibana
    xpack.enabled: true
    period: 10s
    hosts:
      - 'https://kibana:5601'

output.elasticsearch:
  hosts:
    - 'https://monitoring-es:9200'
  username: 'metricbeat_writer'
  password: 'secure_password'

Key monitoring metrics are as follows.

MetricThresholdAction
Cluster statusREDImmediate response required
JVM heap usageAbove 85%Add nodes or adjust heap
Indexing rateSudden fluctuationCheck source
Search latencyOver 5 secondsOptimize shards/queries
Disk usageAbove 85%Review ILM policy
Unassigned shard countAbove 0Diagnose allocation cause
Fluentd buffer sizeExceeds thresholdCheck output bottleneck

Conclusion

The ELK/EFK stack is a mature log pipeline solution with a rich ecosystem and comprehensive features. However, stable production operation requires holistic consideration of Elasticsearch cluster architecture design, shard strategy, ILM policy, Fluentd buffering strategy, and monitoring infrastructure.

Here is a summary of the key takeaways.

  • Node role separation: Separate Master, Data Hot/Warm/Cold, and Coordinating nodes for fault isolation and performance optimization
  • ILM utilization: Optimize storage costs with automated Hot-Warm-Cold-Delete phase transitions
  • Fluent Bit + Fluentd combination: Separate lightweight collection at the edge from transformation and routing at the center
  • Mapping management: Prevent mapping explosion with strict mapping and field count limits
  • Monitoring: Monitor the ELK stack itself with a separate monitoring cluster

For smaller environments, start with a single node or small cluster, and gradually introduce Hot-Warm-Cold architecture and ILM as data volume increases.

References