- Published on
ELK Stack Log Collection and Analysis Pipeline: Elasticsearch, Fluentd, and Kibana Production Deployment and Optimization
- Authors
- Name
- Introduction
- ELK vs EFK Stack Comparison
- Elasticsearch Cluster Architecture
- Index Lifecycle Management (ILM) Configuration
- Fluentd Configuration and Pipeline Setup
- Kibana Dashboard Setup
- Performance Tuning and Optimization
- Operational Notes
- Failure Cases and Recovery Procedures
- Monitoring Setup
- Conclusion
- References

Introduction
In production environments, logs are critical data for failure detection, debugging, security auditing, and performance analysis. As system scale grows, a centralized system for collecting and analyzing logs from tens to hundreds of distributed servers becomes essential. The ELK stack (Elasticsearch + Logstash + Kibana) has established itself as the de facto standard for such log pipelines, and the EFK stack, which replaces Logstash with Fluentd, is also widely adopted in Kubernetes environments.
This article covers the architecture of each ELK/EFK stack component, Elasticsearch cluster design and shard strategy, ILM (Index Lifecycle Management) configuration, Fluentd and Fluent Bit comparison and configuration, Kibana dashboard setup, performance tuning, and common failure scenarios with recovery procedures in production environments.
ELK vs EFK Stack Comparison
The key difference between the ELK and EFK stacks lies in the log collector.
| Item | ELK (Logstash) | EFK (Fluentd) |
|---|---|---|
| Language | Java (JRuby) | Ruby + C |
| Memory Usage | ~500MB-1GB | ~40-100MB |
| Plugin Count | 200+ | 1,000+ |
| Configuration Format | Custom DSL | Tag-based routing |
| Kubernetes Affinity | Moderate | Very high (CNCF graduated) |
| Buffering | Memory/Disk | Memory/File |
| Data Parsing | Grok patterns | Regex + parser plugins |
| Best For | Complex transformation logic | Cloud-native, K8s |
Fluentd vs Fluent Bit Comparison
Fluentd and Fluent Bit belong to the same ecosystem but serve different purposes.
| Item | Fluentd | Fluent Bit |
|---|---|---|
| Language | Ruby + C | C |
| Memory Usage | ~40MB+ | ~450KB |
| Plugin Ecosystem | 1,000+ | Core plugins only |
| Role | Central aggregation/transformation | Edge collection/forwarding |
| Best For | Server, aggregator | IoT, sidecar, DaemonSet |
| Processing Performance | Moderate | Very high (10-40x) |
The recommended production architecture is the Fluent Bit (DaemonSet) + Fluentd (Aggregator) + Elasticsearch combination. Fluent Bit collects logs in a lightweight manner on each node, while Fluentd handles transformation and routing centrally.
Elasticsearch Cluster Architecture
Node Role Separation
Separating node roles is the key to a production Elasticsearch cluster.
# elasticsearch-master.yml
cluster.name: prod-logs
node.name: master-01
node.roles: [ master ]
network.host: 0.0.0.0
discovery.seed_hosts:
- master-01
- master-02
- master-03
cluster.initial_master_nodes:
- master-01
- master-02
- master-03
# JVM heap settings (jvm.options)
-Xms4g
-Xmx4g
# elasticsearch-data-hot.yml
cluster.name: prod-logs
node.name: data-hot-01
node.roles: [ data_hot, data_content ]
node.attr.data: hot
network.host: 0.0.0.0
discovery.seed_hosts:
- master-01
- master-02
- master-03
# JVM heap settings
-Xms16g
-Xmx16g
# elasticsearch-data-warm.yml
cluster.name: prod-logs
node.name: data-warm-01
node.roles: [ data_warm ]
node.attr.data: warm
network.host: 0.0.0.0
discovery.seed_hosts:
- master-01
- master-02
- master-03
# JVM heap settings
-Xms8g
-Xmx8g
Recommended specifications for each node role are as follows.
| Node Role | CPU | Memory | Storage | Count |
|---|---|---|---|---|
| Master | 4 vCPU | 8GB | 50GB SSD | 3 (odd number) |
| Data Hot | 8-16 vCPU | 32-64GB | NVMe SSD | 3+ |
| Data Warm | 4-8 vCPU | 16-32GB | HDD/SSD | 2+ |
| Data Cold | 2-4 vCPU | 8-16GB | HDD | 1+ |
| Coordinating | 4-8 vCPU | 16GB | 50GB SSD | 2 |
| Ingest | 4-8 vCPU | 16GB | 50GB SSD | 2 |
Shard and Replica Strategy
# Set shard configuration via index template
curl -X PUT "localhost:9200/_index_template/logs-template" \
-H 'Content-Type: application/json' \
-d '{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s",
"codec": "best_compression",
"routing.allocation.require.data": "hot"
},
"mappings": {
"dynamic": "strict",
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"service": { "type": "keyword" },
"message": { "type": "text" },
"trace_id": { "type": "keyword" },
"host": { "type": "keyword" }
}
}
}
}'
The key principles of shard design are as follows.
- Shard size: Target 30-50GB per shard. Small shards under 10GB incur overhead, and shards over 50GB increase recovery time
- Shard count: Keep the number of shards per node under 20 per GB of heap
- Replicas: Set at least 1 replica in production to ensure availability
- refresh_interval: Increase to 30-60 seconds for log data where real-time visibility is less critical
Index Lifecycle Management (ILM) Configuration
ILM is the core feature that automates index management from creation to deletion.
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "3d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"allocate": {
"require": {
"data": "warm"
}
},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"require": {
"data": "cold"
}
},
"set_priority": {
"priority": 0
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
# Create ILM policy
curl -X PUT "localhost:9200/_ilm/policy/logs-lifecycle" \
-H 'Content-Type: application/json' \
-d @ilm-policy.json
# Attach ILM policy to index template
curl -X PUT "localhost:9200/_index_template/logs-template" \
-H 'Content-Type: application/json' \
-d '{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"index.lifecycle.name": "logs-lifecycle",
"index.lifecycle.rollover_alias": "logs"
}
}
}'
# Create bootstrap index
curl -X PUT "localhost:9200/logs-000001" \
-H 'Content-Type: application/json' \
-d '{
"aliases": {
"logs": {
"is_write_index": true
}
}
}'
Key Actions per ILM Phase
| Phase | Trigger Condition | Key Actions | Purpose |
|---|---|---|---|
| Hot | On index creation | rollover, set_priority | Active writes/reads |
| Warm | After 3 days | shrink, forcemerge, allocate | Read-heavy, save storage |
| Cold | After 30 days | allocate, freeze | Infrequent reads, minimize cost |
| Delete | After 90 days | delete | Reclaim storage |
Fluentd Configuration and Pipeline Setup
Basic Fluentd Configuration
<!-- /etc/fluentd/fluent.conf -->
<system>
log_level info
workers 4
</system>
<!-- Input: Application log collection -->
<source>
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd/app.log.pos
tag app.logs
read_from_head true
<parse>
@type json
time_key timestamp
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<!-- Input: Kubernetes log collection -->
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd/containers.log.pos
tag kubernetes.*
<parse>
@type cri
</parse>
</source>
<!-- Filter: Add Kubernetes metadata -->
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
skip_labels false
skip_container_metadata false
</filter>
<!-- Filter: Remove unnecessary logs -->
<filter **>
@type grep
<exclude>
key log
pattern /healthcheck|readiness|liveness/
</exclude>
</filter>
<!-- Filter: Tag based on log level -->
<filter app.logs>
@type record_transformer
enable_ruby true
<record>
hostname "#{Socket.gethostname}"
environment "production"
</record>
</filter>
<!-- Output: Send to Elasticsearch -->
<match **>
@type elasticsearch
host elasticsearch-coordinating
port 9200
logstash_format true
logstash_prefix fluentd-logs
logstash_dateformat %Y.%m.%d
include_tag_key true
tag_key @fluentd_tag
<buffer tag, time>
@type file
path /var/log/fluentd/buffer
timekey 1h
timekey_wait 10m
chunk_limit_size 64MB
total_limit_size 8GB
flush_mode interval
flush_interval 30s
flush_thread_count 4
retry_max_interval 30
retry_forever true
overflow_action block
</buffer>
</match>
Fluent Bit DaemonSet Configuration (Kubernetes)
# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser cri
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name grep
Match *
Exclude log healthcheck
[OUTPUT]
Name forward
Match *
Host fluentd-aggregator.logging.svc.cluster.local
Port 24224
Retry_Limit False
parsers.conf: |
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluent-bit
image: fluent/fluent-bit:3.2
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: config
mountPath: /fluent-bit/etc/
- name: storage
mountPath: /var/log/flb-storage/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: config
configMap:
name: fluent-bit-config
- name: storage
emptyDir: {}
Kibana Dashboard Setup
Index Pattern Configuration
You must first create an index pattern in Kibana.
# Create index pattern via Kibana API
curl -X POST "localhost:5601/api/saved_objects/index-pattern" \
-H 'kbn-xsrf: true' \
-H 'Content-Type: application/json' \
-d '{
"attributes": {
"title": "fluentd-logs-*",
"timeFieldName": "@timestamp"
}
}'
Effective Dashboard Design Principles
Kibana dashboards should be designed according to their purpose.
| Dashboard Type | Included Visualizations | Target Users |
|---|---|---|
| Operations Overview | Log volume trends, error rate graphs, service distribution | SRE/DevOps |
| Error Analysis | Error type classification, top error messages, stack traces | Developers |
| Security Audit | Authentication failure events, anomalous access patterns | Security team |
| Infrastructure Monitoring | Per-node log volume, indexing rate, latency | Platform team |
Key visualization components include the following.
- Lens charts: Time-series log volume, error rate by service
- TSVB (Time Series Visual Builder): Detailed time-series analysis
- Data Table: Top N error messages, per-service statistics
- Markdown widgets: Dashboard descriptions, runbook links
Performance Tuning and Optimization
Elasticsearch Indexing Performance Optimization
# Bulk indexing optimization settings
curl -X PUT "localhost:9200/logs-000001/_settings" \
-H 'Content-Type: application/json' \
-d '{
"index": {
"refresh_interval": "30s",
"translog.durability": "async",
"translog.sync_interval": "30s",
"translog.flush_threshold_size": "1gb"
}
}'
Key Performance Tuning Parameters
| Parameter | Default | Recommended | Description |
|---|---|---|---|
| refresh_interval | 1s | 30s-60s | Increase for log data where real-time visibility is less critical |
| translog.durability | request | async | Async translog for improved write performance |
| number_of_replicas | 1 | 0 (during initial load) | Disable replicas during bulk loading |
| bulk size | - | 5-15MB | Too large causes memory pressure, too small adds overhead |
| flush_thread_count | 1 | 4-8 | Fluentd output thread count |
JVM Heap Memory Configuration
# jvm.options settings
# Max 50% of total RAM, never more than 31GB (Compressed OOPs limit)
-Xms16g
-Xmx16g
# G1GC settings (default in Elasticsearch 8.x)
-XX:+UseG1GC
-XX:G1HeapRegionSize=16m
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled
# Enable GC logging
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
Storage Optimization
# Force merge segments (on read-only indices)
curl -X POST "localhost:9200/logs-2026.03.01/_forcemerge?max_num_segments=1"
# Check index compression
curl -X GET "localhost:9200/_cat/indices/logs-*?v&h=index,store.size,pri.store.size,docs.count&s=index"
# Disk watermark settings
curl -X PUT "localhost:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-d '{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"
}
}'
Operational Notes
1. Preventing Mapping Explosion
When dynamic mapping is enabled and you index JSON logs with freely defined key names, the field count grows rapidly, destabilizing the cluster.
# Set mapping field count limit
curl -X PUT "localhost:9200/_index_template/logs-template" \
-H 'Content-Type: application/json' \
-d '{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"index.mapping.total_fields.limit": 1000
},
"mappings": {
"dynamic": "strict"
}
}
}'
2. Avoiding Shard Overallocation
- More than 1,000 small shards can cause severe load on master nodes
- Minimize the number of shards per index and use ILM rollover for size-based splitting
- Periodically monitor using the
_cat/shardsAPI
3. Managing GC Pressure
- Keep heap under 31GB to leverage Compressed OOPs
- If Old GC is frequent, limit the fielddata cache size
- Configure circuit breakers to prevent OOM
# Circuit breaker settings
curl -X PUT "localhost:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-d '{
"persistent": {
"indices.breaker.total.limit": "70%",
"indices.breaker.fielddata.limit": "40%",
"indices.breaker.request.limit": "40%"
}
}'
4. Security Configuration
# elasticsearch.yml - Security settings
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: /etc/elasticsearch/certs/http.p12
Failure Cases and Recovery Procedures
Failure Case 1: Cluster Status RED
Symptom: Primary shards are unassigned, risking data loss
# Check unassigned shards
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"
# Diagnose unassignment cause
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"
# Manual shard allocation (when disk space is insufficient)
curl -X POST "localhost:9200/_cluster/reroute" \
-H 'Content-Type: application/json' \
-d '{
"commands": [
{
"allocate_stale_primary": {
"index": "logs-2026.03.10",
"shard": 0,
"node": "data-hot-02",
"accept_data_loss": true
}
}
]
}'
Failure Case 2: Indexing Delay (Bulk Rejection)
Symptom: Large volume of 429 Too Many Requests errors in Fluentd logs
# Check thread pool status
curl -X GET "localhost:9200/_cat/thread_pool/write?v&h=node_name,active,rejected,queue,completed"
# Adjust bulk queue size
curl -X PUT "localhost:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-d '{
"persistent": {
"thread_pool.write.queue_size": 1000
}
}'
Recovery procedure:
- Check Fluentd buffer status (remaining disk buffer)
- Add Elasticsearch nodes or increase bulk queue size
- Temporarily increase refresh_interval to 60s
- If necessary, reduce replicas to 0 to decrease indexing load
- Restore replicas to original value after stabilization
Failure Case 3: Disk Watermark Exceeded
Symptom: Indices switch to read-only mode
# Release read-only (after securing disk space)
curl -X PUT "localhost:9200/_all/_settings" \
-H 'Content-Type: application/json' \
-d '{
"index.blocks.read_only_allow_delete": null
}'
# Manually delete old indices
curl -X DELETE "localhost:9200/logs-2026.01.*"
# Check disk usage
curl -X GET "localhost:9200/_cat/allocation?v"
Monitoring Setup
It is recommended to set up a separate monitoring cluster to monitor the ELK stack itself.
# metricbeat.yml - Elasticsearch monitoring
metricbeat.modules:
- module: elasticsearch
xpack.enabled: true
period: 10s
hosts:
- 'https://es-node-01:9200'
- 'https://es-node-02:9200'
username: 'monitoring_user'
password: 'secure_password'
ssl.certificate_authorities:
- /etc/metricbeat/certs/ca.crt
- module: kibana
xpack.enabled: true
period: 10s
hosts:
- 'https://kibana:5601'
output.elasticsearch:
hosts:
- 'https://monitoring-es:9200'
username: 'metricbeat_writer'
password: 'secure_password'
Key monitoring metrics are as follows.
| Metric | Threshold | Action |
|---|---|---|
| Cluster status | RED | Immediate response required |
| JVM heap usage | Above 85% | Add nodes or adjust heap |
| Indexing rate | Sudden fluctuation | Check source |
| Search latency | Over 5 seconds | Optimize shards/queries |
| Disk usage | Above 85% | Review ILM policy |
| Unassigned shard count | Above 0 | Diagnose allocation cause |
| Fluentd buffer size | Exceeds threshold | Check output bottleneck |
Conclusion
The ELK/EFK stack is a mature log pipeline solution with a rich ecosystem and comprehensive features. However, stable production operation requires holistic consideration of Elasticsearch cluster architecture design, shard strategy, ILM policy, Fluentd buffering strategy, and monitoring infrastructure.
Here is a summary of the key takeaways.
- Node role separation: Separate Master, Data Hot/Warm/Cold, and Coordinating nodes for fault isolation and performance optimization
- ILM utilization: Optimize storage costs with automated Hot-Warm-Cold-Delete phase transitions
- Fluent Bit + Fluentd combination: Separate lightweight collection at the edge from transformation and routing at the center
- Mapping management: Prevent mapping explosion with strict mapping and field count limits
- Monitoring: Monitor the ELK stack itself with a separate monitoring cluster
For smaller environments, start with a single node or small cluster, and gradually introduce Hot-Warm-Cold architecture and ILM as data volume increases.
References
- Elasticsearch Architecture Best Practices - Elastic
- Index Lifecycle Management (ILM) - Elastic Docs
- Fluentd vs Fluent Bit: How to Choose in 2026 - Better Stack
- Production Guidance - Elastic Docs
- 7 Ways to Optimize Your Elastic (ELK) Stack in Production - Better Stack
- The Complete Guide to the ELK Stack - Logz.io
- Best Practices for Building Kibana Dashboards - Elastic