Split View: ELK 스택 기반 로그 수집·분석 파이프라인: Elasticsearch·Fluentd·Kibana 프로덕션 구축과 최적화

ELK 스택 기반 로그 수집·분석 파이프라인: Elasticsearch·Fluentd·Kibana 프로덕션 구축과 최적화

들어가며
ELK vs EFK 스택 비교
- Fluentd vs Fluent Bit 비교
Elasticsearch 클러스터 아키텍처
- 노드 역할 분리
- 샤드와 레플리카 전략
Index Lifecycle Management (ILM) 설정
- ILM 각 Phase별 주요 액션
Fluentd 설정과 파이프라인 구성
- Fluentd 기본 설정
- Fluent Bit DaemonSet 설정 (Kubernetes)
Kibana 대시보드 구성
- 인덱스 패턴 설정
- 효과적인 대시보드 설계 원칙
성능 튜닝과 최적화
운영 시 주의사항
장애 사례와 복구 절차
모니터링 구성
마치며
참고자료

들어가며

운영 환경에서 로그는 장애 탐지, 디버깅, 보안 감사, 성능 분석의 핵심 데이터다. 시스템 규모가 커질수록 분산된 수십~수백 대의 서버에서 발생하는 로그를 중앙 집중적으로 수집하고 분석하는 체계가 필수적이다. ELK 스택(Elasticsearch + Logstash + Kibana)은 이러한 로그 파이프라인의 사실상 표준으로 자리 잡았으며, Logstash 대신 Fluentd를 사용하는 EFK 스택도 Kubernetes 환경에서 널리 채택되고 있다.

이 글에서는 ELK/EFK 스택의 각 컴포넌트 아키텍처, Elasticsearch 클러스터 설계와 샤드 전략, ILM(Index Lifecycle Management) 설정, Fluentd와 Fluent Bit의 비교 및 설정, Kibana 대시보드 구성, 성능 튜닝, 그리고 프로덕션 환경에서 흔히 발생하는 장애 사례와 복구 절차까지 전 과정을 다룬다.

ELK vs EFK 스택 비교

ELK 스택과 EFK 스택의 핵심 차이는 로그 수집기에 있다.

항목	ELK (Logstash)	EFK (Fluentd)
개발 언어	Java (JRuby)	Ruby + C
메모리 사용량	약 500MB~1GB	약 40~100MB
플러그인 수	200+	1,000+
설정 형식	자체 DSL	태그 기반 라우팅
Kubernetes 친화성	보통	매우 높음 (CNCF 졸업)
버퍼링	메모리/디스크	메모리/파일
데이터 파싱	Grok 패턴	정규식 + 파서 플러그인
적합한 환경	복잡한 변환 로직	클라우드 네이티브, K8s

Fluentd vs Fluent Bit 비교

Fluentd와 Fluent Bit은 같은 에코시스템에 속하지만 용도가 다르다.

항목	Fluentd	Fluent Bit
개발 언어	Ruby + C	C
메모리 사용량	약 40MB+	약 450KB
플러그인 생태계	1,000+	핵심 플러그인 위주
역할	중앙 집계/변환	엣지 수집/전달
적합한 환경	서버, 애그리게이터	IoT, 사이드카, DaemonSet
처리 성능	보통	매우 높음 (10~40배)

프로덕션 권장 구성은 Fluent Bit(DaemonSet) + Fluentd(Aggregator) + Elasticsearch 조합이다. Fluent Bit이 각 노드에서 로그를 경량으로 수집하고, Fluentd가 중앙에서 변환과 라우팅을 담당한다.

Elasticsearch 클러스터 아키텍처

노드 역할 분리

프로덕션 Elasticsearch 클러스터는 노드 역할을 분리하는 것이 핵심이다.

# elasticsearch-master.yml
cluster.name: prod-logs
node.name: master-01
node.roles: [ master ]
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03
cluster.initial_master_nodes:
  - master-01
  - master-02
  - master-03

# JVM 힙 설정 (jvm.options)
-Xms4g
-Xmx4g

# elasticsearch-data-hot.yml
cluster.name: prod-logs
node.name: data-hot-01
node.roles: [ data_hot, data_content ]
node.attr.data: hot
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03

# JVM 힙 설정
-Xms16g
-Xmx16g

# elasticsearch-data-warm.yml
cluster.name: prod-logs
node.name: data-warm-01
node.roles: [ data_warm ]
node.attr.data: warm
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03

# JVM 힙 설정
-Xms8g
-Xmx8g

각 노드 역할의 권장 사양은 다음과 같다.

노드 역할	CPU	메모리	스토리지	수량
Master	4 vCPU	8GB	50GB SSD	3대 (홀수)
Data Hot	8~16 vCPU	32~64GB	NVMe SSD	3대+
Data Warm	4~8 vCPU	16~32GB	HDD/SSD	2대+
Data Cold	2~4 vCPU	8~16GB	HDD	1대+
Coordinating	4~8 vCPU	16GB	50GB SSD	2대
Ingest	4~8 vCPU	16GB	50GB SSD	2대

샤드와 레플리카 전략

# 인덱스 템플릿으로 샤드 설정
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "codec": "best_compression",
      "routing.allocation.require.data": "hot"
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "host": { "type": "keyword" }
      }
    }
  }
}'

샤드 설계의 핵심 원칙은 다음과 같다.

샤드 크기: 30~50GB를 목표로 설정한다. 10GB 미만의 작은 샤드는 오버헤드가 크고, 50GB를 초과하면 복구 시간이 길어진다
샤드 수: 노드당 샤드 수를 힙 1GB당 20개 이하로 유지한다
레플리카: 프로덕션에서는 최소 1개의 레플리카를 설정하여 가용성을 보장한다
refresh_interval: 로그성 데이터는 30초~60초로 늘려 인덱싱 성능을 개선한다

Index Lifecycle Management (ILM) 설정

ILM은 인덱스의 생성부터 삭제까지 자동화하는 핵심 기능이다.

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

# ILM 정책 생성
curl -X PUT "localhost:9200/_ilm/policy/logs-lifecycle" \
  -H 'Content-Type: application/json' \
  -d @ilm-policy.json

# ILM 정책을 인덱스 템플릿에 연결
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-lifecycle",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}'

# 부트스트랩 인덱스 생성
curl -X PUT "localhost:9200/logs-000001" \
  -H 'Content-Type: application/json' \
  -d '{
  "aliases": {
    "logs": {
      "is_write_index": true
    }
  }
}'

ILM 각 Phase별 주요 액션

Phase	트리거 조건	주요 액션	목적
Hot	인덱스 생성 시	rollover, set_priority	활발한 쓰기/읽기
Warm	3일 경과	shrink, forcemerge, allocate	읽기 위주, 스토리지 절약
Cold	30일 경과	allocate, freeze	드문 읽기, 비용 최소화
Delete	90일 경과	delete	스토리지 확보

Fluentd 설정과 파이프라인 구성

Fluentd 기본 설정

<!-- /etc/fluentd/fluent.conf -->
<system>
  log_level info
  workers 4
</system>

<!-- Input: 애플리케이션 로그 수집 -->
<source>
  @type tail
  path /var/log/app/*.log
  pos_file /var/log/fluentd/app.log.pos
  tag app.logs
  read_from_head true
  <parse>
    @type json
    time_key timestamp
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<!-- Input: Kubernetes 로그 수집 -->
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd/containers.log.pos
  tag kubernetes.*
  <parse>
    @type cri
  </parse>
</source>

<!-- Filter: Kubernetes 메타데이터 추가 -->
<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
  skip_labels false
  skip_container_metadata false
</filter>

<!-- Filter: 불필요한 로그 제거 -->
<filter **>
  @type grep
  <exclude>
    key log
    pattern /healthcheck|readiness|liveness/
  </exclude>
</filter>

<!-- Filter: 로그 레벨 기반 태깅 -->
<filter app.logs>
  @type record_transformer
  enable_ruby true
  <record>
    hostname "#{Socket.gethostname}"
    environment "production"
  </record>
</filter>

<!-- Output: Elasticsearch로 전송 -->
<match **>
  @type elasticsearch
  host elasticsearch-coordinating
  port 9200
  logstash_format true
  logstash_prefix fluentd-logs
  logstash_dateformat %Y.%m.%d
  include_tag_key true
  tag_key @fluentd_tag

  <buffer tag, time>
    @type file
    path /var/log/fluentd/buffer
    timekey 1h
    timekey_wait 10m
    chunk_limit_size 64MB
    total_limit_size 8GB
    flush_mode interval
    flush_interval 30s
    flush_thread_count 4
    retry_max_interval 30
    retry_forever true
    overflow_action block
  </buffer>
</match>

Fluent Bit DaemonSet 설정 (Kubernetes)

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path  /var/log/flb-storage/
        storage.sync  normal
        storage.checksum off
        storage.backlog.mem_limit 5M

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            cri
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name    grep
        Match   *
        Exclude log healthcheck

    [OUTPUT]
        Name            forward
        Match           *
        Host            fluentd-aggregator.logging.svc.cluster.local
        Port            24224
        Retry_Limit     False

  parsers.conf: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.2
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: config
              mountPath: /fluent-bit/etc/
            - name: storage
              mountPath: /var/log/flb-storage/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluent-bit-config
        - name: storage
          emptyDir: {}

Kibana 대시보드 구성

인덱스 패턴 설정

Kibana에서 먼저 인덱스 패턴을 생성해야 한다.

# Kibana API로 인덱스 패턴 생성
curl -X POST "localhost:5601/api/saved_objects/index-pattern" \
  -H 'kbn-xsrf: true' \
  -H 'Content-Type: application/json' \
  -d '{
  "attributes": {
    "title": "fluentd-logs-*",
    "timeFieldName": "@timestamp"
  }
}'

효과적인 대시보드 설계 원칙

Kibana 대시보드는 목적에 따라 설계해야 한다.

대시보드 유형	포함 시각화	대상 사용자
운영 개요	로그 볼륨 추이, 에러율 그래프, 서비스별 분포	SRE/DevOps
에러 분석	에러 타입별 분류, Top 에러 메시지, 스택트레이스	개발자
보안 감사	인증 실패 이벤트, 비정상 접근 패턴	보안팀
인프라 모니터링	노드별 로그량, 인덱싱 속도, 지연시간	플랫폼팀

주요 시각화 구성 요소는 다음과 같다.

Lens 차트: 시계열 로그 볼륨, 서비스별 에러율
TSVB (Time Series Visual Builder): 상세 시계열 분석
Data Table: Top N 에러 메시지, 서비스별 통계
Markdown 위젯: 대시보드 설명, 런북 링크

성능 튜닝과 최적화

Elasticsearch 인덱싱 성능 최적화

# 벌크 인덱싱 최적화 설정
curl -X PUT "localhost:9200/logs-000001/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "index": {
    "refresh_interval": "30s",
    "translog.durability": "async",
    "translog.sync_interval": "30s",
    "translog.flush_threshold_size": "1gb"
  }
}'

주요 성능 튜닝 파라미터

파라미터	기본값	권장값	설명
refresh_interval	1s	30s~60s	로그성 데이터는 실시간성이 덜 중요하므로 늘린다
translog.durability	request	async	비동기 translog로 쓰기 성능 향상
number_of_replicas	1	0 (초기 적재 시)	대량 적재 시 레플리카를 비활성화
bulk 크기	-	5~15MB	너무 크면 메모리 압박, 너무 작으면 오버헤드
flush_thread_count	1	4~8	Fluentd 출력 스레드 수

JVM 힙 메모리 설정

# jvm.options 설정
# 전체 RAM의 50% 이하, 최대 31GB (Compressed OOPs 한계)
-Xms16g
-Xmx16g

# G1GC 설정 (Elasticsearch 8.x 기본)
-XX:+UseG1GC
-XX:G1HeapRegionSize=16m
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled

# GC 로그 활성화
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

스토리지 최적화

# forcemerge로 세그먼트 병합 (읽기 전용 인덱스에서 수행)
curl -X POST "localhost:9200/logs-2026.03.01/_forcemerge?max_num_segments=1"

# 인덱스 압축 확인
curl -X GET "localhost:9200/_cat/indices/logs-*?v&h=index,store.size,pri.store.size,docs.count&s=index"

# 디스크 워터마크 설정
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}'

운영 시 주의사항

1. 매핑 폭발 방지

동적 매핑이 활성화된 상태에서 키 이름이 자유로운 JSON 로그를 인덱싱하면 필드 수가 급증하여 클러스터가 불안정해진다.

# 매핑 필드 수 제한 설정
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.mapping.total_fields.limit": 1000
    },
    "mappings": {
      "dynamic": "strict"
    }
  }
}'

2. 샤드 과다 할당 주의

1000개 이상의 소규모 샤드는 마스터 노드에 심각한 부하를 유발한다
인덱스당 샤드 수를 최소화하고, ILM rollover로 크기 기반 분할을 사용한다
_cat/shards API로 주기적으로 모니터링한다

3. GC 압력 관리

힙을 31GB 이하로 유지하여 Compressed OOPs를 활용한다
Old GC가 빈번하면 fielddata 캐시 크기를 제한한다
circuit breaker 설정으로 OOM을 방지한다

# Circuit breaker 설정
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.request.limit": "40%"
  }
}'

4. 보안 설정

# elasticsearch.yml - 보안 설정
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/certs/elastic-certificates.p12

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: /etc/elasticsearch/certs/http.p12

장애 사례와 복구 절차

장애 사례 1: 클러스터 상태 RED

증상: 프라이머리 샤드가 할당되지 않아 데이터 유실 가능

# 미할당 샤드 확인
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"

# 미할당 원인 진단
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# 수동 샤드 할당 (디스크 공간 부족 시)
curl -X POST "localhost:9200/_cluster/reroute" \
  -H 'Content-Type: application/json' \
  -d '{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-2026.03.10",
        "shard": 0,
        "node": "data-hot-02",
        "accept_data_loss": true
      }
    }
  ]
}'

장애 사례 2: 인덱싱 지연 (Bulk Rejection)

증상: Fluentd 로그에 429 Too Many Requests 에러가 다량 발생

# 스레드 풀 상태 확인
curl -X GET "localhost:9200/_cat/thread_pool/write?v&h=node_name,active,rejected,queue,completed"

# 벌크 큐 크기 조정
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "thread_pool.write.queue_size": 1000
  }
}'

복구 절차:

Fluentd 버퍼 상태 확인 (디스크 버퍼 잔량)
Elasticsearch 노드 추가 또는 벌크 큐 크기 증가
refresh_interval을 일시적으로 60s로 늘림
필요시 레플리카를 0으로 줄여 인덱싱 부하 감소
안정화 후 레플리카를 원래 값으로 복원

장애 사례 3: 디스크 워터마크 초과

증상: 인덱스가 read-only 모드로 전환됨

# read-only 해제 (디스크 확보 후)
curl -X PUT "localhost:9200/_all/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "index.blocks.read_only_allow_delete": null
}'

# 오래된 인덱스 수동 삭제
curl -X DELETE "localhost:9200/logs-2026.01.*"

# 디스크 사용량 확인
curl -X GET "localhost:9200/_cat/allocation?v"

모니터링 구성

ELK 스택 자체를 모니터링하기 위한 별도 모니터링 클러스터를 구성하는 것이 권장된다.

# metricbeat.yml - Elasticsearch 모니터링
metricbeat.modules:
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - 'https://es-node-01:9200'
      - 'https://es-node-02:9200'
    username: 'monitoring_user'
    password: 'secure_password'
    ssl.certificate_authorities:
      - /etc/metricbeat/certs/ca.crt

  - module: kibana
    xpack.enabled: true
    period: 10s
    hosts:
      - 'https://kibana:5601'

output.elasticsearch:
  hosts:
    - 'https://monitoring-es:9200'
  username: 'metricbeat_writer'
  password: 'secure_password'

핵심 모니터링 지표는 다음과 같다.

지표	임계값	조치
클러스터 상태	RED	즉시 대응 필요
JVM 힙 사용률	85% 이상	노드 추가 또는 힙 조정
인덱싱 속도	급격한 변동	소스 확인
검색 지연시간	5초 이상	샤드/쿼리 최적화
디스크 사용률	85% 이상	ILM 정책 점검
미할당 샤드 수	0 초과	할당 원인 진단
Fluentd 버퍼 크기	임계치 초과	출력 병목 확인

마치며

ELK/EFK 스택은 성숙한 생태계와 풍부한 기능을 갖춘 로그 파이프라인 솔루션이다. 그러나 프로덕션에서 안정적으로 운영하려면 Elasticsearch 클러스터 아키텍처 설계, 샤드 전략, ILM 정책, Fluentd 버퍼링 전략, 그리고 모니터링 체계까지 종합적으로 고려해야 한다.

핵심 정리는 다음과 같다.

노드 역할 분리: Master, Data Hot/Warm/Cold, Coordinating 노드를 분리하여 장애 격리와 성능 최적화를 달성한다
ILM 활용: Hot-Warm-Cold-Delete 단계의 자동 전환으로 스토리지 비용을 최적화한다
Fluent Bit + Fluentd 조합: 엣지에서 경량 수집, 중앙에서 변환과 라우팅을 분리한다
매핑 관리: strict 매핑과 필드 수 제한으로 매핑 폭발을 방지한다
모니터링: 별도 모니터링 클러스터로 ELK 스택 자체를 감시한다

규모가 작은 환경에서는 단일 노드나 소규모 클러스터로 시작하되, 데이터 볼륨이 증가함에 따라 Hot-Warm-Cold 아키텍처와 ILM을 점진적으로 도입하는 것을 권장한다.

참고자료

ELK Stack Log Collection and Analysis Pipeline: Elasticsearch, Fluentd, and Kibana Production Deployment and Optimization

Introduction
ELK vs EFK Stack Comparison
- Fluentd vs Fluent Bit Comparison
Elasticsearch Cluster Architecture
- Node Role Separation
- Shard and Replica Strategy
Index Lifecycle Management (ILM) Configuration
- Key Actions per ILM Phase
Fluentd Configuration and Pipeline Setup
- Basic Fluentd Configuration
- Fluent Bit DaemonSet Configuration (Kubernetes)
Kibana Dashboard Setup
- Index Pattern Configuration
- Effective Dashboard Design Principles
Performance Tuning and Optimization
Operational Notes
Failure Cases and Recovery Procedures
Monitoring Setup
Conclusion
References

Introduction

In production environments, logs are critical data for failure detection, debugging, security auditing, and performance analysis. As system scale grows, a centralized system for collecting and analyzing logs from tens to hundreds of distributed servers becomes essential. The ELK stack (Elasticsearch + Logstash + Kibana) has established itself as the de facto standard for such log pipelines, and the EFK stack, which replaces Logstash with Fluentd, is also widely adopted in Kubernetes environments.

This article covers the architecture of each ELK/EFK stack component, Elasticsearch cluster design and shard strategy, ILM (Index Lifecycle Management) configuration, Fluentd and Fluent Bit comparison and configuration, Kibana dashboard setup, performance tuning, and common failure scenarios with recovery procedures in production environments.

ELK vs EFK Stack Comparison

The key difference between the ELK and EFK stacks lies in the log collector.

Item	ELK (Logstash)	EFK (Fluentd)
Language	Java (JRuby)	Ruby + C
Memory Usage	~500MB-1GB	~40-100MB
Plugin Count	200+	1,000+
Configuration Format	Custom DSL	Tag-based routing
Kubernetes Affinity	Moderate	Very high (CNCF graduated)
Buffering	Memory/Disk	Memory/File
Data Parsing	Grok patterns	Regex + parser plugins
Best For	Complex transformation logic	Cloud-native, K8s

Fluentd vs Fluent Bit Comparison

Fluentd and Fluent Bit belong to the same ecosystem but serve different purposes.

Item	Fluentd	Fluent Bit
Language	Ruby + C	C
Memory Usage	~40MB+	~450KB
Plugin Ecosystem	1,000+	Core plugins only
Role	Central aggregation/transformation	Edge collection/forwarding
Best For	Server, aggregator	IoT, sidecar, DaemonSet
Processing Performance	Moderate	Very high (10-40x)

The recommended production architecture is the Fluent Bit (DaemonSet) + Fluentd (Aggregator) + Elasticsearch combination. Fluent Bit collects logs in a lightweight manner on each node, while Fluentd handles transformation and routing centrally.

Elasticsearch Cluster Architecture

Node Role Separation

Separating node roles is the key to a production Elasticsearch cluster.

# elasticsearch-master.yml
cluster.name: prod-logs
node.name: master-01
node.roles: [ master ]
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03
cluster.initial_master_nodes:
  - master-01
  - master-02
  - master-03

# JVM heap settings (jvm.options)
-Xms4g
-Xmx4g

# elasticsearch-data-hot.yml
cluster.name: prod-logs
node.name: data-hot-01
node.roles: [ data_hot, data_content ]
node.attr.data: hot
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03

# JVM heap settings
-Xms16g
-Xmx16g

# elasticsearch-data-warm.yml
cluster.name: prod-logs
node.name: data-warm-01
node.roles: [ data_warm ]
node.attr.data: warm
network.host: 0.0.0.0
discovery.seed_hosts:
  - master-01
  - master-02
  - master-03

# JVM heap settings
-Xms8g
-Xmx8g

Recommended specifications for each node role are as follows.

Node Role	CPU	Memory	Storage	Count
Master	4 vCPU	8GB	50GB SSD	3 (odd number)
Data Hot	8-16 vCPU	32-64GB	NVMe SSD	3+
Data Warm	4-8 vCPU	16-32GB	HDD/SSD	2+
Data Cold	2-4 vCPU	8-16GB	HDD	1+
Coordinating	4-8 vCPU	16GB	50GB SSD	2
Ingest	4-8 vCPU	16GB	50GB SSD	2

Shard and Replica Strategy

# Set shard configuration via index template
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "codec": "best_compression",
      "routing.allocation.require.data": "hot"
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "host": { "type": "keyword" }
      }
    }
  }
}'

The key principles of shard design are as follows.

Shard size: Target 30-50GB per shard. Small shards under 10GB incur overhead, and shards over 50GB increase recovery time
Shard count: Keep the number of shards per node under 20 per GB of heap
Replicas: Set at least 1 replica in production to ensure availability
refresh_interval: Increase to 30-60 seconds for log data where real-time visibility is less critical

Index Lifecycle Management (ILM) Configuration

ILM is the core feature that automates index management from creation to deletion.

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

# Create ILM policy
curl -X PUT "localhost:9200/_ilm/policy/logs-lifecycle" \
  -H 'Content-Type: application/json' \
  -d @ilm-policy.json

# Attach ILM policy to index template
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-lifecycle",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}'

# Create bootstrap index
curl -X PUT "localhost:9200/logs-000001" \
  -H 'Content-Type: application/json' \
  -d '{
  "aliases": {
    "logs": {
      "is_write_index": true
    }
  }
}'

Key Actions per ILM Phase

Phase	Trigger Condition	Key Actions	Purpose
Hot	On index creation	rollover, set_priority	Active writes/reads
Warm	After 3 days	shrink, forcemerge, allocate	Read-heavy, save storage
Cold	After 30 days	allocate, freeze	Infrequent reads, minimize cost
Delete	After 90 days	delete	Reclaim storage

Fluentd Configuration and Pipeline Setup

Basic Fluentd Configuration

<!-- /etc/fluentd/fluent.conf -->
<system>
  log_level info
  workers 4
</system>

<!-- Input: Application log collection -->
<source>
  @type tail
  path /var/log/app/*.log
  pos_file /var/log/fluentd/app.log.pos
  tag app.logs
  read_from_head true
  <parse>
    @type json
    time_key timestamp
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<!-- Input: Kubernetes log collection -->
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd/containers.log.pos
  tag kubernetes.*
  <parse>
    @type cri
  </parse>
</source>

<!-- Filter: Add Kubernetes metadata -->
<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
  skip_labels false
  skip_container_metadata false
</filter>

<!-- Filter: Remove unnecessary logs -->
<filter **>
  @type grep
  <exclude>
    key log
    pattern /healthcheck|readiness|liveness/
  </exclude>
</filter>

<!-- Filter: Tag based on log level -->
<filter app.logs>
  @type record_transformer
  enable_ruby true
  <record>
    hostname "#{Socket.gethostname}"
    environment "production"
  </record>
</filter>

<!-- Output: Send to Elasticsearch -->
<match **>
  @type elasticsearch
  host elasticsearch-coordinating
  port 9200
  logstash_format true
  logstash_prefix fluentd-logs
  logstash_dateformat %Y.%m.%d
  include_tag_key true
  tag_key @fluentd_tag

  <buffer tag, time>
    @type file
    path /var/log/fluentd/buffer
    timekey 1h
    timekey_wait 10m
    chunk_limit_size 64MB
    total_limit_size 8GB
    flush_mode interval
    flush_interval 30s
    flush_thread_count 4
    retry_max_interval 30
    retry_forever true
    overflow_action block
  </buffer>
</match>

Fluent Bit DaemonSet Configuration (Kubernetes)

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path  /var/log/flb-storage/
        storage.sync  normal
        storage.checksum off
        storage.backlog.mem_limit 5M

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            cri
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name    grep
        Match   *
        Exclude log healthcheck

    [OUTPUT]
        Name            forward
        Match           *
        Host            fluentd-aggregator.logging.svc.cluster.local
        Port            24224
        Retry_Limit     False

  parsers.conf: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.2
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: config
              mountPath: /fluent-bit/etc/
            - name: storage
              mountPath: /var/log/flb-storage/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluent-bit-config
        - name: storage
          emptyDir: {}

Kibana Dashboard Setup

Index Pattern Configuration

You must first create an index pattern in Kibana.

# Create index pattern via Kibana API
curl -X POST "localhost:5601/api/saved_objects/index-pattern" \
  -H 'kbn-xsrf: true' \
  -H 'Content-Type: application/json' \
  -d '{
  "attributes": {
    "title": "fluentd-logs-*",
    "timeFieldName": "@timestamp"
  }
}'

Effective Dashboard Design Principles

Kibana dashboards should be designed according to their purpose.

Dashboard Type	Included Visualizations	Target Users
Operations Overview	Log volume trends, error rate graphs, service distribution	SRE/DevOps
Error Analysis	Error type classification, top error messages, stack traces	Developers
Security Audit	Authentication failure events, anomalous access patterns	Security team
Infrastructure Monitoring	Per-node log volume, indexing rate, latency	Platform team

Key visualization components include the following.

Lens charts: Time-series log volume, error rate by service
TSVB (Time Series Visual Builder): Detailed time-series analysis
Data Table: Top N error messages, per-service statistics
Markdown widgets: Dashboard descriptions, runbook links

Performance Tuning and Optimization

Elasticsearch Indexing Performance Optimization

# Bulk indexing optimization settings
curl -X PUT "localhost:9200/logs-000001/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "index": {
    "refresh_interval": "30s",
    "translog.durability": "async",
    "translog.sync_interval": "30s",
    "translog.flush_threshold_size": "1gb"
  }
}'

Key Performance Tuning Parameters

Parameter	Default	Recommended	Description
refresh_interval	1s	30s-60s	Increase for log data where real-time visibility is less critical
translog.durability	request	async	Async translog for improved write performance
number_of_replicas	1	0 (during initial load)	Disable replicas during bulk loading
bulk size	-	5-15MB	Too large causes memory pressure, too small adds overhead
flush_thread_count	1	4-8	Fluentd output thread count

JVM Heap Memory Configuration

# jvm.options settings
# Max 50% of total RAM, never more than 31GB (Compressed OOPs limit)
-Xms16g
-Xmx16g

# G1GC settings (default in Elasticsearch 8.x)
-XX:+UseG1GC
-XX:G1HeapRegionSize=16m
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled

# Enable GC logging
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Storage Optimization

# Force merge segments (on read-only indices)
curl -X POST "localhost:9200/logs-2026.03.01/_forcemerge?max_num_segments=1"

# Check index compression
curl -X GET "localhost:9200/_cat/indices/logs-*?v&h=index,store.size,pri.store.size,docs.count&s=index"

# Disk watermark settings
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}'

Operational Notes

1. Preventing Mapping Explosion

When dynamic mapping is enabled and you index JSON logs with freely defined key names, the field count grows rapidly, destabilizing the cluster.

# Set mapping field count limit
curl -X PUT "localhost:9200/_index_template/logs-template" \
  -H 'Content-Type: application/json' \
  -d '{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.mapping.total_fields.limit": 1000
    },
    "mappings": {
      "dynamic": "strict"
    }
  }
}'

2. Avoiding Shard Overallocation

More than 1,000 small shards can cause severe load on master nodes
Minimize the number of shards per index and use ILM rollover for size-based splitting
Periodically monitor using the _cat/shards API

3. Managing GC Pressure

Keep heap under 31GB to leverage Compressed OOPs
If Old GC is frequent, limit the fielddata cache size
Configure circuit breakers to prevent OOM

# Circuit breaker settings
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.request.limit": "40%"
  }
}'

4. Security Configuration

# elasticsearch.yml - Security settings
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/certs/elastic-certificates.p12

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: /etc/elasticsearch/certs/http.p12

Failure Cases and Recovery Procedures

Failure Case 1: Cluster Status RED

Symptom: Primary shards are unassigned, risking data loss

# Check unassigned shards
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"

# Diagnose unassignment cause
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# Manual shard allocation (when disk space is insufficient)
curl -X POST "localhost:9200/_cluster/reroute" \
  -H 'Content-Type: application/json' \
  -d '{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-2026.03.10",
        "shard": 0,
        "node": "data-hot-02",
        "accept_data_loss": true
      }
    }
  ]
}'

Failure Case 2: Indexing Delay (Bulk Rejection)

Symptom: Large volume of 429 Too Many Requests errors in Fluentd logs

# Check thread pool status
curl -X GET "localhost:9200/_cat/thread_pool/write?v&h=node_name,active,rejected,queue,completed"

# Adjust bulk queue size
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "persistent": {
    "thread_pool.write.queue_size": 1000
  }
}'

Recovery procedure:

Check Fluentd buffer status (remaining disk buffer)
Add Elasticsearch nodes or increase bulk queue size
Temporarily increase refresh_interval to 60s
If necessary, reduce replicas to 0 to decrease indexing load
Restore replicas to original value after stabilization

Failure Case 3: Disk Watermark Exceeded

Symptom: Indices switch to read-only mode

# Release read-only (after securing disk space)
curl -X PUT "localhost:9200/_all/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
  "index.blocks.read_only_allow_delete": null
}'

# Manually delete old indices
curl -X DELETE "localhost:9200/logs-2026.01.*"

# Check disk usage
curl -X GET "localhost:9200/_cat/allocation?v"

Monitoring Setup

It is recommended to set up a separate monitoring cluster to monitor the ELK stack itself.

# metricbeat.yml - Elasticsearch monitoring
metricbeat.modules:
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - 'https://es-node-01:9200'
      - 'https://es-node-02:9200'
    username: 'monitoring_user'
    password: 'secure_password'
    ssl.certificate_authorities:
      - /etc/metricbeat/certs/ca.crt

  - module: kibana
    xpack.enabled: true
    period: 10s
    hosts:
      - 'https://kibana:5601'

output.elasticsearch:
  hosts:
    - 'https://monitoring-es:9200'
  username: 'metricbeat_writer'
  password: 'secure_password'

Key monitoring metrics are as follows.

Metric	Threshold	Action
Cluster status	RED	Immediate response required
JVM heap usage	Above 85%	Add nodes or adjust heap
Indexing rate	Sudden fluctuation	Check source
Search latency	Over 5 seconds	Optimize shards/queries
Disk usage	Above 85%	Review ILM policy
Unassigned shard count	Above 0	Diagnose allocation cause
Fluentd buffer size	Exceeds threshold	Check output bottleneck

Conclusion

The ELK/EFK stack is a mature log pipeline solution with a rich ecosystem and comprehensive features. However, stable production operation requires holistic consideration of Elasticsearch cluster architecture design, shard strategy, ILM policy, Fluentd buffering strategy, and monitoring infrastructure.

Here is a summary of the key takeaways.

Node role separation: Separate Master, Data Hot/Warm/Cold, and Coordinating nodes for fault isolation and performance optimization
ILM utilization: Optimize storage costs with automated Hot-Warm-Cold-Delete phase transitions
Fluent Bit + Fluentd combination: Separate lightweight collection at the edge from transformation and routing at the center
Mapping management: Prevent mapping explosion with strict mapping and field count limits
Monitoring: Monitor the ELK stack itself with a separate monitoring cluster

For smaller environments, start with a single node or small cluster, and gradually introduce Hot-Warm-Cold architecture and ILM as data volume increases.