Split View: Elasticsearch & Lucene 내부 완전 가이드 2025: Segment, Inverted Index, Refresh/Flush/Merge, Shard Routing 심층 분석

Elasticsearch & Lucene 내부 완전 가이드 2025: Segment, Inverted Index, Refresh/Flush/Merge, Shard Routing 심층 분석

들어가며: 로그 속에서 1초 안에 답을 찾는 기술

상상해 보자

당신의 회사는 매일 수십 TB의 로그를 생성한다. 어느 날 밤 11시, 사용자 1명이 특정 상품을 카트에 담았다가 주문 전 오류를 만났다. 고객 지원 담당자는 당신에게 묻는다:

"이 사용자의 오늘 오후 8시 23분경 모든 요청 로그를 보여줘. 응답 시간 2초 넘는 것만."

GET /logs/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "user_id": "u_12345" } },
        { "range": { "response_time_ms": { "gt": 2000 } } },
        { "range": { "@timestamp": { "gte": "2025-04-15T20:20:00Z", "lte": "2025-04-15T20:30:00Z" } } }
      ]
    }
  }
}

Elasticsearch는 수십억 건의 로그에서 이 답을 200ms 안에 돌려준다. 어떻게?

답은 Lucene이라는 20년 된 자바 검색 엔진 라이브러리 안에 있다. Elasticsearch는 사실상 Lucene의 분산 래퍼다. Lucene을 이해하면 Elasticsearch를 이해한다.

이 글에서 다룰 것

Lucene의 근본: Inverted index, term dictionary, posting list.
Segment 구조: 불변 파일들의 집합.
Refresh / Flush / Merge: NRT의 비밀.
BM25와 스코어링.
Analyzer와 텍스트 처리.
Elasticsearch의 분산화: Shard, replica, routing.
Aggregation 실행.
실전 튜닝.

왜 지금 배워야 하는가?

Elasticsearch는 여전히 가장 널리 쓰이는 검색 엔진이다.
OpenSearch, Kibana, Logstash의 근본.
Grafana Loki, SigNoz도 비슷한 설계 원리를 공유.
Lucene 내부를 모르면 "왜 느린가", "왜 메모리가 많이 드는가", "왜 색인이 커졌나"에 답할 수 없다.

1. Inverted Index: 모든 것의 시작

문제

다음 문서들이 있다:

Doc 1: "Elasticsearch is a distributed search engine"
Doc 2: "Lucene is the library behind Elasticsearch"
Doc 3: "A search engine finds relevant documents"

질문: "search"가 포함된 문서는?

순진한 방법: 모든 문서를 스캔하며 "search" 찾기. O(총 단어 수). 수백만 문서면 불가능.

Inverted Index의 구조

Inverted index는 단어 → 문서 매핑을 미리 만들어 놓는다:

Term Dictionary:
  "a"            → [1, 2, 3]
  "behind"       → [2]
  "distributed"  → [1]
  "documents"    → [3]
  "elasticsearch"→ [1, 2]
  "engine"       → [1, 3]
  "finds"        → [3]
  "is"           → [1, 2]
  "library"      → [2]
  "lucene"       → [2]
  "relevant"     → [3]
  "search"       → [1, 3]
  "the"          → [2]

이제 "search"를 찾으려면 term dictionary에서 즉시 [1, 3]을 얻는다. O(log 단어종류 수) 로.

Posting List

각 term의 문서 목록을 posting list라고 한다. 실제로는 문서 ID뿐 아니라 더 많은 정보를 저장:

"search":
  [
    (docId=1, freq=1, positions=[3]),
    (docId=3, freq=1, positions=[1])
  ]

docId: 문서 번호.
freq: 해당 문서에서의 term 등장 횟수.
positions: 문서 내 위치 (phrase 검색용).

Term Dictionary: FST로 구현

수백만 개의 term을 어떻게 효율적으로 저장할까? Lucene은 FST (Finite State Transducer) 를 사용한다.

FST는 문자열 → 값 매핑의 극도로 압축된 표현이다:

elastic  → 1
elected  → 2
election → 3
electric → 4

이들은 공통 접두사 "elect"를 공유한다. FST는 접두사 공유로 O(1) 평균 lookup + 극도의 메모리 효율을 달성한다. 수백만 개의 term도 수십 MB에 들어간다.

FST는 Lucene 외에도 ICU, Apache Lucene 기반 모든 시스템에서 사용된다.

Posting List의 압축

Posting list는 수백만 문서 ID를 담을 수 있다. 압축이 필수:

1. Delta Encoding:

원본:  [1, 5, 8, 12, 15, 17]
Delta: [1, 4, 3,  4,  3,  2]

작은 숫자가 연속 → 압축하기 좋음.

2. Variable Byte Encoding: 작은 숫자는 1바이트, 큰 숫자는 여러 바이트. 대부분 숫자가 작으니 평균 ~1바이트.

3. FOR (Frame of Reference) + PFOR: 블록 단위로 최댓값 기반 bit-packing. 수십 배 압축.

Lucene은 이들을 조합해 posting list를 원본의 5~10% 크기로 압축한다.

2. Lucene Segment: 불변 파일의 우아함

Segment란?

Lucene에서 segment는 독립적인 작은 inverted index다. 하나의 완전한 검색 단위.

Index/
├── segments_12.file           # 현재 segment 목록
├── _0.cfs                     # segment 0 (compound file)
├── _1.cfs                     # segment 1
├── _2.cfs                     # segment 2
└── _3.cfs                     # segment 3

각 segment는:

Term dictionary (FST)
Posting lists
Stored fields (원본 문서)
Doc values (정렬/집계용)
Norms (스코어링용)
Term vectors (하이라이트용)

불변성(Immutability)

Segment의 핵심 특성: 일단 쓰여지면 절대 수정되지 않는다.

이것이 엄청난 장점을 만든다:

락 없음: 읽기만 하므로 동시성 걱정 없음.
캐시 효율: OS page cache에서 안전히 캐싱.
간단한 replication: 파일 복사만.
Lock-free 검색: 수천 쿼리 동시 처리.

문서 추가 = 새 segment

문서를 추가하면 기존 segment를 수정하지 않는다. 대신 새 segment가 만들어진다.

Before: [segment_1][segment_2][segment_3]
Add 10 documents → create segment_4
After:  [segment_1][segment_2][segment_3][segment_4]

검색 시엔 모든 segment를 병렬로 검색하고 결과를 병합한다.

삭제 = Tombstone

문서 삭제도 실제로 지우지 않는다. tombstone(삭제 표식)을 기록한다:

.liv 파일: [1, 0, 1, 1, 0, 1, ...]  # 0 = 삭제됨

검색 시 이 bitmap을 확인해 삭제된 것을 건너뛴다. 실제 회수는 merge 시에 일어난다.

업데이트 = 삭제 + 추가

업데이트는 "이전 버전을 삭제 표시 + 새 버전을 새 segment에 삽입". 이 때문에:

업데이트가 많으면 tombstone이 쌓임 → 검색 속도 저하.
주기적 merge가 필수.

3. Refresh / Flush / Merge 사이클

Lucene/Elasticsearch의 write path는 세 단계로 이루어진다. 각각 다른 목적.

In-Memory Buffer

새 문서는 먼저 메모리 버퍼에 쌓인다:

Index Buffer (RAM)
[doc1, doc2, doc3, doc4, ...]

아직 검색되지 않는다 (!). 디스크에도 없다.

Refresh: 검색 가능하게 만들기

Refresh는 메모리 버퍼를 in-memory segment로 변환한다:

Index Buffer → [new segment (in memory)]

이 in-memory segment는 OS page cache에만 존재 (아직 fsync 안 됨). 그러나 검색 가능하다.

기본 주기: 1초 (index.refresh_interval = 1s).

이것이 Elasticsearch의 Near Real-Time (NRT) 검색의 비밀이다. 1초 후면 삽입된 문서가 검색 결과에 나타난다.

주의: Refresh는 비싸다. 매번 새 segment 생성 → 작은 segment들이 쌓임 → 검색 느려짐.

Refresh 튜닝

대량 색인이면 refresh를 끄고 작업:

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "-1"
  }
}

// 색인 완료 후
PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "1s"
  }
}

색인 속도가 몇 배 빨라질 수 있다.

Translog: 내구성 보장

Refresh는 durability를 보장하지 않는다 (fsync 없음). 그럼 서버가 죽으면 데이터 손실?

해결: Translog (transaction log).

모든 색인 작업은:

메모리 버퍼 + translog에 동시에 기록.
Translog는 fsync로 디스크 저장 (기본 request 단위).

Write flow:
Document → Memory Buffer → Translog (fsync)
                ↓ (1초 후 refresh)
            In-memory segment (검색 가능)
                ↓ (주기적 flush)
            Disk segment (영구)

Flush: 영구 저장

Flush는 in-memory segment를 디스크에 fsync 하고 translog를 비운다:

Before flush:
  Memory: [seg_new (in cache)]
  Translog: [full, 수백 MB]

After flush:
  Disk: [seg_new (fsynced)]
  Translog: [empty]

기본 조건:

Translog가 512MB 도달 시 (index.translog.flush_threshold_size)
또는 5초마다 (index.translog.sync_interval = 5s)

Flush는 진짜 디스크 I/O라 훨씬 비싸다.

Merge: segment 합치기

시간이 지나면 segment가 많아진다:

검색이 모든 segment를 순회 → 느림.
Tombstone이 쌓여 공간 낭비.

Merge는 여러 segment를 하나의 더 큰 segment로 합친다:

Before: [seg_1, seg_2, seg_3, seg_4]  (각 10MB)
Merge 시작
Concurrent: [seg_1, seg_2, seg_3, seg_4, seg_merged_in_progress]
After:  [seg_merged]                  (40MB, tombstone 제거됨)

Merge 중에도 기존 segment는 검색 가능. 완료 시 atomic swap.

TieredMergePolicy

Lucene의 기본 merge 정책. "tier"라는 개념:

비슷한 크기의 segment들을 묶음.
한 tier가 너무 크면 merge 트리거.
큰 segment (5GB+)는 merge 대상에서 제외.

파라미터:

{
  "index.merge.policy.max_merged_segment": "5gb",
  "index.merge.policy.segments_per_tier": 10
}

비유

Refresh/Flush/Merge를 비유로:

Refresh: 매일 책상 정리 (작은 segment 만들기). 자주, 빠름.
Flush: 주말에 서류를 파일 캐비닛에 (디스크 영구 저장).
Merge: 월말에 서류 재분류 (segment 병합).

파라미터 튜닝

일반적 기본값:

{
  "index.refresh_interval": "1s",
  "index.translog.flush_threshold_size": "512mb",
  "index.merge.scheduler.max_thread_count": 1  // SSD는 2-4
}

대량 색인 (bulk load):

{
  "index.refresh_interval": "60s",     // 또는 -1
  "index.number_of_replicas": 0,        // 색인 후 복구
  "index.translog.durability": "async"
}

4. BM25: 스코어링의 수학

문제

"가장 관련성 높은 10개 문서"를 어떻게 고를까? 단순히 term이 있냐 없냐가 아니라 relevance score가 필요하다.

TF-IDF (고전)

TF-IDF는 두 요소의 곱:

TF (Term Frequency): term이 문서에 얼마나 자주 나타나는가.
IDF (Inverse Document Frequency): 그 term이 전체 문서에서 얼마나 희귀한가.

score(q, d) = Σ_t (TF(t, d) × IDF(t))

직관: 흔한 단어("the")보다 희귀한 단어("quantum")가 매칭되면 더 의미 있다.

TF-IDF의 약점

TF가 선형: 등장 횟수가 100번이면 10배보다 100배 점수가 됨. 비현실적.
문서 길이 무시: 100단어 문서에서 "search" 2번과 1000단어 문서에서 "search" 2번은 같은 의미가 아님.

BM25: 개선판

BM25 (Best Match 25) 는 1990년대 Stephen Robertson이 제안. Lucene 6부터 기본 스코어링.

BM25(q, d) = Σ_t IDF(t) × (f(t,d) × (k1+1)) / (f(t,d) + k1 × (1 - b + b × |d|/avgdl))

복잡해 보이지만:

f(t, d): term frequency.
k1 (기본 1.2): TF saturation — 많이 나타나도 점수 상한.
b (기본 0.75): 문서 길이 정규화 강도.
|d|: 문서 길이.
avgdl: 평균 문서 길이.

BM25의 개선점

Saturation: TF가 많아도 점수가 포화 → 스팸 방지.
문서 길이 정규화: 긴 문서의 TF 이득을 상쇄.
튜닝 가능: k1, b로 조정.

실전 튜닝

대부분은 기본값으로 충분하다. 하지만:

짧은 문서 위주 (예: 트윗): b를 낮춤 (0.3~0.5).
긴 문서 + TF 중요 (예: 긴 기사): k1을 높임 (1.5~2.0).

{
  "settings": {
    "similarity": {
      "my_bm25": {
        "type": "BM25",
        "k1": 1.5,
        "b": 0.5
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "similarity": "my_bm25"
      }
    }
  }
}

Lucene의 스코어링 구현

Lucene은 BM25를 스코어링 시점에 계산하지 않는다. 여러 pre-computed 값을 사용:

Norm: 문서 길이 정규화 값 (색인 시 계산).
IDF: 해당 segment의 통계로 계산.
Field boost: 필드 가중치.

색인 시 이 값들을 저장하고, 검색 시 곱셈만으로 빠르게 계산.

5. Analyzer: 텍스트 처리 파이프라인

왜 Analyzer가 필요한가

"Search Engine"과 "search engine"이 같은 term으로 취급되려면? "running"과 "run"이 같은 의미로 검색되려면?

답: Analyzer가 색인과 쿼리 시점에 텍스트를 처리한다.

Analyzer의 구조

Input Text
    ↓
Character Filter (1개 이상)
    ↓
Tokenizer (정확히 1개)
    ↓
Token Filter (0개 이상)
    ↓
Output Tokens

Character Filter

문자 수준 전처리:

HTML strip: HTML 태그 제거.
Mapping: 문자 치환 (& → and).
Pattern replace: 정규식 치환.

Tokenizer

텍스트를 token(보통 단어)으로 분리:

Standard: 단어 경계 기반, Unicode 인식. 대부분 언어에 무난.
Whitespace: 공백 기준.
N-gram: 모든 n-gram 생성 (예: "search" → ["sea", "ear", "arc", ...]).
Keyword: 분리 안 함 (정확 매칭용).
Language-specific: 중국어(IK, Smart Chinese), 일본어(Kuromoji), 한국어(Nori).

Token Filter

Token 후처리:

Lowercase: 소문자화.
Stop: 불용어 제거 ("the", "a", "is").
Stemmer: 어간 추출 ("running" → "run").
Synonym: 동의어 확장.
ASCII folding: "café" → "cafe".
Word delimiter: 복합어 분리.

Standard Analyzer 예시

기본 standard analyzer:

입력: "The Quick Brown Foxes!"
  ↓ Standard Tokenizer
[The, Quick, Brown, Foxes]
  ↓ Lowercase Filter
[the, quick, brown, foxes]
  ↓ Stop Filter (기본 X)
[the, quick, brown, foxes]
출력: [the, quick, brown, foxes]

한국어 처리: Nori

한국어는 조사, 어미 때문에 단어 분리가 어렵다. Nori는 한국어 형태소 분석:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_nori": {
          "tokenizer": "nori_tokenizer",
          "filter": ["nori_readingform", "lowercase"]
        }
      }
    }
  }
}

입력: "고양이를 좋아합니다" → [고양이, 를, 좋아, 합니다] → 조사 제거 → [고양이, 좋아]

Edge N-gram: 자동완성

자동완성(autocomplete)을 위해 edge n-gram이 자주 쓰인다:

입력: "apple"
→ [a, ap, app, appl, apple]

사용자가 "app"을 입력하면 이미 색인된 "app"과 매치. "apple"을 즉시 찾음.

주의: 저장 공간이 매우 늘어난다. 단어 수 만큼 저장하는 게 아니라 n-gram 수만큼.

6. Elasticsearch의 분산화

Lucene은 단일 머신의 검색 엔진이다. Elasticsearch는 이를 분산 클러스터로 확장한다.

Index, Shard, Replica

Index: 문서 집합 (DB의 테이블 개념).
Shard: Index를 여러 조각으로 분할한 것. 각 shard는 하나의 Lucene index.
Primary shard: 원본.
Replica shard: 복제본.

Index "logs" (5 primary, 1 replica)
├── shard 0 (primary, node A) ← 원본
│   └── replica 0 (node B)   ← 복제
├── shard 1 (primary, node B)
│   └── replica 1 (node C)
├── shard 2 (primary, node C)
│   └── replica 2 (node A)
├── shard 3 (primary, node A)
│   └── replica 3 (node B)
└── shard 4 (primary, node B)
    └── replica 4 (node C)

Shard Routing

문서를 어떤 shard에 넣을지 결정:

shard = hash(_routing) % num_primary_shards

기본적으로 _routing = document_id. 같은 ID의 문서는 항상 같은 shard로.

Primary Shard 수의 제약

문제: Primary shard 수는 인덱스 생성 후 변경 불가. 왜?

초기: 5 shards
hash(user_1) % 5 = 3  → shard 3에 저장

만약 shard 수를 10으로 늘리면:

hash(user_1) % 10 = 1  → shard 1에서 찾음 → 없음!

기존 데이터가 엉뚱한 shard에서 검색된다. 이를 해결하려면 reindex가 필요하다.

해결책: Split API (일부 경우) 또는 애초에 충분히 많은 shard로 시작.

Replica Shard

Replica는 언제든 추가/제거 가능하다:

PUT /my_index/_settings
{
  "index.number_of_replicas": 2
}

Replica는:

High availability: primary 장애 시 replica가 승격.
Read scalability: 쿼리를 여러 replica로 분산.

Cluster State와 Master

클러스터에는 master node가 있다:

Cluster state 관리 (shard 할당, mapping 등).
Master election: Zen Discovery (구 버전) 또는 Raft 기반 (Elasticsearch 7+).
2개 이상의 master 동시 존재 방지 (split brain).

주의: master node를 홀수로 (3, 5, 7) 유지. Quorum 형성 위해.

쿼리 실행: Scatter-Gather

검색 쿼리가 도착하면:

1. Coordinator node가 요청 받음.
2. 필요한 모든 shard (primary or replica)에 쿼리 브로드캐스트.
3. 각 shard가 로컬에서 top-K 계산.
4. Coordinator가 모든 결과를 모아 글로벌 top-K 선택.
5. 선택된 문서들의 전체 데이터 fetch.
6. 클라이언트에 반환.

이를 scatter-gather 또는 two-phase query라 한다.

쿼리 실행의 함정: Deep Pagination

GET /logs/_search?from=9990&size=10

"10000번째~10010번째 결과"를 얻으려면:

각 shard가 top 10,010을 계산 (!).
Coordinator가 모두 모아 10,010개 정렬.
9990~10010 범위 반환.

→ from=1,000,000이면 각 shard가 100만 건 정렬. 메모리 폭발.

해결: search_after API를 사용. 전체 정렬 대신 이전 결과 기준으로 이어받기.

Aggregation 실행

집계 쿼리(예: GROUP BY)도 분산 실행:

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "by_user": {
      "terms": { "field": "user_id", "size": 10 }
    }
  }
}

각 shard가 로컬 top 10 계산 후 coordinator가 병합. 문제: 로컬 top 10이 글로벌 top 10이 아닐 수 있다. Coordinator는 더 많이(예: top 100)을 요청해서 정확도를 높인다.

shard_size 파라미터로 조정:

"terms": { "field": "user_id", "size": 10, "shard_size": 100 }

7. Doc Values: 정렬과 집계

왜 필요한가

Inverted index는 term → docs 방향이다. "이 문서의 field 값은?"을 알려면 역방향 필요.

예: ORDER BY timestamp 또는 GROUP BY country. 각 문서의 timestamp, country 값을 알아야 한다.

Doc Values

Doc values는 컬럼 저장이다:

timestamp column:
  doc 0 → 2025-04-15 10:00:00
  doc 1 → 2025-04-15 10:00:01
  doc 2 → 2025-04-15 10:00:02
  ...

각 문서의 필드 값을 연속된 배열로 저장. 정렬, 집계, 스크립팅에 최적.

메모리 사용

Doc values는 기본적으로 디스크 기반이다:

mmap으로 매핑 → OS page cache 활용.
자주 쓰이면 RAM에 캐시, 드물면 디스크.

Elasticsearch의 field data는 구버전의 인메모리 버전이었다. Doc values가 훨씬 효율적이라 이제 기본.

text 필드의 예외

text 필드는 doc values를 기본으로 저장하지 않는다:

분석된 token들만 저장 → 원본 재구성 어려움.
집계하려면 keyword 서브필드 사용.

{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": { "type": "keyword" }  // 정확한 원본 저장
      }
    }
  }
}

검색: name. 집계/정렬: name.keyword.

Sparse Doc Values

누락된 필드가 많은 경우 (예: optional 필드)에 대비해 Lucene은 sparse 표현 지원. 공간 효율적.

8. 매핑과 동적 매핑

Mapping이란

Mapping은 각 field의 타입과 속성 정의:

{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "user_id": { "type": "keyword" },
      "timestamp": { "type": "date" },
      "price": { "type": "double" },
      "location": { "type": "geo_point" }
    }
  }
}

text vs keyword

가장 중요한 구분:

text: 분석됨 (tokenize). 전문 검색용.
keyword: 분석 안 됨. 정확 매칭, 집계, 정렬용.

이메일, IP, 태그 등은 거의 항상 keyword. 본문, 설명, 제목은 text (옵션으로 keyword도 추가).

Dynamic Mapping

문서를 처음 색인할 때 field 타입 자동 추론:

POST /my_index/_doc
{
  "name": "Alice",      // text (+ keyword subfield)
  "age": 30,            // long
  "active": true        // boolean
}

장점: 빠른 시작. 단점: 타입이 예상과 다를 수 있음. 한 번 결정된 타입은 변경 불가.

권장: 프로덕션에선 명시적 mapping. Dynamic mapping은 개발/테스트용.

Mapping Explosion

각 field는 메모리 오버헤드가 있다. 예를 들어:

// 로그 문서
{
  "user_data": {
    "user_1": { "action": "login" },
    "user_2": { "action": "logout" },
    ...
  }
}

매 user_id마다 새 field가 동적으로 생성된다. 수백만 field로 mapping이 폭발. Elasticsearch가 몇 분 만에 다운될 수 있다.

해결:

구조 변경: { "user_id": "user_1", "action": "login" }.
또는 flattened 필드 타입 사용.
index.mapping.total_fields.limit으로 제한 (기본 1000).

9. 실전 튜닝

대량 색인

목표: 최대한 빠르게 데이터 로드.

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "-1",      // refresh 끄기
    "number_of_replicas": 0,        // replica 0
    "translog": {
      "durability": "async",
      "sync_interval": "30s"
    }
  }
}

Bulk API로 배치 색인:

POST /_bulk
{ "index": { "_index": "my_index" } }
{ "field1": "value1" }
{ "index": { "_index": "my_index" } }
{ "field2": "value2" }
...

한 bulk에 5~15MB가 스위트스팟. 너무 작으면 오버헤드, 너무 크면 메모리 부담.

색인 완료 후:

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "1s",
    "number_of_replicas": 1
  }
}

POST /my_index/_forcemerge?max_num_segments=1

_forcemerge로 검색 성능 최대화.

JVM Heap

중요한 규칙: Heap은 32GB 이하로.

이유: 32GB 넘으면 JVM의 compressed oops(압축 객체 포인터)가 비활성화되어 메모리 효율이 급격히 떨어진다.

# jvm.options
-Xms16g
-Xmx16g

나머지 메모리는 OS page cache용. Lucene은 page cache를 적극 활용한다. 실제로 Elasticsearch가 가장 잘 돌아가는 상황:

서버 메모리 64GB
JVM heap 16~31GB
OS cache 33~48GB

Shard 개수

나쁜 예: 일별 인덱스 × 1000개 × 5 primary shards = 5000개 shard. Master overhead 폭발.

규칙:

Shard당 20~40GB 정도가 적절.
JVM heap 1GB당 shard 20개 이하.
작은 인덱스는 하나의 shard로 충분.

Hot-Warm Architecture

시계열 데이터에 유용한 패턴:

Hot nodes: 최근 데이터, 빠른 SSD, 활발한 색인/검색.
Warm nodes: 오래된 데이터, HDD, 읽기 전용.
Cold nodes: 아주 오래된 데이터, 가끔 조회.

Index Lifecycle Management (ILM) 로 자동 전환:

{
  "policy": {
    "phases": {
      "hot":  { "actions": {} },
      "warm": { "min_age": "7d", "actions": { "allocate": { "require": { "data": "warm" } } } },
      "cold": { "min_age": "30d", "actions": { "allocate": { "require": { "data": "cold" } } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

데이터 스트림

Elasticsearch 7.9+의 data stream은 시계열 데이터를 위한 고수준 추상화:

자동으로 backing indices 생성 (.ds-logs-2025.04.15-000001 등).
자동 rollover (크기/나이 기준).
ILM과 통합.

POST /_data_stream/logs-app

Kibana 로그, APM, 모니터링이 모두 data stream을 쓴다.

10. 흔한 함정과 디버깅

함정 1: Too Many Shards

증상: Cluster state 업데이트 느림, master node CPU 높음, 쿼리 느림.

원인: 수천 개의 작은 shard.

해결:

오래된 인덱스를 merge/shrink.
ILM으로 자동 관리.
Shrink API로 shard 수 감소.

함정 2: Mapping 폭발

증상: 메모리 부족, 색인 실패, cluster 불안정.

원인: Dynamic field가 무제한 생성.

해결:

명시적 mapping.
"dynamic": "strict"로 새 field 거부.
애플리케이션 레벨에서 구조 교정.

함정 3: Deep Pagination

증상: 큰 from 값 쿼리에서 메모리/시간 폭발.

해결:

search_after 사용.
또는 scroll API (대량 추출용).
UI에서 deep pagination 막기.

함정 4: Fielddata on Text

증상: text 필드에 집계 시도 → 에러 또는 엄청난 메모리.

해결:

text.keyword 사용.
또는 fielddata: true (위험, 권장 안 함).

함정 5: Refresh 남용

증상: 색인 후 즉시 검색 가능하기 위해 ?refresh=true 매 요청마다.

원인: 매 색인마다 새 segment 생성 → 과도한 merge 부담.

해결:

필요한 경우만 refresh.
?refresh=wait_for (다음 refresh까지 대기)로 대체.

함정 6: 복잡한 nested 쿼리

증상: Nested 필드의 쿼리가 느림.

원인: Nested는 "숨겨진 하위 문서"로 저장되어 별도 검색.

해결:

Denormalize (flatten). 중복 저장 감수.
또는 관계가 많으면 join 타입 (성능 떨어짐).

11. Lucene과 경쟁 기술

OpenSearch

2021년 Elasticsearch가 라이선스를 SSPL로 변경하자 AWS가 OpenSearch를 fork. Lucene 기반은 동일. 이제 기능이 조금씩 갈라지는 중.

Apache Solr

Elasticsearch보다 먼저 등장한 Solr도 Lucene 기반. 원래 기업용으로 시작. 현재는 Elasticsearch가 시장 점유율에서 크게 앞선다.

Meilisearch

Rust로 작성된 경량 검색 엔진. Lucene을 쓰지 않고 자체 구현. 작은 규모에서 매우 빠름. Typo 내성(fuzzy matching)이 기본.

Typesense

비슷한 Rust 기반. Algolia 대안.

ClickHouse의 역할

"검색"이 전통적 검색이 아닌 분석 쿼리라면 ClickHouse가 훨씬 빠르다. 로그 분석을 Elasticsearch에서 ClickHouse로 전환하는 사례 증가.

차이:

Elasticsearch: 자유 텍스트 검색, 다양한 쿼리, 복잡한 점수.
ClickHouse: 집계, SQL, 대규모 분석.

Vector Search (앞 글 참고)

Elasticsearch 8+도 벡터 검색을 지원 (HNSW). 앞서 ANN 알고리즘 글 참조.

퀴즈로 복습하기

Q1. Lucene segment가 불변(immutable)인 이유와 그 이점은?

A. 문서가 추가/삭제/수정되어도 기존 segment는 절대 수정되지 않는다. 대신 새 segment가 생성되고, 삭제는 tombstone으로 표시된다.

이점:

Lock-free 검색: 읽기 전용이므로 수천 개의 동시 검색 쿼리가 락 없이 실행 가능.
캐시 안정성: OS page cache가 불변 파일을 안전하게 캐싱. 변경 없으므로 cache invalidation 걱정 없음.
복제 단순: 파일을 그대로 복사하면 됨. 상태 동기화 불필요.
Merge 안전성: 백그라운드 merge가 현재 검색을 방해하지 않음. Merge 완료 후 atomic swap.
Write 경로 최적화: 쓰기는 항상 append. 순차 I/O가 빠름.

대가:

삭제는 tombstone만 남기고 실제 회수는 merge 때까지 지연.
업데이트는 "삭제 + 추가" → tombstone 누적.
Merge 비용 (I/O, CPU).

이는 LSM-Tree의 설계 철학과 같다. "쓰기는 append, 정리는 나중에 배치" 는 현대 스토리지 시스템의 근본 패턴이다.

Q2. Refresh, Flush, Merge의 차이와 각각의 역할은?

Refresh (~1초 주기):

메모리 버퍼를 in-memory segment로 변환.
문서가 검색 가능해진다 (Near Real-Time의 원천).
아직 fsync 안 됨 → 내구성 없음.
비용: 중간. 매번 새 small segment 생성.

Flush (수분 주기 또는 512MB):

In-memory segment를 디스크에 fsync.
Translog를 비움.
데이터가 영구 저장됨.
비용: 크다 (진짜 디스크 I/O).

Merge (백그라운드):

여러 small segment를 하나의 큰 segment로 병합.
Tombstone된 문서를 실제로 제거.
검색 성능 유지 (segment 수가 적어야 빠름).
비용: 매우 크다 (디스크 read + write).

비유:

Refresh = 노트에 받아쓰기 (생각만 메모).
Flush = 노트를 파일함에 보관 (영구 저장).
Merge = 분기별 파일 정리 (중복 제거, 효율화).

왜 분리되어 있는가: 각자 다른 속도로 진행되어야 효율적이기 때문이다. 검색을 위해선 refresh가 자주 필요하지만, fsync는 비싸서 자주 하면 안 된다. Merge는 백그라운드에서 천천히 해도 된다. 이 세 단계의 분리가 Elasticsearch의 성능과 내구성을 동시에 달성하게 한다.

Q3. text 필드와 keyword 필드의 차이는 무엇이고, 왜 둘 다 필요한가?

text 필드:

분석기(analyzer) 가 적용됨.
토큰화, lowercase, stemming, 불용어 제거 등.
"The Quick Brown Fox" → ["quick", "brown", "fox"].
전문 검색(full-text search) 에 최적.
정렬/집계 불가 (분석 후 원본 손실).

keyword 필드:

분석 안 됨. 문자열 그대로.
"The Quick Brown Fox" → ["The Quick Brown Fox"] (하나의 값).
정확한 매칭, 정렬, 집계에 사용.
카디널리티 높은 값에 유용.

왜 둘 다 필요한가: 같은 필드라도 서로 다른 쿼리 패턴이 있다. 예를 들어 상품 이름:

"iPhone"으로 전문 검색 → text.
"iPhone 15 Pro"라는 정확한 이름으로 집계 → keyword.

이를 위해 multi-field mapping:

{
  "product_name": {
    "type": "text",
    "fields": {
      "keyword": { "type": "keyword" }
    }
  }
}

이제 검색은 product_name, 집계는 product_name.keyword로. 같은 데이터, 두 가지 색인 방식.

실전 원칙:

IP, 이메일, 태그, ID, URL → keyword.
본문, 설명, 제목, 댓글 → text.
상품명, 사용자명 등 둘 다 필요 → text + keyword 서브필드.

이 구분을 이해하지 못하면 "집계가 안 돼요", "fielddata 에러가 나요" 같은 혼란이 자주 생긴다. Mapping 설계의 가장 기본 원칙이다.

Q4. BM25가 TF-IDF를 어떻게 개선했는가?

A. TF-IDF의 두 가지 약점을 해결한다:

약점 1: TF의 선형성 TF-IDF는 term frequency를 선형으로 사용: score ∝ TF.

"search" 1번 vs 100번이면 100배 차이.
현실: 1번 → 10번은 큰 차이지만, 90번 → 100번은 거의 의미 없음.

BM25의 해결: Saturation curve.

f(tf) = tf × (k1+1) / (tf + k1)

TF가 커질수록 증가폭이 감소. k1=1.2일 때 tf=10이면 f≈5.5, tf=100이면 f≈1.18. Spam 방지에도 효과적.

약점 2: 문서 길이 무시 TF-IDF는 "100단어 문서에서 'search' 2번"과 "10000단어 문서에서 'search' 2번"을 같게 취급. 하지만 짧은 문서에서의 매치가 더 "관련성 높다".

BM25의 해결: Length normalization.

length_norm = 1 - b + b × (|d| / avgdl)

|d|: 이 문서의 길이.
avgdl: 평균 문서 길이.
b: 정규화 강도 (0~1, 기본 0.75).

긴 문서의 TF를 할인. b=1이면 완전 정규화, b=0이면 무시.

결합된 BM25 공식:

BM25 = IDF × (tf × (k1+1)) / (tf + k1 × (1 - b + b × |d|/avgdl))

복잡해 보이지만 의미는 명확하다: "TF는 포화되고, 문서 길이로 할인된, IDF 가중 점수".

실전에서는 기본값 (k1=1.2, b=0.75)로 대부분 만족. 트윗처럼 짧은 문서는 b를 낮추고, 긴 기사는 k1을 높이는 식으로 튜닝. Lucene 6부터 BM25가 기본이 된 것은 우연이 아니다. 20년 넘게 검증된 식이다.

Q5. Elasticsearch에서 primary shard 수를 "인덱스 생성 후 변경 불가"하게 만든 이유는?

A. Shard routing 때문이다. 문서가 어느 shard에 저장될지는 다음 공식으로 결정된다:

shard_id = hash(routing_key) % num_primary_shards

기본 routing_key = document_id. 그러면:

색인 시: hash(doc_1) % 5 = 3 → shard 3에 저장.
검색 시: hash(doc_1) % 5 = 3 → shard 3 조회.

문제가 생기는 경우: shard 수를 5→10으로 변경하면:

새 계산: hash(doc_1) % 10 = 1 → shard 1에서 조회.
그러나 실제 저장은 shard 3 → 데이터를 못 찾음.

기존 모든 문서가 잘못된 shard에서 검색된다. 이는 데이터 손실과 동일하다.

이론적 해결책들:

Consistent hashing: 일부 변화만으로 대부분 문서 위치 유지. 하지만 여전히 일부 이동 필요.
Reindex: 새 인덱스를 만들고 모든 문서를 복사. 시간/자원 소비.
Dual-write 기간: 새 인덱스에도 병렬 색인. 점진적 전환.

Elasticsearch의 실용적 해결책:

Split API: N → N*k (2배, 3배 등)로만 분할. 각 shard 내부적으로 분리.
Shrink API: N → N/k로 축소.
Reindex API: 일반적 경우. 새 인덱스 생성 후 복사.
Data streams + ILM: 시계열 데이터는 자동으로 새 인덱스 생성하므로 문제 없음.

설계 권장:

처음부터 충분한 shard. 나중에 늘리는 것보다 시작이 쉽다.
규칙: shard당 20~~40GB 데이터. 미래 성장 예상치의 1.5~~2배.
시계열 데이터: data stream + 일/주 단위 인덱스.

이 제약은 "왜 Elasticsearch 운영이 까다로운가"의 주된 이유 중 하나다. 초기 설계가 잘못되면 대규모 reindex 작업으로 수정해야 한다. 그래서 프로덕션 전에 shard 수를 신중히 결정해야 한다.

마치며: 20년의 엔지니어링

핵심 정리

Inverted index + FST: 검색의 근본.
Segment는 불변: Lock-free 검색의 비밀.
Refresh/Flush/Merge: 3단계로 성능과 내구성 균형.
BM25: 현대 스코어링의 표준.
Analyzer: 텍스트 처리 파이프라인.
Shard + Replica: 분산화의 기본.
Doc Values: 정렬/집계를 위한 컬럼 저장.
Hot-Warm + ILM: 시계열 데이터 관리.

Elasticsearch 운영의 교훈

기본값을 신뢰하되 이해하라. 대부분 기본값은 좋지만 상황에 맞지 않을 수 있다.
Heap 크기 32GB 이하. 이것이 가장 중요한 단일 규칙.
Shard 설계는 초기에 잘. 나중에 바꾸기 어렵다.
Mapping을 명시적으로. Dynamic mapping은 재앙의 시작.
Refresh는 필요한 만큼만. 대량 색인 시 끄기.
측정하라. _cat/segments, _cat/shards, _cluster/health.

Lucene이라는 보물

Lucene은 2000년대 초부터 개발된 자바 라이브러리다. 20년 동안 많은 엔지니어가 손수 최적화한 코드다. 압축 알고리즘, 데이터 구조, 파일 포맷 — 모든 것이 세밀하게 튜닝되어 있다.

Elasticsearch, Solr, OpenSearch, Kibana, 심지어 일부 SaaS 검색 서비스도 Lucene 위에 있다. 당신이 Google 검색 다음으로 가장 자주 사용하는 검색 엔진일 것이다.

마지막 교훈

검색은 쉬워 보이지만 깊이 들어가면 어렵다. "왜 이 쿼리가 느린가", "왜 메모리가 부족한가", "왜 clustering이 불안정한가"에 답하려면 내부를 알아야 한다.

이 글을 읽은 당신은 이제:

Lucene의 segment와 file 구조를 안다.
Refresh vs Flush의 차이를 안다.
BM25가 왜 좋은지 안다.
Shard 수의 중요성을 안다.
Hot-Warm 패턴의 이유를 안다.

다음에 Elasticsearch 클러스터를 다룰 때, 이 지식이 당신의 결정을 더 나은 방향으로 이끌 것이다. 그리고 문제가 생겼을 때, "왜?"에 답할 수 있을 것이다. 그것이 진짜 엔지니어의 힘이다.

참고 자료

Lucene: The Internal Workings
Elasticsearch: The Definitive Guide - 구버전이지만 개념 설명 훌륭
Elastic Blog: Anatomy of an Elasticsearch Cluster
Lucene Index File Formats
Okapi BM25 (Robertson & Zaragoza, 2009) - BM25 수학적 리뷰
Elasticsearch: Designing for Scale
OpenSearch Documentation - Elasticsearch fork
Introduction to Information Retrieval (Manning, Raghavan, Schütze) - IR의 고전 교과서
Finite State Transducers for Fast Text Processing - FST 설명
Why do I need replicas? Everything you need to know about Elasticsearch Shards

Elasticsearch & Lucene Internals Complete Guide 2025: Deep Dive into Segments, Inverted Index, Refresh/Flush/Merge, and Shard Routing

Introduction: Finding Answers Inside Logs in Under a Second

Picture This

Your company generates tens of terabytes of logs every day. One night at 11 PM, a user puts a specific product in their cart and hits an error before placing the order. The customer support rep asks you:

"Show me all request logs for this user around 8:23 PM today. Only ones with response time over 2 seconds."

GET /logs/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "user_id": "u_12345" } },
        { "range": { "response_time_ms": { "gt": 2000 } } },
        { "range": { "@timestamp": { "gte": "2025-04-15T20:20:00Z", "lte": "2025-04-15T20:30:00Z" } } }
      ]
    }
  }
}

Elasticsearch returns the answer from billions of log entries within 200ms. How?

The answer lies inside Lucene, a 20-year-old Java search engine library. Elasticsearch is essentially a distributed wrapper around Lucene. Understanding Lucene means understanding Elasticsearch.

What This Article Covers

Lucene fundamentals: Inverted index, term dictionary, posting list.
Segment structure: A collection of immutable files.
Refresh / Flush / Merge: The secret of NRT.
BM25 and scoring.
Analyzers and text processing.
Elasticsearch distribution: Shard, replica, routing.
Aggregation execution.
Production tuning.

Why Learn This Now?

Elasticsearch is still the most widely used search engine.
The foundation of OpenSearch, Kibana, and Logstash.
Grafana Loki and SigNoz share similar design principles.
Without understanding Lucene internals, you can't answer "why is it slow", "why does it use so much memory", or "why did the index grow so large".

1. Inverted Index: Where It All Begins

The Problem

Given these documents:

Doc 1: "Elasticsearch is a distributed search engine"
Doc 2: "Lucene is the library behind Elasticsearch"
Doc 3: "A search engine finds relevant documents"

Question: Which documents contain "search"?

Naive approach: scan every document looking for "search". O(total word count). Impossible with millions of documents.

Structure of an Inverted Index

An inverted index pre-builds a word-to-document mapping:

Term Dictionary:
  "a"            → [1, 2, 3]
  "behind"       → [2]
  "distributed"  → [1]
  "documents"    → [3]
  "elasticsearch"→ [1, 2]
  "engine"       → [1, 3]
  "finds"        → [3]
  "is"           → [1, 2]
  "library"      → [2]
  "lucene"       → [2]
  "relevant"     → [3]
  "search"       → [1, 3]
  "the"          → [2]

Now to find "search", you immediately get [1, 3] from the term dictionary. In O(log unique-term-count).

Posting List

The document list for each term is called a posting list. In practice it stores more than just document IDs:

"search":
  [
    (docId=1, freq=1, positions=[3]),
    (docId=3, freq=1, positions=[1])
  ]

docId: document number.
freq: how many times the term appears in that document.
positions: position within the document (for phrase search).

Term Dictionary: Implemented with FST

How do you efficiently store millions of terms? Lucene uses an FST (Finite State Transducer).

An FST is an extremely compressed representation of string-to-value mappings:

elastic  → 1
elected  → 2
election → 3
electric → 4

These share the common prefix "elect". By sharing prefixes, FSTs achieve O(1) average lookup with extreme memory efficiency. Millions of terms fit in tens of megabytes.

FSTs are used beyond Lucene — in ICU and every Apache Lucene-based system.

Posting List Compression

A posting list may contain millions of document IDs. Compression is essential.

1. Delta Encoding:

Original: [1, 5, 8, 12, 15, 17]
Delta:    [1, 4, 3,  4,  3,  2]

Small consecutive numbers compress well.

2. Variable Byte Encoding: Small numbers take 1 byte, larger ones take multiple bytes. Since most numbers are small, averages around 1 byte.

3. FOR (Frame of Reference) + PFOR: Block-level bit-packing based on max value. Tens-of-times compression.

Lucene combines these to compress posting lists to 5–10% of the original size.

2. Lucene Segments: The Elegance of Immutable Files

What Is a Segment?

In Lucene, a segment is an independent small inverted index. A complete unit of search.

Index/
├── segments_12.file           # current segment list
├── _0.cfs                     # segment 0 (compound file)
├── _1.cfs                     # segment 1
├── _2.cfs                     # segment 2
└── _3.cfs                     # segment 3

Each segment contains:

Term dictionary (FST)
Posting lists
Stored fields (original documents)
Doc values (for sorting/aggregation)
Norms (for scoring)
Term vectors (for highlighting)

Immutability

The key property of segments: once written, they are never modified.

This yields huge advantages:

Lock-free: read-only, so no concurrency concerns.
Cache efficiency: safely cached in OS page cache.
Simple replication: just copy files.
Lock-free search: thousands of concurrent queries.

Adding Documents = New Segment

Adding documents does not modify existing segments. Instead, a new segment is created.

Before: [segment_1][segment_2][segment_3]
Add 10 documents → create segment_4
After:  [segment_1][segment_2][segment_3][segment_4]

On search, all segments are searched in parallel and results are merged.

Deletion = Tombstone

Deletion also does not actually remove the document. A tombstone (deletion marker) is recorded:

.liv file: [1, 0, 1, 1, 0, 1, ...]  # 0 = deleted

Search checks this bitmap and skips deleted entries. Actual reclamation happens during merge.

Update = Delete + Add

An update is "mark the previous version as deleted + insert the new version into a new segment". Because of this:

Many updates → tombstones accumulate → search slows down.
Periodic merges are essential.

3. The Refresh / Flush / Merge Cycle

The Lucene/Elasticsearch write path consists of three stages, each with a different purpose.

In-Memory Buffer

New documents first accumulate in the memory buffer:

Index Buffer (RAM)
[doc1, doc2, doc3, doc4, ...]

Not yet searchable (!). Not on disk either.

Refresh: Make It Searchable

Refresh converts the memory buffer into an in-memory segment:

Index Buffer → [new segment (in memory)]

This in-memory segment exists only in OS page cache (not yet fsynced). But it is searchable.

Default interval: 1 second (index.refresh_interval = 1s).

This is the secret of Elasticsearch's Near Real-Time (NRT) search. Inserted documents appear in search results after 1 second.

Caution: Refresh is expensive. Each creates a new segment → small segments pile up → search slows down.

Tuning Refresh

For bulk indexing, disable refresh:

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "-1"
  }
}

// After indexing is done
PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "1s"
  }
}

Indexing can become several times faster.

Translog: Ensuring Durability

Refresh does not guarantee durability (no fsync). So if the server crashes, do we lose data?

Solution: the translog (transaction log).

Every indexing operation:

Written simultaneously to the memory buffer AND translog.
Translog is persisted to disk via fsync (default: per request).

Write flow:
Document → Memory Buffer → Translog (fsync)
                ↓ (after 1 second, refresh)
            In-memory segment (searchable)
                ↓ (periodic flush)
            Disk segment (persistent)

Flush: Persistent Storage

Flush fsyncs in-memory segments to disk and empties the translog:

Before flush:
  Memory: [seg_new (in cache)]
  Translog: [full, several hundred MB]

After flush:
  Disk: [seg_new (fsynced)]
  Translog: [empty]

Default triggers:

Translog reaches 512MB (index.translog.flush_threshold_size)
Or every 5 seconds (index.translog.sync_interval = 5s)

Flush is real disk I/O, so it is far more expensive.

Merge: Combining Segments

Over time, the number of segments grows:

Search traverses all segments → slow.
Tombstones accumulate, wasting space.

Merge combines multiple segments into one larger segment:

Before: [seg_1, seg_2, seg_3, seg_4]  (10MB each)
Merge starts
Concurrent: [seg_1, seg_2, seg_3, seg_4, seg_merged_in_progress]
After:  [seg_merged]                  (40MB, tombstones removed)

Existing segments remain searchable during merge. An atomic swap happens on completion.

TieredMergePolicy

Lucene's default merge policy. The concept of "tier":

Groups segments of similar size.
If a tier is too large, triggers a merge.
Large segments (5GB+) are excluded from further merging.

Parameters:

{
  "index.merge.policy.max_merged_segment": "5gb",
  "index.merge.policy.segments_per_tier": 10
}

Analogy

An analogy for refresh/flush/merge:

Refresh: tidying your desk every day (making small segments). Frequent, fast.
Flush: filing paperwork into the cabinet on weekends (permanent disk storage).
Merge: reorganizing the filing cabinet at month end (segment merging).

Parameter Tuning

Common defaults:

{
  "index.refresh_interval": "1s",
  "index.translog.flush_threshold_size": "512mb",
  "index.merge.scheduler.max_thread_count": 1
}

Bulk indexing:

{
  "index.refresh_interval": "60s",
  "index.number_of_replicas": 0,
  "index.translog.durability": "async"
}

4. BM25: The Math of Scoring

The Problem

How do you pick "the 10 most relevant documents"? You need a relevance score, not just a yes/no on whether the term exists.

TF-IDF (Classic)

TF-IDF is the product of two factors:

TF (Term Frequency): how often the term appears in the document.
IDF (Inverse Document Frequency): how rare the term is across all documents.

score(q, d) = Σ_t (TF(t, d) × IDF(t))

Intuition: matching a rare word ("quantum") is more meaningful than matching a common word ("the").

Weaknesses of TF-IDF

Linear TF: 100 occurrences get 100x the score of 1 occurrence. Unrealistic.
Ignores document length: "search" appearing twice in a 100-word document is not the same as twice in a 1000-word document.

BM25: The Improved Version

BM25 (Best Match 25) was proposed by Stephen Robertson in the 1990s. It has been the default scoring since Lucene 6.

BM25(q, d) = Σ_t IDF(t) × (f(t,d) × (k1+1)) / (f(t,d) + k1 × (1 - b + b × |d|/avgdl))

Looks complex, but:

f(t, d): term frequency.
k1 (default 1.2): TF saturation — even many occurrences cap the score.
b (default 0.75): document length normalization strength.
|d|: document length.
avgdl: average document length.

BM25 Improvements

Saturation: even with high TF, the score saturates → prevents spam.
Document length normalization: offsets the TF advantage of longer documents.
Tunable: adjustable via k1 and b.

Practical Tuning

Defaults work for most cases. But:

Primarily short documents (e.g. tweets): lower b (0.3–0.5).
Long documents where TF matters (e.g. long articles): raise k1 (1.5–2.0).

{
  "settings": {
    "similarity": {
      "my_bm25": {
        "type": "BM25",
        "k1": 1.5,
        "b": 0.5
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "similarity": "my_bm25"
      }
    }
  }
}

Lucene's Scoring Implementation

Lucene does not compute BM25 at scoring time. It uses several pre-computed values:

Norm: document length normalization value (computed at index time).
IDF: computed from the segment's statistics.
Field boost: field weight.

Index-time values are stored, and queries compute fast via simple multiplication.

5. Analyzer: The Text Processing Pipeline

Why an Analyzer?

For "Search Engine" and "search engine" to be treated as the same term? For "running" and "run" to match as the same concept?

Answer: the analyzer processes text at index and query time.

Analyzer Structure

Input Text
    ↓
Character Filter (one or more)
    ↓
Tokenizer (exactly one)
    ↓
Token Filter (zero or more)
    ↓
Output Tokens

Character Filter

Character-level preprocessing:

HTML strip: remove HTML tags.
Mapping: character replacement (& → and).
Pattern replace: regex replacement.

Tokenizer

Splits text into tokens (usually words):

Standard: word boundary based, Unicode aware. Works for most languages.
Whitespace: split by whitespace.
N-gram: generate all n-grams (e.g. "search" → ["sea", "ear", "arc", ...]).
Keyword: no splitting (for exact match).
Language-specific: Chinese (IK, Smart Chinese), Japanese (Kuromoji), Korean (Nori).

Token Filter

Post-processing tokens:

Lowercase: lowercase conversion.
Stop: remove stop words ("the", "a", "is").
Stemmer: stemming ("running" → "run").
Synonym: synonym expansion.
ASCII folding: "café" → "cafe".
Word delimiter: split compound words.

Standard Analyzer Example

The default standard analyzer:

Input: "The Quick Brown Foxes!"
  ↓ Standard Tokenizer
[The, Quick, Brown, Foxes]
  ↓ Lowercase Filter
[the, quick, brown, foxes]
  ↓ Stop Filter (off by default)
[the, quick, brown, foxes]
Output: [the, quick, brown, foxes]

Korean Processing: Nori

Korean is hard to tokenize because of particles and endings. Nori is a Korean morphological analyzer:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_nori": {
          "tokenizer": "nori_tokenizer",
          "filter": ["nori_readingform", "lowercase"]
        }
      }
    }
  }
}

Input: "고양이를 좋아합니다" → [고양이, 를, 좋아, 합니다] → particles removed → [고양이, 좋아]

Edge N-gram: Autocomplete

For autocomplete, edge n-gram is commonly used:

Input: "apple"
→ [a, ap, app, appl, apple]

When a user types "app", it matches the already-indexed "app". "apple" is found instantly.

Caution: storage grows significantly. Not by word count but by n-gram count.

6. Elasticsearch Distribution

Lucene is a search engine on a single machine. Elasticsearch extends it to a distributed cluster.

Index, Shard, Replica

Index: a collection of documents (analogous to a DB table).
Shard: a partition of the index. Each shard is a single Lucene index.
Primary shard: the original.
Replica shard: the copy.

Index "logs" (5 primary, 1 replica)
├── shard 0 (primary, node A) ← original
│   └── replica 0 (node B)   ← copy
├── shard 1 (primary, node B)
│   └── replica 1 (node C)
├── shard 2 (primary, node C)
│   └── replica 2 (node A)
├── shard 3 (primary, node A)
│   └── replica 3 (node B)
└── shard 4 (primary, node B)
    └── replica 4 (node C)

Shard Routing

Deciding which shard to put a document in:

shard = hash(_routing) % num_primary_shards

By default, _routing = document_id. Documents with the same ID always go to the same shard.

The Primary Shard Count Constraint

Problem: the primary shard count cannot be changed after index creation. Why?

Initial: 5 shards
hash(user_1) % 5 = 3  → stored on shard 3

If you change it to 10 shards:

hash(user_1) % 10 = 1  → look on shard 1 → not there!

Existing data would be searched on the wrong shard. Reindexing is required to resolve this.

Solutions: the Split API (in some cases), or start with enough shards from the beginning.

Replica Shards

Replicas can be added/removed anytime:

PUT /my_index/_settings
{
  "index.number_of_replicas": 2
}

Replicas provide:

High availability: if primary fails, a replica is promoted.
Read scalability: distribute queries across replicas.

Cluster State and Master

A cluster has master nodes:

Manage cluster state (shard assignment, mappings, etc.).
Master election: Zen Discovery (older) or Raft-based (Elasticsearch 7+).
Prevent two masters existing simultaneously (split brain).

Tip: keep an odd number of masters (3, 5, 7). To form a quorum.

Query Execution: Scatter-Gather

When a search query arrives:

1. Coordinator node receives the request.
2. Broadcasts the query to every needed shard (primary or replica).
3. Each shard computes local top-K.
4. Coordinator gathers all results and picks the global top-K.
5. Fetches the full data for the selected documents.
6. Returns to the client.

This is called scatter-gather or two-phase query.

Query Pitfall: Deep Pagination

GET /logs/_search?from=9990&size=10

To get "results 10,000 to 10,010":

Each shard computes top 10,010 (!).
Coordinator gathers and sorts 10,010 items.
Returns the range 9,990 to 10,010.

→ With from=1,000,000, each shard sorts a million entries. Memory explosion.

Solution: use the search_after API. Instead of global sort, continue from the previous result.

Aggregation Execution

Aggregation queries (e.g. GROUP BY) also execute distributed:

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "by_user": {
      "terms": { "field": "user_id", "size": 10 }
    }
  }
}

Each shard computes local top 10, then the coordinator merges them. Problem: the local top 10 may not be the global top 10. The coordinator requests more (e.g. top 100) to improve accuracy.

Use the shard_size parameter to tune:

"terms": { "field": "user_id", "size": 10, "shard_size": 100 }

7. Doc Values: Sorting and Aggregation

Why They Matter

The inverted index goes from term → docs. To answer "what is this document's field value?", you need the reverse direction.

Example: ORDER BY timestamp or GROUP BY country. You need each document's timestamp and country values.

Doc Values

Doc values are columnar storage:

timestamp column:
  doc 0 → 2025-04-15 10:00:00
  doc 1 → 2025-04-15 10:00:01
  doc 2 → 2025-04-15 10:00:02
  ...

Each document's field value stored as a contiguous array. Optimal for sorting, aggregation, and scripting.

Memory Use

Doc values are disk-based by default:

mapped via mmap → leverages OS page cache.
Cached in RAM if frequently used, on disk otherwise.

Elasticsearch's field data was the old in-memory version. Doc values are much more efficient, so they are now the default.

Exception: text Fields

text fields do not store doc values by default:

Only analyzed tokens are stored → original reconstruction is hard.
To aggregate, use a keyword subfield.

{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": { "type": "keyword" }
      }
    }
  }
}

Search on name. Aggregate/sort on name.keyword.

Sparse Doc Values

Lucene supports sparse representation for cases with many missing fields (e.g. optional fields). Space-efficient.

8. Mapping and Dynamic Mapping

What Is Mapping?

Mapping defines each field's type and properties:

{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "user_id": { "type": "keyword" },
      "timestamp": { "type": "date" },
      "price": { "type": "double" },
      "location": { "type": "geo_point" }
    }
  }
}

text vs keyword

The most important distinction:

text: analyzed (tokenized). For full-text search.
keyword: not analyzed. For exact match, aggregation, and sorting.

Emails, IPs, tags are almost always keyword. Body, description, title are text (optionally with keyword as well).

Dynamic Mapping

Field types are auto-inferred on first indexing:

POST /my_index/_doc
{
  "name": "Alice",
  "age": 30,
  "active": true
}

Pro: quick start. Con: types may differ from expectation. Once decided, types cannot change.

Recommendation: use explicit mapping in production. Dynamic mapping is for dev/test.

Mapping Explosion

Each field carries memory overhead. For example:

{
  "user_data": {
    "user_1": { "action": "login" },
    "user_2": { "action": "logout" }
  }
}

Each user_id dynamically creates a new field. Millions of fields → mapping explosion. Elasticsearch can go down in minutes.

Fixes:

Restructure: { "user_id": "user_1", "action": "login" }.
Or use the flattened field type.
Limit via index.mapping.total_fields.limit (default 1000).

9. Production Tuning

Bulk Indexing

Goal: load data as fast as possible.

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "-1",
    "number_of_replicas": 0,
    "translog": {
      "durability": "async",
      "sync_interval": "30s"
    }
  }
}

Index in batches with the Bulk API:

POST /_bulk
{ "index": { "_index": "my_index" } }
{ "field1": "value1" }
{ "index": { "_index": "my_index" } }
{ "field2": "value2" }

5–15MB per bulk is the sweet spot. Too small → overhead; too large → memory pressure.

After indexing:

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "1s",
    "number_of_replicas": 1
  }
}

POST /my_index/_forcemerge?max_num_segments=1

_forcemerge maximizes search performance.

JVM Heap

Critical rule: keep heap under 32GB.

Reason: above 32GB, the JVM's compressed oops (compressed object pointers) is disabled, and memory efficiency drops sharply.

# jvm.options
-Xms16g
-Xmx16g

The rest of memory goes to OS page cache. Lucene heavily uses page cache. Best setup:

64GB server memory
JVM heap 16–31GB
OS cache 33–48GB

Shard Count

Bad example: daily indices × 1,000 × 5 primary shards = 5,000 shards. Master overhead explodes.

Rules:

20–40GB per shard is ideal.
No more than 20 shards per GB of JVM heap.
Small indices are fine with a single shard.

Hot-Warm Architecture

Useful pattern for time-series data:

Hot nodes: recent data, fast SSDs, active indexing/search.
Warm nodes: older data, HDDs, read-only.
Cold nodes: very old data, occasionally queried.

Automate with Index Lifecycle Management (ILM):

{
  "policy": {
    "phases": {
      "hot":  { "actions": {} },
      "warm": { "min_age": "7d", "actions": { "allocate": { "require": { "data": "warm" } } } },
      "cold": { "min_age": "30d", "actions": { "allocate": { "require": { "data": "cold" } } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Data Streams

Elasticsearch 7.9+ data streams are a high-level abstraction for time-series data:

Auto-creates backing indices (e.g. .ds-logs-2025.04.15-000001).
Auto-rollover (by size/age).
Integrates with ILM.

POST /_data_stream/logs-app

Kibana logs, APM, and monitoring all use data streams.

10. Common Pitfalls and Debugging

Pitfall 1: Too Many Shards

Symptoms: slow cluster state updates, high master node CPU, slow queries.

Cause: thousands of small shards.

Fixes:

Merge/shrink older indices.
Automate via ILM.
Reduce shard count via the Shrink API.

Pitfall 2: Mapping Explosion

Symptoms: OOM, indexing failures, cluster instability.

Cause: unbounded dynamic field creation.

Fixes:

Explicit mapping.
Use "dynamic": "strict" to reject new fields.
Fix data structure at the application level.

Pitfall 3: Deep Pagination

Symptoms: memory/time explosion on queries with large from values.

Fixes:

Use search_after.
Or the scroll API (for bulk extraction).
Block deep pagination in the UI.

Pitfall 4: Fielddata on Text

Symptoms: aggregation on a text field → error or massive memory use.

Fixes:

Use text.keyword.
Or fielddata: true (risky, not recommended).

Pitfall 5: Refresh Abuse

Symptoms: ?refresh=true on every request to make docs searchable immediately.

Cause: new segment per indexing call → excessive merge load.

Fixes:

Refresh only when needed.
Replace with ?refresh=wait_for (waits for the next refresh).

Pitfall 6: Complex Nested Queries

Symptoms: queries on nested fields are slow.

Cause: nested fields are stored as "hidden sub-documents" and searched separately.

Fixes:

Denormalize (flatten). Accept duplication.
Or use join type for many relations (with reduced performance).

11. Lucene and Competing Technologies

OpenSearch

When Elasticsearch moved to the SSPL license in 2021, AWS forked OpenSearch. Still Lucene-based. Features are starting to diverge.

Apache Solr

Solr, predating Elasticsearch, is also Lucene-based. Originally enterprise-focused. Elasticsearch now leads in market share.

Meilisearch

A lightweight search engine written in Rust. Does not use Lucene; implements its own. Very fast at smaller scale. Built-in typo tolerance (fuzzy matching).

Typesense

Also Rust-based. An Algolia alternative.

The Role of ClickHouse

If "search" really means analytical queries rather than traditional search, ClickHouse is far faster. Log analytics are increasingly moving from Elasticsearch to ClickHouse.

Differences:

Elasticsearch: free-text search, diverse queries, complex scoring.
ClickHouse: aggregation, SQL, large-scale analytics.

Vector Search (see earlier post)

Elasticsearch 8+ also supports vector search (HNSW). See the earlier post on ANN algorithms.

Quiz Review

Q1. Why are Lucene segments immutable, and what are the benefits?

A. Even when documents are added/deleted/updated, existing segments are never modified. Instead, new segments are created, and deletions are marked via tombstones.

Benefits:

Lock-free search: read-only, so thousands of concurrent queries run without locks.
Cache stability: OS page cache safely caches immutable files. No cache invalidation concerns.
Simple replication: just copy files. No state synchronization.
Merge safety: background merges don't interfere with live searches. Atomic swap on completion.
Optimized write path: writes are always append. Sequential I/O is fast.

Costs:

Deletion only leaves a tombstone; actual reclamation is deferred to merge.
Update = delete + add → tombstone accumulation.
Merge cost (I/O, CPU).

This is the same philosophy as LSM-Trees. "Writes are append, cleanup is batched later" is a fundamental pattern of modern storage systems.

Q2. What are the differences and roles of refresh, flush, and merge?

Refresh (~1 second interval):

Converts memory buffer into an in-memory segment.
Documents become searchable (the source of Near Real-Time).
Not yet fsynced → no durability.
Cost: moderate. Creates a new small segment each time.

Flush (minutes or 512MB):

Fsyncs in-memory segments to disk.
Empties the translog.
Data becomes persistent.
Cost: high (actual disk I/O).

Merge (background):

Merges multiple small segments into a larger one.
Tombstoned documents are actually removed.
Maintains search performance (fewer segments = faster).
Cost: very high (disk read + write).

Analogy:

Refresh = jotting into your notebook (just a memo).
Flush = filing the notebook into the cabinet (permanent storage).
Merge = quarterly file reorganization (dedup, efficiency).

Why separate: each must run at a different cadence to be efficient. Search needs frequent refresh, but fsync is expensive so it cannot be frequent. Merge can run slowly in the background. Separating these three stages is how Elasticsearch achieves both performance and durability.

Q3. What is the difference between text and keyword fields, and why do you need both?

text field:

An analyzer is applied.
Tokenization, lowercase, stemming, stop-word removal, etc.
"The Quick Brown Fox" → ["quick", "brown", "fox"].
Optimal for full-text search.
Cannot sort/aggregate (original lost after analysis).

keyword field:

Not analyzed. String is kept as-is.
"The Quick Brown Fox" → ["The Quick Brown Fox"] (a single value).
Used for exact match, sorting, aggregation.
Useful for high-cardinality values.

Why both: the same field often has different query patterns. For example, a product name:

Full-text search for "iPhone" → text.
Aggregation by exact name "iPhone 15 Pro" → keyword.

Hence multi-field mapping:

{
  "product_name": {
    "type": "text",
    "fields": {
      "keyword": { "type": "keyword" }
    }
  }
}

Now search on product_name, aggregate on product_name.keyword. Same data, two indexing styles.

Rules of thumb:

IP, email, tag, ID, URL → keyword.
Body, description, title, comments → text.
Product name, user name where both needed → text + keyword subfield.

Without this distinction, confusion like "aggregation doesn't work" or "fielddata error" arises. It's the most basic principle of mapping design.

Q4. How does BM25 improve on TF-IDF?

A. It addresses two weaknesses of TF-IDF:

Weakness 1: Linearity of TF TF-IDF uses term frequency linearly: score ∝ TF.

"search" 1 time vs 100 times → 100x difference.
Reality: 1 → 10 is a big jump, but 90 → 100 is almost meaningless.

BM25 fix: a saturation curve.

f(tf) = tf × (k1+1) / (tf + k1)

Growth rate decreases as TF grows. With k1=1.2, tf=10 gives f≈5.5, tf=100 gives f≈1.18. Also effective against spam.

Weakness 2: Ignoring document length TF-IDF treats "'search' 2 times in a 100-word document" and "'search' 2 times in a 10,000-word document" equally. But matches in short documents are more "relevant".

BM25 fix: length normalization.

length_norm = 1 - b + b × (|d| / avgdl)

|d|: this document's length.
avgdl: average document length.
b: normalization strength (0-1, default 0.75).

Discounts TF for long documents. b=1 is full normalization, b=0 ignores length.

Combined BM25 formula:

BM25 = IDF × (tf × (k1+1)) / (tf + k1 × (1 - b + b × |d|/avgdl))

Complex-looking, but the meaning is clear: "IDF-weighted score with saturated TF discounted by document length".

In practice defaults (k1=1.2, b=0.75) satisfy most cases. For short documents like tweets lower b; for long articles raise k1. BM25 becoming Lucene's default in Lucene 6 was no accident — it's a formula proven over 20 years.

Q5. Why did Elasticsearch make the primary shard count unchangeable after index creation?

A. Because of shard routing. Which shard a document lands on is determined by:

shard_id = hash(routing_key) % num_primary_shards

By default routing_key = document_id. Then:

At indexing: hash(doc_1) % 5 = 3 → stored on shard 3.
At search: hash(doc_1) % 5 = 3 → queried on shard 3.

The problem: if you change from 5 to 10 shards:

New calculation: hash(doc_1) % 10 = 1 → queried on shard 1.
But actual storage was shard 3 → data not found.

All existing documents would be searched on the wrong shard. Equivalent to data loss.

Theoretical solutions:

Consistent hashing: most documents stay put with partial changes. Still some migration needed.
Reindex: create a new index and copy all documents. Time/resource intensive.
Dual-write period: index into the new one in parallel. Gradual transition.

Elasticsearch's practical solutions:

Split API: only N → N*k (2x, 3x, etc.). Each shard splits internally.
Shrink API: shrink from N → N/k.
Reindex API: general case. Create new index and copy.
Data streams + ILM: time-series data auto-creates new indices, so no problem.

Design recommendations:

Enough shards from the start. Easier than growing later.
Rule: 20–40GB per shard. 1.5–2x projected future growth.
Time-series: data streams + daily/weekly indices.

This constraint is a main reason "why Elasticsearch operations are tricky". Bad initial design means costly reindexing. Hence decide the shard count carefully before production.

Conclusion: 20 Years of Engineering

Key Takeaways

Inverted index + FST: the foundation of search.
Segments are immutable: the secret of lock-free search.
Refresh/Flush/Merge: three-stage performance/durability balance.
BM25: the modern scoring standard.
Analyzer: the text processing pipeline.
Shard + Replica: the basics of distribution.
Doc Values: columnar storage for sorting/aggregation.
Hot-Warm + ILM: time-series data management.

Lessons in Operating Elasticsearch

Trust defaults, but understand them. They are usually good but may not fit your situation.
Heap under 32GB. The single most important rule.
Design shards well upfront. Hard to change later.
Explicit mapping. Dynamic mapping is a start of disaster.
Refresh only as needed. Turn it off for bulk indexing.
Measure. _cat/segments, _cat/shards, _cluster/health.

Lucene, a Treasure

Lucene has been developed since the early 2000s as a Java library. Twenty years of careful, hand-tuned optimization by many engineers. Compression algorithms, data structures, file formats — everything is finely tuned.

Elasticsearch, Solr, OpenSearch, Kibana, and even some SaaS search services sit on top of Lucene. After Google Search, it's likely the search engine you use most often.

A Final Lesson

Search looks easy but is hard when you dig deep. To answer "why is this query slow", "why is memory low", or "why is clustering unstable", you have to know the internals.

Having read this post, you now:

Know Lucene's segment and file structure.
Understand refresh vs flush.
Know why BM25 is good.
Understand the importance of shard count.
Know why Hot-Warm exists.

Next time you work with an Elasticsearch cluster, this knowledge will steer your decisions for the better. And when issues arise, you'll be able to answer "why?". That is the power of a real engineer.

References

Lucene: The Internal Workings
Elasticsearch: The Definitive Guide - older but great conceptual explanations
Elastic Blog: Anatomy of an Elasticsearch Cluster
Lucene Index File Formats
Okapi BM25 (Robertson & Zaragoza, 2009) - mathematical review of BM25
Elasticsearch: Designing for Scale
OpenSearch Documentation - Elasticsearch fork
Introduction to Information Retrieval (Manning, Raghavan, Schütze) - classic IR textbook
Finite State Transducers for Fast Text Processing - FST explanation
Why do I need replicas? Everything you need to know about Elasticsearch Shards