Split View: Elasticsearch와 OpenSearch, Lucene의 내부 — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG까지 (2025)

Elasticsearch와 OpenSearch, Lucene의 내부 — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG까지 (2025)

"Search is not a feature. It's a philosophy of how humans interact with information." — Doug Cutting (creator of Lucene, 1999)

Google이 없던 시절을 기억하는가? 1999년 Doug Cutting이 자바로 Lucene을 만들었을 때, 그는 "누구나 자기 데이터에 Google급 검색을 달 수 있어야 한다"고 믿었다. 26년이 지난 지금, Elasticsearch, OpenSearch, Solr는 모두 Lucene 위에 서 있다. 로그 분석, 상품 검색, 자동완성, 그리고 2024년부터는 RAG(Retrieval-Augmented Generation)의 핵심 인프라까지.

그러나 "Elasticsearch 쓴다"와 "Lucene을 이해한다"는 하늘과 땅 차이다. 이 글은 검색의 본질부터 2025년 하이브리드 검색까지, 한 번에 꿰뚫는 지도다.

1. 왜 관계형 DB의 `LIKE '%keyword%'`는 안 되는가

선형 스캔의 벽

SELECT * FROM articles WHERE content LIKE '%postgresql%';

인덱스 사용 불가 (prefix %)
모든 row의 text 필드를 풀 스캔
1억 문서 → 수십 분

Postgres의 GIN도 일부 해결하지만:

토큰화/언어 분석 기능 제한
관련성 점수(scoring) 매기기 어려움
자동완성, 오타 교정, 동의어 생태계 약함
분산 검색 지원 약함

검색 전용 엔진이 필요한 이유다.

2. Inverted Index — 검색의 수학적 심장

기본 아이디어

Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "Foxes and dogs"

이를 단어 → 문서 리스트로 뒤집으면:

brown  → [1, 2]
dog    → [2]
dogs   → [3]
fox    → [1]
foxes  → [3]
lazy   → [2]
quick  → [1]
the    → [1, 2]

질의 "brown dog"는:

brown → [1,2]
dog → [2]
AND 연산(교집합) → [2]

이것이 Inverted Index(역인덱스)의 본질이다. 수십억 문서에서도 $O(\log N)$ 에 가까운 속도.

Lucene의 구현 — 왼쪽에서 오른쪽으로

Lucene이 디스크에 저장하는 단위:

Term Dictionary — 모든 term의 정렬된 사전 (FST, Finite State Transducer)
Postings List — 각 term이 나타난 문서 리스트 + 빈도/위치
Stored Fields — 원본 문서 (압축)
Doc Values — 컬럼나(columnar) 저장, 집계/정렬용
Norms — 필드 길이 정규화 값

FST — 접두사 공유로 메모리 절약

"fox, foxes, foxy" 같은 term들의 공통 접두사를 유한 상태 전이기로 압축. 수천만 term의 사전을 수 MB로 저장. 자동완성의 핵심 구조이기도 하다.

3. Segment — Lucene의 불변성 원칙

불변(Immutable) Segment

Lucene에서 한 번 쓴 파일은 변경되지 않는다. 이것이 Lucene을 빠르고 안전하게 만든다.

Append 전용 — 새 문서는 새 segment 생성
Delete는 삭제 마커(tombstone)만 남김
Update는 delete + insert
여러 작은 segment들이 병합(merge)되어 큰 segment 됨

Segment 구조

한 segment는 여러 파일로 구성:

_0.cfs     — composite file (여러 파일을 하나로)
_0.cfe     — 진입점
_0.si      — segment info
_0.fdt/fdx — field data
_0.tim/tip — term dictionary
_0.doc/pos — postings
_0.dvm/dvd — doc values
_0.liv     — live docs (삭제 비트맵)

Merge 정책

작은 segment 많으면 검색 느림 (각각 순회)
큰 segment 만들면 merge 비용 큼 (I/O 폭발)
기본은 TieredMergePolicy — 크기별로 계층화해 merge

Refresh / Flush / Commit의 차이

용어	뜻	시점
Refresh	메모리 buffer를 검색 가능한 segment로 만듦	기본 1초마다
Flush	segment를 디스크로 fsync	자동 (메모리 임계치)
Commit	트랜잭션 로그 포함 완전 영속	덜 자주

**"Near Real-Time Search"**의 비밀: refresh는 1초, flush는 나중에. 그래서 "1초 지연"이 Elasticsearch의 트레이드마크.

Translog — 내구성을 지키는 로그

모든 쓰기는 먼저 translog에 기록
노드 재시작 시 translog 재생
index.translog.durability:
- request — 매 요청 fsync (느림, 손실 0)
- async — 주기 fsync (기본, 5초)

4. BM25 — TF-IDF를 대체한 점수

TF-IDF의 한계

$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}$

tf: term의 문서 내 빈도
df: term이 나타난 문서 수
문제: 긴 문서일수록 tf가 커져 점수 부풀림

BM25 공식

$\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$

핵심 변경:

포화(saturation) — 같은 단어가 많이 나와도 점수 증가 감소 ( $k_1 \approx 1.2$ )
길이 정규화 — 문서 길이를 평균과 비교해 페널티/보너스 ( $b \approx 0.75$ )

BM25가 기본값인 이유

20년 이상 경험적으로 우수
파라미터 두 개로 튜닝 용이
Lucene이 2016년부터 기본값

파라미터 튜닝

k_1: 1.2 → 높이면 빈도 중요
b: 0.75 → 낮추면 길이 무관 (짧은 필드에 유리)

제품명/쿼리 로그 분석은 b=0.3 정도로 낮추는 게 흔한 패턴.

5. 분석기 (Analyzer) — 토큰화의 예술

3단계 파이프라인

Character Filter — HTML 제거, 문자 교체
Tokenizer — 문장을 단어로 자름
Token Filter — 소문자화, 어간 추출, 동의어, 불용어

한국어의 지옥

영어: 공백 기준 토큰화 쉬움.

한국어: 교착어 + 어미 변화 + 조사.

"검색했다", "검색한다", "검색은", "검색을"
모두 "검색"으로 매칭되어야
nori 분석기 (한국어), kuromoji (일본어), ik (중국어)

예: nori 분석

POST _analyze
{
  "analyzer": "nori",
  "text": "Elasticsearch는 검색엔진입니다"
}

-> "elasticsearch", "는", "검색", "엔진", "입니다"

는, 입니다 같은 조사/어미 제거:

"filter": ["nori_part_of_speech"]

동의어 확장

"shoe, sneaker, runner" 
-> 질의 "shoe" 시 자동으로 sneaker/runner 문서도 매칭

검색 품질의 절반은 동의어 사전에 달려 있다. 구축은 지루하지만 투자 대비 효과 최고.

6. Shard & Replica — 분산의 기본

Primary Shard

인덱스는 여러 primary shard로 분할
문서 doc_id를 해싱해 shard 결정: shard = hash(_routing) % number_of_primary_shards
_routing 기본은 _id

Replica Shard

Primary의 복제본
검색 성능 확장 + 장애 대비
number_of_replicas = 1이면 각 primary마다 복제 1개

한계와 원칙

primary 수는 index 생성 후 변경 불가 (routing 깨짐)
해결: reindex하거나 alias + 새 인덱스
shard 크기 경험칙: 각 shard 10-50GB
과도한 shard → cluster state 폭발, overhead

Cluster State & Split Brain

Master 노드가 cluster 메타데이터 관리
여러 node가 동시에 master 되면 split brain
discovery.zen.minimum_master_nodes = (N/2) + 1 (7.x 이후 자동)
2020년 7.0에서 합의 알고리즘 완전 재작성 (Raft-like)

7. Query DSL — JSON의 미로

주요 쿼리 타입

쿼리	용도
`match`	분석기 거친 풀텍스트 매칭
`term`	분석기 거치지 않는 정확 일치 (keyword 필드)
`match_phrase`	순서 보존 구문
`multi_match`	여러 필드 동시 검색
`bool`	AND/OR/NOT 조합 (핵심)
`range`	범위
`function_score`	커스텀 점수
`rank_feature`	부스팅 필드

bool 쿼리의 4절(clause)

{
  "bool": {
    "must": [...],      // AND, 점수 기여
    "should": [...],    // OR, 점수 기여
    "filter": [...],    // AND, 점수 기여 안함, 캐시됨
    "must_not": [...]   // NOT
  }
}

filter 절을 최대한 활용하라 — 캐시되고 빠르다. match로 점수를 매기고, 고정 조건은 filter로.

Term vs Match — 가장 흔한 실수

// text 필드 "User Name"이 저장되어 있음
{"term": {"name": "User Name"}}   // 매칭 안 됨! (소문자화되었음)
{"term": {"name": "user name"}}   // 이것도 안 됨 (공백 토큰화 됨)
{"match": {"name": "User Name"}}  // OK

text 필드에는 match, keyword 필드에는 term.

집계(Aggregation) — 분석 DB로서의 ES

{
  "aggs": {
    "by_category": {
      "terms": {"field": "category"},
      "aggs": {
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

SQL의 GROUP BY에 해당. 로그 분석, 대시보드 구축에 필수.

8. Vector Search — 2022년 이후의 혁명

왜 벡터인가

BM25는 단어 매칭 기반 — "강아지"와 "개"를 다르게 봄
임베딩은 의미 기반 — 유사 의미 자동 연결

ES/OpenSearch의 kNN

Elasticsearch 8.0(2022)부터 native kNN. Lucene 9.0의 HNSW 구현 활용.

// 인덱스 매핑
{
  "mappings": {
    "properties": {
      "title_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

// 질의
{
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, 0.2, ...],
    "k": 10,
    "num_candidates": 100
  }
}

HNSW 파라미터

m: 각 노드의 이웃 수 (보통 16)
ef_construction: 인덱싱 시 탐색 폭 (100-200)
ef_search: 질의 시 탐색 폭 (recall↑ vs 속도↓)

저장 효율 — Quantization

int8 quantization — float32 → int8, 4배 절약, 정확도 1% 손실
BBQ (Better Binary Quantization) — 2024년 말 Lucene 10에 도입, 32배 절약
2025년 RAG에서 사실상 표준

9. Hybrid Search — RAG 시대의 정답

BM25 vs Vector Search

측면	BM25	Vector
정확 일치 (상품 코드)	강함	약함
의미 유사성	약함	강함
희귀어 (전문용어)	강함	약함
오타	약함	중간
다국어	약함	강함

둘 다 필요하다.

RRF — Reciprocal Rank Fusion

두 랭킹의 결과를 순위 기반으로 결합:

$\text{RRF}(d) = \sum_i \frac{1}{k + \text{rank}_i(d)}$

각 검색 결과의 순위만 써서 결합
스케일 차이 무관 (BM25 점수와 cosine이 달라도 OK)
k = 60이 경험적으로 가장 좋음

ES의 `rank: rrf`

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {"standard": {"query": {"match": {"content": "검색어"}}}},
        {"knn": {"field": "vec", "query_vector": [...], "k": 50}}
      ],
      "rank_window_size": 50,
      "rank_constant": 60
    }
  }
}

Cross-Encoder Reranker

BM25/kNN로 상위 100개 후보 가져옴
Cross-Encoder 모델(BERT 기반)로 재정렬
2024년부터 ES/OpenSearch 네이티브 지원 (Cohere, E5, BGE)

10. 2021년 라이선스 전쟁 — OpenSearch 탄생

배경

AWS가 Elasticsearch를 managed service(Elasticsearch Service)로 판매
Elastic은 "AWS가 업스트림에 기여 없이 이익만 챙긴다"며 분노
2021년 1월, Elasticsearch 7.11을 SSPL/Elastic License 이중 라이선스로 전환
AWS는 즉시 포크: OpenSearch

결과

Elastic: 라이선스 덕에 AWS와 경쟁 가능, 매출 방어
AWS: OpenSearch를 Linux Foundation 재단으로 이관(2024년 9월)
커뮤니티: 분열

2024-2025 상황

제품	라이선스	주도
Elasticsearch 8	Elastic License 2 / SSPL	Elastic
Elasticsearch 8.14+	AGPL 추가 (2024.8)	Elastic (회복 시도)
OpenSearch 2.x	Apache 2.0	AWS → Linux Foundation

선택 가이드

관리형 AWS 중심이면 → OpenSearch
최신 ML/벡터/ESQL 기능 필요 → Elasticsearch
오픈소스 순수주의 → OpenSearch
Elastic Agent/Fleet 생태 → Elasticsearch

11. 운영의 지옥 — 흔한 장애 패턴

JVM Heap 관리

기본: 32GB 넘지 말 것 (Compressed OOPs 한계)
Heap의 50% 넘으면 경고, 75% 넘으면 위험
Old Gen GC가 검색을 멈춤 (stop-the-world)

Circuit Breaker

circuit_breaking_exception: Data too large

한 질의가 heap의 일정 비율 넘으면 거절
기본 60-70%
거절된 질의는 다시 시도하지 말 것 — 더 악화

Hot Shard

특정 shard에 질의 집중
원인: routing 키 편향
해결: _routing 조정, shard 수 증가

Shard 수 폭발

노드당 1000+ shard면 cluster state 과부하
인덱스 라이프사이클(ILM): rollover, shrink, delete

Snapshot & Restore

S3/GCS/Azure Blob 리포지토리
증분 백업
대형 클러스터에서 restore는 수 시간 — 주의

12. Ingest 파이프라인 — 데이터 투입의 예술

Logstash — 전통적 ETL

Input → Filter → Output 구조
Grok, Mutate, GeoIP, User Agent 등 풍부한 필터
JVM 기반, 무거움

Beats — 경량 에이전트

Filebeat (로그), Metricbeat (지표), Packetbeat (네트워크)
Go 기반, 가벼움
직접 ES로 또는 Logstash 경유

Elastic Agent + Fleet (2021+)

하나의 에이전트로 모든 데이터 수집
중앙 UI(Fleet)에서 정책 관리

OpenTelemetry 통합

2024년부터 OTel → ES가 1급 지원
OTel Collector가 Logstash 대체 가능

Ingest Node Pipeline

ES 자체에 간단한 파이프라인 정의:

{
  "processors": [
    {"set": {"field": "indexed_at", "value": "{{_ingest.timestamp}}"}},
    {"grok": {"field": "message", "patterns": ["%{COMBINEDAPACHELOG}"]}}
  ]
}

13. ES|QL — SQL의 귀환 (2024)

Elastic이 오랜 숙원 SQL-like 질의 언어를 내놓았다.

FROM logs-*
| WHERE status >= 500
| STATS count = COUNT(*) BY host
| SORT count DESC
| LIMIT 10

파이프 기반 (Splunk SPL, Kusto KQL 영감)
JSON Query DSL의 복잡도 제거
분석 쿼리에 압도적으로 편함

OpenSearch도 2024년 **PPL (Piped Processing Language)**로 유사 기능 추가.

14. 검색 품질 평가

nDCG (Normalized Discounted Cumulative Gain)

상위 결과의 관련성을 로그 가중치로 합산
0-1, 1이 완벽
검색 팀의 주요 KPI

Precision / Recall

Precision: 가져온 것 중 정답 비율
Recall: 정답 중 가져온 것 비율
트레이드오프 존재

클릭 로그로 학습 (Learning to Rank)

사용자 클릭 → 관련성 레이블
LambdaMART 모델 학습
Elasticsearch LTR 플러그인

A/B Test

실험군/대조군에 다른 랭킹
CTR, 전환율, 체류시간 측정
오프라인 지표(nDCG) ≠ 온라인 성공 — 항상 A/B로 검증

15. 안티패턴 TOP 10

기본 1-shard/1-replica로 수십 TB 인덱스 — shard 당 10-50GB 유지
_id를 수동 지정해 해싱 불균형 — routing 편향
너무 많은 인덱스/shard — cluster state 폭발
JVM heap 64GB 설정 — Compressed OOPs 한계 (32GB 이하)
deep pagination (from=10000) — scroll/search_after 사용
분석된 text 필드에 term 쿼리 — keyword 사용 또는 match
매 질의마다 클러스터 헬스 체크 — 과부하
빈번한 대량 삭제 — segment 폭발, _forcemerge 필요
백업 없음 — snapshot 필수
벡터만 쓰고 BM25 버림 — Hybrid가 거의 항상 이김

16. Elasticsearch/OpenSearch 현명하게 쓰기 체크리스트

마치며 — Lucene 위에 선 거인들

Elasticsearch, OpenSearch, Solr — 이름은 다르지만 모두 Lucene이라는 거인의 어깨 위에 서 있다. 그리고 그 Lucene은 1999년, Doug Cutting이 "검색을 민주화하자"고 만든 단일 자바 라이브러리에서 시작했다.

2025년 검색은:

로그를 찾고 (OpenTelemetry → ES)
제품을 찾고 (e-commerce 검색)
콘텐츠를 추천하고 (Learning to Rank)
LLM의 컨텍스트가 되고 (RAG)
자동완성과 오타 교정을 제공한다

모든 곳에 있지만, 모두가 제대로 이해하는 것은 아니다. Inverted Index와 Segment, BM25와 벡터 검색, Shard와 Routing — 이 용어들이 "아하!" 하고 연결되는 순간, 당신은 검색을 사용하는 사람에서 설계하는 사람이 된다.

다음 글 예고 — 분산 시스템의 합의 — Paxos, Raft, ZAB 완전 분해

ES의 cluster state 관리, etcd, ZooKeeper, PostgreSQL Patroni HA — 모두 합의 알고리즘 위에 서 있다. 다음 글에서는:

분산 합의의 문제 정의 — FLP 불가능성 정리의 충격
Paxos — Lamport의 "나는 Multi-Paxos를 구현해보지 않았으나..."
Raft — "Paxos보다 이해하기 쉽게" 만든 혁명
ZAB — ZooKeeper Atomic Broadcast
Viewstamped Replication — 덜 유명하지만 우아한 설계
etcd vs ZooKeeper vs Consul — 언제 무엇을 쓰나
Raft의 함정 — Leader election flapping, split vote
PBFT와 BFT 계열 — 비잔틴 오류 감내 (블록체인의 기반)
Kafka KRaft의 내부 — 왜 ZooKeeper를 버렸나
CRDTs — 합의 없이도 수렴하는 마법

분산 시스템의 진짜 심장을 여는 여정.

"A good search engine doesn't just find what you typed. It finds what you meant." — Peter Norvig (Google Research Director)

Elasticsearch, OpenSearch, and Lucene Internals — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG (2025)

"Search is not a feature. It's a philosophy of how humans interact with information." — Doug Cutting (creator of Lucene, 1999)

When Doug Cutting built Lucene in Java in 1999, he believed "anyone should be able to add Google-class search to their own data." 26 years later, Elasticsearch, OpenSearch, and Solr all stand on Lucene. Log analytics, product search, autocomplete — and since 2024, the core infrastructure of RAG (Retrieval-Augmented Generation).

But "using Elasticsearch" and "understanding Lucene" are worlds apart. This article is a map from the fundamentals of search to the hybrid search of 2025.

1. Why `LIKE '%keyword%'` in an RDB Doesn't Work

The Linear Scan Wall

SELECT * FROM articles WHERE content LIKE '%postgresql%';

Cannot use index (prefix %)
Full scan of text field on every row
100M documents means tens of minutes

Postgres GIN helps partially, but:

Limited tokenization/language analysis
Hard to compute relevance scores
Weak ecosystem for autocomplete, typo correction, synonyms
Weak distributed search support

This is why dedicated search engines exist.

2. Inverted Index — The Mathematical Heart of Search

The Basic Idea

Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "Foxes and dogs"

Flipped to word to document list:

brown  -> [1, 2]
dog    -> [2]
dogs   -> [3]
fox    -> [1]
foxes  -> [3]
lazy   -> [2]
quick  -> [1]
the    -> [1, 2]

Query "brown dog":

brown -> [1,2]
dog -> [2]
AND (intersection) -> [2]

This is the essence of the Inverted Index. Close to $O(\log N)$ even across billions of documents.

Lucene's On-Disk Units

Term Dictionary — sorted dictionary of all terms (FST, Finite State Transducer)
Postings List — document list per term + frequency/position
Stored Fields — original documents (compressed)
Doc Values — columnar storage for aggregations/sorting
Norms — field length normalization values

Common prefixes of terms like "fox, foxes, foxy" are compressed with a finite state transducer. Tens of millions of terms fit in a few MB. Also the core structure behind autocomplete.

3. Segment — Lucene's Immutability Principle

Immutable Segments

In Lucene, once a file is written it never changes. That's what makes Lucene fast and safe.

Append-only — new documents create new segments
Delete leaves only a tombstone
Update is delete + insert
Small segments merge into larger ones

Segment Files

_0.cfs     — composite file
_0.cfe     — entry point
_0.si      — segment info
_0.fdt/fdx — field data
_0.tim/tip — term dictionary
_0.doc/pos — postings
_0.dvm/dvd — doc values
_0.liv     — live docs (delete bitmap)

Merge Policy

Too many small segments -> slow search
Building huge segments -> expensive merges (I/O storms)
Default: TieredMergePolicy — tiered by size

Refresh vs Flush vs Commit

Term	Meaning	Timing
Refresh	Turn in-memory buffer into a searchable Segment	every 1s by default
Flush	fsync the segment to disk	automatic (memory threshold)
Commit	Full durability including translog	less frequent

The secret of "Near Real-Time Search": Refresh every 1s, Flush later. The "1-second lag" is Elasticsearch's trademark.

Translog

Every write goes to translog first
Replayed on node restart
index.translog.durability: request (fsync per request) vs async (periodic, default 5s)

4. BM25 — Replacing TF-IDF

Limits of TF-IDF

$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}$

Longer documents inflate TF, biasing scores.

BM25 Formula

$\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$

Key changes:

Saturation — diminishing returns for repeated terms ( $k_1 \approx 1.2$ )
Length normalization — penalize/bonus by doc length vs average ( $b \approx 0.75$ )

Why BM25 Is the Default

20+ years of empirical wins
Just two tunable params
Lucene default since 2016

Lower b (around 0.3) is common for product names or query-log fields.

5. Analyzer — The Art of Tokenization

Three-Stage Pipeline

Character Filter — strip HTML, char replacement
Tokenizer — split into words
Token Filter — lowercase, stemming, synonyms, stopwords

The Korean Hell

English: whitespace tokenization is easy.

Korean: agglutinative, conjugated endings, particles. "검색했다/검색한다/검색은/검색을" must all match "검색". Use nori (Korean), kuromoji (Japanese), ik (Chinese).

Example: nori analysis

POST _analyze
{
  "analyzer": "nori",
  "text": "Elasticsearch는 검색엔진입니다"
}

-> "elasticsearch", "는", "검색", "엔진", "입니다"

Remove particles/endings with "filter": ["nori_part_of_speech"].

Synonym Expansion

"shoe, sneaker, runner"
-> query "shoe" also matches sneaker/runner docs

Half of search quality lives in the synonym dictionary. Tedious but highest ROI.

6. Shard & Replica — The Basics of Distribution

Primary Shard

Indexes split across multiple primary shards
shard = hash(_routing) % number_of_primary_shards
_routing defaults to _id

Replica Shard

Copy of the primary
Scales read throughput and survives failures
number_of_replicas = 1 means one replica per primary

Limits and Rules

Primary count is fixed after index creation (routing would break)
Fix: reindex or use alias + new index
Rule of thumb: each shard 10-50 GB
Too many shards -> cluster state explosion

Cluster State and Split Brain

Master node manages cluster metadata
Multiple masters at once = split brain
7.0 (2020) rewrote the consensus algorithm (Raft-like)

7. Query DSL — The JSON Maze

Main Query Types

Query	Use
`match`	Analyzed full-text matching
`term`	Exact match without analysis (keyword fields)
`match_phrase`	Order-preserving phrase
`multi_match`	Search across multiple fields
`bool`	AND/OR/NOT composition
`range`	Range
`function_score`	Custom scoring
`rank_feature`	Boosting field

Four Clauses of `bool`

{
  "bool": {
    "must": [],
    "should": [],
    "filter": [],
    "must_not": []
  }
}

Use filter aggressively — it's cached and fast. Score with match; make fixed conditions filter.

Term vs Match — The Most Common Mistake

{"term": {"name": "User Name"}}
{"match": {"name": "User Name"}}

match for text fields, term for keyword fields.

Aggregations

{
  "aggs": {
    "by_category": {
      "terms": {"field": "category"},
      "aggs": {
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

The SQL GROUP BY equivalent — essential for log analysis and dashboards.

8. Vector Search — The Revolution After 2022

Why Vectors

BM25 is word-matching — "puppy" and "dog" are unrelated
Embeddings are meaning-based — auto-links similar meanings

ES/OpenSearch kNN

Native kNN since Elasticsearch 8.0 (2022), built on Lucene 9.0 HNSW.

{
  "mappings": {
    "properties": {
      "title_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

{
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, 0.2],
    "k": 10,
    "num_candidates": 100
  }
}

HNSW Params

m: neighbors per node (usually 16)
ef_construction: indexing width (100-200)
ef_search: query width (recall vs speed)

Quantization

int8 — 4x savings, about 1% accuracy loss
BBQ (Better Binary Quantization) — Lucene 10, late 2024, 32x savings
Effectively standard for RAG in 2025

9. Hybrid Search — The Answer for the RAG Era

BM25 vs Vector

Aspect	BM25	Vector
Exact match (SKUs)	Strong	Weak
Semantic similarity	Weak	Strong
Rare terms	Strong	Weak
Typos	Weak	Medium
Multilingual	Weak	Strong

You need both.

RRF — Reciprocal Rank Fusion

$\text{RRF}(d) = \sum_i \frac{1}{k + \text{rank}_i(d)}$

Combine via rank only
Scale-agnostic (BM25 vs cosine doesn't matter)
k = 60 works best empirically

ES `rrf` retriever

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {"standard": {"query": {"match": {"content": "query"}}}},
        {"knn": {"field": "vec", "query_vector": [], "k": 50}}
      ],
      "rank_window_size": 50,
      "rank_constant": 60
    }
  }
}

Cross-Encoder Reranker

Get top 100 candidates via BM25/kNN
Rerank with Cross-Encoder (BERT-based)
Native support in ES/OpenSearch since 2024 (Cohere, E5, BGE)

10. The 2021 License War — OpenSearch Is Born

Background

AWS sold Elasticsearch as a managed service
Elastic accused AWS of free-riding without upstream contributions
January 2021: Elasticsearch 7.11 moved to dual SSPL/Elastic License
AWS forked immediately: OpenSearch

Aftermath

Elastic: defended revenue against AWS
AWS: donated OpenSearch to the Linux Foundation (Sep 2024)
Community: fragmented

2024-2025 Status

Product	License	Lead
Elasticsearch 8	Elastic License 2 / SSPL	Elastic
Elasticsearch 8.14+	AGPL added (2024.8)	Elastic (recovery attempt)
OpenSearch 2.x	Apache 2.0	AWS -> Linux Foundation

Choosing

AWS-centric managed workloads -> OpenSearch
Latest ML/vector/ESQL features -> Elasticsearch
Pure open source preference -> OpenSearch
Elastic Agent/Fleet ecosystem -> Elasticsearch

11. Operational Hell — Common Failure Patterns

JVM Heap

Don't exceed 32 GB (Compressed OOPs limit)
Warning at 50% heap, danger at 75%
Old Gen GC stops searches (stop-the-world)

Circuit Breaker

circuit_breaking_exception: Data too large

Default 60-70% of heap
Do not auto-retry rejected queries — makes it worse

Hot Shard

Query concentrates on one shard
Cause: biased routing key
Fix: adjust _routing, increase shard count

Shard Explosion

Over 1000 shards per node overloads cluster state
Use ILM: rollover, shrink, delete

Snapshot & Restore

S3/GCS/Azure Blob repositories
Incremental backup
Restore can take hours on large clusters

12. Ingest Pipelines

Logstash

Input -> Filter -> Output
Grok, Mutate, GeoIP, User Agent filters
JVM-based, heavy

Beats

Filebeat, Metricbeat, Packetbeat
Go-based, lightweight
Direct to ES or via Logstash

Elastic Agent + Fleet

One agent for all data
Central policy UI (Fleet)

OpenTelemetry

First-class OTel -> ES since 2024
OTel Collector can replace Logstash

Ingest Node Pipeline

{
  "processors": [
    {"set": {"field": "indexed_at", "value": "{{_ingest.timestamp}}"}},
    {"grok": {"field": "message", "patterns": ["%{COMBINEDAPACHELOG}"]}}
  ]
}

13. ES|QL — The Return of SQL (2024)

Elastic's long-awaited SQL-like query language.

FROM logs-*
| WHERE status >= 500
| STATS count = COUNT(*) BY host
| SORT count DESC
| LIMIT 10

Pipe-based (inspired by Splunk SPL, Kusto KQL)
Strips out JSON Query DSL complexity
Massively more convenient for analytics

OpenSearch added similar functionality with PPL (Piped Processing Language) in 2024.

14. Search Quality Evaluation

nDCG — relevance of top results, log-weighted, 0-1
Precision / Recall — tradeoff
Learning to Rank — train from click logs (LambdaMART)
A/B Testing — offline metrics are not online success

15. Top 10 Anti-Patterns

Default 1-shard/1-replica for tens of TB indexes
Manual _id causing routing skew
Too many indexes/shards
JVM heap 64 GB (exceeds Compressed OOPs)
Deep pagination (from=10000) — use scroll/search_after
term on analyzed text fields
Cluster health check per query
Frequent bulk deletes without _forcemerge
No snapshots
Vector-only, abandoning BM25 — hybrid almost always wins

16. Checklist for Running ES/OpenSearch Wisely

Closing — Giants Standing on Lucene

Elasticsearch, OpenSearch, Solr — different names, all on the shoulders of the giant named Lucene. And Lucene itself began as Doug Cutting's single Java library to democratize search in 1999.

In 2025, search finds logs, products, content recommendations, LLM context, autocomplete, and typo correction. It is everywhere, but not everyone understands it. When Inverted Index, Segment, BM25, vector search, Shard, and Routing click into place, you move from being a search user to a search designer.

"A good search engine doesn't just find what you typed. It finds what you meant." — Peter Norvig

Elasticsearch와 OpenSearch, Lucene의 내부 — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG까지 (2025)

1. 왜 관계형 DB의 LIKE '%keyword%'는 안 되는가

선형 스캔의 벽

2. Inverted Index — 검색의 수학적 심장

기본 아이디어

Lucene의 구현 — 왼쪽에서 오른쪽으로

FST — 접두사 공유로 메모리 절약

3. Segment — Lucene의 불변성 원칙

불변(Immutable) Segment

Segment 구조

Merge 정책

Refresh / Flush / Commit의 차이

Translog — 내구성을 지키는 로그

4. BM25 — TF-IDF를 대체한 점수

TF-IDF의 한계

BM25 공식

BM25가 기본값인 이유

파라미터 튜닝

5. 분석기 (Analyzer) — 토큰화의 예술

3단계 파이프라인

한국어의 지옥

예: nori 분석

동의어 확장

6. Shard & Replica — 분산의 기본

Primary Shard

Replica Shard

한계와 원칙

Cluster State & Split Brain

7. Query DSL — JSON의 미로

주요 쿼리 타입

bool 쿼리의 4절(clause)

Term vs Match — 가장 흔한 실수

집계(Aggregation) — 분석 DB로서의 ES

8. Vector Search — 2022년 이후의 혁명

왜 벡터인가

ES/OpenSearch의 kNN

HNSW 파라미터

저장 효율 — Quantization

9. Hybrid Search — RAG 시대의 정답

BM25 vs Vector Search

RRF — Reciprocal Rank Fusion

ES의 rank: rrf

Cross-Encoder Reranker

10. 2021년 라이선스 전쟁 — OpenSearch 탄생

배경

결과

2024-2025 상황

선택 가이드

11. 운영의 지옥 — 흔한 장애 패턴

JVM Heap 관리

Circuit Breaker

Hot Shard

Shard 수 폭발

Snapshot & Restore

12. Ingest 파이프라인 — 데이터 투입의 예술

Logstash — 전통적 ETL

Beats — 경량 에이전트

Elastic Agent + Fleet (2021+)

OpenTelemetry 통합

Ingest Node Pipeline

13. ES|QL — SQL의 귀환 (2024)

14. 검색 품질 평가

nDCG (Normalized Discounted Cumulative Gain)

Precision / Recall

클릭 로그로 학습 (Learning to Rank)

A/B Test

15. 안티패턴 TOP 10

16. Elasticsearch/OpenSearch 현명하게 쓰기 체크리스트

마치며 — Lucene 위에 선 거인들

다음 글 예고 — 분산 시스템의 합의 — Paxos, Raft, ZAB 완전 분해

Elasticsearch, OpenSearch, and Lucene Internals — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG (2025)

1. Why LIKE '%keyword%' in an RDB Doesn't Work

The Linear Scan Wall

2. Inverted Index — The Mathematical Heart of Search

The Basic Idea

Lucene's On-Disk Units

FST — Prefix Sharing for Memory Savings

3. Segment — Lucene's Immutability Principle

Immutable Segments

Segment Files

1. 왜 관계형 DB의 `LIKE '%keyword%'`는 안 되는가

ES의 `rank: rrf`

1. Why `LIKE '%keyword%'` in an RDB Doesn't Work

Four Clauses of `bool`

ES `rrf` retriever