- Published on
Elasticsearch, OpenSearch, and Lucene Internals — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG (2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
"Search is not a feature. It's a philosophy of how humans interact with information." — Doug Cutting (creator of Lucene, 1999)
When Doug Cutting built Lucene in Java in 1999, he believed "anyone should be able to add Google-class search to their own data." 26 years later, Elasticsearch, OpenSearch, and Solr all stand on Lucene. Log analytics, product search, autocomplete — and since 2024, the core infrastructure of RAG (Retrieval-Augmented Generation).
But "using Elasticsearch" and "understanding Lucene" are worlds apart. This article is a map from the fundamentals of search to the hybrid search of 2025.
1. Why LIKE '%keyword%' in an RDB Doesn't Work
The Linear Scan Wall
SELECT * FROM articles WHERE content LIKE '%postgresql%';
- Cannot use index (prefix
%) - Full scan of text field on every row
- 100M documents means tens of minutes
Postgres GIN helps partially, but:
- Limited tokenization/language analysis
- Hard to compute relevance scores
- Weak ecosystem for autocomplete, typo correction, synonyms
- Weak distributed search support
This is why dedicated search engines exist.
2. Inverted Index — The Mathematical Heart of Search
The Basic Idea
Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "Foxes and dogs"
Flipped to word to document list:
brown -> [1, 2]
dog -> [2]
dogs -> [3]
fox -> [1]
foxes -> [3]
lazy -> [2]
quick -> [1]
the -> [1, 2]
Query "brown dog":
brown-> [1,2]dog-> [2]- AND (intersection) -> [2]
This is the essence of the Inverted Index. Close to even across billions of documents.
Lucene's On-Disk Units
- Term Dictionary — sorted dictionary of all terms (FST, Finite State Transducer)
- Postings List — document list per term + frequency/position
- Stored Fields — original documents (compressed)
- Doc Values — columnar storage for aggregations/sorting
- Norms — field length normalization values
FST — Prefix Sharing for Memory Savings
Common prefixes of terms like "fox, foxes, foxy" are compressed with a finite state transducer. Tens of millions of terms fit in a few MB. Also the core structure behind autocomplete.
3. Segment — Lucene's Immutability Principle
Immutable Segments
In Lucene, once a file is written it never changes. That's what makes Lucene fast and safe.
- Append-only — new documents create new segments
- Delete leaves only a tombstone
- Update is delete + insert
- Small segments merge into larger ones
Segment Files
_0.cfs — composite file
_0.cfe — entry point
_0.si — segment info
_0.fdt/fdx — field data
_0.tim/tip — term dictionary
_0.doc/pos — postings
_0.dvm/dvd — doc values
_0.liv — live docs (delete bitmap)
Merge Policy
- Too many small segments -> slow search
- Building huge segments -> expensive merges (I/O storms)
- Default: TieredMergePolicy — tiered by size
Refresh vs Flush vs Commit
| Term | Meaning | Timing |
|---|---|---|
| Refresh | Turn in-memory buffer into a searchable Segment | every 1s by default |
| Flush | fsync the segment to disk | automatic (memory threshold) |
| Commit | Full durability including translog | less frequent |
The secret of "Near Real-Time Search": Refresh every 1s, Flush later. The "1-second lag" is Elasticsearch's trademark.
Translog
- Every write goes to translog first
- Replayed on node restart
index.translog.durability:request(fsync per request) vsasync(periodic, default 5s)
4. BM25 — Replacing TF-IDF
Limits of TF-IDF
- Longer documents inflate TF, biasing scores.
BM25 Formula
Key changes:
- Saturation — diminishing returns for repeated terms ()
- Length normalization — penalize/bonus by doc length vs average ()
Why BM25 Is the Default
- 20+ years of empirical wins
- Just two tunable params
- Lucene default since 2016
Lower b (around 0.3) is common for product names or query-log fields.
5. Analyzer — The Art of Tokenization
Three-Stage Pipeline
- Character Filter — strip HTML, char replacement
- Tokenizer — split into words
- Token Filter — lowercase, stemming, synonyms, stopwords
The Korean Hell
English: whitespace tokenization is easy.
Korean: agglutinative, conjugated endings, particles. "검색했다/검색한다/검색은/검색을" must all match "검색". Use nori (Korean), kuromoji (Japanese), ik (Chinese).
Example: nori analysis
POST _analyze
{
"analyzer": "nori",
"text": "Elasticsearch는 검색엔진입니다"
}
-> "elasticsearch", "는", "검색", "엔진", "입니다"
Remove particles/endings with "filter": ["nori_part_of_speech"].
Synonym Expansion
"shoe, sneaker, runner"
-> query "shoe" also matches sneaker/runner docs
Half of search quality lives in the synonym dictionary. Tedious but highest ROI.
6. Shard & Replica — The Basics of Distribution
Primary Shard
- Indexes split across multiple primary shards
shard = hash(_routing) % number_of_primary_shards_routingdefaults to_id
Replica Shard
- Copy of the primary
- Scales read throughput and survives failures
number_of_replicas = 1means one replica per primary
Limits and Rules
- Primary count is fixed after index creation (routing would break)
- Fix: reindex or use alias + new index
- Rule of thumb: each shard 10-50 GB
- Too many shards -> cluster state explosion
Cluster State and Split Brain
- Master node manages cluster metadata
- Multiple masters at once = split brain
- 7.0 (2020) rewrote the consensus algorithm (Raft-like)
7. Query DSL — The JSON Maze
Main Query Types
| Query | Use |
|---|---|
match | Analyzed full-text matching |
term | Exact match without analysis (keyword fields) |
match_phrase | Order-preserving phrase |
multi_match | Search across multiple fields |
bool | AND/OR/NOT composition |
range | Range |
function_score | Custom scoring |
rank_feature | Boosting field |
Four Clauses of bool
{
"bool": {
"must": [],
"should": [],
"filter": [],
"must_not": []
}
}
Use filter aggressively — it's cached and fast. Score with match; make fixed conditions filter.
Term vs Match — The Most Common Mistake
{"term": {"name": "User Name"}}
{"match": {"name": "User Name"}}
match for text fields, term for keyword fields.
Aggregations
{
"aggs": {
"by_category": {
"terms": {"field": "category"},
"aggs": {
"avg_price": {"avg": {"field": "price"}}
}
}
}
}
The SQL GROUP BY equivalent — essential for log analysis and dashboards.
8. Vector Search — The Revolution After 2022
Why Vectors
- BM25 is word-matching — "puppy" and "dog" are unrelated
- Embeddings are meaning-based — auto-links similar meanings
ES/OpenSearch kNN
Native kNN since Elasticsearch 8.0 (2022), built on Lucene 9.0 HNSW.
{
"mappings": {
"properties": {
"title_vector": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
}
}
}
}
{
"knn": {
"field": "title_vector",
"query_vector": [0.1, 0.2],
"k": 10,
"num_candidates": 100
}
}
HNSW Params
m: neighbors per node (usually 16)ef_construction: indexing width (100-200)ef_search: query width (recall vs speed)
Quantization
- int8 — 4x savings, about 1% accuracy loss
- BBQ (Better Binary Quantization) — Lucene 10, late 2024, 32x savings
- Effectively standard for RAG in 2025
9. Hybrid Search — The Answer for the RAG Era
BM25 vs Vector
| Aspect | BM25 | Vector |
|---|---|---|
| Exact match (SKUs) | Strong | Weak |
| Semantic similarity | Weak | Strong |
| Rare terms | Strong | Weak |
| Typos | Weak | Medium |
| Multilingual | Weak | Strong |
You need both.
RRF — Reciprocal Rank Fusion
- Combine via rank only
- Scale-agnostic (BM25 vs cosine doesn't matter)
k = 60works best empirically
ES rrf retriever
{
"retriever": {
"rrf": {
"retrievers": [
{"standard": {"query": {"match": {"content": "query"}}}},
{"knn": {"field": "vec", "query_vector": [], "k": 50}}
],
"rank_window_size": 50,
"rank_constant": 60
}
}
}
Cross-Encoder Reranker
- Get top 100 candidates via BM25/kNN
- Rerank with Cross-Encoder (BERT-based)
- Native support in ES/OpenSearch since 2024 (Cohere, E5, BGE)
10. The 2021 License War — OpenSearch Is Born
Background
- AWS sold Elasticsearch as a managed service
- Elastic accused AWS of free-riding without upstream contributions
- January 2021: Elasticsearch 7.11 moved to dual SSPL/Elastic License
- AWS forked immediately: OpenSearch
Aftermath
- Elastic: defended revenue against AWS
- AWS: donated OpenSearch to the Linux Foundation (Sep 2024)
- Community: fragmented
2024-2025 Status
| Product | License | Lead |
|---|---|---|
| Elasticsearch 8 | Elastic License 2 / SSPL | Elastic |
| Elasticsearch 8.14+ | AGPL added (2024.8) | Elastic (recovery attempt) |
| OpenSearch 2.x | Apache 2.0 | AWS -> Linux Foundation |
Choosing
- AWS-centric managed workloads -> OpenSearch
- Latest ML/vector/ESQL features -> Elasticsearch
- Pure open source preference -> OpenSearch
- Elastic Agent/Fleet ecosystem -> Elasticsearch
11. Operational Hell — Common Failure Patterns
JVM Heap
- Don't exceed 32 GB (Compressed OOPs limit)
- Warning at 50% heap, danger at 75%
- Old Gen GC stops searches (stop-the-world)
Circuit Breaker
circuit_breaking_exception: Data too large
- Default 60-70% of heap
- Do not auto-retry rejected queries — makes it worse
Hot Shard
- Query concentrates on one shard
- Cause: biased routing key
- Fix: adjust
_routing, increase shard count
Shard Explosion
- Over 1000 shards per node overloads cluster state
- Use ILM: rollover, shrink, delete
Snapshot & Restore
- S3/GCS/Azure Blob repositories
- Incremental backup
- Restore can take hours on large clusters
12. Ingest Pipelines
Logstash
- Input -> Filter -> Output
- Grok, Mutate, GeoIP, User Agent filters
- JVM-based, heavy
Beats
- Filebeat, Metricbeat, Packetbeat
- Go-based, lightweight
- Direct to ES or via Logstash
Elastic Agent + Fleet
- One agent for all data
- Central policy UI (Fleet)
OpenTelemetry
- First-class OTel -> ES since 2024
- OTel Collector can replace Logstash
Ingest Node Pipeline
{
"processors": [
{"set": {"field": "indexed_at", "value": "{{_ingest.timestamp}}"}},
{"grok": {"field": "message", "patterns": ["%{COMBINEDAPACHELOG}"]}}
]
}
13. ES|QL — The Return of SQL (2024)
Elastic's long-awaited SQL-like query language.
FROM logs-*
| WHERE status >= 500
| STATS count = COUNT(*) BY host
| SORT count DESC
| LIMIT 10
- Pipe-based (inspired by Splunk SPL, Kusto KQL)
- Strips out JSON Query DSL complexity
- Massively more convenient for analytics
OpenSearch added similar functionality with PPL (Piped Processing Language) in 2024.
14. Search Quality Evaluation
- nDCG — relevance of top results, log-weighted, 0-1
- Precision / Recall — tradeoff
- Learning to Rank — train from click logs (LambdaMART)
- A/B Testing — offline metrics are not online success
15. Top 10 Anti-Patterns
- Default 1-shard/1-replica for tens of TB indexes
- Manual
_idcausing routing skew - Too many indexes/shards
- JVM heap 64 GB (exceeds Compressed OOPs)
- Deep pagination (
from=10000) — use scroll/search_after termon analyzed text fields- Cluster health check per query
- Frequent bulk deletes without
_forcemerge - No snapshots
- Vector-only, abandoning BM25 — hybrid almost always wins
16. Checklist for Running ES/OpenSearch Wisely
- Separate use cases (logs/search/vector/analytics)
- Shard size 10-50 GB
- ILM policy (rollover, hot/warm/cold)
- Regular snapshots
- Heap under 32 GB, off-heap to file cache
- Circuit breaker alerts
- Analyzer choice (nori for Korean)
- Synonym dictionary curated
- Consider Hybrid Search (BM25 + vector + RRF)
- Monitor CTR, nDCG, 0-result rate
- Use
bool filterfor cache - Try ES|QL/PPL for analytics
Closing — Giants Standing on Lucene
Elasticsearch, OpenSearch, Solr — different names, all on the shoulders of the giant named Lucene. And Lucene itself began as Doug Cutting's single Java library to democratize search in 1999.
In 2025, search finds logs, products, content recommendations, LLM context, autocomplete, and typo correction. It is everywhere, but not everyone understands it. When Inverted Index, Segment, BM25, vector search, Shard, and Routing click into place, you move from being a search user to a search designer.
"A good search engine doesn't just find what you typed. It finds what you meant." — Peter Norvig