Skip to content
Published on

Elasticsearch, OpenSearch, and Lucene Internals — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG (2025)

Authors

"Search is not a feature. It's a philosophy of how humans interact with information." — Doug Cutting (creator of Lucene, 1999)

When Doug Cutting built Lucene in Java in 1999, he believed "anyone should be able to add Google-class search to their own data." 26 years later, Elasticsearch, OpenSearch, and Solr all stand on Lucene. Log analytics, product search, autocomplete — and since 2024, the core infrastructure of RAG (Retrieval-Augmented Generation).

But "using Elasticsearch" and "understanding Lucene" are worlds apart. This article is a map from the fundamentals of search to the hybrid search of 2025.


1. Why LIKE '%keyword%' in an RDB Doesn't Work

The Linear Scan Wall

SELECT * FROM articles WHERE content LIKE '%postgresql%';
  • Cannot use index (prefix %)
  • Full scan of text field on every row
  • 100M documents means tens of minutes

Postgres GIN helps partially, but:

  • Limited tokenization/language analysis
  • Hard to compute relevance scores
  • Weak ecosystem for autocomplete, typo correction, synonyms
  • Weak distributed search support

This is why dedicated search engines exist.


The Basic Idea

Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "Foxes and dogs"

Flipped to word to document list:

brown  -> [1, 2]
dog    -> [2]
dogs   -> [3]
fox    -> [1]
foxes  -> [3]
lazy   -> [2]
quick  -> [1]
the    -> [1, 2]

Query "brown dog":

  • brown -> [1,2]
  • dog -> [2]
  • AND (intersection) -> [2]

This is the essence of the Inverted Index. Close to O(logN)O(\log N) even across billions of documents.

Lucene's On-Disk Units

  1. Term Dictionary — sorted dictionary of all terms (FST, Finite State Transducer)
  2. Postings List — document list per term + frequency/position
  3. Stored Fields — original documents (compressed)
  4. Doc Values — columnar storage for aggregations/sorting
  5. Norms — field length normalization values

FST — Prefix Sharing for Memory Savings

Common prefixes of terms like "fox, foxes, foxy" are compressed with a finite state transducer. Tens of millions of terms fit in a few MB. Also the core structure behind autocomplete.


3. Segment — Lucene's Immutability Principle

Immutable Segments

In Lucene, once a file is written it never changes. That's what makes Lucene fast and safe.

  • Append-only — new documents create new segments
  • Delete leaves only a tombstone
  • Update is delete + insert
  • Small segments merge into larger ones

Segment Files

_0.cfs     — composite file
_0.cfe     — entry point
_0.si      — segment info
_0.fdt/fdx — field data
_0.tim/tip — term dictionary
_0.doc/pos — postings
_0.dvm/dvd — doc values
_0.liv     — live docs (delete bitmap)

Merge Policy

  • Too many small segments -> slow search
  • Building huge segments -> expensive merges (I/O storms)
  • Default: TieredMergePolicy — tiered by size

Refresh vs Flush vs Commit

TermMeaningTiming
RefreshTurn in-memory buffer into a searchable Segmentevery 1s by default
Flushfsync the segment to diskautomatic (memory threshold)
CommitFull durability including translogless frequent

The secret of "Near Real-Time Search": Refresh every 1s, Flush later. The "1-second lag" is Elasticsearch's trademark.

Translog

  • Every write goes to translog first
  • Replayed on node restart
  • index.translog.durability: request (fsync per request) vs async (periodic, default 5s)

4. BM25 — Replacing TF-IDF

Limits of TF-IDF

tf-idf(t,d)=tf(t,d)×logNdf(t)\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}

  • Longer documents inflate TF, biasing scores.

BM25 Formula

score(d,q)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1(1b+bdavgdl)\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}

Key changes:

  • Saturation — diminishing returns for repeated terms (k11.2k_1 \approx 1.2)
  • Length normalization — penalize/bonus by doc length vs average (b0.75b \approx 0.75)

Why BM25 Is the Default

  • 20+ years of empirical wins
  • Just two tunable params
  • Lucene default since 2016

Lower b (around 0.3) is common for product names or query-log fields.


5. Analyzer — The Art of Tokenization

Three-Stage Pipeline

  1. Character Filter — strip HTML, char replacement
  2. Tokenizer — split into words
  3. Token Filter — lowercase, stemming, synonyms, stopwords

The Korean Hell

English: whitespace tokenization is easy.

Korean: agglutinative, conjugated endings, particles. "검색했다/검색한다/검색은/검색을" must all match "검색". Use nori (Korean), kuromoji (Japanese), ik (Chinese).

Example: nori analysis

POST _analyze
{
  "analyzer": "nori",
  "text": "Elasticsearch는 검색엔진입니다"
}

-> "elasticsearch", "는", "검색", "엔진", "입니다"

Remove particles/endings with "filter": ["nori_part_of_speech"].

Synonym Expansion

"shoe, sneaker, runner"
-> query "shoe" also matches sneaker/runner docs

Half of search quality lives in the synonym dictionary. Tedious but highest ROI.


6. Shard & Replica — The Basics of Distribution

Primary Shard

  • Indexes split across multiple primary shards
  • shard = hash(_routing) % number_of_primary_shards
  • _routing defaults to _id

Replica Shard

  • Copy of the primary
  • Scales read throughput and survives failures
  • number_of_replicas = 1 means one replica per primary

Limits and Rules

  • Primary count is fixed after index creation (routing would break)
  • Fix: reindex or use alias + new index
  • Rule of thumb: each shard 10-50 GB
  • Too many shards -> cluster state explosion

Cluster State and Split Brain

  • Master node manages cluster metadata
  • Multiple masters at once = split brain
  • 7.0 (2020) rewrote the consensus algorithm (Raft-like)

7. Query DSL — The JSON Maze

Main Query Types

QueryUse
matchAnalyzed full-text matching
termExact match without analysis (keyword fields)
match_phraseOrder-preserving phrase
multi_matchSearch across multiple fields
boolAND/OR/NOT composition
rangeRange
function_scoreCustom scoring
rank_featureBoosting field

Four Clauses of bool

{
  "bool": {
    "must": [],
    "should": [],
    "filter": [],
    "must_not": []
  }
}

Use filter aggressively — it's cached and fast. Score with match; make fixed conditions filter.

Term vs Match — The Most Common Mistake

{"term": {"name": "User Name"}}
{"match": {"name": "User Name"}}

match for text fields, term for keyword fields.

Aggregations

{
  "aggs": {
    "by_category": {
      "terms": {"field": "category"},
      "aggs": {
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

The SQL GROUP BY equivalent — essential for log analysis and dashboards.


8. Vector Search — The Revolution After 2022

Why Vectors

  • BM25 is word-matching — "puppy" and "dog" are unrelated
  • Embeddings are meaning-based — auto-links similar meanings

ES/OpenSearch kNN

Native kNN since Elasticsearch 8.0 (2022), built on Lucene 9.0 HNSW.

{
  "mappings": {
    "properties": {
      "title_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}
{
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, 0.2],
    "k": 10,
    "num_candidates": 100
  }
}

HNSW Params

  • m: neighbors per node (usually 16)
  • ef_construction: indexing width (100-200)
  • ef_search: query width (recall vs speed)

Quantization

  • int8 — 4x savings, about 1% accuracy loss
  • BBQ (Better Binary Quantization) — Lucene 10, late 2024, 32x savings
  • Effectively standard for RAG in 2025

9. Hybrid Search — The Answer for the RAG Era

BM25 vs Vector

AspectBM25Vector
Exact match (SKUs)StrongWeak
Semantic similarityWeakStrong
Rare termsStrongWeak
TyposWeakMedium
MultilingualWeakStrong

You need both.

RRF — Reciprocal Rank Fusion

RRF(d)=i1k+ranki(d)\text{RRF}(d) = \sum_i \frac{1}{k + \text{rank}_i(d)}

  • Combine via rank only
  • Scale-agnostic (BM25 vs cosine doesn't matter)
  • k = 60 works best empirically

ES rrf retriever

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {"standard": {"query": {"match": {"content": "query"}}}},
        {"knn": {"field": "vec", "query_vector": [], "k": 50}}
      ],
      "rank_window_size": 50,
      "rank_constant": 60
    }
  }
}

Cross-Encoder Reranker

  • Get top 100 candidates via BM25/kNN
  • Rerank with Cross-Encoder (BERT-based)
  • Native support in ES/OpenSearch since 2024 (Cohere, E5, BGE)

10. The 2021 License War — OpenSearch Is Born

Background

  • AWS sold Elasticsearch as a managed service
  • Elastic accused AWS of free-riding without upstream contributions
  • January 2021: Elasticsearch 7.11 moved to dual SSPL/Elastic License
  • AWS forked immediately: OpenSearch

Aftermath

  • Elastic: defended revenue against AWS
  • AWS: donated OpenSearch to the Linux Foundation (Sep 2024)
  • Community: fragmented

2024-2025 Status

ProductLicenseLead
Elasticsearch 8Elastic License 2 / SSPLElastic
Elasticsearch 8.14+AGPL added (2024.8)Elastic (recovery attempt)
OpenSearch 2.xApache 2.0AWS -> Linux Foundation

Choosing

  • AWS-centric managed workloads -> OpenSearch
  • Latest ML/vector/ESQL features -> Elasticsearch
  • Pure open source preference -> OpenSearch
  • Elastic Agent/Fleet ecosystem -> Elasticsearch

11. Operational Hell — Common Failure Patterns

JVM Heap

  • Don't exceed 32 GB (Compressed OOPs limit)
  • Warning at 50% heap, danger at 75%
  • Old Gen GC stops searches (stop-the-world)

Circuit Breaker

circuit_breaking_exception: Data too large
  • Default 60-70% of heap
  • Do not auto-retry rejected queries — makes it worse

Hot Shard

  • Query concentrates on one shard
  • Cause: biased routing key
  • Fix: adjust _routing, increase shard count

Shard Explosion

  • Over 1000 shards per node overloads cluster state
  • Use ILM: rollover, shrink, delete

Snapshot & Restore

  • S3/GCS/Azure Blob repositories
  • Incremental backup
  • Restore can take hours on large clusters

12. Ingest Pipelines

Logstash

  • Input -> Filter -> Output
  • Grok, Mutate, GeoIP, User Agent filters
  • JVM-based, heavy

Beats

  • Filebeat, Metricbeat, Packetbeat
  • Go-based, lightweight
  • Direct to ES or via Logstash

Elastic Agent + Fleet

  • One agent for all data
  • Central policy UI (Fleet)

OpenTelemetry

  • First-class OTel -> ES since 2024
  • OTel Collector can replace Logstash

Ingest Node Pipeline

{
  "processors": [
    {"set": {"field": "indexed_at", "value": "{{_ingest.timestamp}}"}},
    {"grok": {"field": "message", "patterns": ["%{COMBINEDAPACHELOG}"]}}
  ]
}

13. ES|QL — The Return of SQL (2024)

Elastic's long-awaited SQL-like query language.

FROM logs-*
| WHERE status >= 500
| STATS count = COUNT(*) BY host
| SORT count DESC
| LIMIT 10
  • Pipe-based (inspired by Splunk SPL, Kusto KQL)
  • Strips out JSON Query DSL complexity
  • Massively more convenient for analytics

OpenSearch added similar functionality with PPL (Piped Processing Language) in 2024.


14. Search Quality Evaluation

  • nDCG — relevance of top results, log-weighted, 0-1
  • Precision / Recall — tradeoff
  • Learning to Rank — train from click logs (LambdaMART)
  • A/B Testing — offline metrics are not online success

15. Top 10 Anti-Patterns

  1. Default 1-shard/1-replica for tens of TB indexes
  2. Manual _id causing routing skew
  3. Too many indexes/shards
  4. JVM heap 64 GB (exceeds Compressed OOPs)
  5. Deep pagination (from=10000) — use scroll/search_after
  6. term on analyzed text fields
  7. Cluster health check per query
  8. Frequent bulk deletes without _forcemerge
  9. No snapshots
  10. Vector-only, abandoning BM25 — hybrid almost always wins

16. Checklist for Running ES/OpenSearch Wisely

  • Separate use cases (logs/search/vector/analytics)
  • Shard size 10-50 GB
  • ILM policy (rollover, hot/warm/cold)
  • Regular snapshots
  • Heap under 32 GB, off-heap to file cache
  • Circuit breaker alerts
  • Analyzer choice (nori for Korean)
  • Synonym dictionary curated
  • Consider Hybrid Search (BM25 + vector + RRF)
  • Monitor CTR, nDCG, 0-result rate
  • Use bool filter for cache
  • Try ES|QL/PPL for analytics

Closing — Giants Standing on Lucene

Elasticsearch, OpenSearch, Solr — different names, all on the shoulders of the giant named Lucene. And Lucene itself began as Doug Cutting's single Java library to democratize search in 1999.

In 2025, search finds logs, products, content recommendations, LLM context, autocomplete, and typo correction. It is everywhere, but not everyone understands it. When Inverted Index, Segment, BM25, vector search, Shard, and Routing click into place, you move from being a search user to a search designer.


"A good search engine doesn't just find what you typed. It finds what you meant." — Peter Norvig