Elasticsearch, OpenSearch, and Lucene Internals — Inverted Index, BM25, Sharding, Vector Search, Hybrid RAG (2025)

"Search is not a feature. It's a philosophy of how humans interact with information." — Doug Cutting (creator of Lucene, 1999)

When Doug Cutting built Lucene in Java in 1999, he believed "anyone should be able to add Google-class search to their own data." 26 years later, Elasticsearch, OpenSearch, and Solr all stand on Lucene. Log analytics, product search, autocomplete — and since 2024, the core infrastructure of RAG (Retrieval-Augmented Generation).

But "using Elasticsearch" and "understanding Lucene" are worlds apart. This article is a map from the fundamentals of search to the hybrid search of 2025.

1. Why `LIKE '%keyword%'` in an RDB Doesn't Work

The Linear Scan Wall

SELECT * FROM articles WHERE content LIKE '%postgresql%';

Cannot use index (prefix %)
Full scan of text field on every row
100M documents means tens of minutes

Postgres GIN helps partially, but:

Limited tokenization/language analysis
Hard to compute relevance scores
Weak ecosystem for autocomplete, typo correction, synonyms
Weak distributed search support

This is why dedicated search engines exist.

2. Inverted Index — The Mathematical Heart of Search

The Basic Idea

Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "Foxes and dogs"

Flipped to word to document list:

brown  -> [1, 2]
dog    -> [2]
dogs   -> [3]
fox    -> [1]
foxes  -> [3]
lazy   -> [2]
quick  -> [1]
the    -> [1, 2]

Query "brown dog":

brown -> [1,2]
dog -> [2]
AND (intersection) -> [2]

This is the essence of the Inverted Index. Close to $O(\log N)$ even across billions of documents.

Lucene's On-Disk Units

Term Dictionary — sorted dictionary of all terms (FST, Finite State Transducer)
Postings List — document list per term + frequency/position
Stored Fields — original documents (compressed)
Doc Values — columnar storage for aggregations/sorting
Norms — field length normalization values

Common prefixes of terms like "fox, foxes, foxy" are compressed with a finite state transducer. Tens of millions of terms fit in a few MB. Also the core structure behind autocomplete.

3. Segment — Lucene's Immutability Principle

Immutable Segments

In Lucene, once a file is written it never changes. That's what makes Lucene fast and safe.

Append-only — new documents create new segments
Delete leaves only a tombstone
Update is delete + insert
Small segments merge into larger ones

Segment Files

_0.cfs     — composite file
_0.cfe     — entry point
_0.si      — segment info
_0.fdt/fdx — field data
_0.tim/tip — term dictionary
_0.doc/pos — postings
_0.dvm/dvd — doc values
_0.liv     — live docs (delete bitmap)

Merge Policy

Too many small segments -> slow search
Building huge segments -> expensive merges (I/O storms)
Default: TieredMergePolicy — tiered by size

Refresh vs Flush vs Commit

Term	Meaning	Timing
Refresh	Turn in-memory buffer into a searchable Segment	every 1s by default
Flush	fsync the segment to disk	automatic (memory threshold)
Commit	Full durability including translog	less frequent

The secret of "Near Real-Time Search": Refresh every 1s, Flush later. The "1-second lag" is Elasticsearch's trademark.

Translog

Every write goes to translog first
Replayed on node restart
index.translog.durability: request (fsync per request) vs async (periodic, default 5s)

4. BM25 — Replacing TF-IDF

Limits of TF-IDF

$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}$

Longer documents inflate TF, biasing scores.

BM25 Formula

$\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$

Key changes:

Saturation — diminishing returns for repeated terms ( $k_1 \approx 1.2$ )
Length normalization — penalize/bonus by doc length vs average ( $b \approx 0.75$ )

Why BM25 Is the Default

20+ years of empirical wins
Just two tunable params
Lucene default since 2016

Lower b (around 0.3) is common for product names or query-log fields.

5. Analyzer — The Art of Tokenization

Three-Stage Pipeline

Character Filter — strip HTML, char replacement
Tokenizer — split into words
Token Filter — lowercase, stemming, synonyms, stopwords

The Korean Hell

English: whitespace tokenization is easy.

Korean: agglutinative, conjugated endings, particles. "검색했다/검색한다/검색은/검색을" must all match "검색". Use nori (Korean), kuromoji (Japanese), ik (Chinese).

Example: nori analysis

POST _analyze
{
  "analyzer": "nori",
  "text": "Elasticsearch는 검색엔진입니다"
}

-> "elasticsearch", "는", "검색", "엔진", "입니다"

Remove particles/endings with "filter": ["nori_part_of_speech"].

Synonym Expansion

"shoe, sneaker, runner"
-> query "shoe" also matches sneaker/runner docs

Half of search quality lives in the synonym dictionary. Tedious but highest ROI.

6. Shard & Replica — The Basics of Distribution

Primary Shard

Indexes split across multiple primary shards
shard = hash(_routing) % number_of_primary_shards
_routing defaults to _id

Replica Shard

Copy of the primary
Scales read throughput and survives failures
number_of_replicas = 1 means one replica per primary

Limits and Rules

Primary count is fixed after index creation (routing would break)
Fix: reindex or use alias + new index
Rule of thumb: each shard 10-50 GB
Too many shards -> cluster state explosion

Cluster State and Split Brain

Master node manages cluster metadata
Multiple masters at once = split brain
7.0 (2020) rewrote the consensus algorithm (Raft-like)

7. Query DSL — The JSON Maze

Main Query Types

Query	Use
`match`	Analyzed full-text matching
`term`	Exact match without analysis (keyword fields)
`match_phrase`	Order-preserving phrase
`multi_match`	Search across multiple fields
`bool`	AND/OR/NOT composition
`range`	Range
`function_score`	Custom scoring
`rank_feature`	Boosting field

Four Clauses of `bool`

{
  "bool": {
    "must": [],
    "should": [],
    "filter": [],
    "must_not": []
  }
}

Use filter aggressively — it's cached and fast. Score with match; make fixed conditions filter.

Term vs Match — The Most Common Mistake

{"term": {"name": "User Name"}}
{"match": {"name": "User Name"}}

match for text fields, term for keyword fields.

Aggregations

{
  "aggs": {
    "by_category": {
      "terms": {"field": "category"},
      "aggs": {
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

The SQL GROUP BY equivalent — essential for log analysis and dashboards.

8. Vector Search — The Revolution After 2022

Why Vectors

BM25 is word-matching — "puppy" and "dog" are unrelated
Embeddings are meaning-based — auto-links similar meanings

ES/OpenSearch kNN

Native kNN since Elasticsearch 8.0 (2022), built on Lucene 9.0 HNSW.

{
  "mappings": {
    "properties": {
      "title_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

{
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, 0.2],
    "k": 10,
    "num_candidates": 100
  }
}

HNSW Params

m: neighbors per node (usually 16)
ef_construction: indexing width (100-200)
ef_search: query width (recall vs speed)

Quantization

int8 — 4x savings, about 1% accuracy loss
BBQ (Better Binary Quantization) — Lucene 10, late 2024, 32x savings
Effectively standard for RAG in 2025

9. Hybrid Search — The Answer for the RAG Era

BM25 vs Vector

Aspect	BM25	Vector
Exact match (SKUs)	Strong	Weak
Semantic similarity	Weak	Strong
Rare terms	Strong	Weak
Typos	Weak	Medium
Multilingual	Weak	Strong

You need both.

RRF — Reciprocal Rank Fusion

$\text{RRF}(d) = \sum_i \frac{1}{k + \text{rank}_i(d)}$

Combine via rank only
Scale-agnostic (BM25 vs cosine doesn't matter)
k = 60 works best empirically

ES `rrf` retriever

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {"standard": {"query": {"match": {"content": "query"}}}},
        {"knn": {"field": "vec", "query_vector": [], "k": 50}}
      ],
      "rank_window_size": 50,
      "rank_constant": 60
    }
  }
}

Cross-Encoder Reranker

Get top 100 candidates via BM25/kNN
Rerank with Cross-Encoder (BERT-based)
Native support in ES/OpenSearch since 2024 (Cohere, E5, BGE)

10. The 2021 License War — OpenSearch Is Born

Background

AWS sold Elasticsearch as a managed service
Elastic accused AWS of free-riding without upstream contributions
January 2021: Elasticsearch 7.11 moved to dual SSPL/Elastic License
AWS forked immediately: OpenSearch

Aftermath

Elastic: defended revenue against AWS
AWS: donated OpenSearch to the Linux Foundation (Sep 2024)
Community: fragmented

2024-2025 Status

Product	License	Lead
Elasticsearch 8	Elastic License 2 / SSPL	Elastic
Elasticsearch 8.14+	AGPL added (2024.8)	Elastic (recovery attempt)
OpenSearch 2.x	Apache 2.0	AWS -> Linux Foundation

Choosing

AWS-centric managed workloads -> OpenSearch
Latest ML/vector/ESQL features -> Elasticsearch
Pure open source preference -> OpenSearch
Elastic Agent/Fleet ecosystem -> Elasticsearch

11. Operational Hell — Common Failure Patterns

JVM Heap

Don't exceed 32 GB (Compressed OOPs limit)
Warning at 50% heap, danger at 75%
Old Gen GC stops searches (stop-the-world)

Circuit Breaker

circuit_breaking_exception: Data too large

Default 60-70% of heap
Do not auto-retry rejected queries — makes it worse

Hot Shard

Query concentrates on one shard
Cause: biased routing key
Fix: adjust _routing, increase shard count

Shard Explosion

Over 1000 shards per node overloads cluster state
Use ILM: rollover, shrink, delete

Snapshot & Restore

S3/GCS/Azure Blob repositories
Incremental backup
Restore can take hours on large clusters

12. Ingest Pipelines

Logstash

Input -> Filter -> Output
Grok, Mutate, GeoIP, User Agent filters
JVM-based, heavy

Beats

Filebeat, Metricbeat, Packetbeat
Go-based, lightweight
Direct to ES or via Logstash

Elastic Agent + Fleet

One agent for all data
Central policy UI (Fleet)

OpenTelemetry

First-class OTel -> ES since 2024
OTel Collector can replace Logstash

Ingest Node Pipeline

{
  "processors": [
    {"set": {"field": "indexed_at", "value": "{{_ingest.timestamp}}"}},
    {"grok": {"field": "message", "patterns": ["%{COMBINEDAPACHELOG}"]}}
  ]
}

13. ES|QL — The Return of SQL (2024)

Elastic's long-awaited SQL-like query language.

FROM logs-*
| WHERE status >= 500
| STATS count = COUNT(*) BY host
| SORT count DESC
| LIMIT 10

Pipe-based (inspired by Splunk SPL, Kusto KQL)
Strips out JSON Query DSL complexity
Massively more convenient for analytics

OpenSearch added similar functionality with PPL (Piped Processing Language) in 2024.

14. Search Quality Evaluation

nDCG — relevance of top results, log-weighted, 0-1
Precision / Recall — tradeoff
Learning to Rank — train from click logs (LambdaMART)
A/B Testing — offline metrics are not online success

15. Top 10 Anti-Patterns

Default 1-shard/1-replica for tens of TB indexes
Manual _id causing routing skew
Too many indexes/shards
JVM heap 64 GB (exceeds Compressed OOPs)
Deep pagination (from=10000) — use scroll/search_after
term on analyzed text fields
Cluster health check per query
Frequent bulk deletes without _forcemerge
No snapshots
Vector-only, abandoning BM25 — hybrid almost always wins

16. Checklist for Running ES/OpenSearch Wisely

Closing — Giants Standing on Lucene

Elasticsearch, OpenSearch, Solr — different names, all on the shoulders of the giant named Lucene. And Lucene itself began as Doug Cutting's single Java library to democratize search in 1999.

In 2025, search finds logs, products, content recommendations, LLM context, autocomplete, and typo correction. It is everywhere, but not everyone understands it. When Inverted Index, Segment, BM25, vector search, Shard, and Routing click into place, you move from being a search user to a search designer.

"A good search engine doesn't just find what you typed. It finds what you meant." — Peter Norvig

1. Why LIKE '%keyword%' in an RDB Doesn't Work