Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction: Finding Answers Inside Logs in Under a Second

Picture This

Your company generates tens of terabytes of logs every day. One night at 11 PM, a user puts a specific product in their cart and hits an error before placing the order. The customer support rep asks you:

"Show me all request logs for this user around 8:23 PM today. Only ones with response time over 2 seconds."

GET /logs/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "user_id": "u_12345" } },
        { "range": { "response_time_ms": { "gt": 2000 } } },
        { "range": { "@timestamp": { "gte": "2025-04-15T20:20:00Z", "lte": "2025-04-15T20:30:00Z" } } }
      ]
    }
  }
}

Elasticsearch returns the answer from billions of log entries within 200ms. How?

The answer lies inside Lucene, a 20-year-old Java search engine library. Elasticsearch is essentially a distributed wrapper around Lucene. Understanding Lucene means understanding Elasticsearch.

What This Article Covers

Lucene fundamentals: Inverted index, term dictionary, posting list.
Segment structure: A collection of immutable files.
Refresh / Flush / Merge: The secret of NRT.
BM25 and scoring.
Analyzers and text processing.
Elasticsearch distribution: Shard, replica, routing.
Aggregation execution.
Production tuning.

Why Learn This Now?

Elasticsearch is still the most widely used search engine.
The foundation of OpenSearch, Kibana, and Logstash.
Grafana Loki and SigNoz share similar design principles.
Without understanding Lucene internals, you can't answer "why is it slow", "why does it use so much memory", or "why did the index grow so large".

1. Inverted Index: Where It All Begins

The Problem

Given these documents:

Doc 1: "Elasticsearch is a distributed search engine"
Doc 2: "Lucene is the library behind Elasticsearch"
Doc 3: "A search engine finds relevant documents"

Question: Which documents contain "search"?

Naive approach: scan every document looking for "search". O(total word count). Impossible with millions of documents.

Structure of an Inverted Index

An inverted index pre-builds a word-to-document mapping:

Term Dictionary:
  "a"            → [1, 2, 3]
  "behind"       → [2]
  "distributed"  → [1]
  "documents"    → [3]
  "elasticsearch"→ [1, 2]
  "engine"       → [1, 3]
  "finds"        → [3]
  "is"           → [1, 2]
  "library"      → [2]
  "lucene"       → [2]
  "relevant"     → [3]
  "search"       → [1, 3]
  "the"          → [2]

Now to find "search", you immediately get [1, 3] from the term dictionary. In O(log unique-term-count).

Posting List

The document list for each term is called a posting list. In practice it stores more than just document IDs:

"search":
  [
    (docId=1, freq=1, positions=[3]),
    (docId=3, freq=1, positions=[1])
  ]

docId: document number.
freq: how many times the term appears in that document.
positions: position within the document (for phrase search).

Term Dictionary: Implemented with FST

How do you efficiently store millions of terms? Lucene uses an FST (Finite State Transducer).

An FST is an extremely compressed representation of string-to-value mappings:

elastic  → 1
elected  → 2
election → 3
electric → 4

These share the common prefix "elect". By sharing prefixes, FSTs achieve O(1) average lookup with extreme memory efficiency. Millions of terms fit in tens of megabytes.

FSTs are used beyond Lucene — in ICU and every Apache Lucene-based system.

Posting List Compression

A posting list may contain millions of document IDs. Compression is essential.

1. Delta Encoding:

Original: [1, 5, 8, 12, 15, 17]
Delta:    [1, 4, 3,  4,  3,  2]

Small consecutive numbers compress well.

2. Variable Byte Encoding: Small numbers take 1 byte, larger ones take multiple bytes. Since most numbers are small, averages around 1 byte.

3. FOR (Frame of Reference) + PFOR: Block-level bit-packing based on max value. Tens-of-times compression.

Lucene combines these to compress posting lists to 5–10% of the original size.

2. Lucene Segments: The Elegance of Immutable Files

What Is a Segment?

In Lucene, a segment is an independent small inverted index. A complete unit of search.

Index/
├── segments_12.file           # current segment list
├── _0.cfs                     # segment 0 (compound file)
├── _1.cfs                     # segment 1
├── _2.cfs                     # segment 2
└── _3.cfs                     # segment 3

Each segment contains:

Term dictionary (FST)
Posting lists
Stored fields (original documents)
Doc values (for sorting/aggregation)
Norms (for scoring)
Term vectors (for highlighting)

Immutability

The key property of segments: once written, they are never modified.

This yields huge advantages:

Lock-free: read-only, so no concurrency concerns.
Cache efficiency: safely cached in OS page cache.
Simple replication: just copy files.
Lock-free search: thousands of concurrent queries.

Adding Documents = New Segment

Adding documents does not modify existing segments. Instead, a new segment is created.

Before: [segment_1][segment_2][segment_3]
Add 10 documents → create segment_4
After:  [segment_1][segment_2][segment_3][segment_4]

On search, all segments are searched in parallel and results are merged.

Deletion = Tombstone

Deletion also does not actually remove the document. A tombstone (deletion marker) is recorded:

.liv file: [1, 0, 1, 1, 0, 1, ...]  # 0 = deleted

Search checks this bitmap and skips deleted entries. Actual reclamation happens during merge.

Update = Delete + Add

An update is "mark the previous version as deleted + insert the new version into a new segment". Because of this:

Many updates → tombstones accumulate → search slows down.
Periodic merges are essential.

3. The Refresh / Flush / Merge Cycle

The Lucene/Elasticsearch write path consists of three stages, each with a different purpose.

In-Memory Buffer

New documents first accumulate in the memory buffer:

Index Buffer (RAM)
[doc1, doc2, doc3, doc4, ...]

Not yet searchable (!). Not on disk either.

Refresh: Make It Searchable

Refresh converts the memory buffer into an in-memory segment:

Index Buffer → [new segment (in memory)]

This in-memory segment exists only in OS page cache (not yet fsynced). But it is searchable.

Default interval: 1 second (index.refresh_interval = 1s).

This is the secret of Elasticsearch's Near Real-Time (NRT) search. Inserted documents appear in search results after 1 second.

Caution: Refresh is expensive. Each creates a new segment → small segments pile up → search slows down.

Tuning Refresh

For bulk indexing, disable refresh:

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "-1"
  }
}

// After indexing is done
PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "1s"
  }
}

Indexing can become several times faster.

Translog: Ensuring Durability

Refresh does not guarantee durability (no fsync). So if the server crashes, do we lose data?

Solution: the translog (transaction log).

Every indexing operation:

Written simultaneously to the memory buffer AND translog.
Translog is persisted to disk via fsync (default: per request).

Write flow:
Document → Memory Buffer → Translog (fsync)
                ↓ (after 1 second, refresh)
            In-memory segment (searchable)
                ↓ (periodic flush)
            Disk segment (persistent)

Flush: Persistent Storage

Flush fsyncs in-memory segments to disk and empties the translog:

Before flush:
  Memory: [seg_new (in cache)]
  Translog: [full, several hundred MB]

After flush:
  Disk: [seg_new (fsynced)]
  Translog: [empty]

Default triggers:

Translog reaches 512MB (index.translog.flush_threshold_size)
Or every 5 seconds (index.translog.sync_interval = 5s)

Flush is real disk I/O, so it is far more expensive.

Merge: Combining Segments

Over time, the number of segments grows:

Search traverses all segments → slow.
Tombstones accumulate, wasting space.

Merge combines multiple segments into one larger segment:

Before: [seg_1, seg_2, seg_3, seg_4]  (10MB each)
Merge starts
Concurrent: [seg_1, seg_2, seg_3, seg_4, seg_merged_in_progress]
After:  [seg_merged]                  (40MB, tombstones removed)

Existing segments remain searchable during merge. An atomic swap happens on completion.

TieredMergePolicy

Lucene's default merge policy. The concept of "tier":

Groups segments of similar size.
If a tier is too large, triggers a merge.
Large segments (5GB+) are excluded from further merging.

Parameters:

{
  "index.merge.policy.max_merged_segment": "5gb",
  "index.merge.policy.segments_per_tier": 10
}

Analogy

An analogy for refresh/flush/merge:

Refresh: tidying your desk every day (making small segments). Frequent, fast.
Flush: filing paperwork into the cabinet on weekends (permanent disk storage).
Merge: reorganizing the filing cabinet at month end (segment merging).

Parameter Tuning

Common defaults:

{
  "index.refresh_interval": "1s",
  "index.translog.flush_threshold_size": "512mb",
  "index.merge.scheduler.max_thread_count": 1
}

Bulk indexing:

{
  "index.refresh_interval": "60s",
  "index.number_of_replicas": 0,
  "index.translog.durability": "async"
}

4. BM25: The Math of Scoring

The Problem

How do you pick "the 10 most relevant documents"? You need a relevance score, not just a yes/no on whether the term exists.

TF-IDF (Classic)

TF-IDF is the product of two factors:

TF (Term Frequency): how often the term appears in the document.
IDF (Inverse Document Frequency): how rare the term is across all documents.

score(q, d) = Σ_t (TF(t, d) × IDF(t))

Intuition: matching a rare word ("quantum") is more meaningful than matching a common word ("the").

Weaknesses of TF-IDF

Linear TF: 100 occurrences get 100x the score of 1 occurrence. Unrealistic.
Ignores document length: "search" appearing twice in a 100-word document is not the same as twice in a 1000-word document.

BM25: The Improved Version

BM25 (Best Match 25) was proposed by Stephen Robertson in the 1990s. It has been the default scoring since Lucene 6.

BM25(q, d) = Σ_t IDF(t) × (f(t,d) × (k1+1)) / (f(t,d) + k1 × (1 - b + b × |d|/avgdl))

Looks complex, but:

f(t, d): term frequency.
k1 (default 1.2): TF saturation — even many occurrences cap the score.
b (default 0.75): document length normalization strength.
|d|: document length.
avgdl: average document length.

BM25 Improvements

Saturation: even with high TF, the score saturates → prevents spam.
Document length normalization: offsets the TF advantage of longer documents.
Tunable: adjustable via k1 and b.

Practical Tuning

Defaults work for most cases. But:

Primarily short documents (e.g. tweets): lower b (0.3–0.5).
Long documents where TF matters (e.g. long articles): raise k1 (1.5–2.0).

{
  "settings": {
    "similarity": {
      "my_bm25": {
        "type": "BM25",
        "k1": 1.5,
        "b": 0.5
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "similarity": "my_bm25"
      }
    }
  }
}

Lucene's Scoring Implementation

Lucene does not compute BM25 at scoring time. It uses several pre-computed values:

Norm: document length normalization value (computed at index time).
IDF: computed from the segment's statistics.
Field boost: field weight.

Index-time values are stored, and queries compute fast via simple multiplication.

5. Analyzer: The Text Processing Pipeline

Why an Analyzer?

For "Search Engine" and "search engine" to be treated as the same term? For "running" and "run" to match as the same concept?

Answer: the analyzer processes text at index and query time.

Analyzer Structure

Input Text
    ↓
Character Filter (one or more)
    ↓
Tokenizer (exactly one)
    ↓
Token Filter (zero or more)
    ↓
Output Tokens

Character Filter

Character-level preprocessing:

HTML strip: remove HTML tags.
Mapping: character replacement (& → and).
Pattern replace: regex replacement.

Tokenizer

Splits text into tokens (usually words):

Standard: word boundary based, Unicode aware. Works for most languages.
Whitespace: split by whitespace.
N-gram: generate all n-grams (e.g. "search" → ["sea", "ear", "arc", ...]).
Keyword: no splitting (for exact match).
Language-specific: Chinese (IK, Smart Chinese), Japanese (Kuromoji), Korean (Nori).

Token Filter

Post-processing tokens:

Lowercase: lowercase conversion.
Stop: remove stop words ("the", "a", "is").
Stemmer: stemming ("running" → "run").
Synonym: synonym expansion.
ASCII folding: "café" → "cafe".
Word delimiter: split compound words.

Standard Analyzer Example

The default standard analyzer:

Input: "The Quick Brown Foxes!"
  ↓ Standard Tokenizer
[The, Quick, Brown, Foxes]
  ↓ Lowercase Filter
[the, quick, brown, foxes]
  ↓ Stop Filter (off by default)
[the, quick, brown, foxes]
Output: [the, quick, brown, foxes]

Korean Processing: Nori

Korean is hard to tokenize because of particles and endings. Nori is a Korean morphological analyzer:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_nori": {
          "tokenizer": "nori_tokenizer",
          "filter": ["nori_readingform", "lowercase"]
        }
      }
    }
  }
}

Input: "고양이를 좋아합니다" → [고양이, 를, 좋아, 합니다] → particles removed → [고양이, 좋아]

Edge N-gram: Autocomplete

For autocomplete, edge n-gram is commonly used:

Input: "apple"
→ [a, ap, app, appl, apple]

When a user types "app", it matches the already-indexed "app". "apple" is found instantly.

Caution: storage grows significantly. Not by word count but by n-gram count.

6. Elasticsearch Distribution

Lucene is a search engine on a single machine. Elasticsearch extends it to a distributed cluster.

Index, Shard, Replica

Index: a collection of documents (analogous to a DB table).
Shard: a partition of the index. Each shard is a single Lucene index.
Primary shard: the original.
Replica shard: the copy.

Index "logs" (5 primary, 1 replica)
├── shard 0 (primary, node A) ← original
│   └── replica 0 (node B)   ← copy
├── shard 1 (primary, node B)
│   └── replica 1 (node C)
├── shard 2 (primary, node C)
│   └── replica 2 (node A)
├── shard 3 (primary, node A)
│   └── replica 3 (node B)
└── shard 4 (primary, node B)
    └── replica 4 (node C)

Shard Routing

Deciding which shard to put a document in:

shard = hash(_routing) % num_primary_shards

By default, _routing = document_id. Documents with the same ID always go to the same shard.

The Primary Shard Count Constraint

Problem: the primary shard count cannot be changed after index creation. Why?

Initial: 5 shards
hash(user_1) % 5 = 3  → stored on shard 3

If you change it to 10 shards:

hash(user_1) % 10 = 1  → look on shard 1 → not there!

Existing data would be searched on the wrong shard. Reindexing is required to resolve this.

Solutions: the Split API (in some cases), or start with enough shards from the beginning.

Replica Shards

Replicas can be added/removed anytime:

PUT /my_index/_settings
{
  "index.number_of_replicas": 2
}

Replicas provide:

High availability: if primary fails, a replica is promoted.
Read scalability: distribute queries across replicas.

Cluster State and Master

A cluster has master nodes:

Manage cluster state (shard assignment, mappings, etc.).
Master election: Zen Discovery (older) or Raft-based (Elasticsearch 7+).
Prevent two masters existing simultaneously (split brain).

Tip: keep an odd number of masters (3, 5, 7). To form a quorum.

Query Execution: Scatter-Gather

When a search query arrives:

1. Coordinator node receives the request.
2. Broadcasts the query to every needed shard (primary or replica).
3. Each shard computes local top-K.
4. Coordinator gathers all results and picks the global top-K.
5. Fetches the full data for the selected documents.
6. Returns to the client.

This is called scatter-gather or two-phase query.

Query Pitfall: Deep Pagination

GET /logs/_search?from=9990&size=10

To get "results 10,000 to 10,010":

Each shard computes top 10,010 (!).
Coordinator gathers and sorts 10,010 items.
Returns the range 9,990 to 10,010.

→ With from=1,000,000, each shard sorts a million entries. Memory explosion.

Solution: use the search_after API. Instead of global sort, continue from the previous result.

Aggregation Execution

Aggregation queries (e.g. GROUP BY) also execute distributed:

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "by_user": {
      "terms": { "field": "user_id", "size": 10 }
    }
  }
}

Each shard computes local top 10, then the coordinator merges them. Problem: the local top 10 may not be the global top 10. The coordinator requests more (e.g. top 100) to improve accuracy.

Use the shard_size parameter to tune:

"terms": { "field": "user_id", "size": 10, "shard_size": 100 }

7. Doc Values: Sorting and Aggregation

Why They Matter

The inverted index goes from term → docs. To answer "what is this document's field value?", you need the reverse direction.

Example: ORDER BY timestamp or GROUP BY country. You need each document's timestamp and country values.

Doc Values

Doc values are columnar storage:

timestamp column:
  doc 0 → 2025-04-15 10:00:00
  doc 1 → 2025-04-15 10:00:01
  doc 2 → 2025-04-15 10:00:02
  ...

Each document's field value stored as a contiguous array. Optimal for sorting, aggregation, and scripting.

Memory Use

Doc values are disk-based by default:

mapped via mmap → leverages OS page cache.
Cached in RAM if frequently used, on disk otherwise.

Elasticsearch's field data was the old in-memory version. Doc values are much more efficient, so they are now the default.

Exception: text Fields

text fields do not store doc values by default:

Only analyzed tokens are stored → original reconstruction is hard.
To aggregate, use a keyword subfield.

{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": { "type": "keyword" }
      }
    }
  }
}

Search on name. Aggregate/sort on name.keyword.

Sparse Doc Values

Lucene supports sparse representation for cases with many missing fields (e.g. optional fields). Space-efficient.

8. Mapping and Dynamic Mapping

What Is Mapping?

Mapping defines each field's type and properties:

{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "user_id": { "type": "keyword" },
      "timestamp": { "type": "date" },
      "price": { "type": "double" },
      "location": { "type": "geo_point" }
    }
  }
}

text vs keyword

The most important distinction:

text: analyzed (tokenized). For full-text search.
keyword: not analyzed. For exact match, aggregation, and sorting.

Emails, IPs, tags are almost always keyword. Body, description, title are text (optionally with keyword as well).

Dynamic Mapping

Field types are auto-inferred on first indexing:

POST /my_index/_doc
{
  "name": "Alice",
  "age": 30,
  "active": true
}

Pro: quick start. Con: types may differ from expectation. Once decided, types cannot change.

Recommendation: use explicit mapping in production. Dynamic mapping is for dev/test.

Mapping Explosion

Each field carries memory overhead. For example:

{
  "user_data": {
    "user_1": { "action": "login" },
    "user_2": { "action": "logout" }
  }
}

Each user_id dynamically creates a new field. Millions of fields → mapping explosion. Elasticsearch can go down in minutes.

Fixes:

Restructure: { "user_id": "user_1", "action": "login" }.
Or use the flattened field type.
Limit via index.mapping.total_fields.limit (default 1000).

9. Production Tuning

Bulk Indexing

Goal: load data as fast as possible.

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "-1",
    "number_of_replicas": 0,
    "translog": {
      "durability": "async",
      "sync_interval": "30s"
    }
  }
}

Index in batches with the Bulk API:

POST /_bulk
{ "index": { "_index": "my_index" } }
{ "field1": "value1" }
{ "index": { "_index": "my_index" } }
{ "field2": "value2" }

5–15MB per bulk is the sweet spot. Too small → overhead; too large → memory pressure.

After indexing:

PUT /my_index/_settings
{
  "index": {
    "refresh_interval": "1s",
    "number_of_replicas": 1
  }
}

POST /my_index/_forcemerge?max_num_segments=1

_forcemerge maximizes search performance.

JVM Heap

Critical rule: keep heap under 32GB.

Reason: above 32GB, the JVM's compressed oops (compressed object pointers) is disabled, and memory efficiency drops sharply.

# jvm.options
-Xms16g
-Xmx16g

The rest of memory goes to OS page cache. Lucene heavily uses page cache. Best setup:

64GB server memory
JVM heap 16–31GB
OS cache 33–48GB

Shard Count

Bad example: daily indices × 1,000 × 5 primary shards = 5,000 shards. Master overhead explodes.

Rules:

20–40GB per shard is ideal.
No more than 20 shards per GB of JVM heap.
Small indices are fine with a single shard.

Hot-Warm Architecture

Useful pattern for time-series data:

Hot nodes: recent data, fast SSDs, active indexing/search.
Warm nodes: older data, HDDs, read-only.
Cold nodes: very old data, occasionally queried.

Automate with Index Lifecycle Management (ILM):

{
  "policy": {
    "phases": {
      "hot":  { "actions": {} },
      "warm": { "min_age": "7d", "actions": { "allocate": { "require": { "data": "warm" } } } },
      "cold": { "min_age": "30d", "actions": { "allocate": { "require": { "data": "cold" } } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Data Streams

Elasticsearch 7.9+ data streams are a high-level abstraction for time-series data:

Auto-creates backing indices (e.g. .ds-logs-2025.04.15-000001).
Auto-rollover (by size/age).
Integrates with ILM.

POST /_data_stream/logs-app

Kibana logs, APM, and monitoring all use data streams.

10. Common Pitfalls and Debugging

Pitfall 1: Too Many Shards

Symptoms: slow cluster state updates, high master node CPU, slow queries.

Cause: thousands of small shards.

Fixes:

Merge/shrink older indices.
Automate via ILM.
Reduce shard count via the Shrink API.

Pitfall 2: Mapping Explosion

Symptoms: OOM, indexing failures, cluster instability.

Cause: unbounded dynamic field creation.

Fixes:

Explicit mapping.
Use "dynamic": "strict" to reject new fields.
Fix data structure at the application level.

Pitfall 3: Deep Pagination

Symptoms: memory/time explosion on queries with large from values.

Fixes:

Use search_after.
Or the scroll API (for bulk extraction).
Block deep pagination in the UI.

Pitfall 4: Fielddata on Text

Symptoms: aggregation on a text field → error or massive memory use.

Fixes:

Use text.keyword.
Or fielddata: true (risky, not recommended).

Pitfall 5: Refresh Abuse

Symptoms: ?refresh=true on every request to make docs searchable immediately.

Cause: new segment per indexing call → excessive merge load.

Fixes:

Refresh only when needed.
Replace with ?refresh=wait_for (waits for the next refresh).

Pitfall 6: Complex Nested Queries

Symptoms: queries on nested fields are slow.

Cause: nested fields are stored as "hidden sub-documents" and searched separately.

Fixes:

Denormalize (flatten). Accept duplication.
Or use join type for many relations (with reduced performance).

11. Lucene and Competing Technologies

OpenSearch

When Elasticsearch moved to the SSPL license in 2021, AWS forked OpenSearch. Still Lucene-based. Features are starting to diverge.

Apache Solr

Solr, predating Elasticsearch, is also Lucene-based. Originally enterprise-focused. Elasticsearch now leads in market share.

Meilisearch

A lightweight search engine written in Rust. Does not use Lucene; implements its own. Very fast at smaller scale. Built-in typo tolerance (fuzzy matching).

Typesense

Also Rust-based. An Algolia alternative.

The Role of ClickHouse

If "search" really means analytical queries rather than traditional search, ClickHouse is far faster. Log analytics are increasingly moving from Elasticsearch to ClickHouse.

Differences:

Elasticsearch: free-text search, diverse queries, complex scoring.
ClickHouse: aggregation, SQL, large-scale analytics.

Vector Search (see earlier post)

Elasticsearch 8+ also supports vector search (HNSW). See the earlier post on ANN algorithms.

Quiz Review

Q1. Why are Lucene segments immutable, and what are the benefits?

A. Even when documents are added/deleted/updated, existing segments are never modified. Instead, new segments are created, and deletions are marked via tombstones.

Benefits:

Lock-free search: read-only, so thousands of concurrent queries run without locks.
Cache stability: OS page cache safely caches immutable files. No cache invalidation concerns.
Simple replication: just copy files. No state synchronization.
Merge safety: background merges don't interfere with live searches. Atomic swap on completion.
Optimized write path: writes are always append. Sequential I/O is fast.

Costs:

Deletion only leaves a tombstone; actual reclamation is deferred to merge.
Update = delete + add → tombstone accumulation.
Merge cost (I/O, CPU).

This is the same philosophy as LSM-Trees. "Writes are append, cleanup is batched later" is a fundamental pattern of modern storage systems.

Q2. What are the differences and roles of refresh, flush, and merge?

Refresh (~1 second interval):

Converts memory buffer into an in-memory segment.
Documents become searchable (the source of Near Real-Time).
Not yet fsynced → no durability.
Cost: moderate. Creates a new small segment each time.

Flush (minutes or 512MB):

Fsyncs in-memory segments to disk.
Empties the translog.
Data becomes persistent.
Cost: high (actual disk I/O).

Merge (background):

Merges multiple small segments into a larger one.
Tombstoned documents are actually removed.
Maintains search performance (fewer segments = faster).
Cost: very high (disk read + write).

Analogy:

Refresh = jotting into your notebook (just a memo).
Flush = filing the notebook into the cabinet (permanent storage).
Merge = quarterly file reorganization (dedup, efficiency).

Why separate: each must run at a different cadence to be efficient. Search needs frequent refresh, but fsync is expensive so it cannot be frequent. Merge can run slowly in the background. Separating these three stages is how Elasticsearch achieves both performance and durability.

Q3. What is the difference between text and keyword fields, and why do you need both?

text field:

An analyzer is applied.
Tokenization, lowercase, stemming, stop-word removal, etc.
"The Quick Brown Fox" → ["quick", "brown", "fox"].
Optimal for full-text search.
Cannot sort/aggregate (original lost after analysis).

keyword field:

Not analyzed. String is kept as-is.
"The Quick Brown Fox" → ["The Quick Brown Fox"] (a single value).
Used for exact match, sorting, aggregation.
Useful for high-cardinality values.

Why both: the same field often has different query patterns. For example, a product name:

Full-text search for "iPhone" → text.
Aggregation by exact name "iPhone 15 Pro" → keyword.

Hence multi-field mapping:

{
  "product_name": {
    "type": "text",
    "fields": {
      "keyword": { "type": "keyword" }
    }
  }
}

Now search on product_name, aggregate on product_name.keyword. Same data, two indexing styles.

Rules of thumb:

IP, email, tag, ID, URL → keyword.
Body, description, title, comments → text.
Product name, user name where both needed → text + keyword subfield.

Without this distinction, confusion like "aggregation doesn't work" or "fielddata error" arises. It's the most basic principle of mapping design.

Q4. How does BM25 improve on TF-IDF?

A. It addresses two weaknesses of TF-IDF:

Weakness 1: Linearity of TF TF-IDF uses term frequency linearly: score ∝ TF.

"search" 1 time vs 100 times → 100x difference.
Reality: 1 → 10 is a big jump, but 90 → 100 is almost meaningless.

BM25 fix: a saturation curve.

f(tf) = tf × (k1+1) / (tf + k1)

Growth rate decreases as TF grows. With k1=1.2, tf=10 gives f≈5.5, tf=100 gives f≈1.18. Also effective against spam.

Weakness 2: Ignoring document length TF-IDF treats "'search' 2 times in a 100-word document" and "'search' 2 times in a 10,000-word document" equally. But matches in short documents are more "relevant".

BM25 fix: length normalization.

length_norm = 1 - b + b × (|d| / avgdl)

|d|: this document's length.
avgdl: average document length.
b: normalization strength (0-1, default 0.75).

Discounts TF for long documents. b=1 is full normalization, b=0 ignores length.

Combined BM25 formula:

BM25 = IDF × (tf × (k1+1)) / (tf + k1 × (1 - b + b × |d|/avgdl))

Complex-looking, but the meaning is clear: "IDF-weighted score with saturated TF discounted by document length".

In practice defaults (k1=1.2, b=0.75) satisfy most cases. For short documents like tweets lower b; for long articles raise k1. BM25 becoming Lucene's default in Lucene 6 was no accident — it's a formula proven over 20 years.

Q5. Why did Elasticsearch make the primary shard count unchangeable after index creation?

A. Because of shard routing. Which shard a document lands on is determined by:

shard_id = hash(routing_key) % num_primary_shards

By default routing_key = document_id. Then:

At indexing: hash(doc_1) % 5 = 3 → stored on shard 3.
At search: hash(doc_1) % 5 = 3 → queried on shard 3.

The problem: if you change from 5 to 10 shards:

New calculation: hash(doc_1) % 10 = 1 → queried on shard 1.
But actual storage was shard 3 → data not found.

All existing documents would be searched on the wrong shard. Equivalent to data loss.

Theoretical solutions:

Consistent hashing: most documents stay put with partial changes. Still some migration needed.
Reindex: create a new index and copy all documents. Time/resource intensive.
Dual-write period: index into the new one in parallel. Gradual transition.

Elasticsearch's practical solutions:

Split API: only N → N*k (2x, 3x, etc.). Each shard splits internally.
Shrink API: shrink from N → N/k.
Reindex API: general case. Create new index and copy.
Data streams + ILM: time-series data auto-creates new indices, so no problem.

Design recommendations:

Enough shards from the start. Easier than growing later.
Rule: 20–40GB per shard. 1.5–2x projected future growth.
Time-series: data streams + daily/weekly indices.

This constraint is a main reason "why Elasticsearch operations are tricky". Bad initial design means costly reindexing. Hence decide the shard count carefully before production.

Conclusion: 20 Years of Engineering

Key Takeaways

Inverted index + FST: the foundation of search.
Segments are immutable: the secret of lock-free search.
Refresh/Flush/Merge: three-stage performance/durability balance.
BM25: the modern scoring standard.
Analyzer: the text processing pipeline.
Shard + Replica: the basics of distribution.
Doc Values: columnar storage for sorting/aggregation.
Hot-Warm + ILM: time-series data management.

Lessons in Operating Elasticsearch

Trust defaults, but understand them. They are usually good but may not fit your situation.
Heap under 32GB. The single most important rule.
Design shards well upfront. Hard to change later.
Explicit mapping. Dynamic mapping is a start of disaster.
Refresh only as needed. Turn it off for bulk indexing.
Measure. _cat/segments, _cat/shards, _cluster/health.

Lucene, a Treasure

Lucene has been developed since the early 2000s as a Java library. Twenty years of careful, hand-tuned optimization by many engineers. Compression algorithms, data structures, file formats — everything is finely tuned.

Elasticsearch, Solr, OpenSearch, Kibana, and even some SaaS search services sit on top of Lucene. After Google Search, it's likely the search engine you use most often.

A Final Lesson

Search looks easy but is hard when you dig deep. To answer "why is this query slow", "why is memory low", or "why is clustering unstable", you have to know the internals.

Having read this post, you now:

Know Lucene's segment and file structure.
Understand refresh vs flush.
Know why BM25 is good.
Understand the importance of shard count.
Know why Hot-Warm exists.

Next time you work with an Elasticsearch cluster, this knowledge will steer your decisions for the better. And when issues arise, you'll be able to answer "why?". That is the power of a real engineer.

References

Lucene: The Internal Workings
Elasticsearch: The Definitive Guide - older but great conceptual explanations
Elastic Blog: Anatomy of an Elasticsearch Cluster
Lucene Index File Formats
Okapi BM25 (Robertson & Zaragoza, 2009) - mathematical review of BM25
Elasticsearch: Designing for Scale
OpenSearch Documentation - Elasticsearch fork
Introduction to Information Retrieval (Manning, Raghavan, Schütze) - classic IR textbook
Finite State Transducers for Fast Text Processing - FST explanation
Why do I need replicas? Everything you need to know about Elasticsearch Shards