✍️ 필사 모드: Elasticsearch & Lucene Internals Complete Guide 2025: Deep Dive into Segments, Inverted Index, Refresh/Flush/Merge, and Shard Routing
EnglishIntroduction: Finding Answers Inside Logs in Under a Second
Picture This
Your company generates tens of terabytes of logs every day. One night at 11 PM, a user puts a specific product in their cart and hits an error before placing the order. The customer support rep asks you:
"Show me all request logs for this user around 8:23 PM today. Only ones with response time over 2 seconds."
GET /logs/_search
{
"query": {
"bool": {
"must": [
{ "term": { "user_id": "u_12345" } },
{ "range": { "response_time_ms": { "gt": 2000 } } },
{ "range": { "@timestamp": { "gte": "2025-04-15T20:20:00Z", "lte": "2025-04-15T20:30:00Z" } } }
]
}
}
}
Elasticsearch returns the answer from billions of log entries within 200ms. How?
The answer lies inside Lucene, a 20-year-old Java search engine library. Elasticsearch is essentially a distributed wrapper around Lucene. Understanding Lucene means understanding Elasticsearch.
What This Article Covers
- Lucene fundamentals: Inverted index, term dictionary, posting list.
- Segment structure: A collection of immutable files.
- Refresh / Flush / Merge: The secret of NRT.
- BM25 and scoring.
- Analyzers and text processing.
- Elasticsearch distribution: Shard, replica, routing.
- Aggregation execution.
- Production tuning.
Why Learn This Now?
- Elasticsearch is still the most widely used search engine.
- The foundation of OpenSearch, Kibana, and Logstash.
- Grafana Loki and SigNoz share similar design principles.
- Without understanding Lucene internals, you can't answer "why is it slow", "why does it use so much memory", or "why did the index grow so large".
1. Inverted Index: Where It All Begins
The Problem
Given these documents:
Doc 1: "Elasticsearch is a distributed search engine"
Doc 2: "Lucene is the library behind Elasticsearch"
Doc 3: "A search engine finds relevant documents"
Question: Which documents contain "search"?
Naive approach: scan every document looking for "search". O(total word count). Impossible with millions of documents.
Structure of an Inverted Index
An inverted index pre-builds a word-to-document mapping:
Term Dictionary:
"a" → [1, 2, 3]
"behind" → [2]
"distributed" → [1]
"documents" → [3]
"elasticsearch"→ [1, 2]
"engine" → [1, 3]
"finds" → [3]
"is" → [1, 2]
"library" → [2]
"lucene" → [2]
"relevant" → [3]
"search" → [1, 3]
"the" → [2]
Now to find "search", you immediately get [1, 3] from the term dictionary. In O(log unique-term-count).
Posting List
The document list for each term is called a posting list. In practice it stores more than just document IDs:
"search":
[
(docId=1, freq=1, positions=[3]),
(docId=3, freq=1, positions=[1])
]
- docId: document number.
- freq: how many times the term appears in that document.
- positions: position within the document (for phrase search).
Term Dictionary: Implemented with FST
How do you efficiently store millions of terms? Lucene uses an FST (Finite State Transducer).
An FST is an extremely compressed representation of string-to-value mappings:
elastic → 1
elected → 2
election → 3
electric → 4
These share the common prefix "elect". By sharing prefixes, FSTs achieve O(1) average lookup with extreme memory efficiency. Millions of terms fit in tens of megabytes.
FSTs are used beyond Lucene — in ICU and every Apache Lucene-based system.
Posting List Compression
A posting list may contain millions of document IDs. Compression is essential.
1. Delta Encoding:
Original: [1, 5, 8, 12, 15, 17]
Delta: [1, 4, 3, 4, 3, 2]
Small consecutive numbers compress well.
2. Variable Byte Encoding: Small numbers take 1 byte, larger ones take multiple bytes. Since most numbers are small, averages around 1 byte.
3. FOR (Frame of Reference) + PFOR: Block-level bit-packing based on max value. Tens-of-times compression.
Lucene combines these to compress posting lists to 5–10% of the original size.
2. Lucene Segments: The Elegance of Immutable Files
What Is a Segment?
In Lucene, a segment is an independent small inverted index. A complete unit of search.
Index/
├── segments_12.file # current segment list
├── _0.cfs # segment 0 (compound file)
├── _1.cfs # segment 1
├── _2.cfs # segment 2
└── _3.cfs # segment 3
Each segment contains:
- Term dictionary (FST)
- Posting lists
- Stored fields (original documents)
- Doc values (for sorting/aggregation)
- Norms (for scoring)
- Term vectors (for highlighting)
Immutability
The key property of segments: once written, they are never modified.
This yields huge advantages:
- Lock-free: read-only, so no concurrency concerns.
- Cache efficiency: safely cached in OS page cache.
- Simple replication: just copy files.
- Lock-free search: thousands of concurrent queries.
Adding Documents = New Segment
Adding documents does not modify existing segments. Instead, a new segment is created.
Before: [segment_1][segment_2][segment_3]
Add 10 documents → create segment_4
After: [segment_1][segment_2][segment_3][segment_4]
On search, all segments are searched in parallel and results are merged.
Deletion = Tombstone
Deletion also does not actually remove the document. A tombstone (deletion marker) is recorded:
.liv file: [1, 0, 1, 1, 0, 1, ...] # 0 = deleted
Search checks this bitmap and skips deleted entries. Actual reclamation happens during merge.
Update = Delete + Add
An update is "mark the previous version as deleted + insert the new version into a new segment". Because of this:
- Many updates → tombstones accumulate → search slows down.
- Periodic merges are essential.
3. The Refresh / Flush / Merge Cycle
The Lucene/Elasticsearch write path consists of three stages, each with a different purpose.
In-Memory Buffer
New documents first accumulate in the memory buffer:
Index Buffer (RAM)
[doc1, doc2, doc3, doc4, ...]
Not yet searchable (!). Not on disk either.
Refresh: Make It Searchable
Refresh converts the memory buffer into an in-memory segment:
Index Buffer → [new segment (in memory)]
This in-memory segment exists only in OS page cache (not yet fsynced). But it is searchable.
Default interval: 1 second (index.refresh_interval = 1s).
This is the secret of Elasticsearch's Near Real-Time (NRT) search. Inserted documents appear in search results after 1 second.
Caution: Refresh is expensive. Each creates a new segment → small segments pile up → search slows down.
Tuning Refresh
For bulk indexing, disable refresh:
PUT /my_index/_settings
{
"index": {
"refresh_interval": "-1"
}
}
// After indexing is done
PUT /my_index/_settings
{
"index": {
"refresh_interval": "1s"
}
}
Indexing can become several times faster.
Translog: Ensuring Durability
Refresh does not guarantee durability (no fsync). So if the server crashes, do we lose data?
Solution: the translog (transaction log).
Every indexing operation:
- Written simultaneously to the memory buffer AND translog.
- Translog is persisted to disk via fsync (default: per
request).
Write flow:
Document → Memory Buffer → Translog (fsync)
↓ (after 1 second, refresh)
In-memory segment (searchable)
↓ (periodic flush)
Disk segment (persistent)
Flush: Persistent Storage
Flush fsyncs in-memory segments to disk and empties the translog:
Before flush:
Memory: [seg_new (in cache)]
Translog: [full, several hundred MB]
After flush:
Disk: [seg_new (fsynced)]
Translog: [empty]
Default triggers:
- Translog reaches 512MB (
index.translog.flush_threshold_size) - Or every 5 seconds (
index.translog.sync_interval= 5s)
Flush is real disk I/O, so it is far more expensive.
Merge: Combining Segments
Over time, the number of segments grows:
- Search traverses all segments → slow.
- Tombstones accumulate, wasting space.
Merge combines multiple segments into one larger segment:
Before: [seg_1, seg_2, seg_3, seg_4] (10MB each)
Merge starts
Concurrent: [seg_1, seg_2, seg_3, seg_4, seg_merged_in_progress]
After: [seg_merged] (40MB, tombstones removed)
Existing segments remain searchable during merge. An atomic swap happens on completion.
TieredMergePolicy
Lucene's default merge policy. The concept of "tier":
- Groups segments of similar size.
- If a tier is too large, triggers a merge.
- Large segments (5GB+) are excluded from further merging.
Parameters:
{
"index.merge.policy.max_merged_segment": "5gb",
"index.merge.policy.segments_per_tier": 10
}
Analogy
An analogy for refresh/flush/merge:
- Refresh: tidying your desk every day (making small segments). Frequent, fast.
- Flush: filing paperwork into the cabinet on weekends (permanent disk storage).
- Merge: reorganizing the filing cabinet at month end (segment merging).
Parameter Tuning
Common defaults:
{
"index.refresh_interval": "1s",
"index.translog.flush_threshold_size": "512mb",
"index.merge.scheduler.max_thread_count": 1
}
Bulk indexing:
{
"index.refresh_interval": "60s",
"index.number_of_replicas": 0,
"index.translog.durability": "async"
}
4. BM25: The Math of Scoring
The Problem
How do you pick "the 10 most relevant documents"? You need a relevance score, not just a yes/no on whether the term exists.
TF-IDF (Classic)
TF-IDF is the product of two factors:
- TF (Term Frequency): how often the term appears in the document.
- IDF (Inverse Document Frequency): how rare the term is across all documents.
score(q, d) = Σ_t (TF(t, d) × IDF(t))
Intuition: matching a rare word ("quantum") is more meaningful than matching a common word ("the").
Weaknesses of TF-IDF
- Linear TF: 100 occurrences get 100x the score of 1 occurrence. Unrealistic.
- Ignores document length: "search" appearing twice in a 100-word document is not the same as twice in a 1000-word document.
BM25: The Improved Version
BM25 (Best Match 25) was proposed by Stephen Robertson in the 1990s. It has been the default scoring since Lucene 6.
BM25(q, d) = Σ_t IDF(t) × (f(t,d) × (k1+1)) / (f(t,d) + k1 × (1 - b + b × |d|/avgdl))
Looks complex, but:
- f(t, d): term frequency.
- k1 (default 1.2): TF saturation — even many occurrences cap the score.
- b (default 0.75): document length normalization strength.
- |d|: document length.
- avgdl: average document length.
BM25 Improvements
- Saturation: even with high TF, the score saturates → prevents spam.
- Document length normalization: offsets the TF advantage of longer documents.
- Tunable: adjustable via k1 and b.
Practical Tuning
Defaults work for most cases. But:
- Primarily short documents (e.g. tweets): lower b (0.3–0.5).
- Long documents where TF matters (e.g. long articles): raise k1 (1.5–2.0).
{
"settings": {
"similarity": {
"my_bm25": {
"type": "BM25",
"k1": 1.5,
"b": 0.5
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"similarity": "my_bm25"
}
}
}
}
Lucene's Scoring Implementation
Lucene does not compute BM25 at scoring time. It uses several pre-computed values:
- Norm: document length normalization value (computed at index time).
- IDF: computed from the segment's statistics.
- Field boost: field weight.
Index-time values are stored, and queries compute fast via simple multiplication.
5. Analyzer: The Text Processing Pipeline
Why an Analyzer?
For "Search Engine" and "search engine" to be treated as the same term? For "running" and "run" to match as the same concept?
Answer: the analyzer processes text at index and query time.
Analyzer Structure
Input Text
↓
Character Filter (one or more)
↓
Tokenizer (exactly one)
↓
Token Filter (zero or more)
↓
Output Tokens
Character Filter
Character-level preprocessing:
- HTML strip: remove HTML tags.
- Mapping: character replacement (
&→and). - Pattern replace: regex replacement.
Tokenizer
Splits text into tokens (usually words):
- Standard: word boundary based, Unicode aware. Works for most languages.
- Whitespace: split by whitespace.
- N-gram: generate all n-grams (e.g. "search" → ["sea", "ear", "arc", ...]).
- Keyword: no splitting (for exact match).
- Language-specific: Chinese (IK, Smart Chinese), Japanese (Kuromoji), Korean (Nori).
Token Filter
Post-processing tokens:
- Lowercase: lowercase conversion.
- Stop: remove stop words ("the", "a", "is").
- Stemmer: stemming ("running" → "run").
- Synonym: synonym expansion.
- ASCII folding: "café" → "cafe".
- Word delimiter: split compound words.
Standard Analyzer Example
The default standard analyzer:
Input: "The Quick Brown Foxes!"
↓ Standard Tokenizer
[The, Quick, Brown, Foxes]
↓ Lowercase Filter
[the, quick, brown, foxes]
↓ Stop Filter (off by default)
[the, quick, brown, foxes]
Output: [the, quick, brown, foxes]
Korean Processing: Nori
Korean is hard to tokenize because of particles and endings. Nori is a Korean morphological analyzer:
{
"settings": {
"analysis": {
"analyzer": {
"my_nori": {
"tokenizer": "nori_tokenizer",
"filter": ["nori_readingform", "lowercase"]
}
}
}
}
}
Input: "고양이를 좋아합니다" → [고양이, 를, 좋아, 합니다] → particles removed → [고양이, 좋아]
Edge N-gram: Autocomplete
For autocomplete, edge n-gram is commonly used:
Input: "apple"
→ [a, ap, app, appl, apple]
When a user types "app", it matches the already-indexed "app". "apple" is found instantly.
Caution: storage grows significantly. Not by word count but by n-gram count.
6. Elasticsearch Distribution
Lucene is a search engine on a single machine. Elasticsearch extends it to a distributed cluster.
Index, Shard, Replica
- Index: a collection of documents (analogous to a DB table).
- Shard: a partition of the index. Each shard is a single Lucene index.
- Primary shard: the original.
- Replica shard: the copy.
Index "logs" (5 primary, 1 replica)
├── shard 0 (primary, node A) ← original
│ └── replica 0 (node B) ← copy
├── shard 1 (primary, node B)
│ └── replica 1 (node C)
├── shard 2 (primary, node C)
│ └── replica 2 (node A)
├── shard 3 (primary, node A)
│ └── replica 3 (node B)
└── shard 4 (primary, node B)
└── replica 4 (node C)
Shard Routing
Deciding which shard to put a document in:
shard = hash(_routing) % num_primary_shards
By default, _routing = document_id. Documents with the same ID always go to the same shard.
The Primary Shard Count Constraint
Problem: the primary shard count cannot be changed after index creation. Why?
Initial: 5 shards
hash(user_1) % 5 = 3 → stored on shard 3
If you change it to 10 shards:
hash(user_1) % 10 = 1 → look on shard 1 → not there!
Existing data would be searched on the wrong shard. Reindexing is required to resolve this.
Solutions: the Split API (in some cases), or start with enough shards from the beginning.
Replica Shards
Replicas can be added/removed anytime:
PUT /my_index/_settings
{
"index.number_of_replicas": 2
}
Replicas provide:
- High availability: if primary fails, a replica is promoted.
- Read scalability: distribute queries across replicas.
Cluster State and Master
A cluster has master nodes:
- Manage cluster state (shard assignment, mappings, etc.).
- Master election: Zen Discovery (older) or Raft-based (Elasticsearch 7+).
- Prevent two masters existing simultaneously (split brain).
Tip: keep an odd number of masters (3, 5, 7). To form a quorum.
Query Execution: Scatter-Gather
When a search query arrives:
1. Coordinator node receives the request.
2. Broadcasts the query to every needed shard (primary or replica).
3. Each shard computes local top-K.
4. Coordinator gathers all results and picks the global top-K.
5. Fetches the full data for the selected documents.
6. Returns to the client.
This is called scatter-gather or two-phase query.
Query Pitfall: Deep Pagination
GET /logs/_search?from=9990&size=10
To get "results 10,000 to 10,010":
- Each shard computes top 10,010 (!).
- Coordinator gathers and sorts 10,010 items.
- Returns the range 9,990 to 10,010.
→ With from=1,000,000, each shard sorts a million entries. Memory explosion.
Solution: use the search_after API. Instead of global sort, continue from the previous result.
Aggregation Execution
Aggregation queries (e.g. GROUP BY) also execute distributed:
GET /logs/_search
{
"size": 0,
"aggs": {
"by_user": {
"terms": { "field": "user_id", "size": 10 }
}
}
}
Each shard computes local top 10, then the coordinator merges them. Problem: the local top 10 may not be the global top 10. The coordinator requests more (e.g. top 100) to improve accuracy.
Use the shard_size parameter to tune:
"terms": { "field": "user_id", "size": 10, "shard_size": 100 }
7. Doc Values: Sorting and Aggregation
Why They Matter
The inverted index goes from term → docs. To answer "what is this document's field value?", you need the reverse direction.
Example: ORDER BY timestamp or GROUP BY country. You need each document's timestamp and country values.
Doc Values
Doc values are columnar storage:
timestamp column:
doc 0 → 2025-04-15 10:00:00
doc 1 → 2025-04-15 10:00:01
doc 2 → 2025-04-15 10:00:02
...
Each document's field value stored as a contiguous array. Optimal for sorting, aggregation, and scripting.
Memory Use
Doc values are disk-based by default:
- mapped via mmap → leverages OS page cache.
- Cached in RAM if frequently used, on disk otherwise.
Elasticsearch's field data was the old in-memory version. Doc values are much more efficient, so they are now the default.
Exception: text Fields
text fields do not store doc values by default:
- Only analyzed tokens are stored → original reconstruction is hard.
- To aggregate, use a keyword subfield.
{
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": { "type": "keyword" }
}
}
}
}
Search on name. Aggregate/sort on name.keyword.
Sparse Doc Values
Lucene supports sparse representation for cases with many missing fields (e.g. optional fields). Space-efficient.
8. Mapping and Dynamic Mapping
What Is Mapping?
Mapping defines each field's type and properties:
{
"mappings": {
"properties": {
"title": { "type": "text" },
"user_id": { "type": "keyword" },
"timestamp": { "type": "date" },
"price": { "type": "double" },
"location": { "type": "geo_point" }
}
}
}
text vs keyword
The most important distinction:
- text: analyzed (tokenized). For full-text search.
- keyword: not analyzed. For exact match, aggregation, and sorting.
Emails, IPs, tags are almost always keyword. Body, description, title are text (optionally with keyword as well).
Dynamic Mapping
Field types are auto-inferred on first indexing:
POST /my_index/_doc
{
"name": "Alice",
"age": 30,
"active": true
}
Pro: quick start. Con: types may differ from expectation. Once decided, types cannot change.
Recommendation: use explicit mapping in production. Dynamic mapping is for dev/test.
Mapping Explosion
Each field carries memory overhead. For example:
{
"user_data": {
"user_1": { "action": "login" },
"user_2": { "action": "logout" }
}
}
Each user_id dynamically creates a new field. Millions of fields → mapping explosion. Elasticsearch can go down in minutes.
Fixes:
- Restructure:
{ "user_id": "user_1", "action": "login" }. - Or use the
flattenedfield type. - Limit via
index.mapping.total_fields.limit(default 1000).
9. Production Tuning
Bulk Indexing
Goal: load data as fast as possible.
PUT /my_index/_settings
{
"index": {
"refresh_interval": "-1",
"number_of_replicas": 0,
"translog": {
"durability": "async",
"sync_interval": "30s"
}
}
}
Index in batches with the Bulk API:
POST /_bulk
{ "index": { "_index": "my_index" } }
{ "field1": "value1" }
{ "index": { "_index": "my_index" } }
{ "field2": "value2" }
5–15MB per bulk is the sweet spot. Too small → overhead; too large → memory pressure.
After indexing:
PUT /my_index/_settings
{
"index": {
"refresh_interval": "1s",
"number_of_replicas": 1
}
}
POST /my_index/_forcemerge?max_num_segments=1
_forcemerge maximizes search performance.
JVM Heap
Critical rule: keep heap under 32GB.
Reason: above 32GB, the JVM's compressed oops (compressed object pointers) is disabled, and memory efficiency drops sharply.
# jvm.options
-Xms16g
-Xmx16g
The rest of memory goes to OS page cache. Lucene heavily uses page cache. Best setup:
- 64GB server memory
- JVM heap 16–31GB
- OS cache 33–48GB
Shard Count
Bad example: daily indices × 1,000 × 5 primary shards = 5,000 shards. Master overhead explodes.
Rules:
- 20–40GB per shard is ideal.
- No more than 20 shards per GB of JVM heap.
- Small indices are fine with a single shard.
Hot-Warm Architecture
Useful pattern for time-series data:
- Hot nodes: recent data, fast SSDs, active indexing/search.
- Warm nodes: older data, HDDs, read-only.
- Cold nodes: very old data, occasionally queried.
Automate with Index Lifecycle Management (ILM):
{
"policy": {
"phases": {
"hot": { "actions": {} },
"warm": { "min_age": "7d", "actions": { "allocate": { "require": { "data": "warm" } } } },
"cold": { "min_age": "30d", "actions": { "allocate": { "require": { "data": "cold" } } } },
"delete": { "min_age": "90d", "actions": { "delete": {} } }
}
}
}
Data Streams
Elasticsearch 7.9+ data streams are a high-level abstraction for time-series data:
- Auto-creates backing indices (e.g.
.ds-logs-2025.04.15-000001). - Auto-rollover (by size/age).
- Integrates with ILM.
POST /_data_stream/logs-app
Kibana logs, APM, and monitoring all use data streams.
10. Common Pitfalls and Debugging
Pitfall 1: Too Many Shards
Symptoms: slow cluster state updates, high master node CPU, slow queries.
Cause: thousands of small shards.
Fixes:
- Merge/shrink older indices.
- Automate via ILM.
- Reduce shard count via the Shrink API.
Pitfall 2: Mapping Explosion
Symptoms: OOM, indexing failures, cluster instability.
Cause: unbounded dynamic field creation.
Fixes:
- Explicit mapping.
- Use
"dynamic": "strict"to reject new fields. - Fix data structure at the application level.
Pitfall 3: Deep Pagination
Symptoms: memory/time explosion on queries with large from values.
Fixes:
- Use
search_after. - Or the
scrollAPI (for bulk extraction). - Block deep pagination in the UI.
Pitfall 4: Fielddata on Text
Symptoms: aggregation on a text field → error or massive memory use.
Fixes:
- Use
text.keyword. - Or
fielddata: true(risky, not recommended).
Pitfall 5: Refresh Abuse
Symptoms: ?refresh=true on every request to make docs searchable immediately.
Cause: new segment per indexing call → excessive merge load.
Fixes:
- Refresh only when needed.
- Replace with
?refresh=wait_for(waits for the next refresh).
Pitfall 6: Complex Nested Queries
Symptoms: queries on nested fields are slow.
Cause: nested fields are stored as "hidden sub-documents" and searched separately.
Fixes:
- Denormalize (flatten). Accept duplication.
- Or use
jointype for many relations (with reduced performance).
11. Lucene and Competing Technologies
OpenSearch
When Elasticsearch moved to the SSPL license in 2021, AWS forked OpenSearch. Still Lucene-based. Features are starting to diverge.
Apache Solr
Solr, predating Elasticsearch, is also Lucene-based. Originally enterprise-focused. Elasticsearch now leads in market share.
Meilisearch
A lightweight search engine written in Rust. Does not use Lucene; implements its own. Very fast at smaller scale. Built-in typo tolerance (fuzzy matching).
Typesense
Also Rust-based. An Algolia alternative.
The Role of ClickHouse
If "search" really means analytical queries rather than traditional search, ClickHouse is far faster. Log analytics are increasingly moving from Elasticsearch to ClickHouse.
Differences:
- Elasticsearch: free-text search, diverse queries, complex scoring.
- ClickHouse: aggregation, SQL, large-scale analytics.
Vector Search (see earlier post)
Elasticsearch 8+ also supports vector search (HNSW). See the earlier post on ANN algorithms.
Quiz Review
Q1. Why are Lucene segments immutable, and what are the benefits?
A. Even when documents are added/deleted/updated, existing segments are never modified. Instead, new segments are created, and deletions are marked via tombstones.
Benefits:
- Lock-free search: read-only, so thousands of concurrent queries run without locks.
- Cache stability: OS page cache safely caches immutable files. No cache invalidation concerns.
- Simple replication: just copy files. No state synchronization.
- Merge safety: background merges don't interfere with live searches. Atomic swap on completion.
- Optimized write path: writes are always append. Sequential I/O is fast.
Costs:
- Deletion only leaves a tombstone; actual reclamation is deferred to merge.
- Update = delete + add → tombstone accumulation.
- Merge cost (I/O, CPU).
This is the same philosophy as LSM-Trees. "Writes are append, cleanup is batched later" is a fundamental pattern of modern storage systems.
Q2. What are the differences and roles of refresh, flush, and merge?
A.
Refresh (~1 second interval):
- Converts memory buffer into an in-memory segment.
- Documents become searchable (the source of Near Real-Time).
- Not yet fsynced → no durability.
- Cost: moderate. Creates a new small segment each time.
Flush (minutes or 512MB):
- Fsyncs in-memory segments to disk.
- Empties the translog.
- Data becomes persistent.
- Cost: high (actual disk I/O).
Merge (background):
- Merges multiple small segments into a larger one.
- Tombstoned documents are actually removed.
- Maintains search performance (fewer segments = faster).
- Cost: very high (disk read + write).
Analogy:
- Refresh = jotting into your notebook (just a memo).
- Flush = filing the notebook into the cabinet (permanent storage).
- Merge = quarterly file reorganization (dedup, efficiency).
Why separate: each must run at a different cadence to be efficient. Search needs frequent refresh, but fsync is expensive so it cannot be frequent. Merge can run slowly in the background. Separating these three stages is how Elasticsearch achieves both performance and durability.
Q3. What is the difference between text and keyword fields, and why do you need both?
A.
text field:
- An analyzer is applied.
- Tokenization, lowercase, stemming, stop-word removal, etc.
- "The Quick Brown Fox" → ["quick", "brown", "fox"].
- Optimal for full-text search.
- Cannot sort/aggregate (original lost after analysis).
keyword field:
- Not analyzed. String is kept as-is.
- "The Quick Brown Fox" → ["The Quick Brown Fox"] (a single value).
- Used for exact match, sorting, aggregation.
- Useful for high-cardinality values.
Why both: the same field often has different query patterns. For example, a product name:
- Full-text search for "iPhone" → text.
- Aggregation by exact name "iPhone 15 Pro" → keyword.
Hence multi-field mapping:
{
"product_name": {
"type": "text",
"fields": {
"keyword": { "type": "keyword" }
}
}
}
Now search on product_name, aggregate on product_name.keyword. Same data, two indexing styles.
Rules of thumb:
- IP, email, tag, ID, URL → keyword.
- Body, description, title, comments → text.
- Product name, user name where both needed → text + keyword subfield.
Without this distinction, confusion like "aggregation doesn't work" or "fielddata error" arises. It's the most basic principle of mapping design.
Q4. How does BM25 improve on TF-IDF?
A. It addresses two weaknesses of TF-IDF:
Weakness 1: Linearity of TF
TF-IDF uses term frequency linearly: score ∝ TF.
- "search" 1 time vs 100 times → 100x difference.
- Reality: 1 → 10 is a big jump, but 90 → 100 is almost meaningless.
BM25 fix: a saturation curve.
f(tf) = tf × (k1+1) / (tf + k1)
Growth rate decreases as TF grows. With k1=1.2, tf=10 gives f≈5.5, tf=100 gives f≈1.18. Also effective against spam.
Weakness 2: Ignoring document length TF-IDF treats "'search' 2 times in a 100-word document" and "'search' 2 times in a 10,000-word document" equally. But matches in short documents are more "relevant".
BM25 fix: length normalization.
length_norm = 1 - b + b × (|d| / avgdl)
|d|: this document's length.avgdl: average document length.b: normalization strength (0-1, default 0.75).
Discounts TF for long documents. b=1 is full normalization, b=0 ignores length.
Combined BM25 formula:
BM25 = IDF × (tf × (k1+1)) / (tf + k1 × (1 - b + b × |d|/avgdl))
Complex-looking, but the meaning is clear: "IDF-weighted score with saturated TF discounted by document length".
In practice defaults (k1=1.2, b=0.75) satisfy most cases. For short documents like tweets lower b; for long articles raise k1. BM25 becoming Lucene's default in Lucene 6 was no accident — it's a formula proven over 20 years.
Q5. Why did Elasticsearch make the primary shard count unchangeable after index creation?
A. Because of shard routing. Which shard a document lands on is determined by:
shard_id = hash(routing_key) % num_primary_shards
By default routing_key = document_id. Then:
- At indexing:
hash(doc_1) % 5 = 3→ stored on shard 3. - At search:
hash(doc_1) % 5 = 3→ queried on shard 3.
The problem: if you change from 5 to 10 shards:
- New calculation:
hash(doc_1) % 10 = 1→ queried on shard 1. - But actual storage was shard 3 → data not found.
All existing documents would be searched on the wrong shard. Equivalent to data loss.
Theoretical solutions:
- Consistent hashing: most documents stay put with partial changes. Still some migration needed.
- Reindex: create a new index and copy all documents. Time/resource intensive.
- Dual-write period: index into the new one in parallel. Gradual transition.
Elasticsearch's practical solutions:
- Split API: only
N → N*k(2x, 3x, etc.). Each shard splits internally. - Shrink API: shrink from
N → N/k. - Reindex API: general case. Create new index and copy.
- Data streams + ILM: time-series data auto-creates new indices, so no problem.
Design recommendations:
- Enough shards from the start. Easier than growing later.
- Rule: 20–40GB per shard. 1.5–2x projected future growth.
- Time-series: data streams + daily/weekly indices.
This constraint is a main reason "why Elasticsearch operations are tricky". Bad initial design means costly reindexing. Hence decide the shard count carefully before production.
Conclusion: 20 Years of Engineering
Key Takeaways
- Inverted index + FST: the foundation of search.
- Segments are immutable: the secret of lock-free search.
- Refresh/Flush/Merge: three-stage performance/durability balance.
- BM25: the modern scoring standard.
- Analyzer: the text processing pipeline.
- Shard + Replica: the basics of distribution.
- Doc Values: columnar storage for sorting/aggregation.
- Hot-Warm + ILM: time-series data management.
Lessons in Operating Elasticsearch
- Trust defaults, but understand them. They are usually good but may not fit your situation.
- Heap under 32GB. The single most important rule.
- Design shards well upfront. Hard to change later.
- Explicit mapping. Dynamic mapping is a start of disaster.
- Refresh only as needed. Turn it off for bulk indexing.
- Measure.
_cat/segments,_cat/shards,_cluster/health.
Lucene, a Treasure
Lucene has been developed since the early 2000s as a Java library. Twenty years of careful, hand-tuned optimization by many engineers. Compression algorithms, data structures, file formats — everything is finely tuned.
Elasticsearch, Solr, OpenSearch, Kibana, and even some SaaS search services sit on top of Lucene. After Google Search, it's likely the search engine you use most often.
A Final Lesson
Search looks easy but is hard when you dig deep. To answer "why is this query slow", "why is memory low", or "why is clustering unstable", you have to know the internals.
Having read this post, you now:
- Know Lucene's segment and file structure.
- Understand refresh vs flush.
- Know why BM25 is good.
- Understand the importance of shard count.
- Know why Hot-Warm exists.
Next time you work with an Elasticsearch cluster, this knowledge will steer your decisions for the better. And when issues arise, you'll be able to answer "why?". That is the power of a real engineer.
References
- Lucene: The Internal Workings
- Elasticsearch: The Definitive Guide - older but great conceptual explanations
- Elastic Blog: Anatomy of an Elasticsearch Cluster
- Lucene Index File Formats
- Okapi BM25 (Robertson & Zaragoza, 2009) - mathematical review of BM25
- Elasticsearch: Designing for Scale
- OpenSearch Documentation - Elasticsearch fork
- Introduction to Information Retrieval (Manning, Raghavan, Schütze) - classic IR textbook
- Finite State Transducers for Fast Text Processing - FST explanation
- Why do I need replicas? Everything you need to know about Elasticsearch Shards
현재 단락 (1/684)
Your company generates tens of terabytes of logs every day. One night at 11 PM, a user puts a specif...