- Published on
Elasticsearch Complete Guide 2025: From Search Engine to Log Analytics & Vector Search
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TOC
1. What is Elasticsearch
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Since its first release by Shay Banon in 2010, it has expanded from full-text search to log analytics, metrics monitoring, security analytics, and most recently, vector search.
1.1 Why Elasticsearch
Running LIKE '%keyword%' in a traditional RDBMS requires a full table scan. With hundreds of millions of rows, response times can reach tens of seconds. Elasticsearch uses an inverted index structure to deliver millisecond-level search responses.
Key Features:
- Distributed Architecture: Horizontal scaling (scale-out) handling tens of TB of data
- Near Real-Time Search: Documents become searchable within ~1 second of indexing
- Schemaless: Index JSON documents directly with dynamic mapping support
- RESTful API: All operations via HTTP/JSON
- Rich Ecosystem: Integration with Kibana, Logstash, Beats, APM
1.2 Elasticsearch vs RDBMS Comparison
| Aspect | RDBMS | Elasticsearch |
|---|---|---|
| Data Unit | Row | Document (JSON) |
| Collection | Table | Index |
| Column | Column | Field |
| Schema | Fixed Schema | Dynamic Mapping |
| Search Method | B-Tree Index | Inverted Index |
| Scaling | Scale-up oriented | Scale-out (Sharding) |
| Transactions | ACID support | Not supported |
| Primary Use | OLTP | Search, Analytics, Logging |
1.3 Version History
ES 1.x (2014) - Initial stabilization
ES 2.x (2015) - Pipeline Aggregation
ES 5.x (2016) - Lucene 6, Painless scripting
ES 6.x (2017) - Single type per index
ES 7.x (2019) - Type removal, adaptive replica selection
ES 8.x (2022) - Security by default, kNN vector search, NLP integration
2. Inverted Index Internals
The secret behind Elasticsearch's search speed is the inverted index. While a regular database finds words within documents (forward index), an inverted index finds documents from words.
2.1 Inverted Index Structure
Given three documents:
Doc 1: "The quick brown fox"
Doc 2: "The quick brown dog"
Doc 3: "The lazy brown fox"
The inverted index is built as follows:
Term | Document IDs
----------|-------------
the | [1, 2, 3]
quick | [1, 2]
brown | [1, 2, 3]
fox | [1, 3]
dog | [2]
lazy | [3]
Searching for "fox" instantly returns Doc 1 and Doc 3. Regardless of the number of documents, term lookup is close to O(1).
2.2 Lucene Segment Architecture
Lucene inside Elasticsearch stores data in segments:
Index
└── Shard (Lucene Index)
├── Segment 0 (immutable)
│ ├── Inverted Index
│ ├── Stored Fields
│ ├── Doc Values
│ └── Term Vectors
├── Segment 1 (immutable)
├── Segment 2 (immutable)
└── Commit Point (segments_N)
Key Points:
- Segments are immutable once created
- Document deletion marks entries in a
.delfile rather than removing them - Background segment merging occurs periodically
- A new segment is created on each refresh (default 1 second), making new documents searchable
2.3 Doc Values and Fielddata
Inverted indices are optimized for text search but not for sorting or aggregations:
Inverted Index (search): Term → Document IDs
Doc Values (agg/sort): Document ID → Values
// Doc Values example
{
"doc_1": { "price": 100, "category": "electronics" },
"doc_2": { "price": 200, "category": "books" },
"doc_3": { "price": 150, "category": "electronics" }
}
- Doc Values: Automatically enabled for
keyword,numeric,date,ip,geo_pointtypes. Disk-based. - Fielddata: Used for sorting/aggregations on
textfields. Heap memory-based, use with caution.
3. Mapping and Analyzers
3.1 Mapping Definition
Mapping defines the schema of an index, specifying field types and analysis methods.
PUT /products
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"description": {
"type": "text"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
},
"location": {
"type": "geo_point"
},
"tags": {
"type": "keyword"
},
"metadata": {
"type": "object"
},
"reviews": {
"type": "nested",
"properties": {
"author": { "type": "keyword" },
"rating": { "type": "integer" },
"comment": { "type": "text" }
}
}
}
}
}
3.2 Core Field Types
| Type | Description | Inverted Index | Doc Values |
|---|---|---|---|
text | Full-text search (analyzer applied) | Yes | No (Fielddata) |
keyword | Exact match, sorting, aggregation | Yes | Yes |
long/integer/float | Numeric | Yes (BKD Tree) | Yes |
date | Date/time | Yes | Yes |
boolean | true/false | Yes | Yes |
geo_point | Latitude/longitude | Yes | Yes |
nested | Nested objects (independent docs) | Yes | Yes |
object | JSON objects (flattened) | Yes | Yes |
dense_vector | Vectors (kNN search) | No | No (separate storage) |
3.3 Analyzer Pipeline
An analyzer consists of three stages:
Text Input
│
▼
Character Filter
│ - html_strip: Remove HTML tags
│ - mapping: Character substitution
│ - pattern_replace: Regex replacement
▼
Tokenizer
│ - standard: Unicode word boundary
│ - whitespace: Whitespace-based split
│ - ngram: N-gram token generation
│ - edge_ngram: Prefix-based tokens
│ - language-specific (e.g., nori for Korean, kuromoji for Japanese)
▼
Token Filter
│ - lowercase: Convert to lowercase
│ - stop: Remove stop words
│ - synonym: Synonym expansion
│ - stemmer: Stem extraction
▼
Token Stream (Terms)
3.4 Custom Analyzer Example
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"html_cleaner": {
"type": "html_strip",
"escaped_tags": ["b", "i"]
}
},
"tokenizer": {
"custom_ngram": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": ["letter", "digit"]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_synonyms": {
"type": "synonym",
"synonyms": [
"quick,fast,speedy",
"big,large,huge"
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["html_cleaner"],
"tokenizer": "standard",
"filter": ["lowercase", "english_stop", "custom_synonyms"]
},
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "custom_ngram",
"filter": ["lowercase"]
}
}
}
}
}
3.5 Analyze API
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Elasticsearch is a <b>powerful</b> search engine"
}
// Response
{
"tokens": [
{ "token": "elasticsearch", "position": 0 },
{ "token": "powerful", "position": 3 },
{ "token": "search", "position": 4 },
{ "token": "engine", "position": 5 }
]
}
4. Mastering Query DSL
Elasticsearch Query DSL is a JSON-based query language. It is divided into Query Context (relevance scoring) and Filter Context (yes/no matching with caching).
4.1 Match Queries (Full-Text Search)
// Basic match - tokenized through analyzer
GET /products/_search
{
"query": {
"match": {
"description": "powerful search engine"
}
}
}
// match_phrase - word order matters
GET /products/_search
{
"query": {
"match_phrase": {
"description": {
"query": "search engine",
"slop": 1
}
}
}
}
// multi_match - search across multiple fields
GET /products/_search
{
"query": {
"multi_match": {
"query": "elasticsearch guide",
"fields": ["title^3", "description", "tags^2"],
"type": "best_fields"
}
}
}
4.2 Term Queries (Exact Match)
// term - exact match on keyword fields
GET /products/_search
{
"query": {
"term": {
"category": "electronics"
}
}
}
// terms - match any of multiple values
GET /products/_search
{
"query": {
"terms": {
"status": ["published", "pending"]
}
}
}
// range - range queries
GET /products/_search
{
"query": {
"range": {
"price": {
"gte": 100,
"lte": 500
}
}
}
}
// exists - field existence check
GET /products/_search
{
"query": {
"exists": {
"field": "discount"
}
}
}
4.3 Bool Query (Compound Query)
GET /products/_search
{
"query": {
"bool": {
"must": [
{ "match": { "description": "search engine" } }
],
"must_not": [
{ "term": { "status": "deleted" } }
],
"should": [
{ "term": { "featured": true } },
{ "range": { "rating": { "gte": 4.5 } } }
],
"filter": [
{ "term": { "category": "software" } },
{ "range": { "price": { "lte": 1000 } } }
],
"minimum_should_match": 1
}
}
}
Bool Query Clauses:
| Clause | Scoring | Caching | Purpose |
|---|---|---|---|
must | Yes | No | Must match + score contribution |
must_not | No | Yes | Must not match |
should | Yes | No | Bonus if matched |
filter | No | Yes | Must match (no scoring) |
4.4 Nested Query
GET /products/_search
{
"query": {
"nested": {
"path": "reviews",
"query": {
"bool": {
"must": [
{ "term": { "reviews.author": "john" } },
{ "range": { "reviews.rating": { "gte": 4 } } }
]
}
},
"inner_hits": {
"size": 3,
"highlight": {
"fields": {
"reviews.comment": {}
}
}
}
}
}
}
4.5 Function Score Query
GET /products/_search
{
"query": {
"function_score": {
"query": { "match": { "title": "elasticsearch" } },
"functions": [
{
"field_value_factor": {
"field": "popularity",
"modifier": "log1p",
"factor": 2
}
},
{
"gauss": {
"created_at": {
"origin": "now",
"scale": "30d",
"decay": 0.5
}
}
},
{
"filter": { "term": { "featured": true } },
"weight": 5
}
],
"score_mode": "sum",
"boost_mode": "multiply"
}
}
}
4.6 Highlighting
GET /products/_search
{
"query": {
"match": { "description": "search engine" }
},
"highlight": {
"pre_tags": ["<strong>"],
"post_tags": ["</strong>"],
"fields": {
"description": {
"fragment_size": 150,
"number_of_fragments": 3
}
}
}
}
5. Aggregations Deep Dive
Elasticsearch aggregations are analogous to SQL GROUP BY but far more powerful. They are classified into bucket, metric, and pipeline aggregations.
5.1 Bucket Aggregations
// terms aggregation - document count by category
GET /products/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": {
"field": "category",
"size": 20,
"order": { "_count": "desc" }
}
}
}
}
// date_histogram - time-based aggregation
GET /logs/_search
{
"size": 0,
"aggs": {
"logs_per_hour": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "1h",
"time_zone": "UTC",
"min_doc_count": 0
}
}
}
}
// range aggregation - price brackets
GET /products/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 100 },
{ "from": 100, "to": 500 },
{ "from": 500 }
]
}
}
}
}
// nested aggregation - average price by category
GET /products/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": { "field": "category" },
"aggs": {
"avg_price": {
"avg": { "field": "price" }
},
"price_stats": {
"stats": { "field": "price" }
}
}
}
}
}
5.2 Metric Aggregations
GET /products/_search
{
"size": 0,
"aggs": {
"avg_price": { "avg": { "field": "price" } },
"max_price": { "max": { "field": "price" } },
"min_price": { "min": { "field": "price" } },
"total_sales": { "sum": { "field": "sales_count" } },
"unique_brands": { "cardinality": { "field": "brand" } },
"price_percentiles": {
"percentiles": {
"field": "price",
"percents": [25, 50, 75, 90, 99]
}
}
}
}
5.3 Pipeline Aggregations
GET /sales/_search
{
"size": 0,
"aggs": {
"monthly_sales": {
"date_histogram": {
"field": "date",
"calendar_interval": "month"
},
"aggs": {
"total_revenue": {
"sum": { "field": "revenue" }
}
}
},
"avg_monthly_revenue": {
"avg_bucket": {
"buckets_path": "monthly_sales>total_revenue"
}
},
"max_monthly_revenue": {
"max_bucket": {
"buckets_path": "monthly_sales>total_revenue"
}
},
"revenue_derivative": {
"derivative": {
"buckets_path": "monthly_sales>total_revenue"
}
},
"moving_avg_revenue": {
"moving_fn": {
"buckets_path": "monthly_sales>total_revenue",
"window": 3,
"script": "MovingFunctions.unweightedAvg(values)"
}
}
}
}
6. ELK Stack (Logstash, Kibana, Beats)
6.1 ELK Stack Architecture
Data Sources Collection Processing Storage Visualization
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ App Log │──────────│ Filebeat │──────│ Logstash │──────│ Elastic │──────│ Kibana │
│ Server │ │ Metricbt │ │ Pipeline │ │ search │ │Dashboard │
│ Docker │ │ Heartbt │ │ (Filter) │ │ Cluster │ │ Lens/Map │
│ K8s │ │ Packetbt │ │ │ │ │ │ Alerting │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
6.2 Logstash Pipeline
# /etc/logstash/conf.d/main.conf
input {
beats {
port => 5044
}
kafka {
bootstrap_servers => "kafka:9092"
topics => ["app-logs"]
codec => json
}
}
filter {
# Grok pattern for unstructured log parsing
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
}
}
# Date parsing
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
# GeoIP enrichment
geoip {
source => "client_ip"
target => "geo"
}
# Conditional processing
if [level] == "ERROR" {
mutate {
add_tag => ["alert"]
add_field => { "severity" => "high" }
}
}
# Remove unnecessary fields
mutate {
remove_field => ["agent", "ecs", "host"]
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "changeme"
}
# Error logs to a separate index
if "alert" in [tags] {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "alerts-%{+YYYY.MM.dd}"
}
}
}
6.3 Filebeat Configuration
# /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
multiline:
pattern: '^\d{4}-\d{2}-\d{2}'
negate: true
match: after
- type: container
paths:
- /var/lib/docker/containers/*/*.log
processors:
- add_docker_metadata: ~
output.logstash:
hosts: ["logstash:5044"]
6.4 Kibana Features
- Discover: Log search, filtering, time-based distribution
- Dashboard: Combine multiple visualizations
- Lens: Drag-and-drop visualization builder
- Maps: Geographic data visualization
- Alerting: Condition-based alerts (Slack, Email, PagerDuty)
- APM: Application Performance Monitoring
- Security: SIEM capabilities, security event analysis
- Dev Tools: Direct API calls from the console
7. Vector Search and kNN
Native vector search in Elasticsearch 8.x enables semantic search capabilities.
7.1 What is Vector Search
Traditional keyword search relies on exact word matching. Searching for "car" will not find "vehicle" or "automobile." Vector search converts text into high-dimensional vectors and calculates semantic similarity.
"car" → [0.12, -0.34, 0.56, ..., 0.89] (768 dimensions)
"vehicle" → [0.13, -0.32, 0.55, ..., 0.88] (similar vector)
"fruit" → [-0.45, 0.67, -0.12, ..., 0.23] (different vector)
7.2 kNN Search Index Setup
PUT /semantic-search
{
"mappings": {
"properties": {
"title": { "type": "text" },
"content": { "type": "text" },
"content_vector": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "hnsw",
"m": 16,
"ef_construction": 200
}
}
}
}
}
7.3 Performing kNN Search
// Pure kNN search
GET /semantic-search/_search
{
"knn": {
"field": "content_vector",
"query_vector": [0.12, -0.34, 0.56],
"k": 10,
"num_candidates": 100
}
}
// Hybrid search (keyword + kNN)
GET /semantic-search/_search
{
"query": {
"match": {
"content": "car recommendation"
}
},
"knn": {
"field": "content_vector",
"query_vector": [0.12, -0.34, 0.56],
"k": 10,
"num_candidates": 100,
"boost": 0.5
},
"size": 10
}
7.4 HNSW Algorithm
HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor (ANN) algorithm used for kNN search:
Layer 2 (sparse): A ──── B
│
Layer 1 (medium): A ── C ── B ── D
│ │ │ │
Layer 0 (dense): A-E-C-F-B-G-D-H
| Parameter | Description | Default | Trade-off |
|---|---|---|---|
m | Connections per node | 16 | Higher = more accurate but more memory |
ef_construction | Search range during build | 200 | Higher = more accurate but slower indexing |
ef | Search range during query | 100 | Higher = more accurate but slower search |
7.5 NLP Model Integration (Elastic ELSER)
// Deploy ELSER v2 model
PUT /_ml/trained_models/.elser_model_2
{
"input": {
"field_names": ["text_field"]
}
}
// Auto-vectorize with Ingest Pipeline
PUT /_ingest/pipeline/elser-pipeline
{
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [
{
"input_field": "content",
"output_field": "content_embedding"
}
]
}
}
]
}
8. Cluster Operations and Management
8.1 Cluster Architecture
┌─────────────────────────────────────────────────┐
│ Elasticsearch Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│
│ │ Master Node │ │ Master Node │ │Master Node ││
│ │ (Elected) │ │ (Eligible) │ │(Eligible) ││
│ └─────────────┘ └─────────────┘ └────────────┘│
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│
│ │ Data Node │ │ Data Node │ │ Data Node ││
│ │ (Hot Tier) │ │ (Hot Tier) │ │(Warm Tier) ││
│ │ SSD 1TB │ │ SSD 1TB │ │ HDD 4TB ││
│ └─────────────┘ └─────────────┘ └────────────┘│
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Ingest Node │ │ Coord Node │ │
│ │ (Pipeline) │ │ (Routing) │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────┘
8.2 Shard Strategy
PUT /logs-2025.03
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"routing.allocation.require.data": "hot"
}
}
Shard Sizing Guidelines:
- Recommended 10-50GB per shard
- Shards per node: no more than 20 per 1GB of heap
- Total cluster shards: check master node resources per 1000 shards
- Index pattern: use date-based indices for time-series data
8.3 Index Lifecycle Management (ILM)
PUT /_ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"allocate": {
"require": {
"data": "warm"
}
},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"require": {
"data": "cold"
}
},
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
8.4 Cluster Monitoring APIs
# Cluster health
GET /_cluster/health
# Node stats
GET /_nodes/stats
# Index status
GET /_cat/indices?v&s=store.size:desc
# Shard allocation
GET /_cat/shards?v&s=store:desc
# Allocation failure explanation
GET /_cluster/allocation/explain
# Hot Threads (CPU analysis)
GET /_nodes/hot_threads
# Pending Tasks
GET /_cluster/pending_tasks
9. Performance Optimization
9.1 Indexing Performance
// Bulk API (10x+ faster than individual requests)
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}
{"update": {"_index": "products", "_id": "1"}}
{"doc": {"price": 150}}
{"delete": {"_index": "products", "_id": "3"}}
Indexing Optimization Checklist:
| Item | Setting | Effect |
|---|---|---|
| Bulk size | 5-15MB per request | Reduced network overhead |
| Refresh Interval | "30s" or "-1" | Fewer segment creations |
| Replica count | 0 during initial load | No replication overhead |
| Translog | "async" flush | Reduced disk I/O |
| Mapping | "enabled": false for unused fields | Less indexing load |
| ID generation | Auto-generated IDs | Skip ID duplicate check |
9.2 Search Performance
// 1. Use Filter Context (cached)
GET /products/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "category": "electronics" } },
{ "range": { "price": { "gte": 100, "lte": 500 } } }
],
"must": [
{ "match": { "description": "wireless" } }
]
}
}
}
// 2. Source Filtering (only needed fields)
GET /products/_search
{
"_source": ["name", "price", "category"],
"query": { "match_all": {} }
}
// 3. Search After (Deep Pagination alternative)
GET /products/_search
{
"size": 20,
"sort": [
{ "created_at": "desc" },
{ "_id": "asc" }
],
"search_after": ["2025-03-01T00:00:00", "abc123"]
}
9.3 Caching Strategy
Elasticsearch Cache Layers:
1. Node Query Cache (Filter Cache)
- Caches filter context results
- Node-level, 10% of heap (default)
- LRU eviction
2. Shard Request Cache
- Caches aggregation results
- Shard-level, 1% of heap (default)
- Invalidated on index refresh
3. Field Data Cache
- Sort/aggregation data for text fields
- Uses heap memory (caution needed)
4. OS Page Cache
- Caches Lucene segment files
- Uses off-heap memory
- Most important cache!
9.4 Segment Merge Optimization
// Force Merge (for read-only indices)
POST /logs-2025.01/_forcemerge?max_num_segments=1
// Merge policy settings
PUT /products/_settings
{
"index": {
"merge": {
"policy": {
"max_merged_segment": "5gb",
"segments_per_tier": 10,
"floor_segment": "2mb"
}
}
}
}
9.5 JVM and OS Level Optimization
# jvm.options
-Xms16g
-Xmx16g
# Heap should be 50% or less of physical memory, max 30.5GB (Compressed OOPs)
# OS-level settings
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1
# /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096
10. Production Operational Patterns
10.1 Index Templates and Component Templates
// Component Template (reusable blocks)
PUT /_component_template/base-settings
{
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy"
}
}
}
PUT /_component_template/log-mappings
{
"template": {
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"message": { "type": "text" },
"level": { "type": "keyword" },
"service": { "type": "keyword" },
"trace_id": { "type": "keyword" }
}
}
}
}
// Index Template (composing Component Templates)
PUT /_index_template/logs
{
"index_patterns": ["logs-*"],
"composed_of": ["base-settings", "log-mappings"],
"priority": 200
}
10.2 Aliases and Reindex
// Alias for zero-downtime index swap
POST /_aliases
{
"actions": [
{ "remove": { "index": "products-v1", "alias": "products" } },
{ "add": { "index": "products-v2", "alias": "products" } }
]
}
// Reindex (index migration)
POST /_reindex
{
"source": {
"index": "old-index",
"query": {
"range": {
"@timestamp": { "gte": "2025-01-01" }
}
}
},
"dest": {
"index": "new-index",
"pipeline": "enrichment-pipeline"
}
}
10.3 Snapshot and Restore
// Register repository (S3)
PUT /_snapshot/s3-backup
{
"type": "s3",
"settings": {
"bucket": "my-es-backups",
"region": "us-east-1",
"base_path": "elasticsearch"
}
}
// Create snapshot
PUT /_snapshot/s3-backup/snapshot-2025-03-24
{
"indices": "logs-*,products",
"ignore_unavailable": true,
"include_global_state": false
}
// Restore
POST /_snapshot/s3-backup/snapshot-2025-03-24/_restore
{
"indices": "products",
"rename_pattern": "(.+)",
"rename_replacement": "restored-$1"
}
11. Interview Quiz
Q1. Why can't you search for a document immediately after indexing it in Elasticsearch?
Elasticsearch is a Near Real-Time (NRT) search engine. When a document is indexed, it is first written to an in-memory buffer and the transaction log (translog). For the document to become searchable, a refresh operation must occur, which creates a new Lucene segment from the in-memory buffer.
The default refresh interval is 1 second, which is why it is called "Near Real-Time."
- Configurable via
index.refresh_interval - For immediate search, call
POST /index/_refresh - During bulk indexing, set to
-1to disable, then manually refresh after completion
Q2. What is the difference between text and keyword field types?
text type:
- Tokenized through an analyzer
- Individual tokens stored in the inverted index
- Used for full-text search (match query)
- No Doc Values support (Fielddata used for sorting/aggregation, memory-intensive)
keyword type:
- Stored as-is without analysis
- Used for exact match searches (term query)
- Optimized for sorting, aggregation, and filtering
- Doc Values supported (disk-based)
Practical tip: Fields like name are commonly configured as multi-fields with both types.
"name": {
"type": "text",
"fields": {
"keyword": { "type": "keyword" }
}
}
Q3. What is the difference between must and filter in a Bool Query?
Both require matching, but there is a key difference:
must:
- Calculates relevance score (_score)
- Results are not cached
- Use when "how well does it match" matters
filter:
- Does not calculate score (always 0)
- Results are cached (Node Query Cache)
- Use when only "does it match or not" matters
- Performance benefit when the same filter is reused
Optimization principle: Move all conditions that do not need scoring to filter. Especially term, range, and exists conditions are well-suited for filter.
Q4. What are the roles of Primary and Replica Shards?
Primary Shard:
- Stores the original index data
- Cannot be changed after index creation (except via shrink/split)
- All write (index) requests are processed by the Primary Shard first
- Unit of data distribution
Replica Shard:
- A copy of the Primary Shard
- Can be dynamically adjusted
- Two purposes:
- High Availability: Replica promotes to Primary on failure
- Search Performance: Distributes read load across replicas
- Never placed on the same node as its Primary
Shard allocation formula: Maximum simultaneous node failures tolerated = number of replicas. Search throughput scales proportionally to (Primary + Replica) count.
Q5. Why is deep pagination dangerous in Elasticsearch and what are the alternatives?
The problem with deep pagination:
A request with from: 10000, size: 10 requires each shard to sort and return 10,010 documents to the coordinating node. With 3 shards, that means 30,030 documents must be sorted in memory. Resource consumption grows exponentially with page depth.
The default maximum for from + size is 10,000 (index.max_result_window).
Alternatives:
- Search After: Uses the last sort value from the previous page as a cursor for the next page. Real-time cursor-based.
- Scroll API: Snapshot-based traversal of large datasets. Suitable for batch processing. (Being deprecated; PIT + search_after recommended)
- Point in Time (PIT): A replacement for Scroll. Used with search_after for consistent views.
// PIT + Search After example
POST /products/_pit?keep_alive=5m
GET /_search
{
"pit": { "id": "PIT_ID", "keep_alive": "5m" },
"size": 20,
"sort": [{ "created_at": "desc" }, { "_shard_doc": "asc" }],
"search_after": [1679616000000, 42]
}
12. References
- Elasticsearch Official Documentation
- Elasticsearch: The Definitive Guide
- Elastic Official Blog
- Lucene Internal Architecture
- ELK Stack Tutorial
- Elasticsearch Vector Search Guide
- HNSW Algorithm Paper
- Elasticsearch Performance Tuning Guide
- Index Lifecycle Management
- Elastic ELSER Model
- Elasticsearch Cluster Operations Guide
- OpenSearch Project
- Elasticsearch in Action (Manning)