Skip to content
Published on

Elasticsearch Complete Guide 2025: From Search Engine to Log Analytics & Vector Search

Authors

TOC

1. What is Elasticsearch

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Since its first release by Shay Banon in 2010, it has expanded from full-text search to log analytics, metrics monitoring, security analytics, and most recently, vector search.

1.1 Why Elasticsearch

Running LIKE '%keyword%' in a traditional RDBMS requires a full table scan. With hundreds of millions of rows, response times can reach tens of seconds. Elasticsearch uses an inverted index structure to deliver millisecond-level search responses.

Key Features:

  • Distributed Architecture: Horizontal scaling (scale-out) handling tens of TB of data
  • Near Real-Time Search: Documents become searchable within ~1 second of indexing
  • Schemaless: Index JSON documents directly with dynamic mapping support
  • RESTful API: All operations via HTTP/JSON
  • Rich Ecosystem: Integration with Kibana, Logstash, Beats, APM

1.2 Elasticsearch vs RDBMS Comparison

AspectRDBMSElasticsearch
Data UnitRowDocument (JSON)
CollectionTableIndex
ColumnColumnField
SchemaFixed SchemaDynamic Mapping
Search MethodB-Tree IndexInverted Index
ScalingScale-up orientedScale-out (Sharding)
TransactionsACID supportNot supported
Primary UseOLTPSearch, Analytics, Logging

1.3 Version History

ES 1.x (2014) - Initial stabilization
ES 2.x (2015) - Pipeline Aggregation
ES 5.x (2016) - Lucene 6, Painless scripting
ES 6.x (2017) - Single type per index
ES 7.x (2019) - Type removal, adaptive replica selection
ES 8.x (2022) - Security by default, kNN vector search, NLP integration

2. Inverted Index Internals

The secret behind Elasticsearch's search speed is the inverted index. While a regular database finds words within documents (forward index), an inverted index finds documents from words.

2.1 Inverted Index Structure

Given three documents:

Doc 1: "The quick brown fox"
Doc 2: "The quick brown dog"
Doc 3: "The lazy brown fox"

The inverted index is built as follows:

Term      | Document IDs
----------|-------------
the       | [1, 2, 3]
quick     | [1, 2]
brown     | [1, 2, 3]
fox       | [1, 3]
dog       | [2]
lazy      | [3]

Searching for "fox" instantly returns Doc 1 and Doc 3. Regardless of the number of documents, term lookup is close to O(1).

2.2 Lucene Segment Architecture

Lucene inside Elasticsearch stores data in segments:

Index
  └── Shard (Lucene Index)
        ├── Segment 0 (immutable)
        │     ├── Inverted Index
        │     ├── Stored Fields
        │     ├── Doc Values
        │     └── Term Vectors
        ├── Segment 1 (immutable)
        ├── Segment 2 (immutable)
        └── Commit Point (segments_N)

Key Points:

  • Segments are immutable once created
  • Document deletion marks entries in a .del file rather than removing them
  • Background segment merging occurs periodically
  • A new segment is created on each refresh (default 1 second), making new documents searchable

2.3 Doc Values and Fielddata

Inverted indices are optimized for text search but not for sorting or aggregations:

Inverted Index (search):     TermDocument IDs
Doc Values (agg/sort):       Document IDValues
// Doc Values example
{
  "doc_1": { "price": 100, "category": "electronics" },
  "doc_2": { "price": 200, "category": "books" },
  "doc_3": { "price": 150, "category": "electronics" }
}
  • Doc Values: Automatically enabled for keyword, numeric, date, ip, geo_point types. Disk-based.
  • Fielddata: Used for sorting/aggregations on text fields. Heap memory-based, use with caution.

3. Mapping and Analyzers

3.1 Mapping Definition

Mapping defines the schema of an index, specifying field types and analysis methods.

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "description": {
        "type": "text"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      },
      "location": {
        "type": "geo_point"
      },
      "tags": {
        "type": "keyword"
      },
      "metadata": {
        "type": "object"
      },
      "reviews": {
        "type": "nested",
        "properties": {
          "author": { "type": "keyword" },
          "rating": { "type": "integer" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

3.2 Core Field Types

TypeDescriptionInverted IndexDoc Values
textFull-text search (analyzer applied)YesNo (Fielddata)
keywordExact match, sorting, aggregationYesYes
long/integer/floatNumericYes (BKD Tree)Yes
dateDate/timeYesYes
booleantrue/falseYesYes
geo_pointLatitude/longitudeYesYes
nestedNested objects (independent docs)YesYes
objectJSON objects (flattened)YesYes
dense_vectorVectors (kNN search)NoNo (separate storage)

3.3 Analyzer Pipeline

An analyzer consists of three stages:

Text Input
Character Filter
- html_strip: Remove HTML tags
- mapping: Character substitution
- pattern_replace: Regex replacement
Tokenizer
- standard: Unicode word boundary
- whitespace: Whitespace-based split
- ngram: N-gram token generation
- edge_ngram: Prefix-based tokens
- language-specific (e.g., nori for Korean, kuromoji for Japanese)
Token Filter
- lowercase: Convert to lowercase
- stop: Remove stop words
- synonym: Synonym expansion
- stemmer: Stem extraction
Token Stream (Terms)

3.4 Custom Analyzer Example

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_cleaner": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "tokenizer": {
        "custom_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": ["letter", "digit"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_synonyms": {
          "type": "synonym",
          "synonyms": [
            "quick,fast,speedy",
            "big,large,huge"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_cleaner"],
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stop", "custom_synonyms"]
        },
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "custom_ngram",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

3.5 Analyze API

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch is a <b>powerful</b> search engine"
}

// Response
{
  "tokens": [
    { "token": "elasticsearch", "position": 0 },
    { "token": "powerful", "position": 3 },
    { "token": "search", "position": 4 },
    { "token": "engine", "position": 5 }
  ]
}

4. Mastering Query DSL

Elasticsearch Query DSL is a JSON-based query language. It is divided into Query Context (relevance scoring) and Filter Context (yes/no matching with caching).

// Basic match - tokenized through analyzer
GET /products/_search
{
  "query": {
    "match": {
      "description": "powerful search engine"
    }
  }
}

// match_phrase - word order matters
GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "search engine",
        "slop": 1
      }
    }
  }
}

// multi_match - search across multiple fields
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "elasticsearch guide",
      "fields": ["title^3", "description", "tags^2"],
      "type": "best_fields"
    }
  }
}

4.2 Term Queries (Exact Match)

// term - exact match on keyword fields
GET /products/_search
{
  "query": {
    "term": {
      "category": "electronics"
    }
  }
}

// terms - match any of multiple values
GET /products/_search
{
  "query": {
    "terms": {
      "status": ["published", "pending"]
    }
  }
}

// range - range queries
GET /products/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 100,
        "lte": 500
      }
    }
  }
}

// exists - field existence check
GET /products/_search
{
  "query": {
    "exists": {
      "field": "discount"
    }
  }
}

4.3 Bool Query (Compound Query)

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "search engine" } }
      ],
      "must_not": [
        { "term": { "status": "deleted" } }
      ],
      "should": [
        { "term": { "featured": true } },
        { "range": { "rating": { "gte": 4.5 } } }
      ],
      "filter": [
        { "term": { "category": "software" } },
        { "range": { "price": { "lte": 1000 } } }
      ],
      "minimum_should_match": 1
    }
  }
}

Bool Query Clauses:

ClauseScoringCachingPurpose
mustYesNoMust match + score contribution
must_notNoYesMust not match
shouldYesNoBonus if matched
filterNoYesMust match (no scoring)

4.4 Nested Query

GET /products/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "term": { "reviews.author": "john" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      },
      "inner_hits": {
        "size": 3,
        "highlight": {
          "fields": {
            "reviews.comment": {}
          }
        }
      }
    }
  }
}

4.5 Function Score Query

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "title": "elasticsearch" } },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity",
            "modifier": "log1p",
            "factor": 2
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "30d",
              "decay": 0.5
            }
          }
        },
        {
          "filter": { "term": { "featured": true } },
          "weight": 5
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

4.6 Highlighting

GET /products/_search
{
  "query": {
    "match": { "description": "search engine" }
  },
  "highlight": {
    "pre_tags": ["<strong>"],
    "post_tags": ["</strong>"],
    "fields": {
      "description": {
        "fragment_size": 150,
        "number_of_fragments": 3
      }
    }
  }
}

5. Aggregations Deep Dive

Elasticsearch aggregations are analogous to SQL GROUP BY but far more powerful. They are classified into bucket, metric, and pipeline aggregations.

5.1 Bucket Aggregations

// terms aggregation - document count by category
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category",
        "size": 20,
        "order": { "_count": "desc" }
      }
    }
  }
}

// date_histogram - time-based aggregation
GET /logs/_search
{
  "size": 0,
  "aggs": {
    "logs_per_hour": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1h",
        "time_zone": "UTC",
        "min_doc_count": 0
      }
    }
  }
}

// range aggregation - price brackets
GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500 }
        ]
      }
    }
  }
}

// nested aggregation - average price by category
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": { "field": "category" },
      "aggs": {
        "avg_price": {
          "avg": { "field": "price" }
        },
        "price_stats": {
          "stats": { "field": "price" }
        }
      }
    }
  }
}

5.2 Metric Aggregations

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price": { "avg": { "field": "price" } },
    "max_price": { "max": { "field": "price" } },
    "min_price": { "min": { "field": "price" } },
    "total_sales": { "sum": { "field": "sales_count" } },
    "unique_brands": { "cardinality": { "field": "brand" } },
    "price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [25, 50, 75, 90, 99]
      }
    }
  }
}

5.3 Pipeline Aggregations

GET /sales/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_revenue": {
          "sum": { "field": "revenue" }
        }
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "revenue_derivative": {
      "derivative": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "moving_avg_revenue": {
      "moving_fn": {
        "buckets_path": "monthly_sales>total_revenue",
        "window": 3,
        "script": "MovingFunctions.unweightedAvg(values)"
      }
    }
  }
}

6. ELK Stack (Logstash, Kibana, Beats)

6.1 ELK Stack Architecture

Data Sources              Collection          Processing           Storage             Visualization
┌──────────┐          ┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
App Log │──────────│ Filebeat │──────│ Logstash │──────│  Elastic │──────│  KibanaServer  │          │ Metricbt │      │ Pipeline │      │  search  │      │Dashboard │
Docker  │          │ Heartbt (Filter) │      │ Cluster  │      │ Lens/MapK8s     │          │ Packetbt │      │          │      │          │      │ Alerting└──────────┘          └──────────┘      └──────────┘      └──────────┘      └──────────┘

6.2 Logstash Pipeline

# /etc/logstash/conf.d/main.conf
input {
  beats {
    port => 5044
  }
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["app-logs"]
    codec => json
  }
}

filter {
  # Grok pattern for unstructured log parsing
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
    }
  }

  # Date parsing
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # GeoIP enrichment
  geoip {
    source => "client_ip"
    target => "geo"
  }

  # Conditional processing
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
      add_field => { "severity" => "high" }
    }
  }

  # Remove unnecessary fields
  mutate {
    remove_field => ["agent", "ecs", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "changeme"
  }

  # Error logs to a separate index
  if "alert" in [tags] {
    elasticsearch {
      hosts => ["http://elasticsearch:9200"]
      index => "alerts-%{+YYYY.MM.dd}"
    }
  }
}

6.3 Filebeat Configuration

# /etc/filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

6.4 Kibana Features

  • Discover: Log search, filtering, time-based distribution
  • Dashboard: Combine multiple visualizations
  • Lens: Drag-and-drop visualization builder
  • Maps: Geographic data visualization
  • Alerting: Condition-based alerts (Slack, Email, PagerDuty)
  • APM: Application Performance Monitoring
  • Security: SIEM capabilities, security event analysis
  • Dev Tools: Direct API calls from the console

7. Vector Search and kNN

Native vector search in Elasticsearch 8.x enables semantic search capabilities.

Traditional keyword search relies on exact word matching. Searching for "car" will not find "vehicle" or "automobile." Vector search converts text into high-dimensional vectors and calculates semantic similarity.

"car"[0.12, -0.34, 0.56, ..., 0.89]  (768 dimensions)
"vehicle"[0.13, -0.32, 0.55, ..., 0.88]  (similar vector)
"fruit"[-0.45, 0.67, -0.12, ..., 0.23] (different vector)

7.2 kNN Search Index Setup

PUT /semantic-search
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 200
        }
      }
    }
  }
}
// Pure kNN search
GET /semantic-search/_search
{
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56],
    "k": 10,
    "num_candidates": 100
  }
}

// Hybrid search (keyword + kNN)
GET /semantic-search/_search
{
  "query": {
    "match": {
      "content": "car recommendation"
    }
  },
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56],
    "k": 10,
    "num_candidates": 100,
    "boost": 0.5
  },
  "size": 10
}

7.4 HNSW Algorithm

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor (ANN) algorithm used for kNN search:

Layer 2 (sparse):  A ──── B
Layer 1 (medium):  A ── C ── B ── D
                   │    │    │    │
Layer 0 (dense):   A-E-C-F-B-G-D-H
ParameterDescriptionDefaultTrade-off
mConnections per node16Higher = more accurate but more memory
ef_constructionSearch range during build200Higher = more accurate but slower indexing
efSearch range during query100Higher = more accurate but slower search

7.5 NLP Model Integration (Elastic ELSER)

// Deploy ELSER v2 model
PUT /_ml/trained_models/.elser_model_2
{
  "input": {
    "field_names": ["text_field"]
  }
}

// Auto-vectorize with Ingest Pipeline
PUT /_ingest/pipeline/elser-pipeline
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [
          {
            "input_field": "content",
            "output_field": "content_embedding"
          }
        ]
      }
    }
  ]
}

8. Cluster Operations and Management

8.1 Cluster Architecture

┌─────────────────────────────────────────────────┐
Elasticsearch Cluster│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │ Master Node │  │ Master Node │  │Master Node ││
  (Elected) (Eligible)(Eligible)  ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │  Data Node  │  │  Data Node  │  │ Data Node  ││
 (Hot Tier) (Hot Tier)(Warm Tier) ││
│  │  SSD 1TB    │  │  SSD 1TB    │  │ HDD 4TB    ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐                │
│  │ Ingest Node │  │ Coord Node  │                │
 (Pipeline) (Routing)   │                │
│  └─────────────┘  └─────────────┘                │
└─────────────────────────────────────────────────┘

8.2 Shard Strategy

PUT /logs-2025.03
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "routing.allocation.require.data": "hot"
  }
}

Shard Sizing Guidelines:

  • Recommended 10-50GB per shard
  • Shards per node: no more than 20 per 1GB of heap
  • Total cluster shards: check master node resources per 1000 shards
  • Index pattern: use date-based indices for time-series data

8.3 Index Lifecycle Management (ILM)

PUT /_ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

8.4 Cluster Monitoring APIs

# Cluster health
GET /_cluster/health

# Node stats
GET /_nodes/stats

# Index status
GET /_cat/indices?v&s=store.size:desc

# Shard allocation
GET /_cat/shards?v&s=store:desc

# Allocation failure explanation
GET /_cluster/allocation/explain

# Hot Threads (CPU analysis)
GET /_nodes/hot_threads

# Pending Tasks
GET /_cluster/pending_tasks

9. Performance Optimization

9.1 Indexing Performance

// Bulk API (10x+ faster than individual requests)
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}
{"update": {"_index": "products", "_id": "1"}}
{"doc": {"price": 150}}
{"delete": {"_index": "products", "_id": "3"}}

Indexing Optimization Checklist:

ItemSettingEffect
Bulk size5-15MB per requestReduced network overhead
Refresh Interval"30s" or "-1"Fewer segment creations
Replica count0 during initial loadNo replication overhead
Translog"async" flushReduced disk I/O
Mapping"enabled": false for unused fieldsLess indexing load
ID generationAuto-generated IDsSkip ID duplicate check

9.2 Search Performance

// 1. Use Filter Context (cached)
GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "category": "electronics" } },
        { "range": { "price": { "gte": 100, "lte": 500 } } }
      ],
      "must": [
        { "match": { "description": "wireless" } }
      ]
    }
  }
}

// 2. Source Filtering (only needed fields)
GET /products/_search
{
  "_source": ["name", "price", "category"],
  "query": { "match_all": {} }
}

// 3. Search After (Deep Pagination alternative)
GET /products/_search
{
  "size": 20,
  "sort": [
    { "created_at": "desc" },
    { "_id": "asc" }
  ],
  "search_after": ["2025-03-01T00:00:00", "abc123"]
}

9.3 Caching Strategy

Elasticsearch Cache Layers:

1. Node Query Cache (Filter Cache)
   - Caches filter context results
   - Node-level, 10% of heap (default)
   - LRU eviction

2. Shard Request Cache
   - Caches aggregation results
   - Shard-level, 1% of heap (default)
   - Invalidated on index refresh

3. Field Data Cache
   - Sort/aggregation data for text fields
   - Uses heap memory (caution needed)

4. OS Page Cache
   - Caches Lucene segment files
   - Uses off-heap memory
   - Most important cache!

9.4 Segment Merge Optimization

// Force Merge (for read-only indices)
POST /logs-2025.01/_forcemerge?max_num_segments=1

// Merge policy settings
PUT /products/_settings
{
  "index": {
    "merge": {
      "policy": {
        "max_merged_segment": "5gb",
        "segments_per_tier": 10,
        "floor_segment": "2mb"
      }
    }
  }
}

9.5 JVM and OS Level Optimization

# jvm.options
-Xms16g
-Xmx16g
# Heap should be 50% or less of physical memory, max 30.5GB (Compressed OOPs)
# OS-level settings
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1

# /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096

10. Production Operational Patterns

10.1 Index Templates and Component Templates

// Component Template (reusable blocks)
PUT /_component_template/base-settings
{
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    }
  }
}

PUT /_component_template/log-mappings
{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

// Index Template (composing Component Templates)
PUT /_index_template/logs
{
  "index_patterns": ["logs-*"],
  "composed_of": ["base-settings", "log-mappings"],
  "priority": 200
}

10.2 Aliases and Reindex

// Alias for zero-downtime index swap
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

// Reindex (index migration)
POST /_reindex
{
  "source": {
    "index": "old-index",
    "query": {
      "range": {
        "@timestamp": { "gte": "2025-01-01" }
      }
    }
  },
  "dest": {
    "index": "new-index",
    "pipeline": "enrichment-pipeline"
  }
}

10.3 Snapshot and Restore

// Register repository (S3)
PUT /_snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "us-east-1",
    "base_path": "elasticsearch"
  }
}

// Create snapshot
PUT /_snapshot/s3-backup/snapshot-2025-03-24
{
  "indices": "logs-*,products",
  "ignore_unavailable": true,
  "include_global_state": false
}

// Restore
POST /_snapshot/s3-backup/snapshot-2025-03-24/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

11. Interview Quiz

Q1. Why can't you search for a document immediately after indexing it in Elasticsearch?

Elasticsearch is a Near Real-Time (NRT) search engine. When a document is indexed, it is first written to an in-memory buffer and the transaction log (translog). For the document to become searchable, a refresh operation must occur, which creates a new Lucene segment from the in-memory buffer.

The default refresh interval is 1 second, which is why it is called "Near Real-Time."

  • Configurable via index.refresh_interval
  • For immediate search, call POST /index/_refresh
  • During bulk indexing, set to -1 to disable, then manually refresh after completion
Q2. What is the difference between text and keyword field types?

text type:

  • Tokenized through an analyzer
  • Individual tokens stored in the inverted index
  • Used for full-text search (match query)
  • No Doc Values support (Fielddata used for sorting/aggregation, memory-intensive)

keyword type:

  • Stored as-is without analysis
  • Used for exact match searches (term query)
  • Optimized for sorting, aggregation, and filtering
  • Doc Values supported (disk-based)

Practical tip: Fields like name are commonly configured as multi-fields with both types.

"name": {
  "type": "text",
  "fields": {
    "keyword": { "type": "keyword" }
  }
}
Q3. What is the difference between must and filter in a Bool Query?

Both require matching, but there is a key difference:

must:

  • Calculates relevance score (_score)
  • Results are not cached
  • Use when "how well does it match" matters

filter:

  • Does not calculate score (always 0)
  • Results are cached (Node Query Cache)
  • Use when only "does it match or not" matters
  • Performance benefit when the same filter is reused

Optimization principle: Move all conditions that do not need scoring to filter. Especially term, range, and exists conditions are well-suited for filter.

Q4. What are the roles of Primary and Replica Shards?

Primary Shard:

  • Stores the original index data
  • Cannot be changed after index creation (except via shrink/split)
  • All write (index) requests are processed by the Primary Shard first
  • Unit of data distribution

Replica Shard:

  • A copy of the Primary Shard
  • Can be dynamically adjusted
  • Two purposes:
    1. High Availability: Replica promotes to Primary on failure
    2. Search Performance: Distributes read load across replicas
  • Never placed on the same node as its Primary

Shard allocation formula: Maximum simultaneous node failures tolerated = number of replicas. Search throughput scales proportionally to (Primary + Replica) count.

Q5. Why is deep pagination dangerous in Elasticsearch and what are the alternatives?

The problem with deep pagination: A request with from: 10000, size: 10 requires each shard to sort and return 10,010 documents to the coordinating node. With 3 shards, that means 30,030 documents must be sorted in memory. Resource consumption grows exponentially with page depth.

The default maximum for from + size is 10,000 (index.max_result_window).

Alternatives:

  1. Search After: Uses the last sort value from the previous page as a cursor for the next page. Real-time cursor-based.
  2. Scroll API: Snapshot-based traversal of large datasets. Suitable for batch processing. (Being deprecated; PIT + search_after recommended)
  3. Point in Time (PIT): A replacement for Scroll. Used with search_after for consistent views.
// PIT + Search After example
POST /products/_pit?keep_alive=5m

GET /_search
{
  "pit": { "id": "PIT_ID", "keep_alive": "5m" },
  "size": 20,
  "sort": [{ "created_at": "desc" }, { "_shard_doc": "asc" }],
  "search_after": [1679616000000, 42]
}

12. References

  1. Elasticsearch Official Documentation
  2. Elasticsearch: The Definitive Guide
  3. Elastic Official Blog
  4. Lucene Internal Architecture
  5. ELK Stack Tutorial
  6. Elasticsearch Vector Search Guide
  7. HNSW Algorithm Paper
  8. Elasticsearch Performance Tuning Guide
  9. Index Lifecycle Management
  10. Elastic ELSER Model
  11. Elasticsearch Cluster Operations Guide
  12. OpenSearch Project
  13. Elasticsearch in Action (Manning)