Elasticsearch Complete Guide 2025: From Search Engine to Log Analytics & Vector Search

1. What is Elasticsearch

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Since its first release by Shay Banon in 2010, it has expanded from full-text search to log analytics, metrics monitoring, security analytics, and most recently, vector search.

1.1 Why Elasticsearch

Running LIKE '%keyword%' in a traditional RDBMS requires a full table scan. With hundreds of millions of rows, response times can reach tens of seconds. Elasticsearch uses an inverted index structure to deliver millisecond-level search responses.

Key Features:

Distributed Architecture: Horizontal scaling (scale-out) handling tens of TB of data
Near Real-Time Search: Documents become searchable within ~1 second of indexing
Schemaless: Index JSON documents directly with dynamic mapping support
RESTful API: All operations via HTTP/JSON
Rich Ecosystem: Integration with Kibana, Logstash, Beats, APM

1.2 Elasticsearch vs RDBMS Comparison

Aspect	RDBMS	Elasticsearch
Data Unit	Row	Document (JSON)
Collection	Table	Index
Column	Column	Field
Schema	Fixed Schema	Dynamic Mapping
Search Method	B-Tree Index	Inverted Index
Scaling	Scale-up oriented	Scale-out (Sharding)
Transactions	ACID support	Not supported
Primary Use	OLTP	Search, Analytics, Logging

1.3 Version History

ES 1.x (2014) - Initial stabilization
ES 2.x (2015) - Pipeline Aggregation
ES 5.x (2016) - Lucene 6, Painless scripting
ES 6.x (2017) - Single type per index
ES 7.x (2019) - Type removal, adaptive replica selection
ES 8.x (2022) - Security by default, kNN vector search, NLP integration

2. Inverted Index Internals

The secret behind Elasticsearch's search speed is the inverted index. While a regular database finds words within documents (forward index), an inverted index finds documents from words.

2.1 Inverted Index Structure

Given three documents:

Doc 1: "The quick brown fox"
Doc 2: "The quick brown dog"
Doc 3: "The lazy brown fox"

The inverted index is built as follows:

Term      | Document IDs
----------|-------------
the       | [1, 2, 3]
quick     | [1, 2]
brown     | [1, 2, 3]
fox       | [1, 3]
dog       | [2]
lazy      | [3]

Searching for "fox" instantly returns Doc 1 and Doc 3. Regardless of the number of documents, term lookup is close to O(1).

2.2 Lucene Segment Architecture

Lucene inside Elasticsearch stores data in segments:

Index
  └── Shard (Lucene Index)
        ├── Segment 0 (immutable)
        │     ├── Inverted Index
        │     ├── Stored Fields
        │     ├── Doc Values
        │     └── Term Vectors
        ├── Segment 1 (immutable)
        ├── Segment 2 (immutable)
        └── Commit Point (segments_N)

Key Points:

Segments are immutable once created
Document deletion marks entries in a .del file rather than removing them
Background segment merging occurs periodically
A new segment is created on each refresh (default 1 second), making new documents searchable

2.3 Doc Values and Fielddata

Inverted indices are optimized for text search but not for sorting or aggregations:

Inverted Index (search):     Term → Document IDs
Doc Values (agg/sort):       Document ID → Values

// Doc Values example
{
  "doc_1": { "price": 100, "category": "electronics" },
  "doc_2": { "price": 200, "category": "books" },
  "doc_3": { "price": 150, "category": "electronics" }
}

Doc Values: Automatically enabled for keyword, numeric, date, ip, geo_point types. Disk-based.
Fielddata: Used for sorting/aggregations on text fields. Heap memory-based, use with caution.

3. Mapping and Analyzers

3.1 Mapping Definition

Mapping defines the schema of an index, specifying field types and analysis methods.

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "description": {
        "type": "text"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      },
      "location": {
        "type": "geo_point"
      },
      "tags": {
        "type": "keyword"
      },
      "metadata": {
        "type": "object"
      },
      "reviews": {
        "type": "nested",
        "properties": {
          "author": { "type": "keyword" },
          "rating": { "type": "integer" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

3.2 Core Field Types

Type	Description	Inverted Index	Doc Values
`text`	Full-text search (analyzer applied)	Yes	No (Fielddata)
`keyword`	Exact match, sorting, aggregation	Yes	Yes
`long/integer/float`	Numeric	Yes (BKD Tree)	Yes
`date`	Date/time	Yes	Yes
`boolean`	true/false	Yes	Yes
`geo_point`	Latitude/longitude	Yes	Yes
`nested`	Nested objects (independent docs)	Yes	Yes
`object`	JSON objects (flattened)	Yes	Yes
`dense_vector`	Vectors (kNN search)	No	No (separate storage)

3.3 Analyzer Pipeline

An analyzer consists of three stages:

Text Input
  │
  ▼
Character Filter
  │  - html_strip: Remove HTML tags
  │  - mapping: Character substitution
  │  - pattern_replace: Regex replacement
  ▼
Tokenizer
  │  - standard: Unicode word boundary
  │  - whitespace: Whitespace-based split
  │  - ngram: N-gram token generation
  │  - edge_ngram: Prefix-based tokens
  │  - language-specific (e.g., nori for Korean, kuromoji for Japanese)
  ▼
Token Filter
  │  - lowercase: Convert to lowercase
  │  - stop: Remove stop words
  │  - synonym: Synonym expansion
  │  - stemmer: Stem extraction
  ▼
Token Stream (Terms)

3.4 Custom Analyzer Example

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_cleaner": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "tokenizer": {
        "custom_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": ["letter", "digit"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_synonyms": {
          "type": "synonym",
          "synonyms": [
            "quick,fast,speedy",
            "big,large,huge"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_cleaner"],
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stop", "custom_synonyms"]
        },
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "custom_ngram",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

3.5 Analyze API

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch is a <b>powerful</b> search engine"
}

// Response
{
  "tokens": [
    { "token": "elasticsearch", "position": 0 },
    { "token": "powerful", "position": 3 },
    { "token": "search", "position": 4 },
    { "token": "engine", "position": 5 }
  ]
}

4. Mastering Query DSL

Elasticsearch Query DSL is a JSON-based query language. It is divided into Query Context (relevance scoring) and Filter Context (yes/no matching with caching).

4.1 Match Queries (Full-Text Search)

// Basic match - tokenized through analyzer
GET /products/_search
{
  "query": {
    "match": {
      "description": "powerful search engine"
    }
  }
}

// match_phrase - word order matters
GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "search engine",
        "slop": 1
      }
    }
  }
}

// multi_match - search across multiple fields
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "elasticsearch guide",
      "fields": ["title^3", "description", "tags^2"],
      "type": "best_fields"
    }
  }
}

4.2 Term Queries (Exact Match)

// term - exact match on keyword fields
GET /products/_search
{
  "query": {
    "term": {
      "category": "electronics"
    }
  }
}

// terms - match any of multiple values
GET /products/_search
{
  "query": {
    "terms": {
      "status": ["published", "pending"]
    }
  }
}

// range - range queries
GET /products/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 100,
        "lte": 500
      }
    }
  }
}

// exists - field existence check
GET /products/_search
{
  "query": {
    "exists": {
      "field": "discount"
    }
  }
}

4.3 Bool Query (Compound Query)

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "search engine" } }
      ],
      "must_not": [
        { "term": { "status": "deleted" } }
      ],
      "should": [
        { "term": { "featured": true } },
        { "range": { "rating": { "gte": 4.5 } } }
      ],
      "filter": [
        { "term": { "category": "software" } },
        { "range": { "price": { "lte": 1000 } } }
      ],
      "minimum_should_match": 1
    }
  }
}

Bool Query Clauses:

Clause	Scoring	Caching	Purpose
`must`	Yes	No	Must match + score contribution
`must_not`	No	Yes	Must not match
`should`	Yes	No	Bonus if matched
`filter`	No	Yes	Must match (no scoring)

4.4 Nested Query

GET /products/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "term": { "reviews.author": "john" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      },
      "inner_hits": {
        "size": 3,
        "highlight": {
          "fields": {
            "reviews.comment": {}
          }
        }
      }
    }
  }
}

4.5 Function Score Query

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "title": "elasticsearch" } },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity",
            "modifier": "log1p",
            "factor": 2
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "30d",
              "decay": 0.5
            }
          }
        },
        {
          "filter": { "term": { "featured": true } },
          "weight": 5
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

4.6 Highlighting

GET /products/_search
{
  "query": {
    "match": { "description": "search engine" }
  },
  "highlight": {
    "pre_tags": ["<strong>"],
    "post_tags": ["</strong>"],
    "fields": {
      "description": {
        "fragment_size": 150,
        "number_of_fragments": 3
      }
    }
  }
}

5. Aggregations Deep Dive

Elasticsearch aggregations are analogous to SQL GROUP BY but far more powerful. They are classified into bucket, metric, and pipeline aggregations.

5.1 Bucket Aggregations

// terms aggregation - document count by category
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category",
        "size": 20,
        "order": { "_count": "desc" }
      }
    }
  }
}

// date_histogram - time-based aggregation
GET /logs/_search
{
  "size": 0,
  "aggs": {
    "logs_per_hour": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1h",
        "time_zone": "UTC",
        "min_doc_count": 0
      }
    }
  }
}

// range aggregation - price brackets
GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500 }
        ]
      }
    }
  }
}

// nested aggregation - average price by category
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": { "field": "category" },
      "aggs": {
        "avg_price": {
          "avg": { "field": "price" }
        },
        "price_stats": {
          "stats": { "field": "price" }
        }
      }
    }
  }
}

5.2 Metric Aggregations

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price": { "avg": { "field": "price" } },
    "max_price": { "max": { "field": "price" } },
    "min_price": { "min": { "field": "price" } },
    "total_sales": { "sum": { "field": "sales_count" } },
    "unique_brands": { "cardinality": { "field": "brand" } },
    "price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [25, 50, 75, 90, 99]
      }
    }
  }
}

5.3 Pipeline Aggregations

GET /sales/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_revenue": {
          "sum": { "field": "revenue" }
        }
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "revenue_derivative": {
      "derivative": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "moving_avg_revenue": {
      "moving_fn": {
        "buckets_path": "monthly_sales>total_revenue",
        "window": 3,
        "script": "MovingFunctions.unweightedAvg(values)"
      }
    }
  }
}

6. ELK Stack (Logstash, Kibana, Beats)

6.1 ELK Stack Architecture

Data Sources              Collection          Processing           Storage             Visualization
┌──────────┐          ┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
│  App Log │──────────│ Filebeat │──────│ Logstash │──────│  Elastic │──────│  Kibana  │
│  Server  │          │ Metricbt │      │ Pipeline │      │  search  │      │Dashboard │
│  Docker  │          │ Heartbt  │      │ (Filter) │      │ Cluster  │      │ Lens/Map │
│  K8s     │          │ Packetbt │      │          │      │          │      │ Alerting │
└──────────┘          └──────────┘      └──────────┘      └──────────┘      └──────────┘

6.2 Logstash Pipeline

# /etc/logstash/conf.d/main.conf
input {
  beats {
    port => 5044
  }
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["app-logs"]
    codec => json
  }
}

filter {
  # Grok pattern for unstructured log parsing
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
    }
  }

  # Date parsing
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # GeoIP enrichment
  geoip {
    source => "client_ip"
    target => "geo"
  }

  # Conditional processing
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
      add_field => { "severity" => "high" }
    }
  }

  # Remove unnecessary fields
  mutate {
    remove_field => ["agent", "ecs", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "changeme"
  }

  # Error logs to a separate index
  if "alert" in [tags] {
    elasticsearch {
      hosts => ["http://elasticsearch:9200"]
      index => "alerts-%{+YYYY.MM.dd}"
    }
  }
}

6.3 Filebeat Configuration

# /etc/filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

6.4 Kibana Features

Discover: Log search, filtering, time-based distribution
Dashboard: Combine multiple visualizations
Lens: Drag-and-drop visualization builder
Maps: Geographic data visualization
Alerting: Condition-based alerts (Slack, Email, PagerDuty)
APM: Application Performance Monitoring
Security: SIEM capabilities, security event analysis
Dev Tools: Direct API calls from the console

7. Vector Search and kNN

Native vector search in Elasticsearch 8.x enables semantic search capabilities.

7.1 What is Vector Search

Traditional keyword search relies on exact word matching. Searching for "car" will not find "vehicle" or "automobile." Vector search converts text into high-dimensional vectors and calculates semantic similarity.

"car"       → [0.12, -0.34, 0.56, ..., 0.89]  (768 dimensions)
"vehicle"   → [0.13, -0.32, 0.55, ..., 0.88]  (similar vector)
"fruit"     → [-0.45, 0.67, -0.12, ..., 0.23] (different vector)

7.2 kNN Search Index Setup

PUT /semantic-search
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 200
        }
      }
    }
  }
}

7.3 Performing kNN Search

// Pure kNN search
GET /semantic-search/_search
{
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56],
    "k": 10,
    "num_candidates": 100
  }
}

// Hybrid search (keyword + kNN)
GET /semantic-search/_search
{
  "query": {
    "match": {
      "content": "car recommendation"
    }
  },
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56],
    "k": 10,
    "num_candidates": 100,
    "boost": 0.5
  },
  "size": 10
}

7.4 HNSW Algorithm

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor (ANN) algorithm used for kNN search:

Layer 2 (sparse):  A ──── B
                   │
Layer 1 (medium):  A ── C ── B ── D
                   │    │    │    │
Layer 0 (dense):   A-E-C-F-B-G-D-H

Parameter	Description	Default	Trade-off
`m`	Connections per node	16	Higher = more accurate but more memory
`ef_construction`	Search range during build	200	Higher = more accurate but slower indexing
`ef`	Search range during query	100	Higher = more accurate but slower search

7.5 NLP Model Integration (Elastic ELSER)

// Deploy ELSER v2 model
PUT /_ml/trained_models/.elser_model_2
{
  "input": {
    "field_names": ["text_field"]
  }
}

// Auto-vectorize with Ingest Pipeline
PUT /_ingest/pipeline/elser-pipeline
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [
          {
            "input_field": "content",
            "output_field": "content_embedding"
          }
        ]
      }
    }
  ]
}

8. Cluster Operations and Management

8.1 Cluster Architecture

┌─────────────────────────────────────────────────┐
│                  Elasticsearch Cluster            │
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │ Master Node │  │ Master Node │  │Master Node ││
│  │  (Elected)  │  │ (Eligible)  │  │(Eligible)  ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │  Data Node  │  │  Data Node  │  │ Data Node  ││
│  │ (Hot Tier)  │  │ (Hot Tier)  │  │(Warm Tier) ││
│  │  SSD 1TB    │  │  SSD 1TB    │  │ HDD 4TB    ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐                │
│  │ Ingest Node │  │ Coord Node  │                │
│  │ (Pipeline)  │  │ (Routing)   │                │
│  └─────────────┘  └─────────────┘                │
└─────────────────────────────────────────────────┘

8.2 Shard Strategy

PUT /logs-2025.03
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "routing.allocation.require.data": "hot"
  }
}

Shard Sizing Guidelines:

Recommended 10-50GB per shard
Shards per node: no more than 20 per 1GB of heap
Total cluster shards: check master node resources per 1000 shards
Index pattern: use date-based indices for time-series data

8.3 Index Lifecycle Management (ILM)

PUT /_ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

8.4 Cluster Monitoring APIs

# Cluster health
GET /_cluster/health

# Node stats
GET /_nodes/stats

# Index status
GET /_cat/indices?v&s=store.size:desc

# Shard allocation
GET /_cat/shards?v&s=store:desc

# Allocation failure explanation
GET /_cluster/allocation/explain

# Hot Threads (CPU analysis)
GET /_nodes/hot_threads

# Pending Tasks
GET /_cluster/pending_tasks

9. Performance Optimization

9.1 Indexing Performance

// Bulk API (10x+ faster than individual requests)
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}
{"update": {"_index": "products", "_id": "1"}}
{"doc": {"price": 150}}
{"delete": {"_index": "products", "_id": "3"}}

Indexing Optimization Checklist:

Item	Setting	Effect
Bulk size	5-15MB per request	Reduced network overhead
Refresh Interval	`"30s"` or `"-1"`	Fewer segment creations
Replica count	`0` during initial load	No replication overhead
Translog	`"async"` flush	Reduced disk I/O
Mapping	`"enabled": false` for unused fields	Less indexing load
ID generation	Auto-generated IDs	Skip ID duplicate check

9.2 Search Performance

// 1. Use Filter Context (cached)
GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "category": "electronics" } },
        { "range": { "price": { "gte": 100, "lte": 500 } } }
      ],
      "must": [
        { "match": { "description": "wireless" } }
      ]
    }
  }
}

// 2. Source Filtering (only needed fields)
GET /products/_search
{
  "_source": ["name", "price", "category"],
  "query": { "match_all": {} }
}

// 3. Search After (Deep Pagination alternative)
GET /products/_search
{
  "size": 20,
  "sort": [
    { "created_at": "desc" },
    { "_id": "asc" }
  ],
  "search_after": ["2025-03-01T00:00:00", "abc123"]
}

9.3 Caching Strategy

Elasticsearch Cache Layers:

1. Node Query Cache (Filter Cache)
   - Caches filter context results
   - Node-level, 10% of heap (default)
   - LRU eviction

2. Shard Request Cache
   - Caches aggregation results
   - Shard-level, 1% of heap (default)
   - Invalidated on index refresh

3. Field Data Cache
   - Sort/aggregation data for text fields
   - Uses heap memory (caution needed)

4. OS Page Cache
   - Caches Lucene segment files
   - Uses off-heap memory
   - Most important cache!

9.4 Segment Merge Optimization

// Force Merge (for read-only indices)
POST /logs-2025.01/_forcemerge?max_num_segments=1

// Merge policy settings
PUT /products/_settings
{
  "index": {
    "merge": {
      "policy": {
        "max_merged_segment": "5gb",
        "segments_per_tier": 10,
        "floor_segment": "2mb"
      }
    }
  }
}

9.5 JVM and OS Level Optimization

# jvm.options
-Xms16g
-Xmx16g
# Heap should be 50% or less of physical memory, max 30.5GB (Compressed OOPs)

# OS-level settings
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1

# /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096

10. Production Operational Patterns

10.1 Index Templates and Component Templates

// Component Template (reusable blocks)
PUT /_component_template/base-settings
{
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    }
  }
}

PUT /_component_template/log-mappings
{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

// Index Template (composing Component Templates)
PUT /_index_template/logs
{
  "index_patterns": ["logs-*"],
  "composed_of": ["base-settings", "log-mappings"],
  "priority": 200
}

10.2 Aliases and Reindex

// Alias for zero-downtime index swap
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

// Reindex (index migration)
POST /_reindex
{
  "source": {
    "index": "old-index",
    "query": {
      "range": {
        "@timestamp": { "gte": "2025-01-01" }
      }
    }
  },
  "dest": {
    "index": "new-index",
    "pipeline": "enrichment-pipeline"
  }
}

10.3 Snapshot and Restore

// Register repository (S3)
PUT /_snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "us-east-1",
    "base_path": "elasticsearch"
  }
}

// Create snapshot
PUT /_snapshot/s3-backup/snapshot-2025-03-24
{
  "indices": "logs-*,products",
  "ignore_unavailable": true,
  "include_global_state": false
}

// Restore
POST /_snapshot/s3-backup/snapshot-2025-03-24/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

11. Interview Quiz

Q1. Why can't you search for a document immediately after indexing it in Elasticsearch?

Elasticsearch is a Near Real-Time (NRT) search engine. When a document is indexed, it is first written to an in-memory buffer and the transaction log (translog). For the document to become searchable, a refresh operation must occur, which creates a new Lucene segment from the in-memory buffer.

The default refresh interval is 1 second, which is why it is called "Near Real-Time."

Configurable via index.refresh_interval
For immediate search, call POST /index/_refresh
During bulk indexing, set to -1 to disable, then manually refresh after completion

Q2. What is the difference between text and keyword field types?

text type:

Tokenized through an analyzer
Individual tokens stored in the inverted index
Used for full-text search (match query)
No Doc Values support (Fielddata used for sorting/aggregation, memory-intensive)

keyword type:

Stored as-is without analysis
Used for exact match searches (term query)
Optimized for sorting, aggregation, and filtering
Doc Values supported (disk-based)

Practical tip: Fields like name are commonly configured as multi-fields with both types.

"name": {
  "type": "text",
  "fields": {
    "keyword": { "type": "keyword" }
  }
}

Q3. What is the difference between must and filter in a Bool Query?

Both require matching, but there is a key difference:

must:

Calculates relevance score (_score)
Results are not cached
Use when "how well does it match" matters

filter:

Does not calculate score (always 0)
Results are cached (Node Query Cache)
Use when only "does it match or not" matters
Performance benefit when the same filter is reused

Optimization principle: Move all conditions that do not need scoring to filter. Especially term, range, and exists conditions are well-suited for filter.

Q4. What are the roles of Primary and Replica Shards?

Primary Shard:

Stores the original index data
Cannot be changed after index creation (except via shrink/split)
All write (index) requests are processed by the Primary Shard first
Unit of data distribution

Replica Shard:

A copy of the Primary Shard
Can be dynamically adjusted
Two purposes:
1. High Availability: Replica promotes to Primary on failure
2. Search Performance: Distributes read load across replicas
Never placed on the same node as its Primary

Shard allocation formula: Maximum simultaneous node failures tolerated = number of replicas. Search throughput scales proportionally to (Primary + Replica) count.

Q5. Why is deep pagination dangerous in Elasticsearch and what are the alternatives?

The problem with deep pagination: A request with from: 10000, size: 10 requires each shard to sort and return 10,010 documents to the coordinating node. With 3 shards, that means 30,030 documents must be sorted in memory. Resource consumption grows exponentially with page depth.

The default maximum for from + size is 10,000 (index.max_result_window).

Alternatives:

Search After: Uses the last sort value from the previous page as a cursor for the next page. Real-time cursor-based.
Scroll API: Snapshot-based traversal of large datasets. Suitable for batch processing. (Being deprecated; PIT + search_after recommended)
Point in Time (PIT): A replacement for Scroll. Used with search_after for consistent views.

// PIT + Search After example
POST /products/_pit?keep_alive=5m

GET /_search
{
  "pit": { "id": "PIT_ID", "keep_alive": "5m" },
  "size": 20,
  "sort": [{ "created_at": "desc" }, { "_shard_doc": "asc" }],
  "search_after": [1679616000000, 42]
}

TOC