Split View: Elasticsearch 완전 가이드 2025: 검색 엔진부터 로그 분석, 벡터 검색까지

Elasticsearch 완전 가이드 2025: 검색 엔진부터 로그 분석, 벡터 검색까지

1. Elasticsearch란 무엇인가

Elasticsearch는 Apache Lucene 기반의 분산 검색 및 분석 엔진입니다. 2010년 Shay Banon이 처음 릴리스한 이후, 전문 검색(Full-Text Search), 로그 분석, 메트릭 모니터링, 보안 분석, 그리고 최근에는 벡터 검색(Vector Search)까지 활용 영역이 확대되고 있습니다.

1.1 왜 Elasticsearch인가

전통적인 RDBMS에서 LIKE '%keyword%' 쿼리를 실행하면 전체 테이블을 풀 스캔해야 합니다. 데이터가 수억 건이 되면 응답 시간이 수십 초에 달할 수 있습니다. Elasticsearch는 역색인(Inverted Index) 구조를 사용하여 밀리초 단위의 검색 응답을 제공합니다.

핵심 특징:

분산 아키텍처: 수평 확장(Scale-out)이 가능하며 수십 TB 데이터 처리
실시간에 가까운 검색: 문서 색인 후 약 1초 이내에 검색 가능(Near Real-Time)
스키마리스(Schemaless): JSON 문서를 바로 색인 가능하며 동적 매핑 지원
RESTful API: 모든 작업을 HTTP/JSON으로 수행
풍부한 에코시스템: Kibana, Logstash, Beats, APM 등과 통합

1.2 Elasticsearch vs RDBMS 비교

항목	RDBMS	Elasticsearch
데이터 단위	Row	Document (JSON)
테이블	Table	Index
컬럼	Column	Field
스키마	고정 스키마	동적 매핑 가능
검색 방식	B-Tree 인덱스	역색인(Inverted Index)
확장 방식	Scale-up 위주	Scale-out (Sharding)
트랜잭션	ACID 지원	미지원
주 용도	OLTP	검색, 분석, 로깅

1.3 Elasticsearch 버전 히스토리

ES 1.x (2014) - 초기 안정화
ES 2.x (2015) - Pipeline Aggregation 도입
ES 5.x (2016) - Lucene 6, Painless 스크립팅
ES 6.x (2017) - 인덱스당 단일 타입
ES 7.x (2019) - 타입 제거, 적응형 레플리카 선택
ES 8.x (2022) - 보안 기본 활성화, kNN 벡터 검색, NLP 통합

2. 역색인(Inverted Index)의 원리

Elasticsearch 검색 속도의 비밀은 역색인에 있습니다. 일반적인 데이터베이스가 문서에서 단어를 찾는(Forward Index) 반면, 역색인은 단어에서 문서를 찾습니다.

2.1 역색인 구조

세 개의 문서가 있다고 가정합니다:

Doc 1: "The quick brown fox"
Doc 2: "The quick brown dog"
Doc 3: "The lazy brown fox"

역색인은 다음과 같이 구성됩니다:

Term      | Document IDs
----------|-------------
the       | [1, 2, 3]
quick     | [1, 2]
brown     | [1, 2, 3]
fox       | [1, 3]
dog       | [2]
lazy      | [3]

"fox"를 검색하면, 역색인에서 바로 Doc 1, Doc 3을 찾을 수 있습니다. 문서 수가 아무리 많아도 단어 조회는 O(1)에 가깝습니다.

2.2 Lucene 세그먼트 구조

Elasticsearch 내부의 Lucene은 데이터를 세그먼트(Segment) 단위로 저장합니다:

Index
  └── Shard (Lucene Index)
        ├── Segment 0 (immutable)
        │     ├── Inverted Index
        │     ├── Stored Fields
        │     ├── Doc Values
        │     └── Term Vectors
        ├── Segment 1 (immutable)
        ├── Segment 2 (immutable)
        └── Commit Point (segments_N)

핵심 포인트:

세그먼트는 한번 생성되면 변경 불가(Immutable)
문서 삭제 시 실제로 삭제하지 않고 .del 파일에 표시
백그라운드에서 세그먼트 병합(Merge)이 발생
Refresh(기본 1초)마다 새 세그먼트 생성 후 검색 가능

2.3 Doc Values와 Fielddata

역색인은 텍스트 검색에 최적화되어 있지만, 정렬이나 집계에는 적합하지 않습니다:

역색인 (검색용): Term → Document IDs
Doc Values (집계/정렬용): Document ID → Values

// Doc Values 예시
{
  "doc_1": { "price": 100, "category": "electronics" },
  "doc_2": { "price": 200, "category": "books" },
  "doc_3": { "price": 150, "category": "electronics" }
}

Doc Values: keyword, numeric, date, ip, geo_point 타입에 자동 활성화. 디스크 기반.
Fielddata: text 타입의 정렬/집계 시 사용. 힙 메모리 사용으로 주의 필요.

3. 매핑(Mapping)과 분석기(Analyzer)

3.1 매핑 정의

매핑은 인덱스의 스키마를 정의합니다. 각 필드의 타입과 분석 방법을 지정합니다.

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "description": {
        "type": "text",
        "analyzer": "korean_analyzer"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      },
      "location": {
        "type": "geo_point"
      },
      "tags": {
        "type": "keyword"
      },
      "metadata": {
        "type": "object"
      },
      "reviews": {
        "type": "nested",
        "properties": {
          "author": { "type": "keyword" },
          "rating": { "type": "integer" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

3.2 주요 필드 타입

타입	설명	역색인	Doc Values
`text`	전문 검색용 (분석기 적용)	O	X (Fielddata)
`keyword`	정확한 일치, 정렬, 집계	O	O
`long/integer/float`	숫자	O (BKD Tree)	O
`date`	날짜/시간	O	O
`boolean`	true/false	O	O
`geo_point`	위도/경도	O	O
`nested`	중첩 객체 (독립 문서)	O	O
`object`	JSON 객체 (평탄화)	O	O
`dense_vector`	벡터 (kNN 검색)	X	X (별도 저장)

3.3 분석기(Analyzer) 파이프라인

분석기는 세 단계로 구성됩니다:

텍스트 입력
  │
  ▼
Character Filter (문자 필터)
  │  - html_strip: HTML 태그 제거
  │  - mapping: 문자 치환
  │  - pattern_replace: 정규식 치환
  ▼
Tokenizer (토크나이저)
  │  - standard: 유니코드 단어 경계 기반
  │  - whitespace: 공백 기준 분리
  │  - ngram: N-gram 토큰 생성
  │  - edge_ngram: 접두사 기반 토큰
  │  - nori_tokenizer: 한국어 형태소 분석
  ▼
Token Filter (토큰 필터)
  │  - lowercase: 소문자 변환
  │  - stop: 불용어 제거
  │  - synonym: 동의어 처리
  │  - stemmer: 어간 추출
  │  - nori_part_of_speech: 한국어 품사 필터
  ▼
토큰 스트림 (Term)

3.4 한국어 분석기 설정(Nori)

PUT /korean_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "nori_custom": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed",
          "discard_punctuation": true,
          "user_dictionary_rules": [
            "삼성전자",
            "엘라스틱서치"
          ]
        }
      },
      "filter": {
        "nori_pos_filter": {
          "type": "nori_part_of_speech",
          "stoptags": [
            "E", "IC", "J", "MAG", "MAJ",
            "MM", "SP", "SSC", "SSO", "SC",
            "SE", "XPN", "XSA", "XSN", "XSV",
            "UNA", "NA", "VSV"
          ]
        }
      },
      "analyzer": {
        "korean_analyzer": {
          "type": "custom",
          "tokenizer": "nori_custom",
          "filter": [
            "nori_readingform",
            "nori_pos_filter",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "korean_analyzer"
      }
    }
  }
}

3.5 Analyze API로 분석 결과 확인

POST /korean_index/_analyze
{
  "analyzer": "korean_analyzer",
  "text": "엘라스틱서치는 강력한 검색 엔진입니다"
}

// 결과
{
  "tokens": [
    { "token": "엘라스틱서치", "position": 0 },
    { "token": "강력", "position": 2 },
    { "token": "검색", "position": 4 },
    { "token": "엔진", "position": 5 }
  ]
}

4. Query DSL 완전 정복

Elasticsearch의 쿼리 DSL은 JSON 기반의 강력한 쿼리 언어입니다. 크게 Query Context(관련도 점수 계산)와 Filter Context(Yes/No 판별, 캐싱)로 나뉩니다.

4.1 Match Query (전문 검색)

// 기본 match - 분석기를 통해 토큰화 후 검색
GET /products/_search
{
  "query": {
    "match": {
      "description": "강력한 검색 엔진"
    }
  }
}

// match_phrase - 단어 순서까지 일치
GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "검색 엔진",
        "slop": 1
      }
    }
  }
}

// multi_match - 여러 필드에서 검색
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "엘라스틱서치 가이드",
      "fields": ["title^3", "description", "tags^2"],
      "type": "best_fields"
    }
  }
}

4.2 Term Query (정확한 일치)

// term - keyword 필드의 정확한 일치
GET /products/_search
{
  "query": {
    "term": {
      "category": "electronics"
    }
  }
}

// terms - 여러 값 중 하나와 일치
GET /products/_search
{
  "query": {
    "terms": {
      "status": ["published", "pending"]
    }
  }
}

// range - 범위 검색
GET /products/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 100,
        "lte": 500
      }
    }
  }
}

// exists - 필드 존재 여부
GET /products/_search
{
  "query": {
    "exists": {
      "field": "discount"
    }
  }
}

4.3 Bool Query (복합 쿼리)

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "검색 엔진" } }
      ],
      "must_not": [
        { "term": { "status": "deleted" } }
      ],
      "should": [
        { "term": { "featured": true } },
        { "range": { "rating": { "gte": 4.5 } } }
      ],
      "filter": [
        { "term": { "category": "software" } },
        { "range": { "price": { "lte": 1000 } } }
      ],
      "minimum_should_match": 1
    }
  }
}

Bool 쿼리의 각 절(Clause):

절	점수 반영	캐싱	용도
`must`	O	X	반드시 일치 + 점수 계산
`must_not`	X	O	반드시 불일치
`should`	O	X	일치하면 가산점
`filter`	X	O	반드시 일치 (점수 무관)

4.4 Nested Query

// 중첩 객체 검색 - 독립적인 문서로 저장됨
GET /products/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "term": { "reviews.author": "john" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      },
      "inner_hits": {
        "size": 3,
        "highlight": {
          "fields": {
            "reviews.comment": {}
          }
        }
      }
    }
  }
}

4.5 Function Score Query

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "title": "엘라스틱서치" } },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity",
            "modifier": "log1p",
            "factor": 2
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "30d",
              "decay": 0.5
            }
          }
        },
        {
          "filter": { "term": { "featured": true } },
          "weight": 5
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

4.6 하이라이팅(Highlighting)

GET /products/_search
{
  "query": {
    "match": { "description": "검색 엔진" }
  },
  "highlight": {
    "pre_tags": ["<strong>"],
    "post_tags": ["</strong>"],
    "fields": {
      "description": {
        "fragment_size": 150,
        "number_of_fragments": 3
      }
    }
  }
}

5. 집계(Aggregation) 완전 정복

Elasticsearch의 집계는 SQL의 GROUP BY에 해당하지만 훨씬 강력합니다. 버킷 집계, 메트릭 집계, 파이프라인 집계로 구분됩니다.

5.1 Bucket Aggregation (버킷 집계)

// terms 집계 - 카테고리별 문서 수
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category",
        "size": 20,
        "order": { "_count": "desc" }
      }
    }
  }
}

// date_histogram - 시간대별 집계
GET /logs/_search
{
  "size": 0,
  "aggs": {
    "logs_per_hour": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1h",
        "time_zone": "Asia/Seoul",
        "min_doc_count": 0
      }
    }
  }
}

// range 집계 - 가격대별
GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500 }
        ]
      }
    }
  }
}

// 중첩 집계 - 카테고리별 평균 가격
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": { "field": "category" },
      "aggs": {
        "avg_price": {
          "avg": { "field": "price" }
        },
        "price_stats": {
          "stats": { "field": "price" }
        }
      }
    }
  }
}

5.2 Metric Aggregation (메트릭 집계)

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price": { "avg": { "field": "price" } },
    "max_price": { "max": { "field": "price" } },
    "min_price": { "min": { "field": "price" } },
    "total_sales": { "sum": { "field": "sales_count" } },
    "unique_brands": { "cardinality": { "field": "brand" } },
    "price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [25, 50, 75, 90, 99]
      }
    }
  }
}

5.3 Pipeline Aggregation (파이프라인 집계)

GET /sales/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_revenue": {
          "sum": { "field": "revenue" }
        }
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "revenue_derivative": {
      "derivative": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "moving_avg_revenue": {
      "moving_fn": {
        "buckets_path": "monthly_sales>total_revenue",
        "window": 3,
        "script": "MovingFunctions.unweightedAvg(values)"
      }
    }
  }
}

6. ELK 스택 (Logstash, Kibana, Beats)

6.1 ELK 스택 아키텍처

데이터 소스                  수집               처리/변환            저장              시각화
┌──────────┐          ┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
│  App Log │──────────│ Filebeat │──────│ Logstash │──────│  Elastic │──────│  Kibana  │
│  Server  │          │ Metricbt │      │ Pipeline │      │  search  │      │Dashboard │
│  Docker  │          │ Heartbt  │      │ (Filter) │      │ Cluster  │      │ Lens/Map │
│  K8s     │          │ Packetbt │      │          │      │          │      │ Alerting │
└──────────┘          └──────────┘      └──────────┘      └──────────┘      └──────────┘

6.2 Logstash 파이프라인

# /etc/logstash/conf.d/main.conf
input {
  beats {
    port => 5044
  }
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["app-logs"]
    codec => json
  }
}

filter {
  # Grok 패턴으로 비정형 로그 파싱
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
    }
  }

  # 날짜 파싱
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # GeoIP 변환
  geoip {
    source => "client_ip"
    target => "geo"
  }

  # 조건부 처리
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
      add_field => { "severity" => "high" }
    }
  }

  # 불필요한 필드 제거
  mutate {
    remove_field => ["agent", "ecs", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "changeme"
  }

  # 에러 로그는 별도 인덱스로
  if "alert" in [tags] {
    elasticsearch {
      hosts => ["http://elasticsearch:9200"]
      index => "alerts-%{+YYYY.MM.dd}"
    }
  }
}

6.3 Filebeat 설정

# /etc/filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

# 또는 직접 Elasticsearch로
# output.elasticsearch:
#   hosts: ["http://elasticsearch:9200"]
#   index: "filebeat-%{+yyyy.MM.dd}"

6.4 Kibana 주요 기능

Discover: 로그 검색, 필터링, 시간대별 분포
Dashboard: 여러 시각화를 모아 대시보드 구성
Lens: 드래그 앤 드롭 시각화 빌더
Maps: 지리 데이터 시각화
Alerting: 조건 기반 알림 (Slack, Email, PagerDuty)
APM: 애플리케이션 성능 모니터링
Security: SIEM 기능, 보안 이벤트 분석
Dev Tools: Console에서 직접 API 호출

7. 벡터 검색(Vector Search)과 kNN

Elasticsearch 8.x부터 기본 내장된 벡터 검색은 의미 기반 검색(Semantic Search)을 가능하게 합니다.

7.1 벡터 검색이란

전통적 키워드 검색은 정확한 단어 일치에 의존합니다. "자동차"를 검색하면 "차량", "vehicle"은 찾지 못합니다. 벡터 검색은 텍스트를 고차원 벡터로 변환하여 의미적 유사도를 계산합니다.

"자동차" → [0.12, -0.34, 0.56, ..., 0.89]  (768차원)
"차량"   → [0.13, -0.32, 0.55, ..., 0.88]  (유사한 벡터)
"과일"   → [-0.45, 0.67, -0.12, ..., 0.23] (다른 벡터)

7.2 kNN 검색 인덱스 설정

PUT /semantic-search
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 200
        }
      }
    }
  }
}

7.3 kNN 검색 수행

// 순수 kNN 검색
GET /semantic-search/_search
{
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56, ...],
    "k": 10,
    "num_candidates": 100
  }
}

// 하이브리드 검색 (키워드 + kNN)
GET /semantic-search/_search
{
  "query": {
    "match": {
      "content": "자동차 추천"
    }
  },
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56, ...],
    "k": 10,
    "num_candidates": 100,
    "boost": 0.5
  },
  "size": 10
}

7.4 HNSW 알고리즘 이해

HNSW(Hierarchical Navigable Small World)는 kNN 검색에 사용되는 근사 최근접 이웃(ANN) 알고리즘입니다:

Layer 2 (sparse):  A ──── B
                   │
Layer 1 (medium):  A ── C ── B ── D
                   │    │    │    │
Layer 0 (dense):   A-E-C-F-B-G-D-H

파라미터	설명	기본값	트레이드오프
`m`	각 노드의 연결 수	16	높을수록 정확하지만 메모리 증가
`ef_construction`	인덱스 빌드 시 탐색 범위	200	높을수록 정확하지만 인덱싱 느림
`ef`	검색 시 탐색 범위	100	높을수록 정확하지만 검색 느림

7.5 NLP 모델 통합 (Elastic ELSER)

// ELSER v2 모델 배포
PUT /_ml/trained_models/.elser_model_2
{
  "input": {
    "field_names": ["text_field"]
  }
}

// Ingest Pipeline으로 자동 벡터화
PUT /_ingest/pipeline/elser-pipeline
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [
          {
            "input_field": "content",
            "output_field": "content_embedding"
          }
        ]
      }
    }
  ]
}

8. 클러스터 운영 및 관리

8.1 클러스터 아키텍처

┌─────────────────────────────────────────────────┐
│                  Elasticsearch Cluster            │
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │ Master Node │  │ Master Node │  │Master Node ││
│  │  (Elected)  │  │ (Eligible)  │  │(Eligible)  ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │  Data Node  │  │  Data Node  │  │ Data Node  ││
│  │ (Hot Tier)  │  │ (Hot Tier)  │  │(Warm Tier) ││
│  │  SSD 1TB    │  │  SSD 1TB    │  │ HDD 4TB    ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐                │
│  │ Ingest Node │  │ Coord Node  │                │
│  │ (Pipeline)  │  │ (Routing)   │                │
│  └─────────────┘  └─────────────┘                │
└─────────────────────────────────────────────────┘

8.2 샤드(Shard) 전략

// 인덱스 생성 시 샤드 수 지정
PUT /logs-2025.03
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "routing.allocation.require.data": "hot"
  }
}

샤드 사이징 가이드라인:

샤드 하나당 10-50GB 권장
노드당 샤드 수: 힙 1GB당 20개 이하
클러스터 총 샤드 수: 1000개당 마스터 노드 리소스 확인
인덱스 패턴: 시계열 데이터는 날짜 기반 인덱스 사용

8.3 Index Lifecycle Management (ILM)

PUT /_ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// ILM 정책을 인덱스 템플릿에 적용
PUT /_index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-policy",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

8.4 클러스터 모니터링 API

# 클러스터 상태
GET /_cluster/health

# 노드 정보
GET /_nodes/stats

# 인덱스 상태
GET /_cat/indices?v&s=store.size:desc

# 샤드 할당 상태
GET /_cat/shards?v&s=store:desc

# 할당 실패 원인
GET /_cluster/allocation/explain

# Hot Threads (CPU 사용 분석)
GET /_nodes/hot_threads

# Pending Tasks
GET /_cluster/pending_tasks

9. 성능 최적화

9.1 인덱싱 성능 최적화

// Bulk API 사용 (단건 요청 대비 10x 이상 빠름)
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}
{"update": {"_index": "products", "_id": "1"}}
{"doc": {"price": 150}}
{"delete": {"_index": "products", "_id": "3"}}

인덱싱 최적화 체크리스트:

항목	설정	효과
Bulk 크기	5-15MB per request	네트워크 오버헤드 감소
Refresh Interval	`"30s"` 또는 `"-1"`	세그먼트 생성 빈도 감소
Replica 수	초기 로딩 시 `0`	복제 오버헤드 제거
Translog	`"async"` flush	디스크 I/O 감소
Mapping	불필요한 필드 `"enabled": false`	인덱싱 부하 감소
ID 생성	Auto-generated ID 사용	ID 중복 검사 생략

9.2 검색 성능 최적화

// 1. Filter Context 활용 (캐싱됨)
GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "category": "electronics" } },
        { "range": { "price": { "gte": 100, "lte": 500 } } }
      ],
      "must": [
        { "match": { "description": "wireless" } }
      ]
    }
  }
}

// 2. Source Filtering (필요한 필드만)
GET /products/_search
{
  "_source": ["name", "price", "category"],
  "query": { "match_all": {} }
}

// 3. Search After (Deep Pagination 대안)
GET /products/_search
{
  "size": 20,
  "sort": [
    { "created_at": "desc" },
    { "_id": "asc" }
  ],
  "search_after": ["2025-03-01T00:00:00", "abc123"]
}

9.3 캐싱 전략

Elasticsearch 캐시 계층:

1. Node Query Cache (Filter Cache)
   - filter context의 결과 캐싱
   - 노드 레벨, 힙의 10% (기본)
   - LRU 방식 제거

2. Shard Request Cache
   - 집계 결과 캐싱
   - 샤드 레벨, 힙의 1% (기본)
   - 인덱스 refresh 시 무효화

3. Field Data Cache
   - text 필드의 정렬/집계 데이터
   - 힙 메모리 사용 (주의 필요)

4. OS Page Cache
   - Lucene 세그먼트 파일 캐싱
   - 힙 외 메모리 활용
   - 가장 중요한 캐시!

9.4 세그먼트 병합(Merge) 최적화

// Force Merge (읽기 전용 인덱스에 적용)
POST /logs-2025.01/_forcemerge?max_num_segments=1

// 병합 정책 설정
PUT /products/_settings
{
  "index": {
    "merge": {
      "policy": {
        "max_merged_segment": "5gb",
        "segments_per_tier": 10,
        "floor_segment": "2mb"
      }
    }
  }
}

9.5 JVM 및 OS 레벨 최적화

# jvm.options
-Xms16g
-Xmx16g
# 힙은 물리 메모리의 50% 이하, 최대 30.5GB (Compressed OOPs)

# elasticsearch.yml
bootstrap.memory_lock: true  # 스왑 비활성화

# OS 레벨 설정
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1

# /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096

10. 실전 운영 패턴

10.1 Index Template과 Component Template

// Component Template (재사용 가능한 블록)
PUT /_component_template/base-settings
{
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    }
  }
}

PUT /_component_template/log-mappings
{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

// Index Template (Component Template 조합)
PUT /_index_template/logs
{
  "index_patterns": ["logs-*"],
  "composed_of": ["base-settings", "log-mappings"],
  "priority": 200
}

10.2 Alias와 Reindex

// Alias 설정 (무중단 인덱스 교체)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

// Reindex (인덱스 마이그레이션)
POST /_reindex
{
  "source": {
    "index": "old-index",
    "query": {
      "range": {
        "@timestamp": { "gte": "2025-01-01" }
      }
    }
  },
  "dest": {
    "index": "new-index",
    "pipeline": "enrichment-pipeline"
  }
}

10.3 Snapshot과 Restore

// 리포지토리 등록 (S3)
PUT /_snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "ap-northeast-2",
    "base_path": "elasticsearch"
  }
}

// 스냅샷 생성
PUT /_snapshot/s3-backup/snapshot-2025-03-24
{
  "indices": "logs-*,products",
  "ignore_unavailable": true,
  "include_global_state": false
}

// 복원
POST /_snapshot/s3-backup/snapshot-2025-03-24/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

11. 면접 대비 퀴즈

Q1. Elasticsearch에서 문서를 색인한 후 바로 검색할 수 없는 이유는?

Elasticsearch는 Near Real-Time(NRT) 검색 엔진입니다. 문서가 색인되면 먼저 메모리 버퍼(In-Memory Buffer)와 트랜잭션 로그(Translog)에 기록됩니다. 검색이 가능하려면 Refresh 과정을 통해 메모리 버퍼의 내용이 새로운 Lucene 세그먼트로 생성되어야 합니다.

기본 Refresh 간격은 1초이며, 이것이 "Near Real-Time"이라 불리는 이유입니다.

index.refresh_interval 설정으로 조절 가능
즉시 검색이 필요하면 POST /index/_refresh API 호출
대량 인덱싱 시에는 -1로 설정하여 비활성화 후 완료 후 수동 refresh

Q2. text 타입과 keyword 타입의 차이점은?

text 타입:

분석기(Analyzer)를 통해 토큰화됨
역색인에 개별 토큰으로 저장
전문 검색(Full-Text Search)에 사용 (match query)
Doc Values 미지원 (정렬/집계 시 Fielddata 사용, 메모리 주의)

keyword 타입:

분석 없이 원본 그대로 저장
정확한 일치 검색에 사용 (term query)
정렬, 집계, 필터링에 최적화
Doc Values 지원 (디스크 기반)

실무 팁: name 같은 필드는 Multi-field로 두 타입 모두 설정하는 것이 일반적입니다.

"name": {
  "type": "text",
  "fields": {
    "keyword": { "type": "keyword" }
  }
}

Q3. Bool Query의 must와 filter의 차이는?

둘 다 "반드시 일치해야 하는" 조건이지만 핵심 차이가 있습니다:

must:

관련도 점수(_score)를 계산합니다
캐싱되지 않습니다
"얼마나 잘 일치하는가"가 중요할 때 사용

filter:

점수를 계산하지 않습니다 (0점)
결과가 캐싱됩니다 (Node Query Cache)
"일치하는가/아닌가"만 중요할 때 사용
동일한 필터가 반복 사용될 때 성능 이점

성능 최적화 원칙: 점수가 필요 없는 조건은 모두 filter로 옮기세요. 특히 term, range, exists 같은 정확한 일치 조건은 filter가 적합합니다.

Q4. Primary Shard와 Replica Shard의 역할은?

Primary Shard:

인덱스 데이터의 원본을 저장
인덱스 생성 후 개수 변경 불가 (shrink/split 제외)
모든 색인(Write) 요청은 먼저 Primary Shard에서 처리
데이터 분산의 단위

Replica Shard:

Primary Shard의 복사본
동적으로 개수 변경 가능
두 가지 역할:
1. 고가용성: Primary Shard 장애 시 Replica가 승격
2. 검색 성능: 검색 요청을 분산 처리 (Read 부하 분산)
Primary와 동일 노드에 배치되지 않음

샤드 할당 공식: 최대 동시 노드 장애 허용 수 = Replica 수 검색 처리량 = (Primary + Replica) 수에 비례

Q5. Elasticsearch에서 Deep Pagination이 위험한 이유와 대안은?

Deep Pagination의 문제: from: 10000, size: 10 요청 시, 각 샤드에서 10,010개의 문서를 정렬하여 Coordinating Node로 전송합니다. 3개 샤드라면 30,030개를 메모리에서 정렬해야 합니다. 페이지가 깊어질수록 기하급수적으로 리소스가 증가합니다.

기본 설정에서 from + size의 최대값은 10,000입니다 (index.max_result_window).

대안:

Search After: 이전 페이지의 마지막 정렬 값을 기준으로 다음 페이지 조회. 실시간 커서 기반.
Scroll API: 스냅샷 기반으로 대량 데이터 순회. 배치 처리에 적합. (Deprecated 예정, PIT + search_after 권장)
Point in Time (PIT): Scroll의 대안. search_after와 함께 사용하여 일관된 뷰 제공.

// PIT + Search After 예시
POST /products/_pit?keep_alive=5m

GET /_search
{
  "pit": { "id": "PIT_ID", "keep_alive": "5m" },
  "size": 20,
  "sort": [{ "created_at": "desc" }, { "_shard_doc": "asc" }],
  "search_after": [1679616000000, 42]
}

12. 참고 자료

Elasticsearch Complete Guide 2025: From Search Engine to Log Analytics & Vector Search

1. What is Elasticsearch

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Since its first release by Shay Banon in 2010, it has expanded from full-text search to log analytics, metrics monitoring, security analytics, and most recently, vector search.

1.1 Why Elasticsearch

Running LIKE '%keyword%' in a traditional RDBMS requires a full table scan. With hundreds of millions of rows, response times can reach tens of seconds. Elasticsearch uses an inverted index structure to deliver millisecond-level search responses.

Key Features:

Distributed Architecture: Horizontal scaling (scale-out) handling tens of TB of data
Near Real-Time Search: Documents become searchable within ~1 second of indexing
Schemaless: Index JSON documents directly with dynamic mapping support
RESTful API: All operations via HTTP/JSON
Rich Ecosystem: Integration with Kibana, Logstash, Beats, APM

1.2 Elasticsearch vs RDBMS Comparison

Aspect	RDBMS	Elasticsearch
Data Unit	Row	Document (JSON)
Collection	Table	Index
Column	Column	Field
Schema	Fixed Schema	Dynamic Mapping
Search Method	B-Tree Index	Inverted Index
Scaling	Scale-up oriented	Scale-out (Sharding)
Transactions	ACID support	Not supported
Primary Use	OLTP	Search, Analytics, Logging

1.3 Version History

ES 1.x (2014) - Initial stabilization
ES 2.x (2015) - Pipeline Aggregation
ES 5.x (2016) - Lucene 6, Painless scripting
ES 6.x (2017) - Single type per index
ES 7.x (2019) - Type removal, adaptive replica selection
ES 8.x (2022) - Security by default, kNN vector search, NLP integration

2. Inverted Index Internals

The secret behind Elasticsearch's search speed is the inverted index. While a regular database finds words within documents (forward index), an inverted index finds documents from words.

2.1 Inverted Index Structure

Given three documents:

Doc 1: "The quick brown fox"
Doc 2: "The quick brown dog"
Doc 3: "The lazy brown fox"

The inverted index is built as follows:

Term      | Document IDs
----------|-------------
the       | [1, 2, 3]
quick     | [1, 2]
brown     | [1, 2, 3]
fox       | [1, 3]
dog       | [2]
lazy      | [3]

Searching for "fox" instantly returns Doc 1 and Doc 3. Regardless of the number of documents, term lookup is close to O(1).

2.2 Lucene Segment Architecture

Lucene inside Elasticsearch stores data in segments:

Index
  └── Shard (Lucene Index)
        ├── Segment 0 (immutable)
        │     ├── Inverted Index
        │     ├── Stored Fields
        │     ├── Doc Values
        │     └── Term Vectors
        ├── Segment 1 (immutable)
        ├── Segment 2 (immutable)
        └── Commit Point (segments_N)

Key Points:

Segments are immutable once created
Document deletion marks entries in a .del file rather than removing them
Background segment merging occurs periodically
A new segment is created on each refresh (default 1 second), making new documents searchable

2.3 Doc Values and Fielddata

Inverted indices are optimized for text search but not for sorting or aggregations:

Inverted Index (search):     Term → Document IDs
Doc Values (agg/sort):       Document ID → Values

// Doc Values example
{
  "doc_1": { "price": 100, "category": "electronics" },
  "doc_2": { "price": 200, "category": "books" },
  "doc_3": { "price": 150, "category": "electronics" }
}

Doc Values: Automatically enabled for keyword, numeric, date, ip, geo_point types. Disk-based.
Fielddata: Used for sorting/aggregations on text fields. Heap memory-based, use with caution.

3. Mapping and Analyzers

3.1 Mapping Definition

Mapping defines the schema of an index, specifying field types and analysis methods.

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "description": {
        "type": "text"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      },
      "location": {
        "type": "geo_point"
      },
      "tags": {
        "type": "keyword"
      },
      "metadata": {
        "type": "object"
      },
      "reviews": {
        "type": "nested",
        "properties": {
          "author": { "type": "keyword" },
          "rating": { "type": "integer" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

3.2 Core Field Types

Type	Description	Inverted Index	Doc Values
`text`	Full-text search (analyzer applied)	Yes	No (Fielddata)
`keyword`	Exact match, sorting, aggregation	Yes	Yes
`long/integer/float`	Numeric	Yes (BKD Tree)	Yes
`date`	Date/time	Yes	Yes
`boolean`	true/false	Yes	Yes
`geo_point`	Latitude/longitude	Yes	Yes
`nested`	Nested objects (independent docs)	Yes	Yes
`object`	JSON objects (flattened)	Yes	Yes
`dense_vector`	Vectors (kNN search)	No	No (separate storage)

3.3 Analyzer Pipeline

An analyzer consists of three stages:

Text Input
  │
  ▼
Character Filter
  │  - html_strip: Remove HTML tags
  │  - mapping: Character substitution
  │  - pattern_replace: Regex replacement
  ▼
Tokenizer
  │  - standard: Unicode word boundary
  │  - whitespace: Whitespace-based split
  │  - ngram: N-gram token generation
  │  - edge_ngram: Prefix-based tokens
  │  - language-specific (e.g., nori for Korean, kuromoji for Japanese)
  ▼
Token Filter
  │  - lowercase: Convert to lowercase
  │  - stop: Remove stop words
  │  - synonym: Synonym expansion
  │  - stemmer: Stem extraction
  ▼
Token Stream (Terms)

3.4 Custom Analyzer Example

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_cleaner": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "tokenizer": {
        "custom_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": ["letter", "digit"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_synonyms": {
          "type": "synonym",
          "synonyms": [
            "quick,fast,speedy",
            "big,large,huge"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_cleaner"],
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stop", "custom_synonyms"]
        },
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "custom_ngram",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

3.5 Analyze API

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch is a <b>powerful</b> search engine"
}

// Response
{
  "tokens": [
    { "token": "elasticsearch", "position": 0 },
    { "token": "powerful", "position": 3 },
    { "token": "search", "position": 4 },
    { "token": "engine", "position": 5 }
  ]
}

4. Mastering Query DSL

Elasticsearch Query DSL is a JSON-based query language. It is divided into Query Context (relevance scoring) and Filter Context (yes/no matching with caching).

4.1 Match Queries (Full-Text Search)

// Basic match - tokenized through analyzer
GET /products/_search
{
  "query": {
    "match": {
      "description": "powerful search engine"
    }
  }
}

// match_phrase - word order matters
GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "search engine",
        "slop": 1
      }
    }
  }
}

// multi_match - search across multiple fields
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "elasticsearch guide",
      "fields": ["title^3", "description", "tags^2"],
      "type": "best_fields"
    }
  }
}

4.2 Term Queries (Exact Match)

// term - exact match on keyword fields
GET /products/_search
{
  "query": {
    "term": {
      "category": "electronics"
    }
  }
}

// terms - match any of multiple values
GET /products/_search
{
  "query": {
    "terms": {
      "status": ["published", "pending"]
    }
  }
}

// range - range queries
GET /products/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 100,
        "lte": 500
      }
    }
  }
}

// exists - field existence check
GET /products/_search
{
  "query": {
    "exists": {
      "field": "discount"
    }
  }
}

4.3 Bool Query (Compound Query)

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "search engine" } }
      ],
      "must_not": [
        { "term": { "status": "deleted" } }
      ],
      "should": [
        { "term": { "featured": true } },
        { "range": { "rating": { "gte": 4.5 } } }
      ],
      "filter": [
        { "term": { "category": "software" } },
        { "range": { "price": { "lte": 1000 } } }
      ],
      "minimum_should_match": 1
    }
  }
}

Bool Query Clauses:

Clause	Scoring	Caching	Purpose
`must`	Yes	No	Must match + score contribution
`must_not`	No	Yes	Must not match
`should`	Yes	No	Bonus if matched
`filter`	No	Yes	Must match (no scoring)

4.4 Nested Query

GET /products/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "term": { "reviews.author": "john" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      },
      "inner_hits": {
        "size": 3,
        "highlight": {
          "fields": {
            "reviews.comment": {}
          }
        }
      }
    }
  }
}

4.5 Function Score Query

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "title": "elasticsearch" } },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity",
            "modifier": "log1p",
            "factor": 2
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "30d",
              "decay": 0.5
            }
          }
        },
        {
          "filter": { "term": { "featured": true } },
          "weight": 5
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

4.6 Highlighting

GET /products/_search
{
  "query": {
    "match": { "description": "search engine" }
  },
  "highlight": {
    "pre_tags": ["<strong>"],
    "post_tags": ["</strong>"],
    "fields": {
      "description": {
        "fragment_size": 150,
        "number_of_fragments": 3
      }
    }
  }
}

5. Aggregations Deep Dive

Elasticsearch aggregations are analogous to SQL GROUP BY but far more powerful. They are classified into bucket, metric, and pipeline aggregations.

5.1 Bucket Aggregations

// terms aggregation - document count by category
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category",
        "size": 20,
        "order": { "_count": "desc" }
      }
    }
  }
}

// date_histogram - time-based aggregation
GET /logs/_search
{
  "size": 0,
  "aggs": {
    "logs_per_hour": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1h",
        "time_zone": "UTC",
        "min_doc_count": 0
      }
    }
  }
}

// range aggregation - price brackets
GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500 }
        ]
      }
    }
  }
}

// nested aggregation - average price by category
GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": { "field": "category" },
      "aggs": {
        "avg_price": {
          "avg": { "field": "price" }
        },
        "price_stats": {
          "stats": { "field": "price" }
        }
      }
    }
  }
}

5.2 Metric Aggregations

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price": { "avg": { "field": "price" } },
    "max_price": { "max": { "field": "price" } },
    "min_price": { "min": { "field": "price" } },
    "total_sales": { "sum": { "field": "sales_count" } },
    "unique_brands": { "cardinality": { "field": "brand" } },
    "price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [25, 50, 75, 90, 99]
      }
    }
  }
}

5.3 Pipeline Aggregations

GET /sales/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_revenue": {
          "sum": { "field": "revenue" }
        }
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "revenue_derivative": {
      "derivative": {
        "buckets_path": "monthly_sales>total_revenue"
      }
    },
    "moving_avg_revenue": {
      "moving_fn": {
        "buckets_path": "monthly_sales>total_revenue",
        "window": 3,
        "script": "MovingFunctions.unweightedAvg(values)"
      }
    }
  }
}

6. ELK Stack (Logstash, Kibana, Beats)

6.1 ELK Stack Architecture

Data Sources              Collection          Processing           Storage             Visualization
┌──────────┐          ┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
│  App Log │──────────│ Filebeat │──────│ Logstash │──────│  Elastic │──────│  Kibana  │
│  Server  │          │ Metricbt │      │ Pipeline │      │  search  │      │Dashboard │
│  Docker  │          │ Heartbt  │      │ (Filter) │      │ Cluster  │      │ Lens/Map │
│  K8s     │          │ Packetbt │      │          │      │          │      │ Alerting │
└──────────┘          └──────────┘      └──────────┘      └──────────┘      └──────────┘

6.2 Logstash Pipeline

# /etc/logstash/conf.d/main.conf
input {
  beats {
    port => 5044
  }
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["app-logs"]
    codec => json
  }
}

filter {
  # Grok pattern for unstructured log parsing
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
    }
  }

  # Date parsing
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # GeoIP enrichment
  geoip {
    source => "client_ip"
    target => "geo"
  }

  # Conditional processing
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
      add_field => { "severity" => "high" }
    }
  }

  # Remove unnecessary fields
  mutate {
    remove_field => ["agent", "ecs", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "changeme"
  }

  # Error logs to a separate index
  if "alert" in [tags] {
    elasticsearch {
      hosts => ["http://elasticsearch:9200"]
      index => "alerts-%{+YYYY.MM.dd}"
    }
  }
}

6.3 Filebeat Configuration

# /etc/filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

6.4 Kibana Features

Discover: Log search, filtering, time-based distribution
Dashboard: Combine multiple visualizations
Lens: Drag-and-drop visualization builder
Maps: Geographic data visualization
Alerting: Condition-based alerts (Slack, Email, PagerDuty)
APM: Application Performance Monitoring
Security: SIEM capabilities, security event analysis
Dev Tools: Direct API calls from the console

7. Vector Search and kNN

Native vector search in Elasticsearch 8.x enables semantic search capabilities.

7.1 What is Vector Search

Traditional keyword search relies on exact word matching. Searching for "car" will not find "vehicle" or "automobile." Vector search converts text into high-dimensional vectors and calculates semantic similarity.

"car"       → [0.12, -0.34, 0.56, ..., 0.89]  (768 dimensions)
"vehicle"   → [0.13, -0.32, 0.55, ..., 0.88]  (similar vector)
"fruit"     → [-0.45, 0.67, -0.12, ..., 0.23] (different vector)

7.2 kNN Search Index Setup

PUT /semantic-search
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 200
        }
      }
    }
  }
}

7.3 Performing kNN Search

// Pure kNN search
GET /semantic-search/_search
{
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56],
    "k": 10,
    "num_candidates": 100
  }
}

// Hybrid search (keyword + kNN)
GET /semantic-search/_search
{
  "query": {
    "match": {
      "content": "car recommendation"
    }
  },
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56],
    "k": 10,
    "num_candidates": 100,
    "boost": 0.5
  },
  "size": 10
}

7.4 HNSW Algorithm

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor (ANN) algorithm used for kNN search:

Layer 2 (sparse):  A ──── B
                   │
Layer 1 (medium):  A ── C ── B ── D
                   │    │    │    │
Layer 0 (dense):   A-E-C-F-B-G-D-H

Parameter	Description	Default	Trade-off
`m`	Connections per node	16	Higher = more accurate but more memory
`ef_construction`	Search range during build	200	Higher = more accurate but slower indexing
`ef`	Search range during query	100	Higher = more accurate but slower search

7.5 NLP Model Integration (Elastic ELSER)

// Deploy ELSER v2 model
PUT /_ml/trained_models/.elser_model_2
{
  "input": {
    "field_names": ["text_field"]
  }
}

// Auto-vectorize with Ingest Pipeline
PUT /_ingest/pipeline/elser-pipeline
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [
          {
            "input_field": "content",
            "output_field": "content_embedding"
          }
        ]
      }
    }
  ]
}

8. Cluster Operations and Management

8.1 Cluster Architecture

┌─────────────────────────────────────────────────┐
│                  Elasticsearch Cluster            │
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │ Master Node │  │ Master Node │  │Master Node ││
│  │  (Elected)  │  │ (Eligible)  │  │(Eligible)  ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐│
│  │  Data Node  │  │  Data Node  │  │ Data Node  ││
│  │ (Hot Tier)  │  │ (Hot Tier)  │  │(Warm Tier) ││
│  │  SSD 1TB    │  │  SSD 1TB    │  │ HDD 4TB    ││
│  └─────────────┘  └─────────────┘  └────────────┘│
│                                                   │
│  ┌─────────────┐  ┌─────────────┐                │
│  │ Ingest Node │  │ Coord Node  │                │
│  │ (Pipeline)  │  │ (Routing)   │                │
│  └─────────────┘  └─────────────┘                │
└─────────────────────────────────────────────────┘

8.2 Shard Strategy

PUT /logs-2025.03
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "routing.allocation.require.data": "hot"
  }
}

Shard Sizing Guidelines:

Recommended 10-50GB per shard
Shards per node: no more than 20 per 1GB of heap
Total cluster shards: check master node resources per 1000 shards
Index pattern: use date-based indices for time-series data

8.3 Index Lifecycle Management (ILM)

PUT /_ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

8.4 Cluster Monitoring APIs

# Cluster health
GET /_cluster/health

# Node stats
GET /_nodes/stats

# Index status
GET /_cat/indices?v&s=store.size:desc

# Shard allocation
GET /_cat/shards?v&s=store:desc

# Allocation failure explanation
GET /_cluster/allocation/explain

# Hot Threads (CPU analysis)
GET /_nodes/hot_threads

# Pending Tasks
GET /_cluster/pending_tasks

9. Performance Optimization

9.1 Indexing Performance

// Bulk API (10x+ faster than individual requests)
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}
{"update": {"_index": "products", "_id": "1"}}
{"doc": {"price": 150}}
{"delete": {"_index": "products", "_id": "3"}}

Indexing Optimization Checklist:

Item	Setting	Effect
Bulk size	5-15MB per request	Reduced network overhead
Refresh Interval	`"30s"` or `"-1"`	Fewer segment creations
Replica count	`0` during initial load	No replication overhead
Translog	`"async"` flush	Reduced disk I/O
Mapping	`"enabled": false` for unused fields	Less indexing load
ID generation	Auto-generated IDs	Skip ID duplicate check

9.2 Search Performance

// 1. Use Filter Context (cached)
GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "category": "electronics" } },
        { "range": { "price": { "gte": 100, "lte": 500 } } }
      ],
      "must": [
        { "match": { "description": "wireless" } }
      ]
    }
  }
}

// 2. Source Filtering (only needed fields)
GET /products/_search
{
  "_source": ["name", "price", "category"],
  "query": { "match_all": {} }
}

// 3. Search After (Deep Pagination alternative)
GET /products/_search
{
  "size": 20,
  "sort": [
    { "created_at": "desc" },
    { "_id": "asc" }
  ],
  "search_after": ["2025-03-01T00:00:00", "abc123"]
}

9.3 Caching Strategy

Elasticsearch Cache Layers:

1. Node Query Cache (Filter Cache)
   - Caches filter context results
   - Node-level, 10% of heap (default)
   - LRU eviction

2. Shard Request Cache
   - Caches aggregation results
   - Shard-level, 1% of heap (default)
   - Invalidated on index refresh

3. Field Data Cache
   - Sort/aggregation data for text fields
   - Uses heap memory (caution needed)

4. OS Page Cache
   - Caches Lucene segment files
   - Uses off-heap memory
   - Most important cache!

9.4 Segment Merge Optimization

// Force Merge (for read-only indices)
POST /logs-2025.01/_forcemerge?max_num_segments=1

// Merge policy settings
PUT /products/_settings
{
  "index": {
    "merge": {
      "policy": {
        "max_merged_segment": "5gb",
        "segments_per_tier": 10,
        "floor_segment": "2mb"
      }
    }
  }
}

9.5 JVM and OS Level Optimization

# jvm.options
-Xms16g
-Xmx16g
# Heap should be 50% or less of physical memory, max 30.5GB (Compressed OOPs)

# OS-level settings
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1

# /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096

10. Production Operational Patterns

10.1 Index Templates and Component Templates

// Component Template (reusable blocks)
PUT /_component_template/base-settings
{
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    }
  }
}

PUT /_component_template/log-mappings
{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

// Index Template (composing Component Templates)
PUT /_index_template/logs
{
  "index_patterns": ["logs-*"],
  "composed_of": ["base-settings", "log-mappings"],
  "priority": 200
}

10.2 Aliases and Reindex

// Alias for zero-downtime index swap
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

// Reindex (index migration)
POST /_reindex
{
  "source": {
    "index": "old-index",
    "query": {
      "range": {
        "@timestamp": { "gte": "2025-01-01" }
      }
    }
  },
  "dest": {
    "index": "new-index",
    "pipeline": "enrichment-pipeline"
  }
}

10.3 Snapshot and Restore

// Register repository (S3)
PUT /_snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "us-east-1",
    "base_path": "elasticsearch"
  }
}

// Create snapshot
PUT /_snapshot/s3-backup/snapshot-2025-03-24
{
  "indices": "logs-*,products",
  "ignore_unavailable": true,
  "include_global_state": false
}

// Restore
POST /_snapshot/s3-backup/snapshot-2025-03-24/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

11. Interview Quiz

Q1. Why can't you search for a document immediately after indexing it in Elasticsearch?

Elasticsearch is a Near Real-Time (NRT) search engine. When a document is indexed, it is first written to an in-memory buffer and the transaction log (translog). For the document to become searchable, a refresh operation must occur, which creates a new Lucene segment from the in-memory buffer.

The default refresh interval is 1 second, which is why it is called "Near Real-Time."

Configurable via index.refresh_interval
For immediate search, call POST /index/_refresh
During bulk indexing, set to -1 to disable, then manually refresh after completion

Q2. What is the difference between text and keyword field types?

text type:

Tokenized through an analyzer
Individual tokens stored in the inverted index
Used for full-text search (match query)
No Doc Values support (Fielddata used for sorting/aggregation, memory-intensive)

keyword type:

Stored as-is without analysis
Used for exact match searches (term query)
Optimized for sorting, aggregation, and filtering
Doc Values supported (disk-based)

Practical tip: Fields like name are commonly configured as multi-fields with both types.

"name": {
  "type": "text",
  "fields": {
    "keyword": { "type": "keyword" }
  }
}

Q3. What is the difference between must and filter in a Bool Query?

Both require matching, but there is a key difference:

must:

Calculates relevance score (_score)
Results are not cached
Use when "how well does it match" matters

filter:

Does not calculate score (always 0)
Results are cached (Node Query Cache)
Use when only "does it match or not" matters
Performance benefit when the same filter is reused

Optimization principle: Move all conditions that do not need scoring to filter. Especially term, range, and exists conditions are well-suited for filter.

Q4. What are the roles of Primary and Replica Shards?

Primary Shard:

Stores the original index data
Cannot be changed after index creation (except via shrink/split)
All write (index) requests are processed by the Primary Shard first
Unit of data distribution

Replica Shard:

A copy of the Primary Shard
Can be dynamically adjusted
Two purposes:
1. High Availability: Replica promotes to Primary on failure
2. Search Performance: Distributes read load across replicas
Never placed on the same node as its Primary

Shard allocation formula: Maximum simultaneous node failures tolerated = number of replicas. Search throughput scales proportionally to (Primary + Replica) count.

Q5. Why is deep pagination dangerous in Elasticsearch and what are the alternatives?

The problem with deep pagination: A request with from: 10000, size: 10 requires each shard to sort and return 10,010 documents to the coordinating node. With 3 shards, that means 30,030 documents must be sorted in memory. Resource consumption grows exponentially with page depth.

The default maximum for from + size is 10,000 (index.max_result_window).

Alternatives:

Search After: Uses the last sort value from the previous page as a cursor for the next page. Real-time cursor-based.
Scroll API: Snapshot-based traversal of large datasets. Suitable for batch processing. (Being deprecated; PIT + search_after recommended)
Point in Time (PIT): A replacement for Scroll. Used with search_after for consistent views.

// PIT + Search After example
POST /products/_pit?keep_alive=5m

GET /_search
{
  "pit": { "id": "PIT_ID", "keep_alive": "5m" },
  "size": 20,
  "sort": [{ "created_at": "desc" }, { "_shard_doc": "asc" }],
  "search_after": [1679616000000, 42]
}

Elasticsearch 완전 가이드 2025: 검색 엔진부터 로그 분석, 벡터 검색까지

TOC

1. Elasticsearch란 무엇인가

1.1 왜 Elasticsearch인가

1.2 Elasticsearch vs RDBMS 비교

1.3 Elasticsearch 버전 히스토리

2. 역색인(Inverted Index)의 원리

2.1 역색인 구조

2.2 Lucene 세그먼트 구조

2.3 Doc Values와 Fielddata

3. 매핑(Mapping)과 분석기(Analyzer)

3.1 매핑 정의

3.2 주요 필드 타입

3.3 분석기(Analyzer) 파이프라인

3.4 한국어 분석기 설정(Nori)

3.5 Analyze API로 분석 결과 확인

4. Query DSL 완전 정복

4.1 Match Query (전문 검색)

4.2 Term Query (정확한 일치)

4.3 Bool Query (복합 쿼리)

4.4 Nested Query

4.5 Function Score Query

4.6 하이라이팅(Highlighting)

5. 집계(Aggregation) 완전 정복

5.1 Bucket Aggregation (버킷 집계)

5.2 Metric Aggregation (메트릭 집계)

5.3 Pipeline Aggregation (파이프라인 집계)

6. ELK 스택 (Logstash, Kibana, Beats)

6.1 ELK 스택 아키텍처

6.2 Logstash 파이프라인

6.3 Filebeat 설정

6.4 Kibana 주요 기능

7. 벡터 검색(Vector Search)과 kNN

7.1 벡터 검색이란

7.2 kNN 검색 인덱스 설정

7.3 kNN 검색 수행

7.4 HNSW 알고리즘 이해

7.5 NLP 모델 통합 (Elastic ELSER)

8. 클러스터 운영 및 관리

8.1 클러스터 아키텍처

8.2 샤드(Shard) 전략

8.3 Index Lifecycle Management (ILM)

8.4 클러스터 모니터링 API

9. 성능 최적화

9.1 인덱싱 성능 최적화

9.2 검색 성능 최적화

9.3 캐싱 전략

9.4 세그먼트 병합(Merge) 최적화

9.5 JVM 및 OS 레벨 최적화

10. 실전 운영 패턴

10.1 Index Template과 Component Template

10.2 Alias와 Reindex

10.3 Snapshot과 Restore

11. 면접 대비 퀴즈

12. 참고 자료

Elasticsearch Complete Guide 2025: From Search Engine to Log Analytics & Vector Search

TOC

1. What is Elasticsearch

1.1 Why Elasticsearch

1.2 Elasticsearch vs RDBMS Comparison

1.3 Version History

2. Inverted Index Internals

2.1 Inverted Index Structure

2.2 Lucene Segment Architecture

2.3 Doc Values and Fielddata

3. Mapping and Analyzers

3.1 Mapping Definition

3.2 Core Field Types

3.3 Analyzer Pipeline

3.4 Custom Analyzer Example

3.5 Analyze API

4. Mastering Query DSL

4.1 Match Queries (Full-Text Search)

4.2 Term Queries (Exact Match)

4.3 Bool Query (Compound Query)

4.4 Nested Query

4.5 Function Score Query

4.6 Highlighting

5. Aggregations Deep Dive

5.1 Bucket Aggregations