Split View: Prometheus 운영 총정리 — 메트릭 수집, PromQL, 알림, 대시보드, 모범사례

Prometheus 운영 총정리 — 메트릭 수집, PromQL, 알림, 대시보드, 모범사례

들어가며
1. 핵심 개념
2. 아키텍처
3. 설치 및 설정
- Docker Compose로 전체 스택 구성
- prometheus.yml 설정
4. PromQL 실전
5. Alertmanager 알림 설정
- 알림 규칙 정의
- Alertmanager 설정
6. Grafana 대시보드 연동
7. 대규모 운영 모범사례
8. 운영 체크리스트
9. 흔한 실수
10. 요약
퀴즈

들어가며

현대 인프라는 마이크로서비스, 컨테이너, 서버리스로 분산되면서 수백 개의 구성 요소가 동시에 동작한다. 이런 환경에서 "지금 시스템이 정상인가?"라는 질문에 답하려면 체계적인 메트릭 수집과 모니터링이 필수다. Prometheus는 CNCF 졸업 프로젝트로서 Kubernetes 생태계의 사실상 표준 모니터링 시스템이며, 풀(pull) 기반 메트릭 수집, 강력한 쿼리 언어 PromQL, 유연한 알림 시스템을 제공한다.

이 글에서는 Prometheus의 핵심 개념부터 아키텍처, 설치 및 설정, PromQL 실전 쿼리, Alertmanager 알림 구성, Grafana 대시보드 연동, 대규모 환경 운영 모범사례, 운영 체크리스트, 그리고 흔한 실수까지 한 글에 총정리한다. 프로덕션 환경에서 즉시 적용할 수 있는 실전 중심의 가이드다.

1. 핵심 개념

Pull 기반 모델

Prometheus는 모니터링 대상이 메트릭을 push하는 방식이 아니라, Prometheus 서버가 주기적으로 대상의 /metrics 엔드포인트를 스크레이프(scrape) 하는 pull 기반 모델을 사용한다. 이 방식의 장점은 다음과 같다.

모니터링 대상이 Prometheus의 존재를 알 필요가 없다
서비스 디스커버리와 결합하여 동적 환경에 자연스럽게 대응한다
대상이 다운되면 스크레이프 실패로 즉시 감지된다

시계열 데이터 (Time Series)

Prometheus는 모든 데이터를 시계열(time series) 로 저장한다. 각 시계열은 메트릭 이름과 키-값 쌍의 레이블 조합으로 고유하게 식별된다.

http_requests_total{method="GET", handler="/api/users", status="200"}

메트릭 타입

타입	설명	사용 예시	PromQL 사용
Counter	단조 증가하는 누적 값	요청 수, 에러 수	`rate()`, `increase()`
Gauge	증감 가능한 현재 값	CPU 사용률, 메모리, 온도	직접 사용, `avg_over_time()`
Histogram	값의 분포를 버킷별로 측정	응답 시간, 요청 크기	`histogram_quantile()`
Summary	클라이언트 측에서 quantile 계산	응답 시간 (집계 불가)	직접 사용 (비권장)

Histogram vs Summary: Histogram은 서버 측에서 quantile을 계산할 수 있어 여러 인스턴스의 집계가 가능하다. Summary는 클라이언트에서 계산하므로 집계가 불가능하다. 대부분의 경우 Histogram을 권장한다.

2. 아키텍처

Prometheus 생태계는 여러 구성 요소가 유기적으로 연결된다.

graph TB
    subgraph Targets
        A[Application /metrics]
        B[Node Exporter]
        C[cAdvisor]
        D[Custom Exporter]
    end

    subgraph "Prometheus Server"
        E[Retrieval<br/>스크레이프 엔진]
        F[TSDB<br/>시계열 데이터베이스]
        G[HTTP Server<br/>PromQL API]
    end

    H[Service Discovery<br/>Kubernetes / Consul / DNS]
    I[Pushgateway<br/>단기 작업용]
    J[Alertmanager<br/>알림 라우팅/그룹핑]
    K[Grafana<br/>대시보드]

    H --> E
    A --> E
    B --> E
    C --> E
    D --> E
    I --> E
    E --> F
    F --> G
    G --> K
    G --> J
    J --> L[Slack / PagerDuty / Email]

구성 요소	역할
Prometheus Server	메트릭 스크레이프, TSDB 저장, PromQL 쿼리 엔진
Exporters	대상 시스템의 메트릭을 Prometheus 형식으로 노출 (node_exporter, mysqld_exporter 등)
Pushgateway	단기 배치 작업 등 스크레이프가 어려운 대상의 메트릭을 중계
Alertmanager	알림 규칙에 따른 알림 라우팅, 그룹핑, 중복 제거, 억제
Service Discovery	Kubernetes, Consul, DNS 등에서 스크레이프 대상을 동적으로 발견

3. 설치 및 설정

Docker Compose로 전체 스택 구성

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    ports:
      - '9090:9090'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - '9093:9093'
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    ports:
      - '3000:3000'
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    ports:
      - '9100:9100'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml 설정

# prometheus/prometheus.yml
global:
  scrape_interval: 15s # 기본 스크레이프 간격
  evaluation_interval: 15s # 알림 규칙 평가 간격
  scrape_timeout: 10s # 스크레이프 타임아웃

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus 자기 자신 모니터링
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # 애플리케이션 (relabel_configs 활용)
  - job_name: 'app'
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets: ['app:8080']
        labels:
          env: production
          team: backend

  # Kubernetes 서비스 디스커버리 (참고)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

4. PromQL 실전

rate()와 irate()

# 5분간 초당 요청 비율 (알림에 적합)
sum(rate(http_requests_total[5m])) by (service)

# 순간 변화율 (대시보드 시각화에 적합)
sum(irate(http_requests_total[$__rate_interval])) by (service)

rate()는 알림 규칙에서, irate()는 대시보드에서 사용한다. rate()의 range window는 scrape_interval의 4배 이상으로 설정해야 한다.

histogram_quantile()

# 95번째 백분위수 응답 시간
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 50번째 백분위수 (중간값)
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

집계 연산자 (Aggregation Operators)

# 서비스별 에러율 상위 5개
topk(5,
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  /
  sum(rate(http_requests_total[5m])) by (service)
)

# 네임스페이스별 메모리 사용량
sum(container_memory_usage_bytes{container!=""}) by (namespace)

# CPU 사용률이 80% 초과인 노드
node_cpu_seconds_total{mode="idle"} < 0.2

# predict_linear으로 디스크 용량 예측 (4시간 뒤)
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[6h], 4 * 3600
) < 0

유용한 함수 모음

함수	용도	예시
`rate()`	시간 범위의 초당 평균 변화율	`rate(http_requests_total[5m])`
`irate()`	최근 두 점의 순간 변화율	`irate(http_requests_total[5m])`
`increase()`	시간 범위의 총 증가량	`increase(http_requests_total[1h])`
`histogram_quantile()`	히스토그램에서 분위수 계산	`histogram_quantile(0.99, ...)`
`predict_linear()`	선형 회귀로 미래 값 예측	`predict_linear(disk_free[6h], 3600*4)`
`absent()`	시계열이 없으면 1 반환	`absent(up{job="app"})`
`changes()`	시간 범위 내 값 변경 횟수	`changes(process_start_time_seconds[1h])`

5. Alertmanager 알림 설정

알림 규칙 정의

# prometheus/rules/alerts.yml
groups:
  - name: instance-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: '인스턴스 {{ $labels.instance }}가 다운되었습니다'
          description: '{{ $labels.job }} 작업의 {{ $labels.instance }}가 3분 이상 응답하지 않습니다.'

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: '{{ $labels.service }} 에러율 {{ $value | humanizePercentage }} 초과'
          description: '서비스 {{ $labels.service }}의 5xx 에러율이 5%를 초과했습니다.'

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: '노드 메모리 사용률 90% 초과'

      - alert: DiskSpaceRunningOut
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: '24시간 내 디스크 공간이 부족할 것으로 예측됩니다'

      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: '{{ $labels.service }} P95 응답 시간 1초 초과'

Alertmanager 설정

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']

6. Grafana 대시보드 연동

데이터 소스 프로비저닝

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: '15s'
      httpMethod: POST

핵심 대시보드 패널 예시

{
  "title": "Service Error Rate",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
      "legendFormat": "{{ service }}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 0.01, "color": "yellow" },
          { "value": 0.05, "color": "red" }
        ]
      }
    }
  }
}

권장 대시보드 패널 구성

패널	PromQL	용도
에러율	`sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)`	서비스별 에러율 추이
QPS	`sum(rate(http_requests_total[5m])) by (service)`	초당 요청 수
P95 Latency	`histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))`	95번째 백분위수 응답 시간
CPU 사용률	`1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)`	노드별 CPU 사용률
메모리 사용량	`1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`	노드별 메모리 사용률
디스크 잔여량	`node_filesystem_avail_bytes{mountpoint="/"}`	루트 파일시스템 남은 용량

7. 대규모 운영 모범사례

Federation (연합)

여러 Prometheus 서버의 메트릭을 상위 Prometheus가 수집하는 구조다. 팀/클러스터별로 로컬 Prometheus를 운영하고, 글로벌 뷰가 필요한 메트릭만 연합 서버로 올린다.

# 글로벌 Prometheus의 scrape_configs
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".+"}' # 모든 job 메트릭
        - '{__name__=~"job:.*"}' # Recording Rule 결과만
    static_configs:
      - targets:
          - 'prometheus-team-a:9090'
          - 'prometheus-team-b:9090'

장기 저장소: Thanos / Cortex / Mimir

Prometheus의 로컬 TSDB는 기본적으로 15~30일 보존에 적합하다. 장기 보존과 글로벌 쿼리가 필요하면 다음 솔루션을 검토한다.

솔루션	특징	적합한 환경
Thanos	Sidecar 패턴, 오브젝트 스토리지, 글로벌 쿼리	기존 Prometheus에 사이드카 추가
Cortex	멀티테넌트, 수평 확장, 마이크로서비스 아키텍처	대규모 SaaS 환경
Grafana Mimir	Cortex 포크, 성능 개선, Grafana 생태계 통합	Grafana 스택 사용 시
VictoriaMetrics	고성능, 낮은 리소스, PromQL 호환	비용 최적화가 중요할 때

카디널리티 관리

고카디널리티(레이블 값 조합의 폭발적 증가)는 Prometheus 성능 저하의 가장 흔한 원인이다.

# relabel_configs로 불필요한 레이블 제거
relabel_configs:
  - action: labeldrop
    regex: '(pod_template_hash|controller_revision_hash)'

# metric_relabel_configs로 고카디널리티 메트릭 제거
metric_relabel_configs:
  - source_labels: [__name__]
    regex: '(go_gc_.*|go_memstats_.*)'
    action: drop

카디널리티 점검 PromQL:

# 시계열 수가 가장 많은 메트릭 상위 10개
topk(10, count by (__name__) ({__name__=~".+"}))

# 특정 메트릭의 카디널리티 확인
count(http_requests_total) by (service, method, status)

8. 운영 체크리스트

항목	권장 설정	비고
보존 기간	15~30일 (로컬 TSDB)	장기: Thanos/Mimir로 오브젝트 스토리지
백업	TSDB 스냅샷 API 활용	`POST /api/v1/admin/tsdb/snapshot`
HA 구성	동일 설정의 Prometheus 2대 + Alertmanager 클러스터링	Alertmanager는 gossip 프로토콜로 중복 알림 방지
보안	TLS + Basic Auth 또는 OAuth2 Proxy	`--web.config.file`로 TLS/Auth 설정
리소스	시계열 100만 개당 약 2GB RAM	WAL과 Head chunks 메모리 고려
스크레이프 간격	15s (기본), 중요 메트릭은 10s	너무 짧으면 부하 증가, 너무 길면 해상도 저하
알림 테스트	`amtool check-config`, `promtool check rules`	CI/CD에 통합 권장
Recording Rules	자주 사용하는 쿼리를 사전 계산	대시보드 성능 향상, `record:` 네이밍 컨벤션 준수
카디널리티 모니터링	`prometheus_tsdb_head_series` 추이 감시	급격한 증가 시 원인 메트릭 추적

9. 흔한 실수

rate()에 너무 짧은 range window 사용 -- rate(metric[1m]) 대신 scrape_interval의 최소 4배(15s라면 [1m], 30s라면 [2m])를 사용해야 한다. 너무 짧으면 데이터 포인트 누락으로 결과가 0이 된다.
알림에 irate() 사용 -- irate()는 순간 변화율이라 노이즈에 민감하다. 알림 규칙에서는 반드시 rate()를 사용한다.
for 절 없이 알림 정의 -- for: 0s이면 한 번의 스파이크에도 알림이 발화한다. 최소 3~5분의 for 기간을 설정한다.
고카디널리티 레이블 사용 -- user_id, request_id, trace_id 같은 고유 값을 레이블에 넣으면 시계열이 무한 증가한다. 이런 값은 로그나 트레이스에 저장한다.
Summary 타입 남용 -- Summary의 quantile은 인스턴스 간 집계가 불가능하다. 여러 인스턴스를 운영한다면 Histogram을 사용한다.
Pushgateway 남용 -- Pushgateway는 단기 배치 작업 전용이다. 장기 실행 서비스의 메트릭을 Pushgateway로 보내면 대상 다운 감지가 불가능하다.
Alertmanager 라우팅 우선순위 무시 -- 라우트 매칭은 상위에서 하위로 진행된다. 구체적인 매칭을 먼저 배치하고 기본 수신자를 마지막에 둔다.
Recording Rules 네이밍 컨벤션 미준수 -- level:metric:operations 형식(예: job:http_requests_total:rate5m)을 따라야 대시보드와 알림에서 구분이 쉽다.
TSDB 보존 기간과 디스크 용량 불일치 -- retention을 90일로 설정했지만 디스크가 30일분만 감당할 수 있으면 Prometheus가 OOM으로 죽는다. --storage.tsdb.retention.size로 디스크 기반 보존 제한을 병행한다.
모니터링 시스템 자체를 모니터링하지 않음 -- Prometheus 자체의 up, prometheus_tsdb_head_series, prometheus_engine_query_duration_seconds를 반드시 감시한다.

10. 요약

Prometheus는 클라우드 네이티브 환경의 표준 모니터링 시스템이다. 핵심을 정리하면 다음과 같다.

Pull 기반 모델로 동적 환경에 자연스럽게 적응하며, 서비스 디스커버리와 결합한다
4가지 메트릭 타입(Counter, Gauge, Histogram, Summary) 중 Histogram을 우선 사용한다
PromQL은 rate(), histogram_quantile(), predict_linear() 등 강력한 함수를 제공한다
Alertmanager는 라우팅, 그룹핑, 억제, 중복 제거로 알림 피로를 줄인다
Grafana 연동으로 에러율, QPS, 레이턴시, 리소스 사용량 대시보드를 구성한다
대규모 환경에서는 Thanos/Mimir로 장기 저장, Federation으로 글로벌 뷰, 카디널리티 관리로 성능을 유지한다
운영 체크리스트를 주기적으로 점검하고, 흔한 실수를 피한다

이 글에서 다룬 설정과 쿼리를 기반으로 자신의 환경에 맞게 조정하면, 안정적이고 확장 가능한 모니터링 시스템을 구축할 수 있다.

퀴즈

Q1: Prometheus가 push 기반이 아닌 pull 기반 모델을 사용하는 주요 이유는 무엇인가?

Pull 기반 모델에서는 모니터링 대상이 Prometheus의 존재를 알 필요가 없으며, 서비스 디스커버리와 자연스럽게 결합할 수 있다. 또한 대상이 다운되면 스크레이프 실패로 즉시 감지되는 장점이 있다. 이는 동적으로 스케일링되는 클라우드 네이티브 환경에 특히 적합하다.

Q2: Counter 타입 메트릭에 직접 값을 사용하지 않고 rate()를 적용하는 이유는?

Counter는 단조 증가하는 누적 값이므로, 직접 값을 보면 시간이 지날수록 계속 커지기만 하여 의미 있는 정보를 얻기 어렵다. rate()를 적용하면 초당 변화율을 계산하여 현재 처리량(예: 초당 요청 수)을 파악할 수 있다. 또한 rate()는 카운터 리셋(재시작 등)을 자동으로 처리한다.

Q3: Histogram과 Summary의 가장 큰 차이점은 무엇이며, 왜 Histogram이 권장되는가?

Histogram은 서버 측(Prometheus)에서 histogram_quantile() 함수로 분위수를 계산하므로 여러 인스턴스의 데이터를 집계할 수 있다. 반면 Summary는 클라이언트 측에서 분위수를 미리 계산하므로 인스턴스 간 집계가 수학적으로 불가능하다. 마이크로서비스 환경에서 여러 인스턴스를 운영하는 것이 일반적이므로 Histogram이 권장된다.

Q4: 알림 규칙에서 rate() 대신 irate()를 사용하면 왜 문제가 되는가?

irate()는 가장 최근 두 데이터 포인트 간의 순간 변화율을 계산하므로 짧은 스파이크에도 민감하게 반응한다. 알림에 사용하면 일시적인 노이즈에도 알림이 발화하여 오탐(false positive)이 급증하고, 알림 피로를 유발한다. rate()는 지정된 시간 범위 전체의 평균이므로 일시적 변동에 안정적이다.

Q5: Alertmanager의 group_by, group_wait, group_interval은 각각 어떤 역할을 하는가?

group_by는 알림을 묶는 기준 레이블을 지정한다(예: alertname, service). group_wait는 같은 그룹의 알림이 모일 때까지 대기하는 초기 시간(예: 30초)이다. group_interval은 이미 알림이 전송된 그룹에 새 알림이 추가되었을 때 다시 전송하는 간격이다. 이 설정들을 적절히 조합하면 알림 폭풍을 방지할 수 있다.

Q6: 고카디널리티(high cardinality)가 Prometheus에 위험한 이유와 대응 방법은?

레이블 값 조합이 폭발적으로 증가하면 시계열 수가 급증하여 Prometheus의 메모리(Head chunks)와 디스크 사용량이 기하급수적으로 늘어난다. 대응 방법으로는 user_id 같은 고유 값을 레이블에 넣지 않고, metric_relabel_configs로 불필요한 메트릭을 drop하며, prometheus_tsdb_head_series 메트릭으로 시계열 수를 모니터링하는 것이 있다.

Q7: Thanos와 Grafana Mimir의 주요 차이점은?

Thanos는 기존 Prometheus에 Sidecar를 추가하는 패턴으로, 오브젝트 스토리지에 블록을 업로드하여 장기 저장과 글로벌 쿼리를 제공한다. Grafana Mimir(Cortex 포크)는 자체적으로 수평 확장이 가능한 마이크로서비스 아키텍처로, remote_write를 통해 데이터를 수신하며 멀티테넌시와 Grafana 생태계 통합이 강점이다. 기존 Prometheus에 최소 변경을 원하면 Thanos, Grafana 스택을 사용하면 Mimir가 적합하다.

Q8: predict_linear() 함수는 어떤 상황에서 유용하며, 어떻게 알림에 활용하는가?

predict_linear()은 과거 데이터를 기반으로 선형 회귀를 수행하여 미래 시점의 값을 예측한다. 디스크 용량 부족 같은 점진적 문제를 사전에 감지하는 데 유용하다. 예를 들어, predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0 이면 "지난 6시간 추이를 기반으로 24시간 뒤 디스크가 가득 찰 것"이라는 알림을 사전에 보낼 수 있다.

Prometheus Complete Guide — Metrics, PromQL, Alerting, Dashboards, and Best Practices

Introduction
1. Core Concepts
2. Architecture
3. Installation and Configuration
- Full Stack Setup with Docker Compose
- prometheus.yml Configuration
4. PromQL in Practice
5. Alertmanager Alerting Setup
- Defining Alerting Rules
- Alertmanager Configuration
6. Grafana Dashboard Integration
7. Best Practices for Large-Scale Operations
8. Operational Checklist
9. Common Mistakes
10. Summary
Quiz

Introduction

Modern infrastructure is distributed across microservices, containers, and serverless functions, with hundreds of components operating simultaneously. To answer the question "Is the system healthy right now?" in such an environment, systematic metric collection and monitoring are essential. Prometheus is a CNCF graduated project and the de facto standard monitoring system for the Kubernetes ecosystem, providing pull-based metric collection, a powerful query language called PromQL, and a flexible alerting system.

This article covers everything from Prometheus core concepts to architecture, installation and configuration, practical PromQL queries, Alertmanager alerting setup, Grafana dashboard integration, best practices for large-scale operations, an operational checklist, and common mistakes — all in a single comprehensive guide. This is a production-oriented, hands-on resource that you can apply immediately.

1. Core Concepts

Pull-Based Model

Unlike push-based monitoring systems where targets send metrics to a central server, Prometheus uses a pull-based model where the Prometheus server periodically scrapes each target's /metrics endpoint. The advantages of this approach include:

Monitoring targets do not need to know about Prometheus
It naturally adapts to dynamic environments when combined with service discovery
When a target goes down, scrape failure is detected immediately

Time Series Data

Prometheus stores all data as time series. Each time series is uniquely identified by a combination of a metric name and key-value label pairs.

http_requests_total{method="GET", handler="/api/users", status="200"}

Metric Types

Type	Description	Use Case	PromQL Usage
Counter	Monotonically increasing cumulative value	Request count, error count	`rate()`, `increase()`
Gauge	Value that can go up and down	CPU usage, memory, temperature	Direct use, `avg_over_time()`
Histogram	Measures distribution of values across buckets	Response time, request size	`histogram_quantile()`
Summary	Calculates quantiles on the client side	Response time (non-aggregatable)	Direct use (not recommended)

Histogram vs Summary: Histograms allow server-side quantile calculation, making aggregation across multiple instances possible. Summaries calculate quantiles on the client side, making aggregation mathematically impossible. In most cases, Histogram is recommended.

2. Architecture

The Prometheus ecosystem consists of several components working together organically.

graph TB
    subgraph Targets
        A[Application /metrics]
        B[Node Exporter]
        C[cAdvisor]
        D[Custom Exporter]
    end

    subgraph "Prometheus Server"
        E[Retrieval<br/>Scrape Engine]
        F[TSDB<br/>Time Series Database]
        G[HTTP Server<br/>PromQL API]
    end

    H[Service Discovery<br/>Kubernetes / Consul / DNS]
    I[Pushgateway<br/>For Short-lived Jobs]
    J[Alertmanager<br/>Alert Routing/Grouping]
    K[Grafana<br/>Dashboards]

    H --> E
    A --> E
    B --> E
    C --> E
    D --> E
    I --> E
    E --> F
    F --> G
    G --> K
    G --> J
    J --> L[Slack / PagerDuty / Email]

Component	Role
Prometheus Server	Scrapes metrics, stores in TSDB, runs PromQL query engine
Exporters	Expose target system metrics in Prometheus format (node_exporter, mysqld_exporter, etc.)
Pushgateway	Relays metrics from short-lived batch jobs that are difficult to scrape
Alertmanager	Routes, groups, deduplicates, and silences alerts based on alerting rules
Service Discovery	Dynamically discovers scrape targets from Kubernetes, Consul, DNS, etc.

3. Installation and Configuration

Full Stack Setup with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    ports:
      - '9090:9090'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - '9093:9093'
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    ports:
      - '3000:3000'
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    ports:
      - '9100:9100'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s # Default scrape interval
  evaluation_interval: 15s # Alert rule evaluation interval
  scrape_timeout: 10s # Scrape timeout

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Application (using relabel_configs)
  - job_name: 'app'
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets: ['app:8080']
        labels:
          env: production
          team: backend

  # Kubernetes Service Discovery (reference)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

4. PromQL in Practice

rate() and irate()

# Per-second request rate over 5 minutes (suitable for alerts)
sum(rate(http_requests_total[5m])) by (service)

# Instantaneous rate of change (suitable for dashboard visualization)
sum(irate(http_requests_total[$__rate_interval])) by (service)

Use rate() for alerting rules and irate() for dashboards. The range window for rate() should be at least 4 times the scrape_interval.

histogram_quantile()

# 95th percentile response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 50th percentile (median)
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Aggregation Operators

# Top 5 services by error rate
topk(5,
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  /
  sum(rate(http_requests_total[5m])) by (service)
)

# Memory usage by namespace
sum(container_memory_usage_bytes{container!=""}) by (namespace)

# Nodes with CPU usage exceeding 80%
node_cpu_seconds_total{mode="idle"} < 0.2

# Predict disk space 4 hours from now using predict_linear
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[6h], 4 * 3600
) < 0

Useful Functions Reference

Function	Purpose	Example
`rate()`	Average per-second rate over time range	`rate(http_requests_total[5m])`
`irate()`	Instantaneous rate between last two points	`irate(http_requests_total[5m])`
`increase()`	Total increase over time range	`increase(http_requests_total[1h])`
`histogram_quantile()`	Calculate quantile from histogram	`histogram_quantile(0.99, ...)`
`predict_linear()`	Predict future value via linear regression	`predict_linear(disk_free[6h], 3600*4)`
`absent()`	Returns 1 if time series is missing	`absent(up{job="app"})`
`changes()`	Number of value changes in time range	`changes(process_start_time_seconds[1h])`

5. Alertmanager Alerting Setup

Defining Alerting Rules

# prometheus/rules/alerts.yml
groups:
  - name: instance-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: 'Instance {{ $labels.instance }} is down'
          description: '{{ $labels.instance }} of job {{ $labels.job }} has been unreachable for more than 3 minutes.'

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: '{{ $labels.service }} error rate exceeds {{ $value | humanizePercentage }}'
          description: 'Service {{ $labels.service }} has a 5xx error rate exceeding 5%.'

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Node memory usage exceeds 90%'

      - alert: DiskSpaceRunningOut
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: 'Disk space predicted to run out within 24 hours'

      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: '{{ $labels.service }} P95 latency exceeds 1 second'

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']

6. Grafana Dashboard Integration

Data Source Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: '15s'
      httpMethod: POST

Example Dashboard Panel

{
  "title": "Service Error Rate",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
      "legendFormat": "{{ service }}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 0.01, "color": "yellow" },
          { "value": 0.05, "color": "red" }
        ]
      }
    }
  }
}

Recommended Dashboard Panels

Panel	PromQL	Purpose
Error Rate	`sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)`	Error rate trend per service
QPS	`sum(rate(http_requests_total[5m])) by (service)`	Queries per second
P95 Latency	`histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))`	95th percentile response time
CPU Usage	`1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)`	CPU usage per node
Memory Usage	`1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`	Memory usage per node
Disk Free	`node_filesystem_avail_bytes{mountpoint="/"}`	Root filesystem remaining capacity

7. Best Practices for Large-Scale Operations

Federation

Federation is a pattern where a global Prometheus server scrapes selected metrics from multiple local Prometheus servers. Each team or cluster runs its own local Prometheus, and only metrics needed for a global view are federated upward.

# Global Prometheus scrape_configs
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".+"}' # All job metrics
        - '{__name__=~"job:.*"}' # Only recording rule results
    static_configs:
      - targets:
          - 'prometheus-team-a:9090'
          - 'prometheus-team-b:9090'

Long-Term Storage: Thanos / Cortex / Mimir

Prometheus local TSDB is designed for 15 to 30 days of retention. For longer retention and global querying, consider the following solutions.

Solution	Characteristics	Best For
Thanos	Sidecar pattern, object storage, global querying	Adding sidecars to existing Prometheus
Cortex	Multi-tenant, horizontally scalable, microservices architecture	Large-scale SaaS environments
Grafana Mimir	Cortex fork, improved performance, Grafana ecosystem integration	When using the Grafana stack
VictoriaMetrics	High performance, low resource usage, PromQL compatible	When cost optimization is critical

Cardinality Management

High cardinality (explosive growth of label value combinations) is the most common cause of Prometheus performance degradation.

# Drop unnecessary labels with relabel_configs
relabel_configs:
  - action: labeldrop
    regex: '(pod_template_hash|controller_revision_hash)'

# Drop high-cardinality metrics with metric_relabel_configs
metric_relabel_configs:
  - source_labels: [__name__]
    regex: '(go_gc_.*|go_memstats_.*)'
    action: drop

Cardinality inspection PromQL:

# Top 10 metrics by time series count
topk(10, count by (__name__) ({__name__=~".+"}))

# Check cardinality for a specific metric
count(http_requests_total) by (service, method, status)

8. Operational Checklist

Item	Recommended Setting	Notes
Retention Period	15-30 days (local TSDB)	Long-term: use Thanos/Mimir with object storage
Backup	Use TSDB snapshot API	`POST /api/v1/admin/tsdb/snapshot`
HA Setup	2 identical Prometheus instances + Alertmanager clustering	Alertmanager uses gossip protocol to prevent duplicate alerts
Security	TLS + Basic Auth or OAuth2 Proxy	Configure via `--web.config.file` for TLS/Auth
Resources	~2GB RAM per 1 million time series	Account for WAL and head chunks memory
Scrape Interval	15s (default), 10s for critical metrics	Too short increases load, too long reduces resolution
Alert Testing	`amtool check-config`, `promtool check rules`	Integrate into CI/CD pipeline
Recording Rules	Pre-compute frequently used queries	Improves dashboard performance, follow `record:` naming convention
Cardinality Monitoring	Watch `prometheus_tsdb_head_series` trend	Investigate root cause metric on sudden increase

9. Common Mistakes

Using too short a range window with rate() -- Use at least 4x the scrape_interval (e.g., [1m] for 15s interval, [2m] for 30s). A window that is too short causes missing data points, resulting in zero values.
Using irate() in alerting rules -- irate() calculates the instantaneous rate and is sensitive to noise. Always use rate() for alerting rules.
Defining alerts without a for clause -- With for: 0s, a single spike triggers an alert. Set a minimum of 3-5 minutes for the for period.
Using high-cardinality labels -- Putting unique values like user_id, request_id, or trace_id in labels causes unbounded time series growth. Store such values in logs or traces instead.
Overusing the Summary metric type -- Summary quantiles cannot be aggregated across instances. Use Histogram when running multiple instances.
Overusing the Pushgateway -- Pushgateway is designed for short-lived batch jobs only. Routing long-running service metrics through Pushgateway makes it impossible to detect target downtime.
Ignoring Alertmanager routing priority -- Route matching proceeds from top to bottom. Place more specific matches first and the default receiver last.
Not following Recording Rules naming convention -- Follow the level:metric:operations format (e.g., job:http_requests_total:rate5m) for easy identification in dashboards and alerts.
Mismatch between TSDB retention period and disk capacity -- Setting retention to 90 days when the disk can only hold 30 days of data will cause Prometheus to crash with OOM. Use --storage.tsdb.retention.size to enforce disk-based retention limits in parallel.
Not monitoring the monitoring system itself -- Always monitor Prometheus's own up, prometheus_tsdb_head_series, and prometheus_engine_query_duration_seconds metrics.

10. Summary

Prometheus is the standard monitoring system for cloud-native environments. Here are the key takeaways:

The pull-based model naturally adapts to dynamic environments and integrates with service discovery
Among the 4 metric types (Counter, Gauge, Histogram, Summary), prefer Histogram
PromQL provides powerful functions such as rate(), histogram_quantile(), and predict_linear()
Alertmanager reduces alert fatigue through routing, grouping, inhibition, and deduplication
Grafana integration enables dashboards for error rates, QPS, latency, and resource usage
For large-scale environments, use Thanos/Mimir for long-term storage, Federation for global views, and cardinality management for sustained performance
Regularly review the operational checklist and avoid common mistakes

By adapting the configurations and queries covered in this guide to your specific environment, you can build a stable and scalable monitoring system.

Quiz

Q1: What is the main reason Prometheus uses a pull-based model instead of a push-based model?

In the pull-based model, monitoring targets do not need to know about Prometheus, and it integrates naturally with service discovery. Additionally, when a target goes down, scrape failure is detected immediately. This makes it particularly well-suited for dynamically scaling cloud-native environments.

Q2: Why should you apply rate() to Counter-type metrics instead of using the raw value?

A Counter is a monotonically increasing cumulative value, so looking at the raw value just shows an ever-growing number that provides little meaningful insight. Applying rate() calculates the per-second rate of change, allowing you to understand current throughput (e.g., requests per second). Additionally, rate() automatically handles counter resets caused by restarts.

Q3: What is the key difference between Histogram and Summary, and why is Histogram recommended?

Histogram allows server-side quantile calculation using the histogram_quantile() function, making it possible to aggregate data across multiple instances. Summary calculates quantiles on the client side, making cross-instance aggregation mathematically impossible. Since running multiple instances is standard in microservices environments, Histogram is recommended.

Q4: Why is using irate() instead of rate() problematic in alerting rules?

irate() calculates the instantaneous rate between the two most recent data points, making it highly sensitive to short-lived spikes. When used in alerting, it triggers on transient noise, causing a surge in false positives and alert fatigue. rate() averages over the entire specified time range, making it resilient to temporary fluctuations.

Q5: What are the roles of group_by, group_wait, and group_interval in Alertmanager?

group_by specifies the labels used to group alerts together (e.g., alertname, service). group_wait is the initial waiting period for alerts in the same group to accumulate (e.g., 30 seconds). group_interval is the interval for resending when new alerts join an already-notified group. Properly combining these settings prevents alert storms.

Q6: Why is high cardinality dangerous for Prometheus, and how can it be mitigated?

When label value combinations grow explosively, the number of time series surges, causing Prometheus memory (head chunks) and disk usage to increase exponentially. Mitigation strategies include avoiding unique values like user_id in labels, using metric_relabel_configs to drop unnecessary metrics, and monitoring the prometheus_tsdb_head_series metric to track time series count.

Q7: What are the main differences between Thanos and Grafana Mimir?

Thanos uses a sidecar pattern added to existing Prometheus instances, uploading blocks to object storage for long-term retention and global querying. Grafana Mimir (a Cortex fork) is a self-contained, horizontally scalable microservices architecture that receives data via remote_write, with strengths in multi-tenancy and Grafana ecosystem integration. If you want minimal changes to existing Prometheus, choose Thanos; if you use the Grafana stack, Mimir is the better fit.

Q8: When is the predict_linear() function useful, and how is it applied in alerting?

predict_linear() performs linear regression based on historical data to predict values at a future point in time. It is useful for proactively detecting gradual issues like disk space exhaustion. For example, predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0 means "based on the trend over the last 6 hours, the disk will be full within 24 hours," allowing a preemptive alert to be sent.