Entering

Prometheus is a CNCF graduation project and is the de facto standard metrics collection system in the Kubernetes ecosystem. Running a few rate() queries on a small cluster is not difficult, but in a production environment handling tens of millions of time series across hundreds of microservices, the design and optimization of PromQL queries determines system stability. Situations where dashboards take 30 seconds to load, notification evaluation is delayed, slowing down fault detection, and the Prometheus server's CPU and memory are saturated with query load are problems that organizations of a certain size or larger must face.
This article covers the correct use of PromQL advanced query patterns (rate, histogram_quantile, predict_linear, subquery), query performance optimization using Recording Rules, naming convention design, SLI definition and calculation, SLO-based Multi-Window Multi-Burn-Rate notification system, Alertmanager routing and grouping strategy, high cardinality response, actual troubleshooting cases, and operational checklist.
PromQL advanced query patterns

Beyond the basic rate() and sum() by(), we summarize advanced query patterns actually needed in a production environment.
Correct choice between rate() and irate()

rate() calculates the average rate of change per second over a specified time range, and irate() calculates the instantaneous rate of change between the two most recent data points. Notification rules must use rate(). irate() is useful for dashboard visualization because it responds sensitively to spikes, but when used for notifications, it triggers even a brief moment of noise, rapidly increasing false positives.```promql
알림 규칙 - 반드시 rate() 사용

rate()는 범위 전체의 평균이므로 일시적 스파이크에 안정적

sum(rate(http_requests_total-1[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
대시보드 시각화 - irate()로 실시간 반응성 확보

$__rate_interval은 Grafana가 scrape interval 기반으로 자동 계산

sum(irate(http_requests_total-1[ $__rate_interval])) by (service) / sum(irate(http_requests_total[$ __rate_interval])) by (service)
rate()의 range window 선택 기준

- scrape_interval의 최소 4배 이상 (15s interval이면 최소 [1m])

- 알림용: [5m] 이상 권장 (데이터 포인트 누락에 대한 내성 확보)

- 너무 넓으면 감지가 느려지고, 너무 좁으면 노이즈가 증가


### In-depth use of histogram_quantile()

Histogram-based percentile calculation is the core of SLI definition. Note that histogram_quantile() performs linear interpolation between bucket boundaries. Bucket design directly affects query accuracy.```promql
# 서비스별 P99 응답 시간
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 엔드포인트별 P95 응답 시간 (상위 레이블 추가)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, endpoint)
)

# Apdex Score 계산 (만족: 0.5s 이하, 허용: 2s 이하)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (service)
  +
  sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) by (service)
)
/ 2
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)

# Native Histogram (Prometheus 2.53+) 활용
# 버킷 경계를 자동 관리하여 정확도와 효율성 모두 향상
histogram_quantile(0.99, sum(rate(http_request_duration_seconds[5m])) by (le, service))
```in histogram_quantile()`by (le)`If a clause is omitted, the le label is aggregated and meaningless results are returned. This is one of the most common mistakes in PromQL, so be sure to watch out for it.

### Capacity prediction using predict_linear()

predict_linear() predicts values at a future point in time by applying simple linear regression to time series data. This is a key function for detecting resource depletion in advance, such as disk, memory, and certificate expiration.```promql
# 디스크가 24시간 내에 고갈될 것으로 예측되면 알림
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 * 3600) < 0

# PVC 용량 예측 (Kubernetes)
predict_linear(
  kubelet_volume_stats_available_bytes[12h], 7 * 24 * 3600
) < 0

# 인증서 만료 예측 (cert-manager)
# 현재 남은 시간이 30일(2592000초) 미만이면 알림
(x509_cert_not_after - time()) < 2592000

# Prometheus TSDB 스토리지 증가율 예측
predict_linear(prometheus_tsdb_storage_blocks_bytes[7d], 30 * 24 * 3600)
  > prometheus_tsdb_retention_limit_bytes * 0.9
```If the input range window of predict_linear() is too short, it becomes sensitive to noise, and if it is too long, it does not reflect recent changes. Typically, more than half of the forecast period is set as a range window. To predict 24 hours ahead, it is reasonable to refer to at least 12 hours of data.

### Subquery and advanced time manipulation

Subquery is a technique that applies a time range to the result of the range vector function. It is powerful for complex time series analysis, but the query cost is high, so it is desirable to pre-calculate using Recording Rules.```promql
# Subquery: 지난 1시간 동안의 5분 에러율 최대값
max_over_time(
  (
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)
  )[1h:1m]
)

# offset을 활용한 전주 대비 트래픽 비교
sum(rate(http_requests_total[5m])) by (service)
/
sum(rate(http_requests_total[5m] offset 7d)) by (service)

# label_replace()로 레이블 가공
label_replace(
  up{job="prometheus"},
  "cluster", "$1", "instance", "(.+)\\.example\\.com:.*"
)

# absent()로 메트릭 수집 중단 감지
absent(up{job="payment-service"} == 1)
```## Recording Rules Design Principles

Recording Rules is a mechanism that pre-calculates and stores the results of frequently used PromQL expressions as a new time series. It provides three effects at the same time: improving dashboard loading speed, stabilizing notification evaluation, and reducing Prometheus server load.

### When do you need Recording Rules?

Clear criteria are needed for when Recording Rules should be introduced.

1. **When the same query is used repeatedly in three or more places.** Evaluating the same rate() expression repeatedly in multiple dashboard panels, notification rules, other recording rules, etc. is a waste of resources.
2. **When the query execution time exceeds 2 seconds.** In Prometheus`/api/v1/query`You can check the query execution time on the endpoint. Anything over 2 seconds will cause a perceptible delay in loading the dashboard and evaluating notifications.
3. **When aggregating high cardinality metrics.** Instead of aggregating hundreds of thousands of time series in real time each time, if you aggregate them in advance using Recording Rules, the load at query time is dramatically reduced.
4. **When complex expressions are used in notification rules.** If heavy queries are executed every notification evaluation cycle (default 1 minute), it poses a direct threat to the stability of the Prometheus server.

### Recording Rules vs Raw Queries performance comparison

| Comparison Items | Raw Query (real-time calculation) | Recording Rule (pre-calculation) |
| ------------------- | --------------------------- | ------------------------------- |
| Dashboard loading time | 2-30 seconds (proportional to number of time series) | 50-200ms (single time series lookup) |
| Prometheus CPU load | Scan entire time series per query | Calculated once per evaluation cycle (30s-1m) |
| Notification Evaluation Stability | Evaluation may be delayed due to query delay | Stable with pre-computed value references |
| storage cost | No additional cost | Small increase with new time series storage |
| Flexibility | Queries can be modified on the fly | Time required to reflect rule changes |
| High cardinality processing | Compute the entire time series every time | Efficient by storing only aggregated results |
| Suitable for use | Exploratory queries, ad-hoc analysis | Dashboards, Notifications, SLI/SLO Calculation |

### Basic configuration of Recording Rules```yaml
# prometheus/recording_rules.yaml
groups:
  # 그룹명은 논리적 단위로 구분 (파일당 1-3개 그룹이 적정)
  - name: http_request_rates
    # interval: 평가 주기 (생략 시 global.evaluation_interval 사용)
    # SLI용 Recording Rules는 30초 이하 권장
    interval: 30s
    rules:
      # 서비스별 초당 총 요청 수
      - record: service:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (service)

      # 서비스별 초당 에러 요청 수
      - record: service:http_requests_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

      # 서비스별 에러율
      - record: service:http_error_rate:ratio_rate5m
        expr: |
          service:http_requests_errors:rate5m
          /
          service:http_requests:rate5m

      # 서비스별 P99 응답 시간
      - record: service:http_request_duration_seconds:p99_rate5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

      # 서비스별 P95 응답 시간
      - record: service:http_request_duration_seconds:p95_rate5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )
```Recording Rules can be referenced hierarchically. In the example above`service:http_error_rate:ratio_rate5m`Refers to two sub-Recording Rules. This way, the basic rate() calculation is executed only once and reused in multiple parent rules.

## Naming convention: level:metric:operation

The naming of Recording Rules is recommended in the Prometheus official document.`level:metric:operations`It follows a pattern. This convention allows you to understand the meaning of a rule just by its name.

### Naming Rule Structure```
level:metric:operations

level      - 집계 수준 (어떤 레이블 차원이 남아있는가)
metric     - 원본 메트릭 이름
operations - 적용된 연산 (rate, ratio, p99 등)
```### Practical naming example```yaml
# level = "job" (job 레이블 기준 집계)
# metric = "http_requests" (원본: http_requests_total)
# operation = "rate5m" (5분 rate 적용)
- record: job:http_requests:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)

# level = "cluster" (클러스터 전체 집계, 레이블 없음)
# metric = "http_requests"
# operation = "rate5m"
- record: cluster:http_requests:rate5m
  expr: sum(rate(http_requests_total[5m]))

# level = "service" + "endpoint"
# metric = "http_request_duration_seconds"
# operation = "p99_rate5m"
- record: service_endpoint:http_request_duration_seconds:p99_rate5m
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, endpoint)
    )

# SLI용 Recording Rule
# level = "service"
# metric = "sli_availability"
# operation = "ratio_rate5m"
- record: service:sli_availability:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)

# 복합 연산: ratio(비율)를 나타내는 Recording Rule
# "ratio"가 operation에 포함되면 0-1 범위의 비율 값임을 명시
- record: instance:node_cpu:utilization_ratio
  expr: |
    1 - avg without(cpu, mode)(
      rate(node_cpu_seconds_total{mode="idle"}[5m])
    )
```**Notes on naming:**

- Level indicates the dimension of the remaining labels.`by (service)`If it is a tally`service:`,`by (job, instance)`ramen`job_instance:`Write in the form
- In metrics,`_total`,`_bytes`,`_seconds`Keep the same type suffix, but use Counter's`_total`You can remove it after applying the rate.
- in operation`rate5m`,`ratio`,`p99`Describe the applied operations in detail.
- Colon (`:`) is used only in Recording Rules and is never used in the original metric name.

## SLI definition and calculation

SLI (Service Level Indicator) is an indicator that quantitatively measures service quality. It is expressed as the “proportion of good events” and has a value between 0 and 1. If the SLI definition is inaccurate, the entire SLO and notification system becomes meaningless, so this step is where you should invest the most time.

### PromQL definitions by SLI type```yaml
# prometheus/sli_recording_rules.yaml
groups:
  - name: sli_definitions
    interval: 30s
    rules:
      # === 가용성(Availability) SLI ===
      # "좋은 요청" = 5xx가 아닌 모든 응답
      # 4xx는 클라이언트 오류이므로 서버 가용성에서 제외하지 않음
      - record: service:sli_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # === 지연시간(Latency) SLI ===
      # "좋은 요청" = 300ms 이내에 응답된 요청
      - record: service:sli_latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)

      # === 복합 SLI ===
      # "좋은 요청" = 5xx가 아니고 AND 300ms 이내에 응답
      # 가용성과 지연시간 SLI의 교집합
      - record: service:sli_combined:ratio_rate5m
        expr: |
          min without() (
            service:sli_availability:ratio_rate5m,
            service:sli_latency:ratio_rate5m
          )

      # === 30일 Rolling Window SLI ===
      # Error Budget 계산의 기준이 되는 장기 SLI
      - record: service:sli_availability:ratio_rate30d
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[30d])) by (service)
          /
          sum(rate(http_requests_total[30d])) by (service)

      # === Error Budget 잔량 (퍼센트) ===
      # SLO 99.9% 기준, 남은 error budget 비율
      - record: service:error_budget_remaining:ratio
        expr: |
          1 - (
            (1 - service:sli_availability:ratio_rate30d)
            /
            (1 - 0.999)
          )
```### Common mistakes and corrections when defining SLI

1. **Including health check traffic in SLI.** Kubernetes liveness/readiness probe requests are not user traffic and must be excluded.`http_requests_total{handler!="/healthz", handler!="/readyz"}`Apply a filter.
2. **Counting internal retries as success.** What the user feels is the result of the initial request. Success after three retries inside the server is a "slow success" from the user's perspective.
3. **Classifying all 4xx as errors.** 404 Not Found is a response returned when the server operates normally. The same goes for 422 Validation Error. What can be seen as a server availability issue is 429 Too Many Requests, and even this is normal if rate limiting is intentional.
4. **Using average response time as SLI.** Average hides tail latency. A service with a P50 of 50ms may have a P99 of 5 seconds. SLI must be defined as “the proportion of responses within the threshold.”

## SLO-based error budget notification

SLO (Service Level Objective) is the target level for SLI. “Availability SLI must be greater than 99.9% for 30 days” is a typical SLO. SLO-based notifications are more precise than traditional threshold notifications and are directly linked to business impact.

### Multi-Window Multi-Burn-Rate Notification

This notification method, recommended by Google SRE Workbook, determines notifications based on “how quickly the error budget will be exhausted if progress continues at the current error rate.” Burn rate refers to a multiple of the error budget exhaustion rate.

| Burn Rate | Budget exhaustion time (30 days) | Long Window | Short Window | Severity | Meaning |
| --------- | ---------------------------- | ----------- | ------------ | -------- | --------------------------- |
| 14.4 | 2.08th | 1h | 5m | critical | Acute Disorder, Immediate Response |
| 6 | 5 days | 6h | 30m | warning | Severe performance degradation |
| 2 | 15 days | 3d | 6h | info | Slow Quality Decline, Weekly Review |
| 1 | 30 days | 30d | 3d | ticket | Normal burnout rate, monitoring |

The core of Multi-Window is preventing false positives. If you only use Long Window, you will be notified of past problems that have already been resolved. Add a short window as an AND condition to check whether “the problem is still in progress.”

### Implement notification rules```yaml
# prometheus/slo_alerting_rules.yaml
groups:
  - name: slo_burn_rate_alerts
    rules:
      # ============================================================
      # SLO: 가용성 99.9% (30일 윈도우)
      # Error Budget = 0.1% = 43.2분/30일
      # ============================================================

      # --- Critical: Burn Rate 14.4, 1h/5m 윈도우 ---
      # 이 속도면 약 2일 만에 월간 error budget 전량 소진
      - alert: SLOAvailabilityBurnRateCritical
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
                 / sum(rate(http_requests_total[1h])) by (service))
          ) > (14.4 * 0.001)
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
                 / sum(rate(http_requests_total[5m])) by (service))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo_type: availability
          burn_rate: '14.4'
          alert_window: '1h/5m'
        annotations:
          summary: >-
            {{ $labels.service }}: 가용성 SLO burn rate critical (14.4x)
          description: >-
            서비스 {{ $labels.service }}의 에러율이 SLO(99.9%) 대비
            14.4배 속도로 error budget을 소진 중입니다.
            약 2일 내에 전체 월간 budget이 소진됩니다. 즉시 확인하세요.
          runbook_url: https://wiki.internal/runbook/slo-critical
          dashboard_url: >-
            https://grafana.internal/d/slo-overview?var-service={{ $labels.service }}

      # --- Warning: Burn Rate 6, 6h/30m 윈도우 ---
      - alert: SLOAvailabilityBurnRateHigh
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (6 * 0.001)
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
                 / sum(rate(http_requests_total[30m])) by (service))
          ) > (6 * 0.001)
        for: 5m
        labels:
          severity: warning
          slo_type: availability
          burn_rate: '6'
          alert_window: '6h/30m'
        annotations:
          summary: >-
            {{ $labels.service }}: 가용성 SLO burn rate high (6x)
          description: >-
            서비스 {{ $labels.service }}의 에러율이 SLO(99.9%) 대비
            6배 속도로 error budget을 소진 중입니다.
            약 5일 내에 전체 budget이 소진됩니다.

      # --- Info: Burn Rate 2, 3d/6h 윈도우 ---
      - alert: SLOAvailabilityBurnRateSlow
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[3d])) by (service)
                 / sum(rate(http_requests_total[3d])) by (service))
          ) > (2 * 0.001)
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (2 * 0.001)
        for: 30m
        labels:
          severity: info
          slo_type: availability
          burn_rate: '2'
          alert_window: '3d/6h'
        annotations:
          summary: >-
            {{ $labels.service }}: 가용성 SLO burn rate elevated (2x)
          description: >-
            서비스 {{ $labels.service }}의 에러율이 SLO 대비
            2배 속도로 error budget을 소진 중입니다.
            약 15일 내에 budget이 소진되며, 주간 리뷰에서 확인이 필요합니다.

      # ============================================================
      # SLO: 지연시간 (P99 < 300ms) 99.9% (30일 윈도우)
      # ============================================================

      - alert: SLOLatencyBurnRateCritical
        expr: |
          (
            1 - (sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1h])) by (service)
                 / sum(rate(http_request_duration_seconds_count[1h])) by (service))
          ) > (14.4 * 0.001)
          and
          (
            1 - (sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
                 / sum(rate(http_request_duration_seconds_count[5m])) by (service))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo_type: latency
          burn_rate: '14.4'
        annotations:
          summary: >-
            {{ $labels.service }}: 지연시간 SLO burn rate critical (14.4x)
          description: >-
            서비스 {{ $labels.service }}의 P99 응답시간이 300ms SLO를 위반하는
            비율이 14.4배 속도로 error budget을 소진 중입니다.
```### Optimize notification rules with Recording Rules

The above notification rules repeat the same rate() calculation multiple times. Pre-computing intermediate results with Recording Rules can significantly reduce Prometheus load.```yaml
# prometheus/slo_recording_rules.yaml
groups:
  - name: slo_error_rates
    interval: 30s
    rules:
      # 각 윈도우별 에러율을 Recording Rule로 사전 계산
      - record: service:http_error_rate:ratio_rate5m
        expr: |
          1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
               / sum(rate(http_requests_total[5m])) by (service))

      - record: service:http_error_rate:ratio_rate30m
        expr: |
          1 - (sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
               / sum(rate(http_requests_total[30m])) by (service))

      - record: service:http_error_rate:ratio_rate1h
        expr: |
          1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
               / sum(rate(http_requests_total[1h])) by (service))

      - record: service:http_error_rate:ratio_rate6h
        expr: |
          1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
               / sum(rate(http_requests_total[6h])) by (service))

      - record: service:http_error_rate:ratio_rate3d
        expr: |
          1 - (sum(rate(http_requests_total{status!~"5.."}[3d])) by (service)
               / sum(rate(http_requests_total[3d])) by (service))

  - name: slo_alerts_optimized
    rules:
      # Recording Rules를 참조하여 알림 규칙을 간결하고 효율적으로 작성
      - alert: SLOAvailabilityBurnRateCritical
        expr: |
          service:http_error_rate:ratio_rate1h > (14.4 * 0.001)
            and
          service:http_error_rate:ratio_rate5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo_type: availability
          burn_rate: '14.4'

      - alert: SLOAvailabilityBurnRateHigh
        expr: |
          service:http_error_rate:ratio_rate6h > (6 * 0.001)
            and
          service:http_error_rate:ratio_rate30m > (6 * 0.001)
        for: 5m
        labels:
          severity: warning
          slo_type: availability
          burn_rate: '6'
```## Alertmanager routing and grouping

Alertmanager receives notifications from Prometheus, performs deduplication, grouping, routing, and silencing, and then delivers them to the appropriate recipient. In an SLO-based notification system, precise routing according to severity and service ownership is key.

### Alertmanager settings```yaml
# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/XXXX'

# 알림 수신 시 라우팅 트리를 순회하며 매칭되는 수신자에게 전달
route:
  # 기본 그룹핑: 서비스와 SLO 타입별로 묶음
  group_by: ['service', 'slo_type', 'alertname']
  # 첫 알림 대기 시간 (동일 그룹의 알림을 모아서 발송)
  group_wait: 30s
  # 같은 그룹에 새 알림 추가 시 재발송 간격
  group_interval: 5m
  # 동일 알림 반복 발송 간격
  repeat_interval: 4h
  # 기본 수신자
  receiver: 'slack-default'

  routes:
    # Critical: PagerDuty + Slack (즉시 대응 필요)
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h
      continue: true # 다음 라우트도 계속 평가

    - match:
        severity: critical
      receiver: 'slack-critical'
      group_wait: 10s

    # Warning: Slack 전용 채널 (업무 시간 내 대응)
    - match:
        severity: warning
      receiver: 'slack-warning'
      repeat_interval: 8h

    # Info: 주간 다이제스트로 수집 (즉시 대응 불필요)
    - match:
        severity: info
      receiver: 'slack-info'
      repeat_interval: 24h

    # 특정 서비스를 담당 팀 채널로 라우팅
    - match_re:
        service: 'payment-.*'
      receiver: 'slack-payment-team'
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-payment'

# Inhibition: 상위 severity 알림이 활성 상태면 하위 알림 억제
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service', 'slo_type']

  - source_match:
      severity: critical
    target_match:
      severity: info
    equal: ['service', 'slo_type']

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: critical
        description: '{{ .GroupLabels.service }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          dashboard: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
          runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .GroupLabels.service }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}

  - name: 'slack-warning'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true

  - name: 'slack-info'
    slack_configs:
      - channel: '#alerts-info'
        send_resolved: false

  - name: 'slack-default'
    slack_configs:
      - channel: '#alerts-default'

  - name: 'slack-payment-team'
    slack_configs:
      - channel: '#team-payment-alerts'

  - name: 'pagerduty-payment'
    pagerduty_configs:
      - service_key: '<PAYMENT_PAGERDUTY_KEY>'
```### The core of grouping strategy`group_by`If you include too many labels, the notifications will be overly distributed, making it difficult to understand the overall situation, and if there are too few, irrelevant notifications will be lumped together in one message, making it less readable.`['service', 'slo_type', 'alertname']`The combination is appropriate for most environments.`continue: true`The setting is used when one notification needs to be delivered to multiple recipients. A typical example is when a critical notification needs to be delivered to both PagerDuty and Slack.

## High cardinality response

High cardinality is the most frequently encountered cause of performance problems in Prometheus operations. When the number of label combinations for one metric exceeds tens of thousands, the memory and CPU of the Prometheus server increase rapidly.

### Cardinality Diagnostics```promql
# TSDB 상태 확인 - 시계열 수가 가장 많은 메트릭 Top 10
topk(10, count by (__name__)({__name__=~".+"}))

# 특정 메트릭의 레이블별 카디널리티 확인
count(http_requests_total) by (method)
count(http_requests_total) by (status)
count(http_requests_total) by (handler)

# 전체 시계열 수 추이 (급증하면 문제의 징후)
prometheus_tsdb_head_series

# 메모리 사용량 추이
process_resident_memory_bytes{job="prometheus"}
```### A typical example of a high-cardinality label

Labels to avoid:`user_id`,`request_id`,`trace_id`,`session_id`,`ip_address`,`url_path`(full denormalized path). These are not suitable as metric labels because their eigenvalues are close to infinite. Instead, it must be handled by a trace system (Jaeger, Tempo) or log system (Loki).

### Cardinality reduction using Recording Rules```yaml
# 고카디널리티 메트릭을 Recording Rules로 집계하여 카디널리티 감소
groups:
  - name: cardinality_reduction
    interval: 1m
    rules:
      # handler별로 수천 개의 시계열을 service 단위로 집계
      # handler 레이블을 제거하여 카디널리티를 1/100로 감소
      - record: service:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (service, method, status)

      # endpoint별 세분화된 히스토그램을 service 단위로 집계
      - record: service:http_request_duration_seconds_bucket:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
```The role of Recording Rules is not to reduce the memory usage of Prometheus itself (the original time series is still collected), but to reduce the computational load at query time. To reduce the cardinality of the original metric, you must drop unnecessary labels in relabel_configs at the time of collection, or clean up labels in your application code.

## Performance optimization

### Prometheus server tuning```yaml
# prometheus.yml - 성능 관련 설정
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  # query_log_file로 느린 쿼리를 기록하여 최적화 대상 식별
  query_log_file: /prometheus/query.log

# Recording Rules 파일 로드
rule_files:
  - /etc/prometheus/recording_rules/*.yaml
  - /etc/prometheus/alerting_rules/*.yaml
# 쿼리 엔진 설정 (Prometheus 2.x 커맨드 라인 플래그)
# --query.max-concurrency=20       동시 쿼리 수 제한
# --query.timeout=2m               단일 쿼리 타임아웃
# --query.max-samples=50000000     단일 쿼리 최대 샘플 수
# --storage.tsdb.retention.time=30d 데이터 보존 기간
# --storage.tsdb.retention.size=100GB 데이터 보존 크기 제한
```### Utilizing Recording Rules to optimize Grafana dashboards

The difference between queries using Recording Rules on the dashboard and Raw Query widens dramatically as the number of panels increases.```json
{
  "dashboard": {
    "title": "SLO Overview Dashboard",
    "panels": [
      {
        "title": "Service Availability (SLI)",
        "type": "gauge",
        "targets": [
          {
            "expr": "service:sli_availability:ratio_rate30d",
            "legendFormat": "{{ service }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 0.995, "color": "yellow" },
                { "value": 0.999, "color": "green" }
              ]
            },
            "min": 0.99,
            "max": 1,
            "unit": "percentunit"
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "timeseries",
        "targets": [
          {
            "expr": "service:error_budget_remaining:ratio * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 20, "color": "orange" },
                { "value": 50, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency by Service",
        "type": "timeseries",
        "targets": [
          {
            "expr": "service:http_request_duration_seconds:p99_rate5m",
            "legendFormat": "{{ service }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s"
          }
        }
      }
    ]
  }
}
```## Troubleshooting

### Symptom 1: Delay or failure in Recording Rules evaluation

In Prometheus log`rule group took longer than evaluation interval`If this is output repeatedly.

**Cause:** The evaluation time of the Recording Rules group exceeds evaluation_interval (default 15 seconds). This is due to the number of rules in the group being too large or the queries of individual rules being heavy.

**diagnosis:**```promql
# Recording Rules 평가 지연 확인
prometheus_rule_group_duration_seconds{rule_group="slo_error_rates"}

# 평가 시간이 interval을 초과한 그룹 식별
prometheus_rule_group_duration_seconds > 15

# 규칙별 평가 실패 횟수
prometheus_rule_evaluation_failures_total

# 마지막 평가 시간 확인
prometheus_rule_group_last_duration_seconds
```**Solution:**

1. Separate heavy rules into a separate group and increase the interval of that group (30s or 1m).
2. If too many time series are referenced in one rule, create lower-level Recording Rules first and calculate them hierarchically.
3. Optimize the rule’s query. for example`{__name__=~".+"}`Replace the same generic selector with a specific metric name.

### Symptom 2: Notification not fired

When SLI drops below SLO but no burn rate notification occurs.

**Cause Checklist:**

1.**`for`Check duration.**`for: 5m`means that the notification will be converted to the FIRING state only if the condition is met for 5 consecutive minutes. Short spikes can be resolved in PENDING.
2. **Check burn rate threshold.** The threshold of burn rate 14.4 for SLO 99.9% is`14.4 * 0.001 = 0.0144`all. If the error rate is close to this value but below it, it will not fire.
3. **Check whether time series exists.** For services with no traffic at all, rate() is`NaN`Since it returns, the comparison operation does not work.
4. **Check Alertmanager connection.** In Prometheus settings:`alerting.alertmanagers`Check whether the section is configured correctly and Alertmanager operates normally.```promql
# 현재 PENDING 상태인 알림 확인
ALERTS{alertstate="pending"}

# 현재 FIRING 상태인 알림 확인
ALERTS{alertstate="firing"}

# Alertmanager 전송 실패 확인
prometheus_notifications_errors_total
prometheus_notifications_dropped_total
```### Symptom 3: Prometheus memory usage spikes

**Cause:** When the number of time series increases rapidly after new Recording Rules are added. Because Recording Rules creates a new time series, memory increases in proportion to the number of rules and the cardinality of the results.

**diagnosis:**```promql
# 시계열 총 수 추이
prometheus_tsdb_head_series

# Recording Rules가 생성한 시계열 수
count({__name__=~".+:.+"})

# 메모리 사용량
process_resident_memory_bytes{job="prometheus"} / 1024 / 1024 / 1024
```**Solution:** Estimate the number of resulting time series of Recording Rules in advance.`sum() by (service)`If a rule targets 50 services, 50 new time series are created. If you create rules for each of the 5 windows, there are 250. We need to predict scale and reduce unnecessary segmentation.

## Precautions during operation

### Recording Rules Distribution Procedure

Because changes to Recording Rules have a direct impact on the Prometheus server, they must go through the same level of review process as code changes.

1. **PromQL syntax verification.**`promtool check rules recording_rules.yaml`Detect grammatical errors in advance using commands.
2. **Unit Test Execution.** promtool supports unit testing for Recording Rules and Alerting Rules.
3. **Estimation of result cardinality.** Calculate in advance the number of time series to be generated by the new rule and check the available memory of the Prometheus server.
4. **Canary deployment.** If possible, apply it to a replicated Prometheus instance first to determine evaluation time and resource impact.```bash
# 규칙 파일 문법 검증
promtool check rules /etc/prometheus/recording_rules/slo_rules.yaml

# Unit Test 실행
promtool test rules /etc/prometheus/tests/slo_rules_test.yaml

# 설정 전체 검증
promtool check config /etc/prometheus/prometheus.yml
```### promtool Unit Test Example```yaml
# tests/slo_rules_test.yaml
rule_files:
  - ../recording_rules/slo_rules.yaml
  - ../alerting_rules/slo_alerts.yaml

evaluation_interval: 1m

tests:
  # Recording Rule 출력 값 검증
  - interval: 1m
    input_series:
      - series: 'http_requests_total{service="api", status="200"}'
        values: '0+100x10' # 분당 100씩 증가 (10분)
      - series: 'http_requests_total{service="api", status="500"}'
        values: '0+1x10' # 분당 1씩 증가 (10분)

    promql_expr_test:
      - expr: service:sli_availability:ratio_rate5m{service="api"}
        eval_time: 10m
        exp_samples:
          - labels: 'service:sli_availability:ratio_rate5m{service="api"}'
            value: 0.9901 # 100/(100+1) 근사값

    alert_rule_test:
      - eval_time: 10m
        alertname: SLOAvailabilityBurnRateCritical
        exp_alerts: [] # 이 수준의 에러율로는 critical 알림 발화 안 함
```## Failure cases and recovery

### Case 1: Prometheus crashes with Recording Rules circular references

**Situation:** A circular reference occurred in which Recording Rule A refers to Rule B and Rule B refers to Rule A. In Prometheus 2.x, this was not explicitly rejected, and it fell into an infinite loop during evaluation and crashed with OOM.

**Lesson:** Dependencies between Recording Rules must be managed using DAG (Directed Acyclic Graph).`promtool check rules`does not detect circular references, so you must manually check dependencies during the code review phase or perform DAG verification with an automated script.

**Recovery Procedure:**

1. Restart Prometheus in safe mode (without rules file).
2. Identify and remove circular references.
3.`promtool check rules`Re-verify the grammar and restore the rule file.

### Case 2: TSDB WAL explosion due to excessive Recording Rules

**Situation:** After adding 2,000 Recording Rules at a time, each rule created an average of 500 time series, for a total of 1 million new time series. TSDB's WAL (Write-Ahead Log) increased rapidly, and compaction could not keep up, causing the disk to become full.

**Lesson:** When adding Recording Rules in bulk, you must apply them incrementally in batches. Limit the number of rules added at a time to 100 or less, and after applying each batch`prometheus_tsdb_head_series`and check disk usage.

### Case 3: Notification fatigue caused by SLO notification false positives

**Situation:** Burn rate notification`for`Since duration was not set (default 0), a notification was fired for each momentary error spike. More than 50 notifications occurred per day, causing on-call engineers to ignore all notifications, delaying response when an actual failure occurred.

**Lesson:** Critical notifications require minimal`for: 1m`In Warning,`for: 5m`Set . Although Multi-Window's short window serves to prevent false positives,`for`duration must be set as an additional safety device.

## Checklist

### Recording Rules Design Checklist

- [ ] Does it comply with the naming convention (level:metric:operations)?
- [ ] If the same rate() calculation is repeated in three or more places, was it extracted as a Recording Rule?
- [ ] Does the dependency between Recording Rules form a DAG without circular references?
- [ ] Has the number of time series to be generated by the new rule been estimated in advance?
- Are all rule evaluations completed within [ ] evaluation_interval?
- [ ] Does it pass grammar verification with promtool check rules?
- [ ] Did you write a Unit Test?

### SLI/SLO Checklist

- [ ] Is SLI defined from the user perspective (rather than the server perspective)?
- [ ] Is health check traffic excluded from SLI calculation?
- [ ] Is the SLO goal realistic compared to the current service level?
- [ ] Is there a recording rule that calculates the remaining Error Budget?
- [ ] Is Multi-Window Multi-Burn-Rate notification configured (at least 2 tiers)?
- [ ] Does the notification include runbook_url and dashboard_url annotation?
- [ ] Is the Error Budget Policy documented and agreed upon by the team?

### Alertmanager Checklist

- [ ] Is routing configured by severity (critical, warning, info)?
- [ ] Are Critical notifications routed to immediate notification channels such as PagerDuty?
- [ ] Are duplicate notifications of lower severity suppressed by Inhibition rules?
- [ ] Is the group_by setting appropriate (is it too detailed or too comprehensive)?
- [ ] Is repeat_interval set differentially according to severity?
- [ ] Is routing configured for each service ownership team?

### Operations Checklist

- [ ] Enable Prometheus' query.log to monitor slow queries
- Monitor [ ] prometheus_rule_group_duration_seconds
- [ ] Are you monitoring for spikes in prometheus_tsdb_head_series?
- [ ] Does a code review process go through when changing Recording Rules?
- [ ] Is the promtool unit test included in the CI pipeline?
- [ ] Are quarterly SLO goal reviews and Error Budget Policy reviews scheduled?

## References- [Prometheus Recording Rules - Official Documentation](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
- [Prometheus Best Practices: Recording Rules](https://prometheus.io/docs/practices/rules/)
- [Google SRE Workbook - Alerting on SLOs](https://sre.google/workbook/alerting-on-slos/)
- [Prometheus Query Functions - Official Documentation](https://prometheus.io/docs/prometheus/latest/querying/functions/)
- [Awesome Prometheus Alerts - Collection of community notification rules](https://samber.github.io/awesome-prometheus-alerts/)
- [Prometheus Alertmanager Configuration - Official Documentation](https://prometheus.io/docs/alerting/latest/configuration/)
- [Google SRE Workbook - Error Budget Policy](https://sre.google/workbook/error-budget-policy/)
- [Prometheus Naming Best Practices](https://prometheus.io/docs/practices/naming/)