Split View: [Golden Kubestronaut] PCA 실전 연습 문제 80제 - Prometheus Certified Associate

[Golden Kubestronaut] PCA 실전 연습 문제 80제 - Prometheus Certified Associate

1. PCA 시험 개요
2. Golden Kubestronaut 소개
3. 도메인별 출제 비율
4. 핵심 개념 요약
5. 실전 연습 문제 80제
6. 마무리

1. PCA 시험 개요

**PCA(Prometheus Certified Associate)**는 CNCF에서 주관하는 Prometheus 모니터링 시스템에 대한 자격증입니다.

항목	내용
시험 시간	90분
문제 수	60문제 (객관식)
합격선	75% (45문제 이상)
시험 방식	온라인 원격 감독
유효 기간	3년
응시 비용	USD 250

2. Golden Kubestronaut 소개

Golden Kubestronaut는 기존 Kubestronaut 5개 자격증에 추가로 Prometheus(PCA), Istio(ICA), Argo(ACA), Backstage(BCA), Cilium(CCA) 자격증까지 총 10개를 모두 취득해야 부여되는 최상위 타이틀입니다.

3. 도메인별 출제 비율

도메인	비율
Observability Concepts	18%
Prometheus Fundamentals	20%
PromQL	28%
Instrumentation and Exporters	16%
Alerting and Dashboarding	18%

4. 핵심 개념 요약

Prometheus 아키텍처

Prometheus Server: 메트릭 수집(스크래핑), TSDB 저장, PromQL 쿼리 엔진
Alertmanager: 알림 라우팅, 그룹핑, 중복 제거, 사일런싱
Pushgateway: 단기 배치 잡 메트릭 수집용 중간 게이트웨이
Exporters: Node Exporter, Blackbox Exporter 등 메트릭 변환기
Service Discovery: Kubernetes SD, Consul SD, File SD 등 자동 타겟 탐지

메트릭 타입

Counter: 단조 증가하는 누적값 (예: 총 요청 수)
Gauge: 증감 가능한 현재값 (예: 메모리 사용량)
Histogram: 관측값을 버킷별로 분류 (예: 응답 시간 분포)
Summary: 클라이언트 측에서 계산된 분위수

PromQL 핵심

Instant Vector: 단일 타임스탬프의 시계열 셋
Range Vector: 시간 범위의 시계열 셋
Scalar: 단일 숫자값
rate(): Counter의 초당 평균 증가율
histogram_quantile(): 히스토그램에서 분위수 계산

5. 실전 연습 문제 80제

Domain 1: Observability Concepts (Q1-Q14)

Q1. 옵저버빌리티의 세 가지 핵심 신호(Three Pillars)에 해당하지 않는 것은?

A) Metrics B) Logs C) Traces D) Alerts

정답: D

설명: 옵저버빌리티의 세 가지 핵심 신호는 Metrics, Logs, Traces입니다. Alerts는 모니터링의 산출물이지 핵심 신호에 해당하지 않습니다. Metrics는 수치 데이터, Logs는 이벤트 기록, Traces는 분산 시스템의 요청 경로를 추적합니다.

Q2. Prometheus의 메트릭 수집 방식으로 올바른 것은?

A) 에이전트가 중앙 서버로 메트릭을 푸시 B) 중앙 서버가 타겟으로부터 메트릭을 풀(pull) C) 메시지 큐를 통한 비동기 수집 D) 스트리밍 방식의 실시간 수집

정답: B

설명: Prometheus는 Pull 기반 아키텍처를 사용합니다. Prometheus 서버가 설정된 간격(scrape_interval)마다 각 타겟의 HTTP 엔드포인트에서 메트릭을 직접 가져옵니다. 이 방식은 타겟의 상태를 자연스럽게 확인할 수 있고, 서버 측에서 수집 빈도를 제어할 수 있는 장점이 있습니다.

Q3. USE 방법론에서 각 글자가 의미하는 것으로 올바른 조합은?

A) Utilization, Saturation, Errors B) Uptime, Scalability, Efficiency C) Usage, Speed, Execution D) Utilization, Speed, Errors

정답: A

설명: USE 방법론은 Brendan Gregg가 제안한 시스템 성능 분석 방법론으로, 모든 리소스(CPU, Memory, Disk, Network)에 대해 Utilization(사용률), Saturation(포화도), Errors(에러)를 확인합니다.

Q4. RED 방법론이 측정하는 세 가지 지표로 올바른 것은?

A) Rate, Errors, Duration B) Requests, Endpoints, Delays C) Resources, Events, Data D) Reads, Executions, Drops

정답: A

설명: RED 방법론은 Tom Wilkie가 제안한 마이크로서비스 모니터링 방법론입니다. Rate(초당 요청 수), Errors(실패한 요청 비율), Duration(요청 처리 시간)을 측정합니다. USE가 인프라 리소스에 집중한다면, RED는 서비스 수준의 성능에 집중합니다.

Q5. SLI(Service Level Indicator)에 대한 설명으로 올바른 것은?

A) 서비스 제공자가 고객에게 보장하는 계약서 B) 서비스 성능을 측정하는 정량적 지표 C) 장애 발생 시 허용되는 최대 복구 시간 D) 서비스 가용성의 목표치

정답: B

설명: SLI는 서비스 수준을 정량적으로 측정하는 지표입니다. 예를 들어 요청 지연 시간, 에러율, 처리량 등이 SLI가 됩니다. SLO(Service Level Objective)는 SLI의 목표값이고, SLA(Service Level Agreement)는 법적 계약입니다.

Q6. OpenTelemetry에 대한 설명으로 올바르지 않은 것은?

A) CNCF 인큐베이팅 프로젝트이다 B) 텔레메트리 데이터의 생성, 수집, 관리를 위한 프레임워크이다 C) Prometheus를 완전히 대체하기 위해 만들어졌다 D) Metrics, Logs, Traces를 통합 관리한다

정답: C

설명: OpenTelemetry는 텔레메트리 데이터의 표준화된 수집 프레임워크이지, Prometheus를 대체하기 위한 것이 아닙니다. OpenTelemetry는 Prometheus와 상호 보완적으로 사용되며, OTLP 프로토콜로 Prometheus에 메트릭을 전달할 수 있습니다. 또한 현재는 졸업(Graduated) 프로젝트입니다.

Q7. Pull 방식 대비 Push 방식 메트릭 수집의 장점은?

A) 타겟 상태를 자동으로 확인할 수 있다 B) 방화벽 뒤의 단기 배치 잡 메트릭을 수집하기 용이하다 C) 중앙에서 수집 빈도를 제어하기 쉽다 D) 타겟 설정이 더 간단하다

정답: B

설명: Push 방식은 방화벽 뒤에 있는 타겟이나 매우 짧은 시간 동안만 실행되는 배치 잡에서 메트릭을 수집할 때 유리합니다. Prometheus에서는 이러한 경우를 위해 Pushgateway를 제공합니다. Pull 방식의 장점은 타겟 상태 자동 확인과 중앙 집중식 수집 빈도 제어입니다.

Q8. 옵저버빌리티와 모니터링의 차이점으로 올바른 것은?

A) 옵저버빌리티는 사전 정의된 메트릭만 확인하는 것이다 B) 모니터링은 알 수 없는 문제를 탐색적으로 분석하는 것이다 C) 옵저버빌리티는 시스템 내부 상태를 외부 출력으로 파악할 수 있는 능력이다 D) 모니터링은 옵저버빌리티보다 상위 개념이다

정답: C

설명: 옵저버빌리티는 시스템의 외부 출력(메트릭, 로그, 트레이스)을 통해 내부 상태를 이해할 수 있는 시스템의 속성입니다. 모니터링은 사전 정의된 지표를 감시하는 행위인 반면, 옵저버빌리티는 예상하지 못한 문제도 탐색적으로 분석할 수 있는 능력을 의미합니다.

Q9. Prometheus 메트릭 포맷(Exposition Format)에서 올바른 형식은?

A) JSON 형식으로 키-값 쌍 B) 메트릭이름과 레이블, 값이 한 줄에 표현되는 텍스트 형식 C) XML 기반 구조화 형식 D) Protocol Buffers 전용 바이너리 형식

정답: B

설명: Prometheus의 기본 Exposition Format은 사람이 읽을 수 있는 텍스트 형식입니다. 각 줄에 메트릭 이름, 레이블(중괄호 안), 값이 공백으로 구분되어 표현됩니다. TYPE과 HELP 주석 줄도 포함됩니다. Protocol Buffers 형식도 지원하지만 텍스트 형식이 기본입니다.

Q10. Exemplar의 주요 용도는?

A) 메트릭 데이터를 압축 저장하는 것 B) 메트릭에서 트레이스로의 연결 링크를 제공하는 것 C) 알림 규칙의 예시 쿼리를 저장하는 것 D) 히스토그램 버킷의 예시 값을 저장하는 것

정답: B

설명: Exemplar는 특정 메트릭 샘플에 trace ID 등의 추가 레이블을 첨부하여 메트릭에서 트레이스로의 직접적인 연결을 가능하게 합니다. 이를 통해 높은 지연 시간을 보이는 히스토그램 버킷에서 해당 요청의 분산 추적으로 바로 이동할 수 있습니다.

Q11. 4가지 골든 시그널(Four Golden Signals)에 해당하지 않는 것은?

A) Latency B) Traffic C) Throughput D) Saturation

정답: C

설명: Google SRE에서 정의한 4가지 골든 시그널은 Latency(지연 시간), Traffic(트래픽), Errors(에러율), Saturation(포화도)입니다. Throughput은 Traffic과 관련이 있지만 정확한 골든 시그널 용어는 아닙니다.

Q12. 다차원 데이터 모델에서 시계열을 고유하게 식별하는 요소는?

A) 메트릭 이름만 B) 메트릭 이름과 레이블 셋의 조합 C) 타임스탬프와 값의 조합 D) 메트릭 이름과 타임스탬프

정답: B

설명: Prometheus의 다차원 데이터 모델에서 시계열은 메트릭 이름과 레이블 키-값 쌍의 고유한 조합으로 식별됩니다. 동일한 메트릭 이름이라도 레이블이 다르면 별개의 시계열로 저장됩니다. 이것이 Prometheus의 핵심 특성입니다.

Q13. Cardinality 폭발(Cardinality Explosion)의 원인으로 가장 적절한 것은?

A) 스크래핑 간격이 너무 짧은 경우 B) 레이블 값의 종류가 무한정 증가하는 경우 C) 알림 규칙이 너무 많은 경우 D) 리텐션 기간이 너무 긴 경우

정답: B

설명: Cardinality 폭발은 user_id, request_id, IP 주소 등 고유 값이 매우 많은 레이블을 사용할 때 발생합니다. 레이블 값마다 별도의 시계열이 생성되므로 TSDB의 메모리와 디스크 사용량이 급격히 증가합니다. 이를 방지하려면 레이블 값의 카디널리티를 제한해야 합니다.

Q14. OpenMetrics 표준에 대한 설명으로 올바른 것은?

A) Prometheus와 무관한 독자적인 메트릭 표준이다 B) Prometheus Exposition Format을 기반으로 표준화한 CNCF 프로젝트이다 C) JSON 전용 메트릭 포맷이다 D) 바이너리 전용 프로토콜이다

정답: B

설명: OpenMetrics는 Prometheus Exposition Format을 기반으로 표준화된 메트릭 전송 형식입니다. CNCF 프로젝트로, 텍스트와 Protocol Buffers 두 가지 형식을 지원합니다. Exemplar 지원, Created timestamp 등 Prometheus 포맷에 추가 기능을 포함합니다.

Domain 2: Prometheus Fundamentals (Q15-Q30)

Q15. Prometheus TSDB의 WAL(Write-Ahead Log)의 주요 목적은?

A) 쿼리 성능 향상 B) 크래시 복구 시 데이터 유실 방지 C) 메트릭 압축 저장 D) 원격 스토리지 동기화

정답: B

설명: WAL은 데이터가 메모리의 Head Block에 기록되기 전에 먼저 디스크에 순차적으로 기록됩니다. Prometheus가 비정상 종료되더라도 WAL을 리플레이하여 Head Block의 데이터를 복구할 수 있습니다. WAL 세그먼트 파일은 기본 128MB 크기입니다.

Q16. Prometheus의 Head Block에 대한 설명으로 올바르지 않은 것은?

A) 가장 최근의 데이터를 메모리에 보관한다 B) 기본적으로 최근 2시간의 데이터를 포함한다 C) 디스크에 영구적으로 저장되어 있다 D) 스크래핑된 새 샘플이 먼저 기록되는 곳이다

정답: C

설명: Head Block은 메모리에 존재하는 인메모리 블록입니다. 가장 최근의 데이터(기본 2시간)를 보관하며, 새로운 샘플이 먼저 WAL에 기록된 후 Head Block에 추가됩니다. Head Block의 데이터는 주기적으로 디스크의 영속 블록으로 컴팩션됩니다.

Q17. Prometheus 설정을 재로드하는 방법이 아닌 것은?

A) SIGHUP 시그널 전송 B) /-/reload HTTP 엔드포인트 호출 C) prometheus.yml 파일 수정 후 자동 감지 D) --web.enable-lifecycle 플래그 활성화 후 API 호출

정답: C

설명: Prometheus는 설정 파일 변경을 자동으로 감지하지 않습니다. SIGHUP 시그널을 보내거나, --web.enable-lifecycle 플래그가 활성화된 상태에서 /-/reload POST 엔드포인트를 호출해야 합니다. Prometheus Operator 환경에서는 config-reloader 사이드카가 이를 자동화합니다.

Q18. TSDB의 블록 컴팩션(Compaction)에 대한 설명으로 올바른 것은?

A) 오래된 블록을 삭제하는 프로세스이다 B) 여러 작은 블록을 하나의 큰 블록으로 병합하는 프로세스이다 C) 블록 데이터를 원격 스토리지로 전송하는 프로세스이다 D) WAL 파일을 정리하는 프로세스이다

정답: B

설명: 컴팩션은 여러 개의 작은 블록을 하나의 큰 블록으로 병합하여 쿼리 효율성을 높이는 프로세스입니다. Level-based 컴팩션 방식을 사용하며, 병합 과정에서 삭제 마킹(tombstone)된 데이터가 실제로 제거됩니다. Vertical 컴팩션은 겹치는 시간 범위의 블록을 병합합니다.

Q19. Prometheus의 데이터 리텐션 설정 방법으로 올바른 것은?

A) prometheus.yml 파일에서 retention 설정 B) --storage.tsdb.retention.time 커맨드라인 플래그 C) TSDB API를 통한 동적 설정 D) 환경 변수 PROMETHEUS_RETENTION

정답: B

설명: Prometheus의 리텐션은 커맨드라인 플래그로 설정합니다. --storage.tsdb.retention.time으로 시간 기반 리텐션(기본 15일)을, --storage.tsdb.retention.size로 크기 기반 리텐션을 설정할 수 있습니다. 두 가지를 동시에 설정하면 먼저 도달하는 조건이 적용됩니다.

Q20. Prometheus의 scrape_interval과 evaluation_interval의 차이점은?

A) 둘 다 메트릭 수집 간격이다 B) scrape_interval은 메트릭 수집 간격, evaluation_interval은 규칙 평가 간격이다 C) scrape_interval은 글로벌 설정, evaluation_interval은 잡별 설정이다 D) 두 값은 항상 동일해야 한다

정답: B

설명: scrape_interval은 Prometheus가 타겟에서 메트릭을 수집하는 주기(기본 1분)이고, evaluation_interval은 recording rule과 alerting rule을 평가하는 주기(기본 1분)입니다. 두 값은 독립적으로 설정할 수 있으며, 일반적으로 동일하게 설정하는 것이 권장됩니다.

Q21. Prometheus의 스토리지에서 시계열 데이터의 샘플은 어떻게 인코딩되는가?

A) 타임스탬프와 값 모두 원시값으로 저장 B) 타임스탬프는 delta-of-delta, 값은 XOR 인코딩 C) 둘 다 gzip 압축 D) LZ4 블록 압축

정답: B

설명: Prometheus TSDB는 Facebook의 Gorilla 논문에서 영감받은 압축 방식을 사용합니다. 타임스탬프는 delta-of-delta 인코딩으로 대부분 매우 작은 비트로 표현되고, 값(float64)은 XOR 인코딩으로 이전 값과의 차이만 저장합니다. 이를 통해 샘플당 약 1.37바이트의 높은 압축률을 달성합니다.

Q22. Prometheus Federation에 대한 설명으로 올바른 것은?

A) Prometheus 서버 간 데이터를 자동으로 복제한다 B) 상위 Prometheus가 하위 Prometheus의 특정 메트릭을 스크래핑한다 C) 모든 Prometheus 인스턴스가 동일한 TSDB를 공유한다 D) Alertmanager를 통해 메트릭을 전파한다

정답: B

설명: Federation은 상위(global) Prometheus 서버가 하위(local) Prometheus 서버의 /federate 엔드포인트에서 선택된 시계열을 스크래핑하는 계층적 구조입니다. match 파라미터로 필요한 메트릭만 선택적으로 수집합니다. 크로스 서비스 집계나 글로벌 뷰에 사용됩니다.

Q23. Prometheus remote_write에 대한 설명으로 올바르지 않은 것은?

A) 수집된 샘플을 원격 엔드포인트로 전송한다 B) snappy로 압축된 Protocol Buffers 형식을 사용한다 C) 원격 스토리지에 기록하면 로컬 TSDB 저장을 건너뛴다 D) 큐 기반으로 동작하며 재시도 로직이 포함되어 있다

정답: C

설명: remote_write는 로컬 TSDB 저장과 병렬로 동작합니다. 수집된 샘플은 로컬 TSDB에도 저장되고, 동시에 원격 엔드포인트로도 전송됩니다. Snappy 압축된 protobuf 형식을 사용하며, 내부 큐와 재시도 메커니즘으로 일시적인 장애를 처리합니다.

Q24. Prometheus의 staleness 처리에 대한 설명으로 올바른 것은?

A) 타겟이 사라지면 해당 시계열 데이터가 즉시 삭제된다 B) 스크래핑에 실패하면 stale marker가 추가되어 시계열이 stale 상태가 된다 C) 시계열은 영구적으로 유지되며 stale 처리되지 않는다 D) 5분간 새 샘플이 없으면 자동으로 stale 처리된다

정답: B

설명: Prometheus 2.x부터 staleness 처리가 개선되었습니다. 타겟이 스크래핑에서 사라지면 해당 시계열에 stale marker(특수 NaN 값)가 추가됩니다. 이를 통해 쿼리 시 lookback delta(기본 5분) 내에 stale marker가 있으면 해당 시계열이 결과에서 제외됩니다.

Q25. Prometheus Operator에서 ServiceMonitor의 역할은?

A) Kubernetes Service를 자동으로 생성한다 B) Prometheus가 스크래핑할 타겟을 선언적으로 정의한다 C) Alertmanager의 라우팅 규칙을 정의한다 D) Grafana 대시보드를 자동 생성한다

정답: B

설명: ServiceMonitor는 Prometheus Operator가 제공하는 CRD로, Kubernetes Service를 기반으로 Prometheus 스크래핑 타겟을 선언적으로 정의합니다. namespaceSelector와 selector로 대상 Service를 선택하고, endpoints 필드로 포트, 경로, 간격 등을 설정합니다. Operator가 이를 감지하여 Prometheus 설정에 자동 반영합니다.

Q26. Thanos와 Cortex의 공통 목적으로 올바른 것은?

A) Prometheus를 완전히 대체하는 것 B) Prometheus의 장기 스토리지와 수평 확장을 제공하는 것 C) PromQL을 대체하는 새로운 쿼리 언어를 제공하는 것 D) Alertmanager를 대체하는 것

정답: B

설명: Thanos와 Cortex(현재 Mimir) 모두 Prometheus의 장기 스토리지와 글로벌 뷰, 고가용성을 제공하는 솔루션입니다. Thanos는 사이드카 패턴과 오브젝트 스토리지를, Cortex/Mimir는 완전 분산 아키텍처를 사용합니다. 둘 다 PromQL 호환 쿼리를 지원합니다.

Q27. Prometheus의 TSDB에서 inverted index(역인덱스)의 용도는?

A) 시계열 데이터를 시간순으로 정렬 B) 레이블 기반 시계열 검색을 빠르게 수행 C) 메트릭 값을 압축 저장 D) WAL 파일의 위치를 추적

정답: B

설명: TSDB의 역인덱스는 레이블 이름-값 쌍에서 해당 레이블을 가진 시계열 ID 목록(posting list)으로의 매핑입니다. PromQL 쿼리에서 레이블 매칭 조건이 주어지면, 역인덱스를 통해 해당하는 시계열을 빠르게 찾을 수 있습니다. 교집합/합집합 연산으로 복잡한 레이블 셀렉터를 처리합니다.

Q28. Native Histogram에 대한 설명으로 올바른 것은?

A) 기존 히스토그램과 동일한 저장 방식을 사용한다 B) 버킷 경계를 사전 정의할 필요 없이 지수 분포 버킷을 자동 생성한다 C) Summary 타입을 대체하기 위해 도입되었다 D) PromQL에서 별도 함수 없이 사용 가능하다

정답: B

설명: Native Histogram(Exponential Histogram)은 Prometheus 2.40부터 도입된 새로운 히스토그램 형식입니다. 사전에 버킷 경계를 정의할 필요 없이, 지수 분포(exponential) 방식으로 자동 버킷을 생성합니다. 이를 통해 카디널리티를 크게 줄이면서도 정확한 분위수 계산이 가능합니다.

Q29. Prometheus에서 honor_labels 설정의 역할은?

A) 스크래핑된 메트릭의 레이블을 Prometheus가 자동으로 추가하는 레이블보다 우선한다 B) 레이블 이름을 자동으로 정규화한다 C) 충돌하는 레이블을 모두 삭제한다 D) 외부 레이블만 유지한다

정답: A

설명: honor_labels를 true로 설정하면, 스크래핑된 메트릭에 이미 존재하는 레이블이 Prometheus가 서버 측에서 부여하는 레이블(job, instance 등)과 충돌할 때, 원본 레이블을 유지합니다. Federation이나 Pushgateway에서 원본 레이블을 보존해야 할 때 사용됩니다.

Q30. Prometheus의 scrape_timeout에 대한 설명으로 올바른 것은?

A) 타겟 검색에 소요되는 최대 시간이다 B) 개별 스크래핑 요청의 타임아웃이다 C) 알림 전송의 타임아웃이다 D) PromQL 쿼리 실행의 타임아웃이다

정답: B

설명: scrape_timeout은 개별 스크래핑 HTTP 요청에 대한 타임아웃입니다. 기본값은 10초이며, scrape_interval보다 작거나 같아야 합니다. 타겟이 이 시간 내에 응답하지 않으면 스크래핑이 실패로 기록되고 up 메트릭이 0이 됩니다.

Domain 3: PromQL (Q31-Q53)

Q31. 다음 PromQL 쿼리의 결과 타입은? rate(http_requests_total[5m])

A) Scalar B) Instant Vector C) Range Vector D) String

정답: B

설명: rate() 함수는 Range Vector를 입력으로 받아 Instant Vector를 반환합니다. 각 시계열에 대해 5분 범위 내의 샘플을 사용하여 초당 평균 증가율을 계산하고, 그 결과는 단일 타임스탬프의 값(Instant Vector)으로 반환됩니다.

Q32. rate()와 irate()의 차이점으로 올바른 것은?

A) rate()는 Gauge에, irate()는 Counter에 사용한다 B) rate()는 전체 구간의 평균 증가율, irate()는 마지막 두 샘플 간의 순간 증가율이다 C) irate()가 rate()보다 항상 더 정확하다 D) 두 함수는 동일한 결과를 반환한다

정답: B

설명: rate()는 범위 내 첫 번째와 마지막 샘플 사이의 평균 초당 증가율을 계산합니다. irate()는 범위 내 마지막 두 샘플만 사용하여 순간 변화율을 계산합니다. rate()는 알림과 recording rule에, irate()는 변동성이 큰 그래프에 적합합니다. rate()가 일반적으로 더 권장됩니다.

Q33. histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) 쿼리에 대한 설명으로 올바른 것은?

A) 정확한 95번째 백분위수를 계산한다 B) 버킷 경계 사이의 선형 보간으로 95번째 백분위수를 추정한다 C) 클라이언트 측에서 계산된 분위수를 조회한다 D) 마지막 5분간의 최대값을 반환한다

정답: B

설명: histogram_quantile()은 히스토그램 버킷의 누적 카운트를 기반으로 분위수를 추정합니다. 버킷 경계 사이에서 선형 보간(linear interpolation)을 사용하므로, 실제 값과 차이가 있을 수 있습니다. 특히 버킷 경계가 실제 분포와 맞지 않으면 부정확할 수 있습니다.

Q34. PromQL에서 offset 수정자(modifier)의 용도는?

A) 쿼리 결과의 시간을 미래로 이동 B) 현재 시점 대신 과거 특정 시점의 데이터를 조회 C) 스크래핑 간격을 조정 D) 레이블 값을 변환

정답: B

설명: offset 수정자는 쿼리의 평가 시점을 과거로 이동시킵니다. 예를 들어 http_requests_total offset 1h는 1시간 전의 데이터를 조회합니다. 이를 활용하면 현재 값과 과거 값의 비교가 가능합니다. @ 수정자는 특정 에포크 타임스탬프를 지정합니다.

Q35. 다음 중 Aggregation Operator가 아닌 것은?

A) sum B) avg C) rate D) topk

정답: C

설명: rate()는 함수(function)이지 집계 연산자(aggregation operator)가 아닙니다. Prometheus의 집계 연산자에는 sum, avg, min, max, count, stddev, stdvar, topk, bottomk, quantile, count_values, group 등이 있습니다.

Q36. PromQL에서 by와 without 절의 차이점은?

A) by는 지정한 레이블만 유지, without은 지정한 레이블을 제거 B) by는 필터링, without은 집계에 사용 C) 둘은 동일한 기능이다 D) by는 instant query, without은 range query에 사용

정답: A

설명: by 절은 지정된 레이블만 유지하고 나머지를 제거하여 집계합니다. without 절은 지정된 레이블을 제거하고 나머지를 유지하여 집계합니다. 예를 들어 sum by (job)(metric)은 job 레이블별로 합산하고, sum without (instance)(metric)은 instance 레이블을 제외하고 합산합니다.

Q37. label_replace 함수의 용도는?

A) TSDB에 저장된 레이블을 영구적으로 변경 B) 쿼리 결과에서 정규식으로 레이블 값을 변환하여 새 레이블 생성 C) relabel_configs와 동일한 기능 D) 레이블을 삭제하는 함수

정답: B

설명: label_replace는 쿼리 시점에서 정규식을 사용하여 기존 레이블 값의 일부를 캡처하고, 이를 새로운 레이블로 생성하거나 기존 레이블 값을 변경합니다. 이는 쿼리 결과에만 영향을 미치며 저장된 데이터를 변경하지 않습니다.

Q38. PromQL 서브쿼리(Subquery)의 올바른 문법은?

A) rate(http_requests_total[5m])[30m:1m] B) rate(http_requests_total[5m]) subquery 30m C) subquery(rate(http_requests_total[5m]), 30m, 1m) D) rate(http_requests_total[5m]).range(30m, 1m)

정답: A

설명: 서브쿼리는 instant vector를 반환하는 표현식 뒤에 [range:resolution] 형태로 작성합니다. 이 예시에서는 rate() 결과를 30분 범위에서 1분 간격으로 평가합니다. resolution을 생략하면 글로벌 evaluation_interval이 사용됩니다. 서브쿼리는 max_over_time 등 range vector 함수의 입력으로 활용됩니다.

Q39. 다음 쿼리의 의미는? http_requests_total unless http_errors_total

A) http_requests_total에서 http_errors_total의 값을 뺀다 B) http_requests_total 중 http_errors_total에 매칭되지 않는 시계열만 반환 C) 두 메트릭의 교집합을 반환 D) 조건부로 http_requests_total을 반환

정답: B

설명: unless는 집합 연산자로, 왼쪽 벡터에서 오른쪽 벡터와 레이블이 매칭되는 시계열을 제거합니다. 즉, http_requests_total 시계열 중에서 동일한 레이블 셋을 가진 http_errors_total이 없는 시계열만 반환합니다. and(교집합), or(합집합)도 집합 연산자입니다.

Q40. increase() 함수에 대한 설명으로 올바른 것은?

A) Gauge의 증가량을 계산한다 B) Counter의 지정 기간 동안의 총 증가량을 반환한다 C) rate()와 완전히 다른 계산 방식을 사용한다 D) 결과가 항상 정수이다

정답: B

설명: increase()는 지정된 시간 범위 내에서 Counter의 총 증가량을 반환합니다. 내부적으로 rate() 곱하기 시간 범위(초)와 동일한 계산을 수행합니다. 범위의 시작과 끝을 외삽(extrapolation)하기 때문에 결과가 정수가 아닐 수 있습니다.

Q41. 벡터 매칭에서 on과 ignoring 키워드의 역할은?

A) on은 지정된 레이블로만 매칭, ignoring은 지정된 레이블을 무시하고 매칭 B) on은 필터링, ignoring은 정렬에 사용 C) 둘 다 집계 연산에만 사용 D) on은 왼쪽 벡터, ignoring은 오른쪽 벡터에 적용

정답: A

설명: 이진 연산에서 on 키워드는 지정된 레이블만을 사용하여 양쪽 벡터를 매칭합니다. ignoring 키워드는 지정된 레이블을 무시하고 나머지 레이블로 매칭합니다. 이는 by/without과 유사하지만 집계가 아닌 벡터 간 매칭에 사용됩니다.

Q42. group_left와 group_right의 용도는?

A) 시계열을 그룹으로 묶어 표시 B) 다대일 또는 일대다 벡터 매칭을 허용 C) 레이블을 왼쪽 또는 오른쪽으로 이동 D) 쿼리 결과를 정렬

정답: B

설명: 기본 벡터 매칭은 일대일(one-to-one)입니다. group_left는 오른쪽 벡터의 한 요소가 왼쪽 벡터의 여러 요소와 매칭(다대일)되도록 허용합니다. group_right는 반대입니다. 이를 통해 카디널리티가 다른 메트릭 간의 연산이 가능해집니다.

Q43. predict_linear() 함수에 대한 설명으로 올바른 것은?

A) Counter 타입에만 사용 가능하다 B) Gauge의 선형 회귀를 기반으로 미래 값을 예측한다 C) 머신러닝 알고리즘을 사용한다 D) 범위 내 최대값을 예측한다

정답: B

설명: predict_linear()은 단순 선형 회귀를 사용하여 Gauge 시계열의 미래 값을 예측합니다. 주로 디스크 공간, 인증서 만료 등의 용량 계획 알림에 사용됩니다. 예: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) 은 6시간 추세를 기반으로 24시간 후 값을 예측합니다.

Q44. absent() 함수의 용도는?

A) 값이 0인 시계열을 찾는다 B) 존재하지 않는 시계열에 대해 값 1을 반환한다 C) 시계열을 삭제한다 D) NaN 값을 필터링한다

정답: B

설명: absent()는 입력 벡터가 비어있을 때(해당 시계열이 존재하지 않을 때) 값 1을 가진 단일 요소 벡터를 반환합니다. 시계열이 존재하면 빈 벡터를 반환합니다. 주로 메트릭이 사라진 경우를 감지하는 알림에 사용됩니다. absent_over_time()은 범위 버전입니다.

Q45. resets() 함수가 측정하는 것은?

A) Gauge의 값이 0이 된 횟수 B) Counter의 리셋(감소) 횟수 C) 스크래핑 실패 횟수 D) 알림 해제 횟수

정답: B

설명: resets()는 범위 내에서 Counter 값이 감소한(리셋된) 횟수를 반환합니다. Counter 리셋은 애플리케이션 재시작 등으로 발생합니다. rate()와 increase()는 내부적으로 Counter 리셋을 자동 보정하지만, resets()는 리셋 자체의 빈도를 모니터링할 때 유용합니다.

Q46. 다음 중 range vector 함수가 아닌 것은?

A) rate() B) avg_over_time() C) abs() D) delta()

정답: C

설명: abs()는 instant vector를 입력받아 각 샘플의 절대값을 반환하는 함수입니다. rate(), avg_over_time(), delta()는 모두 range vector를 입력으로 받는 함수입니다. range vector 함수는 대괄호로 범위를 지정하는 인수가 필요합니다.

Q47. PromQL에서 bool 수정자의 역할은?

A) 비교 연산의 결과를 필터링 대신 0 또는 1의 값으로 반환 B) 논리 연산을 활성화 C) 불리언 타입 메트릭을 조회 D) 참/거짓 알림을 생성

정답: A

설명: 기본적으로 비교 연산자는 조건을 만족하지 않는 시계열을 필터링합니다. bool 수정자를 사용하면 필터링 대신 조건 만족 시 1, 불만족 시 0을 반환합니다. 예를 들어 http_requests_total > bool 100은 100 초과면 1, 이하면 0을 반환합니다.

Q48. changes() 함수가 측정하는 것은?

A) Counter의 증가 횟수 B) 시계열의 값이 변경된 횟수 C) 레이블이 변경된 횟수 D) 설정이 변경된 횟수

정답: B

설명: changes()는 지정된 시간 범위 내에서 시계열의 값이 변경된 횟수를 반환합니다. 주로 Gauge 타입 메트릭에서 값의 변동 빈도를 파악할 때 사용됩니다. 예를 들어, 설정값이나 버전 번호가 변경된 횟수를 감지하는 데 유용합니다.

Q49. deriv() 함수에 대한 설명으로 올바른 것은?

A) Counter의 미분값을 계산한다 B) Gauge의 초당 변화율을 선형 회귀로 계산한다 C) rate()와 동일한 계산을 수행한다 D) 이산 미분을 계산한다

정답: B

설명: deriv()는 단순 선형 회귀를 사용하여 Gauge 시계열의 초당 변화율(도함수)을 계산합니다. rate()가 Counter 전용인 반면, deriv()는 Gauge에 사용됩니다. 노이즈가 있는 데이터에서 전체적인 추세를 파악할 때 유용합니다.

Q50. 다음 쿼리에서 문제가 될 수 있는 상황은? sum(rate(http_requests_total[5m])) by (status_code)

A) rate()를 sum()과 함께 사용할 수 없다 B) 없다. 올바른 쿼리이다 C) by 절이 sum 앞에 와야 한다 D) status_code 레이블이 존재하지 않으면 에러가 발생한다

정답: B

설명: 이 쿼리는 올바릅니다. rate()로 Counter의 초당 증가율을 계산하고, sum by (status_code)로 상태 코드별 합산합니다. by 절은 sum() 앞이나 뒤 모두에 올 수 있습니다. status_code 레이블이 없으면 에러가 아니라 하나의 그룹으로 합산됩니다.

Q51. histogram_quantile에서 le 레이블의 의미는?

A) less than or equal - 해당 버킷의 상한 경계값 B) label expression - 레이블 필터링 표현식 C) level - 히스토그램의 깊이 레벨 D) length - 관측값의 길이

정답: A

설명: le는 "less than or equal to"의 약자로, 히스토그램 버킷의 상한 경계값을 나타냅니다. 예를 들어 le="0.5"인 버킷에는 0.5 이하의 관측값 누적 카운트가 저장됩니다. 최상위 버킷은 le="+Inf"이며, histogram_quantile()은 이 le 레이블을 사용하여 분위수를 보간합니다.

Q52. clamp_min()과 clamp_max() 함수의 용도는?

A) 시계열의 시간 범위를 제한 B) 샘플 값의 하한과 상한을 설정 C) 레이블 수를 제한 D) 쿼리 결과의 시계열 수를 제한

정답: B

설명: clamp_min(v, min)은 벡터의 모든 샘플 값을 최소 min으로 제한하고, clamp_max(v, max)는 최대 max로 제한합니다. clamp(v, min, max)는 둘을 동시에 적용합니다. 그래프에서 비정상적인 스파이크를 제한하거나 음수값을 방지할 때 유용합니다.

Q53. 다음 PromQL 표현식에서 @ 수정자의 역할은? http_requests_total @ 1609459200

A) 메트릭 값을 해당 숫자로 설정한다 B) 해당 Unix 타임스탬프 시점의 데이터를 조회한다 C) 초당 요청 수를 해당 값과 비교한다 D) 레이블에 해당 값을 추가한다

정답: B

설명: @ 수정자는 쿼리를 특정 Unix 에포크 타임스탬프 시점에서 평가합니다. offset이 현재 시점으로부터의 상대적 시간 이동인 반면, @는 절대적인 시점을 지정합니다. 1609459200은 2021년 1월 1일 00:00:00 UTC에 해당합니다.

Domain 4: Instrumentation and Exporters (Q54-Q67)

Q54. Prometheus 클라이언트 라이브러리에서 Counter 메트릭을 사용할 때 주의사항은?

A) 값을 감소시킬 수 있다 B) 음수값을 설정할 수 있다 C) Inc()와 Add()만 사용 가능하며 값을 감소시킬 수 없다 D) 초기값을 반드시 설정해야 한다

정답: C

설명: Counter는 단조 증가하는 메트릭 타입으로, Inc()(1 증가)와 Add(positive_value)(양수 더하기)만 허용됩니다. 값을 감소시키면 패닉이 발생합니다. Counter 리셋은 프로세스 재시작 시에만 발생하며, rate()와 increase()가 이를 자동 보정합니다.

Q55. Node Exporter가 제공하는 메트릭이 아닌 것은?

A) node_cpu_seconds_total B) node_memory_MemTotal_bytes C) node_disk_io_time_seconds_total D) node_container_cpu_usage_seconds_total

정답: D

설명: node_container_cpu_usage_seconds_total은 Node Exporter가 아닌 cAdvisor가 제공하는 컨테이너 레벨 메트릭입니다. Node Exporter는 호스트 레벨의 하드웨어 및 OS 메트릭(CPU, 메모리, 디스크, 네트워크 등)을 제공합니다. cAdvisor는 kubelet에 내장되어 컨테이너 메트릭을 제공합니다.

Q56. Blackbox Exporter의 주요 용도는?

A) 블랙박스 서버의 내부 메트릭 수집 B) HTTP, TCP, ICMP, DNS 등 프로브를 통한 엔드포인트 모니터링 C) 파일 시스템의 블랙박스 테스트 D) 암호화된 메트릭 복호화

정답: B

설명: Blackbox Exporter는 HTTP(S), TCP, ICMP, DNS, gRPC 프로브를 통해 외부에서 서비스의 가용성과 응답 시간을 모니터링합니다. 서비스의 내부 계측(instrumentation) 없이도 동작 여부를 확인할 수 있는 블랙박스 모니터링 도구입니다. probe_success, probe_duration_seconds 등의 메트릭을 제공합니다.

Q57. Pushgateway 사용이 적절한 시나리오는?

A) 장기 실행 서비스의 메트릭 수집 B) 단기 배치 잡의 메트릭 수집 C) 모든 메트릭 수집에 범용적으로 사용 D) 서비스 디스커버리 대체

정답: B

설명: Pushgateway는 Prometheus가 스크래핑하기 전에 종료되는 단기 배치 잡(short-lived batch jobs)의 메트릭을 수집하기 위한 중간 게이트웨이입니다. 잡이 완료 시 결과 메트릭을 Pushgateway에 푸시하면, Prometheus가 이를 스크래핑합니다. 장기 실행 서비스에는 직접 스크래핑이 권장됩니다.

Q58. 커스텀 Exporter를 작성할 때 Collector 인터페이스의 필수 메서드는?

A) Collect()만 B) Describe()만 C) Describe()와 Collect() D) Init()와 Collect()

정답: C

설명: Prometheus Go 클라이언트의 Collector 인터페이스는 Describe()와 Collect() 두 메서드를 구현해야 합니다. Describe()는 메트릭 디스크립터를 채널로 전송하고, Collect()는 현재 메트릭 값을 채널로 전송합니다. 이 인터페이스를 구현하면 커스텀 수집 로직을 가진 Exporter를 만들 수 있습니다.

Q59. 메트릭 네이밍 컨벤션에서 올바른 것은?

A) 대시(-)를 사용하여 단어를 구분한다 B) CamelCase를 사용한다 C) snake_case를 사용하며, 단위를 접미사로 포함한다 D) 접두사 없이 짧은 이름을 사용한다

정답: C

설명: Prometheus 메트릭 네이밍 컨벤션은 snakecase를 사용하고, 단위를 접미사로 포함합니다. 예: http_request_duration_seconds, node_memory_MemTotal_bytes. 접두사는 네임스페이스를 나타내고(예: prometheus, node_), _total은 Counter에, _bytes/_seconds 등은 단위를 나타냅니다.

Q60. 다음 중 올바른 메트릭 이름은?

A) http-request-duration B) HttpRequestDuration C) http_request_duration_seconds D) http.request.duration

정답: C

설명: Prometheus 메트릭 이름은 정규식 [a-zA-Z_:][a-zA-Z0-9_:]* 패턴을 따릅니다. 대시(-)나 점(.)은 허용되지 않습니다. snake_case를 사용하고, Counter는 _total 접미사, 단위는 기본 단위(seconds, bytes 등)를 접미사로 사용하는 것이 컨벤션입니다.

Q61. Summary와 Histogram의 핵심 차이점은?

A) Summary는 서버 측, Histogram은 클라이언트 측에서 분위수를 계산한다 B) Summary는 클라이언트 측에서 분위수를 계산하고, Histogram은 서버 측에서 분위수를 추정한다 C) 둘은 완전히 동일하다 D) Summary만 레이블을 지원한다

정답: B

설명: Summary는 클라이언트 애플리케이션에서 직접 분위수를 계산하여 노출합니다. 따라서 정확하지만 여러 인스턴스의 분위수를 집계할 수 없습니다. Histogram은 버킷 카운트를 노출하고 서버 측에서 histogram_quantile()로 분위수를 추정합니다. Histogram이 더 유연하고 집계 가능하여 일반적으로 권장됩니다.

Q62. 인스트루멘테이션 시 레이블 사용의 모범 사례가 아닌 것은?

A) 카디널리티가 낮은 레이블 값 사용 B) 사용자 ID를 레이블로 추가 C) HTTP 메서드(GET, POST 등)를 레이블로 사용 D) 상태 코드를 레이블로 사용

정답: B

설명: 사용자 ID는 고유 값이 매우 많아 높은 카디널리티를 유발하므로 레이블로 사용하면 안 됩니다. 레이블의 카디널리티가 높을수록 시계열 수가 폭발적으로 증가하여 메모리와 성능에 심각한 영향을 줍니다. HTTP 메서드나 상태 코드는 값이 한정적이므로 적절합니다.

Q63. Exporter의 메트릭 엔드포인트 기본 경로는?

A) /api/v1/metrics B) /metrics C) /prometheus D) /export

정답: B

설명: Prometheus 생태계의 표준 메트릭 엔드포인트 경로는 /metrics입니다. Exporter와 계측된 애플리케이션은 이 경로에서 Prometheus Exposition Format의 메트릭을 노출합니다. 다른 경로를 사용할 경우 scrape config에서 metrics_path를 지정해야 합니다.

Q64. Prometheus 클라이언트에서 Histogram 버킷을 정의할 때 권장 사항은?

A) 가능한 한 많은 버킷을 정의한다 B) 서비스의 SLO에 맞는 버킷 경계를 정의한다 C) 모든 서비스에 동일한 버킷을 사용한다 D) 버킷 경계를 로그 스케일로만 정의한다

정답: B

설명: 히스토그램 버킷은 서비스의 SLO(Service Level Objective)와 예상 분포에 맞게 정의해야 합니다. 예를 들어 SLO가 500ms 이하 응답이면 0.1, 0.25, 0.5, 1.0 등의 버킷을 설정합니다. 너무 많은 버킷은 카디널리티를 증가시키고, 너무 적으면 분위수 정확도가 떨어집니다.

Q65. process_cpu_seconds_total 메트릭은 어디서 제공되는가?

A) Node Exporter B) Prometheus 클라이언트 라이브러리의 기본 프로세스 메트릭 C) cAdvisor D) Kube-state-metrics

정답: B

설명: process_cpu_seconds_total은 Prometheus 클라이언트 라이브러리가 자동으로 수집하는 기본 프로세스 메트릭입니다. Go, Python, Java 등 대부분의 클라이언트 라이브러리가 프로세스의 CPU 사용 시간, 메모리, 열린 파일 디스크립터 수 등을 자동으로 노출합니다.

Q66. kube-state-metrics가 제공하는 메트릭의 특성은?

A) 노드의 하드웨어 리소스 사용량 B) Kubernetes API 오브젝트의 상태 정보 C) 컨테이너의 CPU/메모리 사용량 D) 네트워크 트래픽 메트릭

정답: B

설명: kube-state-metrics는 Kubernetes API 서버를 감시하여 Deployment, Pod, Node, Job 등 Kubernetes 오브젝트의 상태를 메트릭으로 변환합니다. 예: kube_deployment_spec_replicas, kube_pod_status_phase. 리소스 사용량은 cAdvisor/kubelet이, 하드웨어 메트릭은 Node Exporter가 제공합니다.

Q67. 다음 중 올바른 Counter 사용 사례는?

A) 현재 메모리 사용량 B) 현재 활성 연결 수 C) 처리된 총 HTTP 요청 수 D) CPU 온도

정답: C

설명: Counter는 단조 증가하는 누적값에 사용합니다. 처리된 총 HTTP 요청 수는 계속 증가하므로 Counter가 적합합니다. 메모리 사용량, 활성 연결 수, CPU 온도는 증감하는 값이므로 Gauge를 사용해야 합니다.

Domain 5: Alerting and Dashboarding (Q68-Q80)

Q68. Alertmanager의 grouping 기능의 주요 목적은?

A) 알림을 시간순으로 정렬 B) 유사한 알림을 하나의 알림으로 묶어 알림 피로도를 감소 C) 알림을 등급별로 분류 D) 알림 데이터를 압축

정답: B

설명: Alertmanager의 grouping은 group_by 레이블을 기반으로 유사한 알림을 하나의 그룹으로 묶습니다. 예를 들어 수백 개의 인스턴스에서 동시에 발생한 알림을 하나의 알림 그룹으로 통합하여 수신자에게 전송합니다. 이를 통해 알림 피로도(alert fatigue)를 크게 줄일 수 있습니다.

Q69. Alertmanager의 inhibition(억제)에 대한 설명으로 올바른 것은?

A) 모든 알림을 일시적으로 중지 B) 특정 알림이 발생하면 관련된 하위 알림을 자동으로 억제 C) 알림 발생 빈도를 제한 D) 중복 알림을 병합

정답: B

설명: Inhibition은 특정 알림(source)이 활성화되면 관련된 다른 알림(target)을 억제하는 규칙입니다. 예를 들어 클러스터 전체 장애 알림이 발생하면, 개별 서비스 장애 알림을 억제할 수 있습니다. source_matchers와 target_matchers, equal 레이블로 관계를 정의합니다.

Q70. 알림 규칙의 for 필드의 역할은?

A) 알림 전송을 반복하는 간격 B) 조건이 충족된 후 firing 상태로 전환되기 전 대기 시간 C) 알림 해제까지의 대기 시간 D) 알림 평가 간격

정답: B

설명: for 필드는 알림 조건이 충족된 후 실제로 firing 상태가 되기까지의 대기 시간(pending 기간)입니다. 이 기간 동안 조건이 계속 충족되어야 firing으로 전환됩니다. 일시적인 스파이크로 인한 거짓 양성(false positive) 알림을 방지하는 데 사용됩니다.

Q71. Recording Rule의 주요 목적은?

A) 메트릭 데이터를 외부에 기록 B) 자주 사용되는 복잡한 쿼리를 미리 계산하여 성능 향상 C) 알림 이력을 기록 D) 스크래핑 결과를 로그로 기록

정답: B

설명: Recording Rule은 자주 사용되는 PromQL 표현식을 주기적으로 미리 계산하여 새로운 시계열로 저장합니다. 이를 통해 대시보드 로딩 시간을 단축하고, 복잡한 쿼리의 반복 실행 비용을 줄입니다. 네이밍 컨벤션은 level:metric:operations 형식(예: job:http_requests_total:rate5m)입니다.

Q72. Alertmanager의 silence(사일런스)와 inhibition의 차이점은?

A) 둘은 동일한 기능이다 B) silence는 수동으로 특정 알림을 일시 중지하고, inhibition은 규칙 기반 자동 억제이다 C) silence는 영구적이고, inhibition은 일시적이다 D) silence는 설정 파일에, inhibition은 UI에서만 설정한다

정답: B

설명: Silence는 관리자가 수동으로 특정 레이블 매칭 조건의 알림을 일정 기간 동안 중지시키는 기능입니다(예: 유지보수 작업 중). Inhibition은 설정 파일에 정의된 규칙에 따라 자동으로 알림을 억제합니다. Silence는 Alertmanager UI나 API를 통해 관리합니다.

Q73. Alertmanager의 group_wait, group_interval, repeat_interval의 역할로 올바른 것은?

A) 모두 알림 전송 빈도를 제어한다 B) group_wait은 첫 알림 대기, group_interval은 그룹 업데이트 간격, repeat_interval은 재전송 간격이다 C) 세 값은 항상 동일해야 한다 D) group_wait만 필수 설정이다

정답: B

설명: group_wait은 새 알림 그룹의 첫 번째 알림 전송 전 추가 알림을 모으기 위한 대기 시간(기본 30초)입니다. group_interval은 이미 전송된 그룹에 새 알림이 추가되었을 때의 전송 간격(기본 5분)입니다. repeat_interval은 동일한 알림의 재전송 간격(기본 4시간)입니다.

Q74. Grafana에서 Prometheus를 데이터소스로 설정할 때 사용하는 기본 쿼리 언어는?

A) SQL B) PromQL C) LogQL D) InfluxQL

정답: B

설명: Grafana에서 Prometheus 데이터소스를 사용할 때는 PromQL로 쿼리를 작성합니다. Grafana의 쿼리 에디터에서 직접 PromQL을 입력하거나, 빌더 모드에서 GUI로 쿼리를 구성할 수 있습니다. LogQL은 Loki용, InfluxQL은 InfluxDB용 쿼리 언어입니다.

Q75. Alertmanager의 라우팅 트리(routing tree)에 대한 설명으로 올바른 것은?

A) 모든 알림이 모든 수신자에게 전송된다 B) 알림이 레이블 기반 매칭으로 적절한 수신자에게 라우팅된다 C) 라우팅은 시간 기반으로만 동작한다 D) 라우팅 트리의 깊이는 2단계로 제한된다

정답: B

설명: Alertmanager의 라우팅 트리는 계층적 구조로, 루트 라우트에서 시작하여 알림의 레이블과 match/match_re 조건을 비교하면서 하위 라우트로 분기합니다. continue 옵션이 없으면 첫 번째 매칭 라우트에서 멈추고, continue: true이면 다음 형제 라우트도 검사합니다.

Q76. 다음 중 좋은 알림 규칙 작성의 원칙이 아닌 것은?

A) 증상(symptom) 기반 알림 작성 B) 모든 메트릭에 대해 알림 생성 C) 실행 가능한(actionable) 알림만 생성 D) for 절을 사용하여 일시적 스파이크 필터링

정답: B

설명: 모든 메트릭에 대해 알림을 생성하면 알림 피로도가 극도로 높아집니다. 좋은 알림은 증상 기반(원인이 아닌 사용자 영향), 실행 가능한(수신자가 조치를 취할 수 있는), 적절한 임계값과 for 절을 가져야 합니다. 원인(cause) 기반 알림보다 증상(symptom) 기반 알림이 권장됩니다.

Q77. Alertmanager 고가용성(HA) 클러스터의 동작 방식은?

A) 리더 선출 기반으로 하나의 인스턴스만 활성화 B) Gossip 프로토콜로 알림 상태를 동기화하여 중복 전송 방지 C) 외부 데이터베이스에 상태를 공유 D) 로드밸런서가 요청을 분배

정답: B

설명: Alertmanager HA 클러스터는 Hashicorp의 Memberlist 라이브러리를 사용한 Gossip 프로토콜로 구성됩니다. 각 인스턴스는 알림 상태(notification log)와 silence를 동기화합니다. 모든 인스턴스가 알림을 수신하지만, 동기화를 통해 동일한 알림이 한 번만 전송되도록 보장합니다.

Q78. Recording Rule 네이밍 컨벤션으로 올바른 형식은?

A) record_metric_operation B) level:metric:operations C) metric.level.operation D) METRIC_LEVEL_OPERATION

정답: B

설명: Recording Rule의 권장 네이밍 컨벤션은 level:metric:operations 형식입니다. level은 집계 수준(job, instance 등), metric은 원본 메트릭 이름, operations는 적용된 함수와 집계입니다. 예: job:http_requests_total:rate5m은 job 수준으로 집계된 HTTP 요청의 5분 rate입니다. 콜론(:)은 recording rule 전용입니다.

Q79. Grafana에서 Prometheus 알림과 Grafana 알림의 차이점은?

A) 둘은 완전히 동일하다 B) Prometheus 알림은 Prometheus 서버에서, Grafana 알림은 Grafana 서버에서 평가된다 C) Grafana 알림만 Alertmanager를 사용한다 D) Prometheus 알림은 시각화 전용이다

정답: B

설명: Prometheus 알림 규칙은 Prometheus 서버의 rule manager에서 주기적으로 PromQL을 평가하고, 조건 충족 시 Alertmanager로 전송합니다. Grafana 알림은 Grafana 서버에서 데이터소스에 쿼리를 보내 평가합니다. Prometheus 알림은 데이터와 가까운 곳에서 평가되므로 더 안정적이고 권장됩니다.

Q80. Alertmanager 템플릿에서 사용할 수 있는 데이터 필드가 아닌 것은?

A) .Status (firing/resolved) B) .Labels (알림 레이블) C) .Annotations (알림 어노테이션) D) .Query (원본 PromQL 쿼리 전문)

정답: D

설명: Alertmanager 템플릿에서는 .Status, .Labels, .Annotations, .StartsAt, .EndsAt, .GeneratorURL 등의 필드를 사용할 수 있습니다. 원본 PromQL 쿼리 전문은 직접적으로 제공되지 않습니다. .GeneratorURL에 Prometheus의 쿼리 링크가 포함되어 있어 간접적으로 확인할 수 있습니다.

6. 마무리

PCA 시험은 PromQL이 28%로 가장 큰 비중을 차지합니다. 특히 rate(), histogram_quantile(), 벡터 매칭, 집계 연산자를 확실히 이해해야 합니다. Prometheus 아키텍처와 TSDB 내부 구조에 대한 이해도 중요하며, Alertmanager의 라우팅, 그룹핑, 억제 메커니즘도 반드시 숙지해야 합니다.

시험 준비 팁:

공식 문서를 꼼꼼히 읽고 실습 환경에서 직접 PromQL 쿼리를 작성해 보세요
Prometheus Demo 사이트에서 다양한 쿼리를 테스트해 보세요
Alertmanager 설정 파일을 직접 작성해 보세요
Recording Rule과 Alerting Rule의 차이를 명확히 이해하세요

[Golden Kubestronaut] PCA Practice Exam 80 Questions - Prometheus Certified Associate

1. PCA Exam Overview
2. Golden Kubestronaut Introduction
3. Domain Breakdown
4. Key Concepts Summary
5. Practice Questions (80 Questions)
6. Conclusion

1. PCA Exam Overview

PCA (Prometheus Certified Associate) is a certification administered by the CNCF for the Prometheus monitoring system.

Item	Details
Duration	90 minutes
Questions	60 questions (multiple choice)
Passing Score	75% (45 or more correct)
Format	Online proctored
Validity	3 years
Cost	USD 250

2. Golden Kubestronaut Introduction

Golden Kubestronaut is the top-tier title requiring all 10 CNCF certifications: the original 5 Kubestronaut certs (CKA, CKAD, CKS, KCNA, KCSA) plus Prometheus (PCA), Istio (ICA), Argo (ACA), Backstage (BCA), and Cilium (CCA).

3. Domain Breakdown

Domain	Weight
Observability Concepts	18%
Prometheus Fundamentals	20%
PromQL	28%
Instrumentation and Exporters	16%
Alerting and Dashboarding	18%

4. Key Concepts Summary

Prometheus Architecture

Prometheus Server: Metric scraping, TSDB storage, PromQL query engine
Alertmanager: Alert routing, grouping, deduplication, silencing
Pushgateway: Intermediate gateway for short-lived batch job metrics
Exporters: Node Exporter, Blackbox Exporter - metric translation
Service Discovery: Kubernetes SD, Consul SD, File SD - automatic target discovery

Metric Types

Counter: Monotonically increasing cumulative value (e.g., total requests)
Gauge: Value that can go up and down (e.g., memory usage)
Histogram: Observations bucketed by value ranges (e.g., response time distribution)
Summary: Client-side calculated quantiles

PromQL Essentials

Instant Vector: Set of time series at a single timestamp
Range Vector: Set of time series over a time range
Scalar: Single numeric value
rate(): Per-second average rate of increase for counters
histogram_quantile(): Calculate quantiles from histograms

5. Practice Questions (80 Questions)

Domain 1: Observability Concepts (Q1-Q14)

Q1. Which of the following is NOT one of the Three Pillars of Observability?

A) Metrics B) Logs C) Traces D) Alerts

Answer: D

Explanation: The three pillars of observability are Metrics, Logs, and Traces. Alerts are an output of monitoring, not a core signal. Metrics are numeric data, Logs are event records, and Traces track request paths through distributed systems.

Q2. What is the metric collection approach used by Prometheus?

A) Agents push metrics to the central server B) Central server pulls metrics from targets C) Asynchronous collection via message queues D) Streaming-based real-time collection

Answer: B

Explanation: Prometheus uses a pull-based architecture. The Prometheus server fetches metrics from each target's HTTP endpoint at configured intervals (scrape_interval). This approach naturally verifies target health and gives the server control over collection frequency.

Q3. What does each letter in the USE methodology stand for?

A) Utilization, Saturation, Errors B) Uptime, Scalability, Efficiency C) Usage, Speed, Execution D) Utilization, Speed, Errors

Answer: A

Explanation: The USE methodology, proposed by Brendan Gregg, checks Utilization, Saturation, and Errors for every resource (CPU, Memory, Disk, Network) in a system performance analysis.

Q4. What three metrics does the RED methodology measure?

A) Rate, Errors, Duration B) Requests, Endpoints, Delays C) Resources, Events, Data D) Reads, Executions, Drops

Answer: A

Explanation: The RED methodology, proposed by Tom Wilkie, measures Rate (requests per second), Errors (failed request ratio), and Duration (request processing time). While USE focuses on infrastructure resources, RED focuses on service-level performance.

Q5. What is a Service Level Indicator (SLI)?

A) A contract that service providers guarantee to customers B) A quantitative measure of service performance C) The maximum allowed recovery time during an outage D) A target value for service availability

Answer: B

Explanation: An SLI is a quantitative metric measuring service performance - for example, request latency, error rate, or throughput. SLO (Service Level Objective) is a target value for an SLI, and SLA (Service Level Agreement) is the legal contract.

Q6. Which statement about OpenTelemetry is NOT correct?

A) It is a CNCF incubating project B) It is a framework for generating, collecting, and managing telemetry data C) It was created to completely replace Prometheus D) It provides unified management of Metrics, Logs, and Traces

Answer: C

Explanation: OpenTelemetry is a standardized telemetry collection framework, not a replacement for Prometheus. It complements Prometheus and can deliver metrics via the OTLP protocol. It is now a Graduated project, not Incubating.

Q7. What is an advantage of push-based metric collection over pull-based?

A) Automatic target health checking B) Easier to collect metrics from short-lived batch jobs behind firewalls C) Centralized control of collection frequency D) Simpler target configuration

Answer: B

Explanation: Push-based collection is advantageous for targets behind firewalls or very short-lived batch jobs. Prometheus provides the Pushgateway for these cases. Pull-based advantages include automatic health checking and centralized collection frequency control.

Q8. What correctly describes the difference between observability and monitoring?

A) Observability only checks predefined metrics B) Monitoring is about exploratory analysis of unknown problems C) Observability is the ability to understand internal system state from external outputs D) Monitoring is a superset of observability

Answer: C

Explanation: Observability is a property of a system that allows understanding its internal state through external outputs (metrics, logs, traces). Monitoring watches predefined indicators, while observability enables exploratory analysis of unexpected problems.

Q9. What is the correct format for Prometheus exposition format?

A) JSON key-value pairs B) Text format with metric name, labels, and value on a single line C) XML-based structured format D) Protocol Buffers binary-only format

Answer: B

Explanation: The default Prometheus Exposition Format is a human-readable text format. Each line contains a metric name, labels (in curly braces), and value separated by whitespace. TYPE and HELP comment lines are also included. Protocol Buffers format is also supported but text is the default.

Q10. What is the primary purpose of Exemplars?

A) Compressing metric data for storage B) Providing a link from metrics to traces C) Storing example queries for alerting rules D) Storing example values of histogram buckets

Answer: B

Explanation: Exemplars attach additional labels such as trace IDs to specific metric samples, enabling direct linking from metrics to traces. This allows jumping from a high-latency histogram bucket directly to the distributed trace of that request.

Q11. Which is NOT one of the Four Golden Signals?

A) Latency B) Traffic C) Throughput D) Saturation

Answer: C

Explanation: The Four Golden Signals defined by Google SRE are Latency, Traffic, Errors, and Saturation. Throughput is related to Traffic but is not the exact golden signal term.

Q12. What uniquely identifies a time series in the multi-dimensional data model?

A) Metric name only B) Combination of metric name and label set C) Combination of timestamp and value D) Metric name and timestamp

Answer: B

Explanation: In Prometheus's multi-dimensional data model, a time series is uniquely identified by the combination of its metric name and key-value label pairs. The same metric name with different labels produces separate time series.

Q13. What is the most common cause of cardinality explosion?

A) Scrape interval too short B) Label values with unbounded growth C) Too many alerting rules D) Retention period too long

Answer: B

Explanation: Cardinality explosion occurs when labels have extremely high-cardinality values such as user IDs, request IDs, or IP addresses. Each unique label value creates a separate time series, causing exponential growth in TSDB memory and disk usage.

Q14. What correctly describes the OpenMetrics standard?

A) An independent metric standard unrelated to Prometheus B) A CNCF project that standardized the Prometheus Exposition Format C) A JSON-only metric format D) A binary-only protocol

Answer: B

Explanation: OpenMetrics is a standardized metric format based on the Prometheus Exposition Format. As a CNCF project, it supports both text and Protocol Buffers formats. It adds features like Exemplar support and Created timestamps on top of the Prometheus format.

Domain 2: Prometheus Fundamentals (Q15-Q30)

Q15. What is the primary purpose of the WAL (Write-Ahead Log) in Prometheus TSDB?

A) Query performance improvement B) Preventing data loss during crash recovery C) Compressed metric storage D) Remote storage synchronization

Answer: B

Explanation: The WAL writes data sequentially to disk before it is committed to the in-memory Head Block. If Prometheus crashes, it can replay the WAL to recover Head Block data. WAL segment files are 128MB by default.

Q16. Which statement about the Head Block is NOT correct?

A) It keeps the most recent data in memory B) It contains approximately the last 2 hours of data by default C) It is permanently stored on disk D) It is where newly scraped samples are first written

Answer: C

Explanation: The Head Block is an in-memory block that holds the most recent data (default 2 hours). New samples are first written to the WAL then added to the Head Block. Head Block data is periodically compacted into persistent on-disk blocks.

Q17. Which is NOT a method to reload Prometheus configuration?

A) Sending SIGHUP signal B) Calling the /-/reload HTTP endpoint C) Automatic detection after editing prometheus.yml D) Calling the API with --web.enable-lifecycle flag enabled

Answer: C

Explanation: Prometheus does not automatically detect configuration file changes. You must either send a SIGHUP signal or call the /-/reload POST endpoint with the --web.enable-lifecycle flag enabled. In Prometheus Operator environments, a config-reloader sidecar automates this.

Q18. What correctly describes TSDB block compaction?

A) A process that deletes old blocks B) A process that merges multiple small blocks into a larger block C) A process that sends block data to remote storage D) A process that cleans up WAL files

Answer: B

Explanation: Compaction merges multiple smaller blocks into larger blocks to improve query efficiency. It uses level-based compaction, and tombstone-marked deletions are physically removed during merging. Vertical compaction merges blocks with overlapping time ranges.

Q19. How is data retention configured in Prometheus?

A) In the prometheus.yml configuration file B) Via --storage.tsdb.retention.time command-line flag C) Through dynamic TSDB API configuration D) Via PROMETHEUS_RETENTION environment variable

Answer: B

Explanation: Prometheus retention is configured via command-line flags. --storage.tsdb.retention.time sets time-based retention (default 15 days), and --storage.tsdb.retention.size sets size-based retention. When both are set, whichever limit is reached first applies.

Q20. What is the difference between scrape_interval and evaluation_interval?

A) Both are metric collection intervals B) scrape_interval is for metric collection, evaluation_interval is for rule evaluation C) scrape_interval is global, evaluation_interval is per-job D) Both values must always be identical

Answer: B

Explanation: scrape_interval is how often Prometheus collects metrics from targets (default 1 minute), while evaluation_interval is how often recording and alerting rules are evaluated (default 1 minute). They can be set independently but are typically configured to the same value.

Q21. How are time series samples encoded in Prometheus storage?

A) Both timestamps and values stored as raw values B) Timestamps use delta-of-delta, values use XOR encoding C) Both compressed with gzip D) LZ4 block compression

Answer: B

Explanation: Prometheus TSDB uses compression inspired by Facebook's Gorilla paper. Timestamps use delta-of-delta encoding (most require very few bits), and values (float64) use XOR encoding (storing only differences from previous values). This achieves approximately 1.37 bytes per sample.

Q22. What correctly describes Prometheus Federation?

A) It automatically replicates data between Prometheus servers B) A higher-level Prometheus scrapes specific metrics from lower-level instances C) All Prometheus instances share the same TSDB D) Metrics are propagated through Alertmanager

Answer: B

Explanation: Federation is a hierarchical structure where a global Prometheus server scrapes selected time series from local Prometheus servers via the /federate endpoint. Match parameters select only needed metrics. It is used for cross-service aggregation and global views.

Q23. Which statement about remote_write is NOT correct?

A) It sends collected samples to a remote endpoint B) It uses snappy-compressed Protocol Buffers format C) Writing to remote storage skips local TSDB storage D) It operates with queues and includes retry logic

Answer: C

Explanation: remote_write operates in parallel with local TSDB storage. Collected samples are stored in both the local TSDB and sent to the remote endpoint simultaneously. It uses snappy-compressed protobuf format with internal queues and retry mechanisms for transient failures.

Q24. What correctly describes staleness handling in Prometheus?

A) When a target disappears, its time series data is immediately deleted B) When a scrape fails, a stale marker is added marking the series as stale C) Time series are maintained permanently and never become stale D) Series are automatically marked stale after 5 minutes without new samples

Answer: B

Explanation: Since Prometheus 2.x, staleness handling was improved. When a target disappears from a scrape, a stale marker (special NaN value) is added to its time series. During queries, if a stale marker exists within the lookback delta (default 5 minutes), that series is excluded from results.

Q25. What is the role of ServiceMonitor in Prometheus Operator?

A) Automatically creating Kubernetes Services B) Declaratively defining scrape targets for Prometheus C) Defining Alertmanager routing rules D) Auto-generating Grafana dashboards

Answer: B

Explanation: ServiceMonitor is a CRD provided by Prometheus Operator that declaratively defines Prometheus scrape targets based on Kubernetes Services. It uses namespaceSelector and selector to choose target Services, and the endpoints field for port, path, and interval configuration. The Operator watches these and auto-updates Prometheus config.

Q26. What is the common purpose of Thanos and Cortex?

A) Completely replacing Prometheus B) Providing long-term storage and horizontal scaling for Prometheus C) Providing a new query language to replace PromQL D) Replacing Alertmanager

Answer: B

Explanation: Both Thanos and Cortex (now Mimir) provide long-term storage, global view, and high availability for Prometheus. Thanos uses a sidecar pattern with object storage, while Cortex/Mimir uses a fully distributed architecture. Both support PromQL-compatible queries.

Q27. What is the purpose of the inverted index in Prometheus TSDB?

A) Sorting time series data chronologically B) Enabling fast label-based time series lookup C) Compressing metric values D) Tracking WAL file locations

Answer: B

Explanation: The TSDB inverted index maps label name-value pairs to lists of series IDs (posting lists) that contain those labels. When a PromQL query specifies label matching conditions, the inverted index enables fast lookups. Intersection and union operations handle complex label selectors.

Q28. What correctly describes Native Histograms?

A) They use the same storage format as classic histograms B) They automatically create exponential distribution buckets without predefined boundaries C) They were introduced to replace the Summary type D) They can be used in PromQL without any special functions

Answer: B

Explanation: Native Histograms (Exponential Histograms) were introduced in Prometheus 2.40. They automatically generate exponential-distribution buckets without needing predefined boundaries. This significantly reduces cardinality while enabling more accurate quantile calculations.

Q29. What does the honor_labels setting do in Prometheus?

A) Scraped metric labels take precedence over Prometheus-added labels B) Automatically normalizes label names C) Deletes all conflicting labels D) Keeps only external labels

Answer: A

Explanation: When honor_labels is true, labels already present in scraped metrics take precedence when they conflict with Prometheus server-side labels (job, instance, etc.). This is used with Federation or Pushgateway to preserve original labels.

Q30. What does scrape_timeout configure?

A) Maximum time for target discovery B) Timeout for individual scrape requests C) Timeout for alert delivery D) Timeout for PromQL query execution

Answer: B

Explanation: scrape_timeout is the timeout for individual scrape HTTP requests. The default is 10 seconds and must be less than or equal to scrape_interval. If a target does not respond within this time, the scrape is recorded as failed and the up metric becomes 0.

Domain 3: PromQL (Q31-Q53)

Q31. What is the result type of: rate(http_requests_total[5m])?

A) Scalar B) Instant Vector C) Range Vector D) String

Answer: B

Explanation: The rate() function takes a Range Vector as input and returns an Instant Vector. For each time series, it calculates the per-second average rate of increase using samples within the 5-minute range, returning the result as a single-timestamp value (Instant Vector).

Q32. What is the difference between rate() and irate()?

A) rate() is for Gauges, irate() is for Counters B) rate() calculates average rate over the range, irate() calculates instant rate from the last two samples C) irate() is always more accurate than rate() D) Both functions return identical results

Answer: B

Explanation: rate() calculates the average per-second increase rate between the first and last samples in the range. irate() uses only the last two samples for instant rate of change. rate() is recommended for alerting and recording rules, while irate() suits volatile graphs.

Q33. What describes this query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))?

A) It calculates the exact 95th percentile B) It estimates the 95th percentile using linear interpolation between bucket boundaries C) It retrieves client-side calculated quantiles D) It returns the maximum value over the last 5 minutes

Answer: B

Explanation: histogram_quantile() estimates quantiles based on cumulative bucket counts from histograms. It uses linear interpolation between bucket boundaries, so results may differ from actual values. Accuracy depends on how well bucket boundaries match the actual distribution.

Q34. What is the purpose of the offset modifier in PromQL?

A) Shifting query results into the future B) Querying data at a past point in time instead of the current moment C) Adjusting the scrape interval D) Transforming label values

Answer: B

Explanation: The offset modifier shifts the evaluation time of a query into the past. For example, http_requests_total offset 1h queries data from 1 hour ago. The @ modifier specifies an absolute epoch timestamp instead.

Q35. Which of the following is NOT an Aggregation Operator?

A) sum B) avg C) rate D) topk

Answer: C

Explanation: rate() is a function, not an aggregation operator. Prometheus aggregation operators include sum, avg, min, max, count, stddev, stdvar, topk, bottomk, quantile, count_values, and group.

Q36. What is the difference between the by and without clauses?

A) by keeps only specified labels, without removes specified labels B) by is for filtering, without is for aggregation C) They are identical in function D) by is for instant queries, without is for range queries

Answer: A

Explanation: The by clause keeps only the specified labels and removes the rest during aggregation. The without clause removes the specified labels and keeps the rest. For example, sum by (job)(metric) sums per job label, while sum without (instance)(metric) sums excluding the instance label.

Q37. What does the label_replace function do?

A) Permanently modifies labels stored in TSDB B) Transforms label values using regex to create new labels in query results C) Functions identically to relabel_configs D) Deletes labels

Answer: B

Explanation: label_replace captures parts of existing label values using regex at query time and creates new labels or modifies existing label values. This only affects query results and does not change stored data.

Q38. What is the correct syntax for a PromQL subquery?

A) rate(http_requests_total[5m])[30m:1m] B) rate(http_requests_total[5m]) subquery 30m C) subquery(rate(http_requests_total[5m]), 30m, 1m) D) rate(http_requests_total[5m]).range(30m, 1m)

Answer: A

Explanation: Subqueries use [range:resolution] after an instant vector expression. This example evaluates the rate() result over a 30-minute range at 1-minute resolution. If resolution is omitted, the global evaluation_interval is used. Subqueries serve as input to range vector functions like max_over_time.

Q39. What does this query mean: http_requests_total unless http_errors_total?

A) Subtracts http_errors_total values from http_requests_total B) Returns only http_requests_total series that do not match http_errors_total C) Returns the intersection of both metrics D) Conditionally returns http_requests_total

Answer: B

Explanation: "unless" is a set operator that removes series from the left vector that have matching labels in the right vector. So it returns http_requests_total series that have no matching label set in http_errors_total. "and" (intersection) and "or" (union) are also set operators.

Q40. What correctly describes the increase() function?

A) It calculates the increase of Gauge values B) It returns the total increase of a Counter over the specified period C) It uses a completely different calculation method from rate() D) Its result is always an integer

Answer: B

Explanation: increase() returns the total increase of a Counter within the specified time range. Internally, it performs the same calculation as rate() multiplied by the time range in seconds. Results may not be integers due to extrapolation at the range boundaries.

Q41. What are the roles of the on and ignoring keywords in vector matching?

A) on matches only on specified labels, ignoring matches while excluding specified labels B) on is for filtering, ignoring is for sorting C) Both are only used in aggregation operations D) on applies to the left vector, ignoring to the right vector

Answer: A

Explanation: In binary operations, the "on" keyword matches vectors using only the specified labels. The "ignoring" keyword matches vectors while excluding the specified labels. This is similar to by/without but used for vector matching rather than aggregation.

Q42. What is the purpose of group_left and group_right?

A) Grouping time series for display B) Allowing many-to-one or one-to-many vector matching C) Moving labels left or right D) Sorting query results

Answer: B

Explanation: Default vector matching is one-to-one. group_left allows a single element from the right vector to match multiple elements from the left vector (many-to-one). group_right is the reverse. This enables operations between metrics with different cardinalities.

Q43. What correctly describes the predict_linear() function?

A) It can only be used with Counter types B) It predicts future values based on linear regression of Gauge data C) It uses machine learning algorithms D) It predicts the maximum value in a range

Answer: B

Explanation: predict_linear() uses simple linear regression to predict future values of a Gauge time series. It is commonly used for capacity planning alerts like disk space or certificate expiration. For example, predict_linear(node_filesystem_avail_bytes[6h], 24*3600) predicts the value 24 hours from now based on a 6-hour trend.

Q44. What is the purpose of the absent() function?

A) Finding time series with a value of 0 B) Returning a value of 1 for non-existent time series C) Deleting time series D) Filtering NaN values

Answer: B

Explanation: absent() returns a single-element vector with value 1 when the input vector is empty (the time series does not exist). If the series exists, it returns an empty vector. It is primarily used in alerts to detect when metrics disappear. absent_over_time() is the range version.

Q45. What does the resets() function measure?

A) Number of times a Gauge reaches 0 B) Number of Counter resets (decreases) C) Number of scrape failures D) Number of alert resolutions

Answer: B

Explanation: resets() returns the number of times a Counter value decreased (reset) within the range. Counter resets occur during application restarts. While rate() and increase() auto-compensate for resets, resets() is useful for monitoring the frequency of resets themselves.

Q46. Which of the following is NOT a range vector function?

A) rate() B) avg_over_time() C) abs() D) delta()

Answer: C

Explanation: abs() takes an instant vector and returns the absolute value of each sample. rate(), avg_over_time(), and delta() are all range vector functions that require a range selector argument in brackets.

Q47. What does the bool modifier do in PromQL?

A) Returns 0 or 1 instead of filtering for comparison operations B) Enables logical operations C) Queries boolean-type metrics D) Creates true/false alerts

Answer: A

Explanation: By default, comparison operators filter out non-matching series. The bool modifier instead returns 1 for matches and 0 for non-matches. For example, http_requests_total > bool 100 returns 1 if the value is above 100 and 0 otherwise.

Q48. What does the changes() function measure?

A) Number of Counter increases B) Number of times the time series value changed C) Number of label changes D) Number of configuration changes

Answer: B

Explanation: changes() returns the number of value changes in a time series within the specified range. It is primarily used with Gauge-type metrics to track the frequency of value fluctuations - for example, detecting how often a configuration value or version number has changed.

Q49. What correctly describes the deriv() function?

A) It calculates the derivative of a Counter B) It calculates per-second rate of change for Gauges using linear regression C) It performs the same calculation as rate() D) It calculates discrete derivatives

Answer: B

Explanation: deriv() uses simple linear regression to calculate the per-second rate of change (derivative) of a Gauge time series. While rate() is Counter-specific, deriv() is used for Gauges. It is useful for understanding overall trends in noisy data.

Q50. What potential issue exists with this query: sum(rate(http_requests_total[5m])) by (status_code)?

A) rate() cannot be used with sum() B) None - this is a correct query C) The by clause must come before sum D) An error occurs if the status_code label does not exist

Answer: B

Explanation: This query is correct. rate() calculates the per-second rate, and sum by (status_code) aggregates by status code. The by clause can appear before or after sum(). If the status_code label does not exist, it simply aggregates into a single group rather than producing an error.

Q51. What does the le label mean in histogram_quantile?

A) less than or equal - the upper boundary value of the bucket B) label expression - a label filtering expression C) level - the depth level of the histogram D) length - the length of observed values

Answer: A

Explanation: "le" stands for "less than or equal to" and represents the upper boundary of a histogram bucket. For example, a bucket with le="0.5" contains the cumulative count of observations at or below 0.5. The top bucket is le="+Inf", and histogram_quantile() uses these le labels for interpolation.

Q52. What do clamp_min() and clamp_max() do?

A) Limit the time range of series B) Set lower and upper bounds for sample values C) Limit the number of labels D) Limit the number of time series in results

Answer: B

Explanation: clamp_min(v, min) clamps all sample values to a minimum of min, and clamp_max(v, max) clamps to a maximum of max. clamp(v, min, max) applies both simultaneously. Useful for limiting abnormal spikes in graphs or preventing negative values.

Q53. What does the @ modifier do in: http_requests_total @ 1609459200?

A) Sets the metric value to that number B) Queries data at that Unix timestamp C) Compares requests per second to that value D) Adds that value as a label

Answer: B

Explanation: The @ modifier evaluates the query at a specific Unix epoch timestamp. While offset shifts time relative to the current moment, @ specifies an absolute point in time. 1609459200 corresponds to January 1, 2021 00:00:00 UTC.

Domain 4: Instrumentation and Exporters (Q54-Q67)

Q54. What must you be careful about when using Counter metrics in client libraries?

A) Values can be decreased B) Negative values can be set C) Only Inc() and Add() are available - values cannot be decreased D) An initial value must always be set

Answer: C

Explanation: Counter is a monotonically increasing metric type that only allows Inc() (increment by 1) and Add(positive_value). Attempting to decrease the value causes a panic. Counter resets only occur on process restart, and rate()/increase() auto-compensate for them.

Q55. Which metric is NOT provided by Node Exporter?

A) node_cpu_seconds_total B) node_memory_MemTotal_bytes C) node_disk_io_time_seconds_total D) node_container_cpu_usage_seconds_total

Answer: D

Explanation: node_container_cpu_usage_seconds_total is a container-level metric provided by cAdvisor, not Node Exporter. Node Exporter provides host-level hardware and OS metrics (CPU, memory, disk, network, etc.). cAdvisor is embedded in kubelet and provides container metrics.

Q56. What is the primary purpose of Blackbox Exporter?

A) Collecting internal metrics from blackbox servers B) Monitoring endpoints via HTTP, TCP, ICMP, and DNS probes C) Blackbox testing of file systems D) Decrypting encrypted metrics

Answer: B

Explanation: Blackbox Exporter monitors service availability and response times externally through HTTP(S), TCP, ICMP, DNS, and gRPC probes. It enables blackbox monitoring without internal instrumentation. It provides metrics like probe_success and probe_duration_seconds.

Q57. When is Pushgateway appropriate to use?

A) Collecting metrics from long-running services B) Collecting metrics from short-lived batch jobs C) General-purpose metric collection D) Replacing service discovery

Answer: B

Explanation: Pushgateway is an intermediate gateway for collecting metrics from short-lived batch jobs that terminate before Prometheus can scrape them. Jobs push their result metrics to the Pushgateway, which Prometheus then scrapes. Direct scraping is recommended for long-running services.

Q58. What are the required methods for the Collector interface when writing a custom Exporter?

A) Collect() only B) Describe() only C) Describe() and Collect() D) Init() and Collect()

Answer: C

Explanation: The Prometheus Go client's Collector interface requires implementing both Describe() and Collect(). Describe() sends metric descriptors to a channel, and Collect() sends current metric values to a channel. Implementing this interface enables custom collection logic in Exporters.

Q59. Which metric naming convention is correct?

A) Use dashes (-) to separate words B) Use CamelCase C) Use snake_case with units as suffixes D) Use short names without prefixes

Answer: C

Explanation: Prometheus metric naming convention uses snakecase with units as suffixes. Examples: http_request_duration_seconds, node_memory_MemTotal_bytes. Prefixes indicate namespace (e.g., prometheus, node_), _total is for Counters, and _bytes/_seconds indicate units.

Q60. Which is a valid metric name?

A) http-request-duration B) HttpRequestDuration C) http_request_duration_seconds D) http.request.duration

Answer: C

Explanation: Prometheus metric names must match the regex pattern [a-zA-Z_:][a-zA-Z0-9_:]*. Dashes (-) and dots (.) are not allowed. Convention uses snake_case, _total suffix for Counters, and base units (seconds, bytes, etc.) as suffixes.

Q61. What is the key difference between Summary and Histogram?

A) Summary calculates quantiles server-side, Histogram calculates client-side B) Summary calculates quantiles client-side, Histogram estimates quantiles server-side C) They are completely identical D) Only Summary supports labels

Answer: B

Explanation: Summary calculates quantiles directly in the client application and exposes them. This is accurate but quantiles cannot be aggregated across instances. Histogram exposes bucket counts and quantiles are estimated server-side with histogram_quantile(). Histogram is more flexible and aggregatable, making it generally recommended.

Q62. Which is NOT a best practice for label usage in instrumentation?

A) Using low-cardinality label values B) Adding user IDs as labels C) Using HTTP methods (GET, POST, etc.) as labels D) Using status codes as labels

Answer: B

Explanation: User IDs have extremely high cardinality and should never be used as labels. High label cardinality causes exponential time series growth, severely impacting memory and performance. HTTP methods and status codes have limited values and are appropriate labels.

Q63. What is the default metrics endpoint path for Exporters?

A) /api/v1/metrics B) /metrics C) /prometheus D) /export

Answer: B

Explanation: The standard metrics endpoint path in the Prometheus ecosystem is /metrics. Exporters and instrumented applications expose metrics in Prometheus Exposition Format at this path. If a different path is used, the metrics_path must be specified in the scrape config.

Q64. What is the recommendation when defining Histogram buckets?

A) Define as many buckets as possible B) Define bucket boundaries aligned with service SLOs C) Use identical buckets for all services D) Define boundaries only on a log scale

Answer: B

Explanation: Histogram buckets should be defined based on the service's SLOs and expected distribution. For example, if the SLO is sub-500ms response, set buckets at 0.1, 0.25, 0.5, 1.0. Too many buckets increase cardinality, while too few reduce quantile accuracy.

Q65. Where does the process_cpu_seconds_total metric come from?

A) Node Exporter B) Default process metrics from Prometheus client libraries C) cAdvisor D) Kube-state-metrics

Answer: B

Explanation: process_cpu_seconds_total is a default process metric automatically collected by Prometheus client libraries. Most client libraries (Go, Python, Java, etc.) automatically expose process CPU time, memory, open file descriptor count, and more.

Q66. What characterizes the metrics provided by kube-state-metrics?

A) Node hardware resource usage B) Kubernetes API object state information C) Container CPU/memory usage D) Network traffic metrics

Answer: B

Explanation: kube-state-metrics watches the Kubernetes API server and converts Kubernetes object states into metrics for Deployments, Pods, Nodes, Jobs, etc. Examples: kube_deployment_spec_replicas, kube_pod_status_phase. Resource usage comes from cAdvisor/kubelet, and hardware metrics from Node Exporter.

Q67. Which is a correct Counter use case?

A) Current memory usage B) Current active connections C) Total HTTP requests processed D) CPU temperature

Answer: C

Explanation: Counter is used for monotonically increasing cumulative values. Total HTTP requests processed continuously increases, making Counter appropriate. Memory usage, active connections, and CPU temperature fluctuate up and down, requiring Gauge instead.

Domain 5: Alerting and Dashboarding (Q68-Q80)

Q68. What is the primary purpose of Alertmanager's grouping feature?

A) Sorting alerts chronologically B) Bundling similar alerts into one notification to reduce alert fatigue C) Classifying alerts by severity D) Compressing alert data

Answer: B

Explanation: Alertmanager grouping bundles similar alerts into a single group based on group_by labels. For example, alerts from hundreds of instances firing simultaneously are consolidated into one alert group notification. This significantly reduces alert fatigue.

Q69. What correctly describes Alertmanager inhibition?

A) Temporarily stopping all alerts B) Automatically suppressing related lower-level alerts when specific alerts fire C) Limiting alert frequency D) Merging duplicate alerts

Answer: B

Explanation: Inhibition is a rule that suppresses target alerts when a source alert is active. For example, a cluster-wide failure alert can inhibit individual service failure alerts. Relationships are defined with source_matchers, target_matchers, and equal labels.

Q70. What does the "for" field in alerting rules do?

A) The interval for repeating alert notifications B) The wait time before transitioning to firing state after condition is met C) The wait time before resolving alerts D) The alert evaluation interval

Answer: B

Explanation: The "for" field is the wait time (pending period) between when the alert condition is met and when the alert transitions to firing state. The condition must remain true throughout this period. It is used to prevent false positive alerts from transient spikes.

Q71. What is the primary purpose of Recording Rules?

A) Recording metric data externally B) Pre-computing frequently used complex queries for performance improvement C) Recording alert history D) Logging scraping results

Answer: B

Explanation: Recording Rules periodically pre-compute frequently used PromQL expressions and store them as new time series. This reduces dashboard loading times and eliminates repeated execution costs of complex queries. The naming convention is level:metric:operations (e.g., job:http_requests_total:rate5m).

Q72. What is the difference between Alertmanager silence and inhibition?

A) They are the same feature B) Silence manually pauses specific alerts, inhibition is rule-based automatic suppression C) Silence is permanent, inhibition is temporary D) Silence is in config files, inhibition is UI-only

Answer: B

Explanation: Silence is manually created by administrators to mute alerts matching specific label conditions for a defined period (e.g., during maintenance). Inhibition is configured in the config file and automatically suppresses alerts based on rules. Silences are managed via the Alertmanager UI or API.

Q73. What do group_wait, group_interval, and repeat_interval control?

A) All control alert sending frequency B) group_wait is initial wait, group_interval is group update interval, repeat_interval is resend interval C) All three values must always be identical D) Only group_wait is required

Answer: B

Explanation: group_wait is the wait time to collect additional alerts before sending the first notification for a new group (default 30s). group_interval is the interval for sending updates when new alerts join an existing group (default 5m). repeat_interval is the interval for resending unchanged alerts (default 4h).

Q74. What query language is used by default when configuring Prometheus as a Grafana data source?

A) SQL B) PromQL C) LogQL D) InfluxQL

Answer: B

Explanation: When using a Prometheus data source in Grafana, queries are written in PromQL. You can type PromQL directly in Grafana's query editor or use the builder mode to construct queries via GUI. LogQL is for Loki, and InfluxQL is for InfluxDB.

Q75. What correctly describes the Alertmanager routing tree?

A) All alerts are sent to all receivers B) Alerts are routed to appropriate receivers based on label matching C) Routing operates on time-based rules only D) The routing tree depth is limited to 2 levels

Answer: B

Explanation: The Alertmanager routing tree is a hierarchical structure starting from the root route, branching to child routes based on alert label match/match_re conditions. Without the continue option, it stops at the first matching route; with continue: true, it continues checking sibling routes.

Q76. Which is NOT a principle of good alerting rules?

A) Writing symptom-based alerts B) Creating alerts for every metric C) Creating only actionable alerts D) Using "for" clause to filter transient spikes

Answer: B

Explanation: Creating alerts for every metric leads to extreme alert fatigue. Good alerts should be symptom-based (user impact, not causes), actionable (receivers can take action), and have appropriate thresholds and "for" clauses. Symptom-based alerting is preferred over cause-based alerting.

Q77. How does Alertmanager High Availability (HA) clustering work?

A) Leader election with only one active instance B) Gossip protocol synchronizes alert state to prevent duplicate notifications C) State is shared via external database D) Load balancer distributes requests

Answer: B

Explanation: Alertmanager HA clusters use a gossip protocol via the Hashicorp Memberlist library. Each instance synchronizes notification logs and silences. All instances receive alerts, but synchronization ensures the same alert is only sent once.

Q78. What is the correct naming convention format for Recording Rules?

A) record_metric_operation B) level:metric:operations C) metric.level.operation D) METRIC_LEVEL_OPERATION

Answer: B

Explanation: The recommended naming convention is level:metric:operations. Level indicates aggregation level (job, instance, etc.), metric is the source metric name, and operations are the applied functions and aggregations. Example: job:http_requests_total:rate5m. Colons (:) are reserved for recording rules.

Q79. What is the difference between Prometheus alerts and Grafana alerts?

A) They are completely identical B) Prometheus alerts are evaluated by the Prometheus server, Grafana alerts by the Grafana server C) Only Grafana alerts use Alertmanager D) Prometheus alerts are for visualization only

Answer: B

Explanation: Prometheus alerting rules are evaluated periodically by the Prometheus server's rule manager, sending to Alertmanager when conditions are met. Grafana alerts are evaluated by the Grafana server, querying data sources. Prometheus alerts are closer to the data and thus more reliable and recommended.

Q80. Which is NOT a data field available in Alertmanager templates?

A) .Status (firing/resolved) B) .Labels (alert labels) C) .Annotations (alert annotations) D) .Query (full original PromQL query)

Answer: D

Explanation: Alertmanager templates can use fields like .Status, .Labels, .Annotations, .StartsAt, .EndsAt, and .GeneratorURL. The full original PromQL query is not directly provided. The .GeneratorURL contains a link to the Prometheus query for indirect access.

6. Conclusion

PromQL carries the highest weight at 28% on the PCA exam. Focus especially on rate(), histogram_quantile(), vector matching, and aggregation operators. Understanding the Prometheus architecture and TSDB internals is also important, along with Alertmanager routing, grouping, and inhibition mechanisms.

Exam Preparation Tips:

Thoroughly read the official documentation and practice PromQL queries in a lab environment
Test various queries on the Prometheus Demo site
Practice writing Alertmanager configuration files
Clearly understand the differences between Recording Rules and Alerting Rules