[Golden Kubestronaut] PCA Practice Exam 80 Questions - Prometheus Certified Associate

1. PCA Exam Overview
2. Golden Kubestronaut Introduction
3. Domain Breakdown
4. Key Concepts Summary
5. Practice Questions (80 Questions)
6. Conclusion

1. PCA Exam Overview

PCA (Prometheus Certified Associate) is a certification administered by the CNCF for the Prometheus monitoring system.

Item	Details
Duration	90 minutes
Questions	60 questions (multiple choice)
Passing Score	75% (45 or more correct)
Format	Online proctored
Validity	3 years
Cost	USD 250

2. Golden Kubestronaut Introduction

Golden Kubestronaut is the top-tier title requiring all 10 CNCF certifications: the original 5 Kubestronaut certs (CKA, CKAD, CKS, KCNA, KCSA) plus Prometheus (PCA), Istio (ICA), Argo (ACA), Backstage (BCA), and Cilium (CCA).

3. Domain Breakdown

Domain	Weight
Observability Concepts	18%
Prometheus Fundamentals	20%
PromQL	28%
Instrumentation and Exporters	16%
Alerting and Dashboarding	18%

4. Key Concepts Summary

Prometheus Architecture

Prometheus Server: Metric scraping, TSDB storage, PromQL query engine
Alertmanager: Alert routing, grouping, deduplication, silencing
Pushgateway: Intermediate gateway for short-lived batch job metrics
Exporters: Node Exporter, Blackbox Exporter - metric translation
Service Discovery: Kubernetes SD, Consul SD, File SD - automatic target discovery

Metric Types

Counter: Monotonically increasing cumulative value (e.g., total requests)
Gauge: Value that can go up and down (e.g., memory usage)
Histogram: Observations bucketed by value ranges (e.g., response time distribution)
Summary: Client-side calculated quantiles

PromQL Essentials

Instant Vector: Set of time series at a single timestamp
Range Vector: Set of time series over a time range
Scalar: Single numeric value
rate(): Per-second average rate of increase for counters
histogram_quantile(): Calculate quantiles from histograms

5. Practice Questions (80 Questions)

Domain 1: Observability Concepts (Q1-Q14)

Q1. Which of the following is NOT one of the Three Pillars of Observability?

A) Metrics B) Logs C) Traces D) Alerts

Answer: D

Explanation: The three pillars of observability are Metrics, Logs, and Traces. Alerts are an output of monitoring, not a core signal. Metrics are numeric data, Logs are event records, and Traces track request paths through distributed systems.

Q2. What is the metric collection approach used by Prometheus?

A) Agents push metrics to the central server B) Central server pulls metrics from targets C) Asynchronous collection via message queues D) Streaming-based real-time collection

Answer: B

Explanation: Prometheus uses a pull-based architecture. The Prometheus server fetches metrics from each target's HTTP endpoint at configured intervals (scrape_interval). This approach naturally verifies target health and gives the server control over collection frequency.

Q3. What does each letter in the USE methodology stand for?

A) Utilization, Saturation, Errors B) Uptime, Scalability, Efficiency C) Usage, Speed, Execution D) Utilization, Speed, Errors

Answer: A

Explanation: The USE methodology, proposed by Brendan Gregg, checks Utilization, Saturation, and Errors for every resource (CPU, Memory, Disk, Network) in a system performance analysis.

Q4. What three metrics does the RED methodology measure?

A) Rate, Errors, Duration B) Requests, Endpoints, Delays C) Resources, Events, Data D) Reads, Executions, Drops

Answer: A

Explanation: The RED methodology, proposed by Tom Wilkie, measures Rate (requests per second), Errors (failed request ratio), and Duration (request processing time). While USE focuses on infrastructure resources, RED focuses on service-level performance.

Q5. What is a Service Level Indicator (SLI)?

A) A contract that service providers guarantee to customers B) A quantitative measure of service performance C) The maximum allowed recovery time during an outage D) A target value for service availability

Answer: B

Explanation: An SLI is a quantitative metric measuring service performance - for example, request latency, error rate, or throughput. SLO (Service Level Objective) is a target value for an SLI, and SLA (Service Level Agreement) is the legal contract.

Q6. Which statement about OpenTelemetry is NOT correct?

A) It is a CNCF incubating project B) It is a framework for generating, collecting, and managing telemetry data C) It was created to completely replace Prometheus D) It provides unified management of Metrics, Logs, and Traces

Answer: C

Explanation: OpenTelemetry is a standardized telemetry collection framework, not a replacement for Prometheus. It complements Prometheus and can deliver metrics via the OTLP protocol. It is now a Graduated project, not Incubating.

Q7. What is an advantage of push-based metric collection over pull-based?

A) Automatic target health checking B) Easier to collect metrics from short-lived batch jobs behind firewalls C) Centralized control of collection frequency D) Simpler target configuration

Answer: B

Explanation: Push-based collection is advantageous for targets behind firewalls or very short-lived batch jobs. Prometheus provides the Pushgateway for these cases. Pull-based advantages include automatic health checking and centralized collection frequency control.

Q8. What correctly describes the difference between observability and monitoring?

A) Observability only checks predefined metrics B) Monitoring is about exploratory analysis of unknown problems C) Observability is the ability to understand internal system state from external outputs D) Monitoring is a superset of observability

Answer: C

Explanation: Observability is a property of a system that allows understanding its internal state through external outputs (metrics, logs, traces). Monitoring watches predefined indicators, while observability enables exploratory analysis of unexpected problems.

Q9. What is the correct format for Prometheus exposition format?

A) JSON key-value pairs B) Text format with metric name, labels, and value on a single line C) XML-based structured format D) Protocol Buffers binary-only format

Answer: B

Explanation: The default Prometheus Exposition Format is a human-readable text format. Each line contains a metric name, labels (in curly braces), and value separated by whitespace. TYPE and HELP comment lines are also included. Protocol Buffers format is also supported but text is the default.

Q10. What is the primary purpose of Exemplars?

A) Compressing metric data for storage B) Providing a link from metrics to traces C) Storing example queries for alerting rules D) Storing example values of histogram buckets

Answer: B

Explanation: Exemplars attach additional labels such as trace IDs to specific metric samples, enabling direct linking from metrics to traces. This allows jumping from a high-latency histogram bucket directly to the distributed trace of that request.

Q11. Which is NOT one of the Four Golden Signals?

A) Latency B) Traffic C) Throughput D) Saturation

Answer: C

Explanation: The Four Golden Signals defined by Google SRE are Latency, Traffic, Errors, and Saturation. Throughput is related to Traffic but is not the exact golden signal term.

Q12. What uniquely identifies a time series in the multi-dimensional data model?

A) Metric name only B) Combination of metric name and label set C) Combination of timestamp and value D) Metric name and timestamp

Answer: B

Explanation: In Prometheus's multi-dimensional data model, a time series is uniquely identified by the combination of its metric name and key-value label pairs. The same metric name with different labels produces separate time series.

Q13. What is the most common cause of cardinality explosion?

A) Scrape interval too short B) Label values with unbounded growth C) Too many alerting rules D) Retention period too long

Answer: B

Explanation: Cardinality explosion occurs when labels have extremely high-cardinality values such as user IDs, request IDs, or IP addresses. Each unique label value creates a separate time series, causing exponential growth in TSDB memory and disk usage.

Q14. What correctly describes the OpenMetrics standard?

A) An independent metric standard unrelated to Prometheus B) A CNCF project that standardized the Prometheus Exposition Format C) A JSON-only metric format D) A binary-only protocol

Answer: B

Explanation: OpenMetrics is a standardized metric format based on the Prometheus Exposition Format. As a CNCF project, it supports both text and Protocol Buffers formats. It adds features like Exemplar support and Created timestamps on top of the Prometheus format.

Domain 2: Prometheus Fundamentals (Q15-Q30)

Q15. What is the primary purpose of the WAL (Write-Ahead Log) in Prometheus TSDB?

A) Query performance improvement B) Preventing data loss during crash recovery C) Compressed metric storage D) Remote storage synchronization

Answer: B

Explanation: The WAL writes data sequentially to disk before it is committed to the in-memory Head Block. If Prometheus crashes, it can replay the WAL to recover Head Block data. WAL segment files are 128MB by default.

Q16. Which statement about the Head Block is NOT correct?

A) It keeps the most recent data in memory B) It contains approximately the last 2 hours of data by default C) It is permanently stored on disk D) It is where newly scraped samples are first written

Answer: C

Explanation: The Head Block is an in-memory block that holds the most recent data (default 2 hours). New samples are first written to the WAL then added to the Head Block. Head Block data is periodically compacted into persistent on-disk blocks.

Q17. Which is NOT a method to reload Prometheus configuration?

A) Sending SIGHUP signal B) Calling the /-/reload HTTP endpoint C) Automatic detection after editing prometheus.yml D) Calling the API with --web.enable-lifecycle flag enabled

Answer: C

Explanation: Prometheus does not automatically detect configuration file changes. You must either send a SIGHUP signal or call the /-/reload POST endpoint with the --web.enable-lifecycle flag enabled. In Prometheus Operator environments, a config-reloader sidecar automates this.

Q18. What correctly describes TSDB block compaction?

A) A process that deletes old blocks B) A process that merges multiple small blocks into a larger block C) A process that sends block data to remote storage D) A process that cleans up WAL files

Answer: B

Explanation: Compaction merges multiple smaller blocks into larger blocks to improve query efficiency. It uses level-based compaction, and tombstone-marked deletions are physically removed during merging. Vertical compaction merges blocks with overlapping time ranges.

Q19. How is data retention configured in Prometheus?

A) In the prometheus.yml configuration file B) Via --storage.tsdb.retention.time command-line flag C) Through dynamic TSDB API configuration D) Via PROMETHEUS_RETENTION environment variable

Answer: B

Explanation: Prometheus retention is configured via command-line flags. --storage.tsdb.retention.time sets time-based retention (default 15 days), and --storage.tsdb.retention.size sets size-based retention. When both are set, whichever limit is reached first applies.

Q20. What is the difference between scrape_interval and evaluation_interval?

A) Both are metric collection intervals B) scrape_interval is for metric collection, evaluation_interval is for rule evaluation C) scrape_interval is global, evaluation_interval is per-job D) Both values must always be identical

Answer: B

Explanation: scrape_interval is how often Prometheus collects metrics from targets (default 1 minute), while evaluation_interval is how often recording and alerting rules are evaluated (default 1 minute). They can be set independently but are typically configured to the same value.

Q21. How are time series samples encoded in Prometheus storage?

A) Both timestamps and values stored as raw values B) Timestamps use delta-of-delta, values use XOR encoding C) Both compressed with gzip D) LZ4 block compression

Answer: B

Explanation: Prometheus TSDB uses compression inspired by Facebook's Gorilla paper. Timestamps use delta-of-delta encoding (most require very few bits), and values (float64) use XOR encoding (storing only differences from previous values). This achieves approximately 1.37 bytes per sample.

Q22. What correctly describes Prometheus Federation?

A) It automatically replicates data between Prometheus servers B) A higher-level Prometheus scrapes specific metrics from lower-level instances C) All Prometheus instances share the same TSDB D) Metrics are propagated through Alertmanager

Answer: B

Explanation: Federation is a hierarchical structure where a global Prometheus server scrapes selected time series from local Prometheus servers via the /federate endpoint. Match parameters select only needed metrics. It is used for cross-service aggregation and global views.

Q23. Which statement about remote_write is NOT correct?

A) It sends collected samples to a remote endpoint B) It uses snappy-compressed Protocol Buffers format C) Writing to remote storage skips local TSDB storage D) It operates with queues and includes retry logic

Answer: C

Explanation: remote_write operates in parallel with local TSDB storage. Collected samples are stored in both the local TSDB and sent to the remote endpoint simultaneously. It uses snappy-compressed protobuf format with internal queues and retry mechanisms for transient failures.

Q24. What correctly describes staleness handling in Prometheus?

A) When a target disappears, its time series data is immediately deleted B) When a scrape fails, a stale marker is added marking the series as stale C) Time series are maintained permanently and never become stale D) Series are automatically marked stale after 5 minutes without new samples

Answer: B

Explanation: Since Prometheus 2.x, staleness handling was improved. When a target disappears from a scrape, a stale marker (special NaN value) is added to its time series. During queries, if a stale marker exists within the lookback delta (default 5 minutes), that series is excluded from results.

Q25. What is the role of ServiceMonitor in Prometheus Operator?

A) Automatically creating Kubernetes Services B) Declaratively defining scrape targets for Prometheus C) Defining Alertmanager routing rules D) Auto-generating Grafana dashboards

Answer: B

Explanation: ServiceMonitor is a CRD provided by Prometheus Operator that declaratively defines Prometheus scrape targets based on Kubernetes Services. It uses namespaceSelector and selector to choose target Services, and the endpoints field for port, path, and interval configuration. The Operator watches these and auto-updates Prometheus config.

Q26. What is the common purpose of Thanos and Cortex?

A) Completely replacing Prometheus B) Providing long-term storage and horizontal scaling for Prometheus C) Providing a new query language to replace PromQL D) Replacing Alertmanager

Answer: B

Explanation: Both Thanos and Cortex (now Mimir) provide long-term storage, global view, and high availability for Prometheus. Thanos uses a sidecar pattern with object storage, while Cortex/Mimir uses a fully distributed architecture. Both support PromQL-compatible queries.

Q27. What is the purpose of the inverted index in Prometheus TSDB?

A) Sorting time series data chronologically B) Enabling fast label-based time series lookup C) Compressing metric values D) Tracking WAL file locations

Answer: B

Explanation: The TSDB inverted index maps label name-value pairs to lists of series IDs (posting lists) that contain those labels. When a PromQL query specifies label matching conditions, the inverted index enables fast lookups. Intersection and union operations handle complex label selectors.

Q28. What correctly describes Native Histograms?

A) They use the same storage format as classic histograms B) They automatically create exponential distribution buckets without predefined boundaries C) They were introduced to replace the Summary type D) They can be used in PromQL without any special functions

Answer: B

Explanation: Native Histograms (Exponential Histograms) were introduced in Prometheus 2.40. They automatically generate exponential-distribution buckets without needing predefined boundaries. This significantly reduces cardinality while enabling more accurate quantile calculations.

Q29. What does the honor_labels setting do in Prometheus?

A) Scraped metric labels take precedence over Prometheus-added labels B) Automatically normalizes label names C) Deletes all conflicting labels D) Keeps only external labels

Answer: A

Explanation: When honor_labels is true, labels already present in scraped metrics take precedence when they conflict with Prometheus server-side labels (job, instance, etc.). This is used with Federation or Pushgateway to preserve original labels.

Q30. What does scrape_timeout configure?

A) Maximum time for target discovery B) Timeout for individual scrape requests C) Timeout for alert delivery D) Timeout for PromQL query execution

Answer: B

Explanation: scrape_timeout is the timeout for individual scrape HTTP requests. The default is 10 seconds and must be less than or equal to scrape_interval. If a target does not respond within this time, the scrape is recorded as failed and the up metric becomes 0.

Domain 3: PromQL (Q31-Q53)

Q31. What is the result type of: rate(http_requests_total[5m])?

A) Scalar B) Instant Vector C) Range Vector D) String

Answer: B

Explanation: The rate() function takes a Range Vector as input and returns an Instant Vector. For each time series, it calculates the per-second average rate of increase using samples within the 5-minute range, returning the result as a single-timestamp value (Instant Vector).

Q32. What is the difference between rate() and irate()?

A) rate() is for Gauges, irate() is for Counters B) rate() calculates average rate over the range, irate() calculates instant rate from the last two samples C) irate() is always more accurate than rate() D) Both functions return identical results

Answer: B

Explanation: rate() calculates the average per-second increase rate between the first and last samples in the range. irate() uses only the last two samples for instant rate of change. rate() is recommended for alerting and recording rules, while irate() suits volatile graphs.

Q33. What describes this query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))?

A) It calculates the exact 95th percentile B) It estimates the 95th percentile using linear interpolation between bucket boundaries C) It retrieves client-side calculated quantiles D) It returns the maximum value over the last 5 minutes

Answer: B

Explanation: histogram_quantile() estimates quantiles based on cumulative bucket counts from histograms. It uses linear interpolation between bucket boundaries, so results may differ from actual values. Accuracy depends on how well bucket boundaries match the actual distribution.

Q34. What is the purpose of the offset modifier in PromQL?

A) Shifting query results into the future B) Querying data at a past point in time instead of the current moment C) Adjusting the scrape interval D) Transforming label values

Answer: B

Explanation: The offset modifier shifts the evaluation time of a query into the past. For example, http_requests_total offset 1h queries data from 1 hour ago. The @ modifier specifies an absolute epoch timestamp instead.

Q35. Which of the following is NOT an Aggregation Operator?

A) sum B) avg C) rate D) topk

Answer: C

Explanation: rate() is a function, not an aggregation operator. Prometheus aggregation operators include sum, avg, min, max, count, stddev, stdvar, topk, bottomk, quantile, count_values, and group.

Q36. What is the difference between the by and without clauses?

A) by keeps only specified labels, without removes specified labels B) by is for filtering, without is for aggregation C) They are identical in function D) by is for instant queries, without is for range queries

Answer: A

Explanation: The by clause keeps only the specified labels and removes the rest during aggregation. The without clause removes the specified labels and keeps the rest. For example, sum by (job)(metric) sums per job label, while sum without (instance)(metric) sums excluding the instance label.

Q37. What does the label_replace function do?

A) Permanently modifies labels stored in TSDB B) Transforms label values using regex to create new labels in query results C) Functions identically to relabel_configs D) Deletes labels

Answer: B

Explanation: label_replace captures parts of existing label values using regex at query time and creates new labels or modifies existing label values. This only affects query results and does not change stored data.

Q38. What is the correct syntax for a PromQL subquery?

A) rate(http_requests_total[5m])[30m:1m] B) rate(http_requests_total[5m]) subquery 30m C) subquery(rate(http_requests_total[5m]), 30m, 1m) D) rate(http_requests_total[5m]).range(30m, 1m)

Answer: A

Explanation: Subqueries use [range:resolution] after an instant vector expression. This example evaluates the rate() result over a 30-minute range at 1-minute resolution. If resolution is omitted, the global evaluation_interval is used. Subqueries serve as input to range vector functions like max_over_time.

Q39. What does this query mean: http_requests_total unless http_errors_total?

A) Subtracts http_errors_total values from http_requests_total B) Returns only http_requests_total series that do not match http_errors_total C) Returns the intersection of both metrics D) Conditionally returns http_requests_total

Answer: B

Explanation: "unless" is a set operator that removes series from the left vector that have matching labels in the right vector. So it returns http_requests_total series that have no matching label set in http_errors_total. "and" (intersection) and "or" (union) are also set operators.

Q40. What correctly describes the increase() function?

A) It calculates the increase of Gauge values B) It returns the total increase of a Counter over the specified period C) It uses a completely different calculation method from rate() D) Its result is always an integer

Answer: B

Explanation: increase() returns the total increase of a Counter within the specified time range. Internally, it performs the same calculation as rate() multiplied by the time range in seconds. Results may not be integers due to extrapolation at the range boundaries.

Q41. What are the roles of the on and ignoring keywords in vector matching?

A) on matches only on specified labels, ignoring matches while excluding specified labels B) on is for filtering, ignoring is for sorting C) Both are only used in aggregation operations D) on applies to the left vector, ignoring to the right vector

Answer: A

Explanation: In binary operations, the "on" keyword matches vectors using only the specified labels. The "ignoring" keyword matches vectors while excluding the specified labels. This is similar to by/without but used for vector matching rather than aggregation.

Q42. What is the purpose of group_left and group_right?

A) Grouping time series for display B) Allowing many-to-one or one-to-many vector matching C) Moving labels left or right D) Sorting query results

Answer: B

Explanation: Default vector matching is one-to-one. group_left allows a single element from the right vector to match multiple elements from the left vector (many-to-one). group_right is the reverse. This enables operations between metrics with different cardinalities.

Q43. What correctly describes the predict_linear() function?

A) It can only be used with Counter types B) It predicts future values based on linear regression of Gauge data C) It uses machine learning algorithms D) It predicts the maximum value in a range

Answer: B

Explanation: predict_linear() uses simple linear regression to predict future values of a Gauge time series. It is commonly used for capacity planning alerts like disk space or certificate expiration. For example, predict_linear(node_filesystem_avail_bytes[6h], 24*3600) predicts the value 24 hours from now based on a 6-hour trend.

Q44. What is the purpose of the absent() function?

A) Finding time series with a value of 0 B) Returning a value of 1 for non-existent time series C) Deleting time series D) Filtering NaN values

Answer: B

Explanation: absent() returns a single-element vector with value 1 when the input vector is empty (the time series does not exist). If the series exists, it returns an empty vector. It is primarily used in alerts to detect when metrics disappear. absent_over_time() is the range version.

Q45. What does the resets() function measure?

A) Number of times a Gauge reaches 0 B) Number of Counter resets (decreases) C) Number of scrape failures D) Number of alert resolutions

Answer: B

Explanation: resets() returns the number of times a Counter value decreased (reset) within the range. Counter resets occur during application restarts. While rate() and increase() auto-compensate for resets, resets() is useful for monitoring the frequency of resets themselves.

Q46. Which of the following is NOT a range vector function?

A) rate() B) avg_over_time() C) abs() D) delta()

Answer: C

Explanation: abs() takes an instant vector and returns the absolute value of each sample. rate(), avg_over_time(), and delta() are all range vector functions that require a range selector argument in brackets.

Q47. What does the bool modifier do in PromQL?

A) Returns 0 or 1 instead of filtering for comparison operations B) Enables logical operations C) Queries boolean-type metrics D) Creates true/false alerts

Answer: A

Explanation: By default, comparison operators filter out non-matching series. The bool modifier instead returns 1 for matches and 0 for non-matches. For example, http_requests_total > bool 100 returns 1 if the value is above 100 and 0 otherwise.

Q48. What does the changes() function measure?

A) Number of Counter increases B) Number of times the time series value changed C) Number of label changes D) Number of configuration changes

Answer: B

Explanation: changes() returns the number of value changes in a time series within the specified range. It is primarily used with Gauge-type metrics to track the frequency of value fluctuations - for example, detecting how often a configuration value or version number has changed.

Q49. What correctly describes the deriv() function?

A) It calculates the derivative of a Counter B) It calculates per-second rate of change for Gauges using linear regression C) It performs the same calculation as rate() D) It calculates discrete derivatives

Answer: B

Explanation: deriv() uses simple linear regression to calculate the per-second rate of change (derivative) of a Gauge time series. While rate() is Counter-specific, deriv() is used for Gauges. It is useful for understanding overall trends in noisy data.

Q50. What potential issue exists with this query: sum(rate(http_requests_total[5m])) by (status_code)?

A) rate() cannot be used with sum() B) None - this is a correct query C) The by clause must come before sum D) An error occurs if the status_code label does not exist

Answer: B

Explanation: This query is correct. rate() calculates the per-second rate, and sum by (status_code) aggregates by status code. The by clause can appear before or after sum(). If the status_code label does not exist, it simply aggregates into a single group rather than producing an error.

Q51. What does the le label mean in histogram_quantile?

A) less than or equal - the upper boundary value of the bucket B) label expression - a label filtering expression C) level - the depth level of the histogram D) length - the length of observed values

Answer: A

Explanation: "le" stands for "less than or equal to" and represents the upper boundary of a histogram bucket. For example, a bucket with le="0.5" contains the cumulative count of observations at or below 0.5. The top bucket is le="+Inf", and histogram_quantile() uses these le labels for interpolation.

Q52. What do clamp_min() and clamp_max() do?

A) Limit the time range of series B) Set lower and upper bounds for sample values C) Limit the number of labels D) Limit the number of time series in results

Answer: B

Explanation: clamp_min(v, min) clamps all sample values to a minimum of min, and clamp_max(v, max) clamps to a maximum of max. clamp(v, min, max) applies both simultaneously. Useful for limiting abnormal spikes in graphs or preventing negative values.

Q53. What does the @ modifier do in: http_requests_total @ 1609459200?

A) Sets the metric value to that number B) Queries data at that Unix timestamp C) Compares requests per second to that value D) Adds that value as a label

Answer: B

Explanation: The @ modifier evaluates the query at a specific Unix epoch timestamp. While offset shifts time relative to the current moment, @ specifies an absolute point in time. 1609459200 corresponds to January 1, 2021 00:00:00 UTC.

Domain 4: Instrumentation and Exporters (Q54-Q67)

Q54. What must you be careful about when using Counter metrics in client libraries?

A) Values can be decreased B) Negative values can be set C) Only Inc() and Add() are available - values cannot be decreased D) An initial value must always be set

Answer: C

Explanation: Counter is a monotonically increasing metric type that only allows Inc() (increment by 1) and Add(positive_value). Attempting to decrease the value causes a panic. Counter resets only occur on process restart, and rate()/increase() auto-compensate for them.

Q55. Which metric is NOT provided by Node Exporter?

A) node_cpu_seconds_total B) node_memory_MemTotal_bytes C) node_disk_io_time_seconds_total D) node_container_cpu_usage_seconds_total

Answer: D

Explanation: node_container_cpu_usage_seconds_total is a container-level metric provided by cAdvisor, not Node Exporter. Node Exporter provides host-level hardware and OS metrics (CPU, memory, disk, network, etc.). cAdvisor is embedded in kubelet and provides container metrics.

Q56. What is the primary purpose of Blackbox Exporter?

A) Collecting internal metrics from blackbox servers B) Monitoring endpoints via HTTP, TCP, ICMP, and DNS probes C) Blackbox testing of file systems D) Decrypting encrypted metrics

Answer: B

Explanation: Blackbox Exporter monitors service availability and response times externally through HTTP(S), TCP, ICMP, DNS, and gRPC probes. It enables blackbox monitoring without internal instrumentation. It provides metrics like probe_success and probe_duration_seconds.

Q57. When is Pushgateway appropriate to use?

A) Collecting metrics from long-running services B) Collecting metrics from short-lived batch jobs C) General-purpose metric collection D) Replacing service discovery

Answer: B

Explanation: Pushgateway is an intermediate gateway for collecting metrics from short-lived batch jobs that terminate before Prometheus can scrape them. Jobs push their result metrics to the Pushgateway, which Prometheus then scrapes. Direct scraping is recommended for long-running services.

Q58. What are the required methods for the Collector interface when writing a custom Exporter?

A) Collect() only B) Describe() only C) Describe() and Collect() D) Init() and Collect()

Answer: C

Explanation: The Prometheus Go client's Collector interface requires implementing both Describe() and Collect(). Describe() sends metric descriptors to a channel, and Collect() sends current metric values to a channel. Implementing this interface enables custom collection logic in Exporters.

Q59. Which metric naming convention is correct?

A) Use dashes (-) to separate words B) Use CamelCase C) Use snake_case with units as suffixes D) Use short names without prefixes

Answer: C

Explanation: Prometheus metric naming convention uses snakecase with units as suffixes. Examples: http_request_duration_seconds, node_memory_MemTotal_bytes. Prefixes indicate namespace (e.g., prometheus, node_), _total is for Counters, and _bytes/_seconds indicate units.

Q60. Which is a valid metric name?

A) http-request-duration B) HttpRequestDuration C) http_request_duration_seconds D) http.request.duration

Answer: C

Explanation: Prometheus metric names must match the regex pattern [a-zA-Z_:][a-zA-Z0-9_:]*. Dashes (-) and dots (.) are not allowed. Convention uses snake_case, _total suffix for Counters, and base units (seconds, bytes, etc.) as suffixes.

Q61. What is the key difference between Summary and Histogram?

A) Summary calculates quantiles server-side, Histogram calculates client-side B) Summary calculates quantiles client-side, Histogram estimates quantiles server-side C) They are completely identical D) Only Summary supports labels

Answer: B

Explanation: Summary calculates quantiles directly in the client application and exposes them. This is accurate but quantiles cannot be aggregated across instances. Histogram exposes bucket counts and quantiles are estimated server-side with histogram_quantile(). Histogram is more flexible and aggregatable, making it generally recommended.

Q62. Which is NOT a best practice for label usage in instrumentation?

A) Using low-cardinality label values B) Adding user IDs as labels C) Using HTTP methods (GET, POST, etc.) as labels D) Using status codes as labels

Answer: B

Explanation: User IDs have extremely high cardinality and should never be used as labels. High label cardinality causes exponential time series growth, severely impacting memory and performance. HTTP methods and status codes have limited values and are appropriate labels.

Q63. What is the default metrics endpoint path for Exporters?

A) /api/v1/metrics B) /metrics C) /prometheus D) /export

Answer: B

Explanation: The standard metrics endpoint path in the Prometheus ecosystem is /metrics. Exporters and instrumented applications expose metrics in Prometheus Exposition Format at this path. If a different path is used, the metrics_path must be specified in the scrape config.

Q64. What is the recommendation when defining Histogram buckets?

A) Define as many buckets as possible B) Define bucket boundaries aligned with service SLOs C) Use identical buckets for all services D) Define boundaries only on a log scale

Answer: B

Explanation: Histogram buckets should be defined based on the service's SLOs and expected distribution. For example, if the SLO is sub-500ms response, set buckets at 0.1, 0.25, 0.5, 1.0. Too many buckets increase cardinality, while too few reduce quantile accuracy.

Q65. Where does the process_cpu_seconds_total metric come from?

A) Node Exporter B) Default process metrics from Prometheus client libraries C) cAdvisor D) Kube-state-metrics

Answer: B

Explanation: process_cpu_seconds_total is a default process metric automatically collected by Prometheus client libraries. Most client libraries (Go, Python, Java, etc.) automatically expose process CPU time, memory, open file descriptor count, and more.

Q66. What characterizes the metrics provided by kube-state-metrics?

A) Node hardware resource usage B) Kubernetes API object state information C) Container CPU/memory usage D) Network traffic metrics

Answer: B

Explanation: kube-state-metrics watches the Kubernetes API server and converts Kubernetes object states into metrics for Deployments, Pods, Nodes, Jobs, etc. Examples: kube_deployment_spec_replicas, kube_pod_status_phase. Resource usage comes from cAdvisor/kubelet, and hardware metrics from Node Exporter.

Q67. Which is a correct Counter use case?

A) Current memory usage B) Current active connections C) Total HTTP requests processed D) CPU temperature

Answer: C

Explanation: Counter is used for monotonically increasing cumulative values. Total HTTP requests processed continuously increases, making Counter appropriate. Memory usage, active connections, and CPU temperature fluctuate up and down, requiring Gauge instead.

Domain 5: Alerting and Dashboarding (Q68-Q80)

Q68. What is the primary purpose of Alertmanager's grouping feature?

A) Sorting alerts chronologically B) Bundling similar alerts into one notification to reduce alert fatigue C) Classifying alerts by severity D) Compressing alert data

Answer: B

Explanation: Alertmanager grouping bundles similar alerts into a single group based on group_by labels. For example, alerts from hundreds of instances firing simultaneously are consolidated into one alert group notification. This significantly reduces alert fatigue.

Q69. What correctly describes Alertmanager inhibition?

A) Temporarily stopping all alerts B) Automatically suppressing related lower-level alerts when specific alerts fire C) Limiting alert frequency D) Merging duplicate alerts

Answer: B

Explanation: Inhibition is a rule that suppresses target alerts when a source alert is active. For example, a cluster-wide failure alert can inhibit individual service failure alerts. Relationships are defined with source_matchers, target_matchers, and equal labels.

Q70. What does the "for" field in alerting rules do?

A) The interval for repeating alert notifications B) The wait time before transitioning to firing state after condition is met C) The wait time before resolving alerts D) The alert evaluation interval

Answer: B

Explanation: The "for" field is the wait time (pending period) between when the alert condition is met and when the alert transitions to firing state. The condition must remain true throughout this period. It is used to prevent false positive alerts from transient spikes.

Q71. What is the primary purpose of Recording Rules?

A) Recording metric data externally B) Pre-computing frequently used complex queries for performance improvement C) Recording alert history D) Logging scraping results

Answer: B

Explanation: Recording Rules periodically pre-compute frequently used PromQL expressions and store them as new time series. This reduces dashboard loading times and eliminates repeated execution costs of complex queries. The naming convention is level:metric:operations (e.g., job:http_requests_total:rate5m).

Q72. What is the difference between Alertmanager silence and inhibition?

A) They are the same feature B) Silence manually pauses specific alerts, inhibition is rule-based automatic suppression C) Silence is permanent, inhibition is temporary D) Silence is in config files, inhibition is UI-only

Answer: B

Explanation: Silence is manually created by administrators to mute alerts matching specific label conditions for a defined period (e.g., during maintenance). Inhibition is configured in the config file and automatically suppresses alerts based on rules. Silences are managed via the Alertmanager UI or API.

Q73. What do group_wait, group_interval, and repeat_interval control?

A) All control alert sending frequency B) group_wait is initial wait, group_interval is group update interval, repeat_interval is resend interval C) All three values must always be identical D) Only group_wait is required

Answer: B

Explanation: group_wait is the wait time to collect additional alerts before sending the first notification for a new group (default 30s). group_interval is the interval for sending updates when new alerts join an existing group (default 5m). repeat_interval is the interval for resending unchanged alerts (default 4h).

Q74. What query language is used by default when configuring Prometheus as a Grafana data source?

A) SQL B) PromQL C) LogQL D) InfluxQL

Answer: B

Explanation: When using a Prometheus data source in Grafana, queries are written in PromQL. You can type PromQL directly in Grafana's query editor or use the builder mode to construct queries via GUI. LogQL is for Loki, and InfluxQL is for InfluxDB.

Q75. What correctly describes the Alertmanager routing tree?

A) All alerts are sent to all receivers B) Alerts are routed to appropriate receivers based on label matching C) Routing operates on time-based rules only D) The routing tree depth is limited to 2 levels

Answer: B

Explanation: The Alertmanager routing tree is a hierarchical structure starting from the root route, branching to child routes based on alert label match/match_re conditions. Without the continue option, it stops at the first matching route; with continue: true, it continues checking sibling routes.

Q76. Which is NOT a principle of good alerting rules?

A) Writing symptom-based alerts B) Creating alerts for every metric C) Creating only actionable alerts D) Using "for" clause to filter transient spikes

Answer: B

Explanation: Creating alerts for every metric leads to extreme alert fatigue. Good alerts should be symptom-based (user impact, not causes), actionable (receivers can take action), and have appropriate thresholds and "for" clauses. Symptom-based alerting is preferred over cause-based alerting.

Q77. How does Alertmanager High Availability (HA) clustering work?

A) Leader election with only one active instance B) Gossip protocol synchronizes alert state to prevent duplicate notifications C) State is shared via external database D) Load balancer distributes requests

Answer: B

Explanation: Alertmanager HA clusters use a gossip protocol via the Hashicorp Memberlist library. Each instance synchronizes notification logs and silences. All instances receive alerts, but synchronization ensures the same alert is only sent once.

Q78. What is the correct naming convention format for Recording Rules?

A) record_metric_operation B) level:metric:operations C) metric.level.operation D) METRIC_LEVEL_OPERATION

Answer: B

Explanation: The recommended naming convention is level:metric:operations. Level indicates aggregation level (job, instance, etc.), metric is the source metric name, and operations are the applied functions and aggregations. Example: job:http_requests_total:rate5m. Colons (:) are reserved for recording rules.

Q79. What is the difference between Prometheus alerts and Grafana alerts?

A) They are completely identical B) Prometheus alerts are evaluated by the Prometheus server, Grafana alerts by the Grafana server C) Only Grafana alerts use Alertmanager D) Prometheus alerts are for visualization only

Answer: B

Explanation: Prometheus alerting rules are evaluated periodically by the Prometheus server's rule manager, sending to Alertmanager when conditions are met. Grafana alerts are evaluated by the Grafana server, querying data sources. Prometheus alerts are closer to the data and thus more reliable and recommended.

Q80. Which is NOT a data field available in Alertmanager templates?

A) .Status (firing/resolved) B) .Labels (alert labels) C) .Annotations (alert annotations) D) .Query (full original PromQL query)

Answer: D

Explanation: Alertmanager templates can use fields like .Status, .Labels, .Annotations, .StartsAt, .EndsAt, and .GeneratorURL. The full original PromQL query is not directly provided. The .GeneratorURL contains a link to the Prometheus query for indirect access.

6. Conclusion

PromQL carries the highest weight at 28% on the PCA exam. Focus especially on rate(), histogram_quantile(), vector matching, and aggregation operators. Understanding the Prometheus architecture and TSDB internals is also important, along with Alertmanager routing, grouping, and inhibition mechanisms.

Exam Preparation Tips:

Thoroughly read the official documentation and practice PromQL queries in a lab environment
Test various queries on the Prometheus Demo site
Practice writing Alertmanager configuration files
Clearly understand the differences between Recording Rules and Alerting Rules