Skip to content
Published on

Prometheus in Production: TSDB, Cardinality, Recording Rules, Federation, and Remote Write

Authors
Prometheus in Production

Introduction

Prometheus is easy to install and much harder to operate well. Many teams begin with a few exporters and dashboards, then run into the same production problems:

  • disk usage grows rapidly as the TSDB expands
  • label design creates high-cardinality explosions
  • dashboards and alerts repeatedly execute the same expensive queries
  • cross-cluster and cross-region aggregation becomes necessary
  • long-term retention forces an architecture choice between federation and remote write

Prometheus operations are not about collecting as many metrics as possible. They are about deciding which metrics deserve retention, which labels are safe, which queries should be precomputed, and which storage model fits the real operating need.

TSDB operations: retention is a cost policy

Prometheus’s built-in TSDB is operationally simple, but it makes retention design critical. If retention is vague, the system eventually pays through disk pressure, longer restart times, and harder incident recovery.

The first two questions should always be:

  • how many days of local data does this Prometheus need
  • which data, if any, must move to long-term external storage

In many production environments, local Prometheus is best treated as a short-retention, high-performance operational store, while long-term history moves elsewhere.

storage:
  tsdb:
    retention.time: 15d

Operational checks should include:

  • whether retention time or retention size is the actual limiting factor
  • whether WAL and block compaction fit the available disk budget
  • how long replay takes after restart
  • whether disk IOPS and compression match the current ingest profile

TSDB disk pressure is often not just a storage problem. It is a signal that the metric set or label model is already too expensive.

Cardinality is a design quality issue

Why high cardinality breaks systems

In Prometheus, the cost of a metric is driven by all label combinations. Labels such as user_id, session_id, and request_id create unbounded series growth and are rarely appropriate for general-purpose metrics.

As cardinality rises, several things degrade at once:

  • memory usage
  • query latency
  • disk usage
  • compaction time
  • rule evaluation cost

Practical operating rules

  • never add unbounded identifiers as labels
  • keep only labels that are actually used for alerting, routing, or dashboard filtering
  • do not blindly ingest every exporter metric
  • require label review during service onboarding

One of the highest-leverage cost optimizations in Prometheus is not scaling hardware. It is removing labels that do not create operational value.

Recording rules and alerting rules should be designed separately

Recording rules reduce repeated query cost

If dashboards and alerts repeatedly run the same heavy query, the cost multiplies across users and evaluation loops. Recording rules let you precompute the expensive part once.

groups:
  - name: service-latency
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:rate5m
        expr: sum by (job) (rate(http_request_duration_seconds_count[5m]))

Recording rules are especially useful when:

  • many dashboards reuse the same expensive expression
  • SLI calculations need standardization
  • downstream systems or federation layers should consume a simpler derived series

Alerting rules should reflect operating intent

Good alerts are not defined by complex PromQL. They are defined by clear meaning, stable duration, and a response path.

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job) > 0.05
        for: 10m

Good alert rules typically include:

  • a stable, understandable signal
  • a for duration that filters transient noise
  • annotation text with response context
  • a runbook link
  • bounded label fanout

Without these guardrails, teams often create alerts that are technically correct but operationally noisy.

Federation and remote write solve different problems

When federation fits

Federation is useful when an upper-level Prometheus only needs selected aggregated metrics from lower-level Prometheus servers.

It fits well when:

  • multiple clusters need to feed summary metrics upward
  • regional or cluster-local Prometheus instances should stay independent
  • the central tier needs rollups rather than the full raw series set

When remote write fits

Remote write is the better fit when the goal is external retention or centralized queryability.

It fits well when:

  • long-term retention is required
  • multi-tenant storage is needed
  • global analytics and cross-environment queries matter
  • local Prometheus should keep short retention for operational performance

These patterns are often mentioned together, but they are not interchangeable. Federation is mainly about hierarchical scraping of selected data. Remote write is mainly about external storage and scale.

What teams should document

Prometheus is not just a binary with a config file. It is an operating system for metric economics. Teams should explicitly document:

1. Metric onboarding policy

  • which metric types and labels are allowed
  • who reviews label additions
  • which exporter defaults are kept or dropped

2. Rule ownership

  • who owns recording rules and alert rules
  • what evaluation intervals are acceptable
  • whether all alerts require runbook references

3. Storage policy

  • local retention target
  • remote write destination and purpose
  • acceptable metric loss during failure events

4. Prometheus SLOs

  • query latency expectations
  • scrape success rate
  • rule evaluation success rate
  • disk usage ceilings

A practical operations checklist

The most useful recurring checks are:

  1. Is scrape failure rate rising
  2. Did time-series count jump after a recent deployment
  3. Which dashboards and alerts run the most expensive queries
  4. Can those queries be replaced by recording rules
  5. Is local retention increasing restart time or disk pressure
  6. Are federation and remote write being used for clearly different purposes

Prometheus becomes easier to scale when teams treat it as a controlled metric economy instead of an unlimited telemetry bucket.

Closing thoughts

Good Prometheus operations are built on four habits:

  • prohibit high-cardinality labels unless there is a very strong reason
  • prefer recording rules for repeated expensive logic
  • attach for durations and runbooks to alerts
  • separate the role of federation from the role of remote write

Well-operated Prometheus does not mean collecting more metrics. It means keeping better metrics at lower cost with clearer operating intent.

References