Prometheus in Production: TSDB, Cardinality, Recording Rules, Federation, and Remote Write

Introduction
TSDB operations: retention is a cost policy
Cardinality is a design quality issue
- Why high cardinality breaks systems
- Practical operating rules
Recording rules and alerting rules should be designed separately
- Recording rules reduce repeated query cost
- Alerting rules should reflect operating intent
Federation and remote write solve different problems
- When federation fits
- When remote write fits
What teams should document
A practical operations checklist
Closing thoughts
References

Introduction

Prometheus is easy to install and much harder to operate well. Many teams begin with a few exporters and dashboards, then run into the same production problems:

disk usage grows rapidly as the TSDB expands
label design creates high-cardinality explosions
dashboards and alerts repeatedly execute the same expensive queries
cross-cluster and cross-region aggregation becomes necessary
long-term retention forces an architecture choice between federation and remote write

Prometheus operations are not about collecting as many metrics as possible. They are about deciding which metrics deserve retention, which labels are safe, which queries should be precomputed, and which storage model fits the real operating need.

TSDB operations: retention is a cost policy

Prometheus’s built-in TSDB is operationally simple, but it makes retention design critical. If retention is vague, the system eventually pays through disk pressure, longer restart times, and harder incident recovery.

The first two questions should always be:

how many days of local data does this Prometheus need
which data, if any, must move to long-term external storage

In many production environments, local Prometheus is best treated as a short-retention, high-performance operational store, while long-term history moves elsewhere.

storage:
  tsdb:
    retention.time: 15d

Operational checks should include:

whether retention time or retention size is the actual limiting factor
whether WAL and block compaction fit the available disk budget
how long replay takes after restart
whether disk IOPS and compression match the current ingest profile

TSDB disk pressure is often not just a storage problem. It is a signal that the metric set or label model is already too expensive.

Cardinality is a design quality issue

Why high cardinality breaks systems

In Prometheus, the cost of a metric is driven by all label combinations. Labels such as user_id, session_id, and request_id create unbounded series growth and are rarely appropriate for general-purpose metrics.

As cardinality rises, several things degrade at once:

memory usage
query latency
disk usage
compaction time
rule evaluation cost

Practical operating rules

never add unbounded identifiers as labels
keep only labels that are actually used for alerting, routing, or dashboard filtering
do not blindly ingest every exporter metric
require label review during service onboarding

One of the highest-leverage cost optimizations in Prometheus is not scaling hardware. It is removing labels that do not create operational value.

Recording rules and alerting rules should be designed separately

Recording rules reduce repeated query cost

If dashboards and alerts repeatedly run the same heavy query, the cost multiplies across users and evaluation loops. Recording rules let you precompute the expensive part once.

groups:
  - name: service-latency
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:rate5m
        expr: sum by (job) (rate(http_request_duration_seconds_count[5m]))

Recording rules are especially useful when:

many dashboards reuse the same expensive expression
SLI calculations need standardization
downstream systems or federation layers should consume a simpler derived series

Alerting rules should reflect operating intent

Good alerts are not defined by complex PromQL. They are defined by clear meaning, stable duration, and a response path.

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job) > 0.05
        for: 10m

Good alert rules typically include:

a stable, understandable signal
a for duration that filters transient noise
annotation text with response context
a runbook link
bounded label fanout

Without these guardrails, teams often create alerts that are technically correct but operationally noisy.

Federation and remote write solve different problems

When federation fits

Federation is useful when an upper-level Prometheus only needs selected aggregated metrics from lower-level Prometheus servers.

It fits well when:

multiple clusters need to feed summary metrics upward
regional or cluster-local Prometheus instances should stay independent
the central tier needs rollups rather than the full raw series set

When remote write fits

Remote write is the better fit when the goal is external retention or centralized queryability.

It fits well when:

long-term retention is required
multi-tenant storage is needed
global analytics and cross-environment queries matter
local Prometheus should keep short retention for operational performance

These patterns are often mentioned together, but they are not interchangeable. Federation is mainly about hierarchical scraping of selected data. Remote write is mainly about external storage and scale.

What teams should document

Prometheus is not just a binary with a config file. It is an operating system for metric economics. Teams should explicitly document:

1. Metric onboarding policy

which metric types and labels are allowed
who reviews label additions
which exporter defaults are kept or dropped

2. Rule ownership

who owns recording rules and alert rules
what evaluation intervals are acceptable
whether all alerts require runbook references

3. Storage policy

local retention target
remote write destination and purpose
acceptable metric loss during failure events

4. Prometheus SLOs

query latency expectations
scrape success rate
rule evaluation success rate
disk usage ceilings

A practical operations checklist

The most useful recurring checks are:

Is scrape failure rate rising
Did time-series count jump after a recent deployment
Which dashboards and alerts run the most expensive queries
Can those queries be replaced by recording rules
Is local retention increasing restart time or disk pressure
Are federation and remote write being used for clearly different purposes

Prometheus becomes easier to scale when teams treat it as a controlled metric economy instead of an unlimited telemetry bucket.

Closing thoughts

Good Prometheus operations are built on four habits:

prohibit high-cardinality labels unless there is a very strong reason
prefer recording rules for repeated expensive logic
attach for durations and runbooks to alerts
separate the role of federation from the role of remote write

Well-operated Prometheus does not mean collecting more metrics. It means keeping better metrics at lower cost with clearer operating intent.