- Published on
Prometheus in Production: TSDB, Cardinality, Recording Rules, Federation, and Remote Write
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- TSDB operations: retention is a cost policy
- Cardinality is a design quality issue
- Recording rules and alerting rules should be designed separately
- Federation and remote write solve different problems
- What teams should document
- A practical operations checklist
- Closing thoughts
- References

Introduction
Prometheus is easy to install and much harder to operate well. Many teams begin with a few exporters and dashboards, then run into the same production problems:
- disk usage grows rapidly as the TSDB expands
- label design creates high-cardinality explosions
- dashboards and alerts repeatedly execute the same expensive queries
- cross-cluster and cross-region aggregation becomes necessary
- long-term retention forces an architecture choice between federation and remote write
Prometheus operations are not about collecting as many metrics as possible. They are about deciding which metrics deserve retention, which labels are safe, which queries should be precomputed, and which storage model fits the real operating need.
TSDB operations: retention is a cost policy
Prometheus’s built-in TSDB is operationally simple, but it makes retention design critical. If retention is vague, the system eventually pays through disk pressure, longer restart times, and harder incident recovery.
The first two questions should always be:
- how many days of local data does this Prometheus need
- which data, if any, must move to long-term external storage
In many production environments, local Prometheus is best treated as a short-retention, high-performance operational store, while long-term history moves elsewhere.
storage:
tsdb:
retention.time: 15d
Operational checks should include:
- whether retention time or retention size is the actual limiting factor
- whether WAL and block compaction fit the available disk budget
- how long replay takes after restart
- whether disk IOPS and compression match the current ingest profile
TSDB disk pressure is often not just a storage problem. It is a signal that the metric set or label model is already too expensive.
Cardinality is a design quality issue
Why high cardinality breaks systems
In Prometheus, the cost of a metric is driven by all label combinations. Labels such as user_id, session_id, and request_id create unbounded series growth and are rarely appropriate for general-purpose metrics.
As cardinality rises, several things degrade at once:
- memory usage
- query latency
- disk usage
- compaction time
- rule evaluation cost
Practical operating rules
- never add unbounded identifiers as labels
- keep only labels that are actually used for alerting, routing, or dashboard filtering
- do not blindly ingest every exporter metric
- require label review during service onboarding
One of the highest-leverage cost optimizations in Prometheus is not scaling hardware. It is removing labels that do not create operational value.
Recording rules and alerting rules should be designed separately
Recording rules reduce repeated query cost
If dashboards and alerts repeatedly run the same heavy query, the cost multiplies across users and evaluation loops. Recording rules let you precompute the expensive part once.
groups:
- name: service-latency
interval: 30s
rules:
- record: job:http_request_duration_seconds:rate5m
expr: sum by (job) (rate(http_request_duration_seconds_count[5m]))
Recording rules are especially useful when:
- many dashboards reuse the same expensive expression
- SLI calculations need standardization
- downstream systems or federation layers should consume a simpler derived series
Alerting rules should reflect operating intent
Good alerts are not defined by complex PromQL. They are defined by clear meaning, stable duration, and a response path.
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job) > 0.05
for: 10m
Good alert rules typically include:
- a stable, understandable signal
- a
forduration that filters transient noise - annotation text with response context
- a runbook link
- bounded label fanout
Without these guardrails, teams often create alerts that are technically correct but operationally noisy.
Federation and remote write solve different problems
When federation fits
Federation is useful when an upper-level Prometheus only needs selected aggregated metrics from lower-level Prometheus servers.
It fits well when:
- multiple clusters need to feed summary metrics upward
- regional or cluster-local Prometheus instances should stay independent
- the central tier needs rollups rather than the full raw series set
When remote write fits
Remote write is the better fit when the goal is external retention or centralized queryability.
It fits well when:
- long-term retention is required
- multi-tenant storage is needed
- global analytics and cross-environment queries matter
- local Prometheus should keep short retention for operational performance
These patterns are often mentioned together, but they are not interchangeable. Federation is mainly about hierarchical scraping of selected data. Remote write is mainly about external storage and scale.
What teams should document
Prometheus is not just a binary with a config file. It is an operating system for metric economics. Teams should explicitly document:
1. Metric onboarding policy
- which metric types and labels are allowed
- who reviews label additions
- which exporter defaults are kept or dropped
2. Rule ownership
- who owns recording rules and alert rules
- what evaluation intervals are acceptable
- whether all alerts require runbook references
3. Storage policy
- local retention target
- remote write destination and purpose
- acceptable metric loss during failure events
4. Prometheus SLOs
- query latency expectations
- scrape success rate
- rule evaluation success rate
- disk usage ceilings
A practical operations checklist
The most useful recurring checks are:
- Is scrape failure rate rising
- Did time-series count jump after a recent deployment
- Which dashboards and alerts run the most expensive queries
- Can those queries be replaced by recording rules
- Is local retention increasing restart time or disk pressure
- Are federation and remote write being used for clearly different purposes
Prometheus becomes easier to scale when teams treat it as a controlled metric economy instead of an unlimited telemetry bucket.
Closing thoughts
Good Prometheus operations are built on four habits:
- prohibit high-cardinality labels unless there is a very strong reason
- prefer recording rules for repeated expensive logic
- attach
fordurations and runbooks to alerts - separate the role of federation from the role of remote write
Well-operated Prometheus does not mean collecting more metrics. It means keeping better metrics at lower cost with clearer operating intent.