Skip to content
Published on

Ingress Observability — Metrics, Access Logs, and Tracing

Authors

Introduction

The Ingress controller sits at the edge of your cluster, the single point through which all external traffic flows. If you cannot see what happens there, it is dangerously easy to blame backend applications for latency or 5xx errors that actually originate at the edge. In real incident response, the first question is almost always: "Is this stuck at the Ingress, or is the backend slow?"

Answering that within seconds requires observability at the Ingress layer. Observability is not merely "we attached Prometheus." It means the three pillars (metrics, logs, traces) are connected through consistent labels, so you can spot an anomaly on a dashboard and drill straight down to the logs and traces for the offending requests.

This article uses ingress-nginx as the reference. We walk through defining golden signals, exposing Prometheus metrics, building Grafana dashboards with key PromQL, emitting structured JSON access logs and shipping them to Loki, wiring up OpenTelemetry distributed tracing, writing alerting rules, doing per-ingress and per-path analysis, capacity planning, and finally a practical troubleshooting workflow. As of 2026 the Ingress API itself is frozen (no new features) and Gateway API has become the successor standard, but observability principles apply equally to both, so we close with the differences under Gateway API.

Golden Signals: What to Measure

The four golden signals from the Google SRE book map almost directly onto the Ingress layer.

Golden signalMeaning at the Ingress layerRepresentative metric
TrafficIncoming HTTP requests per secondrequests per second (RPS)
ErrorsRatio of 5xx/4xx responses5xx ratio, 4xx ratio
LatencyDistribution of request handling timep50/p90/p99 request duration
SaturationController resource and connection limitsactive connections, CPU, reload frequency

We add two Ingress-specific signals. First is bandwidth — request/response byte counts that catch large downloads or abnormal traffic. Second is the gap between upstream (backend) latency and total latency — subtracting backend response time from total handling time reveals time spent inside the controller itself; when that grows, the controller is the bottleneck.

The key point is not "error rate is high" but being able to decompose it down to "which ingress, which path, routing to which backend, has the elevated error rate." That is why every metric needs ingress/service/path labels.

ingress-nginx Prometheus Metrics

The ingress-nginx controller is designed to expose a metrics endpoint by default. Enabling metrics and the ServiceMonitor in the Helm values is the starting point.

controller:
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "10254"
    serviceMonitor:
      enabled: true
      namespace: monitoring
      additionalLabels:
        release: kube-prometheus-stack
      scrapeInterval: 30s

The main time series exposed on the metrics port (default 10254) are:

Metric nameTypeMeaning
nginx_ingress_controller_requestscounterRequests handled (labels: status, method, host, ingress, service, path)
nginx_ingress_controller_request_duration_secondshistogramTotal request handling time
nginx_ingress_controller_response_duration_secondshistogramResponse time
nginx_ingress_controller_request_sizehistogramRequest bytes
nginx_ingress_controller_response_sizehistogramResponse bytes
nginx_ingress_controller_nginx_process_connectionsgaugeactive/reading/writing/waiting connections
nginx_ingress_controller_config_last_reload_successfulgaugeWhether last reload succeeded
nginx_ingress_controller_config_last_reload_success_timestamp_secondsgaugeTimestamp of last successful reload
nginx_ingress_controller_ssl_expire_time_secondsgaugeCertificate expiry time

Of these, the requests counter and the request_duration histogram cover most of the golden signals. For histograms, bucket boundaries matter. If the default buckets do not match your application's latency distribution, p99 becomes inaccurate, so a low-latency API may need finer buckets.

Grafana Dashboards and Key PromQL

The first screen of the dashboard should always be the four golden signals. Below is PromQL you can drop straight into panels.

Request rate (RPS), broken down per ingress.

sum(rate(nginx_ingress_controller_requests[5m])) by (ingress)

5xx error ratio (relative to total). A pattern to guard against a zero denominator.

sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress)

p99 latency (histogram quantile). You must preserve the le label for the quantile calculation.

histogram_quantile(
  0.99,
  sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le, ingress)
)

Overlaying p50/p90/p99 in one panel makes rising tail latency obvious. If p50 is stable but p99 spikes, suspect a few slow backends, GC pauses, or connection-pool exhaustion.

Upstream latency versus controller overhead.

histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le))
-
histogram_quantile(0.99, sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (le))

Active connections (saturation).

sum(nginx_ingress_controller_nginx_process_connections) by (state)

Reload frequency — frequent reloads cause connection drops and latency spikes.

changes(nginx_ingress_controller_config_last_reload_success_timestamp_seconds[15m])

Bandwidth (response bytes per second).

sum(rate(nginx_ingress_controller_response_size_sum[5m])) by (ingress)

Dashboard layout tip: put cluster-wide golden signals at the top, a per-ingress table in the middle (RPS, error ratio, p99 on one row), and controller health metrics like reloads, connections, and certificate expiry at the bottom. Make ingress a template variable so per-ingress drill-down becomes natural.

Structured Access Logs and Shipping to Loki

Metrics tell you "what went wrong," but logs tell you "which request went wrong." The default nginx log format is space-separated text that is awkward to parse. Switching to structured JSON logs enables field-based queries in Loki or Elasticsearch.

Change the ingress-nginx log format in the ConfigMap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  log-format-escape-json: "true"
  log-format-upstream: >-
    {"time": "$time_iso8601",
     "remote_addr": "$remote_addr",
     "x_forwarded_for": "$proxy_add_x_forwarded_for",
     "request_method": "$request_method",
     "host": "$host",
     "uri": "$uri",
     "status": $status,
     "request_time": $request_time,
     "upstream_addr": "$upstream_addr",
     "upstream_response_time": "$upstream_response_time",
     "upstream_status": "$upstream_status",
     "request_length": $request_length,
     "bytes_sent": $bytes_sent,
     "namespace": "$namespace",
     "ingress_name": "$ingress_name",
     "service_name": "$service_name",
     "trace_id": "$opentelemetry_trace_id"}

The key here is recording both request_time (total time) and upstream_response_time (backend time). Their difference is the controller overhead. Embedding trace_id in the log also enables jumping from a log line to its trace (log-to-trace).

When shipping to Loki via Promtail or Grafana Alloy, parse the JSON and promote status, namespace, and ingress_name to labels. To avoid label-cardinality explosions, keep high-cardinality fields like trace_id or remote_addr inside the log line rather than as labels.

scrape_configs:
  - job_name: ingress-nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: ingress-nginx
    pipeline_stages:
      - json:
          expressions:
            status: status
            namespace: namespace
            ingress_name: ingress_name
            request_time: request_time
      - labels:
          status:
          namespace:
          ingress_name:

A LogQL query that filters for 5xx and finds the slowest requests:

{job="ingress-nginx"} | json | status >= 500
  | request_time > 1.0
  | line_format "{{.host}}{{.uri}} {{.status}} {{.request_time}}s"

Wiring Up OpenTelemetry Distributed Tracing

Once metrics and logs have narrowed things down to a "slow request," tracing shows "in which service and which span that request spent its time." ingress-nginx ships built-in OpenTelemetry support and can make the controller the first span (the root or edge span) of a trace.

Enable OpenTelemetry in the ConfigMap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  enable-opentelemetry: "true"
  opentelemetry-trace-sampler-ratio: "0.1"
  otlp-collector-host: "otel-collector.observability.svc"
  otlp-collector-port: "4317"
  otel-service-name: "ingress-nginx"

The sampling ratio (sampler-ratio) directly drives cost. Tracing every request is expensive, so most teams start at 1 to 10 percent and use tail-based sampling at the collector to retain 100 percent of error or slow requests.

The crucial piece is context propagation. The controller must forward the W3C traceparent header to the backend so backend spans attach to the same trace. When OpenTelemetry is enabled, ingress-nginx injects and propagates traceparent automatically, so as long as the backend application uses the same standard, you get an end-to-end trace.

Unifying the labeling convention that links traces, metrics, and logs is the final piece of the puzzle. If the log's trace_id, the trace's service.name, and the metric's ingress label are wired together via cross-datasource correlation in Grafana, you can move from an error-rate spike on the dashboard, to the logs for that window, to the trace of the failing request, in three clicks.

Alerting Rule Examples

Dashboards require a human to look at them; alerts wake you when nobody is looking. Here are example Prometheus alerting rules.

groups:
  - name: ingress-nginx.rules
    rules:
      - alert: IngressHigh5xxRate
        expr: |
          sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress)
          /
          sum(rate(nginx_ingress_controller_requests[5m])) by (ingress)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Ingress 5xx ratio exceeds 5 percent"
          description: "Per-ingress 5xx ratio has been above 5 percent for over 5 minutes."

      - alert: IngressHighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le, ingress)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Ingress p99 latency exceeds 1 second"

      - alert: IngressConfigReloadFailed
        expr: nginx_ingress_controller_config_last_reload_successful == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Ingress configuration reload failed"

      - alert: IngressCertExpiringSoon
        expr: |
          (nginx_ingress_controller_ssl_expire_time_seconds - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "TLS certificate expires within 14 days"

The design principle is symptom-based alerting. "CPU is high" is less page-worthy than "users are getting 5xx." Reload failures and certificate expiry are worth catching separately, because even if users are not affected yet, they soon will be.

Per-ingress / Per-path Analysis

As operations scale, "overall error rate" loses meaning. When a single controller serves dozens of ingresses, a single broken ingress or path can be buried in the average.

ingress-nginx supports metrics-per-host and a path label. The path label can be high-cardinality, so enable it mainly for static paths and normalize paths that contain dynamic IDs.

PromQL to decompose errors by path:

topk(10,
  sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, path)
)

This immediately surfaces diagnoses like "only the /checkout path on the payment ingress is throwing 5xx." Filtering the same path in LogQL to see the actual error messages then connects the metric's "what" to the log's "why."

Capacity Planning

Observability data drives proactive planning, not just reactive firefighting. The inputs to controller capacity planning are:

  • Peak RPS and its growth trend (weekly/monthly regression)
  • CPU cost per request — controller CPU usage divided by RPS
  • Concurrent active connections and the worker connection limit
  • TLS handshake cost (especially when keep-alive is short)
  • Reload frequency and the instantaneous load per reload

For example, if peak RPS is growing 30 percent per quarter and the current CPU cost per request is constant, you can estimate the replica count needed two quarters out by linear extrapolation. However, reload load and TLS cost are nonlinear, so verify the real limits periodically with load testing (for example k6 or vegeta). When saturation metrics (active connections, CPU) start crossing 70 percent, a conservative rule is to use that as a scale-out trigger.

Troubleshooting Workflow

With observability in place, incident response follows a consistent flow.

1. Alert fires (e.g., IngressHigh5xxRate, ingress=payment)
2. Check dashboard — which golden signal broke?
   ├─ Error rate up, latency normal  → suspect backend 5xx
   ├─ Latency up, error rate normal  → backend slow or controller saturated
   └─ Reload count spiking           → frequent deploys/config changes
3. Decompose per-ingress/path — which path is the cause?
4. Pull 5xx logs for that path via LogQL — check upstream_status, request_time
   ├─ upstream_status 5xx     → backend application problem
   ├─ upstream_addr empty     → no endpoints (503), check service/selector
   └─ request_time large      → move to traces
5. Look up the distributed trace by trace_id — where was time spent?
6. Confirm root cause → fix → verify recovery on the dashboard

The value of this workflow is replacing guesswork with data. Instead of "it's probably the backend," you point at the backend because upstream_response_time accounts for 95 percent of request_time.

Observability in the Gateway API Era

As of 2026 the Ingress API is frozen and Gateway API is the successor standard. The good news from an observability standpoint is that the principles carry over directly. Gateway implementations (Envoy-based Contour, Istio, Cilium, NGINX Gateway Fabric, and others) expose the same golden signals, generally leveraging Envoy's rich statistics and native OpenTelemetry support.

The difference is richer label dimensions. Because Gateway API is a three-tier model (GatewayClass to Gateway to HTTPRoute), metrics naturally carry gateway, route, and backend labels, making per-route analysis more precise than ingress-nginx. Envoy-based data planes also provide extra signals like circuit breaking and outlier detection. If you build proper golden signals, structured logs, and tracing on ingress-nginx today, you can reuse nearly all of your dashboard and alerting concepts when you migrate to Gateway API.

Conclusion

The heart of Ingress observability is not the tools but the connections. Only when the flow is seamless — detect an anomaly via metrics, narrow scope per ingress/path, find the cause in structured logs, and pinpoint the span via traces — can you answer "Ingress or backend?" within seconds.

Start small. First stand up metrics and a golden-signal dashboard, then ship JSON access logs to Loki, and finally embed trace_id in logs to connect tracing. The moment all three pillars are tied together by the same labels, the Ingress stops being a black box and becomes your most trustworthy first point of diagnosis.

References