Skip to content
Published on

Istio Observability Internals: Metrics, Tracing, Logging

Authors

Introduction

Observability is one of the core values of a service mesh. Istio automatically generates metrics, distributed tracing, and access logging without any application code changes.

This post analyzes how Istio generates telemetry data, what processing occurs in the Envoy filter chain, and how it integrates with external systems.

Envoy Statistics System

Statistics Types

Envoy generates three types of statistics:

TypeDescriptionExample
CounterMonotonically increasing valueTotal requests, total errors
GaugeValue that can increase/decreaseActive connections, pending requests
HistogramDistribution of valuesRequest latency, response size

Statistics Generation in the Filter Chain

Request flow and statistics generation points:

Listener (connection stats)
HTTP Connection Manager (request stats)
    ├── JWT Authn Filter → auth success/failure counters
    ├── RBAC Filter → authz allow/deny counters
    ├── Fault Filter → injected delay/abort counters
    ├── Stats Filter (istio.stats)Istio standard metrics generation
    └── Router Filter → upstream request statistics
Cluster (upstream stats)
Endpoint (connection/request stats)

Istio Standard Metrics

Core HTTP Metrics

istio_requests_total (Counter)

Core metric tracking request counts:

istio_requests_total{
  reporter="source",                    # or "destination"
  source_workload="frontend",
  source_workload_namespace="prod",
  source_principal="spiffe://cluster.local/ns/prod/sa/frontend",
  destination_workload="reviews",
  destination_workload_namespace="prod",
  destination_principal="spiffe://cluster.local/ns/prod/sa/reviews",
  destination_service="reviews.prod.svc.cluster.local",
  destination_service_name="reviews",
  destination_service_namespace="prod",
  request_protocol="http",
  response_code="200",
  response_flags="-",
  connection_security_policy="mutual_tls"
}

istio_request_duration_milliseconds (Histogram)

Request processing time distribution:

istio_request_duration_milliseconds_bucket{
  ...,  # same labels as above
  le="1"
} 100
istio_request_duration_milliseconds_bucket{le="5"} 250
istio_request_duration_milliseconds_bucket{le="10"} 380
istio_request_duration_milliseconds_bucket{le="25"} 450
istio_request_duration_milliseconds_bucket{le="50"} 490
istio_request_duration_milliseconds_bucket{le="100"} 498
istio_request_duration_milliseconds_bucket{le="+Inf"} 500

istio_request_bytes / istio_response_bytes (Histogram)

Tracks request/response size distribution.

TCP Metrics

istio_tcp_sent_bytes_total         # Total bytes sent
istio_tcp_received_bytes_total     # Total bytes received
istio_tcp_connections_opened_total # Connections opened
istio_tcp_connections_closed_total # Connections closed

Metric Generation Location: Source vs Destination

Frontend Pod                    Reviews Pod
[App][Envoy]  ──────→  [Envoy][App]
         │                    │
    reporter="source"    reporter="destination"
    (outbound side)      (inbound side)

Both Envoys generate metrics, distinguished by the reporter label. Generally, using "source" reporter gives client-side perspective, while "destination" gives server-side perspective.

Telemetry API v2

Architecture Evolution

Istio 1.x (Mixer-based):
AppEnvoyMixerPrometheus/Zipkin
              (separate service, high latency)

Istio 1.12+ (Telemetry API v2):
AppEnvoy (built-in Stats/Trace filters)Prometheus/Zipkin
      (in-proxy processing, low latency)

After Mixer removal, metric generation is performed directly inside the Envoy proxy.

Telemetry Resource

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system # mesh-wide application
spec:
  # Metrics configuration
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: REQUEST_COUNT
            mode: CLIENT_AND_SERVER
          tagOverrides:
            request_host:
              operation: UPSERT
              value: 'request.host'

  # Tracing configuration
  tracing:
    - providers:
        - name: zipkin
      randomSamplingPercentage: 1.0
      customTags:
        environment:
          literal:
            value: 'production'

  # Access logging configuration
  accessLogging:
    - providers:
        - name: envoy
      filter:
        expression: 'response.code >= 400'

Metric Customization

# Disable specific metric
spec:
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - match:
        metric: REQUEST_BYTES
      disabled: true

# Add custom tags
    overrides:
    - match:
        metric: REQUEST_COUNT
      tagOverrides:
        custom_tag:
          operation: UPSERT
          value: "request.headers['x-custom-tag']"

Distributed Tracing

Trace Propagation Mechanism

What Envoy does automatically:
├── Extract trace headers from inbound requests
├── Create spans and record timing
├── Send spans to trace collector
└── Add trace headers to outbound requests

What the application must do:
└── Copy trace headers from inbound to outbound requests
    (without this, traces are broken)

Supported Trace Headers

B3 Headers (Zipkin):

x-b3-traceid:      128-bit trace ID
x-b3-spanid:       64-bit span ID
x-b3-parentspanid: 64-bit parent span ID
x-b3-sampled:      sampling decision (0 or 1)
x-b3-flags:        debug flag

W3C TraceContext:

traceparent: 00-TRACE_ID-SPAN_ID-FLAGS
tracestate:  vendor-specific key=value pairs

Envoy Internal Header:

x-request-id: UUID generated by Envoy (linked to trace)

Span Generation Detail

When FrontendReviewsRatings:

Frontend Envoy (outbound):
  Span: "reviews.prod.svc.cluster.local:9080/*"
  ├── Start: request send start
  ├── End: response receive complete
  ├── Tags: upstream_cluster, http.method, http.status_code
  └── Parent: inbound span

Reviews Envoy (inbound):
  Span: "reviews.prod.svc.cluster.local:9080/*"
  ├── Start: request received
  ├── End: response sent
  └── Tags: downstream_cluster, peer.address

Reviews Envoy (outbound):
  Span: "ratings.prod.svc.cluster.local:9080/*"
  ├── Start: request send start
  ├── End: response receive complete
  └── Parent: inbound span

Trace Sampling

# Set in MeshConfig
meshConfig:
  defaultConfig:
    tracing:
      sampling: 1.0 # 1% (default)
      # sampling: 100.0  # 100% (for debugging)

# Or set via Telemetry API
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: tracing
spec:
  tracing:
    - randomSamplingPercentage: 1.0

The sampling decision is made at the first Envoy and propagated via the x-b3-sampled header. Subsequent Envoys follow this decision.

Trace Collector Integration

# Zipkin integration
meshConfig:
  defaultConfig:
    tracing:
      zipkin:
        address: zipkin.istio-system:9411

# Jaeger integration (using Zipkin-compatible endpoint)
meshConfig:
  defaultConfig:
    tracing:
      zipkin:
        address: jaeger-collector.observability:9411

# OpenTelemetry Collector integration
meshConfig:
  extensionProviders:
  - name: otel
    opentelemetry:
      service: otel-collector.observability.svc.cluster.local
      port: 4317

Access Logging

Envoy Access Log Format

Default log format:

[2026-03-20T10:30:00.000Z] "GET /api/reviews HTTP/1.1" 200 - via_upstream
  - "-" 0 1234 5 3
  "-" "curl/7.68.0" "abc-123-def"
  "reviews.prod.svc.cluster.local:9080"
  inbound|9080||reviews.prod.svc.cluster.local
  10.244.1.5:9080 10.244.0.3:48292
  outbound_.9080_.v1_.reviews.prod.svc.cluster.local default

Response Flags

FlagMeaning
-Normal response
UHUpstream unhealthy (all ejected)
UFUpstream connection failure
UOUpstream overflow (circuit breaker)
NRNo route
URXRetry limit exceeded
DCDownstream connection termination
LHLocal health check failure
UTUpstream timeout
RLRate limited
UAEXExternal authorization denied
RLSERate limit service error

Conditional Logging

Conditional access logging using the Telemetry API:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: access-log-errors
  namespace: production
spec:
  accessLogging:
    - providers:
        - name: envoy
      filter:
        expression: 'response.code >= 400 || connection.mtls == false'

CEL (Common Expression Language) expressions allow fine-grained control over logging conditions.

Kiali: Service Mesh Visualization

Kiali Architecture

Kiali data sources:
├── PrometheusMetric-based service graph
├── Kubernetes APIWorkload, service information
├── Istio Config APIVirtualService, DestinationRule, etc.
└── Jaeger/TempoDistributed traces (optional)

Information Provided by Kiali

1. Service Graph (Topology)
   ├── Traffic flow between services
   ├── Request success/failure rates
   ├── Requests per second
   └── Response times

2. Workload Health
   ├── Health score based on error rate
   ├── Inbound/outbound metrics
   └── Pod status

3. Istio Configuration Validation
   ├── VirtualService validity
   ├── DestinationRule conflict detection
   ├── Reference integrity (non-existent hosts, etc.)
   └── Best practice violations

4. Traffic Analysis
   ├── Traffic trends over time
   ├── Error pattern identification
   └── Latency distribution

Prometheus Integration

Metric Collection Configuration

Istio leverages Prometheus service discovery:

# Prometheus scrape configuration
scrape_configs:
  - job_name: 'envoy-stats'
    metrics_path: /stats/prometheus
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: istio-proxy
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: 'target:15090'

Each Envoy proxy exposes Prometheus metrics on port 15090.

Useful PromQL Queries

# Per-service request success rate (last 5 minutes)
sum(rate(istio_requests_total{
  response_code!~"5.*",
  reporter="destination"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
  reporter="destination"
}[5m])) by (destination_service_name)

# P99 latency
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{
    reporter="source"
  }[5m])) by (le, destination_service_name)
)

# Requests per second by service
sum(rate(istio_requests_total{
  reporter="destination"
}[5m])) by (destination_service_name)

Grafana Dashboards

Istio Standard Dashboards

Istio provides the following Grafana dashboards:

1. Mesh Dashboard
   └── Mesh-wide summary (service count, error rate, traffic)

2. Service Dashboard
   └── Per-service detail (inbound/outbound, error rate, latency)

3. Workload Dashboard
   └── Per-workload detail (pod-level metrics)

4. Control Plane Dashboard
   └── istiod performance (xDS pushes, response time, errors)

5. Performance Dashboard
   └── Envoy resource usage (memory, CPU, connections)

Jaeger/Zipkin/Tempo Integration

Distributed Tracing Backends

Supported tracing backends:

Zipkin
├── Lightweight, simple installation
├── In-memory or Cassandra/Elasticsearch storage
└── Default Istio support

Jaeger
├── Zipkin-compatible API
├── Various storage backend support
├── Spark-based analytics
└── Recommended for production

Tempo (Grafana)
├── Object storage based (S3, GCS)
├── High scalability
├── Native Grafana integration
└── Cost-effective

Trace Data Flow

[1] Request enters the mesh
[2] First Envoy generates trace ID
    (if no existing headers)
[3] Each Envoy creates spans and sends to collector
    ├── Zipkin: HTTP POST /api/v2/spans
    ├── Jaeger: UDP/gRPC
    └── OTLP: gRPC (OpenTelemetry)
[4] Collector assembles spans into traces
[5] Query traces in UI

Debugging Tips

Checking Metrics

# Check Envoy metrics directly for a specific pod
kubectl exec PODNAME -c istio-proxy -- \
  curl -s localhost:15090/stats/prometheus | grep istio_requests

# Check statistics via Envoy admin API
kubectl exec PODNAME -c istio-proxy -- \
  curl -s localhost:15000/stats | grep -E "^cluster\."

# Envoy server info
kubectl exec PODNAME -c istio-proxy -- \
  curl -s localhost:15000/server_info

Checking Tracing

# Verify trace header propagation
kubectl exec PODNAME -c istio-proxy -- \
  curl -s localhost:15000/config_dump | python3 -c "
import json, sys
config = json.load(sys.stdin)
for c in config.get('configs', []):
    if 'tracing' in str(c):
        print(json.dumps(c, indent=2))
"

Checking Access Logs

# Real-time access log viewing
kubectl logs PODNAME -c istio-proxy -f | grep -v healthz

# Filter error responses only
kubectl logs PODNAME -c istio-proxy | grep -E '"[45][0-9]{2}"'

Conclusion

Istio observability is built on Envoy proxy's rich telemetry capabilities. Key takeaways:

  1. Metrics: Envoy Stats filter generates standard metrics like istio_requests_total, collected by Prometheus
  2. Tracing: Envoy automatically creates spans, but applications must propagate trace headers for end-to-end tracing
  3. Logging: Envoy access logs record all requests/responses, with conditional logging via Telemetry API
  4. Visualization: Kiali generates service graphs based on Prometheus metrics

Through the Istio Internals series, we have explored the internals of control plane, traffic management, security, Ambient Mesh, and observability. We hope this understanding helps you effectively operate service meshes in practice.