- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Envoy Statistics System
- Istio Standard Metrics
- Telemetry API v2
- Distributed Tracing
- Access Logging
- Kiali: Service Mesh Visualization
- Prometheus Integration
- Grafana Dashboards
- Jaeger/Zipkin/Tempo Integration
- Debugging Tips
- Conclusion
Introduction
Observability is one of the core values of a service mesh. Istio automatically generates metrics, distributed tracing, and access logging without any application code changes.
This post analyzes how Istio generates telemetry data, what processing occurs in the Envoy filter chain, and how it integrates with external systems.
Envoy Statistics System
Statistics Types
Envoy generates three types of statistics:
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value | Total requests, total errors |
| Gauge | Value that can increase/decrease | Active connections, pending requests |
| Histogram | Distribution of values | Request latency, response size |
Statistics Generation in the Filter Chain
Request flow and statistics generation points:
Listener (connection stats)
│
▼
HTTP Connection Manager (request stats)
│
├── JWT Authn Filter → auth success/failure counters
├── RBAC Filter → authz allow/deny counters
├── Fault Filter → injected delay/abort counters
├── Stats Filter (istio.stats) → Istio standard metrics generation
└── Router Filter → upstream request statistics
│
▼
Cluster (upstream stats)
│
▼
Endpoint (connection/request stats)
Istio Standard Metrics
Core HTTP Metrics
istio_requests_total (Counter)
Core metric tracking request counts:
istio_requests_total{
reporter="source", # or "destination"
source_workload="frontend",
source_workload_namespace="prod",
source_principal="spiffe://cluster.local/ns/prod/sa/frontend",
destination_workload="reviews",
destination_workload_namespace="prod",
destination_principal="spiffe://cluster.local/ns/prod/sa/reviews",
destination_service="reviews.prod.svc.cluster.local",
destination_service_name="reviews",
destination_service_namespace="prod",
request_protocol="http",
response_code="200",
response_flags="-",
connection_security_policy="mutual_tls"
}
istio_request_duration_milliseconds (Histogram)
Request processing time distribution:
istio_request_duration_milliseconds_bucket{
..., # same labels as above
le="1"
} 100
istio_request_duration_milliseconds_bucket{le="5"} 250
istio_request_duration_milliseconds_bucket{le="10"} 380
istio_request_duration_milliseconds_bucket{le="25"} 450
istio_request_duration_milliseconds_bucket{le="50"} 490
istio_request_duration_milliseconds_bucket{le="100"} 498
istio_request_duration_milliseconds_bucket{le="+Inf"} 500
istio_request_bytes / istio_response_bytes (Histogram)
Tracks request/response size distribution.
TCP Metrics
istio_tcp_sent_bytes_total # Total bytes sent
istio_tcp_received_bytes_total # Total bytes received
istio_tcp_connections_opened_total # Connections opened
istio_tcp_connections_closed_total # Connections closed
Metric Generation Location: Source vs Destination
Frontend Pod Reviews Pod
[App] → [Envoy] ──────→ [Envoy] → [App]
│ │
reporter="source" reporter="destination"
(outbound side) (inbound side)
Both Envoys generate metrics, distinguished by the reporter label. Generally, using "source" reporter gives client-side perspective, while "destination" gives server-side perspective.
Telemetry API v2
Architecture Evolution
Istio 1.x (Mixer-based):
App → Envoy → Mixer → Prometheus/Zipkin
(separate service, high latency)
Istio 1.12+ (Telemetry API v2):
App → Envoy (built-in Stats/Trace filters) → Prometheus/Zipkin
(in-proxy processing, low latency)
After Mixer removal, metric generation is performed directly inside the Envoy proxy.
Telemetry Resource
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system # mesh-wide application
spec:
# Metrics configuration
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_COUNT
mode: CLIENT_AND_SERVER
tagOverrides:
request_host:
operation: UPSERT
value: 'request.host'
# Tracing configuration
tracing:
- providers:
- name: zipkin
randomSamplingPercentage: 1.0
customTags:
environment:
literal:
value: 'production'
# Access logging configuration
accessLogging:
- providers:
- name: envoy
filter:
expression: 'response.code >= 400'
Metric Customization
# Disable specific metric
spec:
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_BYTES
disabled: true
# Add custom tags
overrides:
- match:
metric: REQUEST_COUNT
tagOverrides:
custom_tag:
operation: UPSERT
value: "request.headers['x-custom-tag']"
Distributed Tracing
Trace Propagation Mechanism
What Envoy does automatically:
├── Extract trace headers from inbound requests
├── Create spans and record timing
├── Send spans to trace collector
└── Add trace headers to outbound requests
What the application must do:
└── Copy trace headers from inbound to outbound requests
(without this, traces are broken)
Supported Trace Headers
B3 Headers (Zipkin):
x-b3-traceid: 128-bit trace ID
x-b3-spanid: 64-bit span ID
x-b3-parentspanid: 64-bit parent span ID
x-b3-sampled: sampling decision (0 or 1)
x-b3-flags: debug flag
W3C TraceContext:
traceparent: 00-TRACE_ID-SPAN_ID-FLAGS
tracestate: vendor-specific key=value pairs
Envoy Internal Header:
x-request-id: UUID generated by Envoy (linked to trace)
Span Generation Detail
When Frontend → Reviews → Ratings:
Frontend Envoy (outbound):
Span: "reviews.prod.svc.cluster.local:9080/*"
├── Start: request send start
├── End: response receive complete
├── Tags: upstream_cluster, http.method, http.status_code
└── Parent: inbound span
Reviews Envoy (inbound):
Span: "reviews.prod.svc.cluster.local:9080/*"
├── Start: request received
├── End: response sent
└── Tags: downstream_cluster, peer.address
Reviews Envoy (outbound):
Span: "ratings.prod.svc.cluster.local:9080/*"
├── Start: request send start
├── End: response receive complete
└── Parent: inbound span
Trace Sampling
# Set in MeshConfig
meshConfig:
defaultConfig:
tracing:
sampling: 1.0 # 1% (default)
# sampling: 100.0 # 100% (for debugging)
# Or set via Telemetry API
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: tracing
spec:
tracing:
- randomSamplingPercentage: 1.0
The sampling decision is made at the first Envoy and propagated via the x-b3-sampled header. Subsequent Envoys follow this decision.
Trace Collector Integration
# Zipkin integration
meshConfig:
defaultConfig:
tracing:
zipkin:
address: zipkin.istio-system:9411
# Jaeger integration (using Zipkin-compatible endpoint)
meshConfig:
defaultConfig:
tracing:
zipkin:
address: jaeger-collector.observability:9411
# OpenTelemetry Collector integration
meshConfig:
extensionProviders:
- name: otel
opentelemetry:
service: otel-collector.observability.svc.cluster.local
port: 4317
Access Logging
Envoy Access Log Format
Default log format:
[2026-03-20T10:30:00.000Z] "GET /api/reviews HTTP/1.1" 200 - via_upstream
- "-" 0 1234 5 3
"-" "curl/7.68.0" "abc-123-def"
"reviews.prod.svc.cluster.local:9080"
inbound|9080||reviews.prod.svc.cluster.local
10.244.1.5:9080 10.244.0.3:48292
outbound_.9080_.v1_.reviews.prod.svc.cluster.local default
Response Flags
| Flag | Meaning |
|---|---|
| - | Normal response |
| UH | Upstream unhealthy (all ejected) |
| UF | Upstream connection failure |
| UO | Upstream overflow (circuit breaker) |
| NR | No route |
| URX | Retry limit exceeded |
| DC | Downstream connection termination |
| LH | Local health check failure |
| UT | Upstream timeout |
| RL | Rate limited |
| UAEX | External authorization denied |
| RLSE | Rate limit service error |
Conditional Logging
Conditional access logging using the Telemetry API:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: access-log-errors
namespace: production
spec:
accessLogging:
- providers:
- name: envoy
filter:
expression: 'response.code >= 400 || connection.mtls == false'
CEL (Common Expression Language) expressions allow fine-grained control over logging conditions.
Kiali: Service Mesh Visualization
Kiali Architecture
Kiali data sources:
├── Prometheus → Metric-based service graph
├── Kubernetes API → Workload, service information
├── Istio Config API → VirtualService, DestinationRule, etc.
└── Jaeger/Tempo → Distributed traces (optional)
Information Provided by Kiali
1. Service Graph (Topology)
├── Traffic flow between services
├── Request success/failure rates
├── Requests per second
└── Response times
2. Workload Health
├── Health score based on error rate
├── Inbound/outbound metrics
└── Pod status
3. Istio Configuration Validation
├── VirtualService validity
├── DestinationRule conflict detection
├── Reference integrity (non-existent hosts, etc.)
└── Best practice violations
4. Traffic Analysis
├── Traffic trends over time
├── Error pattern identification
└── Latency distribution
Prometheus Integration
Metric Collection Configuration
Istio leverages Prometheus service discovery:
# Prometheus scrape configuration
scrape_configs:
- job_name: 'envoy-stats'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: istio-proxy
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: 'target:15090'
Each Envoy proxy exposes Prometheus metrics on port 15090.
Useful PromQL Queries
# Per-service request success rate (last 5 minutes)
sum(rate(istio_requests_total{
response_code!~"5.*",
reporter="destination"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
reporter="destination"
}[5m])) by (destination_service_name)
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="source"
}[5m])) by (le, destination_service_name)
)
# Requests per second by service
sum(rate(istio_requests_total{
reporter="destination"
}[5m])) by (destination_service_name)
Grafana Dashboards
Istio Standard Dashboards
Istio provides the following Grafana dashboards:
1. Mesh Dashboard
└── Mesh-wide summary (service count, error rate, traffic)
2. Service Dashboard
└── Per-service detail (inbound/outbound, error rate, latency)
3. Workload Dashboard
└── Per-workload detail (pod-level metrics)
4. Control Plane Dashboard
└── istiod performance (xDS pushes, response time, errors)
5. Performance Dashboard
└── Envoy resource usage (memory, CPU, connections)
Jaeger/Zipkin/Tempo Integration
Distributed Tracing Backends
Supported tracing backends:
Zipkin
├── Lightweight, simple installation
├── In-memory or Cassandra/Elasticsearch storage
└── Default Istio support
Jaeger
├── Zipkin-compatible API
├── Various storage backend support
├── Spark-based analytics
└── Recommended for production
Tempo (Grafana)
├── Object storage based (S3, GCS)
├── High scalability
├── Native Grafana integration
└── Cost-effective
Trace Data Flow
[1] Request enters the mesh
│
[2] First Envoy generates trace ID
(if no existing headers)
│
[3] Each Envoy creates spans and sends to collector
├── Zipkin: HTTP POST /api/v2/spans
├── Jaeger: UDP/gRPC
└── OTLP: gRPC (OpenTelemetry)
│
[4] Collector assembles spans into traces
│
[5] Query traces in UI
Debugging Tips
Checking Metrics
# Check Envoy metrics directly for a specific pod
kubectl exec PODNAME -c istio-proxy -- \
curl -s localhost:15090/stats/prometheus | grep istio_requests
# Check statistics via Envoy admin API
kubectl exec PODNAME -c istio-proxy -- \
curl -s localhost:15000/stats | grep -E "^cluster\."
# Envoy server info
kubectl exec PODNAME -c istio-proxy -- \
curl -s localhost:15000/server_info
Checking Tracing
# Verify trace header propagation
kubectl exec PODNAME -c istio-proxy -- \
curl -s localhost:15000/config_dump | python3 -c "
import json, sys
config = json.load(sys.stdin)
for c in config.get('configs', []):
if 'tracing' in str(c):
print(json.dumps(c, indent=2))
"
Checking Access Logs
# Real-time access log viewing
kubectl logs PODNAME -c istio-proxy -f | grep -v healthz
# Filter error responses only
kubectl logs PODNAME -c istio-proxy | grep -E '"[45][0-9]{2}"'
Conclusion
Istio observability is built on Envoy proxy's rich telemetry capabilities. Key takeaways:
- Metrics: Envoy Stats filter generates standard metrics like istio_requests_total, collected by Prometheus
- Tracing: Envoy automatically creates spans, but applications must propagate trace headers for end-to-end tracing
- Logging: Envoy access logs record all requests/responses, with conditional logging via Telemetry API
- Visualization: Kiali generates service graphs based on Prometheus metrics
Through the Istio Internals series, we have explored the internals of control plane, traffic management, security, Ambient Mesh, and observability. We hope this understanding helps you effectively operate service meshes in practice.