Skip to content
Published on

eBPF-Based Zero Instrumentation Kubernetes Observability: Cilium Hubble and Grafana Beyla Practical Guide

Authors
  • Name
    Twitter
eBPF Observability

Introduction: Why eBPF-Based Zero Instrumentation

There are two traditional approaches to achieving observability in Kubernetes environments. First, explicit instrumentation, which inserts SDKs into application code to generate metrics and traces. Second, the service mesh approach, which injects Envoy-based sidecar proxies into each Pod to observe L7 traffic. Both approaches are proven in production but have fundamental limitations.

SDK-based instrumentation requires adding libraries and writing instrumentation code for every service. Maintaining this consistently across large-scale clusters with hundreds of microservices involves enormous engineering costs. The sidecar proxy approach can achieve L7 visibility without code changes, but deploying an additional container per Pod increases per-node memory usage by hundreds of MB, and adding a network hop typically raises P99 latency by 2-5ms.

eBPF (extended Berkeley Packet Filter) addresses the limitations of both approaches simultaneously. By running sandboxed programs inside the Linux kernel, it can observe network flows, HTTP/gRPC requests, DNS queries, and TCP connection states at the kernel level without any application code changes. No sidecar containers are needed, so resource overhead is extremely low, and since data is collected directly from the kernel, it has virtually no impact on application performance.

This article focuses on Cilium Hubble and Grafana Beyla, the core tools for eBPF-based observability, and provides a detailed practical operations guide for building network visibility and application performance monitoring (APM) in Kubernetes clusters with zero code changes.

eBPF Technology Overview: Principles of Kernel-Level Instrumentation

eBPF Program Execution Flow

eBPF is a technology introduced in earnest from Linux kernel 4.x that enables running programs in kernel space without directly modifying kernel modules. eBPF programs are written in user space, pass through the kernel's Verifier, and are then converted to native code by the JIT (Just-In-Time) compiler. The Verifier ensures kernel stability by preventing infinite loops, blocking out-of-bounds memory access, and allowing only permitted helper function calls.

The hook points where eBPF programs can be attached are highly diverse. For networking, there are TC (Traffic Control), XDP (eXpress Data Path), and socket-level hooks. For system-wide observation, kprobes (kernel function entry/exit), tracepoints (predefined observation points), and uprobes (user-space function tracing) are used. Cilium Hubble primarily uses TC and socket-level hooks to observe network flows, while Grafana Beyla leverages uprobes and kprobes to automatically extract RED (Rate, Error, Duration) metrics from HTTP/gRPC requests.

eBPF Feature Support by Kernel Version

eBPF feature support varies by kernel version. Full Cilium Hubble functionality requires at least kernel 4.19, and Grafana Beyla's auto-instrumentation features operate stably on kernel 5.8 and above. Kernel 5.15 LTS or higher is recommended for production environments. Specifically, kernels with BTF (BPF Type Format) support enabled allow eBPF programs to be deployed in CO-RE (Compile Once, Run Everywhere) mode, ensuring compatibility across diverse node environments.

# Check current node's kernel version and BTF support
uname -r
# Example output: 5.15.0-91-generic

# Check BTF support
ls /sys/kernel/btf/vmlinux
# If the file exists, BTF support is enabled

# eBPF feature probe (using bpftool)
bpftool feature probe kernel | grep -E "map_type|program_type|helper"

# Bulk check kernel versions across Kubernetes nodes
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'

Cilium Hubble Architecture: The Core of Network Observability

Hubble Internal Structure

Cilium Hubble is the observability layer of Cilium CNI, composed of three core components. First, Hubble Server runs on each node, embedded in the Cilium Agent, collecting network events from the eBPF dataplane. Second, Hubble Relay aggregates events from Hubble Servers across the entire cluster and provides them through a single API endpoint. Third, Hubble UI visually represents service maps and network flows.

The scope of network events Hubble can observe is as follows. At the L3/L4 level, it provides source/destination IP, port, and protocol information for TCP/UDP/ICMP packets, along with packet drop reasons. At the L7 level, it parses HTTP request/response methods, paths, and status codes; DNS query and response domains, types, and response codes; and Kafka message topics and API keys. All of this information is automatically mapped to Kubernetes identities (namespaces, Pods, service labels), enabling clear understanding of inter-service communication patterns.

Service Maps and Flow Visibility

One of Hubble's most powerful features is automatically generated service maps. In traditional service meshes, Envoy sidecars proxy each Pod's inbound/outbound traffic to construct service topology, but Hubble automatically builds service dependency graphs from network flow data collected by kernel eBPF programs. This allows you to see in real-time which services communicate with which other services, and what the communication success rates and latencies are, all without deploying a service mesh.

Hubble Installation and Configuration: Helm Chart-Based Deployment

Cilium + Hubble Integrated Installation

Hubble is deployed as part of Cilium CNI. If you are already using Cilium, you simply need to enable the Hubble feature. For new clusters, enable Hubble during Cilium installation. Below is a Helm values configuration suitable for production environments.

# cilium-hubble-values.yaml
# Helm values for Cilium + Hubble production deployment
hubble:
  enabled: true
  relay:
    enabled: true
    replicas: 3 # Relay replicas for high availability
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi
    retryTimeout: 30s
    sortBufferLenMax: 5000
    sortBufferDrainTimeout: 1s
  ui:
    enabled: true
    replicas: 2
    ingress:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx
        cert-manager.io/cluster-issuer: letsencrypt-prod
      hosts:
        - hubble.internal.example.com
      tls:
        - secretName: hubble-ui-tls
          hosts:
            - hubble.internal.example.com
  metrics:
    enableOpenMetrics: true
    enabled:
      - dns
      - drop
      - tcp
      - flow
      - port-distribution
      - icmp
      - httpV2:exemplars=true;labelsContext=source_ip,source_namespace,source_workload,destination_ip,destination_namespace,destination_workload,traffic_direction
    serviceMonitor:
      enabled: true # Auto-create Prometheus Operator ServiceMonitor
      interval: 15s
  tls:
    enabled: true # Enable mTLS between Hubble Server and Relay
    auto:
      enabled: true
      method: certmanager
      certManagerIssuerRef:
        group: cert-manager.io
        kind: ClusterIssuer
        name: hubble-issuer

# Cilium Agent optimization settings
operator:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 128Mi

# Monitor buffer size (per-node event cache)
bpf:
  monitorAggregation: medium # none/low/medium/high - aggregation level
  monitorInterval: 5s
  monitorFlags: all
  mapDynamicSizeRatio: 0.0025
# Install Cilium + Hubble (Helm)
helm repo add cilium https://helm.cilium.io
helm repo update

helm upgrade --install cilium cilium/cilium \
  --version 1.16.5 \
  --namespace kube-system \
  --values cilium-hubble-values.yaml \
  --wait

# Verify installation
cilium status --wait
cilium hubble port-forward &

# Check cluster-wide flows with Hubble CLI
hubble observe --since 1m --output compact

# Filter only HTTP traffic in a specific namespace
hubble observe \
  --namespace production \
  --protocol http \
  --http-status 500-599 \
  --output json | jq '{
    source: .flow.source.labels,
    destination: .flow.destination.labels,
    http: .flow.l7.http
  }'

# Analyze dropped packets between services
hubble observe \
  --verdict DROPPED \
  --namespace production \
  --output table

Hubble Metrics-Based PromQL Queries

When Hubble-generated metrics are collected by Prometheus, you can define various network-level SLIs (Service Level Indicators). Below is a collection of PromQL queries commonly used in production.

# Per-service HTTP request success rate (SLI)
(
  sum(rate(hubble_http_requests_total{http_status_code=~"2.."}[5m])) by (destination_workload, destination_namespace)
  /
  sum(rate(hubble_http_requests_total[5m])) by (destination_workload, destination_namespace)
) * 100

# Packet drop rate due to network policy
sum(rate(hubble_drop_total{reason="POLICY_DENIED"}[5m])) by (source_workload, destination_workload)
/ on(source_workload) group_left
sum(rate(hubble_flows_processed_total[5m])) by (source_workload)

# TCP connection setup P99 latency (per service)
histogram_quantile(0.99,
  sum(rate(hubble_tcp_connect_duration_seconds_bucket[5m])) by (le, destination_workload, destination_namespace)
)

# Top 10 workloads by DNS query failure rate
topk(10,
  sum(rate(hubble_dns_responses_total{rcode!="No Error"}[5m])) by (source_workload, source_namespace, qtypes, rcode)
)

# Inter-namespace traffic volume matrix
sum(rate(hubble_flows_processed_total[5m])) by (source_namespace, destination_namespace)

Grafana Beyla Auto-Instrumentation: Zero-Code APM

How Beyla Works

Grafana Beyla is an auto-instrumentation tool using eBPF that automatically generates RED (Rate, Error, Duration) metrics and distributed traces for HTTP/HTTPS, gRPC, and SQL requests without any application code changes. Beyla is deployed as a DaemonSet on each node, and through uprobes, it automatically detects and hooks HTTP handlers and gRPC server functions across various runtimes including Go, Java, Python, Node.js, Rust, and .NET.

The core metrics Beyla generates are as follows. http.server.request.duration provides HTTP server request processing time as a histogram, and http.client.request.duration tracks HTTP client external call times. rpc.server.duration and rpc.client.duration measure gRPC server and client call times respectively. These metrics follow OpenTelemetry semantic conventions and can be exported directly via OTLP protocol or sent through Prometheus Remote Write.

While Hubble provides visibility at the network layer (L3/L4/L7), Beyla complements it with application-layer performance metrics and traces. Using both tools together creates a complete observability stack that can observe everything from infrastructure networking to application performance without any code changes.

Beyla's Automatic Service Discovery

Beyla automatically detects processes to identify instrumentation targets. Detection works in two ways. First, it analyzes the symbol table of executable binaries to check if they contain HTTP server or gRPC server-related functions. For Go binaries, it detects symbols like net/http.(*Server).Serve and google.golang.org/grpc.(*Server).Serve. Second, it monitors listening sockets to identify processes accepting TCP connections on specific ports. By combining these two methods, services written in any language can be automatically included as instrumentation targets.

Beyla Deployment and Configuration: Kubernetes DaemonSet Setup

DaemonSet-Based Deployment

The most common way to deploy Beyla on Kubernetes is as a DaemonSet. It accesses the host PID namespace on each node to detect and instrument all processes on the node.

# beyla-daemonset.yaml
# Grafana Beyla DaemonSet deployment manifest
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: beyla
  namespace: monitoring
  labels:
    app.kubernetes.io/name: beyla
    app.kubernetes.io/component: auto-instrumentation
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: beyla
  template:
    metadata:
      labels:
        app.kubernetes.io/name: beyla
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9090'
        prometheus.io/path: '/metrics'
    spec:
      serviceAccountName: beyla
      hostPID: true # Host PID namespace access required
      tolerations:
        - operator: Exists # Deploy to all nodes (including master)
      containers:
        - name: beyla
          image: grafana/beyla:1.8.2
          securityContext:
            privileged: true # Required for loading eBPF programs
            runAsUser: 0
          ports:
            - containerPort: 9090
              name: metrics
              protocol: TCP
          env:
            - name: BEYLA_OPEN_PORT
              value: '80,443,8080,8443,3000,5000,9090'
            - name: BEYLA_SERVICE_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: BEYLA_KUBE_METADATA_ENABLE
              value: 'autodetect'
          volumeMounts:
            - name: beyla-config
              mountPath: /config
            - name: sys-kernel-security
              mountPath: /sys/kernel/security
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
      volumes:
        - name: beyla-config
          configMap:
            name: beyla-config
        - name: sys-kernel-security
          hostPath:
            path: /sys/kernel/security
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: beyla-config
  namespace: monitoring
data:
  beyla-config.yml: |
    # Beyla main configuration file
    log_level: info

    # Service auto-discovery settings
    discovery:
      services:
        - k8s_namespace: "production|staging"
          k8s_pod_labels:
            app.kubernetes.io/part-of: ".*"
        - k8s_namespace: ".*"
          k8s_deployment_name: ".*"

    # Metrics export settings
    otel_metrics_export:
      endpoint: http://mimir-distributor.monitoring:4317
      protocol: grpc
      interval: 15s
      features:
        - application
        - application_process
        - application_service_graph
      histograms:
        - explicit:
            boundaries:
              - 0.005
              - 0.01
              - 0.025
              - 0.05
              - 0.1
              - 0.25
              - 0.5
              - 1
              - 2.5
              - 5
              - 10

    # Trace export settings
    otel_traces_export:
      endpoint: http://tempo-distributor.monitoring:4317
      protocol: grpc
      sampler:
        name: parentbased_traceidratio
        arg: "0.1"                       # 10% sampling

    # Prometheus endpoint settings
    prometheus_export:
      port: 9090
      path: /metrics
      features:
        - application
        - application_process
        - application_service_graph

    # Network metrics (experimental feature)
    network:
      enable: true
      cidrs:
        - 10.0.0.0/8
        - 172.16.0.0/12

    # Attribute settings
    attributes:
      kubernetes:
        enable: true
      host_id:
        fetch_timeout: 5s
      select:
        beyla_network_flow_bytes:
          include:
            - k8s.src.namespace
            - k8s.dst.namespace
            - direction
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: beyla
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: beyla
rules:
  - apiGroups: ['']
    resources: ['pods', 'nodes', 'services', 'replicationcontrollers']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['apps']
    resources: ['deployments', 'replicasets', 'statefulsets', 'daemonsets']
    verbs: ['get', 'list', 'watch']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: beyla
subjects:
  - kind: ServiceAccount
    name: beyla
    namespace: monitoring
roleRef:
  kind: ClusterRole
  name: beyla
  apiGroup: rbac.authorization.k8s.io
# Deploy Beyla DaemonSet
kubectl apply -f beyla-daemonset.yaml

# Check deployment status
kubectl rollout status daemonset/beyla -n monitoring

# Check auto-detected services in Beyla logs
kubectl logs -n monitoring -l app.kubernetes.io/name=beyla --tail=50 | grep "instrumenting"
# Example output:
# msg="instrumenting process" pid=12345 service=frontend-deployment
# msg="instrumenting process" pid=12346 service=api-server-deployment
# msg="instrumenting process" pid=12347 service=payment-service-deployment

# Check metrics generated by Beyla
kubectl exec -n monitoring $(kubectl get pod -n monitoring -l app.kubernetes.io/name=beyla -o jsonpath='{.items[0].metadata.name}') \
  -- wget -qO- http://localhost:9090/metrics | head -40

# Check eBPF program load status
kubectl exec -n monitoring $(kubectl get pod -n monitoring -l app.kubernetes.io/name=beyla -o jsonpath='{.items[0].metadata.name}') \
  -- bpftool prog list | grep beyla

Grafana Stack Integration: Tempo, Mimir, Loki

Integrated Architecture

Integrating telemetry data generated by Cilium Hubble and Grafana Beyla with the Grafana LGTM (Loki, Grafana, Tempo, Mimir) stack creates a complete observability platform based on zero instrumentation. The data flow is as follows.

Hubble metrics are collected via Prometheus ServiceMonitor and stored long-term in Mimir. Hubble's network flow logs are sent to Loki via Fluentd or Vector. RED metrics generated by Beyla are sent directly to Mimir Distributor via OTLP protocol or stored through Prometheus Remote Write. Beyla's distributed traces are sent to Tempo Distributor via OTLP.

In Grafana dashboards, you can correlate Mimir metrics (Hubble network + Beyla APM), Tempo traces (Beyla auto-generated), and Loki logs (Hubble flow logs) in a single view. The ability to freely navigate from traces to metrics and from metrics to logs significantly reduces root cause analysis (RCA) time during incidents.

Grafana Datasource and Dashboard Configuration

# grafana-datasources.yaml
# Grafana datasource ConfigMap (Hubble + Beyla integration)
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
  labels:
    grafana_datasource: '1'
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      # Mimir - Unified storage for Hubble network metrics + Beyla APM metrics
      - name: Mimir
        type: prometheus
        url: http://mimir-query-frontend.monitoring:8080/prometheus
        access: proxy
        isDefault: true
        jsonData:
          timeInterval: 15s
          exemplarTraceIdDestinations:
            - name: traceID
              datasourceUid: tempo
              urlDisplayLabel: View Trace
          httpMethod: POST
          prometheusType: Mimir

      # Tempo - Storage for Beyla auto-generated traces
      - name: Tempo
        type: tempo
        uid: tempo
        url: http://tempo-query-frontend.monitoring:3200
        access: proxy
        jsonData:
          tracesToMetrics:
            datasourceUid: mimir
            spanStartTimeShift: "-1h"
            spanEndTimeShift: "1h"
            tags:
              - key: service.name
                value: service
              - key: http.method
                value: method
            queries:
              - name: Request Rate
                query: "sum(rate(http_server_request_duration_seconds_count{service=\"${__span.tags.service.name}\"}[5m]))"
              - name: Error Rate
                query: "sum(rate(http_server_request_duration_seconds_count{service=\"${__span.tags.service.name}\",http_status_code=~\"5..\"}[5m]))"
          tracesToLogs:
            datasourceUid: loki
            spanStartTimeShift: "-5m"
            spanEndTimeShift: "5m"
            filterByTraceID: true
            filterBySpanID: false
            tags:
              - key: k8s.namespace.name
                value: namespace
              - key: k8s.pod.name
                value: pod
          serviceMap:
            datasourceUid: mimir
          nodeGraph:
            enabled: true
          search:
            hide: false
          lokiSearch:
            datasourceUid: loki

      # Loki - Hubble flow logs + application logs
      - name: Loki
        type: loki
        uid: loki
        url: http://loki-read.monitoring:3100
        access: proxy
        jsonData:
          derivedFields:
            - datasourceUid: tempo
              matcherRegex: "traceID=(\\w+)"
              name: TraceID
              url: "$${__value.raw}"

eBPF vs Sidecar vs Agent Comparison

A comparison of eBPF-based observability, sidecar proxy-based service mesh, and traditional agent-based monitoring from an operational perspective.

Comparison ItemeBPF (Hubble + Beyla)Sidecar (Istio/Envoy)Agent (SDK Instrumentation)
Code change requiredNoneNone (auto-injection)SDK integration required
Additional container per PodNone1 (Envoy)None
Memory overhead per node~300MB (Cilium+Beyla)~100MB x Pod count~50MB (agent)
P99 latency impactUnder 0.1ms2~5ms0.5~1ms
L3/L4 network visibilityVery detailedLimitedNone
L7 HTTP/gRPC visibilityRED metrics via BeylaVery detailedSDK dependent
Distributed tracingBeyla auto (context propagation limited)Auto (full propagation)Full control
DNS observationHubble built-inEnvoy DNS proxySeparate configuration needed
Kernel version dependency5.8+ recommendedNoneNone
Security permission requirementsprivileged or CAP_BPFNET_ADMINRegular user
Service mesh features (mTLS, traffic control)Partially supported via CiliumFully supportedNone
Operational complexityMediumHighLow (but SDK management cost is high)
Multi-language supportLanguage agnosticLanguage agnosticLanguage-specific SDK needed

Selection Guide

eBPF-based approaches are suitable in the following cases. When you need to apply consistent monitoring across hundreds of microservices but lack the engineering resources to integrate SDKs into each service. High-density node clusters where sidecar proxy resource overhead is difficult to absorb. Migration periods where you need to achieve monitoring without code changes, including legacy services. Infrastructure-centric operations where fine-grained network-level visibility (DNS, TCP connection state, packet drops) is important.

On the other hand, sidecar proxies may be more suitable in some cases. When you need service mesh traffic control features such as mTLS, traffic splitting, and circuit breaking. When complete distributed trace context propagation is essential. Environments that must operate nodes with kernels older than 5.8. Clusters with security policies that cannot allow privileged container permissions for eBPF.

Operational Considerations

Kernel Version and Distribution Compatibility

The most common failure cause in eBPF-based tool operations is kernel version mismatch. Cilium Hubble's basic features work on kernel 4.19 and above, but L7 protocol parsing and advanced metrics require kernel 5.4 or higher. Grafana Beyla's uprobes-based auto-instrumentation operates stably only on kernel 5.8 and above, and features utilizing BPF ring buffer specifically require 5.8 as the minimum.

For Amazon EKS, the AL2023 AMI provides kernel 6.1 by default, so there are no issues. However, the AL2 AMI uses kernel 5.10, which supports most features but may have limitations for some advanced capabilities. GKE's Container-Optimized OS (COS) provides kernel 5.15 or higher, and Ubuntu node images are also 5.15 or above. AKS uses kernel 5.15 with Ubuntu 22.04-based nodes.

Security Considerations

Beyla requires CAP_SYS_ADMIN or CAP_BPF + CAP_PERFMON privileges to load eBPF programs into the kernel. The simplest method is running as a privileged container, but in production environments, it is recommended to grant only the necessary capabilities following the principle of least privilege.

# Minimum privilege security context (instead of privileged)
securityContext:
  privileged: false
  runAsUser: 0
  capabilities:
    add:
      - BPF # eBPF program loading
      - PERFMON # perf event access
      - NET_RAW # raw socket access (network observation)
      - SYS_PTRACE # process tracing (uprobes)
      - DAC_READ_SEARCH # filesystem traversal
    drop:
      - ALL

However, some Kubernetes security policies (Pod Security Standards Restricted profile) may deny adding these capabilities altogether. In this case, you need to configure a separate exception policy for the Beyla DaemonSet. Also, since eBPF programs execute in kernel space, the source and integrity of Beyla images must be verified. Use signed images and set the image pull policy to Always to prevent tampered images from being deployed.

Resource Management and Scaling

When operating Hubble and Beyla on large-scale clusters (500+ nodes), you must closely monitor resource usage. Hubble Relay aggregates events from all nodes in the cluster, so its memory and CPU usage increases linearly with the number of nodes. Using one Relay instance per 100 nodes as a baseline, clusters with 500 or more nodes should horizontally scale Relay to 5 or more instances with load balancing configured.

Beyla's memory usage varies based on the number of instrumented processes per node. Approximately 2-5MB of additional memory is needed per process, so on nodes with high Pod density (50+ Pods), Beyla's memory limit may need to be increased to 1Gi or more. Additionally, it is advisable to limit instrumentation targets to specific namespaces or labels in the discovery settings to prevent unnecessary resource consumption.

Failure Cases and Recovery Procedures

Case 1: eBPF Program Load Failure

The most common failure type is eBPF program load failure due to kernel version mismatch. This easily occurs when only some nodes are transitioned to a new kernel during a node upgrade process.

Symptom: Beyla Pod falls into CrashLoopBackOff state, with failed to load eBPF program: invalid argument or kernel doesn't support bpf_ringbuf errors in the logs.

Diagnostic procedure: First, check the kernel version of the affected node. Query the kernel version with kubectl get node <node-name> -o jsonpath='{.status.nodeInfo.kernelVersion}' and compare it to Beyla's minimum required version (5.8+). BTF support should also be verified. Access the node and check for the existence of /sys/kernel/btf/vmlinux.

Recovery method: Use nodeSelector or nodeAffinity to prevent the Beyla DaemonSet from being scheduled on nodes with older kernels. In the long term, upgrade the OS image on those nodes.

Case 2: Hubble Relay Out of Memory (OOMKilled)

This occurs when Hubble Relay exceeds its memory limit while processing a large volume of network flows.

Symptom: Hubble Relay Pod restarts repeatedly, and OOMKilled termination reason is confirmed in kubectl describe pod. Connections intermittently drop in Hubble CLI and UI.

Diagnostic procedure: Check current memory usage with kubectl top pod -n kube-system -l k8s-app=hubble-relay. Also use the hubble_relay_received_flows_total metric to understand the per-second volume of events being processed by Relay.

Recovery method: In the short term, increase Relay's memory limit (from 512Mi to 1Gi or more). In the medium term, adjust Cilium's bpf.monitorAggregation setting to medium or high to reduce event volume. In the long term, horizontally scale Relay instances.

Case 3: Beyla Metrics Missing

Beyla is running normally but metrics for a specific service are not being collected.

Symptom: RED metrics for a specific service do not appear in Grafana dashboards. No instrumentation messages for that service appear in Beyla logs.

Diagnostic procedure: Check the binary format of the service. If a Go service was built with -ldflags="-s -w" option to strip symbols, Beyla may not be able to detect function symbols. Auto-detection can also fail if the service uses non-standard ports or a custom HTTP framework.

Recovery method: For Go binaries, retain symbols instead of using -ldflags="-s -w", or explicitly add the service's port to the BEYLA_OPEN_PORT environment variable in Beyla's configuration. For non-standard frameworks, enable Beyla's generic HTTP tracing mode to detect HTTP patterns at the socket level.

Case 4: Metric Discrepancy Between Hubble and Beyla

A case where Hubble's network-level HTTP metrics and Beyla's application-level HTTP metrics show different numbers.

Symptom: The number of HTTP requests reported by Hubble is higher or lower than what Beyla reports.

Root cause analysis: Hubble parses HTTP at the network packet level, so it captures all HTTP traffic including health checks, Kubernetes probes, and sidecar-to-sidecar communication. Beyla, on the other hand, traces function calls of application processes, so it only instruments requests actually processed by the application. This discrepancy is normal behavior, and analyzing both metrics together is useful for distinguishing infrastructure traffic from application traffic.

Operations Checklist

A list of items to verify when introducing eBPF-based observability into production.

Pre-deployment verification:

  • Confirm all node kernel versions are 5.8 or above
  • Verify BTF (BPF Type Format) support is enabled (/sys/kernel/btf/vmlinux exists)
  • Confirm node security policies (Pod Security Standards) allow privileged or CAP_BPF
  • For managed Kubernetes (EKS, GKE, AKS), verify CNI plugin replacement feasibility
  • Validate Cilium does not conflict with existing CNI (including kube-proxy replacement mode)

Post-deployment verification:

  • Confirm Cilium Agent is Ready on all nodes (cilium status)
  • Confirm Hubble Relay is connected to all node Hubble Servers
  • Check logs to verify Beyla has auto-detected expected services
  • Confirm Hubble and Beyla metrics are being collected normally in Mimir/Prometheus
  • Confirm Beyla traces are being sent normally to Tempo

Monitoring alerts:

  • Alert when Hubble Relay memory utilization exceeds 80%
  • Verify Beyla DaemonSet Ready Pod count matches node count
  • Detect eBPF program load failure events
  • Monitor Hubble metric collection delay (scrape duration)
  • Monitor Beyla's OTLP export failure rate

Regular maintenance:

  • Monthly: Verify eBPF program compatibility after kernel security patches
  • Quarterly: Review and test Cilium/Beyla version upgrades
  • Semi-annually: Plan kernel version upgrades (track LTS kernels)

References