Skip to content
Published on

Kyverno Production Operations: HA, Monitoring, Troubleshooting

Authors

1. HA (High Availability) Deployment

1.1 Replica Configuration

helm install kyverno kyverno/kyverno -n kyverno --create-namespace \
  --set admissionController.replicas=3 \
  --set backgroundController.replicas=2 \
  --set reportsController.replicas=2

1.2 Resource Allocation and Topology

admissionController:
  replicas: 3
  container:
    resources:
      limits:
        cpu: '2'
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 512Mi
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule

1.3 failurePolicy Strategy

  • Fail: Reject requests when webhook is down (security priority)
  • Ignore: Allow requests when webhook is down (availability priority)

Always configure PDB when using failurePolicy: Fail:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kyverno-admission-controller
  namespace: kyverno
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/component: admission-controller

2. Policy Reports

kubectl get policyreport -n production
kubectl get clusterpolicyreport

Reports include pass/fail/warn counts per policy per namespace. Install Policy Reporter UI for visualization:

helm install policy-reporter policy-reporter/policy-reporter -n kyverno \
  --set ui.enabled=true --set kyvernoPlugin.enabled=true

3. Monitoring

3.1 Prometheus Metrics

Key metrics: kyverno_admission_requests_total, kyverno_admission_review_duration_seconds, kyverno_policy_results_total, kyverno_policy_execution_duration_seconds.

3.2 Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kyverno-alerts
spec:
  groups:
    - name: kyverno
      rules:
        - alert: KyvernoWebhookHighLatency
          expr: histogram_quantile(0.99, rate(kyverno_admission_review_duration_seconds_bucket[5m])) > 10
          for: 5m
        - alert: KyvernoHighFailureRate
          expr: rate(kyverno_admission_requests_total{success="false"}[5m]) > 0.1
          for: 5m

4. Background Scanning

apiVersion: kyverno.io/v1
kind: ClusterPolicy
spec:
  validationFailureAction: Audit
  background: true # Enable background scanning (default: true)
  rules:
    - name: check-labels
      validate:
        pattern:
          metadata:
            labels:
              app.kubernetes.io/name: '?*'

5. PolicyException

apiVersion: kyverno.io/v2beta1
kind: PolicyException
metadata:
  name: allow-privileged-monitoring
  namespace: monitoring
spec:
  exceptions:
    - policyName: disallow-privileged
      ruleNames:
        - check-privileged
  match:
    any:
      - resources:
          kinds: ['Pod']
          namespaces: ['monitoring']
          names: ['prometheus-*']

6. Troubleshooting

# Check Kyverno logs
kubectl logs -n kyverno -l app.kubernetes.io/component=admission-controller --tail=100

# Check webhook status
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations

# Check policy status
kubectl get clusterpolicy -o wide
kubectl describe clusterpolicy require-labels

# Check admission latency
kubectl exec -n kyverno deploy/kyverno-admission-controller -- wget -qO- localhost:8000/metrics | grep admission_review_duration

7. Summary

  1. HA deployment: 3+ replicas, PDB, TopologySpreadConstraints
  2. Policy reports: PolicyReport CRD for compliance visibility
  3. Monitoring: Prometheus metrics + Grafana dashboards + alert rules
  4. Background scanning: Periodic policy evaluation of existing resources
  5. Exception mechanism: PolicyException for legitimate exemptions
  6. Troubleshooting: Diagnose via webhook logs, policy status, and reports