Skip to content
Published on

CNPE (Certified Cloud Native Platform Engineer) Complete Guide — From Exam Scope to Production Tech Stack

Authors
  • Name
    Twitter

1. What is CNPE?

CNPE (Certified Cloud Native Platform Engineer) is the highest-level certification officially announced by CNCF in November 2025. It validates advanced hands-on skills in designing and operating enterprise-scale Internal Developer Platforms (IDPs).

According to CNCF CTO Chris Aniszczyk, this certification verifies "production-level cloud native system capabilities spanning platform architecture, GitOps, Observability, security, and developer experience."

1.1 Exam Format

ItemDetails
Duration120 minutes
Format100% Performance-based (hands-on), online proctored
EnvironmentLinux-based remote desktop (terminal + web interface)
Open Bookkubernetes.io/docs and per-task Quick Reference documents allowed
Fee$445 USD
Retake1 free retake included
Validity2 years
SimulatorKiller.sh 2 sessions included

1.2 Prerequisites

There are no official mandatory prerequisites, but CNPA (Certified Cloud Native Platform Associate) or CKA-level Kubernetes management experience is strongly recommended.

1.3 Target Audience

  • Experienced Platform Engineers
  • Senior DevOps / SRE
  • Platform Architects
  • Infrastructure Engineers

2. Exam Domain Overview

The CNPE exam consists of 5 core domains.

DomainWeight
GitOps and Continuous Delivery25%
Platform APIs and Self-Service Capabilities25%
Observability and Operations20%
Platform Architecture and Infrastructure15%
Security and Policy Enforcement15%

A frequently referenced architecture in the industry is the BACK stack: Backstage + Argo CD + Crossplane + Kyverno. However, since the exam is vendor-neutral, Argo CD can be substituted with Flux, and Kyverno with OPA/Gatekeeper.


3. Domain 1: GitOps and Continuous Delivery (25%)

3.1 Core GitOps Principles

These are the 4 core principles defined by OpenGitOps (opengitops.dev).

PrincipleDescription
DeclarativeThe desired state of the system is expressed declaratively
Versioned and ImmutableThe desired state is stored in Git with full change history tracking
Pulled AutomaticallyAgents automatically pull the desired state from the source (Pull, not Push)
Continuously ReconciledDifferences between actual and desired state are continuously detected and restored

3.2 Argo CD Architecture

Argo CD is a declarative GitOps CD tool for Kubernetes, composed of 3 core components.

  • API Server (argocd-server): A gRPC/REST server providing Web UI and CLI APIs. Handles application management, RBAC enforcement, and Git webhook reception.
  • Repository Server (argocd-repo-server): Maintains a local cache of Git repositories and generates Kubernetes manifests based on revisions and paths.
  • Application Controller (argocd-application-controller): Continuously monitors running applications, comparing current state against the target state in Git to detect OutOfSync conditions.

Application CRD Example:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: guestbook
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/argoproj/argocd-example-apps.git
    targetRevision: HEAD
    path: guestbook
  destination:
    server: https://kubernetes.default.svc
    namespace: guestbook
  syncPolicy:
    automated:
      prune: true # Delete resources removed from Git
      selfHeal: true # Automatically revert manual changes

3.3 Argo CD Sync Policies

PolicyDefaultDescription
automateddisabledAutomatic sync when OutOfSync is detected
prunefalseRemove resources deleted from Git from the cluster
selfHealfalseAutomatically revert manual cluster changes (drift) to Git state
allowEmptyfalseWhether to allow sync when there are 0 manifests

3.4 Multi-cluster Deployment with ApplicationSet

ApplicationSet is a CRD that creates multiple Application resources from a single template. It dynamically generates parameters through Generators.

Key Generators:

GeneratorDescription
ClusterAuto-discover clusters registered in Argo CD
Git DirectoryGenerate apps from repository directory structure
MatrixCombine parameters from two Generators (Cartesian Product)
Pull RequestCreate preview environments for each open PR
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-apps
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: '{{.name}}-my-app'
    spec:
      project: default
      source:
        repoURL: https://github.com/myorg/apps.git
        targetRevision: HEAD
        path: deploy/production
      destination:
        server: '{{.server}}'
        namespace: my-app

3.5 Flux CD

Flux is a CD tool built on the CNCF GitOps Toolkit, composed of 5 specialized controllers.

ControllerCRDRole
Source ControllerGitRepository, HelmRepository, OCIRepositoryFetch artifacts from Git/Helm/OCI
Kustomize ControllerKustomizationApply Kustomize overlays or plain YAML
Helm ControllerHelmReleaseManage Helm chart lifecycle
Notification ControllerProvider, Alert, ReceiverSend notifications and process inbound webhooks
Image AutomationImageRepository, ImagePolicyScan container registries and auto-update

Flux Kustomization Example:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: my-app
  path: ./deploy/production
  prune: true
  wait: true
  dependsOn:
    - name: cert-manager
    - name: ingress-nginx
  postBuild:
    substitute:
      CLUSTER_NAME: production
      DOMAIN: example.com

3.6 Progressive Delivery: Argo Rollouts

Argo Rollouts automates progressive deployment strategies such as Canary and Blue-Green.

Canary Deployment Example:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 30
        - pause: { duration: 5m }
        - setWeight: 60
        - pause: { duration: 5m }
        - setWeight: 100
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: myapp:v2

Blue-Green Deployment Key Settings:

SettingDescription
autoPromotionEnabledWhether to auto-promote after preview (default: true)
autoPromotionSecondsWait time before automatic promotion
scaleDownDelaySecondsWait time before terminating previous version Pods (default: 30s)
prePromotionAnalysisMetric validation before traffic switch
postPromotionAnalysisPost-switch validation; rollback on failure

4. Domain 2: Platform APIs and Self-Service (25%)

4.1 Building Self-Service Infrastructure with Crossplane

Crossplane extends Kubernetes into a universal Control Plane, enabling cloud infrastructure provisioning through the Kubernetes API.

Architecture: Provider -> Managed Resources -> Compositions -> Composite Resources (XRDs)

Composite Resource Definition (XRD):

apiVersion: apiextensions.crossplane.io/v2
kind: CompositeResourceDefinition
metadata:
  name: mydatabases.example.org
spec:
  scope: Namespaced
  group: example.org
  names:
    kind: XMyDatabase
    plural: mydatabases
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                region:
                  type: string
                size:
                  type: string
              required:
                - region
                - size

Composition (Implementation Template):

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: example-database
spec:
  compositeTypeRef:
    apiVersion: example.org/v1alpha1
    kind: XMyDatabase
  mode: Pipeline
  pipeline:
    - step: patch-and-transform
      functionRef:
        name: function-patch-and-transform
      input:
        apiVersion: pt.fn.crossplane.io/v1beta1
        kind: Resources
        resources:
          - name: rds-instance
            base:
              apiVersion: rds.aws.m.upbound.io/v1beta1
              kind: Instance
              spec:
                forProvider:
                  region: us-east-2
                  engine: postgres
                  instanceClass: db.t3.micro
            patches:
              - type: FromCompositeFieldPath
                fromFieldPath: spec.region
                toFieldPath: spec.forProvider.region

The self-service pattern works as follows:

  1. The Platform Team defines XRDs (API schemas) and Compositions (implementations).
  2. Developers create XRD instances in their own namespaces.
  3. Crossplane automatically provisions and manages cloud resources.
  4. Developers never interact directly with cloud Provider APIs.

4.2 Backstage: Internal Developer Portal

Backstage (CNCF Incubating) is an open-source IDP framework developed by Spotify.

Core Features:

FeatureDescription
Software CatalogCentral registry of all software assets
Software TemplatesStandardized project creation automation
TechDocsDocs-like-code technical documentation
Plugin ArchitectureExtensible plugin system

Software Catalog Entity Definition:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Payment processing microservice
  tags:
    - java
    - spring-boot
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: payment-platform
  dependsOn:
    - resource:default/payments-db
  providesApis:
    - payment-api

Software Template Example:

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: new-microservice
  title: Create New Microservice
spec:
  owner: platform-team
  type: service
  parameters:
    - title: Service Details
      required:
        - name
        - owner
      properties:
        name:
          title: Service Name
          type: string
        owner:
          title: Owner Team
          type: string
          ui:field: OwnerPicker
  steps:
    - id: fetchBase
      name: Fetch Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{parameters.name}}
    - id: publish
      name: Publish to GitHub
      action: publish:github
      input:
        repoUrl: github.com?owner=myorg&repo=${{parameters.name}}
    - id: register
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{steps.publish.output.repoContentsUrl}}
        catalogInfoPath: /catalog-info.yaml

4.3 CRDs and the Operator Pattern

The foundation of Platform APIs is Kubernetes Custom Resource Definitions (CRDs) and the Operator pattern.

Operators encode domain-specific operational knowledge into Kubernetes controllers.

Control Loop:
  1. Observe: Watch for Custom Resource changes
  2. Analyze: Compare current state with desired state
  3. Act: Create/update/delete dependent resources to reconcile state

Major Operator Frameworks:

FrameworkLanguage
KubebuilderGo
Operator SDKGo, Ansible, Helm
KopfPython
kube-rsRust
MetacontrollerAny language (webhook-based)

5. Domain 3: Observability and Operations (20%)

5.1 OpenTelemetry

OpenTelemetry (OTel) is the CNCF Observability framework that provides unified collection of 3 core signals.

SignalDescriptionUse Case
TracesRequest path tracking across distributed systemsUnderstanding request flow between microservices
MetricsRuntime measurements (Counter, Gauge, Histogram)Performance trends and resource utilization monitoring
LogsTimestamped event recordsDebugging context at specific points in time

OTel Collector Configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:
    timeout: 10s

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Kubernetes Auto-Instrumentation:

With the OTel Operator, automatic instrumentation is possible without code changes.

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: demo-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4318
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: '1'

Enable auto-instrumentation by adding annotations to Deployments:

metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: 'true' # Java
    instrumentation.opentelemetry.io/inject-python: 'true' # Python
    instrumentation.opentelemetry.io/inject-nodejs: 'true' # Node.js

5.2 Prometheus and Grafana Stack

Prometheus Architecture:

ComponentRole
Prometheus ServerTime-series data collection and storage (Pull-based)
AlertmanagerAlert routing, deduplication, and dispatch
ExportersThird-party system metric adapters
Service DiscoveryAutomatic target discovery via Kubernetes, Consul, DNS, etc.

ServiceMonitor CRD (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Essential PromQL Queries:

# Request rate per second (5-minute window)
rate(http_requests_total[5m])

# Sum by job
sum by (job) (rate(http_requests_total[5m]))

# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Grafana Observability Stack:

                  +-----------+
                  |  Grafana  |  (Dashboards, Alerts)
                  +-----+-----+
                        |
           +------------+------------+
           |            |            |
      +----+----+  +---+---+  +-----+-----+
      |  Mimir  |  |  Loki |  |   Tempo   |
      | Metrics |  |  Logs |  |  Traces   |
      +----+----+  +---+---+  +-----+-----+
           |            |            |
           +------------+------------+
                        |
               +--------+--------+
               | OTel Collector  |
               +-----------------+
                        |
                [Applications]
  • Mimir: Horizontally scalable long-term metrics storage
  • Loki: Lightweight log aggregation system with label-based indexing
  • Tempo: Large-scale distributed tracing backend

5.3 SLI/SLO and Error Budgets

ConceptDefinitionExample
SLIQuantitative measure of service performanceRequest success rate, P99 latency
SLOTarget range for an SLI"99.9% of requests complete within 200ms"
SLAContractual obligation when SLO is missed"Service credits provided if availability falls below 99.95%"

Error Budget Calculation:

Error Budget = 1 - SLO

SLO 99.9% -> Error Budget 0.1% -> Approximately 43.2 minutes downtime allowed per month
SLO 99.99% -> Error Budget 0.01% -> Approximately 4.32 minutes downtime allowed per month

Error Budget Policy:

Burn RateAction
0-50% (Green)Proceed with normal feature development
50-80% (Yellow)Increase focus on stability work and code reviews
80-100% (Red)Feature freeze, full focus on stability work

5.4 DORA Metrics

Core metrics for measuring platform efficiency.

MetricDescription
Deployment FrequencyHow often deployments occur
Lead Time for ChangesTime from code commit to production deployment
Change Failure RatePercentage of deployments causing failures
Mean Time to RecoveryAverage time from failure occurrence to resolution

6. Domain 4: Platform Architecture and Infrastructure (15%)

6.1 Multi-tenancy Patterns

PatternIsolation LevelSuitable Scenario
Namespace-basedLogical isolationTrusted internal teams
Cluster-basedPhysical isolationStrong security requirements, regulated environments
HybridMixedDifferentiated isolation per environment

Namespace-based isolation tools:

# Resource limits with ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: '10'
    requests.memory: 20Gi
    limits.cpu: '20'
    limits.memory: 40Gi
    pods: '50'
---
# Network isolation with NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: team-a
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

6.2 Cost Management: OpenCost

OpenCost (CNCF Sandbox) is an open-source tool that provides Kubernetes cost visibility and allocation. It tracks costs by namespace, team, and service, and supports resource right-sizing.

6.3 Autoscaling Strategies

ScalerTargetCriteria
HPAPod horizontal scalingCPU/memory/custom metrics
VPAPod resource request adjustmentActual usage analysis
Cluster AutoscalerNode horizontal scalingPending Pod detection
KEDAEvent-driven scalingQueue length, HTTP requests, etc.

7. Domain 5: Security and Policy Enforcement (15%)

7.1 OPA/Gatekeeper

OPA Gatekeeper operates with two resources: ConstraintTemplate (policy logic defined in Rego) and Constraint (specifying policy targets).

ConstraintTemplate Example (Required Label Validation):

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("you must provide labels: %v", [missing])
        }

Constraint Application:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: ns-must-have-team
spec:
  match:
    kinds:
      - apiGroups: ['']
        kinds: ['Namespace']
  parameters:
    labels: ['team', 'environment']

7.2 Kyverno

Kyverno is a Kubernetes-native policy engine that uses YAML + CEL, eliminating the need to learn a separate policy language. It supports three types of rules: validate, mutate, and generate.

ClusterPolicy Example (Required Resource Limits):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: 'CPU and memory resource limits are required.'
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    cpu: '?*'
                    memory: '?*'

7.3 OPA/Gatekeeper vs Kyverno Comparison

AspectOPA/GatekeeperKyverno
Policy LanguageRego (dedicated language)YAML + CEL
Learning CurveHighLow
ValidateSupportedSupported
MutateLimitedFully supported
GenerateLimitedFully supported (cross-namespace sync)
VersatilityMulti-platform (usable beyond K8s)Kubernetes only
CNCF StatusGraduated (OPA)Incubating

7.4 Supply Chain Security

  • SBOM (Software Bill of Materials): Generate and manage software component inventories
  • Container Image Scanning: Integrate Shift Left security into CI/CD pipelines
  • Falco: Runtime security threat detection (CNCF Graduated)

WeekContent
1-3Kubernetes fundamentals review + GitOps principles + ArgoCD/Flux hands-on
4-5Crossplane XRD/Composition design + CRD/Operator development
6-7Backstage setup + Software Template authoring
8-9OpenTelemetry + Prometheus + Grafana stack configuration
10OPA/Kyverno policy enforcement + security pipeline setup
11-12Integrated platform lab + CNCF official resource review + Killer.sh mock exam

9. Essential Study Resources

ResourceURL
CNPE Official Pagetraining.linuxfoundation.org/certification/cnpe
CNCF Curriculum (Open Source)github.com/cncf/curriculum
CNCF Platforms White Papertag-app-delivery.cncf.io/whitepapers/platforms
Platform Engineering Maturity Modeltag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model
Killer.sh Simulator2 sessions included with exam registration

Related Supplementary Certifications: CKA, CNPA, CGOA (Certified GitOps Associate), OTCA (OpenTelemetry Certified Associate), CBA (Certified Backstage Associate)


References