CNPE (Certified Cloud Native Platform Engineer) Complete Guide — From Exam Scope to Production Tech Stack

1. What is CNPE?
2. Exam Domain Overview
3. Domain 1: GitOps and Continuous Delivery (25%)
4. Domain 2: Platform APIs and Self-Service (25%)
5. Domain 3: Observability and Operations (20%)
6. Domain 4: Platform Architecture and Infrastructure (15%)
7. Domain 5: Security and Policy Enforcement (15%)
8. Recommended Study Plan (12 Weeks)
9. Essential Study Resources
References

1. What is CNPE?

CNPE (Certified Cloud Native Platform Engineer) is the highest-level certification officially announced by CNCF in November 2025. It validates advanced hands-on skills in designing and operating enterprise-scale Internal Developer Platforms (IDPs).

According to CNCF CTO Chris Aniszczyk, this certification verifies "production-level cloud native system capabilities spanning platform architecture, GitOps, Observability, security, and developer experience."

1.1 Exam Format

Item	Details
Duration	120 minutes
Format	100% Performance-based (hands-on), online proctored
Environment	Linux-based remote desktop (terminal + web interface)
Open Book	kubernetes.io/docs and per-task Quick Reference documents allowed
Fee	$445 USD
Retake	1 free retake included
Validity	2 years
Simulator	Killer.sh 2 sessions included

1.2 Prerequisites

There are no official mandatory prerequisites, but CNPA (Certified Cloud Native Platform Associate) or CKA-level Kubernetes management experience is strongly recommended.

1.3 Target Audience

Experienced Platform Engineers
Senior DevOps / SRE
Platform Architects
Infrastructure Engineers

2. Exam Domain Overview

The CNPE exam consists of 5 core domains.

Domain	Weight
GitOps and Continuous Delivery	25%
Platform APIs and Self-Service Capabilities	25%
Observability and Operations	20%
Platform Architecture and Infrastructure	15%
Security and Policy Enforcement	15%

A frequently referenced architecture in the industry is the BACK stack: Backstage + Argo CD + Crossplane + Kyverno. However, since the exam is vendor-neutral, Argo CD can be substituted with Flux, and Kyverno with OPA/Gatekeeper.

3. Domain 1: GitOps and Continuous Delivery (25%)

3.1 Core GitOps Principles

These are the 4 core principles defined by OpenGitOps (opengitops.dev).

Principle	Description
Declarative	The desired state of the system is expressed declaratively
Versioned and Immutable	The desired state is stored in Git with full change history tracking
Pulled Automatically	Agents automatically pull the desired state from the source (Pull, not Push)
Continuously Reconciled	Differences between actual and desired state are continuously detected and restored

3.2 Argo CD Architecture

Argo CD is a declarative GitOps CD tool for Kubernetes, composed of 3 core components.

API Server (argocd-server): A gRPC/REST server providing Web UI and CLI APIs. Handles application management, RBAC enforcement, and Git webhook reception.
Repository Server (argocd-repo-server): Maintains a local cache of Git repositories and generates Kubernetes manifests based on revisions and paths.
Application Controller (argocd-application-controller): Continuously monitors running applications, comparing current state against the target state in Git to detect OutOfSync conditions.

Application CRD Example:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: guestbook
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/argoproj/argocd-example-apps.git
    targetRevision: HEAD
    path: guestbook
  destination:
    server: https://kubernetes.default.svc
    namespace: guestbook
  syncPolicy:
    automated:
      prune: true # Delete resources removed from Git
      selfHeal: true # Automatically revert manual changes

3.3 Argo CD Sync Policies

Policy	Default	Description
`automated`	disabled	Automatic sync when OutOfSync is detected
`prune`	false	Remove resources deleted from Git from the cluster
`selfHeal`	false	Automatically revert manual cluster changes (drift) to Git state
`allowEmpty`	false	Whether to allow sync when there are 0 manifests

3.4 Multi-cluster Deployment with ApplicationSet

ApplicationSet is a CRD that creates multiple Application resources from a single template. It dynamically generates parameters through Generators.

Key Generators:

Generator	Description
Cluster	Auto-discover clusters registered in Argo CD
Git Directory	Generate apps from repository directory structure
Matrix	Combine parameters from two Generators (Cartesian Product)
Pull Request	Create preview environments for each open PR

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-apps
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: '{{.name}}-my-app'
    spec:
      project: default
      source:
        repoURL: https://github.com/myorg/apps.git
        targetRevision: HEAD
        path: deploy/production
      destination:
        server: '{{.server}}'
        namespace: my-app

3.5 Flux CD

Flux is a CD tool built on the CNCF GitOps Toolkit, composed of 5 specialized controllers.

Controller	CRD	Role
Source Controller	GitRepository, HelmRepository, OCIRepository	Fetch artifacts from Git/Helm/OCI
Kustomize Controller	Kustomization	Apply Kustomize overlays or plain YAML
Helm Controller	HelmRelease	Manage Helm chart lifecycle
Notification Controller	Provider, Alert, Receiver	Send notifications and process inbound webhooks
Image Automation	ImageRepository, ImagePolicy	Scan container registries and auto-update

Flux Kustomization Example:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: my-app
  path: ./deploy/production
  prune: true
  wait: true
  dependsOn:
    - name: cert-manager
    - name: ingress-nginx
  postBuild:
    substitute:
      CLUSTER_NAME: production
      DOMAIN: example.com

3.6 Progressive Delivery: Argo Rollouts

Argo Rollouts automates progressive deployment strategies such as Canary and Blue-Green.

Canary Deployment Example:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 30
        - pause: { duration: 5m }
        - setWeight: 60
        - pause: { duration: 5m }
        - setWeight: 100
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: myapp:v2

Blue-Green Deployment Key Settings:

Setting	Description
`autoPromotionEnabled`	Whether to auto-promote after preview (default: true)
`autoPromotionSeconds`	Wait time before automatic promotion
`scaleDownDelaySeconds`	Wait time before terminating previous version Pods (default: 30s)
`prePromotionAnalysis`	Metric validation before traffic switch
`postPromotionAnalysis`	Post-switch validation; rollback on failure

4. Domain 2: Platform APIs and Self-Service (25%)

4.1 Building Self-Service Infrastructure with Crossplane

Crossplane extends Kubernetes into a universal Control Plane, enabling cloud infrastructure provisioning through the Kubernetes API.

Architecture: Provider -> Managed Resources -> Compositions -> Composite Resources (XRDs)

Composite Resource Definition (XRD):

apiVersion: apiextensions.crossplane.io/v2
kind: CompositeResourceDefinition
metadata:
  name: mydatabases.example.org
spec:
  scope: Namespaced
  group: example.org
  names:
    kind: XMyDatabase
    plural: mydatabases
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                region:
                  type: string
                size:
                  type: string
              required:
                - region
                - size

Composition (Implementation Template):

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: example-database
spec:
  compositeTypeRef:
    apiVersion: example.org/v1alpha1
    kind: XMyDatabase
  mode: Pipeline
  pipeline:
    - step: patch-and-transform
      functionRef:
        name: function-patch-and-transform
      input:
        apiVersion: pt.fn.crossplane.io/v1beta1
        kind: Resources
        resources:
          - name: rds-instance
            base:
              apiVersion: rds.aws.m.upbound.io/v1beta1
              kind: Instance
              spec:
                forProvider:
                  region: us-east-2
                  engine: postgres
                  instanceClass: db.t3.micro
            patches:
              - type: FromCompositeFieldPath
                fromFieldPath: spec.region
                toFieldPath: spec.forProvider.region

The self-service pattern works as follows:

The Platform Team defines XRDs (API schemas) and Compositions (implementations).
Developers create XRD instances in their own namespaces.
Crossplane automatically provisions and manages cloud resources.
Developers never interact directly with cloud Provider APIs.

4.2 Backstage: Internal Developer Portal

Backstage (CNCF Incubating) is an open-source IDP framework developed by Spotify.

Core Features:

Feature	Description
Software Catalog	Central registry of all software assets
Software Templates	Standardized project creation automation
TechDocs	Docs-like-code technical documentation
Plugin Architecture	Extensible plugin system

Software Catalog Entity Definition:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Payment processing microservice
  tags:
    - java
    - spring-boot
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: payment-platform
  dependsOn:
    - resource:default/payments-db
  providesApis:
    - payment-api

Software Template Example:

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: new-microservice
  title: Create New Microservice
spec:
  owner: platform-team
  type: service
  parameters:
    - title: Service Details
      required:
        - name
        - owner
      properties:
        name:
          title: Service Name
          type: string
        owner:
          title: Owner Team
          type: string
          ui:field: OwnerPicker
  steps:
    - id: fetchBase
      name: Fetch Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{parameters.name}}
    - id: publish
      name: Publish to GitHub
      action: publish:github
      input:
        repoUrl: github.com?owner=myorg&repo=${{parameters.name}}
    - id: register
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{steps.publish.output.repoContentsUrl}}
        catalogInfoPath: /catalog-info.yaml

4.3 CRDs and the Operator Pattern

The foundation of Platform APIs is Kubernetes Custom Resource Definitions (CRDs) and the Operator pattern.

Operators encode domain-specific operational knowledge into Kubernetes controllers.

Control Loop:
  1. Observe: Watch for Custom Resource changes
  2. Analyze: Compare current state with desired state
  3. Act: Create/update/delete dependent resources to reconcile state

Major Operator Frameworks:

Framework	Language
Kubebuilder	Go
Operator SDK	Go, Ansible, Helm
Kopf	Python
kube-rs	Rust
Metacontroller	Any language (webhook-based)

5. Domain 3: Observability and Operations (20%)

5.1 OpenTelemetry

OpenTelemetry (OTel) is the CNCF Observability framework that provides unified collection of 3 core signals.

Signal	Description	Use Case
Traces	Request path tracking across distributed systems	Understanding request flow between microservices
Metrics	Runtime measurements (Counter, Gauge, Histogram)	Performance trends and resource utilization monitoring
Logs	Timestamped event records	Debugging context at specific points in time

OTel Collector Configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:
    timeout: 10s

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Kubernetes Auto-Instrumentation:

With the OTel Operator, automatic instrumentation is possible without code changes.

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: demo-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4318
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: '1'

Enable auto-instrumentation by adding annotations to Deployments:

metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: 'true' # Java
    instrumentation.opentelemetry.io/inject-python: 'true' # Python
    instrumentation.opentelemetry.io/inject-nodejs: 'true' # Node.js

5.2 Prometheus and Grafana Stack

Prometheus Architecture:

Component	Role
Prometheus Server	Time-series data collection and storage (Pull-based)
Alertmanager	Alert routing, deduplication, and dispatch
Exporters	Third-party system metric adapters
Service Discovery	Automatic target discovery via Kubernetes, Consul, DNS, etc.

ServiceMonitor CRD (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Essential PromQL Queries:

# Request rate per second (5-minute window)
rate(http_requests_total[5m])

# Sum by job
sum by (job) (rate(http_requests_total[5m]))

# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Grafana Observability Stack:

                  +-----------+
                  |  Grafana  |  (Dashboards, Alerts)
                  +-----+-----+
                        |
           +------------+------------+
           |            |            |
      +----+----+  +---+---+  +-----+-----+
      |  Mimir  |  |  Loki |  |   Tempo   |
      | Metrics |  |  Logs |  |  Traces   |
      +----+----+  +---+---+  +-----+-----+
           |            |            |
           +------------+------------+
                        |
               +--------+--------+
               | OTel Collector  |
               +-----------------+
                        |
                [Applications]

Mimir: Horizontally scalable long-term metrics storage
Loki: Lightweight log aggregation system with label-based indexing
Tempo: Large-scale distributed tracing backend

5.3 SLI/SLO and Error Budgets

Concept	Definition	Example
SLI	Quantitative measure of service performance	Request success rate, P99 latency
SLO	Target range for an SLI	"99.9% of requests complete within 200ms"
SLA	Contractual obligation when SLO is missed	"Service credits provided if availability falls below 99.95%"

Error Budget Calculation:

Error Budget = 1 - SLO

SLO 99.9% -> Error Budget 0.1% -> Approximately 43.2 minutes downtime allowed per month
SLO 99.99% -> Error Budget 0.01% -> Approximately 4.32 minutes downtime allowed per month

Error Budget Policy:

Burn Rate	Action
0-50% (Green)	Proceed with normal feature development
50-80% (Yellow)	Increase focus on stability work and code reviews
80-100% (Red)	Feature freeze, full focus on stability work

5.4 DORA Metrics

Core metrics for measuring platform efficiency.

Metric	Description
Deployment Frequency	How often deployments occur
Lead Time for Changes	Time from code commit to production deployment
Change Failure Rate	Percentage of deployments causing failures
Mean Time to Recovery	Average time from failure occurrence to resolution

6. Domain 4: Platform Architecture and Infrastructure (15%)

6.1 Multi-tenancy Patterns

Pattern	Isolation Level	Suitable Scenario
Namespace-based	Logical isolation	Trusted internal teams
Cluster-based	Physical isolation	Strong security requirements, regulated environments
Hybrid	Mixed	Differentiated isolation per environment

Namespace-based isolation tools:

# Resource limits with ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: '10'
    requests.memory: 20Gi
    limits.cpu: '20'
    limits.memory: 40Gi
    pods: '50'
---
# Network isolation with NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: team-a
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

6.2 Cost Management: OpenCost

OpenCost (CNCF Sandbox) is an open-source tool that provides Kubernetes cost visibility and allocation. It tracks costs by namespace, team, and service, and supports resource right-sizing.

6.3 Autoscaling Strategies

Scaler	Target	Criteria
HPA	Pod horizontal scaling	CPU/memory/custom metrics
VPA	Pod resource request adjustment	Actual usage analysis
Cluster Autoscaler	Node horizontal scaling	Pending Pod detection
KEDA	Event-driven scaling	Queue length, HTTP requests, etc.

7. Domain 5: Security and Policy Enforcement (15%)

7.1 OPA/Gatekeeper

OPA Gatekeeper operates with two resources: ConstraintTemplate (policy logic defined in Rego) and Constraint (specifying policy targets).

ConstraintTemplate Example (Required Label Validation):

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("you must provide labels: %v", [missing])
        }

Constraint Application:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: ns-must-have-team
spec:
  match:
    kinds:
      - apiGroups: ['']
        kinds: ['Namespace']
  parameters:
    labels: ['team', 'environment']

7.2 Kyverno

Kyverno is a Kubernetes-native policy engine that uses YAML + CEL, eliminating the need to learn a separate policy language. It supports three types of rules: validate, mutate, and generate.

ClusterPolicy Example (Required Resource Limits):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: 'CPU and memory resource limits are required.'
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    cpu: '?*'
                    memory: '?*'

7.3 OPA/Gatekeeper vs Kyverno Comparison

Aspect	OPA/Gatekeeper	Kyverno
Policy Language	Rego (dedicated language)	YAML + CEL
Learning Curve	High	Low
Validate	Supported	Supported
Mutate	Limited	Fully supported
Generate	Limited	Fully supported (cross-namespace sync)
Versatility	Multi-platform (usable beyond K8s)	Kubernetes only
CNCF Status	Graduated (OPA)	Incubating

7.4 Supply Chain Security

SBOM (Software Bill of Materials): Generate and manage software component inventories
Container Image Scanning: Integrate Shift Left security into CI/CD pipelines
Falco: Runtime security threat detection (CNCF Graduated)

8. Recommended Study Plan (12 Weeks)

Week	Content
1-3	Kubernetes fundamentals review + GitOps principles + ArgoCD/Flux hands-on
4-5	Crossplane XRD/Composition design + CRD/Operator development
6-7	Backstage setup + Software Template authoring
8-9	OpenTelemetry + Prometheus + Grafana stack configuration
10	OPA/Kyverno policy enforcement + security pipeline setup
11-12	Integrated platform lab + CNCF official resource review + Killer.sh mock exam

9. Essential Study Resources

Resource	URL
CNPE Official Page	training.linuxfoundation.org/certification/cnpe
CNCF Curriculum (Open Source)	github.com/cncf/curriculum
CNCF Platforms White Paper	tag-app-delivery.cncf.io/whitepapers/platforms
Platform Engineering Maturity Model	tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model
Killer.sh Simulator	2 sessions included with exam registration

Related Supplementary Certifications: CKA, CNPA, CGOA (Certified GitOps Associate), OTCA (OpenTelemetry Certified Associate), CBA (Certified Backstage Associate)

References

CNCF CNPE Certification. https://www.cncf.io/training/certification/cnpe/
Linux Foundation CNPE Page. https://training.linuxfoundation.org/certification/certified-cloud-native-platform-engineer-cnpe/
CNCF Curriculum Repository. https://github.com/cncf/curriculum
CNCF Platforms White Paper. https://tag-app-delivery.cncf.io/whitepapers/platforms/
Argo CD Documentation. https://argo-cd.readthedocs.io/en/stable/
Flux CD Documentation. https://fluxcd.io/flux/
Argo Rollouts Documentation. https://argo-rollouts.readthedocs.io/en/stable/
OpenGitOps Principles. https://opengitops.dev/
Crossplane Documentation. https://docs.crossplane.io/latest/
Backstage Documentation. https://backstage.io/docs/
OpenTelemetry Documentation. https://opentelemetry.io/docs/
Prometheus Documentation. https://prometheus.io/docs/
Grafana Loki Documentation. https://grafana.com/docs/loki/latest/
OPA Gatekeeper Documentation. https://open-policy-agent.github.io/gatekeeper/website/docs/
Kyverno Documentation. https://kyverno.io/docs/
Google SRE Book - Service Level Objectives. https://sre.google/sre-book/service-level-objectives/
Google SRE Workbook - Error Budget Policy. https://sre.google/workbook/error-budget-policy/