- Authors
- Name
- 1. What is CNPE?
- 2. Exam Domain Overview
- 3. Domain 1: GitOps and Continuous Delivery (25%)
- 4. Domain 2: Platform APIs and Self-Service (25%)
- 5. Domain 3: Observability and Operations (20%)
- 6. Domain 4: Platform Architecture and Infrastructure (15%)
- 7. Domain 5: Security and Policy Enforcement (15%)
- 8. Recommended Study Plan (12 Weeks)
- 9. Essential Study Resources
- References
1. What is CNPE?
CNPE (Certified Cloud Native Platform Engineer) is the highest-level certification officially announced by CNCF in November 2025. It validates advanced hands-on skills in designing and operating enterprise-scale Internal Developer Platforms (IDPs).
According to CNCF CTO Chris Aniszczyk, this certification verifies "production-level cloud native system capabilities spanning platform architecture, GitOps, Observability, security, and developer experience."
1.1 Exam Format
| Item | Details |
|---|---|
| Duration | 120 minutes |
| Format | 100% Performance-based (hands-on), online proctored |
| Environment | Linux-based remote desktop (terminal + web interface) |
| Open Book | kubernetes.io/docs and per-task Quick Reference documents allowed |
| Fee | $445 USD |
| Retake | 1 free retake included |
| Validity | 2 years |
| Simulator | Killer.sh 2 sessions included |
1.2 Prerequisites
There are no official mandatory prerequisites, but CNPA (Certified Cloud Native Platform Associate) or CKA-level Kubernetes management experience is strongly recommended.
1.3 Target Audience
- Experienced Platform Engineers
- Senior DevOps / SRE
- Platform Architects
- Infrastructure Engineers
2. Exam Domain Overview
The CNPE exam consists of 5 core domains.
| Domain | Weight |
|---|---|
| GitOps and Continuous Delivery | 25% |
| Platform APIs and Self-Service Capabilities | 25% |
| Observability and Operations | 20% |
| Platform Architecture and Infrastructure | 15% |
| Security and Policy Enforcement | 15% |
A frequently referenced architecture in the industry is the BACK stack: Backstage + Argo CD + Crossplane + Kyverno. However, since the exam is vendor-neutral, Argo CD can be substituted with Flux, and Kyverno with OPA/Gatekeeper.
3. Domain 1: GitOps and Continuous Delivery (25%)
3.1 Core GitOps Principles
These are the 4 core principles defined by OpenGitOps (opengitops.dev).
| Principle | Description |
|---|---|
| Declarative | The desired state of the system is expressed declaratively |
| Versioned and Immutable | The desired state is stored in Git with full change history tracking |
| Pulled Automatically | Agents automatically pull the desired state from the source (Pull, not Push) |
| Continuously Reconciled | Differences between actual and desired state are continuously detected and restored |
3.2 Argo CD Architecture
Argo CD is a declarative GitOps CD tool for Kubernetes, composed of 3 core components.
- API Server (
argocd-server): A gRPC/REST server providing Web UI and CLI APIs. Handles application management, RBAC enforcement, and Git webhook reception. - Repository Server (
argocd-repo-server): Maintains a local cache of Git repositories and generates Kubernetes manifests based on revisions and paths. - Application Controller (
argocd-application-controller): Continuously monitors running applications, comparing current state against the target state in Git to detect OutOfSync conditions.
Application CRD Example:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: guestbook
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/argoproj/argocd-example-apps.git
targetRevision: HEAD
path: guestbook
destination:
server: https://kubernetes.default.svc
namespace: guestbook
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Automatically revert manual changes
3.3 Argo CD Sync Policies
| Policy | Default | Description |
|---|---|---|
automated | disabled | Automatic sync when OutOfSync is detected |
prune | false | Remove resources deleted from Git from the cluster |
selfHeal | false | Automatically revert manual cluster changes (drift) to Git state |
allowEmpty | false | Whether to allow sync when there are 0 manifests |
3.4 Multi-cluster Deployment with ApplicationSet
ApplicationSet is a CRD that creates multiple Application resources from a single template. It dynamically generates parameters through Generators.
Key Generators:
| Generator | Description |
|---|---|
| Cluster | Auto-discover clusters registered in Argo CD |
| Git Directory | Generate apps from repository directory structure |
| Matrix | Combine parameters from two Generators (Cartesian Product) |
| Pull Request | Create preview environments for each open PR |
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: cluster-apps
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production
template:
metadata:
name: '{{.name}}-my-app'
spec:
project: default
source:
repoURL: https://github.com/myorg/apps.git
targetRevision: HEAD
path: deploy/production
destination:
server: '{{.server}}'
namespace: my-app
3.5 Flux CD
Flux is a CD tool built on the CNCF GitOps Toolkit, composed of 5 specialized controllers.
| Controller | CRD | Role |
|---|---|---|
| Source Controller | GitRepository, HelmRepository, OCIRepository | Fetch artifacts from Git/Helm/OCI |
| Kustomize Controller | Kustomization | Apply Kustomize overlays or plain YAML |
| Helm Controller | HelmRelease | Manage Helm chart lifecycle |
| Notification Controller | Provider, Alert, Receiver | Send notifications and process inbound webhooks |
| Image Automation | ImageRepository, ImagePolicy | Scan container registries and auto-update |
Flux Kustomization Example:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: my-app
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: my-app
path: ./deploy/production
prune: true
wait: true
dependsOn:
- name: cert-manager
- name: ingress-nginx
postBuild:
substitute:
CLUSTER_NAME: production
DOMAIN: example.com
3.6 Progressive Delivery: Argo Rollouts
Argo Rollouts automates progressive deployment strategies such as Canary and Blue-Green.
Canary Deployment Example:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
- setWeight: 30
- pause: { duration: 5m }
- setWeight: 60
- pause: { duration: 5m }
- setWeight: 100
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: myapp:v2
Blue-Green Deployment Key Settings:
| Setting | Description |
|---|---|
autoPromotionEnabled | Whether to auto-promote after preview (default: true) |
autoPromotionSeconds | Wait time before automatic promotion |
scaleDownDelaySeconds | Wait time before terminating previous version Pods (default: 30s) |
prePromotionAnalysis | Metric validation before traffic switch |
postPromotionAnalysis | Post-switch validation; rollback on failure |
4. Domain 2: Platform APIs and Self-Service (25%)
4.1 Building Self-Service Infrastructure with Crossplane
Crossplane extends Kubernetes into a universal Control Plane, enabling cloud infrastructure provisioning through the Kubernetes API.
Architecture: Provider -> Managed Resources -> Compositions -> Composite Resources (XRDs)
Composite Resource Definition (XRD):
apiVersion: apiextensions.crossplane.io/v2
kind: CompositeResourceDefinition
metadata:
name: mydatabases.example.org
spec:
scope: Namespaced
group: example.org
names:
kind: XMyDatabase
plural: mydatabases
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
region:
type: string
size:
type: string
required:
- region
- size
Composition (Implementation Template):
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: example-database
spec:
compositeTypeRef:
apiVersion: example.org/v1alpha1
kind: XMyDatabase
mode: Pipeline
pipeline:
- step: patch-and-transform
functionRef:
name: function-patch-and-transform
input:
apiVersion: pt.fn.crossplane.io/v1beta1
kind: Resources
resources:
- name: rds-instance
base:
apiVersion: rds.aws.m.upbound.io/v1beta1
kind: Instance
spec:
forProvider:
region: us-east-2
engine: postgres
instanceClass: db.t3.micro
patches:
- type: FromCompositeFieldPath
fromFieldPath: spec.region
toFieldPath: spec.forProvider.region
The self-service pattern works as follows:
- The Platform Team defines XRDs (API schemas) and Compositions (implementations).
- Developers create XRD instances in their own namespaces.
- Crossplane automatically provisions and manages cloud resources.
- Developers never interact directly with cloud Provider APIs.
4.2 Backstage: Internal Developer Portal
Backstage (CNCF Incubating) is an open-source IDP framework developed by Spotify.
Core Features:
| Feature | Description |
|---|---|
| Software Catalog | Central registry of all software assets |
| Software Templates | Standardized project creation automation |
| TechDocs | Docs-like-code technical documentation |
| Plugin Architecture | Extensible plugin system |
Software Catalog Entity Definition:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payment-service
description: Payment processing microservice
tags:
- java
- spring-boot
spec:
type: service
lifecycle: production
owner: payments-team
system: payment-platform
dependsOn:
- resource:default/payments-db
providesApis:
- payment-api
Software Template Example:
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: new-microservice
title: Create New Microservice
spec:
owner: platform-team
type: service
parameters:
- title: Service Details
required:
- name
- owner
properties:
name:
title: Service Name
type: string
owner:
title: Owner Team
type: string
ui:field: OwnerPicker
steps:
- id: fetchBase
name: Fetch Template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{parameters.name}}
- id: publish
name: Publish to GitHub
action: publish:github
input:
repoUrl: github.com?owner=myorg&repo=${{parameters.name}}
- id: register
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{steps.publish.output.repoContentsUrl}}
catalogInfoPath: /catalog-info.yaml
4.3 CRDs and the Operator Pattern
The foundation of Platform APIs is Kubernetes Custom Resource Definitions (CRDs) and the Operator pattern.
Operators encode domain-specific operational knowledge into Kubernetes controllers.
Control Loop:
1. Observe: Watch for Custom Resource changes
2. Analyze: Compare current state with desired state
3. Act: Create/update/delete dependent resources to reconcile state
Major Operator Frameworks:
| Framework | Language |
|---|---|
| Kubebuilder | Go |
| Operator SDK | Go, Ansible, Helm |
| Kopf | Python |
| kube-rs | Rust |
| Metacontroller | Any language (webhook-based) |
5. Domain 3: Observability and Operations (20%)
5.1 OpenTelemetry
OpenTelemetry (OTel) is the CNCF Observability framework that provides unified collection of 3 core signals.
| Signal | Description | Use Case |
|---|---|---|
| Traces | Request path tracking across distributed systems | Understanding request flow between microservices |
| Metrics | Runtime measurements (Counter, Gauge, Histogram) | Performance trends and resource utilization monitoring |
| Logs | Timestamped event records | Debugging context at specific points in time |
OTel Collector Configuration:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2000
batch:
timeout: 10s
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Kubernetes Auto-Instrumentation:
With the OTel Operator, automatic instrumentation is possible without code changes.
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: demo-instrumentation
spec:
exporter:
endpoint: http://otel-collector:4318
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: '1'
Enable auto-instrumentation by adding annotations to Deployments:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: 'true' # Java
instrumentation.opentelemetry.io/inject-python: 'true' # Python
instrumentation.opentelemetry.io/inject-nodejs: 'true' # Node.js
5.2 Prometheus and Grafana Stack
Prometheus Architecture:
| Component | Role |
|---|---|
| Prometheus Server | Time-series data collection and storage (Pull-based) |
| Alertmanager | Alert routing, deduplication, and dispatch |
| Exporters | Third-party system metric adapters |
| Service Discovery | Automatic target discovery via Kubernetes, Consul, DNS, etc. |
ServiceMonitor CRD (Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
path: /metrics
interval: 30s
Essential PromQL Queries:
# Request rate per second (5-minute window)
rate(http_requests_total[5m])
# Sum by job
sum by (job) (rate(http_requests_total[5m]))
# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Grafana Observability Stack:
+-----------+
| Grafana | (Dashboards, Alerts)
+-----+-----+
|
+------------+------------+
| | |
+----+----+ +---+---+ +-----+-----+
| Mimir | | Loki | | Tempo |
| Metrics | | Logs | | Traces |
+----+----+ +---+---+ +-----+-----+
| | |
+------------+------------+
|
+--------+--------+
| OTel Collector |
+-----------------+
|
[Applications]
- Mimir: Horizontally scalable long-term metrics storage
- Loki: Lightweight log aggregation system with label-based indexing
- Tempo: Large-scale distributed tracing backend
5.3 SLI/SLO and Error Budgets
| Concept | Definition | Example |
|---|---|---|
| SLI | Quantitative measure of service performance | Request success rate, P99 latency |
| SLO | Target range for an SLI | "99.9% of requests complete within 200ms" |
| SLA | Contractual obligation when SLO is missed | "Service credits provided if availability falls below 99.95%" |
Error Budget Calculation:
Error Budget = 1 - SLO
SLO 99.9% -> Error Budget 0.1% -> Approximately 43.2 minutes downtime allowed per month
SLO 99.99% -> Error Budget 0.01% -> Approximately 4.32 minutes downtime allowed per month
Error Budget Policy:
| Burn Rate | Action |
|---|---|
| 0-50% (Green) | Proceed with normal feature development |
| 50-80% (Yellow) | Increase focus on stability work and code reviews |
| 80-100% (Red) | Feature freeze, full focus on stability work |
5.4 DORA Metrics
Core metrics for measuring platform efficiency.
| Metric | Description |
|---|---|
| Deployment Frequency | How often deployments occur |
| Lead Time for Changes | Time from code commit to production deployment |
| Change Failure Rate | Percentage of deployments causing failures |
| Mean Time to Recovery | Average time from failure occurrence to resolution |
6. Domain 4: Platform Architecture and Infrastructure (15%)
6.1 Multi-tenancy Patterns
| Pattern | Isolation Level | Suitable Scenario |
|---|---|---|
| Namespace-based | Logical isolation | Trusted internal teams |
| Cluster-based | Physical isolation | Strong security requirements, regulated environments |
| Hybrid | Mixed | Differentiated isolation per environment |
Namespace-based isolation tools:
# Resource limits with ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.cpu: '10'
requests.memory: 20Gi
limits.cpu: '20'
limits.memory: 40Gi
pods: '50'
---
# Network isolation with NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
namespace: team-a
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
6.2 Cost Management: OpenCost
OpenCost (CNCF Sandbox) is an open-source tool that provides Kubernetes cost visibility and allocation. It tracks costs by namespace, team, and service, and supports resource right-sizing.
6.3 Autoscaling Strategies
| Scaler | Target | Criteria |
|---|---|---|
| HPA | Pod horizontal scaling | CPU/memory/custom metrics |
| VPA | Pod resource request adjustment | Actual usage analysis |
| Cluster Autoscaler | Node horizontal scaling | Pending Pod detection |
| KEDA | Event-driven scaling | Queue length, HTTP requests, etc. |
7. Domain 5: Security and Policy Enforcement (15%)
7.1 OPA/Gatekeeper
OPA Gatekeeper operates with two resources: ConstraintTemplate (policy logic defined in Rego) and Constraint (specifying policy targets).
ConstraintTemplate Example (Required Label Validation):
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
type: object
properties:
labels:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg, "details": {"missing_labels": missing}}] {
provided := {label | input.review.object.metadata.labels[label]}
required := {label | label := input.parameters.labels[_]}
missing := required - provided
count(missing) > 0
msg := sprintf("you must provide labels: %v", [missing])
}
Constraint Application:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: ns-must-have-team
spec:
match:
kinds:
- apiGroups: ['']
kinds: ['Namespace']
parameters:
labels: ['team', 'environment']
7.2 Kyverno
Kyverno is a Kubernetes-native policy engine that uses YAML + CEL, eliminating the need to learn a separate policy language. It supports three types of rules: validate, mutate, and generate.
ClusterPolicy Example (Required Resource Limits):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Enforce
rules:
- name: check-limits
match:
any:
- resources:
kinds:
- Pod
validate:
message: 'CPU and memory resource limits are required.'
pattern:
spec:
containers:
- resources:
limits:
cpu: '?*'
memory: '?*'
7.3 OPA/Gatekeeper vs Kyverno Comparison
| Aspect | OPA/Gatekeeper | Kyverno |
|---|---|---|
| Policy Language | Rego (dedicated language) | YAML + CEL |
| Learning Curve | High | Low |
| Validate | Supported | Supported |
| Mutate | Limited | Fully supported |
| Generate | Limited | Fully supported (cross-namespace sync) |
| Versatility | Multi-platform (usable beyond K8s) | Kubernetes only |
| CNCF Status | Graduated (OPA) | Incubating |
7.4 Supply Chain Security
- SBOM (Software Bill of Materials): Generate and manage software component inventories
- Container Image Scanning: Integrate Shift Left security into CI/CD pipelines
- Falco: Runtime security threat detection (CNCF Graduated)
8. Recommended Study Plan (12 Weeks)
| Week | Content |
|---|---|
| 1-3 | Kubernetes fundamentals review + GitOps principles + ArgoCD/Flux hands-on |
| 4-5 | Crossplane XRD/Composition design + CRD/Operator development |
| 6-7 | Backstage setup + Software Template authoring |
| 8-9 | OpenTelemetry + Prometheus + Grafana stack configuration |
| 10 | OPA/Kyverno policy enforcement + security pipeline setup |
| 11-12 | Integrated platform lab + CNCF official resource review + Killer.sh mock exam |
9. Essential Study Resources
| Resource | URL |
|---|---|
| CNPE Official Page | training.linuxfoundation.org/certification/cnpe |
| CNCF Curriculum (Open Source) | github.com/cncf/curriculum |
| CNCF Platforms White Paper | tag-app-delivery.cncf.io/whitepapers/platforms |
| Platform Engineering Maturity Model | tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model |
| Killer.sh Simulator | 2 sessions included with exam registration |
Related Supplementary Certifications: CKA, CNPA, CGOA (Certified GitOps Associate), OTCA (OpenTelemetry Certified Associate), CBA (Certified Backstage Associate)
References
- CNCF CNPE Certification. https://www.cncf.io/training/certification/cnpe/
- Linux Foundation CNPE Page. https://training.linuxfoundation.org/certification/certified-cloud-native-platform-engineer-cnpe/
- CNCF Curriculum Repository. https://github.com/cncf/curriculum
- CNCF Platforms White Paper. https://tag-app-delivery.cncf.io/whitepapers/platforms/
- Argo CD Documentation. https://argo-cd.readthedocs.io/en/stable/
- Flux CD Documentation. https://fluxcd.io/flux/
- Argo Rollouts Documentation. https://argo-rollouts.readthedocs.io/en/stable/
- OpenGitOps Principles. https://opengitops.dev/
- Crossplane Documentation. https://docs.crossplane.io/latest/
- Backstage Documentation. https://backstage.io/docs/
- OpenTelemetry Documentation. https://opentelemetry.io/docs/
- Prometheus Documentation. https://prometheus.io/docs/
- Grafana Loki Documentation. https://grafana.com/docs/loki/latest/
- OPA Gatekeeper Documentation. https://open-policy-agent.github.io/gatekeeper/website/docs/
- Kyverno Documentation. https://kyverno.io/docs/
- Google SRE Book - Service Level Objectives. https://sre.google/sre-book/service-level-objectives/
- Google SRE Workbook - Error Budget Policy. https://sre.google/workbook/error-budget-policy/