- Published on
Complete Guide to Kubernetes StatefulSet and Persistent Volume: CSI Driver, StorageClass, and Dynamic Provisioning in Production
- Authors
- Name
- Introduction
- StatefulSet Core Concepts
- PV/PVC/StorageClass Architecture
- CSI Driver Deep Dive
- Dynamic Provisioning in Practice
- Volume Expansion and Snapshots
- Troubleshooting Scenarios
- Operations Checklist
- Conclusion
- References

Introduction
While stateless workloads in Kubernetes can be managed relatively easily with Deployments and ReplicaSets, stateful workloads like databases, message queues, and distributed caches demand an entirely different level of complexity. Data must survive Pod restarts and node migrations, each Pod needs a unique network identity with stable storage, and scaling operations must respect ordering guarantees.
Kubernetes provides a rich storage ecosystem to address these requirements: StatefulSet, PersistentVolume (PV), PersistentVolumeClaim (PVC), StorageClass, and the Container Storage Interface (CSI) Driver. However, without a deep understanding of how these components interact, production environments are prone to data loss, stuck volumes, and provisioning failures.
This guide begins with the core mechanics of StatefulSet, then dives deep into the PV/PVC/StorageClass architecture, examines CSI Driver internals and compares major implementations (EBS CSI, EFS CSI, Ceph CSI, Longhorn), walks through dynamic provisioning, volume expansion and snapshots, real-world troubleshooting scenarios, and concludes with a production operations checklist.
StatefulSet Core Concepts
How StatefulSet Differs from Deployment
A Deployment treats Pods as interchangeable, identical units. Pod names include random hashes, scheduling order is not guaranteed, and any deleted Pod is immediately replaced by a functionally identical one. StatefulSet, in contrast, assigns each Pod an ordinal index, binding network identity and storage to the Pod's identity.
StatefulSet provides these core guarantees:
- Stable Network Identity: Pod names follow the pattern
statefulset-name-0,statefulset-name-1, and a Headless Service creates unique DNS records for each Pod. - Stable Storage: volumeClaimTemplates automatically create a dedicated PVC for each Pod, and the same PVC is reattached even after rescheduling.
- Ordered Deployment: Pods are created sequentially from index 0, and deleted in reverse order.
- Ordered Rolling Updates: Updates proceed in reverse ordinal order, starting from the highest index.
StatefulSet Specification
Here is a production-ready example of a PostgreSQL StatefulSet with 3 replicas.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: database
spec:
serviceName: postgresql-headless
replicas: 3
podManagementPolicy: OrderedReady
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
app: postgresql
template:
metadata:
labels:
app: postgresql
spec:
terminationGracePeriodSeconds: 120
securityContext:
fsGroup: 999
runAsUser: 999
containers:
- name: postgresql
image: postgres:16.2-alpine
ports:
- containerPort: 5432
name: postgresql
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-secret
key: password
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
livenessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 5
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3-encrypted
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
name: postgresql-headless
namespace: database
spec:
type: ClusterIP
clusterIP: None
selector:
app: postgresql
ports:
- port: 5432
targetPort: postgresql
Key points in this manifest:
- volumeClaimTemplates: The StatefulSet automatically creates PVCs named
data-postgresql-0,data-postgresql-1,data-postgresql-2for each Pod. - podManagementPolicy: OrderedReady: The default policy ensures Pod 0 reaches Ready state before Pod 1 is created. Set to
Parallelfor concurrent startup when ordering is not required. - terminationGracePeriodSeconds: 120: Databases need adequate time for clean shutdown procedures.
- fsGroup: 999: Sets filesystem group ownership on the PV mount to match the PostgreSQL user.
PVC Retention Policy
Since Kubernetes 1.27, StatefulSet PVC retention behavior can be fine-tuned. By default, PVCs are not deleted when a StatefulSet is deleted or scaled down (to protect data). To change this behavior, use persistentVolumeClaimRetentionPolicy.
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete
whenScaled: Retain
- whenDeleted: Delete: PVCs are deleted when the StatefulSet itself is deleted.
- whenScaled: Retain: PVCs are preserved during scale-down. When scaling back up, existing PVCs are reused.
PV/PVC/StorageClass Architecture
Three-Layer Storage Model
Kubernetes storage is organized into three abstraction layers:
- PersistentVolume (PV): A cluster-level storage resource representing an actual physical or cloud storage volume. It can be created manually by an administrator or dynamically provisioned through a StorageClass.
- PersistentVolumeClaim (PVC): A user's request for storage. By specifying capacity, access modes, and StorageClass, it binds to a matching PV.
- StorageClass: The blueprint for dynamic provisioning. It defines which provisioner (CSI Driver) to use and what parameters (disk type, IOPS, encryption, etc.) to apply when creating volumes.
Access Mode Comparison
| Access Mode | Abbreviation | Description | Typical Use Case |
|---|---|---|---|
| ReadWriteOnce | RWO | Read/write mount on a single node | Databases, single-instance apps |
| ReadOnlyMany | ROX | Read-only mount across multiple nodes | Static assets, shared configs |
| ReadWriteMany | RWX | Read/write mount across multiple nodes | Shared filesystems, CMS uploads |
| ReadWriteOncePod | RWOP | Read/write by a single Pod only (GA in 1.29) | DBs requiring exclusive locking |
Reclaim Policy
The reclaim policy determines what happens to a PV after its PVC is deleted:
- Retain: Preserves the PV and its data. Requires manual cleanup or rebinding. Recommended for production databases.
- Delete: Automatically deletes the PV and its backing storage (e.g., EBS volume). Suitable for ephemeral data and development environments.
- Recycle (Deprecated): Do not use. Replaced by CSI-based dynamic provisioning.
CSI Driver Deep Dive
CSI Architecture Overview
The Container Storage Interface (CSI) is a standardized interface between Kubernetes and storage systems. Before CSI, storage plugins were embedded in the Kubernetes core (in-tree plugins), requiring Kubernetes itself to be modified for new storage backends. CSI decouples this relationship, allowing storage vendors to independently develop and deploy drivers.
A CSI Driver consists of two core components:
- Controller Plugin (Deployment): Handles cluster-level operations such as volume creation, deletion, expansion, and snapshots. Runs alongside sidecar containers like External Provisioner, External Attacher, and External Snapshotter.
- Node Plugin (DaemonSet): Runs on every worker node, managing volume mount/unmount, formatting, and device path operations. Communicates directly with kubelet.
CSI Driver Comparison
| Feature | AWS EBS CSI | AWS EFS CSI | Ceph CSI (Rook) | Longhorn |
|---|---|---|---|---|
| Storage Type | Block | File (NFS) | Block + File + Object | Block (Replicated) |
| Access Modes | RWO, RWOP | RWX, ROX, RWO | RWO, RWX, ROX | RWO, RWX |
| Dynamic Provisioning | Yes | Yes (Access Points) | Yes | Yes |
| Volume Expansion | Yes (Online) | N/A (Elastic) | Yes | Yes |
| Snapshots | Yes | No | Yes | Yes |
| Encryption | KMS Support | In-transit | LUKS Support | Yes |
| Topology Awareness | AZ-level | Region-level | CRUSH Map | Node-level |
| Best For | AWS EKS | AWS EKS (Shared) | On-premises, Large Scale | Edge, Small/Medium |
| Operational Complexity | Low | Low | High (CRUSH Map) | Medium (Web UI) |
AWS EBS CSI Driver Installation
Here is the production installation procedure for the EBS CSI Driver on EKS.
# 1. Verify IAM OIDC Provider
eksctl utils associate-iam-oidc-provider \
--cluster my-cluster \
--approve
# 2. Create IAM ServiceAccount for EBS CSI Driver
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster my-cluster \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
# 3. Install EBS CSI Driver as EKS Addon
aws eks create-addon \
--cluster-name my-cluster \
--addon-name aws-ebs-csi-driver \
--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole
# 4. Verify installation
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
kubectl get csidriver
Dynamic Provisioning in Practice
StorageClass Design
Production environments require multiple StorageClasses tailored to different workload characteristics.
# General-purpose SSD - Standard workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-standard
annotations:
storageclass.kubernetes.io/is-default-class: 'true'
provisioner: ebs.csi.aws.com
parameters:
type: gp3
fsType: ext4
encrypted: 'true'
kmsKeyId: 'arn:aws:kms:ap-northeast-2:ACCOUNT_ID:key/KEY_ID'
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# High-performance SSD - Database workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: io2-database
provisioner: ebs.csi.aws.com
parameters:
type: io2
iopsPerGB: '50'
fsType: ext4
encrypted: 'true'
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Shared filesystem - RWX access
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-shared
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: fs-0123456789abcdef0
directoryPerms: '700'
basePath: '/dynamic_provisioning'
reclaimPolicy: Delete
Key design principles for StorageClasses:
- volumeBindingMode: WaitForFirstConsumer: Essential for multi-AZ environments. The volume is provisioned in the same AZ as the scheduled Pod, preventing AZ mismatch issues.
- allowVolumeExpansion: true: Always enable this to allow future disk capacity increases. PVCs created with StorageClasses where this is false cannot be expanded later.
- reclaimPolicy: Use
Retainfor volumes storing critical data (databases) andDeletefor ephemeral data. - encrypted: "true": Apply encryption to all volumes for security compliance.
Dynamic Provisioning Flow
The complete dynamic provisioning workflow:
- When a user creates a PVC, the CSI Driver specified in the StorageClass provisioner field is invoked.
- The CSI Controller Plugin executes the CreateVolume RPC, creating the actual volume through cloud APIs.
- A PV object is automatically created and bound to the PVC.
- When a Pod referencing the PVC is scheduled, the CSI Controller Plugin calls ControllerPublishVolume RPC to attach the volume to the node.
- The CSI Node Plugin calls NodeStageVolume and NodePublishVolume RPCs to format the volume and mount it at the specified path inside the Pod.
Volume Expansion and Snapshots
Online Volume Expansion
The EBS CSI Driver supports online volume expansion, allowing disk capacity increases without Pod restarts.
# Modify PVC requested capacity
kubectl patch pvc data-postgresql-0 -n database \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Check expansion status
kubectl get pvc data-postgresql-0 -n database -o jsonpath='{.status.conditions}'
# Verify actual filesystem size (from inside the Pod)
kubectl exec -it postgresql-0 -n database -- df -h /var/lib/postgresql/data
The expansion occurs in two phases. First, the backend volume (EBS) size increases, then the filesystem expands. The PVC condition shows FileSystemResizePending until filesystem expansion completes.
Volume Snapshots
Data backup and restoration using CSI snapshots.
# VolumeSnapshotClass definition
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
tagSpecification_1: 'Environment=production'
tagSpecification_2: 'BackupType=scheduled'
---
# Create a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgresql-snapshot-20260313
namespace: database
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: data-postgresql-0
---
# Restore PVC from snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-postgresql-restored
namespace: database
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3-standard
resources:
requests:
storage: 100Gi
dataSource:
name: postgresql-snapshot-20260313
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Important considerations for snapshot operations:
- Snapshots guarantee only crash consistency. For application-level consistency, execute commands like
CHECKPOINTbefore taking snapshots. - Setting
deletionPolicy: Retainon VolumeSnapshotClass preserves the actual snapshot data even when the VolumeSnapshot object is deleted. - Snapshot CRDs (VolumeSnapshot, VolumeSnapshotContent, VolumeSnapshotClass) must be installed in the cluster. The snapshot-controller must also be deployed separately.
Troubleshooting Scenarios
Scenario 1: PVC Stuck in Pending State
Symptoms: A PVC remains in Pending state indefinitely after creation.
# Check PVC status
kubectl describe pvc data-postgresql-0 -n database
# Common event messages:
# waiting for first consumer to be created before binding
# no persistent volumes available for this claim
# storageclass "gp3-standard" not found
Causes and Solutions:
- StorageClass does not exist: The storageClassName specified in the PVC is not available in the cluster. Verify with
kubectl get storageclass. - WaitForFirstConsumer mode: When volumeBindingMode is WaitForFirstConsumer, Pending status is normal until a Pod consuming the PVC is created.
- CSI Driver not installed: The CSI Driver specified as the provisioner is not installed. Verify with
kubectl get csidriver. - Resource quota exceeded: A ResourceQuota in the namespace has reached its storage limit.
- AZ constraints: The Pod is scheduled to an AZ where volume creation is not possible.
Scenario 2: PV/PVC Stuck in Terminating State
Symptoms: A deleted PVC remains in Terminating state indefinitely.
# Check finalizers
kubectl get pvc data-postgresql-0 -n database -o jsonpath='{.metadata.finalizers}'
# Example output: ["kubernetes.io/pvc-protection"]
# Find Pods using this PVC
kubectl get pods -n database -o json | \
jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName == "data-postgresql-0") | .metadata.name'
Causes and Solutions:
The kubernetes.io/pvc-protection finalizer blocks deletion while any Pod is using the PVC. Delete all Pods referencing the PVC first, then retry PVC deletion. Only remove finalizers manually if the PVC remains stuck after all referencing Pods have been removed.
# WARNING: Risk of data loss - always backup before executing
kubectl patch pvc data-postgresql-0 -n database \
-p '{"metadata":{"finalizers":null}}'
PVs stuck in Terminating follow a similar pattern. This occurs when VolumeAttachments remain or the CSI Driver cannot properly delete the volume.
# Check VolumeAttachment
kubectl get volumeattachment | grep "pv-name"
# Check CSI Driver Pod logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver \
-c ebs-plugin --tail=100
Scenario 3: CSI Driver Crash and Data Recovery
Symptoms: CSI Driver Pods are in CrashLoopBackOff and newly created Pods fail to mount volumes.
Diagnostic Procedure:
# Check CSI Driver Pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
# Check CSI Driver events
kubectl describe pod -n kube-system ebs-csi-controller-0
# Check CSI Node Plugin logs (per node)
kubectl logs -n kube-system -l app=ebs-csi-node -c ebs-plugin --tail=50
# Check VolumeAttachment state
kubectl get volumeattachment -o wide
Recovery Procedure:
- Redeploy the CSI Driver via Helm or EKS Addon.
- Clean up any stuck VolumeAttachments manually.
- Delete affected Pods to trigger kubelet volume mount retries.
- In the worst case, drain and replace the node itself.
Scenario 4: Data Loss Concerns After StatefulSet Scale-Down
When scaling a StatefulSet from 3 to 2, the postgresql-2 Pod is deleted but the data-postgresql-2 PVC is preserved by default. Scaling back to 3 reuses the existing PVC.
However, if persistentVolumeClaimRetentionPolicy.whenScaled is set to Delete, or the PVC is manually deleted, data is lost. Always perform data migration or backup before scaling down.
Operations Checklist
Essential items to verify when operating StatefulSets and Persistent Volumes in production.
Pre-Deployment Checklist:
- Is
volumeBindingMode: WaitForFirstConsumerset on the StorageClass - Is
allowVolumeExpansion: trueenabled on the StorageClass - Is the reclaimPolicy set to
Retainfor database volumes - Is the CSI Driver properly installed and registered as a csidriver object
- Are VolumeSnapshot CRDs and snapshot-controller deployed
- Is a PodDisruptionBudget configured
- Is terminationGracePeriodSeconds set to an adequate value
Day-to-Day Operations Checklist:
- Are PVC capacity utilization metrics being monitored (kubelet_volume_stats_used_bytes)
- Are volume snapshots being created on a regular schedule
- Are snapshot restoration tests being performed periodically
- Is the CSI Driver version compatible with the Kubernetes version
- Do StorageClass parameters meet security requirements (encryption, etc.)
- Are there any PVCs/PVs stuck in Pending or Terminating state
Incident Response Checklist:
- PVC Pending: Verify StorageClass existence, CSI Driver status, ResourceQuota, AZ constraints
- PVC Terminating: Check for referencing Pods, inspect finalizers
- Volume mount failure: Check VolumeAttachment state, CSI Node Plugin logs
- Data recovery: Locate latest snapshot, execute PVC restoration from snapshot
- CSI Driver failure: Redeploy Driver Pods, clean up stuck VolumeAttachments
Conclusion
Reliably operating stateful workloads in Kubernetes requires a deep understanding of how StatefulSet, PV, PVC, StorageClass, and CSI Driver interact. Rather than simply copying and applying manifests, the key lies in understanding each component's operating principles and preparing for failure scenarios in advance.
In production environments, adhere to these principles:
- Use WaitForFirstConsumer by default to eliminate AZ mismatch issues at the source.
- Apply volume encryption across all StorageClasses.
- Perform snapshot-based backups on a regular schedule, and always include restoration testing.
- Implement monitoring to track disk utilization, IOPS, and latency in real time.
- Set reclaimPolicy according to data criticality to prevent accidental data deletion.
The CSI Driver ecosystem continues to evolve. New features such as the ReadWriteOncePod access mode (GA in Kubernetes 1.29), Volume Group Snapshots, and SELinux mount options are being added continuously. Stay up to date with release notes and evolve your storage operations strategy accordingly.
References
- Kubernetes Official Documentation - StatefulSets
- Kubernetes Official Documentation - Persistent Volumes
- Kubernetes Official Documentation - Storage Classes
- Kubernetes Official Documentation - Volume Snapshots
- Kubernetes Official Documentation - Dynamic Volume Provisioning
- AWS EBS CSI Driver GitHub
- AWS EFS CSI Driver GitHub
- Kubernetes CSI Developer Documentation - Drivers
- Longhorn CSI Snapshot Documentation
- Kubernetes Storage Layers: Ceph vs. Longhorn