Skip to content
Published on

Complete Guide to Kubernetes StatefulSet and Persistent Volume: CSI Driver, StorageClass, and Dynamic Provisioning in Production

Authors
  • Name
    Twitter
Kubernetes StatefulSet Persistent Volume

Introduction

While stateless workloads in Kubernetes can be managed relatively easily with Deployments and ReplicaSets, stateful workloads like databases, message queues, and distributed caches demand an entirely different level of complexity. Data must survive Pod restarts and node migrations, each Pod needs a unique network identity with stable storage, and scaling operations must respect ordering guarantees.

Kubernetes provides a rich storage ecosystem to address these requirements: StatefulSet, PersistentVolume (PV), PersistentVolumeClaim (PVC), StorageClass, and the Container Storage Interface (CSI) Driver. However, without a deep understanding of how these components interact, production environments are prone to data loss, stuck volumes, and provisioning failures.

This guide begins with the core mechanics of StatefulSet, then dives deep into the PV/PVC/StorageClass architecture, examines CSI Driver internals and compares major implementations (EBS CSI, EFS CSI, Ceph CSI, Longhorn), walks through dynamic provisioning, volume expansion and snapshots, real-world troubleshooting scenarios, and concludes with a production operations checklist.

StatefulSet Core Concepts

How StatefulSet Differs from Deployment

A Deployment treats Pods as interchangeable, identical units. Pod names include random hashes, scheduling order is not guaranteed, and any deleted Pod is immediately replaced by a functionally identical one. StatefulSet, in contrast, assigns each Pod an ordinal index, binding network identity and storage to the Pod's identity.

StatefulSet provides these core guarantees:

  1. Stable Network Identity: Pod names follow the pattern statefulset-name-0, statefulset-name-1, and a Headless Service creates unique DNS records for each Pod.
  2. Stable Storage: volumeClaimTemplates automatically create a dedicated PVC for each Pod, and the same PVC is reattached even after rescheduling.
  3. Ordered Deployment: Pods are created sequentially from index 0, and deleted in reverse order.
  4. Ordered Rolling Updates: Updates proceed in reverse ordinal order, starting from the highest index.

StatefulSet Specification

Here is a production-ready example of a PostgreSQL StatefulSet with 3 replicas.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: database
spec:
  serviceName: postgresql-headless
  replicas: 3
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      terminationGracePeriodSeconds: 120
      securityContext:
        fsGroup: 999
        runAsUser: 999
      containers:
        - name: postgresql
          image: postgres:16.2-alpine
          ports:
            - containerPort: 5432
              name: postgresql
          env:
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgresql-secret
                  key: password
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2000m
              memory: 4Gi
          livenessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 5
            periodSeconds: 5
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: gp3-encrypted
        resources:
          requests:
            storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgresql-headless
  namespace: database
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: postgresql
  ports:
    - port: 5432
      targetPort: postgresql

Key points in this manifest:

  • volumeClaimTemplates: The StatefulSet automatically creates PVCs named data-postgresql-0, data-postgresql-1, data-postgresql-2 for each Pod.
  • podManagementPolicy: OrderedReady: The default policy ensures Pod 0 reaches Ready state before Pod 1 is created. Set to Parallel for concurrent startup when ordering is not required.
  • terminationGracePeriodSeconds: 120: Databases need adequate time for clean shutdown procedures.
  • fsGroup: 999: Sets filesystem group ownership on the PV mount to match the PostgreSQL user.

PVC Retention Policy

Since Kubernetes 1.27, StatefulSet PVC retention behavior can be fine-tuned. By default, PVCs are not deleted when a StatefulSet is deleted or scaled down (to protect data). To change this behavior, use persistentVolumeClaimRetentionPolicy.

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete
    whenScaled: Retain
  • whenDeleted: Delete: PVCs are deleted when the StatefulSet itself is deleted.
  • whenScaled: Retain: PVCs are preserved during scale-down. When scaling back up, existing PVCs are reused.

PV/PVC/StorageClass Architecture

Three-Layer Storage Model

Kubernetes storage is organized into three abstraction layers:

  1. PersistentVolume (PV): A cluster-level storage resource representing an actual physical or cloud storage volume. It can be created manually by an administrator or dynamically provisioned through a StorageClass.
  2. PersistentVolumeClaim (PVC): A user's request for storage. By specifying capacity, access modes, and StorageClass, it binds to a matching PV.
  3. StorageClass: The blueprint for dynamic provisioning. It defines which provisioner (CSI Driver) to use and what parameters (disk type, IOPS, encryption, etc.) to apply when creating volumes.

Access Mode Comparison

Access ModeAbbreviationDescriptionTypical Use Case
ReadWriteOnceRWORead/write mount on a single nodeDatabases, single-instance apps
ReadOnlyManyROXRead-only mount across multiple nodesStatic assets, shared configs
ReadWriteManyRWXRead/write mount across multiple nodesShared filesystems, CMS uploads
ReadWriteOncePodRWOPRead/write by a single Pod only (GA in 1.29)DBs requiring exclusive locking

Reclaim Policy

The reclaim policy determines what happens to a PV after its PVC is deleted:

  • Retain: Preserves the PV and its data. Requires manual cleanup or rebinding. Recommended for production databases.
  • Delete: Automatically deletes the PV and its backing storage (e.g., EBS volume). Suitable for ephemeral data and development environments.
  • Recycle (Deprecated): Do not use. Replaced by CSI-based dynamic provisioning.

CSI Driver Deep Dive

CSI Architecture Overview

The Container Storage Interface (CSI) is a standardized interface between Kubernetes and storage systems. Before CSI, storage plugins were embedded in the Kubernetes core (in-tree plugins), requiring Kubernetes itself to be modified for new storage backends. CSI decouples this relationship, allowing storage vendors to independently develop and deploy drivers.

A CSI Driver consists of two core components:

  1. Controller Plugin (Deployment): Handles cluster-level operations such as volume creation, deletion, expansion, and snapshots. Runs alongside sidecar containers like External Provisioner, External Attacher, and External Snapshotter.
  2. Node Plugin (DaemonSet): Runs on every worker node, managing volume mount/unmount, formatting, and device path operations. Communicates directly with kubelet.

CSI Driver Comparison

FeatureAWS EBS CSIAWS EFS CSICeph CSI (Rook)Longhorn
Storage TypeBlockFile (NFS)Block + File + ObjectBlock (Replicated)
Access ModesRWO, RWOPRWX, ROX, RWORWO, RWX, ROXRWO, RWX
Dynamic ProvisioningYesYes (Access Points)YesYes
Volume ExpansionYes (Online)N/A (Elastic)YesYes
SnapshotsYesNoYesYes
EncryptionKMS SupportIn-transitLUKS SupportYes
Topology AwarenessAZ-levelRegion-levelCRUSH MapNode-level
Best ForAWS EKSAWS EKS (Shared)On-premises, Large ScaleEdge, Small/Medium
Operational ComplexityLowLowHigh (CRUSH Map)Medium (Web UI)

AWS EBS CSI Driver Installation

Here is the production installation procedure for the EBS CSI Driver on EKS.

# 1. Verify IAM OIDC Provider
eksctl utils associate-iam-oidc-provider \
  --cluster my-cluster \
  --approve

# 2. Create IAM ServiceAccount for EBS CSI Driver
eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster my-cluster \
  --role-name AmazonEKS_EBS_CSI_DriverRole \
  --role-only \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve

# 3. Install EBS CSI Driver as EKS Addon
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole

# 4. Verify installation
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
kubectl get csidriver

Dynamic Provisioning in Practice

StorageClass Design

Production environments require multiple StorageClasses tailored to different workload characteristics.

# General-purpose SSD - Standard workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-standard
  annotations:
    storageclass.kubernetes.io/is-default-class: 'true'
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
  encrypted: 'true'
  kmsKeyId: 'arn:aws:kms:ap-northeast-2:ACCOUNT_ID:key/KEY_ID'
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# High-performance SSD - Database workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: io2-database
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iopsPerGB: '50'
  fsType: ext4
  encrypted: 'true'
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Shared filesystem - RWX access
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-shared
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-0123456789abcdef0
  directoryPerms: '700'
  basePath: '/dynamic_provisioning'
reclaimPolicy: Delete

Key design principles for StorageClasses:

  • volumeBindingMode: WaitForFirstConsumer: Essential for multi-AZ environments. The volume is provisioned in the same AZ as the scheduled Pod, preventing AZ mismatch issues.
  • allowVolumeExpansion: true: Always enable this to allow future disk capacity increases. PVCs created with StorageClasses where this is false cannot be expanded later.
  • reclaimPolicy: Use Retain for volumes storing critical data (databases) and Delete for ephemeral data.
  • encrypted: "true": Apply encryption to all volumes for security compliance.

Dynamic Provisioning Flow

The complete dynamic provisioning workflow:

  1. When a user creates a PVC, the CSI Driver specified in the StorageClass provisioner field is invoked.
  2. The CSI Controller Plugin executes the CreateVolume RPC, creating the actual volume through cloud APIs.
  3. A PV object is automatically created and bound to the PVC.
  4. When a Pod referencing the PVC is scheduled, the CSI Controller Plugin calls ControllerPublishVolume RPC to attach the volume to the node.
  5. The CSI Node Plugin calls NodeStageVolume and NodePublishVolume RPCs to format the volume and mount it at the specified path inside the Pod.

Volume Expansion and Snapshots

Online Volume Expansion

The EBS CSI Driver supports online volume expansion, allowing disk capacity increases without Pod restarts.

# Modify PVC requested capacity
kubectl patch pvc data-postgresql-0 -n database \
  -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Check expansion status
kubectl get pvc data-postgresql-0 -n database -o jsonpath='{.status.conditions}'

# Verify actual filesystem size (from inside the Pod)
kubectl exec -it postgresql-0 -n database -- df -h /var/lib/postgresql/data

The expansion occurs in two phases. First, the backend volume (EBS) size increases, then the filesystem expands. The PVC condition shows FileSystemResizePending until filesystem expansion completes.

Volume Snapshots

Data backup and restoration using CSI snapshots.

# VolumeSnapshotClass definition
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
  tagSpecification_1: 'Environment=production'
  tagSpecification_2: 'BackupType=scheduled'
---
# Create a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgresql-snapshot-20260313
  namespace: database
spec:
  volumeSnapshotClassName: ebs-snapshot-class
  source:
    persistentVolumeClaimName: data-postgresql-0
---
# Restore PVC from snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgresql-restored
  namespace: database
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3-standard
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: postgresql-snapshot-20260313
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Important considerations for snapshot operations:

  • Snapshots guarantee only crash consistency. For application-level consistency, execute commands like CHECKPOINT before taking snapshots.
  • Setting deletionPolicy: Retain on VolumeSnapshotClass preserves the actual snapshot data even when the VolumeSnapshot object is deleted.
  • Snapshot CRDs (VolumeSnapshot, VolumeSnapshotContent, VolumeSnapshotClass) must be installed in the cluster. The snapshot-controller must also be deployed separately.

Troubleshooting Scenarios

Scenario 1: PVC Stuck in Pending State

Symptoms: A PVC remains in Pending state indefinitely after creation.

# Check PVC status
kubectl describe pvc data-postgresql-0 -n database

# Common event messages:
# waiting for first consumer to be created before binding
# no persistent volumes available for this claim
# storageclass "gp3-standard" not found

Causes and Solutions:

  1. StorageClass does not exist: The storageClassName specified in the PVC is not available in the cluster. Verify with kubectl get storageclass.
  2. WaitForFirstConsumer mode: When volumeBindingMode is WaitForFirstConsumer, Pending status is normal until a Pod consuming the PVC is created.
  3. CSI Driver not installed: The CSI Driver specified as the provisioner is not installed. Verify with kubectl get csidriver.
  4. Resource quota exceeded: A ResourceQuota in the namespace has reached its storage limit.
  5. AZ constraints: The Pod is scheduled to an AZ where volume creation is not possible.

Scenario 2: PV/PVC Stuck in Terminating State

Symptoms: A deleted PVC remains in Terminating state indefinitely.

# Check finalizers
kubectl get pvc data-postgresql-0 -n database -o jsonpath='{.metadata.finalizers}'
# Example output: ["kubernetes.io/pvc-protection"]

# Find Pods using this PVC
kubectl get pods -n database -o json | \
  jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName == "data-postgresql-0") | .metadata.name'

Causes and Solutions:

The kubernetes.io/pvc-protection finalizer blocks deletion while any Pod is using the PVC. Delete all Pods referencing the PVC first, then retry PVC deletion. Only remove finalizers manually if the PVC remains stuck after all referencing Pods have been removed.

# WARNING: Risk of data loss - always backup before executing
kubectl patch pvc data-postgresql-0 -n database \
  -p '{"metadata":{"finalizers":null}}'

PVs stuck in Terminating follow a similar pattern. This occurs when VolumeAttachments remain or the CSI Driver cannot properly delete the volume.

# Check VolumeAttachment
kubectl get volumeattachment | grep "pv-name"

# Check CSI Driver Pod logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver \
  -c ebs-plugin --tail=100

Scenario 3: CSI Driver Crash and Data Recovery

Symptoms: CSI Driver Pods are in CrashLoopBackOff and newly created Pods fail to mount volumes.

Diagnostic Procedure:

# Check CSI Driver Pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

# Check CSI Driver events
kubectl describe pod -n kube-system ebs-csi-controller-0

# Check CSI Node Plugin logs (per node)
kubectl logs -n kube-system -l app=ebs-csi-node -c ebs-plugin --tail=50

# Check VolumeAttachment state
kubectl get volumeattachment -o wide

Recovery Procedure:

  1. Redeploy the CSI Driver via Helm or EKS Addon.
  2. Clean up any stuck VolumeAttachments manually.
  3. Delete affected Pods to trigger kubelet volume mount retries.
  4. In the worst case, drain and replace the node itself.

Scenario 4: Data Loss Concerns After StatefulSet Scale-Down

When scaling a StatefulSet from 3 to 2, the postgresql-2 Pod is deleted but the data-postgresql-2 PVC is preserved by default. Scaling back to 3 reuses the existing PVC.

However, if persistentVolumeClaimRetentionPolicy.whenScaled is set to Delete, or the PVC is manually deleted, data is lost. Always perform data migration or backup before scaling down.

Operations Checklist

Essential items to verify when operating StatefulSets and Persistent Volumes in production.

Pre-Deployment Checklist:

  • Is volumeBindingMode: WaitForFirstConsumer set on the StorageClass
  • Is allowVolumeExpansion: true enabled on the StorageClass
  • Is the reclaimPolicy set to Retain for database volumes
  • Is the CSI Driver properly installed and registered as a csidriver object
  • Are VolumeSnapshot CRDs and snapshot-controller deployed
  • Is a PodDisruptionBudget configured
  • Is terminationGracePeriodSeconds set to an adequate value

Day-to-Day Operations Checklist:

  • Are PVC capacity utilization metrics being monitored (kubelet_volume_stats_used_bytes)
  • Are volume snapshots being created on a regular schedule
  • Are snapshot restoration tests being performed periodically
  • Is the CSI Driver version compatible with the Kubernetes version
  • Do StorageClass parameters meet security requirements (encryption, etc.)
  • Are there any PVCs/PVs stuck in Pending or Terminating state

Incident Response Checklist:

  • PVC Pending: Verify StorageClass existence, CSI Driver status, ResourceQuota, AZ constraints
  • PVC Terminating: Check for referencing Pods, inspect finalizers
  • Volume mount failure: Check VolumeAttachment state, CSI Node Plugin logs
  • Data recovery: Locate latest snapshot, execute PVC restoration from snapshot
  • CSI Driver failure: Redeploy Driver Pods, clean up stuck VolumeAttachments

Conclusion

Reliably operating stateful workloads in Kubernetes requires a deep understanding of how StatefulSet, PV, PVC, StorageClass, and CSI Driver interact. Rather than simply copying and applying manifests, the key lies in understanding each component's operating principles and preparing for failure scenarios in advance.

In production environments, adhere to these principles:

  1. Use WaitForFirstConsumer by default to eliminate AZ mismatch issues at the source.
  2. Apply volume encryption across all StorageClasses.
  3. Perform snapshot-based backups on a regular schedule, and always include restoration testing.
  4. Implement monitoring to track disk utilization, IOPS, and latency in real time.
  5. Set reclaimPolicy according to data criticality to prevent accidental data deletion.

The CSI Driver ecosystem continues to evolve. New features such as the ReadWriteOncePod access mode (GA in Kubernetes 1.29), Volume Group Snapshots, and SELinux mount options are being added continuously. Stay up to date with release notes and evolve your storage operations strategy accordingly.

References