Skip to content
Published on

Kubernetes Database Deep Dive 2025: Operator Patterns, HA Configuration, Backup/Recovery, Performance Tuning

Authors

Table of Contents

1. Should You Run Databases on K8s?

1.1 Pros and Cons Analysis (2025 Perspective)

Advantages of K8s DB Operations

  • Consistent deployment: Manage DB with the same GitOps workflow as applications
  • Resource efficiency: Share node resources with other workloads (cost savings vs dedicated VMs)
  • Automatic recovery: Operator detects failures and performs automatic Failover (RTO under 30s)
  • Portability: Same configuration anywhere without cloud vendor lock-in
  • Dev environment parity: Identical DB stack across dev/staging/production

Disadvantages of K8s DB Operations

  • Operational complexity: Additional management points like StorageClass, PV, Operator upgrades
  • Storage performance: Network storage (EBS/PD) has higher latency than local disks
  • Expertise required: Deep understanding of both DBA and K8s operations needed
  • Backup complexity: K8s-specific volume snapshots, object storage integration required

1.2 Decision Framework

CriteriaK8s DB SuitableManaged DB (RDS etc.) Suitable
Team capabilityK8s + DBA experts availableNo DBA
Cost sensitivityHigh (need infra optimization)Medium (management cost OK)
ComplianceData location control neededCloud region sufficient
Multi-cloudRequiredSingle cloud
Workload sizeSmall-medium (hundreds of GB)Large (multiple TB+)
SLA requirement99.9% achievable in-house99.99% vendor SLA needed

2. StatefulSet Deep Dive

2.1 Pod Identity (Ordinal Index and Stable Hostname)

StatefulSet assigns each Pod an ordinal index.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres-headless
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3-encrypted
        resources:
          requests:
            storage: 100Gi

Pods created by this StatefulSet:

postgres-0  ->  postgres-0.postgres-headless.database.svc.cluster.local
postgres-1  ->  postgres-1.postgres-headless.database.svc.cluster.local
postgres-2  ->  postgres-2.postgres-headless.database.svc.cluster.local

Each Pod binds to the same name and same PVC even after restarts.

2.2 PodManagementPolicy

spec:
  podManagementPolicy: OrderedReady  # default
  # OrderedReady: Create in order 0->1->2, delete in order 2->1->0
  # Parallel: All Pods created/deleted simultaneously (use with caution for DBs)
  • OrderedReady: Primary (index 0) starts first and becomes Ready before Replicas start
  • Parallel: Only use for initial cluster bootstrap (when no existing data)

2.3 volumeClaimTemplates

volumeClaimTemplates:
  - metadata:
      name: data
      labels:
        type: database-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3-encrypted
      resources:
        requests:
          storage: 100Gi

This template creates individual PVCs for each Pod:

data-postgres-0  ->  100Gi PV (gp3-encrypted)
data-postgres-1  ->  100Gi PV (gp3-encrypted)
data-postgres-2  ->  100Gi PV (gp3-encrypted)

PVCs are not automatically deleted when the StatefulSet is deleted (data protection).

2.4 Update Strategies

spec:
  updateStrategy:
    type: RollingUpdate       # or OnDelete
    rollingUpdate:
      partition: 0            # only update indices >= partition
      maxUnavailable: 1       # K8s 1.24+
  • RollingUpdate: Updates in reverse order from highest index (2 then 1 then 0). Replicas first, Primary last
  • OnDelete: Pod must be manually deleted to update. Useful for canary updates
  • partition: Setting partition: 2 only updates index 2 and above (canary)

2.5 Headless Service

apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  namespace: database
spec:
  type: ClusterIP
  clusterIP: None  # Headless Service
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432

Headless Service returns each Pod's IP directly as DNS A records:

# Query all Pod IPs
nslookup postgres-headless.database.svc.cluster.local

# Direct access to specific Pod
psql -h postgres-0.postgres-headless.database.svc.cluster.local -U postgres

3. CloudNativePG (PostgreSQL) Deep Dive

3.1 Cluster CRD

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres
  namespace: database
spec:
  instances: 3

  postgresql:
    parameters:
      shared_buffers: "2GB"
      effective_cache_size: "6GB"
      work_mem: "64MB"
      maintenance_work_mem: "512MB"
      max_connections: "200"
      max_wal_size: "2GB"
      min_wal_size: "1GB"
      wal_buffers: "64MB"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      max_worker_processes: "8"
      max_parallel_workers_per_gather: "4"
      max_parallel_workers: "8"

  storage:
    size: 100Gi
    storageClass: gp3-encrypted
    pvcTemplate:
      accessModes:
        - ReadWriteOnce

  walStorage:
    size: 20Gi
    storageClass: gp3-encrypted

  resources:
    requests:
      cpu: "2"
      memory: "8Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

  monitoring:
    enablePodMonitor: true

  bootstrap:
    initdb:
      database: myapp
      owner: myapp
      secret:
        name: myapp-db-credentials

3.2 Automatic Failover

CloudNativePG automatically promotes a Replica when the Primary fails.

Normal state:
  my-postgres-1 (Primary, RW)
  my-postgres-2 (Replica, RO) - streaming replication
  my-postgres-3 (Replica, RO) - streaming replication

Primary failure occurs:
  1. CloudNativePG detects Primary failure (health check failure)
  2. Selects Replica with most recent WAL position
  3. Executes pg_promote() -> promotes new Primary
  4. Remaining Replicas follow new Primary
  5. pg_rewind rejoins former Primary as Replica

After failover:
  my-postgres-2 (Primary, RW) <- auto-promoted
  my-postgres-3 (Replica, RO) - follows new Primary
  my-postgres-1 (Replica, RO) - rejoined after pg_rewind

Failover duration: typically 10-30 seconds.

3.3 Backup Configuration (Barman + Object Storage)

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres
spec:
  instances: 3

  backup:
    barmanObjectStore:
      destinationPath: s3://my-backup-bucket/postgres/
      endpointURL: https://s3.us-east-1.amazonaws.com
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
        immediateCheckpoint: true
    retentionPolicy: "30d"

Scheduled backup:

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: daily-backup
  namespace: database
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM
  backupOwnerReference: self
  cluster:
    name: my-postgres
  immediate: false
  suspend: false

3.4 WAL Archiving and PITR

WAL (Write-Ahead Log) archiving provides continuous data protection.

PITR (Point-in-Time Recovery):

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres-restored
spec:
  instances: 3

  bootstrap:
    recovery:
      source: my-postgres
      recoveryTarget:
        targetTime: "2024-09-01 14:30:00.00000+00"

  externalClusters:
    - name: my-postgres
      barmanObjectStore:
        destinationPath: s3://my-backup-bucket/postgres/
        endpointURL: https://s3.us-east-1.amazonaws.com
        s3Credentials:
          accessKeyId:
            name: aws-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: aws-creds
            key: ACCESS_SECRET_KEY

3.5 Connection Pooling (Built-in PgBouncer)

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: my-postgres-pooler-rw
  namespace: database
spec:
  cluster:
    name: my-postgres
  instances: 2
  type: rw  # rw or ro
  pgbouncer:
    poolMode: transaction
    parameters:
      max_client_conn: "1000"
      default_pool_size: "25"
      min_pool_size: "5"
      reserve_pool_size: "5"
      reserve_pool_timeout: "5"
      server_idle_timeout: "300"
# Read-only pooler (targets Replicas)
apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: my-postgres-pooler-ro
spec:
  cluster:
    name: my-postgres
  instances: 2
  type: ro
  pgbouncer:
    poolMode: transaction
    parameters:
      max_client_conn: "2000"
      default_pool_size: "50"

Application connections:

# Read/Write (Primary)
my-postgres-pooler-rw.database.svc.cluster.local:5432

# Read-Only (Replica)
my-postgres-pooler-ro.database.svc.cluster.local:5432

3.6 Monitoring (Prometheus + Grafana)

spec:
  monitoring:
    enablePodMonitor: true
    customQueriesConfigMap:
      - name: pg-custom-queries
        key: queries

Custom queries ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: pg-custom-queries
  namespace: database
data:
  queries: |
    pg_database_size:
      query: "SELECT datname, pg_database_size(datname) as size_bytes FROM pg_database WHERE datname NOT IN ('template0', 'template1')"
      master: true
      metrics:
        - datname:
            usage: "LABEL"
            description: "Database name"
        - size_bytes:
            usage: "GAUGE"
            description: "Database size in bytes"

    pg_replication_lag:
      query: "SELECT CASE WHEN pg_is_in_recovery() THEN EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int ELSE 0 END as lag_seconds"
      master: false
      metrics:
        - lag_seconds:
            usage: "GAUGE"
            description: "Replication lag in seconds"

3.7 Rolling Update (Zero-Downtime)

# Image update
spec:
  imageName: ghcr.io/cloudnative-pg/postgresql:16.3

# CloudNativePG automatically:
# 1. Restarts Replicas one by one
# 2. Confirms each Replica is Ready before proceeding
# 3. Finally switchover Primary (promote new Primary then restart old)
# -> Update completes with zero downtime

4. Percona Operator for MySQL

4.1 XtraDB Cluster (Synchronous Replication)

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  name: my-mysql
  namespace: database
spec:
  crVersion: "1.14.0"
  secretsName: my-mysql-secrets

  pxc:
    size: 3
    image: percona/percona-xtradb-cluster:8.0.35
    resources:
      requests:
        cpu: "2"
        memory: "8Gi"
      limits:
        cpu: "4"
        memory: "8Gi"
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: gp3-encrypted
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi
    configuration: |
      [mysqld]
      innodb_buffer_pool_size=4G
      innodb_log_file_size=1G
      innodb_flush_method=O_DIRECT
      max_connections=500
      wsrep_sync_wait=3
      wsrep_slave_threads=4

  haproxy:
    enabled: true
    size: 2
    image: percona/haproxy:2.8.5
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"

  backup:
    image: percona/percona-xtradb-cluster-operator:1.14.0-pxc8.0-backup-pxb8.0.35
    storages:
      s3-backup:
        type: s3
        s3:
          bucket: my-mysql-backups
          region: us-east-1
          credentialsSecret: aws-s3-secret
    schedule:
      - name: daily-full
        schedule: "0 3 * * *"
        keep: 7
        storageName: s3-backup

4.2 ProxySQL Connection Routing

spec:
  proxysql:
    enabled: true
    size: 2
    image: percona/proxysql2:2.5.5
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"

4.3 Percona XtraBackup

# Manual backup
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
  name: manual-backup-20240901
  namespace: database
spec:
  pxcCluster: my-mysql
  storageName: s3-backup
# Check backup status
kubectl get pxc-backup -n database

# Restore from backup
kubectl apply -f - <<EOF
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
  name: restore-20240901
  namespace: database
spec:
  pxcCluster: my-mysql
  backupName: manual-backup-20240901
EOF

5. MongoDB Community Operator

5.1 ReplicaSet Configuration

apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  name: my-mongodb
  namespace: database
spec:
  members: 3
  type: ReplicaSet
  version: "7.0.12"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: admin
      db: admin
      passwordSecretRef:
        name: mongodb-admin-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: admin-scram
    - name: myapp
      db: myapp
      passwordSecretRef:
        name: mongodb-myapp-password
      roles:
        - name: readWrite
          db: myapp
      scramCredentialsSecretName: myapp-scram

  statefulSet:
    spec:
      template:
        spec:
          containers:
            - name: mongod
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                limits:
                  cpu: "4"
                  memory: "4Gi"
      volumeClaimTemplates:
        - metadata:
            name: data-volume
          spec:
            storageClassName: gp3-encrypted
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 50Gi

  additionalMongodConfig:
    storage.wiredTiger.engineConfig.cacheSizeGB: 2
    net.maxIncomingConnections: 500

5.2 Backup Strategies

# mongodump-based logical backup (CronJob)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mongodb-backup
  namespace: database
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: mongo:7.0
              command:
                - /bin/sh
                - -c
                - |
                  mongodump \
                    --uri="mongodb://myapp:password@my-mongodb-0.my-mongodb-svc:27017/myapp?replicaSet=my-mongodb" \
                    --gzip \
                    --archive=/backup/myapp-$(date +%Y%m%d).gz
                  aws s3 cp /backup/myapp-$(date +%Y%m%d).gz \
                    s3://my-backup-bucket/mongodb/
          restartPolicy: OnFailure

6. Storage Deep Dive

6.1 StorageClass Configuration

# AWS EBS gp3 StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  iops: "6000"           # gp3 baseline 3000, max 16000
  throughput: "250"       # gp3 baseline 125MB/s, max 1000MB/s
  fsType: ext4
reclaimPolicy: Retain     # DB data must use Retain
volumeBindingMode: WaitForFirstConsumer  # Topology-aware binding
allowVolumeExpansion: true
# GCP PD SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pd-ssd-encrypted
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

6.2 Local PV vs Cloud EBS/PD Performance Comparison

MetricLocal NVMeEBS gp3 (6000 IOPS)EBS io2 (16000 IOPS)GCP PD-SSD
Random Read IOPS500K+6,00016,00030,000
Random Write IOPS200K+6,00016,00030,000
Sequential Read MB/s3,000+2501,0001,200
Sequential Write MB/s2,000+2501,0001,200
P99 LatencyUnder 0.1ms1-3ms0.5-1ms1-2ms
Data DurabilityLost on node failure99.999%99.999%99.999%
ResizeNot possibleDynamic expansionDynamic expansionDynamic expansion

6.3 Storage Benchmarking

# fio benchmark Pod
kubectl run fio-bench --image=nixery.dev/fio \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "fio-bench",
        "image": "nixery.dev/fio",
        "command": ["sleep", "3600"],
        "volumeMounts": [{
          "name": "test-vol",
          "mountPath": "/data"
        }]
      }],
      "volumes": [{
        "name": "test-vol",
        "persistentVolumeClaim": {
          "claimName": "bench-pvc"
        }
      }]
    }
  }'

# Random read benchmark
kubectl exec fio-bench -- fio \
  --name=randread \
  --ioengine=libaio \
  --iodepth=32 \
  --rw=randread \
  --bs=4k \
  --direct=1 \
  --size=1G \
  --numjobs=4 \
  --runtime=60 \
  --directory=/data \
  --group_reporting

# pgbench (PostgreSQL performance test)
kubectl exec -it my-postgres-1 -- pgbench \
  -i -s 100 myapp  # Initialize (100 scale factor)

kubectl exec -it my-postgres-1 -- pgbench \
  -c 50 -j 4 -T 300 -P 10 myapp  # 50 clients, 300 seconds

7. High Availability Patterns

7.1 Synchronous vs Asynchronous Replication

# CloudNativePG: Synchronous replication
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres-sync
spec:
  instances: 3

  postgresql:
    parameters:
      synchronous_standby_names: "ANY 1 (*)"
      # ANY 1: Guarantees sync replication to at least 1 Replica
      # FIRST 1: Sync replication to the first Replica

  minSyncReplicas: 1
  maxSyncReplicas: 1
ModeData ConsistencyWrite LatencyRPOUse Case
AsynchronousEventual consistencyLowPossible seconds of data lossGeneral workloads
Synchronous (ANY 1)Strong consistencyHigh (2x)0 (no data loss)Financial, payments
Semi-synchronousMediumMediumVery smallMost production

7.2 Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: postgres-pdb
  namespace: database
spec:
  maxUnavailable: 1  # Allow max 1 Pod disrupted at a time
  selector:
    matchLabels:
      app: postgres

CloudNativePG automatically creates a PDB. Default is maxUnavailable: 1.

7.3 Topology Spread Constraints

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres
spec:
  instances: 3

  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          cnpg.io/cluster: my-postgres
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          cnpg.io/cluster: my-postgres

This spreads DB Pods across different availability zones (AZs) and different nodes.

7.4 Anti-Affinity Rules

spec:
  affinity:
    enablePodAntiAffinity: true
    topologyKey: kubernetes.io/hostname
    podAntiAffinityType: required
    # required: Must spread (Pod stays Pending if not possible)
    # preferred: Spread if possible (allows same node if needed)

    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role
                operator: In
                values:
                  - database
    tolerations:
      - key: database-only
        operator: Equal
        value: "true"
        effect: NoSchedule

8. Backup and Recovery Strategies

8.1 Logical vs Physical Backup

AspectLogical (pg_dump)Physical (pgBackRest)
SpeedSlow (hours for large DBs)Fast (incremental backup supported)
SizeSQL text (compressible)Binary (smaller)
Restore flexibilityTable-level restore possibleFull cluster restore
Version compatibilityCross PG versionsSame PG version only
PITRNot possiblePossible
Use caseMigration, partial restoreProduction disaster recovery

8.2 Continuous WAL Archiving

WAL archiving flow:
PG Primary -> WAL segments generated (16MB each) -> S3/GCS archive
                                                        |
                                                Continuous data protection
                                                        |
                                             PITR: Recovery to any point

8.3 PITR (Point-in-Time Recovery) Walkthrough

# 1. Check current backup list
kubectl get backup -n database

# 2. Determine target recovery time (e.g., just before accidental DELETE)
# Time: 2024-09-01 14:29:00 (deletion executed at 14:30)

# 3. PITR restore to new cluster
kubectl apply -f pitr-restore.yaml

# 4. Monitor restore progress
kubectl get cluster my-postgres-restored -n database -w

# 5. Verify data after restore
kubectl exec -it my-postgres-restored-1 -- psql -U myapp -d myapp \
  -c "SELECT count(*) FROM important_table;"

# 6. Switch application connection
kubectl patch service myapp-db -n database \
  -p '{"spec":{"selector":{"cnpg.io/cluster":"my-postgres-restored"}}}'

8.4 Velero Cluster-Level Backup

# Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-bucket \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1

# Full backup of DB namespace (including PVs)
velero backup create db-full-backup \
  --include-namespaces database \
  --include-resources '*' \
  --snapshot-volumes=true \
  --volume-snapshot-locations default

# Scheduled backup
velero schedule create daily-db-backup \
  --schedule="0 4 * * *" \
  --include-namespaces database \
  --snapshot-volumes=true \
  --ttl 720h  # 30-day retention

8.5 Cross-Region Disaster Recovery

# Source region cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres-primary
spec:
  instances: 3
  backup:
    barmanObjectStore:
      destinationPath: s3://my-backup-bucket-primary/
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: ACCESS_SECRET_KEY

---
# DR region replica cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres-dr
spec:
  instances: 2

  replica:
    enabled: true
    source: my-postgres-primary

  externalClusters:
    - name: my-postgres-primary
      barmanObjectStore:
        destinationPath: s3://my-backup-bucket-primary/
        s3Credentials:
          accessKeyId:
            name: aws-creds-dr
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: aws-creds-dr
            key: ACCESS_SECRET_KEY

9. Performance Tuning

9.1 Resource Requests/Limits

# DB Pods must use QoS Guaranteed class
resources:
  requests:
    cpu: "4"
    memory: "16Gi"
  limits:
    cpu: "4"        # requests == limits -> Guaranteed
    memory: "16Gi"  # Prevents OOM Killer

Memory allocation guide:

Based on 16Gi total memory:
- shared_buffers: 4GB (25%)
- effective_cache_size: 12GB (75%)
- work_mem: 64MB (per session)
- maintenance_work_mem: 1GB
- OS/other: ~2GB

9.2 PostgreSQL Performance Parameter Tuning

postgresql:
  parameters:
    # Memory
    shared_buffers: "4GB"
    effective_cache_size: "12GB"
    work_mem: "64MB"
    maintenance_work_mem: "1GB"
    wal_buffers: "64MB"

    # WAL
    max_wal_size: "4GB"
    min_wal_size: "1GB"
    checkpoint_completion_target: "0.9"
    wal_compression: "zstd"

    # Query planner
    random_page_cost: "1.1"       # For SSD (HDD is 4.0)
    effective_io_concurrency: "200"  # For SSD (HDD is 2)

    # Parallelism
    max_worker_processes: "8"
    max_parallel_workers_per_gather: "4"
    max_parallel_workers: "8"

    # Connections
    max_connections: "200"
    idle_in_transaction_session_timeout: "30000"  # 30 seconds

    # Logging
    log_min_duration_statement: "1000"   # Log queries over 1 second
    log_checkpoints: "on"
    log_lock_waits: "on"

    # Auto VACUUM
    autovacuum_max_workers: "4"
    autovacuum_naptime: "30"
    autovacuum_vacuum_cost_limit: "1000"

9.3 Connection Pooling Configuration

PgBouncer recommended settings:

pool_mode = transaction
- Allocates/returns connections per transaction
- Essential for hundreds of Pods connecting simultaneously in K8s
- Restricts PREPARE statement usage (only in session mode)

default_pool_size = 25
- Backend DB connection count (12.5% of max_connections)
- Too high overloads DB, too low increases wait times

max_client_conn = 1000
- Maximum frontend connections
- Calculate: number of Pods x connections per Pod

reserve_pool_size = 5
- Reserve connections for peak traffic

server_idle_timeout = 300
- Clean up idle connections after 5 minutes

10. Security

10.1 Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgres-network-policy
  namespace: database
spec:
  podSelector:
    matchLabels:
      cnpg.io/cluster: my-postgres
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow access only from application namespace
    - from:
        - namespaceSelector:
            matchLabels:
              app.kubernetes.io/part-of: myapp
      ports:
        - protocol: TCP
          port: 5432
    # Allow replication traffic within cluster
    - from:
        - podSelector:
            matchLabels:
              cnpg.io/cluster: my-postgres
      ports:
        - protocol: TCP
          port: 5432
  egress:
    # Outbound for S3 backup (HTTPS)
    - to: []
      ports:
        - protocol: TCP
          port: 443
    # Intra-cluster communication
    - to:
        - podSelector:
            matchLabels:
              cnpg.io/cluster: my-postgres
      ports:
        - protocol: TCP
          port: 5432
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

10.2 Secrets Management (External Secrets Operator)

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: postgres-credentials
  namespace: database
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-db-credentials
    creationPolicy: Owner
  data:
    - secretKey: username
      remoteRef:
        key: production/database/postgres
        property: username
    - secretKey: password
      remoteRef:
        key: production/database/postgres
        property: password

10.3 TLS Configuration

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres
spec:
  certificates:
    serverTLSSecret: my-postgres-server-tls
    serverCASecret: my-postgres-ca

  postgresql:
    parameters:
      ssl: "on"
      ssl_min_protocol_version: "TLSv1.3"

11. Monitoring and Alerting

11.1 Key Metrics and Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgres-alerts
  namespace: monitoring
spec:
  groups:
    - name: postgres.rules
      rules:
        - alert: PostgresReplicationLagHigh
          expr: cnpg_pg_replication_lag > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "PostgreSQL replication lag exceeds 10 seconds"

        - alert: PostgresConnectionsNearLimit
          expr: >
            cnpg_pg_stat_activity_count /
            cnpg_pg_settings_setting{name="max_connections"} > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Connection count exceeds 80% of maximum"

        - alert: PostgresCacheHitRatioLow
          expr: >
            cnpg_pg_stat_database_blks_hit /
            (cnpg_pg_stat_database_blks_hit + cnpg_pg_stat_database_blks_read) < 0.95
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cache hit ratio below 95%. Consider increasing shared_buffers"

        - alert: PostgresStorageNearFull
          expr: >
            kubelet_volume_stats_used_bytes{namespace="database"} /
            kubelet_volume_stats_capacity_bytes{namespace="database"} > 0.85
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "DB storage usage exceeds 85%"

        - alert: PostgresFailoverDetected
          expr: changes(cnpg_pg_replication_is_primary[5m]) > 0
          labels:
            severity: critical
          annotations:
            summary: "PostgreSQL Failover detected"

11.2 Grafana Dashboard Key Panels

Key monitoring panels:
1. Replication lag (seconds) - cnpg_pg_replication_lag
2. Active connection count - cnpg_pg_stat_activity_count
3. Cache hit ratio - blks_hit / (blks_hit + blks_read)
4. TPS (transactions/second) - rate(cnpg_pg_stat_database_xact_commit[5m])
5. Query latency P99
6. Disk IOPS / throughput
7. WAL generation rate - rate(cnpg_pg_stat_archiver_archived_count[5m])
8. Storage usage - kubelet_volume_stats_used_bytes
9. CPU / Memory utilization
10. Vacuum / Analyze execution status

12. Migration Strategies

12.1 VM to K8s Migration

# Method 1: pg_dump/pg_restore (small DBs)
pg_dump -Fc -d myapp -h vm-db.internal -U postgres > myapp.dump

kubectl cp myapp.dump database/my-postgres-1:/tmp/myapp.dump
kubectl exec -it my-postgres-1 -n database -- \
  pg_restore -d myapp -U postgres /tmp/myapp.dump

# Method 2: Logical replication (large DBs, zero-downtime)
# 1. Create empty cluster in K8s
# 2. Create publication on VM DB
#    CREATE PUBLICATION my_pub FOR ALL TABLES;
# 3. Create subscription on K8s DB
#    CREATE SUBSCRIPTION my_sub
#      CONNECTION 'host=vm-db.internal dbname=myapp user=replicator'
#      PUBLICATION my_pub;
# 4. Wait for initial data sync completion
# 5. Switch application connections to K8s DB
# 6. Drop subscription

12.2 Zero-Downtime Migration Checklist

Before migration:
[ ] Assess source DB size and table count
[ ] Verify K8s cluster resource availability
[ ] Size StorageClass / PV (2x headroom over source)
[ ] Verify network connectivity (source DB -> K8s cluster)
[ ] Create logical replication user with appropriate permissions

During migration:
[ ] Confirm initial data copy completed
[ ] Verify replication lag converging to 0
[ ] Check sequence value synchronization
[ ] Switch application read traffic to K8s (test)
[ ] Switch write traffic (brief downtime or dual-write)

After migration:
[ ] Verify data integrity (row counts, checksums)
[ ] Confirm application performance
[ ] Stop source DB replication
[ ] Confirm backup policy applied
[ ] Verify monitoring/alerting operational

13. Production Checklist

Infrastructure and Storage

  • StorageClass reclaimPolicy set to Retain
  • volumeBindingMode set to WaitForFirstConsumer
  • EBS/PD IOPS appropriate for workload (minimum 6000 IOPS)
  • Volume auto-expansion (allowVolumeExpansion: true) enabled
  • Nodes have DB-dedicated labels/taints

HA and Replication

  • Minimum 3 instances (1 Primary + 2 Replicas)
  • Pod Anti-Affinity spreads across different nodes
  • Topology Spread across different AZs
  • PodDisruptionBudget configured (maxUnavailable: 1)
  • Sync/async replication mode decided and configured

Backup and Recovery

  • Automated backup schedule set (minimum daily)
  • WAL archiving enabled (PITR capable)
  • Backup retention policy set (minimum 30 days)
  • Backup restore test completed (minimum quarterly)
  • Cross-region backup replication (if DR required)

Performance

  • Resource requests == limits (QoS Guaranteed)
  • shared_buffers = 25% of total memory
  • effective_cache_size = 75% of total memory
  • Connection Pooling enabled (PgBouncer transaction mode)
  • Query logging enabled (log_min_duration_statement)

Security

  • NetworkPolicy restricts DB access
  • TLS enabled (minimum TLSv1.2)
  • Secrets managed by External Secrets Operator or Vault
  • RBAC: Operator service account with minimal privileges
  • Data encryption (storage level + transport level)

Monitoring

  • Prometheus metrics collection configured
  • Grafana dashboard set up
  • Replication lag alert (over 10 seconds)
  • Storage capacity alert (over 85%)
  • Connection count alert (over 80%)
  • Failover occurrence alert

Operations

  • Operator upgrade procedure documented
  • DB minor/major version upgrade procedure documented
  • Incident response runbook created
  • Regular restore test schedule established

14. Quiz

Q1: Why is Pod Identity important in StatefulSet?

A: StatefulSet Pod Identity (ordinal index + stable hostname + dedicated PVC) is essential for database operations. Primary/Replica roles are bound to specific Pods, the same storage is connected even after restarts, and DNS-based access allows other Pods to directly reach specific instances. Regular Deployments cannot provide these guarantees.

Q2: How does CloudNativePG automatic Failover work?

A: When CloudNativePG detects a Primary Pod health check failure, it selects the Replica with the most recent WAL position and promotes it to the new Primary using pg_promote(). Remaining Replicas are reconfigured to follow the new Primary, and the former Primary rejoins as a Replica via pg_rewind. The entire process typically completes within 10-30 seconds.

Q3: Why should DB Pod QoS class be set to Guaranteed?

A: With QoS Guaranteed (requests == limits), K8s guarantees the Pod's resources and it is the last to be OOM Killed during node memory pressure. Since databases require stable in-memory cache (shared_buffers) and consistent CPU, setting Burstable or BestEffort can cause unexpected performance degradation or failures.

Q4: Why set volumeBindingMode to WaitForFirstConsumer?

A: WaitForFirstConsumer provisions the PV only after the node where the Pod will be scheduled is determined. This ensures Pod and PV are in the same availability zone (AZ), preventing cross-AZ network latency. With Immediate mode, the PV can be created in a different AZ, causing Pod scheduling failures or performance degradation.

Q5: What is the difference between PgBouncer transaction mode and session mode?

A: Transaction mode returns backend connections to the pool when transactions end, allowing hundreds of clients to share a small number of DB connections. This is optimal for K8s environments with many Pods connecting simultaneously. Session mode holds connections until client sessions end, used when PREPARE statements or session variables are needed. Transaction mode is generally recommended for K8s environments.


15. References

  1. CloudNativePG Official Documentation - https://cloudnative-pg.io/documentation/
  2. Percona Operator for MySQL - https://docs.percona.com/percona-operator-for-mysql/pxc/
  3. MongoDB Community Operator - https://github.com/mongodb/mongodb-kubernetes-operator
  4. Kubernetes StatefulSet Documentation - https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
  5. PgBouncer Documentation - https://www.pgbouncer.org/config.html
  6. Velero Official Documentation - https://velero.io/docs/
  7. External Secrets Operator - https://external-secrets.io/
  8. AWS EBS CSI Driver - https://github.com/kubernetes-sigs/aws-ebs-csi-driver
  9. PostgreSQL Performance Tuning Guide - https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
  10. Kubernetes Network Policies - https://kubernetes.io/docs/concepts/services-networking/network-policies/
  11. pgBackRest Documentation - https://pgbackrest.org/
  12. Prometheus PostgreSQL Exporter - https://github.com/prometheus-community/postgres_exporter
  13. Barman (Backup and Recovery Manager) - https://pgbarman.org/