Split View: K8s에서 DB 운영하기 — CNPG, Percona, Vitess, Helm 차트 완전 가이드

K8s에서 DB 운영하기 — CNPG, Percona, Vitess, Helm 차트 완전 가이드

들어가며

Kubernetes에서 데이터베이스를 운영하는 것은 몇 년 전만 해도 논쟁적인 주제였습니다. "Stateless 워크로드에 최적화된 K8s에서 왜 굳이 DB를?" 라는 질문이 많았죠. 하지만 2026년 현재, StatefulSet과 Operator 생태계가 충분히 성숙하면서 K8s 위에서 DB를 운영하는 것이 표준적인 선택지가 되었습니다.

이 글에서는 주요 데이터베이스 Operator들과 Helm 차트를 비교하고, 실전에서 어떤 도구를 어떤 상황에 써야 하는지 정리합니다.

1. 왜 K8s에서 DB를 운영하는가

장점

일관된 배포 파이프라인: 애플리케이션과 DB를 동일한 GitOps 워크플로우로 관리
리소스 효율성: 노드 자원을 애플리케이션과 DB가 공유하며, 자동 스케줄링으로 활용도 극대화
자동 복구: Pod 장애 시 자동 재시작, 노드 장애 시 자동 재스케줄링
환경 일관성: 개발/스테이징/프로덕션에서 동일한 DB 구성을 선언적으로 배포
비용 절감: 관리형 DB 서비스(RDS, Cloud SQL) 대비 라이선스 및 인프라 비용 절감 가능

단점

운영 복잡도: 스토리지, 네트워크, 백업 등 직접 관리해야 할 영역이 많음
성능 오버헤드: 컨테이너 네트워크, 스토리지 레이어 추가에 따른 레이턴시
전문성 요구: K8s와 DB 양쪽 모두에 대한 깊은 이해 필요
데이터 유실 위험: 잘못된 PV/PVC 설정이나 업그레이드 실수로 데이터 손실 가능

StatefulSet 기초

StatefulSet은 K8s에서 상태 유지가 필요한 워크로드를 위한 컨트롤러입니다. 일반 Deployment와 달리 다음을 보장합니다.

안정적인 네트워크 ID: 각 Pod가 고유한 호스트명(예: postgres-0, postgres-1)을 갖고 유지
순서 보장 배포/스케일링: Pod가 0부터 순서대로 생성되고, 역순으로 종료
영속 스토리지: volumeClaimTemplates를 통해 각 Pod에 고유한 PVC가 바인딩

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ['ReadWriteOnce']
        storageClassName: gp3
        resources:
          requests:
            storage: 50Gi

하지만 StatefulSet만으로는 HA, 자동 Failover, 백업/복구, 모니터링 등을 구현하기 어렵습니다. 이 부분을 해결해주는 것이 바로 Operator 패턴입니다.

2. CloudNativePG (CNPG) — PostgreSQL 전용 Operator

개요

CloudNativePG(CNPG)는 EDB(EnterpriseDB)에서 개발을 시작하여 현재 CNCF Sandbox 프로젝트로 관리되는 PostgreSQL 전용 Kubernetes Operator입니다. 2026년 4월 기준 최신 버전은 1.29이며, PostgreSQL 확장 관리를 Image Catalog과 artifacts 생태계로 혁신적으로 개선했습니다.

핵심 특징

네이티브 K8s 설계: Patroni 같은 외부 HA 도구 없이 K8s 리더 선출 메커니즘을 직접 사용
선언적 DB 관리: Database CRD로 PostgreSQL 데이터베이스 라이프사이클 관리
논리적 복제: Publication / Subscription CRD로 온라인 마이그레이션 및 메이저 버전 업그레이드 지원
CNPG-I 플러그인 프레임워크: 외부 플러그인으로 기능 확장 가능
PITR 지원: WAL 아카이빙 기반 Point-In-Time Recovery
병렬 리컨사일러: 클러스터 관리 효율을 높이는 병렬 처리

설치

Helm을 사용한 설치가 가장 간편합니다.

# Helm 리포지토리 추가
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update

# Operator 설치
helm upgrade --install cnpg \
  --namespace cnpg-system \
  --create-namespace \
  cnpg/cloudnative-pg

또는 매니페스트로 직접 설치할 수도 있습니다.

kubectl apply --server-side -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.29/releases/cnpg-1.29.0.yaml

HA 아키텍처

CNPG는 Primary 1대 + Standby N대 구조입니다. Primary가 쓰기를 담당하고, Standby는 스트리밍 복제로 데이터를 동기화합니다.

Primary 장애 시 Standby 중 하나가 자동으로 승격(Promote)
Switchover(계획된 전환)와 Failover(장애 시 전환) 모두 지원
동기 복제를 통한 데이터 내구성 보장 옵션(dataDurability)

백업 및 복구

CNPG는 Barman 기반의 지속적 백업을 내장하고 있습니다.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: prod-pg
spec:
  instances: 3
  storage:
    size: 100Gi
    storageClass: gp3
  backup:
    barmanObjectStore:
      destinationPath: s3://my-backup-bucket/prod-pg/
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
    retentionPolicy: '30d'

ScheduledBackup으로 정기 백업을 예약할 수 있습니다.

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: prod-pg-daily
spec:
  schedule: '0 2 * * *'
  cluster:
    name: prod-pg
  backupOwnerReference: self

3. Percona Operator — 멀티 DB 지원

개요

Percona는 MySQL, MongoDB, PostgreSQL 세 가지 데이터베이스에 대해 각각 전용 Kubernetes Operator를 제공합니다. Apache 2.0 라이선스로 완전 오픈소스이며, 엔터프라이즈급 기능을 무료로 사용할 수 있습니다.

지원 DB별 특징

Percona Operator for MySQL (PXC)

Percona XtraDB Cluster 기반의 다중 마스터(Multi-Primary) 아키텍처
Galera 동기 복제로 모든 노드에서 읽기/쓰기 가능
ProxySQL 또는 HAProxy를 통한 자동 라우팅
2026년 GA 릴리스에서 Group Replication 옵션 추가

Percona Operator for MongoDB (PSMDB)

ReplicaSet 및 Sharded Cluster 지원
PVC 스냅샷 기반 백업 지원 (2025년 추가)
MongoDB 8.0 공식 지원
IAM Role for Service Account로 클라우드 스토리지 인증

Percona Operator for PostgreSQL (PPG)

Patroni 기반 HA 구성
pg_tde(투명 데이터 암호화) 네이티브 지원 (2026년)
제로 다운타임 메이저 버전 업그레이드 로드맵 진행 중

통합 모니터링: PMM

Percona Monitoring and Management(PMM)는 MySQL, MongoDB, PostgreSQL 전체를 하나의 대시보드에서 모니터링할 수 있는 도구입니다.

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  name: prod-mysql
spec:
  crVersion: '1.15.0'
  pxc:
    size: 3
    image: percona/percona-xtradb-cluster:8.0
    resources:
      requests:
        memory: 2Gi
        cpu: '1'
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: gp3
        resources:
          requests:
            storage: 100Gi
  haproxy:
    enabled: true
    size: 2
  pmm:
    enabled: true
    serverHost: monitoring-service
  backup:
    schedule:
      - name: daily-backup
        schedule: '0 3 * * *'
        keep: 7
        storageName: s3-backup
    storages:
      s3-backup:
        type: s3
        s3:
          bucket: my-backup-bucket
          credentialsSecret: aws-creds
          region: ap-northeast-2

멀티 클러스터 지원

Percona Operator는 크로스 리전 복제를 지원하여 여러 K8s 클러스터 간에 데이터를 동기화할 수 있습니다. 이를 통해 재해 복구(DR) 구성이 가능합니다.

4. Vitess — MySQL 수평 샤딩

개요

Vitess는 YouTube에서 대규모 MySQL 워크로드를 처리하기 위해 개발되었으며, 현재 CNCF Graduated 프로젝트입니다. MySQL과 호환되는 인터페이스를 제공하면서 투명한 샤딩, 커넥션 풀링, 온라인 리샤딩을 지원합니다.

PlanetScale은 Vitess를 상용화한 대표적인 DBaaS 서비스입니다.

아키텍처 구성 요소

구성 요소	역할
VTGate	쿼리 라우터, 애플리케이션이 연결하는 엔드포인트
VTTablet	각 MySQL 인스턴스를 감싸는 프록시
Topology Service	클러스터 메타데이터 저장 (etcd 등)
VTOrc	Orchestrator, 자동 Failover 담당
VTAdmin	웹 기반 관리 UI

언제 Vitess를 선택해야 하는가

단일 MySQL 인스턴스로 감당할 수 없는 대규모 쓰기 트래픽
수십억 행 이상의 대용량 테이블 샤딩이 필요한 경우
온라인 스키마 변경(Online DDL)이 빈번한 환경
MySQL 호환성을 유지하면서 수평 확장이 필요한 경우

Vitess on K8s 설치

PlanetScale에서 제공하는 Vitess Operator를 사용합니다.

# Vitess Operator 설치
kubectl apply -f https://github.com/planetscale/vitess-operator/releases/latest/download/operator.yaml

Keyspace(논리적 데이터베이스)와 Shard를 선언적으로 관리합니다.

apiVersion: planetscale.com/v2
kind: VitessCluster
metadata:
  name: prod-vitess
spec:
  images:
    vtgate: vitess/lite:v19
    vttablet: vitess/lite:v19
    vtbackup: vitess/lite:v19
    vtctld: vitess/lite:v19
    vtorc: vitess/lite:v19
  cells:
    - name: zone1
      gateway:
        replicas: 2
        resources:
          requests:
            cpu: '1'
            memory: 1Gi
  keyspaces:
    - name: commerce
      turndownPolicy: Immediate
      partitionings:
        - equal:
            parts: 2
            shardTemplate:
              databaseInitScriptSecret:
                name: commerce-schema
                key: init.sql
              tabletPools:
                - cell: zone1
                  type: replica
                  replicas: 3
                  dataVolumeClaimTemplate:
                    storageClassName: gp3
                    resources:
                      requests:
                        storage: 50Gi

주의사항

Vitess는 매우 강력하지만 학습 곡선이 가파릅니다. 단순한 CRUD 애플리케이션에는 과도한 선택일 수 있으며, 샤딩 전략(Vschema) 설계에 충분한 검토가 필요합니다.

5. 주요 Helm 차트 비교

Operator 없이 Helm 차트만으로도 K8s에 DB를 배포할 수 있습니다. Bitnami가 가장 널리 사용되는 Helm 차트 제공자였으나, 2025년 이후 중요한 변화가 있습니다.

Bitnami 라이선스 변경 (2025)

2025년 9월 이후 대부분의 Bitnami Helm 차트 OCI 패키지가 Broadcom 유료 구독 뒤로 이동했습니다. 공개 docker.io/bitnami 이미지는 "Bitnami Legacy" 저장소로 옮겨져 더 이상 업데이트, 수정, 보안 패치를 받지 않습니다.

대안으로 Chainguard에서 Bitnami를 포크한 40개 이상의 보안 강화 Helm 차트를 제공하고 있으며, 커뮤니티 기반 대안도 늘어나는 추세입니다.

주요 Helm 차트 비교표

차트	DB	기본 구성	HA 지원	백업 내장	비고
bitnami/postgresql	PostgreSQL	Primary + Read Replica	Repmgr 기반	X	레거시 주의
bitnami/postgresql-ha	PostgreSQL	Primary + Standby	Pgpool-II 연동	X	HA 전용 차트
bitnami/mysql	MySQL	Primary + Secondary	반동기 복제	X	InnoDB Cluster 옵션
bitnami/redis	Redis	Master + Replica	Sentinel 기반	X	Cluster 모드 별도
bitnami/mongodb	MongoDB	ReplicaSet	내장	X	Sharded 별도 차트
bitnami/mariadb	MariaDB	Primary + Secondary	Galera 옵션	X	MySQL 호환

Helm 차트 사용이 적합한 경우

개발/테스트 환경: 빠르게 DB를 띄우고 싶을 때
단순 구성: HA가 필수가 아닌 소규모 서비스
학습 목적: K8s에서 DB 운영의 기초를 익힐 때
커스텀 설정이 많은 경우: values.yaml로 세밀한 튜닝이 필요할 때

예시: PostgreSQL HA Helm 차트

helm install prod-pg bitnami/postgresql-ha \
  --set postgresql.replicaCount=3 \
  --set postgresql.resources.requests.memory=2Gi \
  --set postgresql.resources.requests.cpu=1 \
  --set persistence.size=100Gi \
  --set persistence.storageClass=gp3 \
  --set pgpool.replicaCount=2 \
  --set metrics.enabled=true

6. CNPG vs Percona vs Vitess vs Helm 차트 종합 비교

기능 비교표

항목	CNPG	Percona	Vitess	Helm 차트
지원 DB	PostgreSQL	MySQL, MongoDB, PG	MySQL (샤딩)	다양
라이선스	Apache 2.0	Apache 2.0	Apache 2.0	차트별 상이
CNCF 상태	Sandbox	-	Graduated	-
HA 자동 Failover	O	O	O	차트에 따라 다름
자동 백업	O (Barman)	O (다중 스토리지)	O	X (별도 구성)
PITR	O	O	부분 지원	X
수평 샤딩	X	MongoDB만	O (핵심 기능)	X
모니터링 통합	Prometheus	PMM + Prometheus	VTAdmin	차트별 메트릭
커넥션 풀링	PgBouncer 내장	ProxySQL/HAProxy	VTGate 내장	별도 구성
운영 난이도	중	중	상	하
프로덕션 적합도	높음	높음	높음 (대규모)	중간
CRD 기반 관리	O	O	O	X

선택 가이드

PostgreSQL을 K8s에서 운영: CNPG가 최우선 선택. K8s 네이티브 설계와 활발한 커뮤니티
MySQL/MongoDB 운영: Percona Operator. 통합 모니터링(PMM)이 강점
대규모 MySQL 샤딩: Vitess. 수십억 행 이상의 대용량 데이터 처리
개발/테스트 환경: Helm 차트. 빠른 배포와 간단한 구성
멀티 DB 환경 통합 관리: Percona. MySQL + MongoDB + PostgreSQL을 하나의 운영 모델로

7. 운영 유의점

스토리지 (PV/PVC)

스토리지는 K8s DB 운영에서 가장 중요한 요소입니다.

# 권장 StorageClass 예시 (AWS EBS gp3)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-db
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: '5000'
  throughput: '250'
  encrypted: 'true'
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain

핵심 원칙:

volumeBindingMode: WaitForFirstConsumer — Pod가 스케줄된 AZ에서 볼륨 생성
reclaimPolicy: Retain — PVC 삭제 시에도 데이터 보존
allowVolumeExpansion: true — 온라인 볼륨 확장 허용
WAL과 Data 볼륨 분리 — 순차 쓰기(WAL)와 랜덤 액세스(Data)를 분리하여 성능 향상

성능 튜닝

# Pod에 CPU pinning 및 NUMA 인식 설정 예시
spec:
  containers:
    - name: postgres
      resources:
        requests:
          cpu: '4'
          memory: 8Gi
        limits:
          cpu: '4'
          memory: 8Gi
      # Guaranteed QoS 클래스 확보
      # requests == limits로 설정

추가 팁:

Guaranteed QoS: requests와 limits를 동일하게 설정하여 CPU 스로틀링 방지
토폴로지 인식 스케줄링: nodeAffinity로 DB Pod를 고성능 노드에 배치
안티 어피니티: DB Pod들이 서로 다른 노드에 분산되도록 설정

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - postgres
          topologyKey: kubernetes.io/hostname

리소스 제한

DB Pod에는 반드시 resources.requests와 limits를 설정
메모리 limits 초과 시 OOMKill 발생 — DB 프로세스가 강제 종료됨
PostgreSQL의 shared_buffers는 컨테이너 메모리의 25% 내외로 설정
MySQL의 innodb_buffer_pool_size는 컨테이너 메모리의 50~70%로 설정

모니터링

# CNPG에서 PodMonitor 활성화
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: prod-pg
spec:
  instances: 3
  monitoring:
    enablePodMonitor: true
    customQueriesConfigMap:
      - name: pg-custom-queries
        key: queries
  storage:
    size: 100Gi

필수 모니터링 지표:

복제 지연 (Replication Lag): Standby가 Primary에 비해 얼마나 뒤처져 있는지
커넥션 수: 최대 커넥션 대비 사용률
트랜잭션 처리량 (TPS): 초당 트랜잭션 수
디스크 사용률: PV 용량 소진 전 알림 설정
WAL 아카이빙 상태: 백업 시스템 정상 동작 여부

백업 전략

3-2-1 백업 규칙을 K8s 환경에 적용합니다.

3개의 복사본: Primary 데이터 + WAL 아카이브 + 물리 백업
2개의 다른 미디어: 로컬 PV + Object Storage (S3/GCS)
1개의 오프사이트: 다른 리전의 버킷에 복제

# CNPG 복구 클러스터 예시
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: recovery-cluster
spec:
  instances: 2
  storage:
    size: 100Gi
  bootstrap:
    recovery:
      source: prod-pg
      recoveryTarget:
        targetTime: '2026-04-10T08:00:00Z'
  externalClusters:
    - name: prod-pg
      barmanObjectStore:
        destinationPath: s3://my-backup-bucket/prod-pg/
        s3Credentials:
          accessKeyId:
            name: aws-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: aws-creds
            key: SECRET_ACCESS_KEY

8. 실전 예제: CNPG로 PostgreSQL HA 클러스터 구성

실제 프로덕션 환경에서 사용할 수 있는 CNPG 클러스터 전체 YAML입니다.

Step 1: Namespace 및 Secret 생성

kubectl create namespace database

kubectl create secret generic pg-superuser \
  --namespace database \
  --from-literal=username=postgres \
  --from-literal=password=CHANGE_ME_TO_STRONG_PASSWORD

kubectl create secret generic aws-creds \
  --namespace database \
  --from-literal=ACCESS_KEY_ID=your-access-key \
  --from-literal=SECRET_ACCESS_KEY=your-secret-key

Step 2: PostgreSQL 클러스터 정의

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: prod-pg
  namespace: database
spec:
  description: 'Production PostgreSQL HA Cluster'
  imageName: ghcr.io/cloudnative-pg/postgresql:16.4

  instances: 3

  startDelay: 30
  stopDelay: 30
  primaryUpdateStrategy: unsupervised

  postgresql:
    parameters:
      shared_buffers: '2GB'
      effective_cache_size: '6GB'
      work_mem: '64MB'
      maintenance_work_mem: '512MB'
      max_connections: '200'
      max_wal_size: '2GB'
      min_wal_size: '1GB'
      wal_buffers: '64MB'
      random_page_cost: '1.1'
      effective_io_concurrency: '200'
      log_statement: 'ddl'
      log_min_duration_statement: '1000'
    pg_hba:
      - host all all 10.0.0.0/8 scram-sha-256

  bootstrap:
    initdb:
      database: appdb
      owner: appuser
      secret:
        name: pg-superuser

  storage:
    size: 100Gi
    storageClass: gp3-db

  walStorage:
    size: 30Gi
    storageClass: gp3-db

  resources:
    requests:
      memory: 8Gi
      cpu: '4'
    limits:
      memory: 8Gi
      cpu: '4'

  affinity:
    enablePodAntiAffinity: true
    topologyKey: kubernetes.io/hostname

  monitoring:
    enablePodMonitor: true

  backup:
    barmanObjectStore:
      destinationPath: s3://my-backup-bucket/prod-pg/
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
        jobs: 4
    retentionPolicy: '30d'

  nodeMaintenanceWindow:
    inProgress: false
    reusePVC: true

Step 3: 정기 백업 스케줄

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: prod-pg-daily
  namespace: database
spec:
  schedule: '0 2 * * *'
  backupOwnerReference: self
  cluster:
    name: prod-pg
  target: prefer-standby

Step 4: Pooler (PgBouncer) 설정

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: prod-pg-pooler-rw
  namespace: database
spec:
  cluster:
    name: prod-pg
  instances: 2
  type: rw
  pgbouncer:
    poolMode: transaction
    parameters:
      max_client_conn: '1000'
      default_pool_size: '50'

Step 5: 배포 및 확인

kubectl apply -f cluster.yaml
kubectl apply -f scheduled-backup.yaml
kubectl apply -f pooler.yaml

# 클러스터 상태 확인
kubectl get cluster -n database

# Pod 상태 확인
kubectl get pods -n database

# 클러스터 상세 정보
kubectl describe cluster prod-pg -n database

# Primary에 접속 테스트
kubectl exec -it prod-pg-1 -n database -- psql -U postgres -d appdb

9. 안티패턴 — K8s DB 운영에서 피해야 할 실수들

1) Deployment로 DB 배포하기

Deployment는 상태 비저장(Stateless) 워크로드용입니다. DB에 Deployment를 사용하면 Pod 재시작 시 데이터가 유실되거나, 여러 Pod가 같은 데이터 디렉터리에 접근하는 문제가 발생합니다. 반드시 StatefulSet 또는 Operator CRD를 사용하세요.

2) PVC 없이 emptyDir 사용

emptyDir은 Pod가 삭제되면 데이터가 함께 사라집니다. 테스트 환경이라도 DB 데이터에는 PVC를 사용하는 습관을 들이세요.

3) 백업 없이 운영

Operator가 HA를 제공하더라도 백업은 별도로 반드시 구성해야 합니다. HA는 인프라 장애에 대한 보호이고, 백업은 논리적 오류(잘못된 DELETE 문 등)에 대한 보호입니다.

4) 리소스 제한 미설정

DB Pod에 limits를 설정하지 않으면 다른 Pod의 리소스를 잠식하거나, OOM 시 예측 불가능한 시점에 Pod가 종료될 수 있습니다. requests와 limits를 동일하게 설정하여 Guaranteed QoS를 확보하세요.

5) reclaimPolicy: Delete 사용

기본 reclaimPolicy가 Delete인 StorageClass를 사용하면 PVC 삭제 시 PV(실제 데이터)도 함께 삭제됩니다. DB용 StorageClass는 반드시 Retain으로 설정하세요.

6) 단일 AZ에 모든 DB Pod 배치

Pod Anti-Affinity 없이 배포하면 모든 DB Pod가 같은 노드나 AZ에 배치될 수 있습니다. 해당 노드/AZ 장애 시 전체 DB가 다운됩니다.

# 올바른 Anti-Affinity 설정
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: cnpg.io/cluster
              operator: In
              values:
                - prod-pg
        topologyKey: topology.kubernetes.io/zone

7) 모니터링/알림 없이 운영

복제 지연이 증가하거나 디스크가 가득 차도 모르는 상황이 생길 수 있습니다. 최소한 다음 알림을 구성하세요.

디스크 사용률 80% 초과
복제 지연 10초 이상
Pod 재시작 횟수 증가
백업 실패 감지

8) DB 업그레이드 시 롤링 업데이트 미검증

PostgreSQL, MySQL 등의 메이저 버전 업그레이드는 반드시 별도 클러스터에서 테스트 후 진행하세요. Operator가 자동 업그레이드를 지원하더라도, 애플리케이션 호환성 테스트는 사람이 해야 합니다.

9) Secret을 평문으로 관리

DB 비밀번호를 YAML에 하드코딩하지 마세요. External Secrets Operator나 Sealed Secrets를 사용하여 Secret을 안전하게 관리하세요.

# External Secrets 예시
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: pg-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: pg-superuser
  data:
    - secretKey: username
      remoteRef:
        key: prod/database/credentials
        property: username
    - secretKey: password
      remoteRef:
        key: prod/database/credentials
        property: password

마무리

K8s에서 DB를 운영하는 것은 더 이상 실험적인 선택이 아닙니다. CNPG, Percona, Vitess 같은 성숙한 Operator들이 복잡한 운영 작업을 자동화해주며, K8s의 선언적 관리 모델과 자연스럽게 통합됩니다.

핵심 요약:

PostgreSQL을 K8s에서 운영한다면 CloudNativePG가 최선의 선택
MySQL/MongoDB/PostgreSQL을 통합 관리한다면 Percona Operator
대규모 MySQL 샤딩이 필요하다면 Vitess
개발/테스트 환경에서는 Helm 차트가 여전히 편리
어떤 도구를 선택하든 스토리지, 백업, 모니터링은 반드시 챙기세요

Operator를 도입한다면 먼저 개발 환경에서 충분히 테스트하고, 장애 시나리오(Pod 삭제, 노드 다운, AZ 장애)를 시뮬레이션해본 후 프로덕션에 적용하세요.

Running Databases on K8s -- Complete Guide to CNPG, Percona, Vitess, and Helm Charts

Introduction

Running databases on Kubernetes was a controversial topic just a few years ago. Many asked, "Why bother running a DB on K8s, which is optimized for stateless workloads?" However, as of 2026, with StatefulSets and the Operator ecosystem having matured sufficiently, running databases on K8s has become a standard choice.

This article compares major database Operators and Helm charts, and explains which tools to use in which situations in practice.

1. Why Run Databases on K8s

Advantages

Consistent deployment pipeline: Manage applications and databases with the same GitOps workflow
Resource efficiency: Applications and databases share node resources, maximizing utilization through automatic scheduling
Auto-recovery: Automatic Pod restart on failure, automatic rescheduling on node failure
Environment consistency: Declaratively deploy identical DB configurations across dev/staging/production
Cost reduction: Potential license and infrastructure cost savings compared to managed DB services (RDS, Cloud SQL)

Disadvantages

Operational complexity: Many areas to manage directly, including storage, networking, and backups
Performance overhead: Additional latency from container networking and storage layers
Expertise required: Deep understanding of both K8s and databases needed
Data loss risk: Potential data loss from incorrect PV/PVC settings or upgrade mistakes

StatefulSet Basics

StatefulSet is a K8s controller for workloads that need to maintain state. Unlike regular Deployments, it guarantees:

Stable network IDs: Each Pod maintains a unique hostname (e.g., postgres-0, postgres-1)
Ordered deployment/scaling: Pods are created from 0 in order and terminated in reverse
Persistent storage: Each Pod gets a unique PVC bound through volumeClaimTemplates

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ['ReadWriteOnce']
        storageClassName: gp3
        resources:
          requests:
            storage: 50Gi

However, StatefulSet alone makes it difficult to implement HA, automatic failover, backup/recovery, and monitoring. This is where the Operator pattern comes in.

2. CloudNativePG (CNPG) -- PostgreSQL-Specific Operator

Overview

CloudNativePG (CNPG) is a PostgreSQL-specific Kubernetes Operator that was initially developed by EDB (EnterpriseDB) and is now managed as a CNCF Sandbox project. As of April 2026, the latest version is 1.29, featuring innovative PostgreSQL extension management through Image Catalog and artifacts ecosystems.

Key Features

Native K8s design: Uses K8s leader election mechanisms directly without external HA tools like Patroni
Declarative DB management: Manages PostgreSQL database lifecycle via Database CRD
Logical replication: Supports online migration and major version upgrades via Publication / Subscription CRDs
CNPG-I plugin framework: Extensible through external plugins
PITR support: Point-In-Time Recovery based on WAL archiving
Parallel reconciler: Improves cluster management efficiency through parallel processing

Installation

Helm-based installation is the easiest approach.

# Add Helm repository
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update

# Install Operator
helm upgrade --install cnpg \
  --namespace cnpg-system \
  --create-namespace \
  cnpg/cloudnative-pg

Or you can install directly using manifests.

kubectl apply --server-side -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.29/releases/cnpg-1.29.0.yaml

HA Architecture

CNPG uses a Primary 1 + Standby N architecture. The Primary handles writes, and Standbys synchronize data through streaming replication.

Automatic promotion of a Standby when the Primary fails
Supports both Switchover (planned transition) and Failover (failure transition)
Data durability guarantee option through synchronous replication (dataDurability)

Backup and Recovery

CNPG has built-in continuous backup based on Barman.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: prod-pg
spec:
  instances: 3
  storage:
    size: 100Gi
    storageClass: gp3
  backup:
    barmanObjectStore:
      destinationPath: s3://my-backup-bucket/prod-pg/
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
    retentionPolicy: '30d'

You can schedule regular backups with ScheduledBackup.

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: prod-pg-daily
spec:
  schedule: '0 2 * * *'
  cluster:
    name: prod-pg
  backupOwnerReference: self

3. Percona Operator -- Multi-DB Support

Overview

Percona provides dedicated Kubernetes Operators for three databases: MySQL, MongoDB, and PostgreSQL. Fully open source under the Apache 2.0 license, offering enterprise-grade features for free.

Features by Supported DB

Percona Operator for MySQL (PXC)

Multi-Primary architecture based on Percona XtraDB Cluster
Galera synchronous replication enables read/write on all nodes
Automatic routing through ProxySQL or HAProxy
Group Replication option added in 2026 GA release

Percona Operator for MongoDB (PSMDB)

ReplicaSet and Sharded Cluster support
PVC snapshot-based backup support (added in 2025)
Official MongoDB 8.0 support
Cloud storage authentication via IAM Role for Service Account

Percona Operator for PostgreSQL (PPG)

Patroni-based HA configuration
Native pg_tde (Transparent Data Encryption) support (2026)
Zero-downtime major version upgrade roadmap in progress

Integrated Monitoring: PMM

Percona Monitoring and Management (PMM) is a tool that monitors MySQL, MongoDB, and PostgreSQL all from a single dashboard.

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  name: prod-mysql
spec:
  crVersion: '1.15.0'
  pxc:
    size: 3
    image: percona/percona-xtradb-cluster:8.0
    resources:
      requests:
        memory: 2Gi
        cpu: '1'
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: gp3
        resources:
          requests:
            storage: 100Gi
  haproxy:
    enabled: true
    size: 2
  pmm:
    enabled: true
    serverHost: monitoring-service
  backup:
    schedule:
      - name: daily-backup
        schedule: '0 3 * * *'
        keep: 7
        storageName: s3-backup
    storages:
      s3-backup:
        type: s3
        s3:
          bucket: my-backup-bucket
          credentialsSecret: aws-creds
          region: ap-northeast-2

Multi-Cluster Support

Percona Operator supports cross-region replication to synchronize data across multiple K8s clusters. This enables disaster recovery (DR) configurations.

4. Vitess -- MySQL Horizontal Sharding

Overview

Vitess was developed at YouTube to handle large-scale MySQL workloads and is now a CNCF Graduated project. It provides a MySQL-compatible interface while supporting transparent sharding, connection pooling, and online resharding.

PlanetScale is a representative DBaaS service that commercialized Vitess.

Architecture Components

Component	Role
VTGate	Query router, the endpoint applications connect to
VTTablet	Proxy wrapping each MySQL instance
Topology Service	Cluster metadata storage (etcd, etc.)
VTOrc	Orchestrator, handles automatic failover
VTAdmin	Web-based management UI

When to Choose Vitess

Massive write traffic that a single MySQL instance cannot handle
Sharding of large tables with billions of rows or more
Environments with frequent online schema changes (Online DDL)
Need for horizontal scaling while maintaining MySQL compatibility

Vitess on K8s Installation

Use the Vitess Operator provided by PlanetScale.

# Install Vitess Operator
kubectl apply -f https://github.com/planetscale/vitess-operator/releases/latest/download/operator.yaml

Manage Keyspaces (logical databases) and Shards declaratively.

apiVersion: planetscale.com/v2
kind: VitessCluster
metadata:
  name: prod-vitess
spec:
  images:
    vtgate: vitess/lite:v19
    vttablet: vitess/lite:v19
    vtbackup: vitess/lite:v19
    vtctld: vitess/lite:v19
    vtorc: vitess/lite:v19
  cells:
    - name: zone1
      gateway:
        replicas: 2
        resources:
          requests:
            cpu: '1'
            memory: 1Gi
  keyspaces:
    - name: commerce
      turndownPolicy: Immediate
      partitionings:
        - equal:
            parts: 2
            shardTemplate:
              databaseInitScriptSecret:
                name: commerce-schema
                key: init.sql
              tabletPools:
                - cell: zone1
                  type: replica
                  replicas: 3
                  dataVolumeClaimTemplate:
                    storageClassName: gp3
                    resources:
                      requests:
                        storage: 50Gi

Caveats

Vitess is very powerful but has a steep learning curve. It may be overkill for simple CRUD applications, and the sharding strategy (Vschema) requires thorough design review.

5. Major Helm Chart Comparison

You can also deploy databases on K8s using just Helm charts without Operators. Bitnami was the most widely used Helm chart provider, but important changes have occurred since 2025.

Bitnami License Changes (2025)

Since September 2025, most Bitnami Helm chart OCI packages have moved behind a Broadcom paid subscription. Public docker.io/bitnami images have been moved to the "Bitnami Legacy" repository and no longer receive updates, fixes, or security patches.

As an alternative, Chainguard provides over 40 security-hardened Helm charts forked from Bitnami, and community-based alternatives are also growing.

Helm Chart Comparison Table

Chart	DB	Default Config	HA Support	Built-in Backup	Notes
bitnami/postgresql	PostgreSQL	Primary + Read Replica	Repmgr-based	X	Legacy warning
bitnami/postgresql-ha	PostgreSQL	Primary + Standby	Pgpool-II integration	X	HA-dedicated chart
bitnami/mysql	MySQL	Primary + Secondary	Semi-sync replication	X	InnoDB Cluster option
bitnami/redis	Redis	Master + Replica	Sentinel-based	X	Separate Cluster mode
bitnami/mongodb	MongoDB	ReplicaSet	Built-in	X	Separate Sharded chart
bitnami/mariadb	MariaDB	Primary + Secondary	Galera option	X	MySQL compatible

When Helm Charts Are Appropriate

Dev/test environments: When you want to spin up a DB quickly
Simple configurations: Small-scale services where HA is not mandatory
Learning purposes: Building foundational K8s DB operations knowledge
Many custom settings: When fine-tuning through values.yaml is needed

Example: PostgreSQL HA Helm Chart

helm install prod-pg bitnami/postgresql-ha \
  --set postgresql.replicaCount=3 \
  --set postgresql.resources.requests.memory=2Gi \
  --set postgresql.resources.requests.cpu=1 \
  --set persistence.size=100Gi \
  --set persistence.storageClass=gp3 \
  --set pgpool.replicaCount=2 \
  --set metrics.enabled=true

6. CNPG vs Percona vs Vitess vs Helm Charts -- Comprehensive Comparison

Feature Comparison Table

Item	CNPG	Percona	Vitess	Helm Charts
Supported DBs	PostgreSQL	MySQL, MongoDB, PG	MySQL (sharding)	Various
License	Apache 2.0	Apache 2.0	Apache 2.0	Varies by chart
CNCF Status	Sandbox	-	Graduated	-
HA Auto-Failover	O	O	O	Depends on chart
Auto Backup	O (Barman)	O (multi-storage)	O	X (separate config)
PITR	O	O	Partial	X
Horizontal Sharding	X	MongoDB only	O (core feature)	X
Monitoring Integration	Prometheus	PMM + Prometheus	VTAdmin	Per-chart metrics
Connection Pooling	PgBouncer built-in	ProxySQL/HAProxy	VTGate built-in	Separate config
Operational Difficulty	Medium	Medium	High	Low
Production Readiness	High	High	High (large-scale)	Medium
CRD-based Management	O	O	O	X

Selection Guide

PostgreSQL on K8s: CNPG is the top choice. K8s-native design with an active community
MySQL/MongoDB operations: Percona Operator. Integrated monitoring (PMM) is a strength
Large-scale MySQL sharding: Vitess. For processing billions of rows of data
Dev/test environments: Helm charts. Quick deployment and simple configuration
Multi-DB unified management: Percona. MySQL + MongoDB + PostgreSQL under one operational model

7. Operational Considerations

Storage (PV/PVC)

Storage is the most important factor in K8s DB operations.

# Recommended StorageClass example (AWS EBS gp3)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-db
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: '5000'
  throughput: '250'
  encrypted: 'true'
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain

Key principles:

volumeBindingMode: WaitForFirstConsumer -- Creates volume in the AZ where the Pod is scheduled
reclaimPolicy: Retain -- Preserves data even when PVC is deleted
allowVolumeExpansion: true -- Allows online volume expansion
Separate WAL and Data volumes -- Separate sequential writes (WAL) and random access (Data) for performance

Performance Tuning

# CPU pinning and NUMA-aware configuration example
spec:
  containers:
    - name: postgres
      resources:
        requests:
          cpu: '4'
          memory: 8Gi
        limits:
          cpu: '4'
          memory: 8Gi
      # Ensure Guaranteed QoS class
      # Set requests == limits

Additional tips:

Guaranteed QoS: Set requests equal to limits to prevent CPU throttling
Topology-aware scheduling: Use nodeAffinity to place DB Pods on high-performance nodes
Anti-affinity: Ensure DB Pods are distributed across different nodes

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - postgres
          topologyKey: kubernetes.io/hostname

Resource Limits

Always set resources.requests and limits for DB Pods
Exceeding memory limits triggers OOMKill -- the DB process is forcefully terminated
Set PostgreSQL shared_buffers to around 25% of container memory
Set MySQL innodb_buffer_pool_size to 50-70% of container memory

Monitoring

# Enable PodMonitor in CNPG
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: prod-pg
spec:
  instances: 3
  monitoring:
    enablePodMonitor: true
    customQueriesConfigMap:
      - name: pg-custom-queries
        key: queries
  storage:
    size: 100Gi

Essential monitoring metrics:

Replication Lag: How far Standby is behind the Primary
Connection count: Usage rate against maximum connections
Transaction throughput (TPS): Transactions per second
Disk usage: Set alerts before PV capacity is exhausted
WAL archiving status: Verify backup system is functioning normally

Backup Strategy

Apply the 3-2-1 backup rule in K8s environments.

3 copies: Primary data + WAL archive + physical backup
2 different media: Local PV + Object Storage (S3/GCS)
1 offsite: Replicate to a bucket in another region

# CNPG recovery cluster example
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: recovery-cluster
spec:
  instances: 2
  storage:
    size: 100Gi
  bootstrap:
    recovery:
      source: prod-pg
      recoveryTarget:
        targetTime: '2026-04-10T08:00:00Z'
  externalClusters:
    - name: prod-pg
      barmanObjectStore:
        destinationPath: s3://my-backup-bucket/prod-pg/
        s3Credentials:
          accessKeyId:
            name: aws-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: aws-creds
            key: SECRET_ACCESS_KEY

8. Practical Example: Building a PostgreSQL HA Cluster with CNPG

A complete CNPG cluster YAML ready for production use.

Step 1: Create Namespace and Secrets

kubectl create namespace database

kubectl create secret generic pg-superuser \
  --namespace database \
  --from-literal=username=postgres \
  --from-literal=password=CHANGE_ME_TO_STRONG_PASSWORD

kubectl create secret generic aws-creds \
  --namespace database \
  --from-literal=ACCESS_KEY_ID=your-access-key \
  --from-literal=SECRET_ACCESS_KEY=your-secret-key

Step 2: Define the PostgreSQL Cluster

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: prod-pg
  namespace: database
spec:
  description: 'Production PostgreSQL HA Cluster'
  imageName: ghcr.io/cloudnative-pg/postgresql:16.4
  instances: 3
  startDelay: 30
  stopDelay: 30
  primaryUpdateStrategy: unsupervised
  postgresql:
    parameters:
      shared_buffers: '2GB'
      effective_cache_size: '6GB'
      work_mem: '64MB'
      maintenance_work_mem: '512MB'
      max_connections: '200'
      max_wal_size: '2GB'
      min_wal_size: '1GB'
      wal_buffers: '64MB'
      random_page_cost: '1.1'
      effective_io_concurrency: '200'
      log_statement: 'ddl'
      log_min_duration_statement: '1000'
    pg_hba:
      - host all all 10.0.0.0/8 scram-sha-256
  bootstrap:
    initdb:
      database: appdb
      owner: appuser
      secret:
        name: pg-superuser
  storage:
    size: 100Gi
    storageClass: gp3-db
  walStorage:
    size: 30Gi
    storageClass: gp3-db
  resources:
    requests:
      memory: 8Gi
      cpu: '4'
    limits:
      memory: 8Gi
      cpu: '4'
  affinity:
    enablePodAntiAffinity: true
    topologyKey: kubernetes.io/hostname
  monitoring:
    enablePodMonitor: true
  backup:
    barmanObjectStore:
      destinationPath: s3://my-backup-bucket/prod-pg/
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
        jobs: 4
    retentionPolicy: '30d'
  nodeMaintenanceWindow:
    inProgress: false
    reusePVC: true

Step 3: Scheduled Backup

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: prod-pg-daily
  namespace: database
spec:
  schedule: '0 2 * * *'
  backupOwnerReference: self
  cluster:
    name: prod-pg
  target: prefer-standby

Step 4: Pooler (PgBouncer) Configuration

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: prod-pg-pooler-rw
  namespace: database
spec:
  cluster:
    name: prod-pg
  instances: 2
  type: rw
  pgbouncer:
    poolMode: transaction
    parameters:
      max_client_conn: '1000'
      default_pool_size: '50'

Step 5: Deploy and Verify

kubectl apply -f cluster.yaml
kubectl apply -f scheduled-backup.yaml
kubectl apply -f pooler.yaml

# Check cluster status
kubectl get cluster -n database

# Check Pod status
kubectl get pods -n database

# Cluster detail info
kubectl describe cluster prod-pg -n database

# Test connection to Primary
kubectl exec -it prod-pg-1 -n database -- psql -U postgres -d appdb

9. Anti-Patterns -- Mistakes to Avoid in K8s DB Operations

1) Deploying DB with Deployment

Deployment is for stateless workloads. Using Deployment for a DB can cause data loss on Pod restart, or multiple Pods accessing the same data directory. Always use StatefulSet or Operator CRDs.

2) Using emptyDir without PVC

emptyDir data is deleted when the Pod is removed. Build the habit of using PVC for DB data even in test environments.

3) Operating without Backups

Even if the Operator provides HA, backups must be configured separately. HA protects against infrastructure failures, while backups protect against logical errors (like a wrong DELETE statement).

4) Not Setting Resource Limits

Without limits on DB Pods, they can starve other Pods' resources, or get terminated at unpredictable times during OOM. Set requests equal to limits to secure Guaranteed QoS.

5) Using reclaimPolicy: Delete

If the default reclaimPolicy is Delete, PV (actual data) is also deleted when PVC is removed. Always set DB StorageClass to Retain.

6) Placing All DB Pods in a Single AZ

Without Pod Anti-Affinity, all DB Pods may land on the same node or AZ. A failure in that node/AZ would take down the entire DB.

# Correct Anti-Affinity configuration
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: cnpg.io/cluster
              operator: In
              values:
                - prod-pg
        topologyKey: topology.kubernetes.io/zone

7) Operating without Monitoring/Alerts

Replication lag increases or disk fills up without anyone knowing. Configure at minimum these alerts:

Disk usage exceeding 80%
Replication lag over 10 seconds
Increasing Pod restart count
Backup failure detection

8) Not Testing Rolling Updates During DB Upgrades

Major version upgrades for PostgreSQL, MySQL, etc., should always be tested on a separate cluster first. Even if the Operator supports automatic upgrades, application compatibility testing must be done by humans.

9) Managing Secrets in Plaintext

Don't hardcode DB passwords in YAML. Use External Secrets Operator or Sealed Secrets to manage Secrets securely.

# External Secrets example
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: pg-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: pg-superuser
  data:
    - secretKey: username
      remoteRef:
        key: prod/database/credentials
        property: username
    - secretKey: password
      remoteRef:
        key: prod/database/credentials
        property: password

Conclusion

Running databases on K8s is no longer an experimental choice. Mature Operators like CNPG, Percona, and Vitess automate complex operational tasks and integrate naturally with K8s's declarative management model.

Key Summary:

For running PostgreSQL on K8s, CloudNativePG is the best choice
For unified MySQL/MongoDB/PostgreSQL management, Percona Operator
For large-scale MySQL sharding, Vitess
For dev/test environments, Helm charts remain convenient
Regardless of the tool you choose, storage, backup, and monitoring are essential

If adopting an Operator, test thoroughly in a development environment first, simulate failure scenarios (Pod deletion, node down, AZ failure), and then apply to production.