ArgoCD 고가용성과 확장성: 대규모 클러스터 운영

1. 대규모 ArgoCD 운영의 과제
- 주요 병목 지점
2. HA 아키텍처
- 기본 HA 구성
- API Server HA
3. Application Controller 샤딩
4. Repository Server 스케일링
5. Redis HA
6. 성능 튜닝
7. 모니터링
8. 재해 복구 (Disaster Recovery)
9. 대규모 운영 체크리스트
10. 정리

1. 대규모 ArgoCD 운영의 과제

수백 또는 수천 개의 Application을 관리하는 대규모 환경에서는 ArgoCD의 기본 구성으로는 한계에 도달할 수 있습니다.

주요 병목 지점

컴포넌트	병목 지점	증상
Application Controller	단일 인스턴스로 모든 App 처리	Reconciliation 지연, 높은 메모리 사용
Repository Server	Git 클론 및 매니페스트 생성 부하	느린 Sync, 높은 CPU 사용
Redis	캐시 데이터 증가	메모리 부족, 연결 지연
API Server	다수 사용자 동시 접속	UI 응답 지연, API 타임아웃

2. HA 아키텍처

기본 HA 구성

                   Load Balancer
                        |
            +-----------+-----------+
            |           |           |
        API Server  API Server  API Server
        (replica 1) (replica 2) (replica 3)
            |           |           |
            +-----------+-----------+
                        |
                  Redis (Sentinel HA)
                        |
            +-----------+-----------+
            |           |           |
        Repo Server Repo Server Repo Server
        (replica 1) (replica 2) (replica 3)
            |
    App Controller (sharded)
    Shard 0 | Shard 1 | Shard 2

API Server HA

API Server는 stateless하므로 단순히 replica 수를 늘려 수평 확장합니다:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: argocd-server
                topologyKey: kubernetes.io/hostname

3. Application Controller 샤딩

샤딩 개념

Application Controller는 기본적으로 단일 인스턴스로 모든 Application을 처리합니다. 대규모 환경에서는 여러 Controller 인스턴스가 Application을 나눠 처리하는 샤딩을 적용합니다.

샤딩 설정

# argocd-cmd-params-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  controller.sharding.algorithm: round-robin # round-robin 또는 legacy

StatefulSet 기반 샤딩

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argocd
spec:
  replicas: 3 # 3개의 Shard
  template:
    spec:
      containers:
        - name: application-controller
          env:
            - name: ARGOCD_CONTROLLER_REPLICAS
              value: '3'

샤딩 알고리즘

Round-Robin (권장):

Application 해시를 Shard 수로 나누어 할당
  App A -> hash(A) % 3 = Shard 0
  App B -> hash(B) % 3 = Shard 1
  App C -> hash(C) % 3 = Shard 2

Legacy:

클러스터 기반 샤딩
  클러스터 1의 모든 App -> Shard 0
  클러스터 2의 모든 App -> Shard 1
  클러스터 3의 모든 App -> Shard 2

Leader Election

각 Shard 내에서 하나의 Controller만 활성화됩니다:

Shard 0: Controller Pod A (Leader) + Controller Pod D (Standby)
Shard 1: Controller Pod B (Leader) + Controller Pod E (Standby)
Shard 2: Controller Pod C (Leader) + Controller Pod F (Standby)

Leader Election은 Kubernetes의 Lease 리소스를 사용합니다.

4. Repository Server 스케일링

수평 확장

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: repo-server
          resources:
            requests:
              cpu: '1'
              memory: '1Gi'
            limits:
              cpu: '2'
              memory: '2Gi'

캐시 최적화

# 환경 변수로 캐시 설정
env:
  - name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION
    value: '24h' # 캐시 만료 시간 (기본 24시간)
  - name: ARGOCD_EXEC_TIMEOUT
    value: '180s' # 매니페스트 생성 타임아웃

Git 클론 최적화

env:
  # Shallow Clone 깊이 설정
  - name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH
    value: '1' # 최소 깊이로 클론 속도 향상

  # Git 요청 타임아웃
  - name: ARGOCD_GIT_REQUEST_TIMEOUT
    value: '60s'

  # 병렬 매니페스트 생성 수
  - name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
    value: '10'

전용 볼륨

대규모 저장소를 처리하기 위해 임시 볼륨을 설정합니다:

spec:
  template:
    spec:
      volumes:
        - name: tmp
          emptyDir:
            sizeLimit: 10Gi # Git 클론 임시 저장소
      containers:
        - name: repo-server
          volumeMounts:
            - name: tmp
              mountPath: /tmp

5. Redis HA

Redis Sentinel 구성

프로덕션에서는 Redis HA를 구성합니다:

# Redis HA Helm values
redis-ha:
  enabled: true
  exporter:
    enabled: true
  haproxy:
    enabled: true
    replicas: 3
  redis:
    replicas: 3
  sentinel:
    enabled: true
    replicas: 3

Redis Sentinel 아키텍처

             HAProxy (Load Balancer)
                    |
        +-----------+-----------+
        |           |           |
    Sentinel 1  Sentinel 2  Sentinel 3
        |           |           |
        +-----------+-----------+
                    |
        +-----------+-----------+
        |           |           |
    Redis Master  Redis Slave  Redis Slave

Redis 메모리 최적화

# Redis ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-redis-config
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    save ""
    appendonly no

외부 Redis 사용

관리형 Redis(ElastiCache, Cloud Memorystore 등)를 사용할 수 있습니다:

# argocd-cmd-params-cm ConfigMap
data:
  redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'
  redis.tls: 'true'

6. 성능 튜닝

Application Controller 튜닝

env:
  # Reconciliation 타임아웃 (기본 180초)
  - name: ARGOCD_RECONCILIATION_TIMEOUT
    value: '300s'

  # 앱 재동기화 주기 (기본 180초)
  - name: ARGOCD_APP_RESYNC_PERIOD
    value: '180s'

  # 자체 힐링 재동기화 주기
  - name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS
    value: '5'

  # 상태 캐시 만료
  - name: ARGOCD_APP_STATE_CACHE_EXPIRATION
    value: '1h'

  # kubectl 병렬 처리 수
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

  # 리소스 포함/제외 필터
  - name: ARGOCD_RESOURCE_INCLUSIONS
    value: |
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
        clusters: ["*"]
  - name: ARGOCD_RESOURCE_EXCLUSIONS
    value: |
      - apiGroups: [""]
        kinds: ["Event"]
        clusters: ["*"]

리소스 제외 설정

불필요한 리소스를 모니터링에서 제외하여 성능을 향상시킵니다:

# argocd-cm ConfigMap
data:
  resource.exclusions: |
    - apiGroups:
        - ""
      kinds:
        - "Event"
      clusters:
        - "*"
    - apiGroups:
        - "metrics.k8s.io"
      kinds:
        - "*"
      clusters:
        - "*"
    - apiGroups:
        - "coordination.k8s.io"
      kinds:
        - "Lease"
      clusters:
        - "*"

API Rate Limiting

Kubernetes API 서버에 대한 부하를 제어합니다:

env:
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50' # 초당 쿼리 수
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100' # 버스트 허용량

7. 모니터링

Prometheus 메트릭

ArgoCD는 다양한 Prometheus 메트릭을 노출합니다:

Application Controller 메트릭:

메트릭	설명
argocd_app_info	Application 상태 정보
argocd_app_sync_total	총 동기화 횟수
argocd_app_reconcile_count	Reconciliation 횟수
argocd_app_reconcile_bucket	Reconciliation 소요 시간 분포
argocd_cluster_api_resources_count	클러스터별 API 리소스 수
argocd_kubectl_exec_total	kubectl 실행 횟수

API Server 메트릭:

메트릭	설명
argocd_api_server_request_total	API 요청 총수
argocd_api_server_request_duration_seconds	API 요청 처리 시간

Repo Server 메트릭:

메트릭	설명
argocd_git_request_total	Git 요청 총수
argocd_git_request_duration_seconds	Git 요청 처리 시간
argocd_repo_pending_request_total	대기 중인 요청 수

ServiceMonitor 설정

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  endpoints:
    - port: metrics
      interval: 30s

Grafana 대시보드

ArgoCD 공식 Grafana 대시보드에서 모니터링하는 주요 항목:

1. Application 상태 분포 (Healthy/Degraded/Progressing)
2. Sync 성공/실패 비율
3. Reconciliation 지연 시간
4. Repository Server 응답 시간
5. Redis 메모리 사용량
6. API Server 요청 처리량
7. Controller 메모리/CPU 사용량

알림 규칙

# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
  namespace: argocd
spec:
  groups:
    - name: argocd
      rules:
        - alert: ArgoCDAppDegraded
          expr: |
            argocd_app_info{health_status="Degraded"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'ArgoCD Application is Degraded'

        - alert: ArgoCDAppSyncFailed
          expr: |
            increase(argocd_app_sync_total{phase="Error"}[10m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: 'ArgoCD Application sync failed'

        - alert: ArgoCDReconciliationSlow
          expr: |
            histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: 'ArgoCD Reconciliation is slow (p99 > 60s)'

        - alert: ArgoCDRepoServerHighLatency
          expr: |
            histogram_quantile(0.95, argocd_git_request_duration_seconds_bucket) > 30
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: 'ArgoCD Repo Server high latency (p95 > 30s)'

8. 재해 복구 (Disaster Recovery)

백업 전략

ArgoCD의 상태를 백업하기 위한 핵심 리소스:

# ArgoCD 전체 설정 백업
argocd admin export > argocd-backup.yaml

# 개별 리소스 백업
kubectl get applications -n argocd -o yaml > applications-backup.yaml
kubectl get appprojects -n argocd -o yaml > projects-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml
kubectl get configmap -n argocd argocd-cm -o yaml > argocd-cm-backup.yaml
kubectl get configmap -n argocd argocd-rbac-cm -o yaml > argocd-rbac-cm-backup.yaml

복원 절차

# 1. ArgoCD 설치
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# 2. 설정 복원
kubectl apply -f argocd-cm-backup.yaml
kubectl apply -f argocd-rbac-cm-backup.yaml

# 3. 자격 증명 복원
kubectl apply -f repos-backup.yaml
kubectl apply -f clusters-backup.yaml

# 4. 프로젝트 복원
kubectl apply -f projects-backup.yaml

# 5. Application 복원
kubectl apply -f applications-backup.yaml

# 또는 전체 가져오기
argocd admin import - < argocd-backup.yaml

GitOps로 ArgoCD 관리

ArgoCD 자체를 GitOps로 관리하면 재해 복구가 간단해집니다:

# ArgoCD 자체를 관리하는 Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-self
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/argocd-config.git
    targetRevision: HEAD
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true

DR 전략 유형

전략	RPO	RTO	비용
정기 백업	시간 단위	30분 ~ 1시간	낮음
GitOps 자체 관리	0 (Git이 SSOT)	10 ~ 20분	중간
Active-Standby	0	5분 미만	높음
Active-Active	0	즉시	매우 높음

9. 대규모 운영 체크리스트

100+ Application 환경

Application Controller replica: 2-3 (샤딩)
Repo Server replica: 2-3
Redis HA: Sentinel 구성
API Server replica: 2-3
리소스 제외 설정 적용
캐시 만료 시간 최적화

500+ Application 환경

Application Controller replica: 3-5 (샤딩)
Repo Server replica: 3-5
Redis HA: Sentinel + 메모리 최적화
API Server replica: 3-5
클러스터별 API Rate Limiting 적용
모니터링 알림 구성
정기 백업 자동화

1000+ Application 환경

Application Controller replica: 5+ (샤딩)
Repo Server replica: 5+ (전용 볼륨)
외부 관리형 Redis 사용
API Server replica: 5+
ApplicationSet 활용으로 관리 복잡도 감소
멀티 ArgoCD 인스턴스 고려
전용 모니터링 대시보드 구축

10. 정리

ArgoCD의 고가용성과 확장성 핵심 요소:

Controller 샤딩: StatefulSet 기반 Application 분산 처리
Repo Server 스케일링: 수평 확장 + 캐시 최적화 + Git 클론 최적화
Redis HA: Sentinel 구성 또는 관리형 Redis 사용
성능 튜닝: 리소스 제외, API Rate Limiting, Reconciliation 주기 조정
모니터링: Prometheus 메트릭 + Grafana 대시보드 + 알림 규칙
재해 복구: 정기 백업 + GitOps 자체 관리 + DR 전략

이러한 요소들을 단계적으로 적용하면 수천 개의 Application을 안정적으로 관리하는 ArgoCD 환경을 구축할 수 있습니다.