Skip to content

Split View: ArgoCD 고가용성과 확장성: 대규모 클러스터 운영

|

ArgoCD 고가용성과 확장성: 대규모 클러스터 운영

1. 대규모 ArgoCD 운영의 과제

수백 또는 수천 개의 Application을 관리하는 대규모 환경에서는 ArgoCD의 기본 구성으로는 한계에 도달할 수 있습니다.

주요 병목 지점

컴포넌트병목 지점증상
Application Controller단일 인스턴스로 모든 App 처리Reconciliation 지연, 높은 메모리 사용
Repository ServerGit 클론 및 매니페스트 생성 부하느린 Sync, 높은 CPU 사용
Redis캐시 데이터 증가메모리 부족, 연결 지연
API Server다수 사용자 동시 접속UI 응답 지연, API 타임아웃

2. HA 아키텍처

기본 HA 구성

                   Load Balancer
                        |
            +-----------+-----------+
            |           |           |
        API Server  API Server  API Server
        (replica 1) (replica 2) (replica 3)
            |           |           |
            +-----------+-----------+
                        |
                  Redis (Sentinel HA)
                        |
            +-----------+-----------+
            |           |           |
        Repo Server Repo Server Repo Server
        (replica 1) (replica 2) (replica 3)
            |
    App Controller (sharded)
    Shard 0 | Shard 1 | Shard 2

API Server HA

API Server는 stateless하므로 단순히 replica 수를 늘려 수평 확장합니다:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: argocd-server
                topologyKey: kubernetes.io/hostname

3. Application Controller 샤딩

샤딩 개념

Application Controller는 기본적으로 단일 인스턴스로 모든 Application을 처리합니다. 대규모 환경에서는 여러 Controller 인스턴스가 Application을 나눠 처리하는 샤딩을 적용합니다.

샤딩 설정

# argocd-cmd-params-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  controller.sharding.algorithm: round-robin # round-robin 또는 legacy

StatefulSet 기반 샤딩

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argocd
spec:
  replicas: 3 # 3개의 Shard
  template:
    spec:
      containers:
        - name: application-controller
          env:
            - name: ARGOCD_CONTROLLER_REPLICAS
              value: '3'

샤딩 알고리즘

Round-Robin (권장):

Application 해시를 Shard 수로 나누어 할당
  App A -> hash(A) % 3 = Shard 0
  App B -> hash(B) % 3 = Shard 1
  App C -> hash(C) % 3 = Shard 2

Legacy:

클러스터 기반 샤딩
  클러스터 1의 모든 App -> Shard 0
  클러스터 2의 모든 App -> Shard 1
  클러스터 3의 모든 App -> Shard 2

Leader Election

각 Shard 내에서 하나의 Controller만 활성화됩니다:

Shard 0: Controller Pod A (Leader) + Controller Pod D (Standby)
Shard 1: Controller Pod B (Leader) + Controller Pod E (Standby)
Shard 2: Controller Pod C (Leader) + Controller Pod F (Standby)

Leader Election은 Kubernetes의 Lease 리소스를 사용합니다.

4. Repository Server 스케일링

수평 확장

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: repo-server
          resources:
            requests:
              cpu: '1'
              memory: '1Gi'
            limits:
              cpu: '2'
              memory: '2Gi'

캐시 최적화

# 환경 변수로 캐시 설정
env:
  - name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION
    value: '24h' # 캐시 만료 시간 (기본 24시간)
  - name: ARGOCD_EXEC_TIMEOUT
    value: '180s' # 매니페스트 생성 타임아웃

Git 클론 최적화

env:
  # Shallow Clone 깊이 설정
  - name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH
    value: '1' # 최소 깊이로 클론 속도 향상

  # Git 요청 타임아웃
  - name: ARGOCD_GIT_REQUEST_TIMEOUT
    value: '60s'

  # 병렬 매니페스트 생성 수
  - name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
    value: '10'

전용 볼륨

대규모 저장소를 처리하기 위해 임시 볼륨을 설정합니다:

spec:
  template:
    spec:
      volumes:
        - name: tmp
          emptyDir:
            sizeLimit: 10Gi # Git 클론 임시 저장소
      containers:
        - name: repo-server
          volumeMounts:
            - name: tmp
              mountPath: /tmp

5. Redis HA

Redis Sentinel 구성

프로덕션에서는 Redis HA를 구성합니다:

# Redis HA Helm values
redis-ha:
  enabled: true
  exporter:
    enabled: true
  haproxy:
    enabled: true
    replicas: 3
  redis:
    replicas: 3
  sentinel:
    enabled: true
    replicas: 3

Redis Sentinel 아키텍처

             HAProxy (Load Balancer)
                    |
        +-----------+-----------+
        |           |           |
    Sentinel 1  Sentinel 2  Sentinel 3
        |           |           |
        +-----------+-----------+
                    |
        +-----------+-----------+
        |           |           |
    Redis Master  Redis Slave  Redis Slave

Redis 메모리 최적화

# Redis ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-redis-config
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    save ""
    appendonly no

외부 Redis 사용

관리형 Redis(ElastiCache, Cloud Memorystore 등)를 사용할 수 있습니다:

# argocd-cmd-params-cm ConfigMap
data:
  redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'
  redis.tls: 'true'

6. 성능 튜닝

Application Controller 튜닝

env:
  # Reconciliation 타임아웃 (기본 180초)
  - name: ARGOCD_RECONCILIATION_TIMEOUT
    value: '300s'

  # 앱 재동기화 주기 (기본 180초)
  - name: ARGOCD_APP_RESYNC_PERIOD
    value: '180s'

  # 자체 힐링 재동기화 주기
  - name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS
    value: '5'

  # 상태 캐시 만료
  - name: ARGOCD_APP_STATE_CACHE_EXPIRATION
    value: '1h'

  # kubectl 병렬 처리 수
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

  # 리소스 포함/제외 필터
  - name: ARGOCD_RESOURCE_INCLUSIONS
    value: |
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
        clusters: ["*"]
  - name: ARGOCD_RESOURCE_EXCLUSIONS
    value: |
      - apiGroups: [""]
        kinds: ["Event"]
        clusters: ["*"]

리소스 제외 설정

불필요한 리소스를 모니터링에서 제외하여 성능을 향상시킵니다:

# argocd-cm ConfigMap
data:
  resource.exclusions: |
    - apiGroups:
        - ""
      kinds:
        - "Event"
      clusters:
        - "*"
    - apiGroups:
        - "metrics.k8s.io"
      kinds:
        - "*"
      clusters:
        - "*"
    - apiGroups:
        - "coordination.k8s.io"
      kinds:
        - "Lease"
      clusters:
        - "*"

API Rate Limiting

Kubernetes API 서버에 대한 부하를 제어합니다:

env:
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50' # 초당 쿼리 수
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100' # 버스트 허용량

7. 모니터링

Prometheus 메트릭

ArgoCD는 다양한 Prometheus 메트릭을 노출합니다:

Application Controller 메트릭:

메트릭설명
argocd_app_infoApplication 상태 정보
argocd_app_sync_total총 동기화 횟수
argocd_app_reconcile_countReconciliation 횟수
argocd_app_reconcile_bucketReconciliation 소요 시간 분포
argocd_cluster_api_resources_count클러스터별 API 리소스 수
argocd_kubectl_exec_totalkubectl 실행 횟수

API Server 메트릭:

메트릭설명
argocd_api_server_request_totalAPI 요청 총수
argocd_api_server_request_duration_secondsAPI 요청 처리 시간

Repo Server 메트릭:

메트릭설명
argocd_git_request_totalGit 요청 총수
argocd_git_request_duration_secondsGit 요청 처리 시간
argocd_repo_pending_request_total대기 중인 요청 수

ServiceMonitor 설정

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  endpoints:
    - port: metrics
      interval: 30s

Grafana 대시보드

ArgoCD 공식 Grafana 대시보드에서 모니터링하는 주요 항목:

1. Application 상태 분포 (Healthy/Degraded/Progressing)
2. Sync 성공/실패 비율
3. Reconciliation 지연 시간
4. Repository Server 응답 시간
5. Redis 메모리 사용량
6. API Server 요청 처리량
7. Controller 메모리/CPU 사용량

알림 규칙

# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
  namespace: argocd
spec:
  groups:
    - name: argocd
      rules:
        - alert: ArgoCDAppDegraded
          expr: |
            argocd_app_info{health_status="Degraded"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'ArgoCD Application is Degraded'

        - alert: ArgoCDAppSyncFailed
          expr: |
            increase(argocd_app_sync_total{phase="Error"}[10m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: 'ArgoCD Application sync failed'

        - alert: ArgoCDReconciliationSlow
          expr: |
            histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: 'ArgoCD Reconciliation is slow (p99 > 60s)'

        - alert: ArgoCDRepoServerHighLatency
          expr: |
            histogram_quantile(0.95, argocd_git_request_duration_seconds_bucket) > 30
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: 'ArgoCD Repo Server high latency (p95 > 30s)'

8. 재해 복구 (Disaster Recovery)

백업 전략

ArgoCD의 상태를 백업하기 위한 핵심 리소스:

# ArgoCD 전체 설정 백업
argocd admin export > argocd-backup.yaml

# 개별 리소스 백업
kubectl get applications -n argocd -o yaml > applications-backup.yaml
kubectl get appprojects -n argocd -o yaml > projects-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml
kubectl get configmap -n argocd argocd-cm -o yaml > argocd-cm-backup.yaml
kubectl get configmap -n argocd argocd-rbac-cm -o yaml > argocd-rbac-cm-backup.yaml

복원 절차

# 1. ArgoCD 설치
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# 2. 설정 복원
kubectl apply -f argocd-cm-backup.yaml
kubectl apply -f argocd-rbac-cm-backup.yaml

# 3. 자격 증명 복원
kubectl apply -f repos-backup.yaml
kubectl apply -f clusters-backup.yaml

# 4. 프로젝트 복원
kubectl apply -f projects-backup.yaml

# 5. Application 복원
kubectl apply -f applications-backup.yaml

# 또는 전체 가져오기
argocd admin import - < argocd-backup.yaml

GitOps로 ArgoCD 관리

ArgoCD 자체를 GitOps로 관리하면 재해 복구가 간단해집니다:

# ArgoCD 자체를 관리하는 Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-self
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/argocd-config.git
    targetRevision: HEAD
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true

DR 전략 유형

전략RPORTO비용
정기 백업시간 단위30분 ~ 1시간낮음
GitOps 자체 관리0 (Git이 SSOT)10 ~ 20분중간
Active-Standby05분 미만높음
Active-Active0즉시매우 높음

9. 대규모 운영 체크리스트

100+ Application 환경

  • Application Controller replica: 2-3 (샤딩)
  • Repo Server replica: 2-3
  • Redis HA: Sentinel 구성
  • API Server replica: 2-3
  • 리소스 제외 설정 적용
  • 캐시 만료 시간 최적화

500+ Application 환경

  • Application Controller replica: 3-5 (샤딩)
  • Repo Server replica: 3-5
  • Redis HA: Sentinel + 메모리 최적화
  • API Server replica: 3-5
  • 클러스터별 API Rate Limiting 적용
  • 모니터링 알림 구성
  • 정기 백업 자동화

1000+ Application 환경

  • Application Controller replica: 5+ (샤딩)
  • Repo Server replica: 5+ (전용 볼륨)
  • 외부 관리형 Redis 사용
  • API Server replica: 5+
  • ApplicationSet 활용으로 관리 복잡도 감소
  • 멀티 ArgoCD 인스턴스 고려
  • 전용 모니터링 대시보드 구축

10. 정리

ArgoCD의 고가용성과 확장성 핵심 요소:

  1. Controller 샤딩: StatefulSet 기반 Application 분산 처리
  2. Repo Server 스케일링: 수평 확장 + 캐시 최적화 + Git 클론 최적화
  3. Redis HA: Sentinel 구성 또는 관리형 Redis 사용
  4. 성능 튜닝: 리소스 제외, API Rate Limiting, Reconciliation 주기 조정
  5. 모니터링: Prometheus 메트릭 + Grafana 대시보드 + 알림 규칙
  6. 재해 복구: 정기 백업 + GitOps 자체 관리 + DR 전략

이러한 요소들을 단계적으로 적용하면 수천 개의 Application을 안정적으로 관리하는 ArgoCD 환경을 구축할 수 있습니다.

[ArgoCD] High Availability and Scalability: Large-Scale Cluster Operations

1. Challenges of Large-Scale ArgoCD Operations

In large environments managing hundreds or thousands of Applications, ArgoCD's default configuration can reach its limits.

Key Bottlenecks

ComponentBottleneckSymptoms
Application ControllerSingle instance handles all AppsReconciliation delays, high memory
Repository ServerGit clone and manifest generation loadSlow syncs, high CPU
RedisGrowing cache dataMemory exhaustion, connection delays
API ServerMany concurrent usersUI lag, API timeouts

2. HA Architecture

Basic HA Setup

                   Load Balancer
                        |
            +-----------+-----------+
            |           |           |
        API Server  API Server  API Server
        (replica 1) (replica 2) (replica 3)
            |           |           |
            +-----------+-----------+
                        |
                  Redis (Sentinel HA)
                        |
            +-----------+-----------+
            |           |           |
        Repo Server Repo Server Repo Server
        (replica 1) (replica 2) (replica 3)
            |
    App Controller (sharded)
    Shard 0 | Shard 1 | Shard 2

API Server HA

The API Server is stateless, so it scales horizontally by increasing replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: argocd-server
                topologyKey: kubernetes.io/hostname

3. Application Controller Sharding

Sharding Concept

The Application Controller processes all Applications in a single instance by default. In large environments, sharding distributes Applications across multiple Controller instances.

Sharding Configuration

# argocd-cmd-params-cm ConfigMap
data:
  controller.sharding.algorithm: round-robin

StatefulSet-Based Sharding

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: application-controller
          env:
            - name: ARGOCD_CONTROLLER_REPLICAS
              value: '3'

Sharding Algorithms

Round-Robin (recommended):

Assigns Applications by hash modulo shard count
  App A -> hash(A) % 3 = Shard 0
  App B -> hash(B) % 3 = Shard 1
  App C -> hash(C) % 3 = Shard 2

Legacy:

Cluster-based sharding
  Cluster 1 Apps -> Shard 0
  Cluster 2 Apps -> Shard 1
  Cluster 3 Apps -> Shard 2

Leader Election

Only one Controller per Shard is active, using Kubernetes Lease resources:

Shard 0: Controller Pod A (Leader) + Pod D (Standby)
Shard 1: Controller Pod B (Leader) + Pod E (Standby)
Shard 2: Controller Pod C (Leader) + Pod F (Standby)

4. Repository Server Scaling

Horizontal Scaling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: repo-server
          resources:
            requests:
              cpu: '1'
              memory: '1Gi'
            limits:
              cpu: '2'
              memory: '2Gi'

Cache Optimization

env:
  - name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION
    value: '24h'
  - name: ARGOCD_EXEC_TIMEOUT
    value: '180s'

Git Clone Optimization

env:
  - name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH
    value: '1'
  - name: ARGOCD_GIT_REQUEST_TIMEOUT
    value: '60s'
  - name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
    value: '10'

Dedicated Volumes

spec:
  template:
    spec:
      volumes:
        - name: tmp
          emptyDir:
            sizeLimit: 10Gi
      containers:
        - name: repo-server
          volumeMounts:
            - name: tmp
              mountPath: /tmp

5. Redis HA

Redis Sentinel Configuration

redis-ha:
  enabled: true
  haproxy:
    enabled: true
    replicas: 3
  redis:
    replicas: 3
  sentinel:
    enabled: true
    replicas: 3

Redis Sentinel Architecture

             HAProxy (Load Balancer)
                    |
        +-----------+-----------+
        |           |           |
    Sentinel 1  Sentinel 2  Sentinel 3
        |           |           |
        +-----------+-----------+
                    |
        +-----------+-----------+
        |           |           |
    Redis Master  Redis Slave  Redis Slave

Redis Memory Optimization

data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    save ""
    appendonly no

External Managed Redis

Use managed Redis services (ElastiCache, Cloud Memorystore):

data:
  redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'
  redis.tls: 'true'

6. Performance Tuning

Application Controller Tuning

env:
  - name: ARGOCD_RECONCILIATION_TIMEOUT
    value: '300s'
  - name: ARGOCD_APP_RESYNC_PERIOD
    value: '180s'
  - name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS
    value: '5'
  - name: ARGOCD_APP_STATE_CACHE_EXPIRATION
    value: '1h'
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

Resource Exclusions

Exclude unnecessary resources from monitoring to improve performance:

# argocd-cm ConfigMap
data:
  resource.exclusions: |
    - apiGroups:
        - ""
      kinds:
        - "Event"
      clusters:
        - "*"
    - apiGroups:
        - "metrics.k8s.io"
      kinds:
        - "*"
      clusters:
        - "*"
    - apiGroups:
        - "coordination.k8s.io"
      kinds:
        - "Lease"
      clusters:
        - "*"

API Rate Limiting

Control load on the Kubernetes API server:

env:
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

7. Monitoring

Prometheus Metrics

Application Controller metrics:

MetricDescription
argocd_app_infoApplication state information
argocd_app_sync_totalTotal sync count
argocd_app_reconcile_countReconciliation count
argocd_app_reconcile_bucketReconciliation duration distribution

API Server metrics:

MetricDescription
argocd_api_server_request_totalTotal API requests
argocd_api_server_request_duration_secondsAPI request duration

Repo Server metrics:

MetricDescription
argocd_git_request_totalTotal Git requests
argocd_git_request_duration_secondsGit request duration

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  endpoints:
    - port: metrics
      interval: 30s

Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
spec:
  groups:
    - name: argocd
      rules:
        - alert: ArgoCDAppDegraded
          expr: argocd_app_info{health_status="Degraded"} > 0
          for: 5m
          labels:
            severity: warning

        - alert: ArgoCDAppSyncFailed
          expr: increase(argocd_app_sync_total{phase="Error"}[10m]) > 0
          for: 1m
          labels:
            severity: critical

        - alert: ArgoCDReconciliationSlow
          expr: histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60
          for: 10m
          labels:
            severity: warning

8. Disaster Recovery

Backup Strategy

# Full ArgoCD config backup
argocd admin export > argocd-backup.yaml

# Individual resource backups
kubectl get applications -n argocd -o yaml > applications-backup.yaml
kubectl get appprojects -n argocd -o yaml > projects-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml

Restore Procedure

# 1. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f install.yaml

# 2. Restore configs
kubectl apply -f argocd-cm-backup.yaml
kubectl apply -f argocd-rbac-cm-backup.yaml

# 3. Restore credentials
kubectl apply -f repos-backup.yaml
kubectl apply -f clusters-backup.yaml

# 4. Restore projects and applications
kubectl apply -f projects-backup.yaml
kubectl apply -f applications-backup.yaml

GitOps Self-Management

Managing ArgoCD itself via GitOps simplifies disaster recovery:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-self
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/argocd-config.git
    targetRevision: HEAD
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true

DR Strategy Types

StrategyRPORTOCost
Periodic BackupHourly30min - 1hrLow
GitOps Self-Management0 (Git is SSOT)10 - 20minMedium
Active-Standby0Under 5minHigh
Active-Active0ImmediateVery High

9. Large-Scale Operations Checklist

100+ Applications

  • Controller replicas: 2-3 (sharded)
  • Repo Server replicas: 2-3
  • Redis HA: Sentinel setup
  • API Server replicas: 2-3
  • Resource exclusions configured
  • Cache expiry optimized

500+ Applications

  • Controller replicas: 3-5 (sharded)
  • Repo Server replicas: 3-5
  • Redis HA: Sentinel + memory optimization
  • API Rate Limiting per cluster
  • Monitoring alerts configured
  • Automated periodic backups

1000+ Applications

  • Controller replicas: 5+ (sharded)
  • Repo Server replicas: 5+ (dedicated volumes)
  • External managed Redis
  • ApplicationSet for reduced management complexity
  • Consider multi-ArgoCD instances
  • Dedicated monitoring dashboards

10. Summary

Key elements of ArgoCD HA and scalability:

  1. Controller Sharding: StatefulSet-based distributed Application processing
  2. Repo Server Scaling: Horizontal scaling + cache optimization + Git clone optimization
  3. Redis HA: Sentinel configuration or managed Redis
  4. Performance Tuning: Resource exclusions, API Rate Limiting, Reconciliation interval adjustment
  5. Monitoring: Prometheus metrics + Grafana dashboards + alert rules
  6. Disaster Recovery: Periodic backups + GitOps self-management + DR strategy

Applying these elements progressively enables building an ArgoCD environment that reliably manages thousands of Applications.