Skip to content
Published on

[ArgoCD] High Availability and Scalability: Large-Scale Cluster Operations

Authors

1. Challenges of Large-Scale ArgoCD Operations

In large environments managing hundreds or thousands of Applications, ArgoCD's default configuration can reach its limits.

Key Bottlenecks

ComponentBottleneckSymptoms
Application ControllerSingle instance handles all AppsReconciliation delays, high memory
Repository ServerGit clone and manifest generation loadSlow syncs, high CPU
RedisGrowing cache dataMemory exhaustion, connection delays
API ServerMany concurrent usersUI lag, API timeouts

2. HA Architecture

Basic HA Setup

                   Load Balancer
                        |
            +-----------+-----------+
            |           |           |
        API Server  API Server  API Server
        (replica 1) (replica 2) (replica 3)
            |           |           |
            +-----------+-----------+
                        |
                  Redis (Sentinel HA)
                        |
            +-----------+-----------+
            |           |           |
        Repo Server Repo Server Repo Server
        (replica 1) (replica 2) (replica 3)
            |
    App Controller (sharded)
    Shard 0 | Shard 1 | Shard 2

API Server HA

The API Server is stateless, so it scales horizontally by increasing replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: argocd-server
                topologyKey: kubernetes.io/hostname

3. Application Controller Sharding

Sharding Concept

The Application Controller processes all Applications in a single instance by default. In large environments, sharding distributes Applications across multiple Controller instances.

Sharding Configuration

# argocd-cmd-params-cm ConfigMap
data:
  controller.sharding.algorithm: round-robin

StatefulSet-Based Sharding

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: application-controller
          env:
            - name: ARGOCD_CONTROLLER_REPLICAS
              value: '3'

Sharding Algorithms

Round-Robin (recommended):

Assigns Applications by hash modulo shard count
  App A -> hash(A) % 3 = Shard 0
  App B -> hash(B) % 3 = Shard 1
  App C -> hash(C) % 3 = Shard 2

Legacy:

Cluster-based sharding
  Cluster 1 Apps -> Shard 0
  Cluster 2 Apps -> Shard 1
  Cluster 3 Apps -> Shard 2

Leader Election

Only one Controller per Shard is active, using Kubernetes Lease resources:

Shard 0: Controller Pod A (Leader) + Pod D (Standby)
Shard 1: Controller Pod B (Leader) + Pod E (Standby)
Shard 2: Controller Pod C (Leader) + Pod F (Standby)

4. Repository Server Scaling

Horizontal Scaling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: repo-server
          resources:
            requests:
              cpu: '1'
              memory: '1Gi'
            limits:
              cpu: '2'
              memory: '2Gi'

Cache Optimization

env:
  - name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION
    value: '24h'
  - name: ARGOCD_EXEC_TIMEOUT
    value: '180s'

Git Clone Optimization

env:
  - name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH
    value: '1'
  - name: ARGOCD_GIT_REQUEST_TIMEOUT
    value: '60s'
  - name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
    value: '10'

Dedicated Volumes

spec:
  template:
    spec:
      volumes:
        - name: tmp
          emptyDir:
            sizeLimit: 10Gi
      containers:
        - name: repo-server
          volumeMounts:
            - name: tmp
              mountPath: /tmp

5. Redis HA

Redis Sentinel Configuration

redis-ha:
  enabled: true
  haproxy:
    enabled: true
    replicas: 3
  redis:
    replicas: 3
  sentinel:
    enabled: true
    replicas: 3

Redis Sentinel Architecture

             HAProxy (Load Balancer)
                    |
        +-----------+-----------+
        |           |           |
    Sentinel 1  Sentinel 2  Sentinel 3
        |           |           |
        +-----------+-----------+
                    |
        +-----------+-----------+
        |           |           |
    Redis Master  Redis Slave  Redis Slave

Redis Memory Optimization

data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    save ""
    appendonly no

External Managed Redis

Use managed Redis services (ElastiCache, Cloud Memorystore):

data:
  redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'
  redis.tls: 'true'

6. Performance Tuning

Application Controller Tuning

env:
  - name: ARGOCD_RECONCILIATION_TIMEOUT
    value: '300s'
  - name: ARGOCD_APP_RESYNC_PERIOD
    value: '180s'
  - name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS
    value: '5'
  - name: ARGOCD_APP_STATE_CACHE_EXPIRATION
    value: '1h'
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

Resource Exclusions

Exclude unnecessary resources from monitoring to improve performance:

# argocd-cm ConfigMap
data:
  resource.exclusions: |
    - apiGroups:
        - ""
      kinds:
        - "Event"
      clusters:
        - "*"
    - apiGroups:
        - "metrics.k8s.io"
      kinds:
        - "*"
      clusters:
        - "*"
    - apiGroups:
        - "coordination.k8s.io"
      kinds:
        - "Lease"
      clusters:
        - "*"

API Rate Limiting

Control load on the Kubernetes API server:

env:
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

7. Monitoring

Prometheus Metrics

Application Controller metrics:

MetricDescription
argocd_app_infoApplication state information
argocd_app_sync_totalTotal sync count
argocd_app_reconcile_countReconciliation count
argocd_app_reconcile_bucketReconciliation duration distribution

API Server metrics:

MetricDescription
argocd_api_server_request_totalTotal API requests
argocd_api_server_request_duration_secondsAPI request duration

Repo Server metrics:

MetricDescription
argocd_git_request_totalTotal Git requests
argocd_git_request_duration_secondsGit request duration

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  endpoints:
    - port: metrics
      interval: 30s

Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
spec:
  groups:
    - name: argocd
      rules:
        - alert: ArgoCDAppDegraded
          expr: argocd_app_info{health_status="Degraded"} > 0
          for: 5m
          labels:
            severity: warning

        - alert: ArgoCDAppSyncFailed
          expr: increase(argocd_app_sync_total{phase="Error"}[10m]) > 0
          for: 1m
          labels:
            severity: critical

        - alert: ArgoCDReconciliationSlow
          expr: histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60
          for: 10m
          labels:
            severity: warning

8. Disaster Recovery

Backup Strategy

# Full ArgoCD config backup
argocd admin export > argocd-backup.yaml

# Individual resource backups
kubectl get applications -n argocd -o yaml > applications-backup.yaml
kubectl get appprojects -n argocd -o yaml > projects-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml

Restore Procedure

# 1. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f install.yaml

# 2. Restore configs
kubectl apply -f argocd-cm-backup.yaml
kubectl apply -f argocd-rbac-cm-backup.yaml

# 3. Restore credentials
kubectl apply -f repos-backup.yaml
kubectl apply -f clusters-backup.yaml

# 4. Restore projects and applications
kubectl apply -f projects-backup.yaml
kubectl apply -f applications-backup.yaml

GitOps Self-Management

Managing ArgoCD itself via GitOps simplifies disaster recovery:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-self
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/argocd-config.git
    targetRevision: HEAD
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true

DR Strategy Types

StrategyRPORTOCost
Periodic BackupHourly30min - 1hrLow
GitOps Self-Management0 (Git is SSOT)10 - 20minMedium
Active-Standby0Under 5minHigh
Active-Active0ImmediateVery High

9. Large-Scale Operations Checklist

100+ Applications

  • Controller replicas: 2-3 (sharded)
  • Repo Server replicas: 2-3
  • Redis HA: Sentinel setup
  • API Server replicas: 2-3
  • Resource exclusions configured
  • Cache expiry optimized

500+ Applications

  • Controller replicas: 3-5 (sharded)
  • Repo Server replicas: 3-5
  • Redis HA: Sentinel + memory optimization
  • API Rate Limiting per cluster
  • Monitoring alerts configured
  • Automated periodic backups

1000+ Applications

  • Controller replicas: 5+ (sharded)
  • Repo Server replicas: 5+ (dedicated volumes)
  • External managed Redis
  • ApplicationSet for reduced management complexity
  • Consider multi-ArgoCD instances
  • Dedicated monitoring dashboards

10. Summary

Key elements of ArgoCD HA and scalability:

  1. Controller Sharding: StatefulSet-based distributed Application processing
  2. Repo Server Scaling: Horizontal scaling + cache optimization + Git clone optimization
  3. Redis HA: Sentinel configuration or managed Redis
  4. Performance Tuning: Resource exclusions, API Rate Limiting, Reconciliation interval adjustment
  5. Monitoring: Prometheus metrics + Grafana dashboards + alert rules
  6. Disaster Recovery: Periodic backups + GitOps self-management + DR strategy

Applying these elements progressively enables building an ArgoCD environment that reliably manages thousands of Applications.