[ArgoCD] High Availability and Scalability: Large-Scale Cluster Operations

1. Challenges of Large-Scale ArgoCD Operations
- Key Bottlenecks
2. HA Architecture
- Basic HA Setup
- API Server HA
3. Application Controller Sharding
4. Repository Server Scaling
5. Redis HA
6. Performance Tuning
7. Monitoring
8. Disaster Recovery
9. Large-Scale Operations Checklist
10. Summary

1. Challenges of Large-Scale ArgoCD Operations

In large environments managing hundreds or thousands of Applications, ArgoCD's default configuration can reach its limits.

Key Bottlenecks

Component	Bottleneck	Symptoms
Application Controller	Single instance handles all Apps	Reconciliation delays, high memory
Repository Server	Git clone and manifest generation load	Slow syncs, high CPU
Redis	Growing cache data	Memory exhaustion, connection delays
API Server	Many concurrent users	UI lag, API timeouts

2. HA Architecture

Basic HA Setup

                   Load Balancer
                        |
            +-----------+-----------+
            |           |           |
        API Server  API Server  API Server
        (replica 1) (replica 2) (replica 3)
            |           |           |
            +-----------+-----------+
                        |
                  Redis (Sentinel HA)
                        |
            +-----------+-----------+
            |           |           |
        Repo Server Repo Server Repo Server
        (replica 1) (replica 2) (replica 3)
            |
    App Controller (sharded)
    Shard 0 | Shard 1 | Shard 2

API Server HA

The API Server is stateless, so it scales horizontally by increasing replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: argocd-server
                topologyKey: kubernetes.io/hostname

3. Application Controller Sharding

Sharding Concept

The Application Controller processes all Applications in a single instance by default. In large environments, sharding distributes Applications across multiple Controller instances.

Sharding Configuration

# argocd-cmd-params-cm ConfigMap
data:
  controller.sharding.algorithm: round-robin

StatefulSet-Based Sharding

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: application-controller
          env:
            - name: ARGOCD_CONTROLLER_REPLICAS
              value: '3'

Sharding Algorithms

Round-Robin (recommended):

Assigns Applications by hash modulo shard count
  App A -> hash(A) % 3 = Shard 0
  App B -> hash(B) % 3 = Shard 1
  App C -> hash(C) % 3 = Shard 2

Legacy:

Cluster-based sharding
  Cluster 1 Apps -> Shard 0
  Cluster 2 Apps -> Shard 1
  Cluster 3 Apps -> Shard 2

Leader Election

Only one Controller per Shard is active, using Kubernetes Lease resources:

Shard 0: Controller Pod A (Leader) + Pod D (Standby)
Shard 1: Controller Pod B (Leader) + Pod E (Standby)
Shard 2: Controller Pod C (Leader) + Pod F (Standby)

4. Repository Server Scaling

Horizontal Scaling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argocd
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: repo-server
          resources:
            requests:
              cpu: '1'
              memory: '1Gi'
            limits:
              cpu: '2'
              memory: '2Gi'

Cache Optimization

env:
  - name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION
    value: '24h'
  - name: ARGOCD_EXEC_TIMEOUT
    value: '180s'

Git Clone Optimization

env:
  - name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH
    value: '1'
  - name: ARGOCD_GIT_REQUEST_TIMEOUT
    value: '60s'
  - name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
    value: '10'

Dedicated Volumes

spec:
  template:
    spec:
      volumes:
        - name: tmp
          emptyDir:
            sizeLimit: 10Gi
      containers:
        - name: repo-server
          volumeMounts:
            - name: tmp
              mountPath: /tmp

5. Redis HA

Redis Sentinel Configuration

redis-ha:
  enabled: true
  haproxy:
    enabled: true
    replicas: 3
  redis:
    replicas: 3
  sentinel:
    enabled: true
    replicas: 3

Redis Sentinel Architecture

             HAProxy (Load Balancer)
                    |
        +-----------+-----------+
        |           |           |
    Sentinel 1  Sentinel 2  Sentinel 3
        |           |           |
        +-----------+-----------+
                    |
        +-----------+-----------+
        |           |           |
    Redis Master  Redis Slave  Redis Slave

Redis Memory Optimization

data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    save ""
    appendonly no

External Managed Redis

Use managed Redis services (ElastiCache, Cloud Memorystore):

data:
  redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'
  redis.tls: 'true'

6. Performance Tuning

Application Controller Tuning

env:
  - name: ARGOCD_RECONCILIATION_TIMEOUT
    value: '300s'
  - name: ARGOCD_APP_RESYNC_PERIOD
    value: '180s'
  - name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS
    value: '5'
  - name: ARGOCD_APP_STATE_CACHE_EXPIRATION
    value: '1h'
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

Resource Exclusions

Exclude unnecessary resources from monitoring to improve performance:

# argocd-cm ConfigMap
data:
  resource.exclusions: |
    - apiGroups:
        - ""
      kinds:
        - "Event"
      clusters:
        - "*"
    - apiGroups:
        - "metrics.k8s.io"
      kinds:
        - "*"
      clusters:
        - "*"
    - apiGroups:
        - "coordination.k8s.io"
      kinds:
        - "Lease"
      clusters:
        - "*"

API Rate Limiting

Control load on the Kubernetes API server:

env:
  - name: ARGOCD_K8S_CLIENT_QPS
    value: '50'
  - name: ARGOCD_K8S_CLIENT_BURST
    value: '100'

7. Monitoring

Prometheus Metrics

Application Controller metrics:

Metric	Description
argocd_app_info	Application state information
argocd_app_sync_total	Total sync count
argocd_app_reconcile_count	Reconciliation count
argocd_app_reconcile_bucket	Reconciliation duration distribution

API Server metrics:

Metric	Description
argocd_api_server_request_total	Total API requests
argocd_api_server_request_duration_seconds	API request duration

Repo Server metrics:

Metric	Description
argocd_git_request_total	Total Git requests
argocd_git_request_duration_seconds	Git request duration

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  endpoints:
    - port: metrics
      interval: 30s

Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
spec:
  groups:
    - name: argocd
      rules:
        - alert: ArgoCDAppDegraded
          expr: argocd_app_info{health_status="Degraded"} > 0
          for: 5m
          labels:
            severity: warning

        - alert: ArgoCDAppSyncFailed
          expr: increase(argocd_app_sync_total{phase="Error"}[10m]) > 0
          for: 1m
          labels:
            severity: critical

        - alert: ArgoCDReconciliationSlow
          expr: histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60
          for: 10m
          labels:
            severity: warning

8. Disaster Recovery

Backup Strategy

# Full ArgoCD config backup
argocd admin export > argocd-backup.yaml

# Individual resource backups
kubectl get applications -n argocd -o yaml > applications-backup.yaml
kubectl get appprojects -n argocd -o yaml > projects-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml

Restore Procedure

# 1. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f install.yaml

# 2. Restore configs
kubectl apply -f argocd-cm-backup.yaml
kubectl apply -f argocd-rbac-cm-backup.yaml

# 3. Restore credentials
kubectl apply -f repos-backup.yaml
kubectl apply -f clusters-backup.yaml

# 4. Restore projects and applications
kubectl apply -f projects-backup.yaml
kubectl apply -f applications-backup.yaml

GitOps Self-Management

Managing ArgoCD itself via GitOps simplifies disaster recovery:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-self
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/argocd-config.git
    targetRevision: HEAD
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true

DR Strategy Types

Strategy	RPO	RTO	Cost
Periodic Backup	Hourly	30min - 1hr	Low
GitOps Self-Management	0 (Git is SSOT)	10 - 20min	Medium
Active-Standby	0	Under 5min	High
Active-Active	0	Immediate	Very High

9. Large-Scale Operations Checklist

100+ Applications

Controller replicas: 2-3 (sharded)
Repo Server replicas: 2-3
Redis HA: Sentinel setup
API Server replicas: 2-3
Resource exclusions configured
Cache expiry optimized

500+ Applications

Controller replicas: 3-5 (sharded)
Repo Server replicas: 3-5
Redis HA: Sentinel + memory optimization
API Rate Limiting per cluster
Monitoring alerts configured
Automated periodic backups

1000+ Applications

Controller replicas: 5+ (sharded)
Repo Server replicas: 5+ (dedicated volumes)
External managed Redis
ApplicationSet for reduced management complexity
Consider multi-ArgoCD instances
Dedicated monitoring dashboards

10. Summary

Key elements of ArgoCD HA and scalability:

Controller Sharding: StatefulSet-based distributed Application processing
Repo Server Scaling: Horizontal scaling + cache optimization + Git clone optimization
Redis HA: Sentinel configuration or managed Redis
Performance Tuning: Resource exclusions, API Rate Limiting, Reconciliation interval adjustment
Monitoring: Prometheus metrics + Grafana dashboards + alert rules
Disaster Recovery: Periodic backups + GitOps self-management + DR strategy

Applying these elements progressively enables building an ArgoCD environment that reliably manages thousands of Applications.