- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Challenges of Large-Scale ArgoCD Operations
- 2. HA Architecture
- 3. Application Controller Sharding
- 4. Repository Server Scaling
- 5. Redis HA
- 6. Performance Tuning
- 7. Monitoring
- 8. Disaster Recovery
- 9. Large-Scale Operations Checklist
- 10. Summary
1. Challenges of Large-Scale ArgoCD Operations
In large environments managing hundreds or thousands of Applications, ArgoCD's default configuration can reach its limits.
Key Bottlenecks
| Component | Bottleneck | Symptoms |
|---|---|---|
| Application Controller | Single instance handles all Apps | Reconciliation delays, high memory |
| Repository Server | Git clone and manifest generation load | Slow syncs, high CPU |
| Redis | Growing cache data | Memory exhaustion, connection delays |
| API Server | Many concurrent users | UI lag, API timeouts |
2. HA Architecture
Basic HA Setup
Load Balancer
|
+-----------+-----------+
| | |
API Server API Server API Server
(replica 1) (replica 2) (replica 3)
| | |
+-----------+-----------+
|
Redis (Sentinel HA)
|
+-----------+-----------+
| | |
Repo Server Repo Server Repo Server
(replica 1) (replica 2) (replica 3)
|
App Controller (sharded)
Shard 0 | Shard 1 | Shard 2
API Server HA
The API Server is stateless, so it scales horizontally by increasing replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-server
namespace: argocd
spec:
replicas: 3
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-server
topologyKey: kubernetes.io/hostname
3. Application Controller Sharding
Sharding Concept
The Application Controller processes all Applications in a single instance by default. In large environments, sharding distributes Applications across multiple Controller instances.
Sharding Configuration
# argocd-cmd-params-cm ConfigMap
data:
controller.sharding.algorithm: round-robin
StatefulSet-Based Sharding
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: argocd-application-controller
namespace: argocd
spec:
replicas: 3
template:
spec:
containers:
- name: application-controller
env:
- name: ARGOCD_CONTROLLER_REPLICAS
value: '3'
Sharding Algorithms
Round-Robin (recommended):
Assigns Applications by hash modulo shard count
App A -> hash(A) % 3 = Shard 0
App B -> hash(B) % 3 = Shard 1
App C -> hash(C) % 3 = Shard 2
Legacy:
Cluster-based sharding
Cluster 1 Apps -> Shard 0
Cluster 2 Apps -> Shard 1
Cluster 3 Apps -> Shard 2
Leader Election
Only one Controller per Shard is active, using Kubernetes Lease resources:
Shard 0: Controller Pod A (Leader) + Pod D (Standby)
Shard 1: Controller Pod B (Leader) + Pod E (Standby)
Shard 2: Controller Pod C (Leader) + Pod F (Standby)
4. Repository Server Scaling
Horizontal Scaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-repo-server
namespace: argocd
spec:
replicas: 3
template:
spec:
containers:
- name: repo-server
resources:
requests:
cpu: '1'
memory: '1Gi'
limits:
cpu: '2'
memory: '2Gi'
Cache Optimization
env:
- name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION
value: '24h'
- name: ARGOCD_EXEC_TIMEOUT
value: '180s'
Git Clone Optimization
env:
- name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH
value: '1'
- name: ARGOCD_GIT_REQUEST_TIMEOUT
value: '60s'
- name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
value: '10'
Dedicated Volumes
spec:
template:
spec:
volumes:
- name: tmp
emptyDir:
sizeLimit: 10Gi
containers:
- name: repo-server
volumeMounts:
- name: tmp
mountPath: /tmp
5. Redis HA
Redis Sentinel Configuration
redis-ha:
enabled: true
haproxy:
enabled: true
replicas: 3
redis:
replicas: 3
sentinel:
enabled: true
replicas: 3
Redis Sentinel Architecture
HAProxy (Load Balancer)
|
+-----------+-----------+
| | |
Sentinel 1 Sentinel 2 Sentinel 3
| | |
+-----------+-----------+
|
+-----------+-----------+
| | |
Redis Master Redis Slave Redis Slave
Redis Memory Optimization
data:
redis.conf: |
maxmemory 2gb
maxmemory-policy allkeys-lru
save ""
appendonly no
External Managed Redis
Use managed Redis services (ElastiCache, Cloud Memorystore):
data:
redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'
redis.tls: 'true'
6. Performance Tuning
Application Controller Tuning
env:
- name: ARGOCD_RECONCILIATION_TIMEOUT
value: '300s'
- name: ARGOCD_APP_RESYNC_PERIOD
value: '180s'
- name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS
value: '5'
- name: ARGOCD_APP_STATE_CACHE_EXPIRATION
value: '1h'
- name: ARGOCD_K8S_CLIENT_QPS
value: '50'
- name: ARGOCD_K8S_CLIENT_BURST
value: '100'
Resource Exclusions
Exclude unnecessary resources from monitoring to improve performance:
# argocd-cm ConfigMap
data:
resource.exclusions: |
- apiGroups:
- ""
kinds:
- "Event"
clusters:
- "*"
- apiGroups:
- "metrics.k8s.io"
kinds:
- "*"
clusters:
- "*"
- apiGroups:
- "coordination.k8s.io"
kinds:
- "Lease"
clusters:
- "*"
API Rate Limiting
Control load on the Kubernetes API server:
env:
- name: ARGOCD_K8S_CLIENT_QPS
value: '50'
- name: ARGOCD_K8S_CLIENT_BURST
value: '100'
7. Monitoring
Prometheus Metrics
Application Controller metrics:
| Metric | Description |
|---|---|
| argocd_app_info | Application state information |
| argocd_app_sync_total | Total sync count |
| argocd_app_reconcile_count | Reconciliation count |
| argocd_app_reconcile_bucket | Reconciliation duration distribution |
API Server metrics:
| Metric | Description |
|---|---|
| argocd_api_server_request_total | Total API requests |
| argocd_api_server_request_duration_seconds | API request duration |
Repo Server metrics:
| Metric | Description |
|---|---|
| argocd_git_request_total | Total Git requests |
| argocd_git_request_duration_seconds | Git request duration |
ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argocd-metrics
namespace: argocd
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: argocd
endpoints:
- port: metrics
interval: 30s
Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: argocd-alerts
spec:
groups:
- name: argocd
rules:
- alert: ArgoCDAppDegraded
expr: argocd_app_info{health_status="Degraded"} > 0
for: 5m
labels:
severity: warning
- alert: ArgoCDAppSyncFailed
expr: increase(argocd_app_sync_total{phase="Error"}[10m]) > 0
for: 1m
labels:
severity: critical
- alert: ArgoCDReconciliationSlow
expr: histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60
for: 10m
labels:
severity: warning
8. Disaster Recovery
Backup Strategy
# Full ArgoCD config backup
argocd admin export > argocd-backup.yaml
# Individual resource backups
kubectl get applications -n argocd -o yaml > applications-backup.yaml
kubectl get appprojects -n argocd -o yaml > projects-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml
Restore Procedure
# 1. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f install.yaml
# 2. Restore configs
kubectl apply -f argocd-cm-backup.yaml
kubectl apply -f argocd-rbac-cm-backup.yaml
# 3. Restore credentials
kubectl apply -f repos-backup.yaml
kubectl apply -f clusters-backup.yaml
# 4. Restore projects and applications
kubectl apply -f projects-backup.yaml
kubectl apply -f applications-backup.yaml
GitOps Self-Management
Managing ArgoCD itself via GitOps simplifies disaster recovery:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argocd-self
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/argocd-config.git
targetRevision: HEAD
path: argocd
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
selfHeal: true
DR Strategy Types
| Strategy | RPO | RTO | Cost |
|---|---|---|---|
| Periodic Backup | Hourly | 30min - 1hr | Low |
| GitOps Self-Management | 0 (Git is SSOT) | 10 - 20min | Medium |
| Active-Standby | 0 | Under 5min | High |
| Active-Active | 0 | Immediate | Very High |
9. Large-Scale Operations Checklist
100+ Applications
- Controller replicas: 2-3 (sharded)
- Repo Server replicas: 2-3
- Redis HA: Sentinel setup
- API Server replicas: 2-3
- Resource exclusions configured
- Cache expiry optimized
500+ Applications
- Controller replicas: 3-5 (sharded)
- Repo Server replicas: 3-5
- Redis HA: Sentinel + memory optimization
- API Rate Limiting per cluster
- Monitoring alerts configured
- Automated periodic backups
1000+ Applications
- Controller replicas: 5+ (sharded)
- Repo Server replicas: 5+ (dedicated volumes)
- External managed Redis
- ApplicationSet for reduced management complexity
- Consider multi-ArgoCD instances
- Dedicated monitoring dashboards
10. Summary
Key elements of ArgoCD HA and scalability:
- Controller Sharding: StatefulSet-based distributed Application processing
- Repo Server Scaling: Horizontal scaling + cache optimization + Git clone optimization
- Redis HA: Sentinel configuration or managed Redis
- Performance Tuning: Resource exclusions, API Rate Limiting, Reconciliation interval adjustment
- Monitoring: Prometheus metrics + Grafana dashboards + alert rules
- Disaster Recovery: Periodic backups + GitOps self-management + DR strategy
Applying these elements progressively enables building an ArgoCD environment that reliably manages thousands of Applications.