Skip to content

필사 모드: [ArgoCD] High Availability and Scalability: Large-Scale Cluster Operations

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. Challenges of Large-Scale ArgoCD Operations

In large environments managing hundreds or thousands of Applications, ArgoCD's default configuration can reach its limits.

Key Bottlenecks

| Component | Bottleneck | Symptoms |

| ---------------------- | -------------------------------------- | ------------------------------------ |

| Application Controller | Single instance handles all Apps | Reconciliation delays, high memory |

| Repository Server | Git clone and manifest generation load | Slow syncs, high CPU |

| Redis | Growing cache data | Memory exhaustion, connection delays |

| API Server | Many concurrent users | UI lag, API timeouts |

2. HA Architecture

Basic HA Setup

Load Balancer

|

+-----------+-----------+

| | |

API Server API Server API Server

(replica 1) (replica 2) (replica 3)

| | |

+-----------+-----------+

|

Redis (Sentinel HA)

|

+-----------+-----------+

| | |

Repo Server Repo Server Repo Server

(replica 1) (replica 2) (replica 3)

|

App Controller (sharded)

Shard 0 | Shard 1 | Shard 2

API Server HA

The API Server is stateless, so it scales horizontally by increasing replicas:

apiVersion: apps/v1

kind: Deployment

metadata:

name: argocd-server

namespace: argocd

spec:

replicas: 3

template:

spec:

affinity:

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchLabels:

app.kubernetes.io/name: argocd-server

topologyKey: kubernetes.io/hostname

3. Application Controller Sharding

Sharding Concept

The Application Controller processes all Applications in a single instance by default. In large environments, sharding distributes Applications across multiple Controller instances.

Sharding Configuration

argocd-cmd-params-cm ConfigMap

data:

controller.sharding.algorithm: round-robin

StatefulSet-Based Sharding

apiVersion: apps/v1

kind: StatefulSet

metadata:

name: argocd-application-controller

namespace: argocd

spec:

replicas: 3

template:

spec:

containers:

- name: application-controller

env:

- name: ARGOCD_CONTROLLER_REPLICAS

value: '3'

Sharding Algorithms

**Round-Robin (recommended):**

Assigns Applications by hash modulo shard count

App A -> hash(A) % 3 = Shard 0

App B -> hash(B) % 3 = Shard 1

App C -> hash(C) % 3 = Shard 2

**Legacy:**

Cluster-based sharding

Cluster 1 Apps -> Shard 0

Cluster 2 Apps -> Shard 1

Cluster 3 Apps -> Shard 2

Leader Election

Only one Controller per Shard is active, using Kubernetes Lease resources:

Shard 0: Controller Pod A (Leader) + Pod D (Standby)

Shard 1: Controller Pod B (Leader) + Pod E (Standby)

Shard 2: Controller Pod C (Leader) + Pod F (Standby)

4. Repository Server Scaling

Horizontal Scaling

apiVersion: apps/v1

kind: Deployment

metadata:

name: argocd-repo-server

namespace: argocd

spec:

replicas: 3

template:

spec:

containers:

- name: repo-server

resources:

requests:

cpu: '1'

memory: '1Gi'

limits:

cpu: '2'

memory: '2Gi'

Cache Optimization

env:

- name: ARGOCD_REPO_SERVER_CACHE_EXPIRATION

value: '24h'

- name: ARGOCD_EXEC_TIMEOUT

value: '180s'

Git Clone Optimization

env:

- name: ARGOCD_GIT_SHALLOW_CLONE_DEPTH

value: '1'

- name: ARGOCD_GIT_REQUEST_TIMEOUT

value: '60s'

- name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT

value: '10'

Dedicated Volumes

spec:

template:

spec:

volumes:

- name: tmp

emptyDir:

sizeLimit: 10Gi

containers:

- name: repo-server

volumeMounts:

- name: tmp

mountPath: /tmp

5. Redis HA

Redis Sentinel Configuration

redis-ha:

enabled: true

haproxy:

enabled: true

replicas: 3

redis:

replicas: 3

sentinel:

enabled: true

replicas: 3

Redis Sentinel Architecture

HAProxy (Load Balancer)

|

+-----------+-----------+

| | |

Sentinel 1 Sentinel 2 Sentinel 3

| | |

+-----------+-----------+

|

+-----------+-----------+

| | |

Redis Master Redis Slave Redis Slave

Redis Memory Optimization

data:

redis.conf: |

maxmemory 2gb

maxmemory-policy allkeys-lru

save ""

appendonly no

External Managed Redis

Use managed Redis services (ElastiCache, Cloud Memorystore):

data:

redis.server: 'my-redis.xxxx.cache.amazonaws.com:6379'

redis.tls: 'true'

6. Performance Tuning

Application Controller Tuning

env:

- name: ARGOCD_RECONCILIATION_TIMEOUT

value: '300s'

- name: ARGOCD_APP_RESYNC_PERIOD

value: '180s'

- name: ARGOCD_SELF_HEAL_TIMEOUT_SECONDS

value: '5'

- name: ARGOCD_APP_STATE_CACHE_EXPIRATION

value: '1h'

- name: ARGOCD_K8S_CLIENT_QPS

value: '50'

- name: ARGOCD_K8S_CLIENT_BURST

value: '100'

Resource Exclusions

Exclude unnecessary resources from monitoring to improve performance:

argocd-cm ConfigMap

data:

resource.exclusions: |

- apiGroups:

- ""

kinds:

- "Event"

clusters:

- "*"

- apiGroups:

- "metrics.k8s.io"

kinds:

- "*"

clusters:

- "*"

- apiGroups:

- "coordination.k8s.io"

kinds:

- "Lease"

clusters:

- "*"

API Rate Limiting

Control load on the Kubernetes API server:

env:

- name: ARGOCD_K8S_CLIENT_QPS

value: '50'

- name: ARGOCD_K8S_CLIENT_BURST

value: '100'

7. Monitoring

Prometheus Metrics

**Application Controller metrics:**

| Metric | Description |

| --------------------------- | ------------------------------------ |

| argocd_app_info | Application state information |

| argocd_app_sync_total | Total sync count |

| argocd_app_reconcile_count | Reconciliation count |

| argocd_app_reconcile_bucket | Reconciliation duration distribution |

**API Server metrics:**

| Metric | Description |

| ------------------------------------------ | -------------------- |

| argocd_api_server_request_total | Total API requests |

| argocd_api_server_request_duration_seconds | API request duration |

**Repo Server metrics:**

| Metric | Description |

| ----------------------------------- | -------------------- |

| argocd_git_request_total | Total Git requests |

| argocd_git_request_duration_seconds | Git request duration |

ServiceMonitor

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: argocd-metrics

namespace: argocd

spec:

selector:

matchLabels:

app.kubernetes.io/part-of: argocd

endpoints:

- port: metrics

interval: 30s

Alert Rules

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

name: argocd-alerts

spec:

groups:

- name: argocd

rules:

- alert: ArgoCDAppDegraded

expr: argocd_app_info{health_status="Degraded"} > 0

for: 5m

labels:

severity: warning

- alert: ArgoCDAppSyncFailed

expr: increase(argocd_app_sync_total{phase="Error"}[10m]) > 0

for: 1m

labels:

severity: critical

- alert: ArgoCDReconciliationSlow

expr: histogram_quantile(0.99, argocd_app_reconcile_bucket) > 60

for: 10m

labels:

severity: warning

8. Disaster Recovery

Backup Strategy

Full ArgoCD config backup

argocd admin export > argocd-backup.yaml

Individual resource backups

kubectl get applications -n argocd -o yaml > applications-backup.yaml

kubectl get appprojects -n argocd -o yaml > projects-backup.yaml

kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository -o yaml > repos-backup.yaml

kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > clusters-backup.yaml

Restore Procedure

1. Install ArgoCD

kubectl create namespace argocd

kubectl apply -n argocd -f install.yaml

2. Restore configs

kubectl apply -f argocd-cm-backup.yaml

kubectl apply -f argocd-rbac-cm-backup.yaml

3. Restore credentials

kubectl apply -f repos-backup.yaml

kubectl apply -f clusters-backup.yaml

4. Restore projects and applications

kubectl apply -f projects-backup.yaml

kubectl apply -f applications-backup.yaml

GitOps Self-Management

Managing ArgoCD itself via GitOps simplifies disaster recovery:

apiVersion: argoproj.io/v1alpha1

kind: Application

metadata:

name: argocd-self

namespace: argocd

spec:

project: default

source:

repoURL: https://github.com/org/argocd-config.git

targetRevision: HEAD

path: argocd

destination:

server: https://kubernetes.default.svc

namespace: argocd

syncPolicy:

automated:

selfHeal: true

DR Strategy Types

| Strategy | RPO | RTO | Cost |

| ---------------------- | --------------- | ----------- | --------- |

| Periodic Backup | Hourly | 30min - 1hr | Low |

| GitOps Self-Management | 0 (Git is SSOT) | 10 - 20min | Medium |

| Active-Standby | 0 | Under 5min | High |

| Active-Active | 0 | Immediate | Very High |

9. Large-Scale Operations Checklist

100+ Applications

- Controller replicas: 2-3 (sharded)

- Repo Server replicas: 2-3

- Redis HA: Sentinel setup

- API Server replicas: 2-3

- Resource exclusions configured

- Cache expiry optimized

500+ Applications

- Controller replicas: 3-5 (sharded)

- Repo Server replicas: 3-5

- Redis HA: Sentinel + memory optimization

- API Rate Limiting per cluster

- Monitoring alerts configured

- Automated periodic backups

1000+ Applications

- Controller replicas: 5+ (sharded)

- Repo Server replicas: 5+ (dedicated volumes)

- External managed Redis

- ApplicationSet for reduced management complexity

- Consider multi-ArgoCD instances

- Dedicated monitoring dashboards

10. Summary

Key elements of ArgoCD HA and scalability:

1. **Controller Sharding**: StatefulSet-based distributed Application processing

2. **Repo Server Scaling**: Horizontal scaling + cache optimization + Git clone optimization

3. **Redis HA**: Sentinel configuration or managed Redis

4. **Performance Tuning**: Resource exclusions, API Rate Limiting, Reconciliation interval adjustment

5. **Monitoring**: Prometheus metrics + Grafana dashboards + alert rules

6. **Disaster Recovery**: Periodic backups + GitOps self-management + DR strategy

Applying these elements progressively enables building an ArgoCD environment that reliably manages thousands of Applications.

현재 단락 (1/303)

In large environments managing hundreds or thousands of Applications, ArgoCD's default configuration...

작성 글자: 0원문 글자: 8,513작성 단락: 0/303