- Authors
- Name
- Introduction
- ArgoCD Architecture Overview
- ArgoCD Application Configuration
- GitOps Tool Comparison: ArgoCD vs FluxCD vs Jenkins X
- Progressive Delivery with Argo Rollouts
- Rollback Strategies in Production
- Multi-Cluster Management with ApplicationSets
- Production Troubleshooting Guide
- Notification and Alerting Integration
- Best Practices Checklist
- Conclusion
- References

Introduction
GitOps has become the de facto standard for Kubernetes continuous delivery. At its core, GitOps treats Git as the single source of truth for declarative infrastructure and application definitions. ArgoCD, originally created by Intuit and now a CNCF graduated project, has emerged as the dominant GitOps tool with approximately 60% market share and an NPS of 79 among adopters. Intuit itself runs ArgoCD across nearly 400 clusters managing thousands of applications in production.
However, simply adopting ArgoCD does not automatically guarantee safe deployments. In production environments, you need progressive delivery strategies that gradually expose new versions to traffic, automated rollback mechanisms that detect failures before they reach all users, and well-tested recovery procedures for when things go wrong. This guide covers the complete lifecycle: from ArgoCD architecture and sync policies to Argo Rollouts integration with canary and blue-green deployments, automated analysis, rollback strategies, and real-world troubleshooting.
ArgoCD Architecture Overview
Before diving into progressive delivery, it is essential to understand how ArgoCD operates internally.
┌─────────────────────────────────────────────────────────────────┐
│ ArgoCD Server │
│ ┌───────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ API │ │ Web UI │ │ Dex / OIDC │ │
│ │ Server │ │ Dashboard │ │ Authentication │ │
│ └─────┬─────┘ └──────┬───────┘ └───────────────────────┘ │
│ │ │ │
│ ┌─────▼───────────────▼─────────────────────────────────────┐ │
│ │ Application Controller │ │
│ │ - Watches Application CRDs │ │
│ │ - Compares desired state (Git) vs live state (cluster) │ │
│ │ - Executes sync operations │ │
│ └─────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────▼─────────────────────────────────────┐ │
│ │ Repo Server │ │
│ │ - Clones Git repositories │ │
│ │ - Renders Helm / Kustomize / Jsonnet templates │ │
│ │ - Caches manifests for performance │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Redis Cache │ │
│ │ - Application state cache │ │
│ │ - Git repository cache │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Git Repository │ │ Kubernetes │
│ (Source of │ │ Cluster(s) │
│ Truth) │ │ (Target) │
└─────────────────┘ └──────────────────┘
Key components:
- API Server exposes the gRPC/REST API consumed by the CLI and Web UI
- Application Controller is the core reconciliation engine that continuously compares Git state against live cluster state
- Repo Server handles Git operations and manifest rendering (Helm, Kustomize, Jsonnet, plain YAML)
- Redis provides caching for application state and repository data
ArgoCD Application Configuration
A basic ArgoCD Application manifest defines the source repository and target cluster.
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app-production
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: main
path: apps/my-app/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: my-app-prod
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
Critical sync policy options explained:
| Option | Description | Production Recommendation |
|---|---|---|
automated.prune | Deletes resources removed from Git | Enable with caution |
automated.selfHeal | Reverts manual cluster changes | Strongly recommended |
allowEmpty | Allows sync with zero resources | Always set to false |
PruneLast | Prunes after all other syncs | Enable for safety |
retry.backoff | Exponential backoff on failure | Use 5s base, factor 2 |
Warning: Enabling automated.prune without PruneLast=true can cause cascading deletions if your repository structure changes. Always test sync policies in a staging environment first.
GitOps Tool Comparison: ArgoCD vs FluxCD vs Jenkins X
Before committing to a tool, teams should understand the trade-offs.
| Feature | ArgoCD | FluxCD | Jenkins X |
|---|---|---|---|
| Market Share (2026) | ~60% | ~30% | Less than 5% |
| CNCF Status | Graduated | Graduated | Sandbox (archived) |
| Web UI | Built-in, feature-rich | Third-party (Weave GitOps) | Basic |
| Multi-Cluster | Native support | Via Kustomization | Limited |
| Helm Support | Native | Native | Native |
| Progressive Delivery | Via Argo Rollouts | Via Flagger | Built-in (Tekton) |
| SSO/RBAC | Built-in (Dex, OIDC) | Kubernetes RBAC | Kubernetes RBAC |
| Resource Usage | Medium (~512MB) | Low (~128MB) | High (~2GB) |
| Learning Curve | Moderate | Moderate | Steep |
| Manifest Rendering | Helm, Kustomize, Jsonnet | Helm, Kustomize | Helm |
| Notification System | Argo Notifications | Flux Notification Controller | Webhooks |
| Application Sets | Yes (generators) | Yes (Kustomization) | No |
| Diff Preview | UI and CLI | CLI only | CLI only |
When to choose ArgoCD: You need a rich web UI, multi-cluster visibility from a single pane of glass, built-in SSO, or your team includes members with varying Kubernetes experience levels.
When to choose FluxCD: You prefer a lightweight, Kubernetes-native approach, operate primarily through CLI/automation, manage 500+ applications, or need minimal resource overhead.
When to choose Jenkins X: You need tightly integrated CI/CD with Tekton pipelines (note: adoption is declining and the project has limited community momentum).
Progressive Delivery with Argo Rollouts
Argo Rollouts is a Kubernetes controller that provides advanced deployment strategies including canary, blue-green, and experimentation capabilities. It replaces the standard Kubernetes Deployment with a Rollout CRD.
Canary Deployment Strategy
A canary deployment gradually shifts traffic to the new version while monitoring key metrics.
# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
namespace: my-app-prod
spec:
replicas: 10
revisionHistoryLimit: 5
selector:
matchLabels:
app: my-app
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
istio:
virtualServices:
- name: my-app-vsvc
routes:
- primary
steps:
- setWeight: 5
- pause: { duration: 5m }
- analysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: my-app-canary
- setWeight: 20
- pause: { duration: 5m }
- analysis:
templates:
- templateName: latency-check
args:
- name: service-name
value: my-app-canary
- setWeight: 50
- pause: { duration: 10m }
- analysis:
templates:
- templateName: comprehensive-check
args:
- name: service-name
value: my-app-canary
- setWeight: 80
- pause: { duration: 10m }
- setWeight: 100
rollbackWindow:
revisions: 3
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: myregistry/my-app:v2.1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
This rollout follows a carefully staged approach:
- Route 5% of traffic to canary, wait 5 minutes
- Run error rate analysis
- Increase to 20%, wait 5 minutes
- Run latency analysis
- Increase to 50%, wait 10 minutes
- Run comprehensive checks (error rate + latency + business metrics)
- Increase to 80%, wait 10 minutes
- Promote to 100%
If any analysis step fails, the rollout automatically aborts and traffic reverts to the stable version.
AnalysisTemplate for Automated Metric Checks
AnalysisTemplates define what metrics to evaluate and what thresholds trigger rollback.
# analysis-templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 60s
count: 5
successCondition: result[0] < 0.05
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-check
spec:
args:
- name: service-name
metrics:
- name: p99-latency
interval: 60s
count: 5
successCondition: result[0] < 500
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}"
}[5m])) by (le)
) * 1000
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: comprehensive-check
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 60s
count: 10
successCondition: result[0] < 0.02
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
- name: p99-latency
interval: 60s
count: 10
successCondition: result[0] < 300
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}"
}[5m])) by (le)
) * 1000
Warning: Ensure your Prometheus instance has sufficient retention and query performance. Slow Prometheus queries can cause AnalysisRun timeouts, which Argo Rollouts treats as failures, triggering an unintended rollback.
Blue-Green Deployment Strategy
Blue-green deployments maintain two full environments and switch traffic atomically.
# blue-green-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app-bluegreen
namespace: my-app-prod
spec:
replicas: 5
revisionHistoryLimit: 3
selector:
matchLabels:
app: my-app
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
autoPromotionSeconds: 300
scaleDownDelaySeconds: 600
scaleDownDelayRevisionLimit: 2
prePromotionAnalysis:
templates:
- templateName: comprehensive-check
args:
- name: service-name
value: my-app-preview
postPromotionAnalysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: my-app-active
antiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
weight: 100
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: myregistry/my-app:v2.1.0
ports:
- containerPort: 8080
Blue-Green flow:
┌─────────────────────────────────────────────────────────────┐
│ Blue-Green Deployment Flow │
│ │
│ Step 1: Deploy Preview (Green) │
│ ┌──────────┐ ┌──────────┐ │
│ │ Active │ 100%│ Stable │ │
│ │ Service │────▶│ (Blue) │ │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Preview │ 0% │ New │ │
│ │ Service │────▶│ (Green) │ Pre-Promotion Analysis │
│ └──────────┘ └──────────┘ │
│ │
│ Step 2: Promote (Switch Traffic) │
│ ┌──────────┐ ┌──────────┐ │
│ │ Active │ 100%│ New │ │
│ │ Service │────▶│ (Green) │ Post-Promotion Analysis │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Preview │ │ Old │ Scaled down after delay │
│ │ Service │ │ (Blue) │ │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
Key settings:
autoPromotionEnabled: falserequires manual approval (recommended for production)scaleDownDelaySeconds: 600keeps the old version running for 10 minutes after promotion, allowing quick rollbackprePromotionAnalysisvalidates the preview before promotingpostPromotionAnalysisconfirms health after switching traffic
Rollback Strategies in Production
Automated Rollback via Analysis Failure
When an AnalysisRun detects metric violations, Argo Rollouts automatically aborts the rollout and reverts traffic to the stable version. No manual intervention is needed.
# Monitor rollout status
kubectl argo rollouts get rollout my-app -n my-app-prod --watch
# Check analysis run results
kubectl argo rollouts list analysisruns -n my-app-prod
# View detailed analysis failure reasons
kubectl describe analysisrun -n my-app-prod \
$(kubectl get analysisrun -n my-app-prod --sort-by=.metadata.creationTimestamp -o name | tail -1)
Manual Rollback Procedures
When automated systems are not sufficient, manual rollback may be required.
# Option 1: ArgoCD CLI rollback to previous sync
argocd app rollback my-app-production
# Option 2: Argo Rollouts abort and rollback
kubectl argo rollouts abort my-app -n my-app-prod
kubectl argo rollouts undo my-app -n my-app-prod
# Option 3: Rollback to a specific revision
kubectl argo rollouts undo my-app -n my-app-prod --to-revision=3
# Option 4: Git-based rollback (GitOps-native approach)
# Revert the commit in Git - ArgoCD will sync automatically
git revert HEAD
git push origin main
# Option 5: Emergency - force sync to a known-good commit
argocd app sync my-app-production --revision abc1234
Critical Warning: ArgoCD rollback cannot be performed on applications with automated sync enabled. If you need to perform a manual rollback, either disable auto-sync first or use a Git revert approach.
# Disable auto-sync before manual rollback
argocd app set my-app-production --sync-policy none
# Perform rollback
argocd app rollback my-app-production
# Re-enable auto-sync after confirming stability
argocd app set my-app-production --sync-policy automated \
--self-heal --auto-prune
Recovery from Partial Rollback Failures
When a rollback gets stuck partway through:
# Step 1: Check application sync status
argocd app get my-app-production
# Step 2: Identify stuck resources
argocd app resources my-app-production --orphaned
# Step 3: Force sync to resolve conflicts
argocd app sync my-app-production --force --replace
# Step 4: If resources are in a bad state, terminate sync and retry
argocd app terminate-op my-app-production
argocd app sync my-app-production --prune
Warning: Using --force --replace will delete and recreate resources, causing brief downtime. Only use this as a last resort.
Multi-Cluster Management with ApplicationSets
For organizations managing multiple clusters, ApplicationSets automate application deployment across environments.
# multi-cluster-appset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-app-multi-cluster
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production
values:
revision: main
- clusters:
selector:
matchLabels:
env: staging
values:
revision: develop
strategy:
type: RollingSync
rollingSync:
steps:
- matchExpressions:
- key: env
operator: In
values:
- staging
- matchExpressions:
- key: env
operator: In
values:
- production
maxUpdate: 25%
template:
metadata:
name: 'my-app-{{name}}'
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: '{{values.revision}}'
path: apps/my-app/overlays/{{metadata.labels.env}}
destination:
server: '{{server}}'
namespace: my-app
syncPolicy:
automated:
prune: true
selfHeal: true
This ApplicationSet uses a RollingSync strategy that deploys to staging clusters first, then progressively rolls out to production clusters at 25% increments.
Production Troubleshooting Guide
Common Failure Scenarios and Solutions
1. Sync stuck in "Progressing" state
# Check which resources are not healthy
argocd app get my-app-production --show-operation
# Common cause: Pod stuck in CrashLoopBackOff or ImagePullBackOff
kubectl get pods -n my-app-prod -o wide
kubectl describe pod <pod-name> -n my-app-prod
kubectl logs <pod-name> -n my-app-prod --previous
# Fix: Update image tag in Git or fix the container issue
# Then force refresh
argocd app get my-app-production --hard-refresh
2. Out-of-Sync but cannot determine diff
# Generate a detailed diff
argocd app diff my-app-production --local ./path/to/manifests
# Check for ignored differences that might be masking issues
argocd app get my-app-production -o yaml | grep -A 20 ignoreDifferences
# Common cause: Mutating webhooks or controllers modifying resources
# Fix: Add ignoreDifferences for fields modified by controllers
3. Repo Server OOM or slow manifest generation
# Check repo server logs
kubectl logs -n argocd -l app.kubernetes.io/component=repo-server --tail=100
# Common with large Helm charts or many applications
# Fix: Increase repo server resources
kubectl patch deployment argocd-repo-server -n argocd --type=json \
-p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"2Gi"}]'
4. Application Controller high CPU usage
# Check controller metrics
kubectl top pods -n argocd -l app.kubernetes.io/component=application-controller
# Fix: Adjust reconciliation interval
# In argocd-cm ConfigMap:
# timeout.reconciliation: 180s (default 180s, increase for large deployments)
# Or shard the controller for large-scale deployments
kubectl patch statefulset argocd-application-controller -n argocd \
--type=json \
-p='[{"op":"replace","path":"/spec/replicas","value":3}]'
5. Webhook delivery failures causing delayed syncs
# Verify webhook configuration
argocd admin settings validate -n argocd
# Check ArgoCD server logs for webhook events
kubectl logs -n argocd -l app.kubernetes.io/component=server \
--tail=50 | grep webhook
# Fallback: Force a manual refresh
argocd app get my-app-production --refresh
Health Check Monitoring
Set up proper health checks to catch issues before they impact users.
# argocd-cm ConfigMap - custom health checks
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
data:
resource.customizations.health.argoproj.io_Rollout: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for _, condition in ipairs(obj.status.conditions) do
if condition.type == "Paused" and condition.status == "True" then
hs.status = "Suspended"
hs.message = condition.message
return hs
end
if condition.type == "InvalidSpec" then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
end
end
if obj.status.phase == "Healthy" then
hs.status = "Healthy"
hs.message = "Rollout is healthy"
elseif obj.status.phase == "Degraded" then
hs.status = "Degraded"
hs.message = "Rollout is degraded"
elseif obj.status.phase == "Progressing" then
hs.status = "Progressing"
hs.message = "Rollout is progressing"
end
end
return hs
Notification and Alerting Integration
Configure Argo Notifications to alert your team on deployment events.
# argocd-notifications-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
service.slack: |
token: $slack-token
trigger.on-sync-failed: |
- when: app.status.operationState.phase in ['Error', 'Failed']
send: [app-sync-failed]
trigger.on-health-degraded: |
- when: app.status.health.status == 'Degraded'
send: [app-health-degraded]
template.app-sync-failed: |
message: |
Application {{.app.metadata.name}} sync has {{.app.status.operationState.phase}}.
Revision: {{.app.status.operationState.syncResult.revision}}
Details: {{.app.status.operationState.message}}
slack:
attachments: |
[{
"color": "#E96D76",
"title": "{{.app.metadata.name}} Sync Failed",
"fields": [
{"title": "Application", "value": "{{.app.metadata.name}}", "short": true},
{"title": "Cluster", "value": "{{.app.spec.destination.server}}", "short": true}
]
}]
template.app-health-degraded: |
message: |
Application {{.app.metadata.name}} health is Degraded.
slack:
attachments: |
[{
"color": "#F4C030",
"title": "{{.app.metadata.name}} Health Degraded",
"fields": [
{"title": "Health Status", "value": "{{.app.status.health.status}}", "short": true}
]
}]
Best Practices Checklist
Always use Git revert for rollbacks when auto-sync is enabled, rather than disabling auto-sync and performing manual rollback. This maintains GitOps principles.
Set resource limits on ArgoCD components. The repo server is especially prone to OOM when rendering large Helm charts.
Use ApplicationSets with RollingSync for multi-cluster deployments. Deploy to staging first, validate, then roll to production incrementally.
Configure AnalysisTemplates with sensible thresholds. Start with lenient thresholds and tighten them as you gain confidence in your metrics.
Keep
scaleDownDelaySecondsat 600 or higher for blue-green deployments. This gives you a window to manually abort a promotion if post-promotion analysis misses an issue.Never skip readiness and liveness probes in your Rollout templates. These are the first line of defense against deploying broken containers.
Monitor ArgoCD itself. Set up alerts for controller CPU, repo server memory, and sync queue length. ArgoCD becoming unhealthy can mask application issues.
Test your rollback procedures regularly. Run rollback drills in staging to ensure the team is familiar with the process before an incident occurs in production.
Conclusion
ArgoCD combined with Argo Rollouts provides a production-grade GitOps platform for progressive delivery on Kubernetes. The key to success lies not just in the tooling but in the operational practices around it: well-defined canary steps with automated metric analysis, tested rollback procedures, proper notification and alerting, and regular drills to validate recovery processes.
The GitOps model, where Git serves as the single source of truth, provides an auditable, reproducible, and reversible deployment process. When paired with progressive delivery strategies, it enables teams to deploy with confidence, catch issues early through automated analysis, and recover quickly when problems occur.
References
- Argo CD Official Documentation
- Argo Rollouts Documentation
- OpenGitOps - GitOps Principles v1.0
- CNCF GitOps Working Group
- Codefresh - Why You Should Choose Argo CD for GitOps
- Intuit - CNCF Accepts Argo as Incubator Project
- CNCF End User Survey: Argo CD Adoption
- Red Hat - Blue-Green and Canary with Argo Rollouts