Split View: 고급 CI/CD 파이프라인 구축 가이드 — GitHub Actions, ArgoCD, Tekton, 보안 파이프라인
고급 CI/CD 파이프라인 구축 가이드 — GitHub Actions, ArgoCD, Tekton, 보안 파이프라인
들어가며
CI/CD를 "코드 푸시하면 자동으로 배포되는 것" 정도로 이해하는 단계를 넘어, 프로덕션 수준의 파이프라인이 무엇인지 고민해 본 적이 있는가? 실제 프로덕션 환경에서는 보안 스캔, 정적 분석, 컨테이너 이미지 서명, 멀티 클러스터 배포, 자동 롤백까지 수십 개의 단계가 유기적으로 연결되어야 한다.
이 글에서는 CI/CD 성숙도 모델부터 GitHub Actions 고급 패턴, ArgoCD GitOps, Tekton 클라우드 네이티브 파이프라인, DevSecOps 보안 파이프라인, 배포 전략 심화까지 종합적으로 다룬다.
1. CI/CD 성숙도 모델
조직의 CI/CD 수준을 객관적으로 평가하기 위한 5단계 성숙도 모델이다. 각 레벨은 이전 레벨의 역량 위에 구축된다.
Level 1 - 수동 (Manual)
빌드와 배포를 개발자가 수동으로 수행하는 단계다. 로컬에서 빌드하고 FTP나 SCP로 서버에 직접 배포한다. "내 컴퓨터에서는 되는데"가 일상적으로 발생하며, 배포 주기가 월 단위인 경우가 많다.
특징: 문서화된 절차가 있을 수 있지만 자동화는 거의 없다. 배포 시 휴먼 에러가 빈번하고, 롤백은 이전 버전을 수동으로 다시 배포하는 방식이다.
Level 2 - 기본 자동화 (Basic Automation)
CI 서버가 도입되어 코드 푸시 시 자동 빌드와 테스트가 실행된다. Jenkins, GitHub Actions, GitLab CI 같은 도구를 사용하지만, 파이프라인이 단순한 빌드-테스트-배포 직선 구조다.
# Level 2: 기본 파이프라인 예시
name: Basic CI/CD
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install
- run: npm test
- run: npm run build
- run: ./deploy.sh
Level 3 - 표준화 (Standardized)
파이프라인이 조직 전체에 표준화되고, 재사용 가능한 파이프라인 템플릿이 존재한다. 테스트 커버리지 게이트, 코드 품질 체크, 보안 스캔이 기본으로 포함된다. 환경별(dev, staging, prod) 배포 파이프라인이 분리되어 있다.
Level 4 - 고급 자동화 (Advanced Automation)
GitOps 기반 선언적 배포, 카나리/블루-그린 배포 전략, 자동화된 보안 파이프라인(SAST, DAST, 컨테이너 스캔), SBOM 생성과 아티팩트 서명이 파이프라인에 통합된다. 배포 주기가 일 단위 이하로 줄어든다.
Level 5 - 자가 치유 (Self-Healing)
메트릭 기반 자동 롤백, SLO 위반 시 자동 복구, ML 기반 이상 탐지, 프로덕션 피드백이 파이프라인에 자동 반영된다. 배포가 완전히 자동화되어 개발자가 배포를 의식하지 않는다.
| 레벨 | 배포 주기 | 롤백 방식 | 보안 | 테스트 |
|---|---|---|---|---|
| L1 수동 | 월 단위 | 수동 재배포 | 없음 | 수동 |
| L2 기본 | 주 단위 | 수동 트리거 | 없음 | 자동 유닛 |
| L3 표준 | 일 단위 | 원클릭 롤백 | 기본 스캔 | 유닛+통합 |
| L4 고급 | 시간 단위 | 자동 카나리 | 전체 파이프라인 | 유닛+통합+E2E |
| L5 치유 | 지속 배포 | SLO 기반 자동 | 지속적 검증 | 카오스 포함 |
2. GitHub Actions 고급 패턴
GitHub Actions의 기본을 넘어서는 고급 패턴들이다.
2-1. 재사용 가능한 워크플로 (Reusable Workflows)
여러 저장소에서 동일한 파이프라인 로직을 공유하려면 재사용 워크플로를 활용한다. 호출하는 쪽에서는 uses 키워드로 외부 워크플로를 참조한다.
# .github/workflows/reusable-build.yml (공유 저장소)
name: Reusable Build
on:
workflow_call:
inputs:
node-version:
required: false
type: string
default: '20'
registry-url:
required: true
type: string
secrets:
npm-token:
required: true
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
registry-url: ${{ inputs.registry-url }}
- run: npm ci
env:
NODE_AUTH_TOKEN: ${{ secrets.npm-token }}
- run: npm run build
- uses: actions/upload-artifact@v4
with:
name: build-output
path: dist/
# 호출하는 워크플로
name: App CI
on: [push]
jobs:
call-build:
uses: my-org/shared-workflows/.github/workflows/reusable-build.yml@v2
with:
node-version: '20'
registry-url: 'https://npm.pkg.github.com'
secrets:
npm-token: ${{ secrets.NPM_TOKEN }}
2-2. Composite Actions
하나의 액션 안에 여러 단계를 묶어 재사용하는 패턴이다. 재사용 워크플로와 달리 단일 job의 step으로 삽입된다.
# .github/actions/setup-and-test/action.yml
name: 'Setup and Test'
description: 'Node.js 환경 설정 후 테스트 실행'
inputs:
node-version:
description: 'Node.js version'
required: false
default: '20'
runs:
using: 'composite'
steps:
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
- run: npm ci
shell: bash
- run: npm test -- --coverage
shell: bash
- uses: actions/upload-artifact@v4
with:
name: coverage
path: coverage/
2-3. 매트릭스 빌드 전략
여러 환경 조합을 병렬로 테스트하되, 불필요한 조합을 제외하는 고급 매트릭스 설정이다.
jobs:
test:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
node: [18, 20, 22]
exclude:
- os: macos-latest
node: 18
include:
- os: ubuntu-latest
node: 22
experimental: true
runs-on: ${{ matrix.os }}
continue-on-error: ${{ matrix.experimental || false }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node }}
- run: npm ci
- run: npm test
2-4. OIDC를 이용한 클라우드 인증
장기 시크릿(Access Key) 대신 OIDC(OpenID Connect) 토큰을 사용하여 클라우드에 안전하게 인증하는 방식이다. AWS, GCP, Azure 모두 지원한다.
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-role
aws-region: ap-northeast-2
- run: aws s3 sync ./dist s3://my-bucket
- run: aws cloudfront create-invalidation --distribution-id E1234 --paths "/*"
OIDC의 핵심 장점은 시크릿 관리가 필요 없다는 것이다. GitHub에서 발급한 단기 토큰을 AWS STS가 검증하고 임시 자격 증명을 발급한다. 토큰은 워크플로 실행 중에만 유효하므로 유출되어도 악용이 어렵다.
3. GitOps와 ArgoCD
3-1. GitOps 원칙
GitOps는 Git을 단일 진실 원천(Single Source of Truth)으로 사용하여 인프라와 애플리케이션의 선언적 상태를 관리하는 운영 모델이다. 네 가지 핵심 원칙이 있다.
선언적 설정: 시스템의 원하는 상태를 선언적으로 기술한다. "무엇을" 원하는지 정의하지, "어떻게" 달성할지 명령하지 않는다.
Git을 진실의 원천으로: 모든 변경은 Git을 통해 이루어지고, Git 히스토리가 곧 감사 로그다.
자동 적용: Git의 선언적 상태가 변경되면 자동으로 시스템에 적용된다.
지속적 조정: 에이전트가 실제 상태와 원하는 상태를 지속적으로 비교하여 차이(drift)를 교정한다.
3-2. ArgoCD 아키텍처
ArgoCD는 쿠버네티스를 위한 선언적 GitOps 연속 배포 도구다.
# ArgoCD Application 리소스
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests.git
targetRevision: main
path: apps/my-app/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: my-app
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
syncPolicy.automated를 설정하면 Git 변경 시 자동으로 클러스터에 반영된다. selfHeal: true는 누군가 수동으로 클러스터 상태를 변경해도 Git 상태로 자동 복원하는 기능이다.
3-3. App of Apps 패턴
대규모 환경에서는 수십~수백 개의 Application을 관리해야 한다. App of Apps 패턴은 하나의 루트 Application이 나머지 Application들을 관리하는 구조다.
# root-app.yaml - 다른 Application들을 관리하는 루트
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/argocd-apps.git
targetRevision: main
path: apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
# apps/frontend.yaml - 루트가 관리하는 자식 Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: frontend
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests.git
path: apps/frontend/overlays/production
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: frontend
3-4. ApplicationSet으로 멀티 클러스터 배포
ApplicationSet은 하나의 템플릿으로 여러 Application을 자동 생성하는 ArgoCD의 기능이다. 클러스터 목록, Git 디렉토리, PR 이벤트 등을 기반으로 동적 생성한다.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-app-set
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production
template:
metadata:
name: 'my-app-{{name}}'
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests.git
targetRevision: main
path: 'apps/my-app/overlays/{{metadata.labels.region}}'
destination:
server: '{{server}}'
namespace: my-app
4. Tekton Pipelines
4-1. Tekton이란
Tekton은 쿠버네티스 네이티브 CI/CD 프레임워크다. GitHub Actions나 Jenkins와 달리 파이프라인의 모든 구성 요소가 쿠버네티스 커스텀 리소스(CRD)로 정의된다. 각 태스크가 별도의 Pod으로 실행되므로 완전한 격리와 확장성을 제공한다.
핵심 구성 요소:
- Task: 하나 이상의 Step으로 구성된 실행 단위. Pod 하나에 매핑된다.
- Pipeline: 여러 Task를 연결한 DAG(방향성 비순환 그래프) 구조의 워크플로다.
- TaskRun / PipelineRun: Task나 Pipeline의 실행 인스턴스. 쿠버네티스 리소스로 추적 가능하다.
- Workspace: Task 간 데이터 공유를 위한 볼륨이다.
4-2. Task 정의
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: build-and-push
spec:
params:
- name: image-url
type: string
- name: image-tag
type: string
default: latest
workspaces:
- name: source
steps:
- name: build
image: gcr.io/kaniko-project/executor:latest
args:
- --dockerfile=Dockerfile
- --context=$(workspaces.source.path)
- --destination=$(params.image-url):$(params.image-tag)
- --cache=true
- --cache-ttl=24h
4-3. Pipeline 정의
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
spec:
params:
- name: repo-url
type: string
- name: image-url
type: string
workspaces:
- name: shared-workspace
tasks:
- name: fetch-source
taskRef:
name: git-clone
workspaces:
- name: output
workspace: shared-workspace
params:
- name: url
value: $(params.repo-url)
- name: run-tests
taskRef:
name: npm-test
runAfter:
- fetch-source
workspaces:
- name: source
workspace: shared-workspace
- name: build-image
taskRef:
name: build-and-push
runAfter:
- run-tests
workspaces:
- name: source
workspace: shared-workspace
params:
- name: image-url
value: $(params.image-url)
- name: security-scan
taskRef:
name: trivy-scan
runAfter:
- build-image
params:
- name: image-url
value: $(params.image-url)
4-4. PipelineRun으로 실행
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
generateName: ci-pipeline-run-
spec:
pipelineRef:
name: ci-pipeline
params:
- name: repo-url
value: https://github.com/my-org/my-app.git
- name: image-url
value: ghcr.io/my-org/my-app
workspaces:
- name: shared-workspace
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
5. 보안 파이프라인 (DevSecOps)
5-1. 보안 파이프라인의 구성 요소
프로덕션 수준의 보안 파이프라인은 다음 단계를 자동으로 수행한다.
| 단계 | 도구 | 목적 |
|---|---|---|
| SAST (정적 분석) | SonarQube, Semgrep, CodeQL | 소스 코드의 보안 취약점 탐지 |
| SCA (의존성 분석) | Snyk, Dependabot, OWASP DC | 오픈소스 의존성 취약점 탐지 |
| 시크릿 스캔 | GitLeaks, TruffleHog | 코드에 포함된 비밀 정보 탐지 |
| 컨테이너 스캔 | Trivy, Grype | 컨테이너 이미지 취약점 탐지 |
| DAST (동적 분석) | OWASP ZAP, Nuclei | 실행 중인 애플리케이션 취약점 탐지 |
| SBOM 생성 | Syft, CycloneDX | 소프트웨어 구성 요소 목록 생성 |
| 아티팩트 서명 | Cosign, Notation | 빌드 아티팩트 무결성 보장 |
5-2. SAST - SonarQube 통합
# GitHub Actions에서 SonarQube 분석
name: Security Pipeline
on: [push, pull_request]
jobs:
sast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: SonarSource/sonarqube-scan-action@v3
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
with:
args: >
-Dsonar.projectKey=my-project
-Dsonar.sources=src/
-Dsonar.tests=tests/
-Dsonar.coverage.exclusions=**/*.test.ts
- uses: SonarSource/sonarqube-quality-gate-check@v1
timeout-minutes: 5
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
5-3. 컨테이너 이미지 스캔 - Trivy
container-scan:
runs-on: ubuntu-latest
needs: [build]
steps:
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/my-org/my-app:latest'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
exit-code: '1'
- uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: 'trivy-results.sarif'
Trivy는 OS 패키지와 언어별 의존성 모두를 스캔한다. exit-code: '1'을 설정하면 CRITICAL 또는 HIGH 취약점 발견 시 파이프라인이 실패한다.
5-4. SBOM 생성과 Cosign 서명
SBOM(Software Bill of Materials)은 소프트웨어에 포함된 모든 구성 요소의 목록이다. 미국 행정명령 14028 이후 공급망 보안의 필수 요소가 되었다.
sbom-and-sign:
runs-on: ubuntu-latest
needs: [container-scan]
permissions:
id-token: write
packages: write
steps:
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: ghcr.io/my-org/my-app:latest
format: spdx-json
output-file: sbom.spdx.json
- name: Install Cosign
uses: sigstore/cosign-installer@v3
- name: Sign Container Image
run: |
cosign sign --yes \
ghcr.io/my-org/my-app:latest
- name: Attach SBOM to Image
run: |
cosign attach sbom \
--sbom sbom.spdx.json \
ghcr.io/my-org/my-app:latest
Cosign은 Sigstore 프로젝트의 일부로, 키리스(keyless) 서명을 지원한다. OIDC 토큰을 사용하여 별도의 서명 키 관리 없이 이미지에 서명할 수 있다.
5-5. 시크릿 스캔 - GitLeaks
secret-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
6. 테스트 자동화 전략
6-1. 테스트 피라미드
프로덕션 CI/CD에서는 테스트의 종류와 비율이 중요하다.
유닛 테스트 (70%): 가장 빠르고 가장 많아야 한다. 개별 함수, 메서드, 컴포넌트를 격리 테스트한다.
통합 테스트 (20%): 여러 모듈이 함께 동작하는 것을 검증한다. DB, API, 메시지 큐 등 외부 의존성과의 상호작용을 테스트한다.
E2E 테스트 (10%): 사용자 시나리오를 처음부터 끝까지 검증한다. 가장 느리고 불안정하므로 핵심 플로우만 테스트한다.
6-2. 병렬 테스트와 테스트 분할
대규모 테스트 스위트의 실행 시간을 단축하기 위한 전략이다.
jobs:
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Run Tests (Shard ${{ matrix.shard }}/4)
run: |
npx jest --shard=${{ matrix.shard }}/4 \
--ci --coverage --forceExit
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.shard }}
path: coverage/
merge-coverage:
needs: [test]
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with:
pattern: coverage-*
merge-multiple: true
- name: Merge Coverage Reports
run: npx istanbul-merge --out merged-coverage.json coverage-*/coverage-final.json
6-3. Playwright E2E 테스트
e2e:
runs-on: ubuntu-latest
needs: [deploy-staging]
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Install Playwright Browsers
run: npx playwright install --with-deps
- name: Run E2E Tests
run: npx playwright test --reporter=html
env:
BASE_URL: https://staging.my-app.com
- uses: actions/upload-artifact@v4
if: failure()
with:
name: playwright-report
path: playwright-report/
7. 배포 전략 심화
7-1. Argo Rollouts 카나리 배포
Argo Rollouts는 쿠버네티스에서 고급 배포 전략을 구현하는 컨트롤러다. 표준 Deployment를 대체하는 Rollout 리소스를 제공한다.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
istio:
virtualServices:
- name: my-app-vsvc
routes:
- primary
steps:
- setWeight: 5
- pause:
duration: 5m
- setWeight: 20
- pause:
duration: 5m
- setWeight: 50
- pause:
duration: 10m
- setWeight: 80
- pause:
duration: 5m
analysis:
templates:
- templateName: success-rate
startingStep: 2
args:
- name: service-name
value: my-app-canary
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: ghcr.io/my-org/my-app:v2.0.0
ports:
- containerPort: 8080
7-2. AnalysisTemplate - 메트릭 기반 자동 판단
카나리 배포 중 프로메테우스 메트릭을 조회하여 자동으로 승격(promote) 또는 롤백(abort)을 결정한다.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
성공률이 95% 미만인 분석이 3번 연속 발생하면 자동으로 롤백한다. 이것이 Level 5 자가 치유 파이프라인의 핵심이다.
7-3. 블루-그린 배포
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app-bluegreen
spec:
replicas: 5
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
prePromotionAnalysis:
templates:
- templateName: smoke-test
postPromotionAnalysis:
templates:
- templateName: success-rate
scaleDownDelaySeconds: 300
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: ghcr.io/my-org/my-app:v2.0.0
블루-그린 배포에서는 새 버전(그린)이 완전히 준비된 후 트래픽을 한 번에 전환한다. prePromotionAnalysis로 전환 전 스모크 테스트를 실행하고, scaleDownDelaySeconds로 구버전을 일정 시간 유지하여 빠른 롤백을 가능하게 한다.
7-4. 트래픽 미러링 (Shadow Traffic)
실제 프로덕션 트래픽을 새 버전에 복제하여 실제 부하로 테스트하되, 새 버전의 응답은 사용자에게 전달하지 않는 방식이다.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-app
spec:
hosts:
- my-app.example.com
http:
- route:
- destination:
host: my-app-stable
port:
number: 80
mirror:
host: my-app-canary
port:
number: 80
mirrorPercentage:
value: 100.0
8. 멀티 환경 관리
8-1. 환경 구조 설계
프로덕션 파이프라인은 최소 3개 환경을 운영한다.
- dev: 개발자의 feature 브랜치 배포. 불안정해도 괜찮다.
- staging: main 브랜치 배포. 프로덕션과 동일한 설정이어야 한다.
- production: 실제 사용자 트래픽을 처리하는 환경이다.
8-2. Kustomize Overlays
Kustomize는 쿠버네티스 매니페스트를 환경별로 커스터마이징하는 도구다. 기본(base) 설정 위에 환경별 오버레이를 적용한다.
k8s/
base/
deployment.yaml
service.yaml
kustomization.yaml
overlays/
dev/
kustomization.yaml
replica-patch.yaml
staging/
kustomization.yaml
replica-patch.yaml
production/
kustomization.yaml
replica-patch.yaml
hpa.yaml
# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
- hpa.yaml
patches:
- path: replica-patch.yaml
namePrefix: prod-
commonLabels:
env: production
# k8s/overlays/production/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 5
template:
spec:
containers:
- name: my-app
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
8-3. Helm Values 환경별 관리
# values-dev.yaml
replicaCount: 1
image:
tag: latest
resources:
requests:
cpu: 100m
memory: 128Mi
ingress:
host: dev.my-app.internal
autoscaling:
enabled: false
# values-production.yaml
replicaCount: 5
image:
tag: v2.0.0
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
ingress:
host: my-app.example.com
tls: true
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 20
targetCPU: 70
# 환경별 배포
helm upgrade --install my-app ./chart \
-f values-production.yaml \
--namespace production \
--wait --timeout 5m
9. 모니터링과 롤백
9-1. 배포 후 모니터링 체크리스트
배포 직후 확인해야 하는 핵심 메트릭이다.
즉시 확인 (0~5분):
- Pod 상태: 모든 Pod가 Running/Ready 상태인가
- 에러 로그: 새 버전에서 예외가 급증하지 않았는가
- 헬스체크: readiness/liveness 프로브가 정상인가
단기 확인 (5~30분):
- 응답 시간: p50, p95, p99 레이턴시가 이전 버전과 유사한가
- 에러율: 5xx 비율이 임계치 이내인가
- 처리량: 요청 처리량이 예상 범위인가
중기 확인 (30분~수 시간):
- 메모리 사용량: 메모리 누수 징후가 없는가
- CPU 사용량: CPU 사용이 안정적인가
- 비즈니스 메트릭: 주문 수, 전환율 등이 정상인가
9-2. 프로메테우스 기반 자동 롤백
# PrometheusRule - 자동 롤백 트리거
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: deployment-rollback-rules
spec:
groups:
- name: deployment-health
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 2m
labels:
severity: critical
action: rollback
annotations:
summary: "Error rate above 5 percent for 2 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 3m
labels:
severity: critical
action: rollback
annotations:
summary: "p99 latency above 2 seconds for 3 minutes"
9-3. SLO 기반 배포 게이트
서비스 수준 목표(SLO)를 배포 결정의 기준으로 활용한다. 에러 버짓이 소진되면 배포를 중단하는 전략이다.
# SLO 정의 예시
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: my-app-slo
spec:
service: my-app
labels:
team: platform
slos:
- name: requests-availability
objective: 99.9
sli:
events:
errorQuery: sum(rate(http_requests_total{status=~"5.."}[5m]))
totalQuery: sum(rate(http_requests_total[5m]))
alerting:
name: MyAppAvailability
pageAlert:
labels:
severity: critical
ticketAlert:
labels:
severity: warning
에러 버짓 계산 방식은 다음과 같다. SLO가 99.9%라면 한 달에 약 43분의 다운타임이 허용된다. 이미 30분을 소진했다면 남은 13분의 에러 버짓으로는 위험한 배포를 진행하지 않는다.
10. 실전: 프로덕션 파이프라인 아키텍처
10-1. 전체 아키텍처
프로덕션 수준의 완전한 CI/CD 파이프라인은 다음과 같은 구조를 가진다.
개발자 코드 Push
|
v
[CI 단계 - GitHub Actions]
|-- 코드 체크아웃
|-- 의존성 설치 (캐시 활용)
|-- 린트 + 포맷 체크
|-- 유닛 테스트 (병렬 4 샤드)
|-- 통합 테스트
|-- SAST (SonarQube)
|-- 시크릿 스캔 (GitLeaks)
|-- 의존성 취약점 스캔 (Snyk)
|-- 컨테이너 빌드 (Kaniko)
|-- 컨테이너 스캔 (Trivy)
|-- SBOM 생성 (Syft)
|-- 이미지 서명 (Cosign)
|-- 이미지 레지스트리 Push
|
v
[CD 단계 - ArgoCD / GitOps]
|-- 매니페스트 저장소 자동 업데이트
|-- ArgoCD 동기화
|-- dev 환경 자동 배포
|-- staging 환경 자동 배포
|-- E2E 테스트 (Playwright)
|-- DAST (OWASP ZAP)
|
v
[프로덕션 배포 - Argo Rollouts]
|-- 카나리 5% 배포
|-- 메트릭 분석 (AnalysisTemplate)
|-- 카나리 20% 증가
|-- 메트릭 재분석
|-- 카나리 50% 증가
|-- 최종 분석
|-- 100% 프로모션 또는 자동 롤백
|
v
[모니터링 - Prometheus / Grafana]
|-- SLO 대시보드
|-- 에러 버짓 추적
|-- 자동 롤백 알림
10-2. 파이프라인 최적화 팁
캐시 전략: 의존성 설치, Docker 레이어, 테스트 결과를 캐시하여 파이프라인 시간을 50% 이상 단축할 수 있다.
- uses: actions/cache@v4
with:
path: |
~/.npm
node_modules
key: deps-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
deps-
조건부 실행: 변경된 파일에 따라 관련 작업만 실행하여 불필요한 빌드를 방지한다.
- uses: dorny/paths-filter@v3
id: changes
with:
filters: |
backend:
- 'src/api/**'
- 'src/models/**'
frontend:
- 'src/components/**'
- 'src/pages/**'
infra:
- 'terraform/**'
- 'k8s/**'
병렬화: 독립적인 작업은 병렬로 실행한다. 린트, 테스트, 보안 스캔은 서로 의존하지 않으므로 동시에 실행 가능하다.
10-3. 파이프라인 메트릭
파이프라인 자체의 성능도 측정해야 한다.
| 메트릭 | 설명 | 목표 |
|---|---|---|
| Lead Time | 커밋부터 프로덕션 배포까지 시간 | 1시간 이하 |
| 배포 빈도 | 하루 프로덕션 배포 횟수 | 일 10회 이상 |
| 변경 실패율 | 배포 후 롤백 비율 | 5% 이하 |
| MTTR | 장애 발생 후 복구 시간 | 30분 이하 |
| 파이프라인 실행 시간 | CI 전체 소요 시간 | 15분 이하 |
| 테스트 커버리지 | 코드 커버리지 비율 | 80% 이상 |
이것이 DORA 메트릭(Lead Time, 배포 빈도, 변경 실패율, MTTR)의 핵심이다. Elite 수준의 팀은 이 네 가지 메트릭 모두에서 최고 등급을 달성한다.
마무리
CI/CD 파이프라인은 단순한 자동화 도구가 아니라, 소프트웨어 품질과 개발 속도를 결정하는 핵심 인프라다. 이 글에서 다룬 내용을 요약하면 다음과 같다.
- 성숙도 모델을 기준으로 현재 수준을 파악하라. 한 번에 Level 5를 목표로 하지 말고, 단계적으로 역량을 구축하라.
- 재사용 가능한 파이프라인을 설계하라. GitHub Actions의 재사용 워크플로와 Composite Actions를 활용하여 조직 전체의 파이프라인을 표준화하라.
- GitOps를 도입하라. ArgoCD와 같은 도구로 선언적 배포를 구현하면 감사 추적, 자동 복구, 일관성을 모두 확보할 수 있다.
- 보안을 파이프라인에 내장하라. DevSecOps는 별도의 단계가 아니라 파이프라인의 모든 단계에 보안이 녹아드는 것이다.
- 메트릭 기반으로 의사결정하라. SLO와 에러 버짓을 활용하여 배포 리스크를 정량적으로 관리하라.
- DORA 메트릭을 추적하라. 파이프라인의 성능 자체를 지속적으로 개선하라.
프로덕션 수준의 CI/CD는 하루아침에 완성되지 않는다. 하지만 각 구성 요소를 이해하고 단계적으로 도입한다면, 팀의 배포 역량은 확실히 달라질 것이다.
Advanced CI/CD Pipeline Guide — GitHub Actions, ArgoCD, Tekton, and Security Pipelines
Introduction
Have you ever moved past understanding CI/CD as simply "code gets deployed automatically when pushed" and considered what a production-grade pipeline truly requires? In real production environments, dozens of stages must be orchestrated seamlessly: security scanning, static analysis, container image signing, multi-cluster deployments, and automated rollbacks.
This guide covers everything from CI/CD maturity models to advanced GitHub Actions patterns, ArgoCD GitOps, Tekton cloud-native pipelines, DevSecOps security pipelines, and advanced deployment strategies.
1. CI/CD Maturity Model
A five-level maturity model for objectively assessing your organization's CI/CD capabilities. Each level builds upon the previous one.
Level 1 - Manual
Developers perform builds and deployments manually. Building locally and deploying to servers via FTP or SCP. "It works on my machine" is a daily occurrence, and deployment cycles are often monthly.
Characteristics: Documented procedures may exist, but almost no automation. Human errors during deployment are frequent, and rollbacks involve manually redeploying previous versions.
Level 2 - Basic Automation
A CI server is introduced, and automatic builds and tests run on code pushes. Tools like Jenkins, GitHub Actions, or GitLab CI are used, but the pipeline is a simple linear build-test-deploy structure.
# Level 2: Basic pipeline example
name: Basic CI/CD
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install
- run: npm test
- run: npm run build
- run: ./deploy.sh
Level 3 - Standardized
Pipelines are standardized across the organization with reusable templates. Test coverage gates, code quality checks, and security scans are included by default. Environment-specific (dev, staging, prod) deployment pipelines are separated.
Level 4 - Advanced Automation
GitOps-based declarative deployments, canary/blue-green deployment strategies, automated security pipelines (SAST, DAST, container scanning), and SBOM generation with artifact signing are integrated into the pipeline. Deployment cycles shrink to sub-daily.
Level 5 - Self-Healing
Metric-based automatic rollbacks, SLO violation auto-recovery, ML-based anomaly detection, and production feedback automatically fed back into the pipeline. Deployments are fully automated so developers never think about deploying.
| Level | Deploy Cycle | Rollback | Security | Testing |
|---|---|---|---|---|
| L1 Manual | Monthly | Manual redeploy | None | Manual |
| L2 Basic | Weekly | Manual trigger | None | Auto unit |
| L3 Standard | Daily | One-click | Basic scans | Unit + Integration |
| L4 Advanced | Hourly | Auto canary | Full pipeline | Unit + Integration + E2E |
| L5 Healing | Continuous | SLO-based auto | Continuous validation | Includes chaos |
2. Advanced GitHub Actions Patterns
Advanced patterns that go beyond GitHub Actions basics.
2-1. Reusable Workflows
To share identical pipeline logic across multiple repositories, use reusable workflows. The calling side references external workflows with the uses keyword.
# .github/workflows/reusable-build.yml (shared repository)
name: Reusable Build
on:
workflow_call:
inputs:
node-version:
required: false
type: string
default: '20'
registry-url:
required: true
type: string
secrets:
npm-token:
required: true
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
registry-url: ${{ inputs.registry-url }}
- run: npm ci
env:
NODE_AUTH_TOKEN: ${{ secrets.npm-token }}
- run: npm run build
- uses: actions/upload-artifact@v4
with:
name: build-output
path: dist/
# Calling workflow
name: App CI
on: [push]
jobs:
call-build:
uses: my-org/shared-workflows/.github/workflows/reusable-build.yml@v2
with:
node-version: '20'
registry-url: 'https://npm.pkg.github.com'
secrets:
npm-token: ${{ secrets.NPM_TOKEN }}
2-2. Composite Actions
A pattern for bundling multiple steps within a single action for reuse. Unlike reusable workflows, these are inserted as steps within a single job.
# .github/actions/setup-and-test/action.yml
name: 'Setup and Test'
description: 'Set up Node.js environment and run tests'
inputs:
node-version:
description: 'Node.js version'
required: false
default: '20'
runs:
using: 'composite'
steps:
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
cache: 'npm'
- run: npm ci
shell: bash
- run: npm test -- --coverage
shell: bash
- uses: actions/upload-artifact@v4
with:
name: coverage
path: coverage/
2-3. Matrix Build Strategy
An advanced matrix configuration that tests multiple environment combinations in parallel while excluding unnecessary ones.
jobs:
test:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
node: [18, 20, 22]
exclude:
- os: macos-latest
node: 18
include:
- os: ubuntu-latest
node: 22
experimental: true
runs-on: ${{ matrix.os }}
continue-on-error: ${{ matrix.experimental || false }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node }}
- run: npm ci
- run: npm test
2-4. Cloud Authentication with OIDC
Instead of long-lived secrets (Access Keys), use OIDC (OpenID Connect) tokens to securely authenticate with cloud providers. AWS, GCP, and Azure all support this.
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-role
aws-region: us-east-1
- run: aws s3 sync ./dist s3://my-bucket
- run: aws cloudfront create-invalidation --distribution-id E1234 --paths "/*"
The key advantage of OIDC is eliminating secret management. GitHub issues a short-lived token that AWS STS validates to issue temporary credentials. The token is only valid during workflow execution, making leaked tokens difficult to exploit.
3. GitOps and ArgoCD
3-1. GitOps Principles
GitOps is an operational model that uses Git as the single source of truth for managing the declarative state of infrastructure and applications. Four core principles define it.
Declarative configuration: Describe the desired system state declaratively. Define "what" you want, not "how" to achieve it.
Git as the source of truth: All changes go through Git, and Git history serves as the audit log.
Automatic application: When the declarative state in Git changes, it is automatically applied to the system.
Continuous reconciliation: An agent continuously compares actual state to desired state and corrects any drift.
3-2. ArgoCD Architecture
ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes.
# ArgoCD Application resource
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests.git
targetRevision: main
path: apps/my-app/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: my-app
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
When syncPolicy.automated is configured, Git changes are automatically reflected in the cluster. selfHeal: true means that if someone manually modifies the cluster state, it automatically reverts to the Git state.
3-3. App of Apps Pattern
In large environments with dozens to hundreds of Applications to manage, the App of Apps pattern uses a single root Application that manages all the others.
# root-app.yaml - Root that manages other Applications
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/argocd-apps.git
targetRevision: main
path: apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
# apps/frontend.yaml - Child Application managed by root
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: frontend
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests.git
path: apps/frontend/overlays/production
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: frontend
3-4. Multi-Cluster Deployment with ApplicationSet
ApplicationSet is an ArgoCD feature that automatically generates multiple Applications from a single template, based on cluster lists, Git directories, or PR events.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-app-set
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production
template:
metadata:
name: 'my-app-{{name}}'
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests.git
targetRevision: main
path: 'apps/my-app/overlays/{{metadata.labels.region}}'
destination:
server: '{{server}}'
namespace: my-app
4. Tekton Pipelines
4-1. What is Tekton
Tekton is a Kubernetes-native CI/CD framework. Unlike GitHub Actions or Jenkins, all pipeline components are defined as Kubernetes Custom Resources (CRDs). Each task runs as a separate Pod, providing complete isolation and scalability.
Core components:
- Task: An execution unit composed of one or more Steps. Maps to a single Pod.
- Pipeline: A DAG (Directed Acyclic Graph) workflow connecting multiple Tasks.
- TaskRun / PipelineRun: Execution instances of a Task or Pipeline. Trackable as Kubernetes resources.
- Workspace: Volumes for sharing data between Tasks.
4-2. Task Definition
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: build-and-push
spec:
params:
- name: image-url
type: string
- name: image-tag
type: string
default: latest
workspaces:
- name: source
steps:
- name: build
image: gcr.io/kaniko-project/executor:latest
args:
- --dockerfile=Dockerfile
- --context=$(workspaces.source.path)
- --destination=$(params.image-url):$(params.image-tag)
- --cache=true
- --cache-ttl=24h
4-3. Pipeline Definition
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
spec:
params:
- name: repo-url
type: string
- name: image-url
type: string
workspaces:
- name: shared-workspace
tasks:
- name: fetch-source
taskRef:
name: git-clone
workspaces:
- name: output
workspace: shared-workspace
params:
- name: url
value: $(params.repo-url)
- name: run-tests
taskRef:
name: npm-test
runAfter:
- fetch-source
workspaces:
- name: source
workspace: shared-workspace
- name: build-image
taskRef:
name: build-and-push
runAfter:
- run-tests
workspaces:
- name: source
workspace: shared-workspace
params:
- name: image-url
value: $(params.image-url)
- name: security-scan
taskRef:
name: trivy-scan
runAfter:
- build-image
params:
- name: image-url
value: $(params.image-url)
4-4. Executing with PipelineRun
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
generateName: ci-pipeline-run-
spec:
pipelineRef:
name: ci-pipeline
params:
- name: repo-url
value: https://github.com/my-org/my-app.git
- name: image-url
value: ghcr.io/my-org/my-app
workspaces:
- name: shared-workspace
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
5. Security Pipeline (DevSecOps)
5-1. Security Pipeline Components
A production-grade security pipeline automatically performs the following stages.
| Stage | Tools | Purpose |
|---|---|---|
| SAST (Static Analysis) | SonarQube, Semgrep, CodeQL | Detect source code vulnerabilities |
| SCA (Dependency Analysis) | Snyk, Dependabot, OWASP DC | Detect open-source dependency vulnerabilities |
| Secret Scanning | GitLeaks, TruffleHog | Detect secrets embedded in code |
| Container Scanning | Trivy, Grype | Detect container image vulnerabilities |
| DAST (Dynamic Analysis) | OWASP ZAP, Nuclei | Detect running application vulnerabilities |
| SBOM Generation | Syft, CycloneDX | Generate software component inventory |
| Artifact Signing | Cosign, Notation | Ensure build artifact integrity |
5-2. SAST - SonarQube Integration
# SonarQube analysis in GitHub Actions
name: Security Pipeline
on: [push, pull_request]
jobs:
sast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: SonarSource/sonarqube-scan-action@v3
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
with:
args: >
-Dsonar.projectKey=my-project
-Dsonar.sources=src/
-Dsonar.tests=tests/
-Dsonar.coverage.exclusions=**/*.test.ts
- uses: SonarSource/sonarqube-quality-gate-check@v1
timeout-minutes: 5
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
5-3. Container Image Scanning - Trivy
container-scan:
runs-on: ubuntu-latest
needs: [build]
steps:
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/my-org/my-app:latest'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
exit-code: '1'
- uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: 'trivy-results.sarif'
Trivy scans both OS packages and language-specific dependencies. Setting exit-code: '1' causes the pipeline to fail when CRITICAL or HIGH vulnerabilities are found.
5-4. SBOM Generation and Cosign Signing
SBOM (Software Bill of Materials) is a complete inventory of all components in the software. Following US Executive Order 14028, it has become an essential element of supply chain security.
sbom-and-sign:
runs-on: ubuntu-latest
needs: [container-scan]
permissions:
id-token: write
packages: write
steps:
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: ghcr.io/my-org/my-app:latest
format: spdx-json
output-file: sbom.spdx.json
- name: Install Cosign
uses: sigstore/cosign-installer@v3
- name: Sign Container Image
run: |
cosign sign --yes \
ghcr.io/my-org/my-app:latest
- name: Attach SBOM to Image
run: |
cosign attach sbom \
--sbom sbom.spdx.json \
ghcr.io/my-org/my-app:latest
Cosign is part of the Sigstore project and supports keyless signing. It uses OIDC tokens to sign images without requiring separate signing key management.
5-5. Secret Scanning - GitLeaks
secret-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
6. Test Automation Strategy
6-1. The Test Pyramid
In production CI/CD, the types and proportions of tests matter.
Unit Tests (70%): Should be the fastest and most numerous. Test individual functions, methods, and components in isolation.
Integration Tests (20%): Verify that multiple modules work together correctly. Test interactions with external dependencies like databases, APIs, and message queues.
E2E Tests (10%): Validate user scenarios from start to finish. Slowest and most brittle, so only test core flows.
6-2. Parallel Testing and Test Splitting
Strategies for reducing execution time of large test suites.
jobs:
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Run Tests (Shard ${{ matrix.shard }}/4)
run: |
npx jest --shard=${{ matrix.shard }}/4 \
--ci --coverage --forceExit
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.shard }}
path: coverage/
merge-coverage:
needs: [test]
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with:
pattern: coverage-*
merge-multiple: true
- name: Merge Coverage Reports
run: npx istanbul-merge --out merged-coverage.json coverage-*/coverage-final.json
6-3. Playwright E2E Testing
e2e:
runs-on: ubuntu-latest
needs: [deploy-staging]
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Install Playwright Browsers
run: npx playwright install --with-deps
- name: Run E2E Tests
run: npx playwright test --reporter=html
env:
BASE_URL: https://staging.my-app.com
- uses: actions/upload-artifact@v4
if: failure()
with:
name: playwright-report
path: playwright-report/
7. Advanced Deployment Strategies
7-1. Canary Deployments with Argo Rollouts
Argo Rollouts is a controller that implements advanced deployment strategies in Kubernetes. It provides a Rollout resource that replaces the standard Deployment.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
istio:
virtualServices:
- name: my-app-vsvc
routes:
- primary
steps:
- setWeight: 5
- pause:
duration: 5m
- setWeight: 20
- pause:
duration: 5m
- setWeight: 50
- pause:
duration: 10m
- setWeight: 80
- pause:
duration: 5m
analysis:
templates:
- templateName: success-rate
startingStep: 2
args:
- name: service-name
value: my-app-canary
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: ghcr.io/my-org/my-app:v2.0.0
ports:
- containerPort: 8080
7-2. AnalysisTemplate - Metric-Based Automated Decisions
During canary deployment, query Prometheus metrics to automatically decide whether to promote or abort (rollback).
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
If three consecutive analyses show a success rate below 95%, an automatic rollback is triggered. This is the core of Level 5 self-healing pipelines.
7-3. Blue-Green Deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app-bluegreen
spec:
replicas: 5
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
prePromotionAnalysis:
templates:
- templateName: smoke-test
postPromotionAnalysis:
templates:
- templateName: success-rate
scaleDownDelaySeconds: 300
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: ghcr.io/my-org/my-app:v2.0.0
In blue-green deployment, after the new version (green) is fully ready, all traffic switches at once. prePromotionAnalysis runs smoke tests before the switch, and scaleDownDelaySeconds keeps the old version for a period to enable quick rollback.
7-4. Traffic Mirroring (Shadow Traffic)
A technique that replicates real production traffic to the new version for testing under actual load, but does not deliver the new version's responses to users.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-app
spec:
hosts:
- my-app.example.com
http:
- route:
- destination:
host: my-app-stable
port:
number: 80
mirror:
host: my-app-canary
port:
number: 80
mirrorPercentage:
value: 100.0
8. Multi-Environment Management
8-1. Environment Structure Design
Production pipelines operate a minimum of three environments.
- dev: Feature branch deployments for developers. Instability is acceptable.
- staging: Main branch deployments. Must have identical configuration to production.
- production: The environment serving real user traffic.
8-2. Kustomize Overlays
Kustomize is a tool for customizing Kubernetes manifests per environment. It applies environment-specific overlays on top of a base configuration.
k8s/
base/
deployment.yaml
service.yaml
kustomization.yaml
overlays/
dev/
kustomization.yaml
replica-patch.yaml
staging/
kustomization.yaml
replica-patch.yaml
production/
kustomization.yaml
replica-patch.yaml
hpa.yaml
# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
- hpa.yaml
patches:
- path: replica-patch.yaml
namePrefix: prod-
commonLabels:
env: production
# k8s/overlays/production/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 5
template:
spec:
containers:
- name: my-app
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
8-3. Helm Values Per Environment
# values-dev.yaml
replicaCount: 1
image:
tag: latest
resources:
requests:
cpu: 100m
memory: 128Mi
ingress:
host: dev.my-app.internal
autoscaling:
enabled: false
# values-production.yaml
replicaCount: 5
image:
tag: v2.0.0
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
ingress:
host: my-app.example.com
tls: true
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 20
targetCPU: 70
# Deploy per environment
helm upgrade --install my-app ./chart \
-f values-production.yaml \
--namespace production \
--wait --timeout 5m
9. Monitoring and Rollback
9-1. Post-Deployment Monitoring Checklist
Core metrics to verify immediately after deployment.
Immediate check (0-5 minutes):
- Pod status: Are all Pods in Running/Ready state
- Error logs: Have exceptions surged in the new version
- Health checks: Are readiness/liveness probes healthy
Short-term check (5-30 minutes):
- Response time: Are p50, p95, p99 latencies similar to the previous version
- Error rate: Is the 5xx rate within threshold
- Throughput: Is request throughput in the expected range
Medium-term check (30 minutes to hours):
- Memory usage: Any signs of memory leaks
- CPU usage: Is CPU utilization stable
- Business metrics: Are orders, conversion rates, etc. normal
9-2. Prometheus-Based Automatic Rollback
# PrometheusRule - Automatic rollback trigger
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: deployment-rollback-rules
spec:
groups:
- name: deployment-health
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 2m
labels:
severity: critical
action: rollback
annotations:
summary: "Error rate above 5 percent for 2 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 3m
labels:
severity: critical
action: rollback
annotations:
summary: "p99 latency above 2 seconds for 3 minutes"
9-3. SLO-Based Deployment Gates
Use Service Level Objectives (SLOs) as the basis for deployment decisions. When the error budget is exhausted, halt deployments.
# SLO definition example
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: my-app-slo
spec:
service: my-app
labels:
team: platform
slos:
- name: requests-availability
objective: 99.9
sli:
events:
errorQuery: sum(rate(http_requests_total{status=~"5.."}[5m]))
totalQuery: sum(rate(http_requests_total[5m]))
alerting:
name: MyAppAvailability
pageAlert:
labels:
severity: critical
ticketAlert:
labels:
severity: warning
Here is how error budget calculation works. With a 99.9% SLO, roughly 43 minutes of downtime are allowed per month. If 30 minutes have already been consumed, the remaining 13 minutes of error budget mean risky deployments should not proceed.
10. Production Pipeline Architecture in Practice
10-1. Overall Architecture
A complete production-grade CI/CD pipeline follows this structure.
Developer Code Push
|
v
[CI Stage - GitHub Actions]
|-- Code checkout
|-- Dependency install (with caching)
|-- Lint + format check
|-- Unit tests (parallel 4 shards)
|-- Integration tests
|-- SAST (SonarQube)
|-- Secret scanning (GitLeaks)
|-- Dependency vulnerability scan (Snyk)
|-- Container build (Kaniko)
|-- Container scan (Trivy)
|-- SBOM generation (Syft)
|-- Image signing (Cosign)
|-- Image registry push
|
v
[CD Stage - ArgoCD / GitOps]
|-- Manifest repo auto-update
|-- ArgoCD sync
|-- dev auto-deploy
|-- staging auto-deploy
|-- E2E tests (Playwright)
|-- DAST (OWASP ZAP)
|
v
[Production Deploy - Argo Rollouts]
|-- Canary 5% deploy
|-- Metric analysis (AnalysisTemplate)
|-- Canary increase to 20%
|-- Re-analysis
|-- Canary increase to 50%
|-- Final analysis
|-- 100% promotion or automatic rollback
|
v
[Monitoring - Prometheus / Grafana]
|-- SLO dashboard
|-- Error budget tracking
|-- Automatic rollback alerts
10-2. Pipeline Optimization Tips
Caching strategy: Cache dependency installation, Docker layers, and test results to reduce pipeline time by over 50%.
- uses: actions/cache@v4
with:
path: |
~/.npm
node_modules
key: deps-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
deps-
Conditional execution: Run only relevant tasks based on changed files to prevent unnecessary builds.
- uses: dorny/paths-filter@v3
id: changes
with:
filters: |
backend:
- 'src/api/**'
- 'src/models/**'
frontend:
- 'src/components/**'
- 'src/pages/**'
infra:
- 'terraform/**'
- 'k8s/**'
Parallelization: Run independent tasks in parallel. Linting, testing, and security scans do not depend on each other and can run concurrently.
10-3. Pipeline Metrics
The pipeline itself must also be measured.
| Metric | Description | Target |
|---|---|---|
| Lead Time | Time from commit to production deploy | Under 1 hour |
| Deploy Frequency | Production deploys per day | 10+ per day |
| Change Failure Rate | Percentage of deploys requiring rollback | Under 5% |
| MTTR | Time to recover from incidents | Under 30 minutes |
| Pipeline Execution Time | Total CI time | Under 15 minutes |
| Test Coverage | Code coverage percentage | Over 80% |
These are the core of DORA metrics (Lead Time, Deploy Frequency, Change Failure Rate, MTTR). Elite-performing teams achieve top rankings across all four metrics.
Conclusion
A CI/CD pipeline is not just an automation tool but critical infrastructure that determines software quality and development velocity. Here is a summary of everything covered.
- Assess your current level against the maturity model. Do not aim for Level 5 all at once; build capabilities incrementally.
- Design reusable pipelines. Use GitHub Actions reusable workflows and Composite Actions to standardize pipelines across your organization.
- Adopt GitOps. Implementing declarative deployments with tools like ArgoCD gives you audit trails, automatic recovery, and consistency.
- Embed security in the pipeline. DevSecOps is not a separate stage but security woven into every stage of the pipeline.
- Make metric-based decisions. Use SLOs and error budgets to quantitatively manage deployment risk.
- Track DORA metrics. Continuously improve the performance of the pipeline itself.
Production-grade CI/CD is not built overnight. But by understanding each component and introducing them incrementally, your team's deployment capabilities will improve dramatically.