Skip to content

Split View: 고급 CI/CD 파이프라인 구축 가이드 — GitHub Actions, ArgoCD, Tekton, 보안 파이프라인

|

고급 CI/CD 파이프라인 구축 가이드 — GitHub Actions, ArgoCD, Tekton, 보안 파이프라인

들어가며

CI/CD를 "코드 푸시하면 자동으로 배포되는 것" 정도로 이해하는 단계를 넘어, 프로덕션 수준의 파이프라인이 무엇인지 고민해 본 적이 있는가? 실제 프로덕션 환경에서는 보안 스캔, 정적 분석, 컨테이너 이미지 서명, 멀티 클러스터 배포, 자동 롤백까지 수십 개의 단계가 유기적으로 연결되어야 한다.

이 글에서는 CI/CD 성숙도 모델부터 GitHub Actions 고급 패턴, ArgoCD GitOps, Tekton 클라우드 네이티브 파이프라인, DevSecOps 보안 파이프라인, 배포 전략 심화까지 종합적으로 다룬다.


1. CI/CD 성숙도 모델

조직의 CI/CD 수준을 객관적으로 평가하기 위한 5단계 성숙도 모델이다. 각 레벨은 이전 레벨의 역량 위에 구축된다.

Level 1 - 수동 (Manual)

빌드와 배포를 개발자가 수동으로 수행하는 단계다. 로컬에서 빌드하고 FTP나 SCP로 서버에 직접 배포한다. "내 컴퓨터에서는 되는데"가 일상적으로 발생하며, 배포 주기가 월 단위인 경우가 많다.

특징: 문서화된 절차가 있을 수 있지만 자동화는 거의 없다. 배포 시 휴먼 에러가 빈번하고, 롤백은 이전 버전을 수동으로 다시 배포하는 방식이다.

Level 2 - 기본 자동화 (Basic Automation)

CI 서버가 도입되어 코드 푸시 시 자동 빌드와 테스트가 실행된다. Jenkins, GitHub Actions, GitLab CI 같은 도구를 사용하지만, 파이프라인이 단순한 빌드-테스트-배포 직선 구조다.

# Level 2: 기본 파이프라인 예시
name: Basic CI/CD
on:
  push:
    branches: [main]
jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install
      - run: npm test
      - run: npm run build
      - run: ./deploy.sh

Level 3 - 표준화 (Standardized)

파이프라인이 조직 전체에 표준화되고, 재사용 가능한 파이프라인 템플릿이 존재한다. 테스트 커버리지 게이트, 코드 품질 체크, 보안 스캔이 기본으로 포함된다. 환경별(dev, staging, prod) 배포 파이프라인이 분리되어 있다.

Level 4 - 고급 자동화 (Advanced Automation)

GitOps 기반 선언적 배포, 카나리/블루-그린 배포 전략, 자동화된 보안 파이프라인(SAST, DAST, 컨테이너 스캔), SBOM 생성과 아티팩트 서명이 파이프라인에 통합된다. 배포 주기가 일 단위 이하로 줄어든다.

Level 5 - 자가 치유 (Self-Healing)

메트릭 기반 자동 롤백, SLO 위반 시 자동 복구, ML 기반 이상 탐지, 프로덕션 피드백이 파이프라인에 자동 반영된다. 배포가 완전히 자동화되어 개발자가 배포를 의식하지 않는다.

레벨배포 주기롤백 방식보안테스트
L1 수동월 단위수동 재배포없음수동
L2 기본주 단위수동 트리거없음자동 유닛
L3 표준일 단위원클릭 롤백기본 스캔유닛+통합
L4 고급시간 단위자동 카나리전체 파이프라인유닛+통합+E2E
L5 치유지속 배포SLO 기반 자동지속적 검증카오스 포함

2. GitHub Actions 고급 패턴

GitHub Actions의 기본을 넘어서는 고급 패턴들이다.

2-1. 재사용 가능한 워크플로 (Reusable Workflows)

여러 저장소에서 동일한 파이프라인 로직을 공유하려면 재사용 워크플로를 활용한다. 호출하는 쪽에서는 uses 키워드로 외부 워크플로를 참조한다.

# .github/workflows/reusable-build.yml (공유 저장소)
name: Reusable Build
on:
  workflow_call:
    inputs:
      node-version:
        required: false
        type: string
        default: '20'
      registry-url:
        required: true
        type: string
    secrets:
      npm-token:
        required: true

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
          registry-url: ${{ inputs.registry-url }}
      - run: npm ci
        env:
          NODE_AUTH_TOKEN: ${{ secrets.npm-token }}
      - run: npm run build
      - uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: dist/
# 호출하는 워크플로
name: App CI
on: [push]
jobs:
  call-build:
    uses: my-org/shared-workflows/.github/workflows/reusable-build.yml@v2
    with:
      node-version: '20'
      registry-url: 'https://npm.pkg.github.com'
    secrets:
      npm-token: ${{ secrets.NPM_TOKEN }}

2-2. Composite Actions

하나의 액션 안에 여러 단계를 묶어 재사용하는 패턴이다. 재사용 워크플로와 달리 단일 job의 step으로 삽입된다.

# .github/actions/setup-and-test/action.yml
name: 'Setup and Test'
description: 'Node.js 환경 설정 후 테스트 실행'
inputs:
  node-version:
    description: 'Node.js version'
    required: false
    default: '20'
runs:
  using: 'composite'
  steps:
    - uses: actions/setup-node@v4
      with:
        node-version: ${{ inputs.node-version }}
        cache: 'npm'
    - run: npm ci
      shell: bash
    - run: npm test -- --coverage
      shell: bash
    - uses: actions/upload-artifact@v4
      with:
        name: coverage
        path: coverage/

2-3. 매트릭스 빌드 전략

여러 환경 조합을 병렬로 테스트하되, 불필요한 조합을 제외하는 고급 매트릭스 설정이다.

jobs:
  test:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        node: [18, 20, 22]
        exclude:
          - os: macos-latest
            node: 18
        include:
          - os: ubuntu-latest
            node: 22
            experimental: true
    runs-on: ${{ matrix.os }}
    continue-on-error: ${{ matrix.experimental || false }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npm test

2-4. OIDC를 이용한 클라우드 인증

장기 시크릿(Access Key) 대신 OIDC(OpenID Connect) 토큰을 사용하여 클라우드에 안전하게 인증하는 방식이다. AWS, GCP, Azure 모두 지원한다.

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-role
          aws-region: ap-northeast-2
      - run: aws s3 sync ./dist s3://my-bucket
      - run: aws cloudfront create-invalidation --distribution-id E1234 --paths "/*"

OIDC의 핵심 장점은 시크릿 관리가 필요 없다는 것이다. GitHub에서 발급한 단기 토큰을 AWS STS가 검증하고 임시 자격 증명을 발급한다. 토큰은 워크플로 실행 중에만 유효하므로 유출되어도 악용이 어렵다.


3. GitOps와 ArgoCD

3-1. GitOps 원칙

GitOps는 Git을 단일 진실 원천(Single Source of Truth)으로 사용하여 인프라와 애플리케이션의 선언적 상태를 관리하는 운영 모델이다. 네 가지 핵심 원칙이 있다.

선언적 설정: 시스템의 원하는 상태를 선언적으로 기술한다. "무엇을" 원하는지 정의하지, "어떻게" 달성할지 명령하지 않는다.

Git을 진실의 원천으로: 모든 변경은 Git을 통해 이루어지고, Git 히스토리가 곧 감사 로그다.

자동 적용: Git의 선언적 상태가 변경되면 자동으로 시스템에 적용된다.

지속적 조정: 에이전트가 실제 상태와 원하는 상태를 지속적으로 비교하여 차이(drift)를 교정한다.

3-2. ArgoCD 아키텍처

ArgoCD는 쿠버네티스를 위한 선언적 GitOps 연속 배포 도구다.

# ArgoCD Application 리소스
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests.git
    targetRevision: main
    path: apps/my-app/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

syncPolicy.automated를 설정하면 Git 변경 시 자동으로 클러스터에 반영된다. selfHeal: true는 누군가 수동으로 클러스터 상태를 변경해도 Git 상태로 자동 복원하는 기능이다.

3-3. App of Apps 패턴

대규모 환경에서는 수십~수백 개의 Application을 관리해야 한다. App of Apps 패턴은 하나의 루트 Application이 나머지 Application들을 관리하는 구조다.

# root-app.yaml - 다른 Application들을 관리하는 루트
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/argocd-apps.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
# apps/frontend.yaml - 루트가 관리하는 자식 Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: frontend
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests.git
    path: apps/frontend/overlays/production
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: frontend

3-4. ApplicationSet으로 멀티 클러스터 배포

ApplicationSet은 하나의 템플릿으로 여러 Application을 자동 생성하는 ArgoCD의 기능이다. 클러스터 목록, Git 디렉토리, PR 이벤트 등을 기반으로 동적 생성한다.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-set
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: 'my-app-{{name}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/my-org/k8s-manifests.git
        targetRevision: main
        path: 'apps/my-app/overlays/{{metadata.labels.region}}'
      destination:
        server: '{{server}}'
        namespace: my-app

4. Tekton Pipelines

4-1. Tekton이란

Tekton은 쿠버네티스 네이티브 CI/CD 프레임워크다. GitHub Actions나 Jenkins와 달리 파이프라인의 모든 구성 요소가 쿠버네티스 커스텀 리소스(CRD)로 정의된다. 각 태스크가 별도의 Pod으로 실행되므로 완전한 격리와 확장성을 제공한다.

핵심 구성 요소:

  • Task: 하나 이상의 Step으로 구성된 실행 단위. Pod 하나에 매핑된다.
  • Pipeline: 여러 Task를 연결한 DAG(방향성 비순환 그래프) 구조의 워크플로다.
  • TaskRun / PipelineRun: Task나 Pipeline의 실행 인스턴스. 쿠버네티스 리소스로 추적 가능하다.
  • Workspace: Task 간 데이터 공유를 위한 볼륨이다.

4-2. Task 정의

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: build-and-push
spec:
  params:
    - name: image-url
      type: string
    - name: image-tag
      type: string
      default: latest
  workspaces:
    - name: source
  steps:
    - name: build
      image: gcr.io/kaniko-project/executor:latest
      args:
        - --dockerfile=Dockerfile
        - --context=$(workspaces.source.path)
        - --destination=$(params.image-url):$(params.image-tag)
        - --cache=true
        - --cache-ttl=24h

4-3. Pipeline 정의

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
spec:
  params:
    - name: repo-url
      type: string
    - name: image-url
      type: string
  workspaces:
    - name: shared-workspace
  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-workspace
      params:
        - name: url
          value: $(params.repo-url)

    - name: run-tests
      taskRef:
        name: npm-test
      runAfter:
        - fetch-source
      workspaces:
        - name: source
          workspace: shared-workspace

    - name: build-image
      taskRef:
        name: build-and-push
      runAfter:
        - run-tests
      workspaces:
        - name: source
          workspace: shared-workspace
      params:
        - name: image-url
          value: $(params.image-url)

    - name: security-scan
      taskRef:
        name: trivy-scan
      runAfter:
        - build-image
      params:
        - name: image-url
          value: $(params.image-url)

4-4. PipelineRun으로 실행

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  generateName: ci-pipeline-run-
spec:
  pipelineRef:
    name: ci-pipeline
  params:
    - name: repo-url
      value: https://github.com/my-org/my-app.git
    - name: image-url
      value: ghcr.io/my-org/my-app
  workspaces:
    - name: shared-workspace
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

5. 보안 파이프라인 (DevSecOps)

5-1. 보안 파이프라인의 구성 요소

프로덕션 수준의 보안 파이프라인은 다음 단계를 자동으로 수행한다.

단계도구목적
SAST (정적 분석)SonarQube, Semgrep, CodeQL소스 코드의 보안 취약점 탐지
SCA (의존성 분석)Snyk, Dependabot, OWASP DC오픈소스 의존성 취약점 탐지
시크릿 스캔GitLeaks, TruffleHog코드에 포함된 비밀 정보 탐지
컨테이너 스캔Trivy, Grype컨테이너 이미지 취약점 탐지
DAST (동적 분석)OWASP ZAP, Nuclei실행 중인 애플리케이션 취약점 탐지
SBOM 생성Syft, CycloneDX소프트웨어 구성 요소 목록 생성
아티팩트 서명Cosign, Notation빌드 아티팩트 무결성 보장

5-2. SAST - SonarQube 통합

# GitHub Actions에서 SonarQube 분석
name: Security Pipeline
on: [push, pull_request]

jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: SonarSource/sonarqube-scan-action@v3
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
          SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
        with:
          args: >
            -Dsonar.projectKey=my-project
            -Dsonar.sources=src/
            -Dsonar.tests=tests/
            -Dsonar.coverage.exclusions=**/*.test.ts
      - uses: SonarSource/sonarqube-quality-gate-check@v1
        timeout-minutes: 5
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

5-3. 컨테이너 이미지 스캔 - Trivy

  container-scan:
    runs-on: ubuntu-latest
    needs: [build]
    steps:
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/my-org/my-app:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'

Trivy는 OS 패키지와 언어별 의존성 모두를 스캔한다. exit-code: '1'을 설정하면 CRITICAL 또는 HIGH 취약점 발견 시 파이프라인이 실패한다.

5-4. SBOM 생성과 Cosign 서명

SBOM(Software Bill of Materials)은 소프트웨어에 포함된 모든 구성 요소의 목록이다. 미국 행정명령 14028 이후 공급망 보안의 필수 요소가 되었다.

  sbom-and-sign:
    runs-on: ubuntu-latest
    needs: [container-scan]
    permissions:
      id-token: write
      packages: write
    steps:
      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          image: ghcr.io/my-org/my-app:latest
          format: spdx-json
          output-file: sbom.spdx.json

      - name: Install Cosign
        uses: sigstore/cosign-installer@v3

      - name: Sign Container Image
        run: |
          cosign sign --yes \
            ghcr.io/my-org/my-app:latest

      - name: Attach SBOM to Image
        run: |
          cosign attach sbom \
            --sbom sbom.spdx.json \
            ghcr.io/my-org/my-app:latest

Cosign은 Sigstore 프로젝트의 일부로, 키리스(keyless) 서명을 지원한다. OIDC 토큰을 사용하여 별도의 서명 키 관리 없이 이미지에 서명할 수 있다.

5-5. 시크릿 스캔 - GitLeaks

  secret-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

6. 테스트 자동화 전략

6-1. 테스트 피라미드

프로덕션 CI/CD에서는 테스트의 종류와 비율이 중요하다.

유닛 테스트 (70%): 가장 빠르고 가장 많아야 한다. 개별 함수, 메서드, 컴포넌트를 격리 테스트한다.

통합 테스트 (20%): 여러 모듈이 함께 동작하는 것을 검증한다. DB, API, 메시지 큐 등 외부 의존성과의 상호작용을 테스트한다.

E2E 테스트 (10%): 사용자 시나리오를 처음부터 끝까지 검증한다. 가장 느리고 불안정하므로 핵심 플로우만 테스트한다.

6-2. 병렬 테스트와 테스트 분할

대규모 테스트 스위트의 실행 시간을 단축하기 위한 전략이다.

jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Run Tests (Shard ${{ matrix.shard }}/4)
        run: |
          npx jest --shard=${{ matrix.shard }}/4 \
            --ci --coverage --forceExit
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.shard }}
          path: coverage/

  merge-coverage:
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4
        with:
          pattern: coverage-*
          merge-multiple: true
      - name: Merge Coverage Reports
        run: npx istanbul-merge --out merged-coverage.json coverage-*/coverage-final.json

6-3. Playwright E2E 테스트

  e2e:
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Install Playwright Browsers
        run: npx playwright install --with-deps
      - name: Run E2E Tests
        run: npx playwright test --reporter=html
        env:
          BASE_URL: https://staging.my-app.com
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: playwright-report
          path: playwright-report/

7. 배포 전략 심화

7-1. Argo Rollouts 카나리 배포

Argo Rollouts는 쿠버네티스에서 고급 배포 전략을 구현하는 컨트롤러다. 표준 Deployment를 대체하는 Rollout 리소스를 제공한다.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: my-app-vsvc
              routes:
                - primary
      steps:
        - setWeight: 5
        - pause:
            duration: 5m
        - setWeight: 20
        - pause:
            duration: 5m
        - setWeight: 50
        - pause:
            duration: 10m
        - setWeight: 80
        - pause:
            duration: 5m
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
        args:
          - name: service-name
            value: my-app-canary
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: ghcr.io/my-org/my-app:v2.0.0
          ports:
            - containerPort: 8080

7-2. AnalysisTemplate - 메트릭 기반 자동 판단

카나리 배포 중 프로메테우스 메트릭을 조회하여 자동으로 승격(promote) 또는 롤백(abort)을 결정한다.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2.."
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[5m]))

성공률이 95% 미만인 분석이 3번 연속 발생하면 자동으로 롤백한다. 이것이 Level 5 자가 치유 파이프라인의 핵심이다.

7-3. 블루-그린 배포

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app-bluegreen
spec:
  replicas: 5
  strategy:
    blueGreen:
      activeService: my-app-active
      previewService: my-app-preview
      autoPromotionEnabled: false
      prePromotionAnalysis:
        templates:
          - templateName: smoke-test
      postPromotionAnalysis:
        templates:
          - templateName: success-rate
      scaleDownDelaySeconds: 300
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: ghcr.io/my-org/my-app:v2.0.0

블루-그린 배포에서는 새 버전(그린)이 완전히 준비된 후 트래픽을 한 번에 전환한다. prePromotionAnalysis로 전환 전 스모크 테스트를 실행하고, scaleDownDelaySeconds로 구버전을 일정 시간 유지하여 빠른 롤백을 가능하게 한다.

7-4. 트래픽 미러링 (Shadow Traffic)

실제 프로덕션 트래픽을 새 버전에 복제하여 실제 부하로 테스트하되, 새 버전의 응답은 사용자에게 전달하지 않는 방식이다.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-app
spec:
  hosts:
    - my-app.example.com
  http:
    - route:
        - destination:
            host: my-app-stable
            port:
              number: 80
      mirror:
        host: my-app-canary
        port:
          number: 80
      mirrorPercentage:
        value: 100.0

8. 멀티 환경 관리

8-1. 환경 구조 설계

프로덕션 파이프라인은 최소 3개 환경을 운영한다.

  • dev: 개발자의 feature 브랜치 배포. 불안정해도 괜찮다.
  • staging: main 브랜치 배포. 프로덕션과 동일한 설정이어야 한다.
  • production: 실제 사용자 트래픽을 처리하는 환경이다.

8-2. Kustomize Overlays

Kustomize는 쿠버네티스 매니페스트를 환경별로 커스터마이징하는 도구다. 기본(base) 설정 위에 환경별 오버레이를 적용한다.

k8s/
  base/
    deployment.yaml
    service.yaml
    kustomization.yaml
  overlays/
    dev/
      kustomization.yaml
      replica-patch.yaml
    staging/
      kustomization.yaml
      replica-patch.yaml
    production/
      kustomization.yaml
      replica-patch.yaml
      hpa.yaml
# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
  - hpa.yaml
patches:
  - path: replica-patch.yaml
namePrefix: prod-
commonLabels:
  env: production
# k8s/overlays/production/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: my-app
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

8-3. Helm Values 환경별 관리

# values-dev.yaml
replicaCount: 1
image:
  tag: latest
resources:
  requests:
    cpu: 100m
    memory: 128Mi
ingress:
  host: dev.my-app.internal
autoscaling:
  enabled: false
# values-production.yaml
replicaCount: 5
image:
  tag: v2.0.0
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi
ingress:
  host: my-app.example.com
  tls: true
autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPU: 70
# 환경별 배포
helm upgrade --install my-app ./chart \
  -f values-production.yaml \
  --namespace production \
  --wait --timeout 5m

9. 모니터링과 롤백

9-1. 배포 후 모니터링 체크리스트

배포 직후 확인해야 하는 핵심 메트릭이다.

즉시 확인 (0~5분):

  • Pod 상태: 모든 Pod가 Running/Ready 상태인가
  • 에러 로그: 새 버전에서 예외가 급증하지 않았는가
  • 헬스체크: readiness/liveness 프로브가 정상인가

단기 확인 (5~30분):

  • 응답 시간: p50, p95, p99 레이턴시가 이전 버전과 유사한가
  • 에러율: 5xx 비율이 임계치 이내인가
  • 처리량: 요청 처리량이 예상 범위인가

중기 확인 (30분~수 시간):

  • 메모리 사용량: 메모리 누수 징후가 없는가
  • CPU 사용량: CPU 사용이 안정적인가
  • 비즈니스 메트릭: 주문 수, 전환율 등이 정상인가

9-2. 프로메테우스 기반 자동 롤백

# PrometheusRule - 자동 롤백 트리거
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: deployment-rollback-rules
spec:
  groups:
    - name: deployment-health
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
            > 0.05
          for: 2m
          labels:
            severity: critical
            action: rollback
          annotations:
            summary: "Error rate above 5 percent for 2 minutes"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
            ) > 2
          for: 3m
          labels:
            severity: critical
            action: rollback
          annotations:
            summary: "p99 latency above 2 seconds for 3 minutes"

9-3. SLO 기반 배포 게이트

서비스 수준 목표(SLO)를 배포 결정의 기준으로 활용한다. 에러 버짓이 소진되면 배포를 중단하는 전략이다.

# SLO 정의 예시
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: my-app-slo
spec:
  service: my-app
  labels:
    team: platform
  slos:
    - name: requests-availability
      objective: 99.9
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{status=~"5.."}[5m]))
          totalQuery: sum(rate(http_requests_total[5m]))
      alerting:
        name: MyAppAvailability
        pageAlert:
          labels:
            severity: critical
        ticketAlert:
          labels:
            severity: warning

에러 버짓 계산 방식은 다음과 같다. SLO가 99.9%라면 한 달에 약 43분의 다운타임이 허용된다. 이미 30분을 소진했다면 남은 13분의 에러 버짓으로는 위험한 배포를 진행하지 않는다.


10. 실전: 프로덕션 파이프라인 아키텍처

10-1. 전체 아키텍처

프로덕션 수준의 완전한 CI/CD 파이프라인은 다음과 같은 구조를 가진다.

개발자 코드 Push
    |
    v
[CI 단계 - GitHub Actions]
    |-- 코드 체크아웃
    |-- 의존성 설치 (캐시 활용)
    |-- 린트 + 포맷 체크
    |-- 유닛 테스트 (병렬 4 샤드)
    |-- 통합 테스트
    |-- SAST (SonarQube)
    |-- 시크릿 스캔 (GitLeaks)
    |-- 의존성 취약점 스캔 (Snyk)
    |-- 컨테이너 빌드 (Kaniko)
    |-- 컨테이너 스캔 (Trivy)
    |-- SBOM 생성 (Syft)
    |-- 이미지 서명 (Cosign)
    |-- 이미지 레지스트리 Push
    |
    v
[CD 단계 - ArgoCD / GitOps]
    |-- 매니페스트 저장소 자동 업데이트
    |-- ArgoCD 동기화
    |-- dev 환경 자동 배포
    |-- staging 환경 자동 배포
    |-- E2E 테스트 (Playwright)
    |-- DAST (OWASP ZAP)
    |
    v
[프로덕션 배포 - Argo Rollouts]
    |-- 카나리 5% 배포
    |-- 메트릭 분석 (AnalysisTemplate)
    |-- 카나리 20% 증가
    |-- 메트릭 재분석
    |-- 카나리 50% 증가
    |-- 최종 분석
    |-- 100% 프로모션 또는 자동 롤백
    |
    v
[모니터링 - Prometheus / Grafana]
    |-- SLO 대시보드
    |-- 에러 버짓 추적
    |-- 자동 롤백 알림

10-2. 파이프라인 최적화 팁

캐시 전략: 의존성 설치, Docker 레이어, 테스트 결과를 캐시하여 파이프라인 시간을 50% 이상 단축할 수 있다.

- uses: actions/cache@v4
  with:
    path: |
      ~/.npm
      node_modules
    key: deps-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      deps-

조건부 실행: 변경된 파일에 따라 관련 작업만 실행하여 불필요한 빌드를 방지한다.

- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      backend:
        - 'src/api/**'
        - 'src/models/**'
      frontend:
        - 'src/components/**'
        - 'src/pages/**'
      infra:
        - 'terraform/**'
        - 'k8s/**'

병렬화: 독립적인 작업은 병렬로 실행한다. 린트, 테스트, 보안 스캔은 서로 의존하지 않으므로 동시에 실행 가능하다.

10-3. 파이프라인 메트릭

파이프라인 자체의 성능도 측정해야 한다.

메트릭설명목표
Lead Time커밋부터 프로덕션 배포까지 시간1시간 이하
배포 빈도하루 프로덕션 배포 횟수일 10회 이상
변경 실패율배포 후 롤백 비율5% 이하
MTTR장애 발생 후 복구 시간30분 이하
파이프라인 실행 시간CI 전체 소요 시간15분 이하
테스트 커버리지코드 커버리지 비율80% 이상

이것이 DORA 메트릭(Lead Time, 배포 빈도, 변경 실패율, MTTR)의 핵심이다. Elite 수준의 팀은 이 네 가지 메트릭 모두에서 최고 등급을 달성한다.


마무리

CI/CD 파이프라인은 단순한 자동화 도구가 아니라, 소프트웨어 품질과 개발 속도를 결정하는 핵심 인프라다. 이 글에서 다룬 내용을 요약하면 다음과 같다.

  1. 성숙도 모델을 기준으로 현재 수준을 파악하라. 한 번에 Level 5를 목표로 하지 말고, 단계적으로 역량을 구축하라.
  2. 재사용 가능한 파이프라인을 설계하라. GitHub Actions의 재사용 워크플로와 Composite Actions를 활용하여 조직 전체의 파이프라인을 표준화하라.
  3. GitOps를 도입하라. ArgoCD와 같은 도구로 선언적 배포를 구현하면 감사 추적, 자동 복구, 일관성을 모두 확보할 수 있다.
  4. 보안을 파이프라인에 내장하라. DevSecOps는 별도의 단계가 아니라 파이프라인의 모든 단계에 보안이 녹아드는 것이다.
  5. 메트릭 기반으로 의사결정하라. SLO와 에러 버짓을 활용하여 배포 리스크를 정량적으로 관리하라.
  6. DORA 메트릭을 추적하라. 파이프라인의 성능 자체를 지속적으로 개선하라.

프로덕션 수준의 CI/CD는 하루아침에 완성되지 않는다. 하지만 각 구성 요소를 이해하고 단계적으로 도입한다면, 팀의 배포 역량은 확실히 달라질 것이다.

Advanced CI/CD Pipeline Guide — GitHub Actions, ArgoCD, Tekton, and Security Pipelines

Introduction

Have you ever moved past understanding CI/CD as simply "code gets deployed automatically when pushed" and considered what a production-grade pipeline truly requires? In real production environments, dozens of stages must be orchestrated seamlessly: security scanning, static analysis, container image signing, multi-cluster deployments, and automated rollbacks.

This guide covers everything from CI/CD maturity models to advanced GitHub Actions patterns, ArgoCD GitOps, Tekton cloud-native pipelines, DevSecOps security pipelines, and advanced deployment strategies.


1. CI/CD Maturity Model

A five-level maturity model for objectively assessing your organization's CI/CD capabilities. Each level builds upon the previous one.

Level 1 - Manual

Developers perform builds and deployments manually. Building locally and deploying to servers via FTP or SCP. "It works on my machine" is a daily occurrence, and deployment cycles are often monthly.

Characteristics: Documented procedures may exist, but almost no automation. Human errors during deployment are frequent, and rollbacks involve manually redeploying previous versions.

Level 2 - Basic Automation

A CI server is introduced, and automatic builds and tests run on code pushes. Tools like Jenkins, GitHub Actions, or GitLab CI are used, but the pipeline is a simple linear build-test-deploy structure.

# Level 2: Basic pipeline example
name: Basic CI/CD
on:
  push:
    branches: [main]
jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install
      - run: npm test
      - run: npm run build
      - run: ./deploy.sh

Level 3 - Standardized

Pipelines are standardized across the organization with reusable templates. Test coverage gates, code quality checks, and security scans are included by default. Environment-specific (dev, staging, prod) deployment pipelines are separated.

Level 4 - Advanced Automation

GitOps-based declarative deployments, canary/blue-green deployment strategies, automated security pipelines (SAST, DAST, container scanning), and SBOM generation with artifact signing are integrated into the pipeline. Deployment cycles shrink to sub-daily.

Level 5 - Self-Healing

Metric-based automatic rollbacks, SLO violation auto-recovery, ML-based anomaly detection, and production feedback automatically fed back into the pipeline. Deployments are fully automated so developers never think about deploying.

LevelDeploy CycleRollbackSecurityTesting
L1 ManualMonthlyManual redeployNoneManual
L2 BasicWeeklyManual triggerNoneAuto unit
L3 StandardDailyOne-clickBasic scansUnit + Integration
L4 AdvancedHourlyAuto canaryFull pipelineUnit + Integration + E2E
L5 HealingContinuousSLO-based autoContinuous validationIncludes chaos

2. Advanced GitHub Actions Patterns

Advanced patterns that go beyond GitHub Actions basics.

2-1. Reusable Workflows

To share identical pipeline logic across multiple repositories, use reusable workflows. The calling side references external workflows with the uses keyword.

# .github/workflows/reusable-build.yml (shared repository)
name: Reusable Build
on:
  workflow_call:
    inputs:
      node-version:
        required: false
        type: string
        default: '20'
      registry-url:
        required: true
        type: string
    secrets:
      npm-token:
        required: true

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
          registry-url: ${{ inputs.registry-url }}
      - run: npm ci
        env:
          NODE_AUTH_TOKEN: ${{ secrets.npm-token }}
      - run: npm run build
      - uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: dist/
# Calling workflow
name: App CI
on: [push]
jobs:
  call-build:
    uses: my-org/shared-workflows/.github/workflows/reusable-build.yml@v2
    with:
      node-version: '20'
      registry-url: 'https://npm.pkg.github.com'
    secrets:
      npm-token: ${{ secrets.NPM_TOKEN }}

2-2. Composite Actions

A pattern for bundling multiple steps within a single action for reuse. Unlike reusable workflows, these are inserted as steps within a single job.

# .github/actions/setup-and-test/action.yml
name: 'Setup and Test'
description: 'Set up Node.js environment and run tests'
inputs:
  node-version:
    description: 'Node.js version'
    required: false
    default: '20'
runs:
  using: 'composite'
  steps:
    - uses: actions/setup-node@v4
      with:
        node-version: ${{ inputs.node-version }}
        cache: 'npm'
    - run: npm ci
      shell: bash
    - run: npm test -- --coverage
      shell: bash
    - uses: actions/upload-artifact@v4
      with:
        name: coverage
        path: coverage/

2-3. Matrix Build Strategy

An advanced matrix configuration that tests multiple environment combinations in parallel while excluding unnecessary ones.

jobs:
  test:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        node: [18, 20, 22]
        exclude:
          - os: macos-latest
            node: 18
        include:
          - os: ubuntu-latest
            node: 22
            experimental: true
    runs-on: ${{ matrix.os }}
    continue-on-error: ${{ matrix.experimental || false }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npm test

2-4. Cloud Authentication with OIDC

Instead of long-lived secrets (Access Keys), use OIDC (OpenID Connect) tokens to securely authenticate with cloud providers. AWS, GCP, and Azure all support this.

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-role
          aws-region: us-east-1
      - run: aws s3 sync ./dist s3://my-bucket
      - run: aws cloudfront create-invalidation --distribution-id E1234 --paths "/*"

The key advantage of OIDC is eliminating secret management. GitHub issues a short-lived token that AWS STS validates to issue temporary credentials. The token is only valid during workflow execution, making leaked tokens difficult to exploit.


3. GitOps and ArgoCD

3-1. GitOps Principles

GitOps is an operational model that uses Git as the single source of truth for managing the declarative state of infrastructure and applications. Four core principles define it.

Declarative configuration: Describe the desired system state declaratively. Define "what" you want, not "how" to achieve it.

Git as the source of truth: All changes go through Git, and Git history serves as the audit log.

Automatic application: When the declarative state in Git changes, it is automatically applied to the system.

Continuous reconciliation: An agent continuously compares actual state to desired state and corrects any drift.

3-2. ArgoCD Architecture

ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes.

# ArgoCD Application resource
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests.git
    targetRevision: main
    path: apps/my-app/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

When syncPolicy.automated is configured, Git changes are automatically reflected in the cluster. selfHeal: true means that if someone manually modifies the cluster state, it automatically reverts to the Git state.

3-3. App of Apps Pattern

In large environments with dozens to hundreds of Applications to manage, the App of Apps pattern uses a single root Application that manages all the others.

# root-app.yaml - Root that manages other Applications
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/argocd-apps.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
# apps/frontend.yaml - Child Application managed by root
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: frontend
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests.git
    path: apps/frontend/overlays/production
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: frontend

3-4. Multi-Cluster Deployment with ApplicationSet

ApplicationSet is an ArgoCD feature that automatically generates multiple Applications from a single template, based on cluster lists, Git directories, or PR events.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-set
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: 'my-app-{{name}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/my-org/k8s-manifests.git
        targetRevision: main
        path: 'apps/my-app/overlays/{{metadata.labels.region}}'
      destination:
        server: '{{server}}'
        namespace: my-app

4. Tekton Pipelines

4-1. What is Tekton

Tekton is a Kubernetes-native CI/CD framework. Unlike GitHub Actions or Jenkins, all pipeline components are defined as Kubernetes Custom Resources (CRDs). Each task runs as a separate Pod, providing complete isolation and scalability.

Core components:

  • Task: An execution unit composed of one or more Steps. Maps to a single Pod.
  • Pipeline: A DAG (Directed Acyclic Graph) workflow connecting multiple Tasks.
  • TaskRun / PipelineRun: Execution instances of a Task or Pipeline. Trackable as Kubernetes resources.
  • Workspace: Volumes for sharing data between Tasks.

4-2. Task Definition

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: build-and-push
spec:
  params:
    - name: image-url
      type: string
    - name: image-tag
      type: string
      default: latest
  workspaces:
    - name: source
  steps:
    - name: build
      image: gcr.io/kaniko-project/executor:latest
      args:
        - --dockerfile=Dockerfile
        - --context=$(workspaces.source.path)
        - --destination=$(params.image-url):$(params.image-tag)
        - --cache=true
        - --cache-ttl=24h

4-3. Pipeline Definition

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
spec:
  params:
    - name: repo-url
      type: string
    - name: image-url
      type: string
  workspaces:
    - name: shared-workspace
  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-workspace
      params:
        - name: url
          value: $(params.repo-url)

    - name: run-tests
      taskRef:
        name: npm-test
      runAfter:
        - fetch-source
      workspaces:
        - name: source
          workspace: shared-workspace

    - name: build-image
      taskRef:
        name: build-and-push
      runAfter:
        - run-tests
      workspaces:
        - name: source
          workspace: shared-workspace
      params:
        - name: image-url
          value: $(params.image-url)

    - name: security-scan
      taskRef:
        name: trivy-scan
      runAfter:
        - build-image
      params:
        - name: image-url
          value: $(params.image-url)

4-4. Executing with PipelineRun

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  generateName: ci-pipeline-run-
spec:
  pipelineRef:
    name: ci-pipeline
  params:
    - name: repo-url
      value: https://github.com/my-org/my-app.git
    - name: image-url
      value: ghcr.io/my-org/my-app
  workspaces:
    - name: shared-workspace
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

5. Security Pipeline (DevSecOps)

5-1. Security Pipeline Components

A production-grade security pipeline automatically performs the following stages.

StageToolsPurpose
SAST (Static Analysis)SonarQube, Semgrep, CodeQLDetect source code vulnerabilities
SCA (Dependency Analysis)Snyk, Dependabot, OWASP DCDetect open-source dependency vulnerabilities
Secret ScanningGitLeaks, TruffleHogDetect secrets embedded in code
Container ScanningTrivy, GrypeDetect container image vulnerabilities
DAST (Dynamic Analysis)OWASP ZAP, NucleiDetect running application vulnerabilities
SBOM GenerationSyft, CycloneDXGenerate software component inventory
Artifact SigningCosign, NotationEnsure build artifact integrity

5-2. SAST - SonarQube Integration

# SonarQube analysis in GitHub Actions
name: Security Pipeline
on: [push, pull_request]

jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: SonarSource/sonarqube-scan-action@v3
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
          SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
        with:
          args: >
            -Dsonar.projectKey=my-project
            -Dsonar.sources=src/
            -Dsonar.tests=tests/
            -Dsonar.coverage.exclusions=**/*.test.ts
      - uses: SonarSource/sonarqube-quality-gate-check@v1
        timeout-minutes: 5
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

5-3. Container Image Scanning - Trivy

  container-scan:
    runs-on: ubuntu-latest
    needs: [build]
    steps:
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/my-org/my-app:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'

Trivy scans both OS packages and language-specific dependencies. Setting exit-code: '1' causes the pipeline to fail when CRITICAL or HIGH vulnerabilities are found.

5-4. SBOM Generation and Cosign Signing

SBOM (Software Bill of Materials) is a complete inventory of all components in the software. Following US Executive Order 14028, it has become an essential element of supply chain security.

  sbom-and-sign:
    runs-on: ubuntu-latest
    needs: [container-scan]
    permissions:
      id-token: write
      packages: write
    steps:
      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          image: ghcr.io/my-org/my-app:latest
          format: spdx-json
          output-file: sbom.spdx.json

      - name: Install Cosign
        uses: sigstore/cosign-installer@v3

      - name: Sign Container Image
        run: |
          cosign sign --yes \
            ghcr.io/my-org/my-app:latest

      - name: Attach SBOM to Image
        run: |
          cosign attach sbom \
            --sbom sbom.spdx.json \
            ghcr.io/my-org/my-app:latest

Cosign is part of the Sigstore project and supports keyless signing. It uses OIDC tokens to sign images without requiring separate signing key management.

5-5. Secret Scanning - GitLeaks

  secret-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

6. Test Automation Strategy

6-1. The Test Pyramid

In production CI/CD, the types and proportions of tests matter.

Unit Tests (70%): Should be the fastest and most numerous. Test individual functions, methods, and components in isolation.

Integration Tests (20%): Verify that multiple modules work together correctly. Test interactions with external dependencies like databases, APIs, and message queues.

E2E Tests (10%): Validate user scenarios from start to finish. Slowest and most brittle, so only test core flows.

6-2. Parallel Testing and Test Splitting

Strategies for reducing execution time of large test suites.

jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Run Tests (Shard ${{ matrix.shard }}/4)
        run: |
          npx jest --shard=${{ matrix.shard }}/4 \
            --ci --coverage --forceExit
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.shard }}
          path: coverage/

  merge-coverage:
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4
        with:
          pattern: coverage-*
          merge-multiple: true
      - name: Merge Coverage Reports
        run: npx istanbul-merge --out merged-coverage.json coverage-*/coverage-final.json

6-3. Playwright E2E Testing

  e2e:
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Install Playwright Browsers
        run: npx playwright install --with-deps
      - name: Run E2E Tests
        run: npx playwright test --reporter=html
        env:
          BASE_URL: https://staging.my-app.com
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: playwright-report
          path: playwright-report/

7. Advanced Deployment Strategies

7-1. Canary Deployments with Argo Rollouts

Argo Rollouts is a controller that implements advanced deployment strategies in Kubernetes. It provides a Rollout resource that replaces the standard Deployment.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: my-app-vsvc
              routes:
                - primary
      steps:
        - setWeight: 5
        - pause:
            duration: 5m
        - setWeight: 20
        - pause:
            duration: 5m
        - setWeight: 50
        - pause:
            duration: 10m
        - setWeight: 80
        - pause:
            duration: 5m
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
        args:
          - name: service-name
            value: my-app-canary
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: ghcr.io/my-org/my-app:v2.0.0
          ports:
            - containerPort: 8080

7-2. AnalysisTemplate - Metric-Based Automated Decisions

During canary deployment, query Prometheus metrics to automatically decide whether to promote or abort (rollback).

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2.."
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[5m]))

If three consecutive analyses show a success rate below 95%, an automatic rollback is triggered. This is the core of Level 5 self-healing pipelines.

7-3. Blue-Green Deployment

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app-bluegreen
spec:
  replicas: 5
  strategy:
    blueGreen:
      activeService: my-app-active
      previewService: my-app-preview
      autoPromotionEnabled: false
      prePromotionAnalysis:
        templates:
          - templateName: smoke-test
      postPromotionAnalysis:
        templates:
          - templateName: success-rate
      scaleDownDelaySeconds: 300
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: ghcr.io/my-org/my-app:v2.0.0

In blue-green deployment, after the new version (green) is fully ready, all traffic switches at once. prePromotionAnalysis runs smoke tests before the switch, and scaleDownDelaySeconds keeps the old version for a period to enable quick rollback.

7-4. Traffic Mirroring (Shadow Traffic)

A technique that replicates real production traffic to the new version for testing under actual load, but does not deliver the new version's responses to users.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-app
spec:
  hosts:
    - my-app.example.com
  http:
    - route:
        - destination:
            host: my-app-stable
            port:
              number: 80
      mirror:
        host: my-app-canary
        port:
          number: 80
      mirrorPercentage:
        value: 100.0

8. Multi-Environment Management

8-1. Environment Structure Design

Production pipelines operate a minimum of three environments.

  • dev: Feature branch deployments for developers. Instability is acceptable.
  • staging: Main branch deployments. Must have identical configuration to production.
  • production: The environment serving real user traffic.

8-2. Kustomize Overlays

Kustomize is a tool for customizing Kubernetes manifests per environment. It applies environment-specific overlays on top of a base configuration.

k8s/
  base/
    deployment.yaml
    service.yaml
    kustomization.yaml
  overlays/
    dev/
      kustomization.yaml
      replica-patch.yaml
    staging/
      kustomization.yaml
      replica-patch.yaml
    production/
      kustomization.yaml
      replica-patch.yaml
      hpa.yaml
# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
  - hpa.yaml
patches:
  - path: replica-patch.yaml
namePrefix: prod-
commonLabels:
  env: production
# k8s/overlays/production/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: my-app
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

8-3. Helm Values Per Environment

# values-dev.yaml
replicaCount: 1
image:
  tag: latest
resources:
  requests:
    cpu: 100m
    memory: 128Mi
ingress:
  host: dev.my-app.internal
autoscaling:
  enabled: false
# values-production.yaml
replicaCount: 5
image:
  tag: v2.0.0
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi
ingress:
  host: my-app.example.com
  tls: true
autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPU: 70
# Deploy per environment
helm upgrade --install my-app ./chart \
  -f values-production.yaml \
  --namespace production \
  --wait --timeout 5m

9. Monitoring and Rollback

9-1. Post-Deployment Monitoring Checklist

Core metrics to verify immediately after deployment.

Immediate check (0-5 minutes):

  • Pod status: Are all Pods in Running/Ready state
  • Error logs: Have exceptions surged in the new version
  • Health checks: Are readiness/liveness probes healthy

Short-term check (5-30 minutes):

  • Response time: Are p50, p95, p99 latencies similar to the previous version
  • Error rate: Is the 5xx rate within threshold
  • Throughput: Is request throughput in the expected range

Medium-term check (30 minutes to hours):

  • Memory usage: Any signs of memory leaks
  • CPU usage: Is CPU utilization stable
  • Business metrics: Are orders, conversion rates, etc. normal

9-2. Prometheus-Based Automatic Rollback

# PrometheusRule - Automatic rollback trigger
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: deployment-rollback-rules
spec:
  groups:
    - name: deployment-health
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
            > 0.05
          for: 2m
          labels:
            severity: critical
            action: rollback
          annotations:
            summary: "Error rate above 5 percent for 2 minutes"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
            ) > 2
          for: 3m
          labels:
            severity: critical
            action: rollback
          annotations:
            summary: "p99 latency above 2 seconds for 3 minutes"

9-3. SLO-Based Deployment Gates

Use Service Level Objectives (SLOs) as the basis for deployment decisions. When the error budget is exhausted, halt deployments.

# SLO definition example
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: my-app-slo
spec:
  service: my-app
  labels:
    team: platform
  slos:
    - name: requests-availability
      objective: 99.9
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{status=~"5.."}[5m]))
          totalQuery: sum(rate(http_requests_total[5m]))
      alerting:
        name: MyAppAvailability
        pageAlert:
          labels:
            severity: critical
        ticketAlert:
          labels:
            severity: warning

Here is how error budget calculation works. With a 99.9% SLO, roughly 43 minutes of downtime are allowed per month. If 30 minutes have already been consumed, the remaining 13 minutes of error budget mean risky deployments should not proceed.


10. Production Pipeline Architecture in Practice

10-1. Overall Architecture

A complete production-grade CI/CD pipeline follows this structure.

Developer Code Push
    |
    v
[CI Stage - GitHub Actions]
    |-- Code checkout
    |-- Dependency install (with caching)
    |-- Lint + format check
    |-- Unit tests (parallel 4 shards)
    |-- Integration tests
    |-- SAST (SonarQube)
    |-- Secret scanning (GitLeaks)
    |-- Dependency vulnerability scan (Snyk)
    |-- Container build (Kaniko)
    |-- Container scan (Trivy)
    |-- SBOM generation (Syft)
    |-- Image signing (Cosign)
    |-- Image registry push
    |
    v
[CD Stage - ArgoCD / GitOps]
    |-- Manifest repo auto-update
    |-- ArgoCD sync
    |-- dev auto-deploy
    |-- staging auto-deploy
    |-- E2E tests (Playwright)
    |-- DAST (OWASP ZAP)
    |
    v
[Production Deploy - Argo Rollouts]
    |-- Canary 5% deploy
    |-- Metric analysis (AnalysisTemplate)
    |-- Canary increase to 20%
    |-- Re-analysis
    |-- Canary increase to 50%
    |-- Final analysis
    |-- 100% promotion or automatic rollback
    |
    v
[Monitoring - Prometheus / Grafana]
    |-- SLO dashboard
    |-- Error budget tracking
    |-- Automatic rollback alerts

10-2. Pipeline Optimization Tips

Caching strategy: Cache dependency installation, Docker layers, and test results to reduce pipeline time by over 50%.

- uses: actions/cache@v4
  with:
    path: |
      ~/.npm
      node_modules
    key: deps-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      deps-

Conditional execution: Run only relevant tasks based on changed files to prevent unnecessary builds.

- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      backend:
        - 'src/api/**'
        - 'src/models/**'
      frontend:
        - 'src/components/**'
        - 'src/pages/**'
      infra:
        - 'terraform/**'
        - 'k8s/**'

Parallelization: Run independent tasks in parallel. Linting, testing, and security scans do not depend on each other and can run concurrently.

10-3. Pipeline Metrics

The pipeline itself must also be measured.

MetricDescriptionTarget
Lead TimeTime from commit to production deployUnder 1 hour
Deploy FrequencyProduction deploys per day10+ per day
Change Failure RatePercentage of deploys requiring rollbackUnder 5%
MTTRTime to recover from incidentsUnder 30 minutes
Pipeline Execution TimeTotal CI timeUnder 15 minutes
Test CoverageCode coverage percentageOver 80%

These are the core of DORA metrics (Lead Time, Deploy Frequency, Change Failure Rate, MTTR). Elite-performing teams achieve top rankings across all four metrics.


Conclusion

A CI/CD pipeline is not just an automation tool but critical infrastructure that determines software quality and development velocity. Here is a summary of everything covered.

  1. Assess your current level against the maturity model. Do not aim for Level 5 all at once; build capabilities incrementally.
  2. Design reusable pipelines. Use GitHub Actions reusable workflows and Composite Actions to standardize pipelines across your organization.
  3. Adopt GitOps. Implementing declarative deployments with tools like ArgoCD gives you audit trails, automatic recovery, and consistency.
  4. Embed security in the pipeline. DevSecOps is not a separate stage but security woven into every stage of the pipeline.
  5. Make metric-based decisions. Use SLOs and error budgets to quantitatively manage deployment risk.
  6. Track DORA metrics. Continuously improve the performance of the pipeline itself.

Production-grade CI/CD is not built overnight. But by understanding each component and introducing them incrementally, your team's deployment capabilities will improve dramatically.