Split View: CI/CD 베스트 프랙티스 2025: 팀을 위한 파이프라인 설계, 자동화, 보안까지

CI/CD 베스트 프랙티스 2025: 팀을 위한 파이프라인 설계, 자동화, 보안까지

들어가며
1. 2025년 CI/CD 현황
2. CI/CD 플랫폼 비교
3. 파이프라인 설계 원칙
4. CI에서의 테스트 전략
5. Docker 빌드 최적화
6. GitOps와 ArgoCD
7. CI/CD 보안
8. 배포 전략 비교
9. 롤백 전략
10. 파이프라인 헬스 모니터링
11. 실전 파이프라인 통합 예시
- 11.1 풀스택 CI/CD 파이프라인
12. 면접 질문 모음
- 기본 개념
- 심화 질문
13. 퀴즈
14. 참고 자료

들어가며

2025년, CI/CD는 더 이상 선택이 아닌 필수입니다. Google의 DORA(DevOps Research and Assessment) 보고서에 따르면, Elite 수준의 팀은 하루에 여러 번 배포하면서도 변경 실패율 5% 미만을 유지합니다. 반면 Low 수준의 팀은 한 달에 한 번 배포하며 실패율이 46%에 달합니다.

이 격차의 핵심은 파이프라인 설계에 있습니다. 단순히 CI/CD 도구를 도입하는 것이 아니라, 테스트 자동화, 보안 통합, 점진적 배포, 그리고 관찰 가능성까지 포함하는 종합적인 전략이 필요합니다.

이 글에서는 CI/CD 파이프라인을 설계하고 운영하는 데 필요한 모든 것을 다룹니다. 플랫폼 비교부터 파이프라인 설계 원칙, 테스트 전략, Docker 빌드 최적화, GitOps, 보안, 배포 전략, 롤백, 그리고 모니터링까지 실전 중심으로 정리했습니다.

1. 2025년 CI/CD 현황

1.1 DORA 메트릭으로 보는 팀 성과

DORA 메트릭은 소프트웨어 딜리버리 성과를 측정하는 4가지 핵심 지표입니다.

지표	Elite	High	Medium	Low
배포 빈도	하루 여러 번	주 1회~월 1회	월 1회~6개월 1회	6개월 이상
리드 타임 (커밋→배포)	1시간 미만	1일~1주	1주~1개월	1개월~6개월
변경 실패율	0~5%	5~10%	10~15%	46~60%
복구 시간 (MTTR)	1시간 미만	1일 미만	1일~1주	6개월 이상

1.2 시프트 레프트 전략

시프트 레프트(Shift Left)는 테스트와 보안을 개발 초기 단계로 당기는 전략입니다.

전통적 접근:
Code → Build → Test → Security → Deploy → Monitor
                              ↑ 여기서 문제 발견

시프트 레프트:
Code + Test + Security → Build → Deploy → Monitor
↑ 여기서 문제 발견 (비용 10x 절감)

핵심 원칙:

커밋 전 검증: pre-commit hook으로 린트, 포맷, 시크릿 스캔
PR 단계 테스트: 단위 테스트 + 통합 테스트 + SAST 자동 실행
빌드 시 보안: 컨테이너 이미지 스캔, 의존성 취약점 검사
배포 전 검증: 스모크 테스트, 카나리 분석

1.3 2025년 주요 트렌드

플랫폼 엔지니어링: 개발자 셀프 서비스 플랫폼으로 CI/CD 표준화
AI 기반 CI/CD: 테스트 실패 예측, 자동 롤백 결정, 플레이키 테스트 탐지
eBPF 기반 관찰 가능성: 파이프라인 성능 모니터링의 새 패러다임
Supply Chain Security: SBOM, SLSA, Sigstore 기반 소프트웨어 공급망 보안

2. CI/CD 플랫폼 비교

2.1 주요 플랫폼 비교표

기능	GitHub Actions	Jenkins	GitLab CI	CircleCI
호스팅	SaaS/Self-hosted	Self-hosted	SaaS/Self-hosted	SaaS
설정 방식	YAML	Groovy/YAML	YAML	YAML
생태계	Marketplace 15,000+	Plugin 1,800+	Built-in 통합	Orbs 3,000+
컨테이너 지원	네이티브	플러그인	네이티브	네이티브
셀프러너	지원	기본	지원	지원
가격	2,000분 무료/월	무료(OSS)	400분 무료/월	6,000크레딧 무료/월
학습 곡선	낮음	높음	중간	낮음
캐싱	10GB/리포	플러그인	네이티브	네이티브

2.2 GitHub Actions

# .github/workflows/ci.yml
name: CI Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

concurrency:
  group: ci-${{ '{{' }} github.ref {{ '}}' }}
  cancel-in-progress: true

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: myapp:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

2.3 Jenkins Pipeline

// Jenkinsfile (Declarative Pipeline)
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'registry.example.com'
        IMAGE_NAME = 'myapp'
    }

    stages {
        stage('Checkout') {
            steps {
                checkout scm
            }
        }

        stage('Lint & Test') {
            parallel {
                stage('Lint') {
                    steps {
                        sh 'npm run lint'
                    }
                }
                stage('Unit Test') {
                    steps {
                        sh 'npm test -- --coverage'
                    }
                    post {
                        always {
                            junit 'reports/junit.xml'
                            publishHTML([
                                reportDir: 'coverage',
                                reportFiles: 'index.html',
                                reportName: 'Coverage Report'
                            ])
                        }
                    }
                }
            }
        }

        stage('Build & Push') {
            steps {
                script {
                    def image = docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
                    docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
                        image.push()
                        image.push('latest')
                    }
                }
            }
        }
    }

    post {
        failure {
            slackSend(
                channel: '#ci-alerts',
                color: 'danger',
                message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
            )
        }
    }
}

2.4 GitLab CI

# .gitlab-ci.yml
stages:
  - lint
  - test
  - build
  - deploy

variables:
  DOCKER_HOST: tcp://docker:2376

lint:
  stage: lint
  image: node:20-alpine
  cache:
    key: npm-cache
    paths:
      - node_modules/
  script:
    - npm ci
    - npm run lint

test:
  stage: test
  image: node:20-alpine
  parallel: 4
  script:
    - npm ci
    - npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
  coverage: '/Statements\s*:\s*(\d+\.?\d*)%/'
  artifacts:
    reports:
      junit: reports/junit.xml
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker build -t myapp:latest .
    - docker push myapp:latest
  only:
    - main

3. 파이프라인 설계 원칙

3.1 빠른 피드백 루프

개발자가 PR을 올리고 결과를 기다리는 시간은 직접적으로 생산성에 영향을 미칩니다.

목표 시간:
├── 린트 + 포맷 체크: 30초 이내
├── 단위 테스트: 2분 이내
├── 통합 테스트: 5분 이내
├── 빌드: 3분 이내
└── 전체 파이프라인: 10분 이내

현실 (최적화 전): 30분+
현실 (최적화 후): 8~10분

3.2 병렬 처리

# 병렬 파이프라인 예시
jobs:
  # 1단계: 린트/보안은 독립적으로 병렬 실행
  lint:
    runs-on: ubuntu-latest
    # ...
  security-scan:
    runs-on: ubuntu-latest
    # ...

  # 2단계: 테스트는 shard로 병렬 분할
  test:
    needs: [lint]
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    # ...

  # 3단계: 빌드는 테스트 통과 후
  build:
    needs: [test, security-scan]
    # ...

3.3 캐싱 전략

# npm 캐싱 (GitHub Actions)
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ '{{' }} hashFiles('**/package-lock.json') {{ '}}' }}
    restore-keys: |
      npm-

# Docker 레이어 캐싱
- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

# Gradle 캐싱
- uses: actions/cache@v4
  with:
    path: |
      ~/.gradle/caches
      ~/.gradle/wrapper
    key: gradle-${{ '{{' }} hashFiles('**/*.gradle*') {{ '}}' }}

3.4 멱등성 (Idempotency)

파이프라인은 같은 입력에 대해 항상 같은 결과를 내야 합니다.

# 나쁜 예: 타임스탬프 기반 태그 (재실행 시 다른 결과)
# IMAGE_TAG: my-app:build-20250323-142000

# 좋은 예: 커밋 SHA 기반 태그 (항상 동일)
# IMAGE_TAG: my-app:abc1234

# 좋은 예: 시맨틱 버전 (결정적)
# IMAGE_TAG: my-app:v1.2.3

4. CI에서의 테스트 전략

4.1 테스트 피라미드

          /   E2E   \          느리지만 높은 신뢰도
         /  (5~10%)  \
        / Integration  \       중간 속도, 중간 신뢰도
       /   (15~25%)     \
      /    Unit Tests     \    빠르고 많이
     /     (65~80%)        \
    /________________________\

4.2 테스트 분할 (Test Splitting)

# Jest 테스트 샤딩
test:
  strategy:
    matrix:
      shard: [1, 2, 3, 4]
  steps:
    - run: npx jest --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4

# Cypress 병렬 실행
e2e:
  strategy:
    matrix:
      container: [1, 2, 3]
  steps:
    - uses: cypress-io/github-action@v6
      with:
        record: true
        parallel: true
        group: 'e2e-tests'

4.3 플레이키 테스트 관리

플레이키(Flaky) 테스트는 같은 코드에서 때때로 성공하고 때때로 실패하는 테스트입니다.

// 플레이키 테스트 감지 및 격리 전략
// jest.config.js
module.exports = {
  // 실패 시 자동 재시도
  retryTimes: 2,

  // 플레이키 테스트 리포터
  reporters: [
    'default',
    ['jest-flaky-reporter', {
      outputFile: 'flaky-tests.json',
      threshold: 3  // 3번 이상 플레이키하면 보고
    }]
  ]
};

# CI에서 플레이키 테스트 격리
test-stable:
  runs-on: ubuntu-latest
  steps:
    - run: npx jest --testPathIgnorePatterns="flaky"

test-flaky:
  runs-on: ubuntu-latest
  continue-on-error: true  # 실패해도 파이프라인 계속
  steps:
    - run: npx jest --testPathPattern="flaky" --retries=3

4.4 테스트 커버리지 게이트

# 커버리지 임계값 설정
test:
  steps:
    - run: npx jest --coverage
    - name: Check coverage threshold
      run: |
        COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.statements.pct')
        if (( $(echo "$COVERAGE < 80" | bc -l) )); then
          echo "Coverage $COVERAGE% is below 80% threshold"
          exit 1
        fi

5. Docker 빌드 최적화

5.1 멀티 스테이지 빌드

# Stage 1: 의존성 설치
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production

# Stage 2: 빌드
FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

# Stage 3: 프로덕션 이미지
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production

# 보안: non-root 사용자
RUN addgroup --system --gid 1001 nodejs && \
    adduser --system --uid 1001 nextjs

COPY --from=builder --chown=nextjs:nodejs /app/.next ./.next
COPY --from=deps --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER nextjs
EXPOSE 3000
CMD ["npm", "start"]

5.2 레이어 캐싱 최적화

# 나쁜 예: 소스 변경 시 npm ci 재실행
COPY . .
RUN npm ci
RUN npm run build

# 좋은 예: 의존성 파일만 먼저 복사
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

5.3 BuildKit과 Buildx

# GitHub Actions에서 BuildKit 사용
- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max
    platforms: linux/amd64,linux/arm64

# 로컬에서 BuildKit 사용
# DOCKER_BUILDKIT=1 docker build .

5.4 Kaniko (Docker 데몬 없는 빌드)

# Kubernetes에서 Kaniko로 이미지 빌드
apiVersion: v1
kind: Pod
metadata:
  name: kaniko-build
spec:
  containers:
    - name: kaniko
      image: gcr.io/kaniko-project/executor:latest
      args:
        - "--dockerfile=Dockerfile"
        - "--context=git://github.com/myorg/myapp"
        - "--destination=registry.example.com/myapp:latest"
        - "--cache=true"
        - "--cache-repo=registry.example.com/myapp/cache"

5.5 이미지 크기 최적화

이미지 크기 비교:
├── node:20          → 1.1GB
├── node:20-slim     → 220MB
├── node:20-alpine   → 140MB
├── distroless/nodejs → 120MB
└── 멀티스테이지 최적화 → 80~100MB

6. GitOps와 ArgoCD

6.1 GitOps 원칙

GitOps는 Git 리포지토리를 단일 진실의 소스(Single Source of Truth)로 사용하는 운영 모델입니다.

GitOps 워크플로:
1. 개발자가 Git에 변경 Push
2. CI가 이미지 빌드 및 테스트
3. CI가 배포 매니페스트의 이미지 태그 업데이트
4. ArgoCD가 Git과 클러스터 상태 비교
5. 차이가 있으면 자동 동기화 (또는 수동 승인)
6. 클러스터가 Git 상태와 일치

┌────────┐    Push     ┌────────┐   Detect   ┌────────┐
│  Dev   │ ──────────> │  Git   │ <────────> │ ArgoCD │
└────────┘             └────────┘            └───┬────┘
                                                 │ Sync
                                            ┌────▼────┐
                                            │  K8s    │
                                            │ Cluster │
                                            └─────────┘

6.2 ArgoCD App of Apps 패턴

# apps/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/gitops-config
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

# apps/api-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/gitops-config
    path: services/api
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

6.3 Argo Rollouts (점진적 배포)

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualService:
            name: api-vsvc
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 30
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 60
        - pause:
            duration: 5m
        - setWeight: 100

# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"2.*",app="api-service",version="canary"}[5m]))
            /
            sum(rate(http_requests_total{app="api-service",version="canary"}[5m]))

7. CI/CD 보안

7.1 보안 스캔 통합

CI/CD 보안 레이어:
┌─────────────────────────────────────────────┐
│ Layer 1: Pre-commit                         │
│  - Secret scanning (gitleaks, detect-secrets)│
│  - Lint (security rules)                    │
├─────────────────────────────────────────────┤
│ Layer 2: PR / Build                         │
│  - SAST (Semgrep, CodeQL, SonarQube)        │
│  - SCA (Dependabot, Snyk, Trivy)            │
│  - License compliance                       │
├─────────────────────────────────────────────┤
│ Layer 3: Container Build                    │
│  - Image scanning (Trivy, Grype)            │
│  - Base image policy (distroless, alpine)   │
│  - SBOM generation (Syft)                   │
├─────────────────────────────────────────────┤
│ Layer 4: Deploy                             │
│  - Policy enforcement (OPA/Kyverno)         │
│  - Signing (cosign, Sigstore)               │
│  - Runtime security (Falco)                 │
└─────────────────────────────────────────────┘

7.2 시크릿 관리

# GitHub Actions에서 시크릿 사용
deploy:
  steps:
    - name: Deploy
      env:
        AWS_ACCESS_KEY_ID: ${{ '{{' }} secrets.AWS_ACCESS_KEY_ID {{ '}}' }}
        AWS_SECRET_ACCESS_KEY: ${{ '{{' }} secrets.AWS_SECRET_ACCESS_KEY {{ '}}' }}
      run: |
        aws ecs update-service --cluster prod --service api

# OIDC 기반 인증 (시크릿 없는 방식 - 권장)
permissions:
  id-token: write
  contents: read

steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789012:role/github-actions
      aws-region: ap-northeast-2

7.3 SBOM과 Supply Chain Security

# Syft로 SBOM 생성
- name: Generate SBOM
  uses: anchore/sbom-action@v0
  with:
    image: myapp:latest
    format: spdx-json
    output-file: sbom.spdx.json

# cosign으로 이미지 서명
- name: Sign image
  run: |
    cosign sign --key env://COSIGN_PRIVATE_KEY myapp:latest

# cosign으로 서명 검증
- name: Verify signature
  run: |
    cosign verify --key cosign.pub myapp:latest

7.4 시크릿 스캔 자동화

# pre-commit 설정
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks

# CI에서 gitleaks 실행
security:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0
    - uses: gitleaks/gitleaks-action@v2
      env:
        GITHUB_TOKEN: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}

8. 배포 전략 비교

8.1 전략 비교표

전략	다운타임	위험도	리소스 비용	롤백 속도	복잡도
Recreate	있음	높음	1x	느림	낮음
Rolling Update	없음	중간	1x~1.25x	중간	낮음
Blue-Green	없음	낮음	2x	즉시	중간
Canary	없음	매우 낮음	1.1x	즉시	높음
A/B Testing	없음	매우 낮음	1.1x	즉시	매우 높음

8.2 Blue-Green 배포

# Kubernetes Blue-Green 배포
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
    version: green  # blue에서 green으로 전환
  ports:
    - port: 80
      targetPort: 8080

Blue-Green 전환 과정:
1. Blue(v1) 운영 중 → Green(v2) 배포
2. Green 헬스체크 및 스모크 테스트
3. Service selector를 Green으로 전환
4. 문제 시 Blue로 즉시 롤백
5. 안정화 후 Blue 리소스 정리

[Users] → [LB] → [Blue v1] ✓ Active
                  [Green v2] ← 준비 중

[Users] → [LB] → [Blue v1] ← 대기
                  [Green v2] ✓ Active

8.3 카나리 배포

# Istio VirtualService로 카나리 배포
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-service
spec:
  hosts:
    - api-service
  http:
    - route:
        - destination:
            host: api-service
            subset: stable
          weight: 90
        - destination:
            host: api-service
            subset: canary
          weight: 10

8.4 Feature Flags

// LaunchDarkly 또는 자체 Feature Flag 시스템
import { featureFlags } from './feature-flags';

async function handleRequest(req: Request) {
  const userId = req.user.id;

  if (await featureFlags.isEnabled('new-checkout-flow', userId)) {
    return newCheckoutFlow(req);
  }

  return legacyCheckoutFlow(req);
}

Feature Flag 기반 배포:
1. 코드에 새 기능을 플래그로 감싸서 배포
2. 내부 사용자에게만 활성화
3. 점진적으로 비율 확대 (1% → 5% → 25% → 100%)
4. 문제 시 플래그만 끄면 즉시 비활성화
5. 배포와 릴리스를 분리

9. 롤백 전략

9.1 자동 롤백

# Argo Rollouts 자동 롤백
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: error-rate-check
      # 분석 실패 시 자동 롤백
      abortScaleDownDelaySeconds: 30

# Kubernetes Deployment 자동 롤백
apiVersion: apps/v1
kind: Deployment
spec:
  progressDeadlineSeconds: 300  # 5분 내 완료 안 되면 실패
  minReadySeconds: 30
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0

9.2 서킷 브레이커 패턴

// 배포 서킷 브레이커
class DeploymentCircuitBreaker {
  private errorThreshold = 0.05; // 5% 에러율
  private windowSize = 300;       // 5분 윈도우

  async shouldRollback(metrics: DeploymentMetrics): Promise<boolean> {
    const errorRate = metrics.errors / metrics.totalRequests;
    const p99Latency = metrics.p99LatencyMs;

    return (
      errorRate > this.errorThreshold ||
      p99Latency > 3000 // 3초 초과
    );
  }

  async executeRollback(deployment: string) {
    console.log(`Rolling back ${deployment}`);
    // kubectl rollout undo deployment/api-service
    await exec(`kubectl rollout undo deployment/${deployment}`);

    // 알림 전송
    await notify({
      channel: '#deployments',
      message: `Auto-rollback triggered for ${deployment}`,
      severity: 'critical'
    });
  }
}

9.3 데이터베이스 마이그레이션 롤백

안전한 DB 마이그레이션 전략:
1. Expand-Contract 패턴
   Phase 1 (Expand): 새 컬럼 추가, 양쪽 모두 쓰기
   Phase 2 (Migrate): 기존 데이터 마이그레이션
   Phase 3 (Contract): 이전 컬럼 제거

2. 롤백 가능한 마이그레이션만 적용
   - 컬럼 추가 (롤백 가능)
   - 인덱스 추가 (롤백 가능)
   - 컬럼 삭제 (롤백 불가 → Expand-Contract 사용)
   - 타입 변경 (롤백 불가 → 새 컬럼 추가 후 전환)

-- 안전한 마이그레이션 예시
-- Step 1: 새 컬럼 추가 (롤백 가능)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;

-- Step 2: 데이터 마이그레이션 (백그라운드)
UPDATE users SET email_verified = TRUE
WHERE verified_at IS NOT NULL;

-- Step 3: 앱 코드에서 새 컬럼 사용 전환
-- Step 4: 이전 컬럼 삭제 (별도 마이그레이션)
-- ALTER TABLE users DROP COLUMN verified_at;

10. 파이프라인 헬스 모니터링

10.1 핵심 메트릭

파이프라인 헬스 대시보드:
┌─────────────────────────────────────────┐
│  Build Time Trend                       │
│  ██████████████ 8m (avg)                │
│  Target: < 10m                          │
├─────────────────────────────────────────┤
│  Success Rate                           │
│  ████████████████████ 94%               │
│  Target: > 95%                          │
├─────────────────────────────────────────┤
│  Flaky Test Rate                        │
│  ██ 3%                                  │
│  Target: < 2%                           │
├─────────────────────────────────────────┤
│  Mean Time to Recovery (MTTR)           │
│  ████ 25min                             │
│  Target: < 30min                        │
└─────────────────────────────────────────┘

10.2 빌드 시간 추적

# 빌드 시간을 Datadog에 보고
- name: Report build metrics
  if: always()
  run: |
    END_TIME=$(date +%s)
    DURATION=$((END_TIME - START_TIME))
    curl -X POST "https://api.datadoghq.com/api/v1/series" \
      -H "DD-API-KEY: $DD_API_KEY" \
      -d "{
        \"series\": [{
          \"metric\": \"ci.build.duration\",
          \"points\": [[$END_TIME, $DURATION]],
          \"tags\": [
            \"repo:myapp\",
            \"branch:$GITHUB_REF_NAME\",
            \"status:$JOB_STATUS\"
          ]
        }]
      }"

10.3 실패 분석 자동화

# 빌드 실패 자동 분류 스크립트
import re
from enum import Enum

class FailureCategory(Enum):
    FLAKY_TEST = "flaky_test"
    DEPENDENCY = "dependency"
    COMPILATION = "compilation"
    INFRASTRUCTURE = "infrastructure"
    TIMEOUT = "timeout"
    UNKNOWN = "unknown"

def categorize_failure(log: str) -> FailureCategory:
    patterns = {
        FailureCategory.FLAKY_TEST: [
            r"retry.*failed",
            r"intermittent",
            r"flaky"
        ],
        FailureCategory.DEPENDENCY: [
            r"npm ERR!.*404",
            r"Could not resolve dependencies",
            r"ECONNRESET"
        ],
        FailureCategory.COMPILATION: [
            r"error TS\d+",
            r"SyntaxError",
            r"TypeError"
        ],
        FailureCategory.INFRASTRUCTURE: [
            r"runner.*offline",
            r"disk space",
            r"out of memory"
        ],
        FailureCategory.TIMEOUT: [
            r"timed out",
            r"deadline exceeded"
        ]
    }

    for category, regexes in patterns.items():
        for pattern in regexes:
            if re.search(pattern, log, re.IGNORECASE):
                return category

    return FailureCategory.UNKNOWN

11. 실전 파이프라인 통합 예시

11.1 풀스택 CI/CD 파이프라인

# .github/workflows/production.yml
name: Production Pipeline

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read
  packages: write

jobs:
  # Phase 1: 코드 품질
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check

  # Phase 2: 보안 스캔
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: returntocorp/semgrep-action@v1
      - name: Run Trivy (filesystem)
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'HIGH,CRITICAL'

  # Phase 3: 테스트
  test:
    needs: quality
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_PASSWORD: testpass
        ports:
          - 5432:5432
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost:5432/testdb

  # Phase 4: 빌드 및 푸시
  build:
    needs: [test, security]
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ '{{' }} github.actor {{ '}}' }}
          password: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/myorg/myapp
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # Phase 5: 배포 매니페스트 업데이트 (GitOps)
  update-manifest:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          repository: myorg/gitops-config
          token: ${{ '{{' }} secrets.GITOPS_TOKEN {{ '}}' }}
      - name: Update image tag
        run: |
          cd services/api
          kustomize edit set image myapp=${{ '{{' }} needs.build.outputs.image-tag {{ '}}' }}
      - name: Commit and push
        run: |
          git config user.name "CI Bot"
          git config user.email "ci@example.com"
          git add .
          git commit -m "chore: update api-service image"
          git push

12. 면접 질문 모음

기본 개념

Q1. CI와 CD의 차이점을 설명하세요.

**CI(Continuous Integration)**는 개발자들이 코드 변경을 자주 메인 브랜치에 통합하는 관행입니다. 각 통합은 자동화된 빌드와 테스트로 검증합니다.

CD는 두 가지 의미가 있습니다:

Continuous Delivery: 코드가 항상 배포 가능한 상태를 유지. 프로덕션 배포는 수동 승인.
Continuous Deployment: 모든 변경이 자동으로 프로덕션에 배포. 수동 개입 없음.

핵심 차이: CI는 "통합"에, CD는 "전달/배포"에 초점. CI 없이 CD는 불가능하지만, CI만 하고 CD는 안 할 수 있습니다.

Q2. DORA 메트릭 4가지를 설명하세요.

배포 빈도(Deployment Frequency): 프로덕션에 얼마나 자주 배포하는가
리드 타임(Lead Time for Changes): 커밋에서 프로덕션 배포까지 걸리는 시간
변경 실패율(Change Failure Rate): 배포 중 실패하거나 롤백이 필요한 비율
복구 시간(MTTR - Mean Time to Recovery): 장애 발생 후 복구까지 걸리는 시간

Elite 팀: 하루 여러 번 배포, 1시간 미만 리드 타임, 5% 미만 실패율, 1시간 미만 복구

Q3. GitOps의 핵심 원칙을 설명하세요.

선언적 기술(Declarative): 시스템 상태를 선언적으로 정의
버전 관리(Versioned): Git을 단일 진실의 소스로 사용
자동 적용(Automated): 승인된 변경이 자동으로 시스템에 적용
자가 치유(Self-Healing): 실제 상태가 선언된 상태와 다르면 자동 복구

장점: 감사 추적, 롤백 용이, PR 기반 변경 관리, 재현 가능한 환경

Q4. Blue-Green 배포와 카나리 배포의 차이를 설명하세요.

Blue-Green: 두 개의 동일한 환경(Blue/Green)을 운영. 새 버전을 Green에 배포 후 트래픽을 한 번에 전환. 롤백은 Blue로 즉시 전환.

장점: 즉시 롤백, 간단한 구현
단점: 리소스 2배 필요, 데이터베이스 동기화 복잡

Canary: 새 버전을 소수(1~10%)에게 먼저 배포. 메트릭 분석 후 점진적 확대.

장점: 위험 최소화, 실제 트래픽으로 검증
단점: 구현 복잡, 모니터링 필수

Q5. 시프트 레프트(Shift Left)란 무엇인가요?

테스트와 보안을 개발 라이프사이클의 왼쪽(초기 단계)으로 이동시키는 전략입니다.

적용 예시:

pre-commit hook으로 코드 린트, 포맷, 시크릿 스캔
PR 단계에서 SAST, SCA, 단위 테스트 실행
빌드 시 컨테이너 이미지 스캔
IDE 플러그인으로 개발 중 실시간 피드백

효과: 결함을 일찍 발견할수록 수정 비용이 지수적으로 감소 (프로덕션 대비 10~100배 절감)

심화 질문

Q6. 플레이키 테스트(Flaky Test)를 어떻게 관리하나요?

감지: 같은 코드에서 반복 실행 시 결과가 달라지는 테스트 식별
격리: 플레이키 테스트를 별도 test suite로 분리, continue-on-error 적용
재시도: jest의 retryTimes, pytest-rerunfailures 등으로 자동 재시도
추적: 플레이키 테스트 대시보드로 빈도, 패턴 분석
근본 원인 해결: 타이밍 이슈, 공유 상태, 외부 의존성 등 원인 제거
정책: 일정 기간 내 수정 안 되면 비활성화 또는 삭제

Q7. Docker 이미지 빌드를 어떻게 최적화하나요?

멀티 스테이지 빌드: 빌드 도구를 최종 이미지에서 제외
레이어 캐싱 최적화: 자주 변경되는 파일을 뒤에 COPY
경량 베이스 이미지: alpine, distroless 사용
.dockerignore: 불필요한 파일 제외
BuildKit 사용: 병렬 빌드, 캐시 마운트
의존성 분리: package.json을 먼저 복사하여 npm ci 캐싱
Kaniko: Docker 데몬 없이 빌드 (CI/CD 보안 향상)

Q8. 시크릿 관리 모범 사례를 설명하세요.

절대 Git에 커밋하지 않기: gitleaks, detect-secrets로 pre-commit 검사
OIDC 기반 인증: 장기 시크릿 대신 임시 토큰 사용
시크릿 매니저 활용: AWS Secrets Manager, HashiCorp Vault, Doppler
최소 권한 원칙: 필요한 최소한의 권한만 부여
시크릿 로테이션: 정기적으로 시크릿 갱신 자동화
감사 로그: 시크릿 접근 기록 추적
환경 분리: dev/staging/prod 시크릿 분리

Q9. 데이터베이스 마이그레이션을 롤백 안전하게 수행하는 방법은?

Expand-Contract 패턴 사용:

Phase 1 (Expand):

새 컬럼/테이블 추가
앱이 이전 스키마와 새 스키마 모두 호환되도록 코드 수정
새 스키마에도 데이터 쓰기 시작

Phase 2 (Migrate):

기존 데이터를 새 스키마로 마이그레이션 (백그라운드)
앱을 새 스키마만 사용하도록 전환

Phase 3 (Contract):

이전 컬럼/테이블 제거 (별도 배포)
이 단계만 롤백 불가

핵심: 각 단계가 독립적으로 롤백 가능해야 합니다.

Q10. CI/CD 파이프라인의 보안을 어떻게 강화하나요?

Supply Chain Security: SBOM 생성, 이미지 서명(cosign), SLSA 준수
시크릿 관리: OIDC, Vault, 환경변수 최소화
SAST/DAST/SCA: Semgrep, Trivy, Dependabot 통합
컨테이너 보안: 비루트 사용자, distroless 베이스, 이미지 스캔
정책 준수: OPA/Kyverno로 배포 정책 강제
접근 제어: 최소 권한, 브랜치 보호 규칙
감사: 모든 배포 기록 추적, 변경 이력 유지

Q11. GitHub Actions와 Jenkins의 장단점을 비교하세요.

GitHub Actions:

장점: GitHub 네이티브 통합, SaaS로 유지보수 불필요, Marketplace 생태계, YAML 기반 간편 설정
단점: GitHub 종속성, 커스터마이징 한계, 복잡한 워크플로 관리 어려움

Jenkins:

장점: 완전한 자유도, 1800+ 플러그인, 자체 호스팅 제어, Groovy 스크립팅
단점: 높은 유지보수 비용, 복잡한 설정, 보안 패치 관리, 스케일링 어려움

선택 기준: 소규모/GitHub 중심 프로젝트는 Actions, 복잡한 엔터프라이즈/멀티 SCM은 Jenkins

Q12. Argo Rollouts의 Progressive Delivery를 설명하세요.

Progressive Delivery는 새 버전을 점진적으로 배포하면서 자동화된 분석으로 안전성을 검증하는 방식입니다.

Argo Rollouts 워크플로:

카나리 10% 트래픽 할당
AnalysisTemplate으로 성공률, 지연 시간 검증 (5분)
통과 시 30%로 확대, 다시 분석
60%, 100%로 점진적 확대
분석 실패 시 자동 롤백

핵심 구성 요소:

Rollout: 배포 전략 정의
AnalysisTemplate: 검증 조건 정의 (Prometheus, Datadog 등)
TrafficRouting: Istio, Nginx, ALB 등과 연동

Q13. 파이프라인 성능을 어떻게 최적화하나요?

병렬 처리: 독립적인 작업을 동시 실행
테스트 분할: 샤딩으로 테스트를 여러 러너에 분배
캐싱: 의존성, Docker 레이어, 빌드 결과 캐싱
선택적 실행: 변경된 파일에 따라 필요한 작업만 실행
증분 빌드: 전체 빌드 대신 변경된 부분만 빌드
리소스 최적화: 러너 크기, 동시성 제한 조정
피드백 루프 단축: 빠른 검사를 먼저, 느린 검사는 나중에

Q14. Feature Flag 기반 배포의 장단점은?

장점:

배포와 릴리스 분리: 코드를 배포하되, 기능은 나중에 활성화
빠른 롤백: 코드 롤백 없이 플래그만 끄면 됨
점진적 출시: 사용자 비율을 점진적으로 확대
A/B 테스트: 기능별 실험 가능

단점:

기술 부채: 오래된 플래그 정리 필요
복잡성: 플래그 조합으로 테스트 경우의 수 증가
코드 가독성: 조건문 증가로 코드 복잡화
일관성: 사용자마다 다른 경험으로 버그 재현 어려움

Q15. 모노레포에서의 CI/CD 전략을 설명하세요.

영향 범위 분석: 변경된 파일로부터 영향받는 패키지만 빌드/테스트
도구 활용: Turborepo, Nx, Bazel 등으로 의존성 그래프 기반 빌드
캐싱: 원격 캐시(Turborepo Remote Cache)로 빌드 결과 공유
선택적 배포: 변경된 서비스만 배포
병렬 처리: 독립적인 패키지 동시 빌드/테스트

# Turborepo 예시
turbo run build --filter=...[HEAD~1]
# HEAD 이후 변경된 패키지와 의존 패키지만 빌드

13. 퀴즈

Q1. DORA 메트릭에서 Elite 팀의 배포 빈도는?

정답: 하루에 여러 번 (On-demand, multiple deploys per day)

Elite 팀은 하루에 여러 번 배포하면서도 변경 실패율 5% 미만, 복구 시간 1시간 미만을 유지합니다.

Q2. Expand-Contract 패턴에서 롤백이 불가능한 단계는?

정답: Contract 단계 (이전 컬럼/테이블 삭제)

Expand(추가)와 Migrate(마이그레이션)는 롤백 가능하지만, Contract(삭제)는 데이터가 사라지므로 롤백이 불가능합니다. 따라서 Contract는 충분한 안정화 기간 후에 별도로 진행합니다.

Q3. GitOps에서 "단일 진실의 소스"란 무엇인가요?

정답: Git 리포지토리

GitOps에서 Git 리포지토리는 시스템의 원하는 상태(Desired State)를 정의하는 유일한 소스입니다. 클러스터의 실제 상태는 항상 Git에 선언된 상태와 일치해야 하며, ArgoCD 같은 도구가 이를 자동으로 감지하고 동기화합니다.

Q4. 카나리 배포에서 자동 롤백을 트리거하는 기준은 무엇인가요?

정답: AnalysisTemplate에 정의된 메트릭 기준 (성공률, 지연 시간 등)

Argo Rollouts의 AnalysisTemplate에서 Prometheus, Datadog 등의 메트릭을 쿼리하여 성공률이 기준(예: 95%) 이하이거나, P99 지연 시간이 기준을 초과하면 자동으로 롤백을 트리거합니다.

Q5. SBOM(Software Bill of Materials)의 목적은 무엇인가요?

정답: 소프트웨어에 포함된 모든 구성 요소(라이브러리, 의존성)의 목록을 제공하여 공급망 보안을 강화하는 것

SBOM은 소프트웨어의 "재료 목록"으로, 취약점 발견 시 영향 범위를 빠르게 파악하고, 라이선스 컴플라이언스를 확인하며, 공급망 공격(예: Log4Shell)에 대한 대응을 돕습니다. Syft, Trivy 등의 도구로 자동 생성할 수 있습니다.

14. 참고 자료

공식 문서

DORA 및 DevOps

보안

배포 전략

도구 및 생태계

CI/CD Best Practices 2025: Pipeline Design, Automation, and Security for Teams

Introduction
1. CI/CD in 2025
2. CI/CD Platform Comparison
3. Pipeline Design Principles
4. Testing in CI
5. Docker Build Optimization
6. GitOps with ArgoCD
7. CI/CD Security
8. Deployment Strategies Compared
9. Rollback Strategies
10. Pipeline Health Monitoring
11. Full-Stack CI/CD Pipeline Example
12. Interview Questions
- Basic Concepts
- Advanced Questions
13. Quiz
14. References

Introduction

In 2025, CI/CD is no longer optional; it is essential. According to Google's DORA (DevOps Research and Assessment) report, Elite-level teams deploy multiple times per day while maintaining a change failure rate below 5%. In contrast, Low-level teams deploy once a month with failure rates reaching 46%.

The core of this gap lies in pipeline design. It is not just about adopting CI/CD tools, but about a comprehensive strategy that includes test automation, security integration, progressive delivery, and observability.

This article covers everything you need to design and operate CI/CD pipelines: platform comparisons, pipeline design principles, testing strategies, Docker build optimization, GitOps, security, deployment strategies, rollback approaches, and monitoring.

1. CI/CD in 2025

1.1 Team Performance Through DORA Metrics

DORA Metrics are four key indicators that measure software delivery performance.

Metric	Elite	High	Medium	Low
Deployment Frequency	Multiple/day	Weekly~Monthly	Monthly~6 months	6+ months
Lead Time (commit to deploy)	Under 1 hour	1 day~1 week	1 week~1 month	1~6 months
Change Failure Rate	0~5%	5~10%	10~15%	46~60%
MTTR (Recovery Time)	Under 1 hour	Under 1 day	1 day~1 week	6+ months

1.2 Shift Left Strategy

Shift Left is the practice of moving testing and security to earlier stages in the development lifecycle.

Traditional approach:
Code → Build → Test → Security → Deploy → Monitor
                              ↑ Problems found here

Shift Left:
Code + Test + Security → Build → Deploy → Monitor
↑ Problems found here (10x cost reduction)

Key principles:

Pre-commit validation: Lint, format, and secret scanning via pre-commit hooks
PR-level testing: Unit tests + integration tests + SAST run automatically
Build-time security: Container image scanning, dependency vulnerability checks
Pre-deploy verification: Smoke tests, canary analysis

1.3 Key Trends in 2025

Platform Engineering: Standardizing CI/CD through developer self-service platforms
AI-Powered CI/CD: Test failure prediction, auto-rollback decisions, flaky test detection
eBPF-Based Observability: A new paradigm for pipeline performance monitoring
Supply Chain Security: SBOM, SLSA, and Sigstore-based software supply chain security

2. CI/CD Platform Comparison

2.1 Major Platforms at a Glance

Feature	GitHub Actions	Jenkins	GitLab CI	CircleCI
Hosting	SaaS/Self-hosted	Self-hosted	SaaS/Self-hosted	SaaS
Config	YAML	Groovy/YAML	YAML	YAML
Ecosystem	Marketplace 15,000+	Plugins 1,800+	Built-in integrations	Orbs 3,000+
Container Support	Native	Plugin	Native	Native
Self-hosted Runner	Supported	Default	Supported	Supported
Pricing	2,000 min free/mo	Free (OSS)	400 min free/mo	6,000 credits free/mo
Learning Curve	Low	High	Medium	Low
Caching	10GB/repo	Plugin	Native	Native

2.2 GitHub Actions

# .github/workflows/ci.yml
name: CI Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

concurrency:
  group: ci-${{ '{{' }} github.ref {{ '}}' }}
  cancel-in-progress: true

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: myapp:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

2.3 Jenkins Pipeline

// Jenkinsfile (Declarative Pipeline)
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'registry.example.com'
        IMAGE_NAME = 'myapp'
    }

    stages {
        stage('Checkout') {
            steps {
                checkout scm
            }
        }

        stage('Lint & Test') {
            parallel {
                stage('Lint') {
                    steps {
                        sh 'npm run lint'
                    }
                }
                stage('Unit Test') {
                    steps {
                        sh 'npm test -- --coverage'
                    }
                    post {
                        always {
                            junit 'reports/junit.xml'
                        }
                    }
                }
            }
        }

        stage('Build & Push') {
            steps {
                script {
                    def image = docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
                    docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
                        image.push()
                        image.push('latest')
                    }
                }
            }
        }
    }

    post {
        failure {
            slackSend(
                channel: '#ci-alerts',
                color: 'danger',
                message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
            )
        }
    }
}

2.4 GitLab CI

# .gitlab-ci.yml
stages:
  - lint
  - test
  - build
  - deploy

variables:
  DOCKER_HOST: tcp://docker:2376

lint:
  stage: lint
  image: node:20-alpine
  cache:
    key: npm-cache
    paths:
      - node_modules/
  script:
    - npm ci
    - npm run lint

test:
  stage: test
  image: node:20-alpine
  parallel: 4
  script:
    - npm ci
    - npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
  coverage: '/Statements\s*:\s*(\d+\.?\d*)%/'
  artifacts:
    reports:
      junit: reports/junit.xml

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker build -t myapp:latest .
    - docker push myapp:latest
  only:
    - main

3. Pipeline Design Principles

3.1 Fast Feedback Loops

The time developers wait for results after opening a PR directly impacts productivity.

Target times:
├── Lint + format check: under 30s
├── Unit tests: under 2min
├── Integration tests: under 5min
├── Build: under 3min
└── Full pipeline: under 10min

Reality (before optimization): 30min+
Reality (after optimization): 8-10min

3.2 Parallelism

# Parallel pipeline example
jobs:
  # Phase 1: Lint/security run independently in parallel
  lint:
    runs-on: ubuntu-latest
    # ...
  security-scan:
    runs-on: ubuntu-latest
    # ...

  # Phase 2: Tests split into shards
  test:
    needs: [lint]
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    # ...

  # Phase 3: Build after tests pass
  build:
    needs: [test, security-scan]
    # ...

3.3 Caching Strategy

# npm caching (GitHub Actions)
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ '{{' }} hashFiles('**/package-lock.json') {{ '}}' }}
    restore-keys: |
      npm-

# Docker layer caching
- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

# Gradle caching
- uses: actions/cache@v4
  with:
    path: |
      ~/.gradle/caches
      ~/.gradle/wrapper
    key: gradle-${{ '{{' }} hashFiles('**/*.gradle*') {{ '}}' }}

3.4 Idempotency

Pipelines should always produce the same result for the same input.

# Bad: Timestamp-based tag (different result on re-run)
# IMAGE_TAG: my-app:build-20250323-142000

# Good: Commit SHA-based tag (always the same)
# IMAGE_TAG: my-app:abc1234

# Good: Semantic version (deterministic)
# IMAGE_TAG: my-app:v1.2.3

4. Testing in CI

4.1 Test Pyramid

          /    E2E    \          Slow but high confidence
         /  (5-10%)    \
        / Integration   \       Medium speed, medium confidence
       /   (15-25%)      \
      /    Unit Tests      \    Fast and numerous
     /     (65-80%)         \
    /_________________________\

4.2 Test Splitting

# Jest test sharding
test:
  strategy:
    matrix:
      shard: [1, 2, 3, 4]
  steps:
    - run: npx jest --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4

# Cypress parallel execution
e2e:
  strategy:
    matrix:
      container: [1, 2, 3]
  steps:
    - uses: cypress-io/github-action@v6
      with:
        record: true
        parallel: true
        group: 'e2e-tests'

4.3 Flaky Test Management

Flaky tests are those that sometimes pass and sometimes fail on the same code.

// Flaky test detection and isolation strategy
// jest.config.js
module.exports = {
  // Auto-retry on failure
  retryTimes: 2,

  // Flaky test reporter
  reporters: [
    'default',
    ['jest-flaky-reporter', {
      outputFile: 'flaky-tests.json',
      threshold: 3  // Report if flaky 3+ times
    }]
  ]
};

# Isolate flaky tests in CI
test-stable:
  runs-on: ubuntu-latest
  steps:
    - run: npx jest --testPathIgnorePatterns="flaky"

test-flaky:
  runs-on: ubuntu-latest
  continue-on-error: true  # Pipeline continues on failure
  steps:
    - run: npx jest --testPathPattern="flaky" --retries=3

4.4 Coverage Gates

# Set coverage thresholds
test:
  steps:
    - run: npx jest --coverage
    - name: Check coverage threshold
      run: |
        COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.statements.pct')
        if (( $(echo "$COVERAGE < 80" | bc -l) )); then
          echo "Coverage $COVERAGE% is below 80% threshold"
          exit 1
        fi

5. Docker Build Optimization

5.1 Multi-Stage Builds

# Stage 1: Install dependencies
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production

# Stage 2: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

# Stage 3: Production image
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production

# Security: non-root user
RUN addgroup --system --gid 1001 nodejs && \
    adduser --system --uid 1001 nextjs

COPY --from=builder --chown=nextjs:nodejs /app/.next ./.next
COPY --from=deps --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER nextjs
EXPOSE 3000
CMD ["npm", "start"]

5.2 Layer Caching Optimization

# Bad: Source changes trigger npm ci re-run
COPY . .
RUN npm ci
RUN npm run build

# Good: Copy dependency files first
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

5.3 BuildKit and Buildx

# Using BuildKit in GitHub Actions
- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max
    platforms: linux/amd64,linux/arm64

5.4 Kaniko (Daemonless Build)

# Build images with Kaniko in Kubernetes
apiVersion: v1
kind: Pod
metadata:
  name: kaniko-build
spec:
  containers:
    - name: kaniko
      image: gcr.io/kaniko-project/executor:latest
      args:
        - "--dockerfile=Dockerfile"
        - "--context=git://github.com/myorg/myapp"
        - "--destination=registry.example.com/myapp:latest"
        - "--cache=true"
        - "--cache-repo=registry.example.com/myapp/cache"

5.5 Image Size Optimization

Image size comparison:
├── node:20          → 1.1GB
├── node:20-slim     → 220MB
├── node:20-alpine   → 140MB
├── distroless/nodejs → 120MB
└── Multi-stage optimized → 80-100MB

6. GitOps with ArgoCD

6.1 GitOps Principles

GitOps uses a Git repository as the Single Source of Truth for system operations.

GitOps workflow:
1. Developer pushes changes to Git
2. CI builds and tests the image
3. CI updates image tag in deployment manifests
4. ArgoCD compares Git vs cluster state
5. Auto-sync on drift (or manual approval)
6. Cluster matches Git state

┌────────┐    Push     ┌────────┐   Detect   ┌────────┐
│  Dev   │ ──────────> │  Git   │ <────────> │ ArgoCD │
└────────┘             └────────┘            └───┬────┘
                                                 │ Sync
                                            ┌────▼────┐
                                            │  K8s    │
                                            │ Cluster │
                                            └─────────┘

6.2 ArgoCD App of Apps Pattern

# apps/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/gitops-config
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

# apps/api-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/gitops-config
    path: services/api
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

6.3 Argo Rollouts (Progressive Delivery)

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualService:
            name: api-vsvc
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 30
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 60
        - pause:
            duration: 5m
        - setWeight: 100

# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"2.*",app="api-service",version="canary"}[5m]))
            /
            sum(rate(http_requests_total{app="api-service",version="canary"}[5m]))

7. CI/CD Security

7.1 Security Scanning Integration

CI/CD Security Layers:
┌─────────────────────────────────────────────┐
│ Layer 1: Pre-commit                         │
│  - Secret scanning (gitleaks, detect-secrets)│
│  - Lint (security rules)                    │
├─────────────────────────────────────────────┤
│ Layer 2: PR / Build                         │
│  - SAST (Semgrep, CodeQL, SonarQube)        │
│  - SCA (Dependabot, Snyk, Trivy)            │
│  - License compliance                       │
├─────────────────────────────────────────────┤
│ Layer 3: Container Build                    │
│  - Image scanning (Trivy, Grype)            │
│  - Base image policy (distroless, alpine)   │
│  - SBOM generation (Syft)                   │
├─────────────────────────────────────────────┤
│ Layer 4: Deploy                             │
│  - Policy enforcement (OPA/Kyverno)         │
│  - Signing (cosign, Sigstore)               │
│  - Runtime security (Falco)                 │
└─────────────────────────────────────────────┘

7.2 Secret Management

# GitHub Actions secret usage
deploy:
  steps:
    - name: Deploy
      env:
        AWS_ACCESS_KEY_ID: ${{ '{{' }} secrets.AWS_ACCESS_KEY_ID {{ '}}' }}
        AWS_SECRET_ACCESS_KEY: ${{ '{{' }} secrets.AWS_SECRET_ACCESS_KEY {{ '}}' }}
      run: |
        aws ecs update-service --cluster prod --service api

# OIDC-based authentication (secretless - recommended)
permissions:
  id-token: write
  contents: read

steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789012:role/github-actions
      aws-region: ap-northeast-2

7.3 SBOM and Supply Chain Security

# Generate SBOM with Syft
- name: Generate SBOM
  uses: anchore/sbom-action@v0
  with:
    image: myapp:latest
    format: spdx-json
    output-file: sbom.spdx.json

# Sign image with cosign
- name: Sign image
  run: |
    cosign sign --key env://COSIGN_PRIVATE_KEY myapp:latest

# Verify signature with cosign
- name: Verify signature
  run: |
    cosign verify --key cosign.pub myapp:latest

7.4 Automated Secret Scanning

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks

# Run gitleaks in CI
security:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0
    - uses: gitleaks/gitleaks-action@v2
      env:
        GITHUB_TOKEN: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}

8. Deployment Strategies Compared

8.1 Strategy Comparison Table

Strategy	Downtime	Risk	Resource Cost	Rollback Speed	Complexity
Recreate	Yes	High	1x	Slow	Low
Rolling Update	No	Medium	1x~1.25x	Medium	Low
Blue-Green	No	Low	2x	Instant	Medium
Canary	No	Very Low	1.1x	Instant	High
A/B Testing	No	Very Low	1.1x	Instant	Very High

8.2 Blue-Green Deployment

# Kubernetes Blue-Green deployment
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
    version: green  # Switch from blue to green
  ports:
    - port: 80
      targetPort: 8080

Blue-Green switchover process:
1. Blue(v1) running → Deploy Green(v2)
2. Green health check and smoke test
3. Switch service selector to Green
4. Roll back to Blue immediately on issues
5. Clean up Blue resources after stabilization

[Users] → [LB] → [Blue v1] Active
                  [Green v2] ← Preparing

[Users] → [LB] → [Blue v1] ← Standby
                  [Green v2] Active

8.3 Canary Deployment

# Canary deployment with Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-service
spec:
  hosts:
    - api-service
  http:
    - route:
        - destination:
            host: api-service
            subset: stable
          weight: 90
        - destination:
            host: api-service
            subset: canary
          weight: 10

8.4 Feature Flags

// LaunchDarkly or custom feature flag system
import { featureFlags } from './feature-flags';

async function handleRequest(req: Request) {
  const userId = req.user.id;

  if (await featureFlags.isEnabled('new-checkout-flow', userId)) {
    return newCheckoutFlow(req);
  }

  return legacyCheckoutFlow(req);
}

Feature flag-based deployment:
1. Deploy code with new feature wrapped in a flag
2. Enable for internal users only
3. Gradually increase rollout (1% → 5% → 25% → 100%)
4. Disable flag immediately if issues arise
5. Decouple deployment from release

9. Rollback Strategies

9.1 Automatic Rollback

# Argo Rollouts automatic rollback
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: error-rate-check
      # Auto-rollback on analysis failure
      abortScaleDownDelaySeconds: 30

# Kubernetes Deployment automatic rollback
apiVersion: apps/v1
kind: Deployment
spec:
  progressDeadlineSeconds: 300  # Fails if not done in 5min
  minReadySeconds: 30
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0

9.2 Circuit Breaker Pattern

// Deployment circuit breaker
class DeploymentCircuitBreaker {
  private errorThreshold = 0.05; // 5% error rate
  private windowSize = 300;       // 5 minute window

  async shouldRollback(metrics: DeploymentMetrics): Promise<boolean> {
    const errorRate = metrics.errors / metrics.totalRequests;
    const p99Latency = metrics.p99LatencyMs;

    return (
      errorRate > this.errorThreshold ||
      p99Latency > 3000 // Over 3 seconds
    );
  }

  async executeRollback(deployment: string) {
    console.log(`Rolling back ${deployment}`);
    await exec(`kubectl rollout undo deployment/${deployment}`);

    await notify({
      channel: '#deployments',
      message: `Auto-rollback triggered for ${deployment}`,
      severity: 'critical'
    });
  }
}

9.3 Database Migration Rollback

Safe DB migration strategy:
1. Expand-Contract Pattern
   Phase 1 (Expand): Add new columns, write to both
   Phase 2 (Migrate): Migrate existing data
   Phase 3 (Contract): Remove old columns

2. Only apply rollback-safe migrations
   - Add column (rollback-safe)
   - Add index (rollback-safe)
   - Drop column (NOT safe → use Expand-Contract)
   - Change type (NOT safe → add new column, then switch)

-- Safe migration example
-- Step 1: Add new column (rollback-safe)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;

-- Step 2: Data migration (background)
UPDATE users SET email_verified = TRUE
WHERE verified_at IS NOT NULL;

-- Step 3: Switch app code to use new column
-- Step 4: Drop old column (separate migration)
-- ALTER TABLE users DROP COLUMN verified_at;

10. Pipeline Health Monitoring

10.1 Key Metrics

Pipeline Health Dashboard:
┌─────────────────────────────────────────┐
│  Build Time Trend                       │
│  ██████████████ 8m (avg)                │
│  Target: < 10m                          │
├─────────────────────────────────────────┤
│  Success Rate                           │
│  ████████████████████ 94%               │
│  Target: > 95%                          │
├─────────────────────────────────────────┤
│  Flaky Test Rate                        │
│  ██ 3%                                  │
│  Target: < 2%                           │
├─────────────────────────────────────────┤
│  Mean Time to Recovery (MTTR)           │
│  ████ 25min                             │
│  Target: < 30min                        │
└─────────────────────────────────────────┘

10.2 Build Time Tracking

# Report build metrics to Datadog
- name: Report build metrics
  if: always()
  run: |
    END_TIME=$(date +%s)
    DURATION=$((END_TIME - START_TIME))
    curl -X POST "https://api.datadoghq.com/api/v1/series" \
      -H "DD-API-KEY: $DD_API_KEY" \
      -d "{
        \"series\": [{
          \"metric\": \"ci.build.duration\",
          \"points\": [[$END_TIME, $DURATION]],
          \"tags\": [
            \"repo:myapp\",
            \"branch:$GITHUB_REF_NAME\",
            \"status:$JOB_STATUS\"
          ]
        }]
      }"

10.3 Failure Analysis Automation

# Automated build failure classification script
import re
from enum import Enum

class FailureCategory(Enum):
    FLAKY_TEST = "flaky_test"
    DEPENDENCY = "dependency"
    COMPILATION = "compilation"
    INFRASTRUCTURE = "infrastructure"
    TIMEOUT = "timeout"
    UNKNOWN = "unknown"

def categorize_failure(log: str) -> FailureCategory:
    patterns = {
        FailureCategory.FLAKY_TEST: [
            r"retry.*failed",
            r"intermittent",
            r"flaky"
        ],
        FailureCategory.DEPENDENCY: [
            r"npm ERR!.*404",
            r"Could not resolve dependencies",
            r"ECONNRESET"
        ],
        FailureCategory.COMPILATION: [
            r"error TS\d+",
            r"SyntaxError",
            r"TypeError"
        ],
        FailureCategory.INFRASTRUCTURE: [
            r"runner.*offline",
            r"disk space",
            r"out of memory"
        ],
        FailureCategory.TIMEOUT: [
            r"timed out",
            r"deadline exceeded"
        ]
    }

    for category, regexes in patterns.items():
        for pattern in regexes:
            if re.search(pattern, log, re.IGNORECASE):
                return category

    return FailureCategory.UNKNOWN

11. Full-Stack CI/CD Pipeline Example

# .github/workflows/production.yml
name: Production Pipeline

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read
  packages: write

jobs:
  # Phase 1: Code quality
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check

  # Phase 2: Security scanning
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: returntocorp/semgrep-action@v1
      - name: Run Trivy (filesystem)
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'HIGH,CRITICAL'

  # Phase 3: Tests
  test:
    needs: quality
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_PASSWORD: testpass
        ports:
          - 5432:5432
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost:5432/testdb

  # Phase 4: Build and push
  build:
    needs: [test, security]
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ '{{' }} github.actor {{ '}}' }}
          password: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/myorg/myapp
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # Phase 5: Update deployment manifests (GitOps)
  update-manifest:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          repository: myorg/gitops-config
          token: ${{ '{{' }} secrets.GITOPS_TOKEN {{ '}}' }}
      - name: Update image tag
        run: |
          cd services/api
          kustomize edit set image myapp=${{ '{{' }} needs.build.outputs.image-tag {{ '}}' }}
      - name: Commit and push
        run: |
          git config user.name "CI Bot"
          git config user.email "ci@example.com"
          git add .
          git commit -m "chore: update api-service image"
          git push

12. Interview Questions

Basic Concepts

Q1. Explain the difference between CI and CD.

CI (Continuous Integration) is the practice of frequently integrating code changes into the main branch. Each integration is validated through automated builds and tests.

CD has two meanings:

Continuous Delivery: Code is always in a deployable state. Production deployment requires manual approval.
Continuous Deployment: Every change is automatically deployed to production with no manual intervention.

Key difference: CI focuses on "integration" while CD focuses on "delivery/deployment." CD is impossible without CI, but you can do CI without CD.

Q2. Explain the four DORA metrics.

Deployment Frequency: How often you deploy to production
Lead Time for Changes: Time from commit to production deployment
Change Failure Rate: Percentage of deployments that fail or require rollback
MTTR (Mean Time to Recovery): Time from incident to recovery

Elite teams: Multiple deploys per day, under 1 hour lead time, under 5% failure rate, under 1 hour recovery.

Q3. Explain the core principles of GitOps.

Declarative: System state is defined declaratively
Versioned: Git serves as the single source of truth
Automated: Approved changes are automatically applied to the system
Self-Healing: Actual state is automatically reconciled with declared state

Benefits: Audit trail, easy rollback, PR-based change management, reproducible environments.

Q4. Compare Blue-Green and Canary deployments.

Blue-Green: Two identical environments (Blue/Green). Deploy new version to Green, then switch all traffic at once. Rollback by switching back to Blue.

Pros: Instant rollback, simple implementation
Cons: 2x resource cost, complex database synchronization

Canary: Deploy new version to a small percentage (1-10%) first. Analyze metrics then gradually increase.

Pros: Minimal risk, verification with real traffic
Cons: Complex implementation, monitoring required

Q5. What is Shift Left?

A strategy that moves testing and security to earlier stages (left side) of the development lifecycle.

Examples:

Pre-commit hooks for code lint, format, and secret scanning
SAST, SCA, and unit tests at the PR stage
Container image scanning during build
IDE plugins for real-time feedback during development

Impact: The earlier defects are found, the cost of fixing decreases exponentially (10-100x less than production).

Advanced Questions

Q6. How do you manage flaky tests?

Detection: Identify tests that produce different results on the same code
Isolation: Separate flaky tests into a dedicated suite with continue-on-error
Retry: Auto-retry with jest retryTimes, pytest-rerunfailures
Tracking: Dashboard for frequency and pattern analysis
Root cause resolution: Fix timing issues, shared state, external dependencies
Policy: Disable or delete if not fixed within a set period

Q7. How do you optimize Docker image builds?

Multi-stage builds: Exclude build tools from the final image
Layer caching optimization: COPY frequently-changed files last
Lightweight base images: Use alpine, distroless
.dockerignore: Exclude unnecessary files
Use BuildKit: Parallel builds, cache mounts
Dependency separation: Copy package.json first for npm ci caching
Kaniko: Build without Docker daemon (improved CI/CD security)

Q8. Describe secret management best practices.

Never commit to Git: Pre-commit scanning with gitleaks, detect-secrets
OIDC-based auth: Use temporary tokens instead of long-lived secrets
Secret managers: AWS Secrets Manager, HashiCorp Vault, Doppler
Least privilege: Grant only minimum necessary permissions
Secret rotation: Automate regular secret renewal
Audit logs: Track secret access
Environment separation: Separate dev/staging/prod secrets

Q9. How do you safely perform database migration rollbacks?

Expand-Contract Pattern:

Phase 1 (Expand):

Add new columns/tables
Modify app code to be compatible with both old and new schemas
Start writing to new schema

Phase 2 (Migrate):

Migrate existing data to new schema (background)
Switch app to use new schema only

Phase 3 (Contract):

Remove old columns/tables (separate deployment)
Only this phase is non-reversible

Key: Each phase must be independently rollback-safe.

Q10. How do you secure a CI/CD pipeline?

Supply Chain Security: Generate SBOMs, sign images (cosign), comply with SLSA
Secret management: OIDC, Vault, minimize environment variables
SAST/DAST/SCA: Integrate Semgrep, Trivy, Dependabot
Container security: Non-root users, distroless base, image scanning
Policy compliance: Enforce deployment policies with OPA/Kyverno
Access control: Least privilege, branch protection rules
Auditing: Track all deployments, maintain change history

Q11. Compare the pros and cons of GitHub Actions and Jenkins.

GitHub Actions:

Pros: Native GitHub integration, SaaS (no maintenance), Marketplace ecosystem, simple YAML config
Cons: GitHub lock-in, customization limits, complex workflow management

Jenkins:

Pros: Full flexibility, 1800+ plugins, self-hosted control, Groovy scripting
Cons: High maintenance cost, complex setup, security patch management, scaling challenges

Criteria: Small/GitHub-centric projects favor Actions; complex enterprise/multi-SCM environments favor Jenkins.

Q12. Explain Progressive Delivery with Argo Rollouts.

Progressive Delivery deploys new versions incrementally while verifying safety through automated analysis.

Argo Rollouts workflow:

Allocate 10% traffic to canary
Verify success rate and latency via AnalysisTemplate (5 min)
Increase to 30% on pass, analyze again
Gradually scale to 60%, then 100%
Auto-rollback on analysis failure

Key components:

Rollout: Defines deployment strategy
AnalysisTemplate: Defines verification conditions (Prometheus, Datadog, etc.)
TrafficRouting: Integrates with Istio, Nginx, ALB

Q13. How do you optimize pipeline performance?

Parallelism: Run independent jobs concurrently
Test splitting: Shard tests across multiple runners
Caching: Cache dependencies, Docker layers, build artifacts
Selective execution: Run only jobs relevant to changed files
Incremental builds: Build only changed parts
Resource optimization: Tune runner size and concurrency limits
Shorten feedback loop: Fast checks first, slow checks later

Q14. What are the pros and cons of feature flag-based deployments?

Pros:

Decouple deployment from release: deploy code but activate features later
Fast rollback: simply disable the flag with no code rollback
Progressive rollout: gradually expand user percentage
A/B testing: experiment per feature

Cons:

Technical debt: old flags need cleanup
Complexity: flag combinations increase test permutations
Code readability: increased conditional logic
Consistency: different user experiences make bug reproduction harder

Q15. Describe CI/CD strategies for monorepos.

Impact analysis: Build/test only packages affected by changes
Tool usage: Turborepo, Nx, Bazel for dependency-graph-based builds
Caching: Remote cache (Turborepo Remote Cache) for shared build results
Selective deployment: Deploy only changed services
Parallelism: Build/test independent packages concurrently

# Turborepo example
turbo run build --filter=...[HEAD~1]
# Builds only packages changed since HEAD and their dependents

13. Quiz

Q1. What is the deployment frequency of an Elite team according to DORA metrics?

Answer: Multiple times per day (On-demand, multiple deploys per day)

Elite teams deploy multiple times per day while maintaining a change failure rate below 5% and recovery time under 1 hour.

Q2. Which phase in the Expand-Contract pattern is non-reversible?

Answer: The Contract phase (dropping old columns/tables)

Expand (addition) and Migrate (data migration) are reversible, but Contract (deletion) is irreversible since data is removed. Therefore, the Contract phase is performed separately after sufficient stabilization.

Q3. What is the "single source of truth" in GitOps?

Answer: The Git repository

In GitOps, the Git repository is the sole source defining the desired state of the system. The actual cluster state must always match the state declared in Git, and tools like ArgoCD automatically detect and synchronize any drift.

Q4. What triggers an automatic rollback in canary deployment?

Answer: Metric criteria defined in the AnalysisTemplate (success rate, latency, etc.)

Argo Rollouts queries metrics from Prometheus, Datadog, etc. via the AnalysisTemplate. If the success rate falls below the threshold (e.g., 95%) or P99 latency exceeds the limit, an automatic rollback is triggered.

Q5. What is the purpose of an SBOM (Software Bill of Materials)?

Answer: To provide a list of all components (libraries, dependencies) included in software, strengthening supply chain security

An SBOM is an "ingredient list" for software. It helps quickly identify impact scope when vulnerabilities are discovered, verify license compliance, and respond to supply chain attacks (e.g., Log4Shell). It can be automatically generated using tools like Syft and Trivy.