CI/CD Best Practices 2025: Pipeline Design, Automation, and Security for Teams

Introduction
1. CI/CD in 2025
2. CI/CD Platform Comparison
3. Pipeline Design Principles
4. Testing in CI
5. Docker Build Optimization
6. GitOps with ArgoCD
7. CI/CD Security
8. Deployment Strategies Compared
9. Rollback Strategies
10. Pipeline Health Monitoring
11. Full-Stack CI/CD Pipeline Example
12. Interview Questions
- Basic Concepts
- Advanced Questions
13. Quiz
14. References

Introduction

In 2025, CI/CD is no longer optional; it is essential. According to Google's DORA (DevOps Research and Assessment) report, Elite-level teams deploy multiple times per day while maintaining a change failure rate below 5%. In contrast, Low-level teams deploy once a month with failure rates reaching 46%.

The core of this gap lies in pipeline design. It is not just about adopting CI/CD tools, but about a comprehensive strategy that includes test automation, security integration, progressive delivery, and observability.

This article covers everything you need to design and operate CI/CD pipelines: platform comparisons, pipeline design principles, testing strategies, Docker build optimization, GitOps, security, deployment strategies, rollback approaches, and monitoring.

1. CI/CD in 2025

1.1 Team Performance Through DORA Metrics

DORA Metrics are four key indicators that measure software delivery performance.

Metric	Elite	High	Medium	Low
Deployment Frequency	Multiple/day	Weekly~Monthly	Monthly~6 months	6+ months
Lead Time (commit to deploy)	Under 1 hour	1 day~1 week	1 week~1 month	1~6 months
Change Failure Rate	0~5%	5~10%	10~15%	46~60%
MTTR (Recovery Time)	Under 1 hour	Under 1 day	1 day~1 week	6+ months

1.2 Shift Left Strategy

Shift Left is the practice of moving testing and security to earlier stages in the development lifecycle.

Traditional approach:
Code → Build → Test → Security → Deploy → Monitor
                              ↑ Problems found here

Shift Left:
Code + Test + Security → Build → Deploy → Monitor
↑ Problems found here (10x cost reduction)

Key principles:

Pre-commit validation: Lint, format, and secret scanning via pre-commit hooks
PR-level testing: Unit tests + integration tests + SAST run automatically
Build-time security: Container image scanning, dependency vulnerability checks
Pre-deploy verification: Smoke tests, canary analysis

1.3 Key Trends in 2025

Platform Engineering: Standardizing CI/CD through developer self-service platforms
AI-Powered CI/CD: Test failure prediction, auto-rollback decisions, flaky test detection
eBPF-Based Observability: A new paradigm for pipeline performance monitoring
Supply Chain Security: SBOM, SLSA, and Sigstore-based software supply chain security

2. CI/CD Platform Comparison

2.1 Major Platforms at a Glance

Feature	GitHub Actions	Jenkins	GitLab CI	CircleCI
Hosting	SaaS/Self-hosted	Self-hosted	SaaS/Self-hosted	SaaS
Config	YAML	Groovy/YAML	YAML	YAML
Ecosystem	Marketplace 15,000+	Plugins 1,800+	Built-in integrations	Orbs 3,000+
Container Support	Native	Plugin	Native	Native
Self-hosted Runner	Supported	Default	Supported	Supported
Pricing	2,000 min free/mo	Free (OSS)	400 min free/mo	6,000 credits free/mo
Learning Curve	Low	High	Medium	Low
Caching	10GB/repo	Plugin	Native	Native

2.2 GitHub Actions

# .github/workflows/ci.yml
name: CI Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

concurrency:
  group: ci-${{ '{{' }} github.ref {{ '}}' }}
  cancel-in-progress: true

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: myapp:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

2.3 Jenkins Pipeline

// Jenkinsfile (Declarative Pipeline)
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'registry.example.com'
        IMAGE_NAME = 'myapp'
    }

    stages {
        stage('Checkout') {
            steps {
                checkout scm
            }
        }

        stage('Lint & Test') {
            parallel {
                stage('Lint') {
                    steps {
                        sh 'npm run lint'
                    }
                }
                stage('Unit Test') {
                    steps {
                        sh 'npm test -- --coverage'
                    }
                    post {
                        always {
                            junit 'reports/junit.xml'
                        }
                    }
                }
            }
        }

        stage('Build & Push') {
            steps {
                script {
                    def image = docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
                    docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
                        image.push()
                        image.push('latest')
                    }
                }
            }
        }
    }

    post {
        failure {
            slackSend(
                channel: '#ci-alerts',
                color: 'danger',
                message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
            )
        }
    }
}

2.4 GitLab CI

# .gitlab-ci.yml
stages:
  - lint
  - test
  - build
  - deploy

variables:
  DOCKER_HOST: tcp://docker:2376

lint:
  stage: lint
  image: node:20-alpine
  cache:
    key: npm-cache
    paths:
      - node_modules/
  script:
    - npm ci
    - npm run lint

test:
  stage: test
  image: node:20-alpine
  parallel: 4
  script:
    - npm ci
    - npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
  coverage: '/Statements\s*:\s*(\d+\.?\d*)%/'
  artifacts:
    reports:
      junit: reports/junit.xml

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker build -t myapp:latest .
    - docker push myapp:latest
  only:
    - main

3. Pipeline Design Principles

3.1 Fast Feedback Loops

The time developers wait for results after opening a PR directly impacts productivity.

Target times:
├── Lint + format check: under 30s
├── Unit tests: under 2min
├── Integration tests: under 5min
├── Build: under 3min
└── Full pipeline: under 10min

Reality (before optimization): 30min+
Reality (after optimization): 8-10min

3.2 Parallelism

# Parallel pipeline example
jobs:
  # Phase 1: Lint/security run independently in parallel
  lint:
    runs-on: ubuntu-latest
    # ...
  security-scan:
    runs-on: ubuntu-latest
    # ...

  # Phase 2: Tests split into shards
  test:
    needs: [lint]
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    # ...

  # Phase 3: Build after tests pass
  build:
    needs: [test, security-scan]
    # ...

3.3 Caching Strategy

# npm caching (GitHub Actions)
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ '{{' }} hashFiles('**/package-lock.json') {{ '}}' }}
    restore-keys: |
      npm-

# Docker layer caching
- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

# Gradle caching
- uses: actions/cache@v4
  with:
    path: |
      ~/.gradle/caches
      ~/.gradle/wrapper
    key: gradle-${{ '{{' }} hashFiles('**/*.gradle*') {{ '}}' }}

3.4 Idempotency

Pipelines should always produce the same result for the same input.

# Bad: Timestamp-based tag (different result on re-run)
# IMAGE_TAG: my-app:build-20250323-142000

# Good: Commit SHA-based tag (always the same)
# IMAGE_TAG: my-app:abc1234

# Good: Semantic version (deterministic)
# IMAGE_TAG: my-app:v1.2.3

4. Testing in CI

4.1 Test Pyramid

          /    E2E    \          Slow but high confidence
         /  (5-10%)    \
        / Integration   \       Medium speed, medium confidence
       /   (15-25%)      \
      /    Unit Tests      \    Fast and numerous
     /     (65-80%)         \
    /_________________________\

4.2 Test Splitting

# Jest test sharding
test:
  strategy:
    matrix:
      shard: [1, 2, 3, 4]
  steps:
    - run: npx jest --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4

# Cypress parallel execution
e2e:
  strategy:
    matrix:
      container: [1, 2, 3]
  steps:
    - uses: cypress-io/github-action@v6
      with:
        record: true
        parallel: true
        group: 'e2e-tests'

4.3 Flaky Test Management

Flaky tests are those that sometimes pass and sometimes fail on the same code.

// Flaky test detection and isolation strategy
// jest.config.js
module.exports = {
  // Auto-retry on failure
  retryTimes: 2,

  // Flaky test reporter
  reporters: [
    'default',
    ['jest-flaky-reporter', {
      outputFile: 'flaky-tests.json',
      threshold: 3  // Report if flaky 3+ times
    }]
  ]
};

# Isolate flaky tests in CI
test-stable:
  runs-on: ubuntu-latest
  steps:
    - run: npx jest --testPathIgnorePatterns="flaky"

test-flaky:
  runs-on: ubuntu-latest
  continue-on-error: true  # Pipeline continues on failure
  steps:
    - run: npx jest --testPathPattern="flaky" --retries=3

4.4 Coverage Gates

# Set coverage thresholds
test:
  steps:
    - run: npx jest --coverage
    - name: Check coverage threshold
      run: |
        COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.statements.pct')
        if (( $(echo "$COVERAGE < 80" | bc -l) )); then
          echo "Coverage $COVERAGE% is below 80% threshold"
          exit 1
        fi

5. Docker Build Optimization

5.1 Multi-Stage Builds

# Stage 1: Install dependencies
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production

# Stage 2: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

# Stage 3: Production image
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production

# Security: non-root user
RUN addgroup --system --gid 1001 nodejs && \
    adduser --system --uid 1001 nextjs

COPY --from=builder --chown=nextjs:nodejs /app/.next ./.next
COPY --from=deps --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER nextjs
EXPOSE 3000
CMD ["npm", "start"]

5.2 Layer Caching Optimization

# Bad: Source changes trigger npm ci re-run
COPY . .
RUN npm ci
RUN npm run build

# Good: Copy dependency files first
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

5.3 BuildKit and Buildx

# Using BuildKit in GitHub Actions
- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max
    platforms: linux/amd64,linux/arm64

5.4 Kaniko (Daemonless Build)

# Build images with Kaniko in Kubernetes
apiVersion: v1
kind: Pod
metadata:
  name: kaniko-build
spec:
  containers:
    - name: kaniko
      image: gcr.io/kaniko-project/executor:latest
      args:
        - "--dockerfile=Dockerfile"
        - "--context=git://github.com/myorg/myapp"
        - "--destination=registry.example.com/myapp:latest"
        - "--cache=true"
        - "--cache-repo=registry.example.com/myapp/cache"

5.5 Image Size Optimization

Image size comparison:
├── node:20          → 1.1GB
├── node:20-slim     → 220MB
├── node:20-alpine   → 140MB
├── distroless/nodejs → 120MB
└── Multi-stage optimized → 80-100MB

6. GitOps with ArgoCD

6.1 GitOps Principles

GitOps uses a Git repository as the Single Source of Truth for system operations.

GitOps workflow:
1. Developer pushes changes to Git
2. CI builds and tests the image
3. CI updates image tag in deployment manifests
4. ArgoCD compares Git vs cluster state
5. Auto-sync on drift (or manual approval)
6. Cluster matches Git state

┌────────┐    Push     ┌────────┐   Detect   ┌────────┐
│  Dev   │ ──────────> │  Git   │ <────────> │ ArgoCD │
└────────┘             └────────┘            └───┬────┘
                                                 │ Sync
                                            ┌────▼────┐
                                            │  K8s    │
                                            │ Cluster │
                                            └─────────┘

6.2 ArgoCD App of Apps Pattern

# apps/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/gitops-config
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

# apps/api-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/gitops-config
    path: services/api
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

6.3 Argo Rollouts (Progressive Delivery)

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualService:
            name: api-vsvc
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 30
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 60
        - pause:
            duration: 5m
        - setWeight: 100

# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"2.*",app="api-service",version="canary"}[5m]))
            /
            sum(rate(http_requests_total{app="api-service",version="canary"}[5m]))

7. CI/CD Security

7.1 Security Scanning Integration

CI/CD Security Layers:
┌─────────────────────────────────────────────┐
│ Layer 1: Pre-commit                         │
│  - Secret scanning (gitleaks, detect-secrets)│
│  - Lint (security rules)                    │
├─────────────────────────────────────────────┤
│ Layer 2: PR / Build                         │
│  - SAST (Semgrep, CodeQL, SonarQube)        │
│  - SCA (Dependabot, Snyk, Trivy)            │
│  - License compliance                       │
├─────────────────────────────────────────────┤
│ Layer 3: Container Build                    │
│  - Image scanning (Trivy, Grype)            │
│  - Base image policy (distroless, alpine)   │
│  - SBOM generation (Syft)                   │
├─────────────────────────────────────────────┤
│ Layer 4: Deploy                             │
│  - Policy enforcement (OPA/Kyverno)         │
│  - Signing (cosign, Sigstore)               │
│  - Runtime security (Falco)                 │
└─────────────────────────────────────────────┘

7.2 Secret Management

# GitHub Actions secret usage
deploy:
  steps:
    - name: Deploy
      env:
        AWS_ACCESS_KEY_ID: ${{ '{{' }} secrets.AWS_ACCESS_KEY_ID {{ '}}' }}
        AWS_SECRET_ACCESS_KEY: ${{ '{{' }} secrets.AWS_SECRET_ACCESS_KEY {{ '}}' }}
      run: |
        aws ecs update-service --cluster prod --service api

# OIDC-based authentication (secretless - recommended)
permissions:
  id-token: write
  contents: read

steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789012:role/github-actions
      aws-region: ap-northeast-2

7.3 SBOM and Supply Chain Security

# Generate SBOM with Syft
- name: Generate SBOM
  uses: anchore/sbom-action@v0
  with:
    image: myapp:latest
    format: spdx-json
    output-file: sbom.spdx.json

# Sign image with cosign
- name: Sign image
  run: |
    cosign sign --key env://COSIGN_PRIVATE_KEY myapp:latest

# Verify signature with cosign
- name: Verify signature
  run: |
    cosign verify --key cosign.pub myapp:latest

7.4 Automated Secret Scanning

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks

# Run gitleaks in CI
security:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0
    - uses: gitleaks/gitleaks-action@v2
      env:
        GITHUB_TOKEN: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}

8. Deployment Strategies Compared

8.1 Strategy Comparison Table

Strategy	Downtime	Risk	Resource Cost	Rollback Speed	Complexity
Recreate	Yes	High	1x	Slow	Low
Rolling Update	No	Medium	1x~1.25x	Medium	Low
Blue-Green	No	Low	2x	Instant	Medium
Canary	No	Very Low	1.1x	Instant	High
A/B Testing	No	Very Low	1.1x	Instant	Very High

8.2 Blue-Green Deployment

# Kubernetes Blue-Green deployment
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
    version: green  # Switch from blue to green
  ports:
    - port: 80
      targetPort: 8080

Blue-Green switchover process:
1. Blue(v1) running → Deploy Green(v2)
2. Green health check and smoke test
3. Switch service selector to Green
4. Roll back to Blue immediately on issues
5. Clean up Blue resources after stabilization

[Users] → [LB] → [Blue v1] Active
                  [Green v2] ← Preparing

[Users] → [LB] → [Blue v1] ← Standby
                  [Green v2] Active

8.3 Canary Deployment

# Canary deployment with Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-service
spec:
  hosts:
    - api-service
  http:
    - route:
        - destination:
            host: api-service
            subset: stable
          weight: 90
        - destination:
            host: api-service
            subset: canary
          weight: 10

8.4 Feature Flags

// LaunchDarkly or custom feature flag system
import { featureFlags } from './feature-flags';

async function handleRequest(req: Request) {
  const userId = req.user.id;

  if (await featureFlags.isEnabled('new-checkout-flow', userId)) {
    return newCheckoutFlow(req);
  }

  return legacyCheckoutFlow(req);
}

Feature flag-based deployment:
1. Deploy code with new feature wrapped in a flag
2. Enable for internal users only
3. Gradually increase rollout (1% → 5% → 25% → 100%)
4. Disable flag immediately if issues arise
5. Decouple deployment from release

9. Rollback Strategies

9.1 Automatic Rollback

# Argo Rollouts automatic rollback
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: error-rate-check
      # Auto-rollback on analysis failure
      abortScaleDownDelaySeconds: 30

# Kubernetes Deployment automatic rollback
apiVersion: apps/v1
kind: Deployment
spec:
  progressDeadlineSeconds: 300  # Fails if not done in 5min
  minReadySeconds: 30
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0

9.2 Circuit Breaker Pattern

// Deployment circuit breaker
class DeploymentCircuitBreaker {
  private errorThreshold = 0.05; // 5% error rate
  private windowSize = 300;       // 5 minute window

  async shouldRollback(metrics: DeploymentMetrics): Promise<boolean> {
    const errorRate = metrics.errors / metrics.totalRequests;
    const p99Latency = metrics.p99LatencyMs;

    return (
      errorRate > this.errorThreshold ||
      p99Latency > 3000 // Over 3 seconds
    );
  }

  async executeRollback(deployment: string) {
    console.log(`Rolling back ${deployment}`);
    await exec(`kubectl rollout undo deployment/${deployment}`);

    await notify({
      channel: '#deployments',
      message: `Auto-rollback triggered for ${deployment}`,
      severity: 'critical'
    });
  }
}

9.3 Database Migration Rollback

Safe DB migration strategy:
1. Expand-Contract Pattern
   Phase 1 (Expand): Add new columns, write to both
   Phase 2 (Migrate): Migrate existing data
   Phase 3 (Contract): Remove old columns

2. Only apply rollback-safe migrations
   - Add column (rollback-safe)
   - Add index (rollback-safe)
   - Drop column (NOT safe → use Expand-Contract)
   - Change type (NOT safe → add new column, then switch)

-- Safe migration example
-- Step 1: Add new column (rollback-safe)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;

-- Step 2: Data migration (background)
UPDATE users SET email_verified = TRUE
WHERE verified_at IS NOT NULL;

-- Step 3: Switch app code to use new column
-- Step 4: Drop old column (separate migration)
-- ALTER TABLE users DROP COLUMN verified_at;

10. Pipeline Health Monitoring

10.1 Key Metrics

Pipeline Health Dashboard:
┌─────────────────────────────────────────┐
│  Build Time Trend                       │
│  ██████████████ 8m (avg)                │
│  Target: < 10m                          │
├─────────────────────────────────────────┤
│  Success Rate                           │
│  ████████████████████ 94%               │
│  Target: > 95%                          │
├─────────────────────────────────────────┤
│  Flaky Test Rate                        │
│  ██ 3%                                  │
│  Target: < 2%                           │
├─────────────────────────────────────────┤
│  Mean Time to Recovery (MTTR)           │
│  ████ 25min                             │
│  Target: < 30min                        │
└─────────────────────────────────────────┘

10.2 Build Time Tracking

# Report build metrics to Datadog
- name: Report build metrics
  if: always()
  run: |
    END_TIME=$(date +%s)
    DURATION=$((END_TIME - START_TIME))
    curl -X POST "https://api.datadoghq.com/api/v1/series" \
      -H "DD-API-KEY: $DD_API_KEY" \
      -d "{
        \"series\": [{
          \"metric\": \"ci.build.duration\",
          \"points\": [[$END_TIME, $DURATION]],
          \"tags\": [
            \"repo:myapp\",
            \"branch:$GITHUB_REF_NAME\",
            \"status:$JOB_STATUS\"
          ]
        }]
      }"

10.3 Failure Analysis Automation

# Automated build failure classification script
import re
from enum import Enum

class FailureCategory(Enum):
    FLAKY_TEST = "flaky_test"
    DEPENDENCY = "dependency"
    COMPILATION = "compilation"
    INFRASTRUCTURE = "infrastructure"
    TIMEOUT = "timeout"
    UNKNOWN = "unknown"

def categorize_failure(log: str) -> FailureCategory:
    patterns = {
        FailureCategory.FLAKY_TEST: [
            r"retry.*failed",
            r"intermittent",
            r"flaky"
        ],
        FailureCategory.DEPENDENCY: [
            r"npm ERR!.*404",
            r"Could not resolve dependencies",
            r"ECONNRESET"
        ],
        FailureCategory.COMPILATION: [
            r"error TS\d+",
            r"SyntaxError",
            r"TypeError"
        ],
        FailureCategory.INFRASTRUCTURE: [
            r"runner.*offline",
            r"disk space",
            r"out of memory"
        ],
        FailureCategory.TIMEOUT: [
            r"timed out",
            r"deadline exceeded"
        ]
    }

    for category, regexes in patterns.items():
        for pattern in regexes:
            if re.search(pattern, log, re.IGNORECASE):
                return category

    return FailureCategory.UNKNOWN

11. Full-Stack CI/CD Pipeline Example

# .github/workflows/production.yml
name: Production Pipeline

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read
  packages: write

jobs:
  # Phase 1: Code quality
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check

  # Phase 2: Security scanning
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: returntocorp/semgrep-action@v1
      - name: Run Trivy (filesystem)
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'HIGH,CRITICAL'

  # Phase 3: Tests
  test:
    needs: quality
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_PASSWORD: testpass
        ports:
          - 5432:5432
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost:5432/testdb

  # Phase 4: Build and push
  build:
    needs: [test, security]
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ '{{' }} github.actor {{ '}}' }}
          password: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/myorg/myapp
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # Phase 5: Update deployment manifests (GitOps)
  update-manifest:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          repository: myorg/gitops-config
          token: ${{ '{{' }} secrets.GITOPS_TOKEN {{ '}}' }}
      - name: Update image tag
        run: |
          cd services/api
          kustomize edit set image myapp=${{ '{{' }} needs.build.outputs.image-tag {{ '}}' }}
      - name: Commit and push
        run: |
          git config user.name "CI Bot"
          git config user.email "ci@example.com"
          git add .
          git commit -m "chore: update api-service image"
          git push

12. Interview Questions

Basic Concepts

Q1. Explain the difference between CI and CD.

CI (Continuous Integration) is the practice of frequently integrating code changes into the main branch. Each integration is validated through automated builds and tests.

CD has two meanings:

Continuous Delivery: Code is always in a deployable state. Production deployment requires manual approval.
Continuous Deployment: Every change is automatically deployed to production with no manual intervention.

Key difference: CI focuses on "integration" while CD focuses on "delivery/deployment." CD is impossible without CI, but you can do CI without CD.

Q2. Explain the four DORA metrics.

Deployment Frequency: How often you deploy to production
Lead Time for Changes: Time from commit to production deployment
Change Failure Rate: Percentage of deployments that fail or require rollback
MTTR (Mean Time to Recovery): Time from incident to recovery

Elite teams: Multiple deploys per day, under 1 hour lead time, under 5% failure rate, under 1 hour recovery.

Q3. Explain the core principles of GitOps.

Declarative: System state is defined declaratively
Versioned: Git serves as the single source of truth
Automated: Approved changes are automatically applied to the system
Self-Healing: Actual state is automatically reconciled with declared state

Benefits: Audit trail, easy rollback, PR-based change management, reproducible environments.

Q4. Compare Blue-Green and Canary deployments.

Blue-Green: Two identical environments (Blue/Green). Deploy new version to Green, then switch all traffic at once. Rollback by switching back to Blue.

Pros: Instant rollback, simple implementation
Cons: 2x resource cost, complex database synchronization

Canary: Deploy new version to a small percentage (1-10%) first. Analyze metrics then gradually increase.

Pros: Minimal risk, verification with real traffic
Cons: Complex implementation, monitoring required

Q5. What is Shift Left?

A strategy that moves testing and security to earlier stages (left side) of the development lifecycle.

Examples:

Pre-commit hooks for code lint, format, and secret scanning
SAST, SCA, and unit tests at the PR stage
Container image scanning during build
IDE plugins for real-time feedback during development

Impact: The earlier defects are found, the cost of fixing decreases exponentially (10-100x less than production).

Advanced Questions

Q6. How do you manage flaky tests?

Detection: Identify tests that produce different results on the same code
Isolation: Separate flaky tests into a dedicated suite with continue-on-error
Retry: Auto-retry with jest retryTimes, pytest-rerunfailures
Tracking: Dashboard for frequency and pattern analysis
Root cause resolution: Fix timing issues, shared state, external dependencies
Policy: Disable or delete if not fixed within a set period

Q7. How do you optimize Docker image builds?

Multi-stage builds: Exclude build tools from the final image
Layer caching optimization: COPY frequently-changed files last
Lightweight base images: Use alpine, distroless
.dockerignore: Exclude unnecessary files
Use BuildKit: Parallel builds, cache mounts
Dependency separation: Copy package.json first for npm ci caching
Kaniko: Build without Docker daemon (improved CI/CD security)

Q8. Describe secret management best practices.

Never commit to Git: Pre-commit scanning with gitleaks, detect-secrets
OIDC-based auth: Use temporary tokens instead of long-lived secrets
Secret managers: AWS Secrets Manager, HashiCorp Vault, Doppler
Least privilege: Grant only minimum necessary permissions
Secret rotation: Automate regular secret renewal
Audit logs: Track secret access
Environment separation: Separate dev/staging/prod secrets

Q9. How do you safely perform database migration rollbacks?

Expand-Contract Pattern:

Phase 1 (Expand):

Add new columns/tables
Modify app code to be compatible with both old and new schemas
Start writing to new schema

Phase 2 (Migrate):

Migrate existing data to new schema (background)
Switch app to use new schema only

Phase 3 (Contract):

Remove old columns/tables (separate deployment)
Only this phase is non-reversible

Key: Each phase must be independently rollback-safe.

Q10. How do you secure a CI/CD pipeline?

Supply Chain Security: Generate SBOMs, sign images (cosign), comply with SLSA
Secret management: OIDC, Vault, minimize environment variables
SAST/DAST/SCA: Integrate Semgrep, Trivy, Dependabot
Container security: Non-root users, distroless base, image scanning
Policy compliance: Enforce deployment policies with OPA/Kyverno
Access control: Least privilege, branch protection rules
Auditing: Track all deployments, maintain change history

Q11. Compare the pros and cons of GitHub Actions and Jenkins.

GitHub Actions:

Pros: Native GitHub integration, SaaS (no maintenance), Marketplace ecosystem, simple YAML config
Cons: GitHub lock-in, customization limits, complex workflow management

Jenkins:

Pros: Full flexibility, 1800+ plugins, self-hosted control, Groovy scripting
Cons: High maintenance cost, complex setup, security patch management, scaling challenges

Criteria: Small/GitHub-centric projects favor Actions; complex enterprise/multi-SCM environments favor Jenkins.

Q12. Explain Progressive Delivery with Argo Rollouts.

Progressive Delivery deploys new versions incrementally while verifying safety through automated analysis.

Argo Rollouts workflow:

Allocate 10% traffic to canary
Verify success rate and latency via AnalysisTemplate (5 min)
Increase to 30% on pass, analyze again
Gradually scale to 60%, then 100%
Auto-rollback on analysis failure

Key components:

Rollout: Defines deployment strategy
AnalysisTemplate: Defines verification conditions (Prometheus, Datadog, etc.)
TrafficRouting: Integrates with Istio, Nginx, ALB

Q13. How do you optimize pipeline performance?

Parallelism: Run independent jobs concurrently
Test splitting: Shard tests across multiple runners
Caching: Cache dependencies, Docker layers, build artifacts
Selective execution: Run only jobs relevant to changed files
Incremental builds: Build only changed parts
Resource optimization: Tune runner size and concurrency limits
Shorten feedback loop: Fast checks first, slow checks later

Q14. What are the pros and cons of feature flag-based deployments?

Pros:

Decouple deployment from release: deploy code but activate features later
Fast rollback: simply disable the flag with no code rollback
Progressive rollout: gradually expand user percentage
A/B testing: experiment per feature

Cons:

Technical debt: old flags need cleanup
Complexity: flag combinations increase test permutations
Code readability: increased conditional logic
Consistency: different user experiences make bug reproduction harder

Q15. Describe CI/CD strategies for monorepos.

Impact analysis: Build/test only packages affected by changes
Tool usage: Turborepo, Nx, Bazel for dependency-graph-based builds
Caching: Remote cache (Turborepo Remote Cache) for shared build results
Selective deployment: Deploy only changed services
Parallelism: Build/test independent packages concurrently

# Turborepo example
turbo run build --filter=...[HEAD~1]
# Builds only packages changed since HEAD and their dependents

13. Quiz

Q1. What is the deployment frequency of an Elite team according to DORA metrics?

Answer: Multiple times per day (On-demand, multiple deploys per day)

Elite teams deploy multiple times per day while maintaining a change failure rate below 5% and recovery time under 1 hour.

Q2. Which phase in the Expand-Contract pattern is non-reversible?

Answer: The Contract phase (dropping old columns/tables)

Expand (addition) and Migrate (data migration) are reversible, but Contract (deletion) is irreversible since data is removed. Therefore, the Contract phase is performed separately after sufficient stabilization.

Q3. What is the "single source of truth" in GitOps?

Answer: The Git repository

In GitOps, the Git repository is the sole source defining the desired state of the system. The actual cluster state must always match the state declared in Git, and tools like ArgoCD automatically detect and synchronize any drift.

Q4. What triggers an automatic rollback in canary deployment?

Answer: Metric criteria defined in the AnalysisTemplate (success rate, latency, etc.)

Argo Rollouts queries metrics from Prometheus, Datadog, etc. via the AnalysisTemplate. If the success rate falls below the threshold (e.g., 95%) or P99 latency exceeds the limit, an automatic rollback is triggered.

Q5. What is the purpose of an SBOM (Software Bill of Materials)?

Answer: To provide a list of all components (libraries, dependencies) included in software, strengthening supply chain security

An SBOM is an "ingredient list" for software. It helps quickly identify impact scope when vulnerabilities are discovered, verify license compliance, and respond to supply chain attacks (e.g., Log4Shell). It can be automatically generated using tools like Syft and Trivy.