Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction

Have you ever moved past understanding CI/CD as simply "code gets deployed automatically when pushed" and considered what a production-grade pipeline truly requires? In real production environments, dozens of stages must be orchestrated seamlessly: security scanning, static analysis, container image signing, multi-cluster deployments, and automated rollbacks.

This guide covers everything from CI/CD maturity models to advanced GitHub Actions patterns, ArgoCD GitOps, Tekton cloud-native pipelines, DevSecOps security pipelines, and advanced deployment strategies.

1. CI/CD Maturity Model

A five-level maturity model for objectively assessing your organization's CI/CD capabilities. Each level builds upon the previous one.

Level 1 - Manual

Developers perform builds and deployments manually. Building locally and deploying to servers via FTP or SCP. "It works on my machine" is a daily occurrence, and deployment cycles are often monthly.

Characteristics: Documented procedures may exist, but almost no automation. Human errors during deployment are frequent, and rollbacks involve manually redeploying previous versions.

Level 2 - Basic Automation

A CI server is introduced, and automatic builds and tests run on code pushes. Tools like Jenkins, GitHub Actions, or GitLab CI are used, but the pipeline is a simple linear build-test-deploy structure.

# Level 2: Basic pipeline example
name: Basic CI/CD
on:
  push:
    branches: [main]
jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install
      - run: npm test
      - run: npm run build
      - run: ./deploy.sh

Level 3 - Standardized

Pipelines are standardized across the organization with reusable templates. Test coverage gates, code quality checks, and security scans are included by default. Environment-specific (dev, staging, prod) deployment pipelines are separated.

Level 4 - Advanced Automation

GitOps-based declarative deployments, canary/blue-green deployment strategies, automated security pipelines (SAST, DAST, container scanning), and SBOM generation with artifact signing are integrated into the pipeline. Deployment cycles shrink to sub-daily.

Level 5 - Self-Healing

Metric-based automatic rollbacks, SLO violation auto-recovery, ML-based anomaly detection, and production feedback automatically fed back into the pipeline. Deployments are fully automated so developers never think about deploying.

Level	Deploy Cycle	Rollback	Security	Testing
L1 Manual	Monthly	Manual redeploy	None	Manual
L2 Basic	Weekly	Manual trigger	None	Auto unit
L3 Standard	Daily	One-click	Basic scans	Unit + Integration
L4 Advanced	Hourly	Auto canary	Full pipeline	Unit + Integration + E2E
L5 Healing	Continuous	SLO-based auto	Continuous validation	Includes chaos

2. Advanced GitHub Actions Patterns

Advanced patterns that go beyond GitHub Actions basics.

2-1. Reusable Workflows

To share identical pipeline logic across multiple repositories, use reusable workflows. The calling side references external workflows with the uses keyword.

# .github/workflows/reusable-build.yml (shared repository)
name: Reusable Build
on:
  workflow_call:
    inputs:
      node-version:
        required: false
        type: string
        default: '20'
      registry-url:
        required: true
        type: string
    secrets:
      npm-token:
        required: true

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
          registry-url: ${{ inputs.registry-url }}
      - run: npm ci
        env:
          NODE_AUTH_TOKEN: ${{ secrets.npm-token }}
      - run: npm run build
      - uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: dist/

# Calling workflow
name: App CI
on: [push]
jobs:
  call-build:
    uses: my-org/shared-workflows/.github/workflows/reusable-build.yml@v2
    with:
      node-version: '20'
      registry-url: 'https://npm.pkg.github.com'
    secrets:
      npm-token: ${{ secrets.NPM_TOKEN }}

2-2. Composite Actions

A pattern for bundling multiple steps within a single action for reuse. Unlike reusable workflows, these are inserted as steps within a single job.

# .github/actions/setup-and-test/action.yml
name: 'Setup and Test'
description: 'Set up Node.js environment and run tests'
inputs:
  node-version:
    description: 'Node.js version'
    required: false
    default: '20'
runs:
  using: 'composite'
  steps:
    - uses: actions/setup-node@v4
      with:
        node-version: ${{ inputs.node-version }}
        cache: 'npm'
    - run: npm ci
      shell: bash
    - run: npm test -- --coverage
      shell: bash
    - uses: actions/upload-artifact@v4
      with:
        name: coverage
        path: coverage/

2-3. Matrix Build Strategy

An advanced matrix configuration that tests multiple environment combinations in parallel while excluding unnecessary ones.

jobs:
  test:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        node: [18, 20, 22]
        exclude:
          - os: macos-latest
            node: 18
        include:
          - os: ubuntu-latest
            node: 22
            experimental: true
    runs-on: ${{ matrix.os }}
    continue-on-error: ${{ matrix.experimental || false }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npm test

2-4. Cloud Authentication with OIDC

Instead of long-lived secrets (Access Keys), use OIDC (OpenID Connect) tokens to securely authenticate with cloud providers. AWS, GCP, and Azure all support this.

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-role
          aws-region: us-east-1
      - run: aws s3 sync ./dist s3://my-bucket
      - run: aws cloudfront create-invalidation --distribution-id E1234 --paths "/*"

The key advantage of OIDC is eliminating secret management. GitHub issues a short-lived token that AWS STS validates to issue temporary credentials. The token is only valid during workflow execution, making leaked tokens difficult to exploit.

3. GitOps and ArgoCD

3-1. GitOps Principles

GitOps is an operational model that uses Git as the single source of truth for managing the declarative state of infrastructure and applications. Four core principles define it.

Declarative configuration: Describe the desired system state declaratively. Define "what" you want, not "how" to achieve it.

Git as the source of truth: All changes go through Git, and Git history serves as the audit log.

Automatic application: When the declarative state in Git changes, it is automatically applied to the system.

Continuous reconciliation: An agent continuously compares actual state to desired state and corrects any drift.

3-2. ArgoCD Architecture

ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes.

# ArgoCD Application resource
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests.git
    targetRevision: main
    path: apps/my-app/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

When syncPolicy.automated is configured, Git changes are automatically reflected in the cluster. selfHeal: true means that if someone manually modifies the cluster state, it automatically reverts to the Git state.

3-3. App of Apps Pattern

In large environments with dozens to hundreds of Applications to manage, the App of Apps pattern uses a single root Application that manages all the others.

# root-app.yaml - Root that manages other Applications
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/argocd-apps.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd

# apps/frontend.yaml - Child Application managed by root
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: frontend
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests.git
    path: apps/frontend/overlays/production
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: frontend

3-4. Multi-Cluster Deployment with ApplicationSet

ApplicationSet is an ArgoCD feature that automatically generates multiple Applications from a single template, based on cluster lists, Git directories, or PR events.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-set
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: 'my-app-{{name}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/my-org/k8s-manifests.git
        targetRevision: main
        path: 'apps/my-app/overlays/{{metadata.labels.region}}'
      destination:
        server: '{{server}}'
        namespace: my-app

4. Tekton Pipelines

4-1. What is Tekton

Tekton is a Kubernetes-native CI/CD framework. Unlike GitHub Actions or Jenkins, all pipeline components are defined as Kubernetes Custom Resources (CRDs). Each task runs as a separate Pod, providing complete isolation and scalability.

Core components:

Task: An execution unit composed of one or more Steps. Maps to a single Pod.
Pipeline: A DAG (Directed Acyclic Graph) workflow connecting multiple Tasks.
TaskRun / PipelineRun: Execution instances of a Task or Pipeline. Trackable as Kubernetes resources.
Workspace: Volumes for sharing data between Tasks.

4-2. Task Definition

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: build-and-push
spec:
  params:
    - name: image-url
      type: string
    - name: image-tag
      type: string
      default: latest
  workspaces:
    - name: source
  steps:
    - name: build
      image: gcr.io/kaniko-project/executor:latest
      args:
        - --dockerfile=Dockerfile
        - --context=$(workspaces.source.path)
        - --destination=$(params.image-url):$(params.image-tag)
        - --cache=true
        - --cache-ttl=24h

4-3. Pipeline Definition

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
spec:
  params:
    - name: repo-url
      type: string
    - name: image-url
      type: string
  workspaces:
    - name: shared-workspace
  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-workspace
      params:
        - name: url
          value: $(params.repo-url)

    - name: run-tests
      taskRef:
        name: npm-test
      runAfter:
        - fetch-source
      workspaces:
        - name: source
          workspace: shared-workspace

    - name: build-image
      taskRef:
        name: build-and-push
      runAfter:
        - run-tests
      workspaces:
        - name: source
          workspace: shared-workspace
      params:
        - name: image-url
          value: $(params.image-url)

    - name: security-scan
      taskRef:
        name: trivy-scan
      runAfter:
        - build-image
      params:
        - name: image-url
          value: $(params.image-url)

4-4. Executing with PipelineRun

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  generateName: ci-pipeline-run-
spec:
  pipelineRef:
    name: ci-pipeline
  params:
    - name: repo-url
      value: https://github.com/my-org/my-app.git
    - name: image-url
      value: ghcr.io/my-org/my-app
  workspaces:
    - name: shared-workspace
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

5. Security Pipeline (DevSecOps)

5-1. Security Pipeline Components

A production-grade security pipeline automatically performs the following stages.

Stage	Tools	Purpose
SAST (Static Analysis)	SonarQube, Semgrep, CodeQL	Detect source code vulnerabilities
SCA (Dependency Analysis)	Snyk, Dependabot, OWASP DC	Detect open-source dependency vulnerabilities
Secret Scanning	GitLeaks, TruffleHog	Detect secrets embedded in code
Container Scanning	Trivy, Grype	Detect container image vulnerabilities
DAST (Dynamic Analysis)	OWASP ZAP, Nuclei	Detect running application vulnerabilities
SBOM Generation	Syft, CycloneDX	Generate software component inventory
Artifact Signing	Cosign, Notation	Ensure build artifact integrity

5-2. SAST - SonarQube Integration

# SonarQube analysis in GitHub Actions
name: Security Pipeline
on: [push, pull_request]

jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: SonarSource/sonarqube-scan-action@v3
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
          SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
        with:
          args: >
            -Dsonar.projectKey=my-project
            -Dsonar.sources=src/
            -Dsonar.tests=tests/
            -Dsonar.coverage.exclusions=**/*.test.ts
      - uses: SonarSource/sonarqube-quality-gate-check@v1
        timeout-minutes: 5
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

5-3. Container Image Scanning - Trivy

  container-scan:
    runs-on: ubuntu-latest
    needs: [build]
    steps:
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/my-org/my-app:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'

Trivy scans both OS packages and language-specific dependencies. Setting exit-code: '1' causes the pipeline to fail when CRITICAL or HIGH vulnerabilities are found.

5-4. SBOM Generation and Cosign Signing

SBOM (Software Bill of Materials) is a complete inventory of all components in the software. Following US Executive Order 14028, it has become an essential element of supply chain security.

  sbom-and-sign:
    runs-on: ubuntu-latest
    needs: [container-scan]
    permissions:
      id-token: write
      packages: write
    steps:
      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          image: ghcr.io/my-org/my-app:latest
          format: spdx-json
          output-file: sbom.spdx.json

      - name: Install Cosign
        uses: sigstore/cosign-installer@v3

      - name: Sign Container Image
        run: |
          cosign sign --yes \
            ghcr.io/my-org/my-app:latest

      - name: Attach SBOM to Image
        run: |
          cosign attach sbom \
            --sbom sbom.spdx.json \
            ghcr.io/my-org/my-app:latest

Cosign is part of the Sigstore project and supports keyless signing. It uses OIDC tokens to sign images without requiring separate signing key management.

5-5. Secret Scanning - GitLeaks

  secret-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

6. Test Automation Strategy

6-1. The Test Pyramid

In production CI/CD, the types and proportions of tests matter.

Unit Tests (70%): Should be the fastest and most numerous. Test individual functions, methods, and components in isolation.

Integration Tests (20%): Verify that multiple modules work together correctly. Test interactions with external dependencies like databases, APIs, and message queues.

E2E Tests (10%): Validate user scenarios from start to finish. Slowest and most brittle, so only test core flows.

6-2. Parallel Testing and Test Splitting

Strategies for reducing execution time of large test suites.

jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Run Tests (Shard ${{ matrix.shard }}/4)
        run: |
          npx jest --shard=${{ matrix.shard }}/4 \
            --ci --coverage --forceExit
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.shard }}
          path: coverage/

  merge-coverage:
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4
        with:
          pattern: coverage-*
          merge-multiple: true
      - name: Merge Coverage Reports
        run: npx istanbul-merge --out merged-coverage.json coverage-*/coverage-final.json

6-3. Playwright E2E Testing

  e2e:
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Install Playwright Browsers
        run: npx playwright install --with-deps
      - name: Run E2E Tests
        run: npx playwright test --reporter=html
        env:
          BASE_URL: https://staging.my-app.com
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: playwright-report
          path: playwright-report/

7. Advanced Deployment Strategies

7-1. Canary Deployments with Argo Rollouts

Argo Rollouts is a controller that implements advanced deployment strategies in Kubernetes. It provides a Rollout resource that replaces the standard Deployment.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: my-app-vsvc
              routes:
                - primary
      steps:
        - setWeight: 5
        - pause:
            duration: 5m
        - setWeight: 20
        - pause:
            duration: 5m
        - setWeight: 50
        - pause:
            duration: 10m
        - setWeight: 80
        - pause:
            duration: 5m
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
        args:
          - name: service-name
            value: my-app-canary
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: ghcr.io/my-org/my-app:v2.0.0
          ports:
            - containerPort: 8080

7-2. AnalysisTemplate - Metric-Based Automated Decisions

During canary deployment, query Prometheus metrics to automatically decide whether to promote or abort (rollback).

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2.."
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[5m]))

If three consecutive analyses show a success rate below 95%, an automatic rollback is triggered. This is the core of Level 5 self-healing pipelines.

7-3. Blue-Green Deployment

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app-bluegreen
spec:
  replicas: 5
  strategy:
    blueGreen:
      activeService: my-app-active
      previewService: my-app-preview
      autoPromotionEnabled: false
      prePromotionAnalysis:
        templates:
          - templateName: smoke-test
      postPromotionAnalysis:
        templates:
          - templateName: success-rate
      scaleDownDelaySeconds: 300
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: ghcr.io/my-org/my-app:v2.0.0

In blue-green deployment, after the new version (green) is fully ready, all traffic switches at once. prePromotionAnalysis runs smoke tests before the switch, and scaleDownDelaySeconds keeps the old version for a period to enable quick rollback.

7-4. Traffic Mirroring (Shadow Traffic)

A technique that replicates real production traffic to the new version for testing under actual load, but does not deliver the new version's responses to users.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-app
spec:
  hosts:
    - my-app.example.com
  http:
    - route:
        - destination:
            host: my-app-stable
            port:
              number: 80
      mirror:
        host: my-app-canary
        port:
          number: 80
      mirrorPercentage:
        value: 100.0

8. Multi-Environment Management

8-1. Environment Structure Design

Production pipelines operate a minimum of three environments.

dev: Feature branch deployments for developers. Instability is acceptable.
staging: Main branch deployments. Must have identical configuration to production.
production: The environment serving real user traffic.

8-2. Kustomize Overlays

Kustomize is a tool for customizing Kubernetes manifests per environment. It applies environment-specific overlays on top of a base configuration.

k8s/
  base/
    deployment.yaml
    service.yaml
    kustomization.yaml
  overlays/
    dev/
      kustomization.yaml
      replica-patch.yaml
    staging/
      kustomization.yaml
      replica-patch.yaml
    production/
      kustomization.yaml
      replica-patch.yaml
      hpa.yaml

# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml

# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
  - hpa.yaml
patches:
  - path: replica-patch.yaml
namePrefix: prod-
commonLabels:
  env: production

# k8s/overlays/production/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: my-app
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

8-3. Helm Values Per Environment

# values-dev.yaml
replicaCount: 1
image:
  tag: latest
resources:
  requests:
    cpu: 100m
    memory: 128Mi
ingress:
  host: dev.my-app.internal
autoscaling:
  enabled: false

# values-production.yaml
replicaCount: 5
image:
  tag: v2.0.0
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi
ingress:
  host: my-app.example.com
  tls: true
autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPU: 70

# Deploy per environment
helm upgrade --install my-app ./chart \
  -f values-production.yaml \
  --namespace production \
  --wait --timeout 5m

9. Monitoring and Rollback

9-1. Post-Deployment Monitoring Checklist

Core metrics to verify immediately after deployment.

Immediate check (0-5 minutes):

Pod status: Are all Pods in Running/Ready state
Error logs: Have exceptions surged in the new version
Health checks: Are readiness/liveness probes healthy

Short-term check (5-30 minutes):

Response time: Are p50, p95, p99 latencies similar to the previous version
Error rate: Is the 5xx rate within threshold
Throughput: Is request throughput in the expected range

Medium-term check (30 minutes to hours):

Memory usage: Any signs of memory leaks
CPU usage: Is CPU utilization stable
Business metrics: Are orders, conversion rates, etc. normal

9-2. Prometheus-Based Automatic Rollback

# PrometheusRule - Automatic rollback trigger
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: deployment-rollback-rules
spec:
  groups:
    - name: deployment-health
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
            > 0.05
          for: 2m
          labels:
            severity: critical
            action: rollback
          annotations:
            summary: "Error rate above 5 percent for 2 minutes"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
            ) > 2
          for: 3m
          labels:
            severity: critical
            action: rollback
          annotations:
            summary: "p99 latency above 2 seconds for 3 minutes"

9-3. SLO-Based Deployment Gates

Use Service Level Objectives (SLOs) as the basis for deployment decisions. When the error budget is exhausted, halt deployments.

# SLO definition example
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: my-app-slo
spec:
  service: my-app
  labels:
    team: platform
  slos:
    - name: requests-availability
      objective: 99.9
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{status=~"5.."}[5m]))
          totalQuery: sum(rate(http_requests_total[5m]))
      alerting:
        name: MyAppAvailability
        pageAlert:
          labels:
            severity: critical
        ticketAlert:
          labels:
            severity: warning

Here is how error budget calculation works. With a 99.9% SLO, roughly 43 minutes of downtime are allowed per month. If 30 minutes have already been consumed, the remaining 13 minutes of error budget mean risky deployments should not proceed.

10. Production Pipeline Architecture in Practice

10-1. Overall Architecture

A complete production-grade CI/CD pipeline follows this structure.

Developer Code Push
    |
    v
[CI Stage - GitHub Actions]
    |-- Code checkout
    |-- Dependency install (with caching)
    |-- Lint + format check
    |-- Unit tests (parallel 4 shards)
    |-- Integration tests
    |-- SAST (SonarQube)
    |-- Secret scanning (GitLeaks)
    |-- Dependency vulnerability scan (Snyk)
    |-- Container build (Kaniko)
    |-- Container scan (Trivy)
    |-- SBOM generation (Syft)
    |-- Image signing (Cosign)
    |-- Image registry push
    |
    v
[CD Stage - ArgoCD / GitOps]
    |-- Manifest repo auto-update
    |-- ArgoCD sync
    |-- dev auto-deploy
    |-- staging auto-deploy
    |-- E2E tests (Playwright)
    |-- DAST (OWASP ZAP)
    |
    v
[Production Deploy - Argo Rollouts]
    |-- Canary 5% deploy
    |-- Metric analysis (AnalysisTemplate)
    |-- Canary increase to 20%
    |-- Re-analysis
    |-- Canary increase to 50%
    |-- Final analysis
    |-- 100% promotion or automatic rollback
    |
    v
[Monitoring - Prometheus / Grafana]
    |-- SLO dashboard
    |-- Error budget tracking
    |-- Automatic rollback alerts

10-2. Pipeline Optimization Tips

Caching strategy: Cache dependency installation, Docker layers, and test results to reduce pipeline time by over 50%.

- uses: actions/cache@v4
  with:
    path: |
      ~/.npm
      node_modules
    key: deps-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      deps-

Conditional execution: Run only relevant tasks based on changed files to prevent unnecessary builds.

- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      backend:
        - 'src/api/**'
        - 'src/models/**'
      frontend:
        - 'src/components/**'
        - 'src/pages/**'
      infra:
        - 'terraform/**'
        - 'k8s/**'

Parallelization: Run independent tasks in parallel. Linting, testing, and security scans do not depend on each other and can run concurrently.

10-3. Pipeline Metrics

The pipeline itself must also be measured.

Metric	Description	Target
Lead Time	Time from commit to production deploy	Under 1 hour
Deploy Frequency	Production deploys per day	10+ per day
Change Failure Rate	Percentage of deploys requiring rollback	Under 5%
MTTR	Time to recover from incidents	Under 30 minutes
Pipeline Execution Time	Total CI time	Under 15 minutes
Test Coverage	Code coverage percentage	Over 80%

These are the core of DORA metrics (Lead Time, Deploy Frequency, Change Failure Rate, MTTR). Elite-performing teams achieve top rankings across all four metrics.

Conclusion

A CI/CD pipeline is not just an automation tool but critical infrastructure that determines software quality and development velocity. Here is a summary of everything covered.

Assess your current level against the maturity model. Do not aim for Level 5 all at once; build capabilities incrementally.
Design reusable pipelines. Use GitHub Actions reusable workflows and Composite Actions to standardize pipelines across your organization.
Adopt GitOps. Implementing declarative deployments with tools like ArgoCD gives you audit trails, automatic recovery, and consistency.
Embed security in the pipeline. DevSecOps is not a separate stage but security woven into every stage of the pipeline.
Make metric-based decisions. Use SLOs and error budgets to quantitatively manage deployment risk.
Track DORA metrics. Continuously improve the performance of the pipeline itself.

Production-grade CI/CD is not built overnight. But by understanding each component and introducing them incrementally, your team's deployment capabilities will improve dramatically.