- Published on
CI/CD Best Practices 2025: Pipeline Design, Automation, and Security for Teams
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. CI/CD in 2025
- 2. CI/CD Platform Comparison
- 3. Pipeline Design Principles
- 4. Testing in CI
- 5. Docker Build Optimization
- 6. GitOps with ArgoCD
- 7. CI/CD Security
- 8. Deployment Strategies Compared
- 9. Rollback Strategies
- 10. Pipeline Health Monitoring
- 11. Full-Stack CI/CD Pipeline Example
- 12. Interview Questions
- 13. Quiz
- 14. References
Introduction
In 2025, CI/CD is no longer optional; it is essential. According to Google's DORA (DevOps Research and Assessment) report, Elite-level teams deploy multiple times per day while maintaining a change failure rate below 5%. In contrast, Low-level teams deploy once a month with failure rates reaching 46%.
The core of this gap lies in pipeline design. It is not just about adopting CI/CD tools, but about a comprehensive strategy that includes test automation, security integration, progressive delivery, and observability.
This article covers everything you need to design and operate CI/CD pipelines: platform comparisons, pipeline design principles, testing strategies, Docker build optimization, GitOps, security, deployment strategies, rollback approaches, and monitoring.
1. CI/CD in 2025
1.1 Team Performance Through DORA Metrics
DORA Metrics are four key indicators that measure software delivery performance.
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | Multiple/day | Weekly~Monthly | Monthly~6 months | 6+ months |
| Lead Time (commit to deploy) | Under 1 hour | 1 day~1 week | 1 week~1 month | 1~6 months |
| Change Failure Rate | 0~5% | 5~10% | 10~15% | 46~60% |
| MTTR (Recovery Time) | Under 1 hour | Under 1 day | 1 day~1 week | 6+ months |
1.2 Shift Left Strategy
Shift Left is the practice of moving testing and security to earlier stages in the development lifecycle.
Traditional approach:
Code → Build → Test → Security → Deploy → Monitor
↑ Problems found here
Shift Left:
Code + Test + Security → Build → Deploy → Monitor
↑ Problems found here (10x cost reduction)
Key principles:
- Pre-commit validation: Lint, format, and secret scanning via pre-commit hooks
- PR-level testing: Unit tests + integration tests + SAST run automatically
- Build-time security: Container image scanning, dependency vulnerability checks
- Pre-deploy verification: Smoke tests, canary analysis
1.3 Key Trends in 2025
- Platform Engineering: Standardizing CI/CD through developer self-service platforms
- AI-Powered CI/CD: Test failure prediction, auto-rollback decisions, flaky test detection
- eBPF-Based Observability: A new paradigm for pipeline performance monitoring
- Supply Chain Security: SBOM, SLSA, and Sigstore-based software supply chain security
2. CI/CD Platform Comparison
2.1 Major Platforms at a Glance
| Feature | GitHub Actions | Jenkins | GitLab CI | CircleCI |
|---|---|---|---|---|
| Hosting | SaaS/Self-hosted | Self-hosted | SaaS/Self-hosted | SaaS |
| Config | YAML | Groovy/YAML | YAML | YAML |
| Ecosystem | Marketplace 15,000+ | Plugins 1,800+ | Built-in integrations | Orbs 3,000+ |
| Container Support | Native | Plugin | Native | Native |
| Self-hosted Runner | Supported | Default | Supported | Supported |
| Pricing | 2,000 min free/mo | Free (OSS) | 400 min free/mo | 6,000 credits free/mo |
| Learning Curve | Low | High | Medium | Low |
| Caching | 10GB/repo | Plugin | Native | Native |
2.2 GitHub Actions
# .github/workflows/ci.yml
name: CI Pipeline
on:
pull_request:
branches: [main]
push:
branches: [main]
concurrency:
group: ci-${{ '{{' }} github.ref {{ '}}' }}
cancel-in-progress: true
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: myapp:latest
cache-from: type=gha
cache-to: type=gha,mode=max
2.3 Jenkins Pipeline
// Jenkinsfile (Declarative Pipeline)
pipeline {
agent any
environment {
DOCKER_REGISTRY = 'registry.example.com'
IMAGE_NAME = 'myapp'
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Lint & Test') {
parallel {
stage('Lint') {
steps {
sh 'npm run lint'
}
}
stage('Unit Test') {
steps {
sh 'npm test -- --coverage'
}
post {
always {
junit 'reports/junit.xml'
}
}
}
}
}
stage('Build & Push') {
steps {
script {
def image = docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
image.push()
image.push('latest')
}
}
}
}
}
post {
failure {
slackSend(
channel: '#ci-alerts',
color: 'danger',
message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
)
}
}
}
2.4 GitLab CI
# .gitlab-ci.yml
stages:
- lint
- test
- build
- deploy
variables:
DOCKER_HOST: tcp://docker:2376
lint:
stage: lint
image: node:20-alpine
cache:
key: npm-cache
paths:
- node_modules/
script:
- npm ci
- npm run lint
test:
stage: test
image: node:20-alpine
parallel: 4
script:
- npm ci
- npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
coverage: '/Statements\s*:\s*(\d+\.?\d*)%/'
artifacts:
reports:
junit: reports/junit.xml
build:
stage: build
image: docker:24
services:
- docker:24-dind
script:
- docker build -t myapp:latest .
- docker push myapp:latest
only:
- main
3. Pipeline Design Principles
3.1 Fast Feedback Loops
The time developers wait for results after opening a PR directly impacts productivity.
Target times:
├── Lint + format check: under 30s
├── Unit tests: under 2min
├── Integration tests: under 5min
├── Build: under 3min
└── Full pipeline: under 10min
Reality (before optimization): 30min+
Reality (after optimization): 8-10min
3.2 Parallelism
# Parallel pipeline example
jobs:
# Phase 1: Lint/security run independently in parallel
lint:
runs-on: ubuntu-latest
# ...
security-scan:
runs-on: ubuntu-latest
# ...
# Phase 2: Tests split into shards
test:
needs: [lint]
strategy:
matrix:
shard: [1, 2, 3, 4]
# ...
# Phase 3: Build after tests pass
build:
needs: [test, security-scan]
# ...
3.3 Caching Strategy
# npm caching (GitHub Actions)
- uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ '{{' }} hashFiles('**/package-lock.json') {{ '}}' }}
restore-keys: |
npm-
# Docker layer caching
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
# Gradle caching
- uses: actions/cache@v4
with:
path: |
~/.gradle/caches
~/.gradle/wrapper
key: gradle-${{ '{{' }} hashFiles('**/*.gradle*') {{ '}}' }}
3.4 Idempotency
Pipelines should always produce the same result for the same input.
# Bad: Timestamp-based tag (different result on re-run)
# IMAGE_TAG: my-app:build-20250323-142000
# Good: Commit SHA-based tag (always the same)
# IMAGE_TAG: my-app:abc1234
# Good: Semantic version (deterministic)
# IMAGE_TAG: my-app:v1.2.3
4. Testing in CI
4.1 Test Pyramid
/ E2E \ Slow but high confidence
/ (5-10%) \
/ Integration \ Medium speed, medium confidence
/ (15-25%) \
/ Unit Tests \ Fast and numerous
/ (65-80%) \
/_________________________\
4.2 Test Splitting
# Jest test sharding
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npx jest --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4
# Cypress parallel execution
e2e:
strategy:
matrix:
container: [1, 2, 3]
steps:
- uses: cypress-io/github-action@v6
with:
record: true
parallel: true
group: 'e2e-tests'
4.3 Flaky Test Management
Flaky tests are those that sometimes pass and sometimes fail on the same code.
// Flaky test detection and isolation strategy
// jest.config.js
module.exports = {
// Auto-retry on failure
retryTimes: 2,
// Flaky test reporter
reporters: [
'default',
['jest-flaky-reporter', {
outputFile: 'flaky-tests.json',
threshold: 3 // Report if flaky 3+ times
}]
]
};
# Isolate flaky tests in CI
test-stable:
runs-on: ubuntu-latest
steps:
- run: npx jest --testPathIgnorePatterns="flaky"
test-flaky:
runs-on: ubuntu-latest
continue-on-error: true # Pipeline continues on failure
steps:
- run: npx jest --testPathPattern="flaky" --retries=3
4.4 Coverage Gates
# Set coverage thresholds
test:
steps:
- run: npx jest --coverage
- name: Check coverage threshold
run: |
COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.statements.pct')
if (( $(echo "$COVERAGE < 80" | bc -l) )); then
echo "Coverage $COVERAGE% is below 80% threshold"
exit 1
fi
5. Docker Build Optimization
5.1 Multi-Stage Builds
# Stage 1: Install dependencies
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
# Stage 2: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY /app/node_modules ./node_modules
COPY . .
RUN npm run build
# Stage 3: Production image
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
# Security: non-root user
RUN addgroup --system --gid 1001 nodejs && \
adduser --system --uid 1001 nextjs
COPY /app/.next ./.next
COPY /app/node_modules ./node_modules
COPY /app/package.json ./
USER nextjs
EXPOSE 3000
CMD ["npm", "start"]
5.2 Layer Caching Optimization
# Bad: Source changes trigger npm ci re-run
COPY . .
RUN npm ci
RUN npm run build
# Good: Copy dependency files first
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
5.3 BuildKit and Buildx
# Using BuildKit in GitHub Actions
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: myapp:latest
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: linux/amd64,linux/arm64
5.4 Kaniko (Daemonless Build)
# Build images with Kaniko in Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: kaniko-build
spec:
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:latest
args:
- "--dockerfile=Dockerfile"
- "--context=git://github.com/myorg/myapp"
- "--destination=registry.example.com/myapp:latest"
- "--cache=true"
- "--cache-repo=registry.example.com/myapp/cache"
5.5 Image Size Optimization
Image size comparison:
├── node:20 → 1.1GB
├── node:20-slim → 220MB
├── node:20-alpine → 140MB
├── distroless/nodejs → 120MB
└── Multi-stage optimized → 80-100MB
6. GitOps with ArgoCD
6.1 GitOps Principles
GitOps uses a Git repository as the Single Source of Truth for system operations.
GitOps workflow:
1. Developer pushes changes to Git
2. CI builds and tests the image
3. CI updates image tag in deployment manifests
4. ArgoCD compares Git vs cluster state
5. Auto-sync on drift (or manual approval)
6. Cluster matches Git state
┌────────┐ Push ┌────────┐ Detect ┌────────┐
│ Dev │ ──────────> │ Git │ <────────> │ ArgoCD │
└────────┘ └────────┘ └───┬────┘
│ Sync
┌────▼────┐
│ K8s │
│ Cluster │
└─────────┘
6.2 ArgoCD App of Apps Pattern
# apps/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/gitops-config
targetRevision: main
path: apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
# apps/api-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-service
spec:
project: default
source:
repoURL: https://github.com/myorg/gitops-config
path: services/api
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
6.3 Argo Rollouts (Progressive Delivery)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
spec:
replicas: 10
strategy:
canary:
canaryService: api-canary
stableService: api-stable
trafficRouting:
istio:
virtualService:
name: api-vsvc
steps:
- setWeight: 10
- pause:
duration: 5m
- analysis:
templates:
- templateName: success-rate
- setWeight: 30
- pause:
duration: 5m
- analysis:
templates:
- templateName: success-rate
- setWeight: 60
- pause:
duration: 5m
- setWeight: 100
# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.95
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"2.*",app="api-service",version="canary"}[5m]))
/
sum(rate(http_requests_total{app="api-service",version="canary"}[5m]))
7. CI/CD Security
7.1 Security Scanning Integration
CI/CD Security Layers:
┌─────────────────────────────────────────────┐
│ Layer 1: Pre-commit │
│ - Secret scanning (gitleaks, detect-secrets)│
│ - Lint (security rules) │
├─────────────────────────────────────────────┤
│ Layer 2: PR / Build │
│ - SAST (Semgrep, CodeQL, SonarQube) │
│ - SCA (Dependabot, Snyk, Trivy) │
│ - License compliance │
├─────────────────────────────────────────────┤
│ Layer 3: Container Build │
│ - Image scanning (Trivy, Grype) │
│ - Base image policy (distroless, alpine) │
│ - SBOM generation (Syft) │
├─────────────────────────────────────────────┤
│ Layer 4: Deploy │
│ - Policy enforcement (OPA/Kyverno) │
│ - Signing (cosign, Sigstore) │
│ - Runtime security (Falco) │
└─────────────────────────────────────────────┘
7.2 Secret Management
# GitHub Actions secret usage
deploy:
steps:
- name: Deploy
env:
AWS_ACCESS_KEY_ID: ${{ '{{' }} secrets.AWS_ACCESS_KEY_ID {{ '}}' }}
AWS_SECRET_ACCESS_KEY: ${{ '{{' }} secrets.AWS_SECRET_ACCESS_KEY {{ '}}' }}
run: |
aws ecs update-service --cluster prod --service api
# OIDC-based authentication (secretless - recommended)
permissions:
id-token: write
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions
aws-region: ap-northeast-2
7.3 SBOM and Supply Chain Security
# Generate SBOM with Syft
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: myapp:latest
format: spdx-json
output-file: sbom.spdx.json
# Sign image with cosign
- name: Sign image
run: |
cosign sign --key env://COSIGN_PRIVATE_KEY myapp:latest
# Verify signature with cosign
- name: Verify signature
run: |
cosign verify --key cosign.pub myapp:latest
7.4 Automated Secret Scanning
# .pre-commit-config.yaml
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.0
hooks:
- id: gitleaks
# Run gitleaks in CI
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}
8. Deployment Strategies Compared
8.1 Strategy Comparison Table
| Strategy | Downtime | Risk | Resource Cost | Rollback Speed | Complexity |
|---|---|---|---|---|---|
| Recreate | Yes | High | 1x | Slow | Low |
| Rolling Update | No | Medium | 1x~1.25x | Medium | Low |
| Blue-Green | No | Low | 2x | Instant | Medium |
| Canary | No | Very Low | 1.1x | Instant | High |
| A/B Testing | No | Very Low | 1.1x | Instant | Very High |
8.2 Blue-Green Deployment
# Kubernetes Blue-Green deployment
apiVersion: v1
kind: Service
metadata:
name: api-service
spec:
selector:
app: api
version: green # Switch from blue to green
ports:
- port: 80
targetPort: 8080
Blue-Green switchover process:
1. Blue(v1) running → Deploy Green(v2)
2. Green health check and smoke test
3. Switch service selector to Green
4. Roll back to Blue immediately on issues
5. Clean up Blue resources after stabilization
[Users] → [LB] → [Blue v1] Active
[Green v2] ← Preparing
[Users] → [LB] → [Blue v1] ← Standby
[Green v2] Active
8.3 Canary Deployment
# Canary deployment with Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-service
spec:
hosts:
- api-service
http:
- route:
- destination:
host: api-service
subset: stable
weight: 90
- destination:
host: api-service
subset: canary
weight: 10
8.4 Feature Flags
// LaunchDarkly or custom feature flag system
import { featureFlags } from './feature-flags';
async function handleRequest(req: Request) {
const userId = req.user.id;
if (await featureFlags.isEnabled('new-checkout-flow', userId)) {
return newCheckoutFlow(req);
}
return legacyCheckoutFlow(req);
}
Feature flag-based deployment:
1. Deploy code with new feature wrapped in a flag
2. Enable for internal users only
3. Gradually increase rollout (1% → 5% → 25% → 100%)
4. Disable flag immediately if issues arise
5. Decouple deployment from release
9. Rollback Strategies
9.1 Automatic Rollback
# Argo Rollouts automatic rollback
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis:
templates:
- templateName: error-rate-check
# Auto-rollback on analysis failure
abortScaleDownDelaySeconds: 30
# Kubernetes Deployment automatic rollback
apiVersion: apps/v1
kind: Deployment
spec:
progressDeadlineSeconds: 300 # Fails if not done in 5min
minReadySeconds: 30
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
9.2 Circuit Breaker Pattern
// Deployment circuit breaker
class DeploymentCircuitBreaker {
private errorThreshold = 0.05; // 5% error rate
private windowSize = 300; // 5 minute window
async shouldRollback(metrics: DeploymentMetrics): Promise<boolean> {
const errorRate = metrics.errors / metrics.totalRequests;
const p99Latency = metrics.p99LatencyMs;
return (
errorRate > this.errorThreshold ||
p99Latency > 3000 // Over 3 seconds
);
}
async executeRollback(deployment: string) {
console.log(`Rolling back ${deployment}`);
await exec(`kubectl rollout undo deployment/${deployment}`);
await notify({
channel: '#deployments',
message: `Auto-rollback triggered for ${deployment}`,
severity: 'critical'
});
}
}
9.3 Database Migration Rollback
Safe DB migration strategy:
1. Expand-Contract Pattern
Phase 1 (Expand): Add new columns, write to both
Phase 2 (Migrate): Migrate existing data
Phase 3 (Contract): Remove old columns
2. Only apply rollback-safe migrations
- Add column (rollback-safe)
- Add index (rollback-safe)
- Drop column (NOT safe → use Expand-Contract)
- Change type (NOT safe → add new column, then switch)
-- Safe migration example
-- Step 1: Add new column (rollback-safe)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;
-- Step 2: Data migration (background)
UPDATE users SET email_verified = TRUE
WHERE verified_at IS NOT NULL;
-- Step 3: Switch app code to use new column
-- Step 4: Drop old column (separate migration)
-- ALTER TABLE users DROP COLUMN verified_at;
10. Pipeline Health Monitoring
10.1 Key Metrics
Pipeline Health Dashboard:
┌─────────────────────────────────────────┐
│ Build Time Trend │
│ ██████████████ 8m (avg) │
│ Target: < 10m │
├─────────────────────────────────────────┤
│ Success Rate │
│ ████████████████████ 94% │
│ Target: > 95% │
├─────────────────────────────────────────┤
│ Flaky Test Rate │
│ ██ 3% │
│ Target: < 2% │
├─────────────────────────────────────────┤
│ Mean Time to Recovery (MTTR) │
│ ████ 25min │
│ Target: < 30min │
└─────────────────────────────────────────┘
10.2 Build Time Tracking
# Report build metrics to Datadog
- name: Report build metrics
if: always()
run: |
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
curl -X POST "https://api.datadoghq.com/api/v1/series" \
-H "DD-API-KEY: $DD_API_KEY" \
-d "{
\"series\": [{
\"metric\": \"ci.build.duration\",
\"points\": [[$END_TIME, $DURATION]],
\"tags\": [
\"repo:myapp\",
\"branch:$GITHUB_REF_NAME\",
\"status:$JOB_STATUS\"
]
}]
}"
10.3 Failure Analysis Automation
# Automated build failure classification script
import re
from enum import Enum
class FailureCategory(Enum):
FLAKY_TEST = "flaky_test"
DEPENDENCY = "dependency"
COMPILATION = "compilation"
INFRASTRUCTURE = "infrastructure"
TIMEOUT = "timeout"
UNKNOWN = "unknown"
def categorize_failure(log: str) -> FailureCategory:
patterns = {
FailureCategory.FLAKY_TEST: [
r"retry.*failed",
r"intermittent",
r"flaky"
],
FailureCategory.DEPENDENCY: [
r"npm ERR!.*404",
r"Could not resolve dependencies",
r"ECONNRESET"
],
FailureCategory.COMPILATION: [
r"error TS\d+",
r"SyntaxError",
r"TypeError"
],
FailureCategory.INFRASTRUCTURE: [
r"runner.*offline",
r"disk space",
r"out of memory"
],
FailureCategory.TIMEOUT: [
r"timed out",
r"deadline exceeded"
]
}
for category, regexes in patterns.items():
for pattern in regexes:
if re.search(pattern, log, re.IGNORECASE):
return category
return FailureCategory.UNKNOWN
11. Full-Stack CI/CD Pipeline Example
# .github/workflows/production.yml
name: Production Pipeline
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
packages: write
jobs:
# Phase 1: Code quality
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run type-check
# Phase 2: Security scanning
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Semgrep
uses: returntocorp/semgrep-action@v1
- name: Run Trivy (filesystem)
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
severity: 'HIGH,CRITICAL'
# Phase 3: Tests
test:
needs: quality
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
POSTGRES_PASSWORD: testpass
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm test -- --shard=${{ '{{' }} matrix.shard {{ '}}' }}/4
env:
DATABASE_URL: postgresql://postgres:testpass@localhost:5432/testdb
# Phase 4: Build and push
build:
needs: [test, security]
runs-on: ubuntu-latest
outputs:
image-tag: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ '{{' }} github.actor {{ '}}' }}
password: ${{ '{{' }} secrets.GITHUB_TOKEN {{ '}}' }}
- id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/myorg/myapp
- uses: docker/build-push-action@v5
with:
push: true
tags: ${{ '{{' }} steps.meta.outputs.tags {{ '}}' }}
cache-from: type=gha
cache-to: type=gha,mode=max
# Phase 5: Update deployment manifests (GitOps)
update-manifest:
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
repository: myorg/gitops-config
token: ${{ '{{' }} secrets.GITOPS_TOKEN {{ '}}' }}
- name: Update image tag
run: |
cd services/api
kustomize edit set image myapp=${{ '{{' }} needs.build.outputs.image-tag {{ '}}' }}
- name: Commit and push
run: |
git config user.name "CI Bot"
git config user.email "ci@example.com"
git add .
git commit -m "chore: update api-service image"
git push
12. Interview Questions
Basic Concepts
Q1. Explain the difference between CI and CD.
CI (Continuous Integration) is the practice of frequently integrating code changes into the main branch. Each integration is validated through automated builds and tests.
CD has two meanings:
- Continuous Delivery: Code is always in a deployable state. Production deployment requires manual approval.
- Continuous Deployment: Every change is automatically deployed to production with no manual intervention.
Key difference: CI focuses on "integration" while CD focuses on "delivery/deployment." CD is impossible without CI, but you can do CI without CD.
Q2. Explain the four DORA metrics.
- Deployment Frequency: How often you deploy to production
- Lead Time for Changes: Time from commit to production deployment
- Change Failure Rate: Percentage of deployments that fail or require rollback
- MTTR (Mean Time to Recovery): Time from incident to recovery
Elite teams: Multiple deploys per day, under 1 hour lead time, under 5% failure rate, under 1 hour recovery.
Q3. Explain the core principles of GitOps.
- Declarative: System state is defined declaratively
- Versioned: Git serves as the single source of truth
- Automated: Approved changes are automatically applied to the system
- Self-Healing: Actual state is automatically reconciled with declared state
Benefits: Audit trail, easy rollback, PR-based change management, reproducible environments.
Q4. Compare Blue-Green and Canary deployments.
Blue-Green: Two identical environments (Blue/Green). Deploy new version to Green, then switch all traffic at once. Rollback by switching back to Blue.
- Pros: Instant rollback, simple implementation
- Cons: 2x resource cost, complex database synchronization
Canary: Deploy new version to a small percentage (1-10%) first. Analyze metrics then gradually increase.
- Pros: Minimal risk, verification with real traffic
- Cons: Complex implementation, monitoring required
Q5. What is Shift Left?
A strategy that moves testing and security to earlier stages (left side) of the development lifecycle.
Examples:
- Pre-commit hooks for code lint, format, and secret scanning
- SAST, SCA, and unit tests at the PR stage
- Container image scanning during build
- IDE plugins for real-time feedback during development
Impact: The earlier defects are found, the cost of fixing decreases exponentially (10-100x less than production).
Advanced Questions
Q6. How do you manage flaky tests?
- Detection: Identify tests that produce different results on the same code
- Isolation: Separate flaky tests into a dedicated suite with continue-on-error
- Retry: Auto-retry with jest retryTimes, pytest-rerunfailures
- Tracking: Dashboard for frequency and pattern analysis
- Root cause resolution: Fix timing issues, shared state, external dependencies
- Policy: Disable or delete if not fixed within a set period
Q7. How do you optimize Docker image builds?
- Multi-stage builds: Exclude build tools from the final image
- Layer caching optimization: COPY frequently-changed files last
- Lightweight base images: Use alpine, distroless
- .dockerignore: Exclude unnecessary files
- Use BuildKit: Parallel builds, cache mounts
- Dependency separation: Copy package.json first for npm ci caching
- Kaniko: Build without Docker daemon (improved CI/CD security)
Q8. Describe secret management best practices.
- Never commit to Git: Pre-commit scanning with gitleaks, detect-secrets
- OIDC-based auth: Use temporary tokens instead of long-lived secrets
- Secret managers: AWS Secrets Manager, HashiCorp Vault, Doppler
- Least privilege: Grant only minimum necessary permissions
- Secret rotation: Automate regular secret renewal
- Audit logs: Track secret access
- Environment separation: Separate dev/staging/prod secrets
Q9. How do you safely perform database migration rollbacks?
Expand-Contract Pattern:
Phase 1 (Expand):
- Add new columns/tables
- Modify app code to be compatible with both old and new schemas
- Start writing to new schema
Phase 2 (Migrate):
- Migrate existing data to new schema (background)
- Switch app to use new schema only
Phase 3 (Contract):
- Remove old columns/tables (separate deployment)
- Only this phase is non-reversible
Key: Each phase must be independently rollback-safe.
Q10. How do you secure a CI/CD pipeline?
- Supply Chain Security: Generate SBOMs, sign images (cosign), comply with SLSA
- Secret management: OIDC, Vault, minimize environment variables
- SAST/DAST/SCA: Integrate Semgrep, Trivy, Dependabot
- Container security: Non-root users, distroless base, image scanning
- Policy compliance: Enforce deployment policies with OPA/Kyverno
- Access control: Least privilege, branch protection rules
- Auditing: Track all deployments, maintain change history
Q11. Compare the pros and cons of GitHub Actions and Jenkins.
GitHub Actions:
- Pros: Native GitHub integration, SaaS (no maintenance), Marketplace ecosystem, simple YAML config
- Cons: GitHub lock-in, customization limits, complex workflow management
Jenkins:
- Pros: Full flexibility, 1800+ plugins, self-hosted control, Groovy scripting
- Cons: High maintenance cost, complex setup, security patch management, scaling challenges
Criteria: Small/GitHub-centric projects favor Actions; complex enterprise/multi-SCM environments favor Jenkins.
Q12. Explain Progressive Delivery with Argo Rollouts.
Progressive Delivery deploys new versions incrementally while verifying safety through automated analysis.
Argo Rollouts workflow:
- Allocate 10% traffic to canary
- Verify success rate and latency via AnalysisTemplate (5 min)
- Increase to 30% on pass, analyze again
- Gradually scale to 60%, then 100%
- Auto-rollback on analysis failure
Key components:
- Rollout: Defines deployment strategy
- AnalysisTemplate: Defines verification conditions (Prometheus, Datadog, etc.)
- TrafficRouting: Integrates with Istio, Nginx, ALB
Q13. How do you optimize pipeline performance?
- Parallelism: Run independent jobs concurrently
- Test splitting: Shard tests across multiple runners
- Caching: Cache dependencies, Docker layers, build artifacts
- Selective execution: Run only jobs relevant to changed files
- Incremental builds: Build only changed parts
- Resource optimization: Tune runner size and concurrency limits
- Shorten feedback loop: Fast checks first, slow checks later
Q14. What are the pros and cons of feature flag-based deployments?
Pros:
- Decouple deployment from release: deploy code but activate features later
- Fast rollback: simply disable the flag with no code rollback
- Progressive rollout: gradually expand user percentage
- A/B testing: experiment per feature
Cons:
- Technical debt: old flags need cleanup
- Complexity: flag combinations increase test permutations
- Code readability: increased conditional logic
- Consistency: different user experiences make bug reproduction harder
Q15. Describe CI/CD strategies for monorepos.
- Impact analysis: Build/test only packages affected by changes
- Tool usage: Turborepo, Nx, Bazel for dependency-graph-based builds
- Caching: Remote cache (Turborepo Remote Cache) for shared build results
- Selective deployment: Deploy only changed services
- Parallelism: Build/test independent packages concurrently
# Turborepo example
turbo run build --filter=...[HEAD~1]
# Builds only packages changed since HEAD and their dependents
13. Quiz
Q1. What is the deployment frequency of an Elite team according to DORA metrics?
Answer: Multiple times per day (On-demand, multiple deploys per day)
Elite teams deploy multiple times per day while maintaining a change failure rate below 5% and recovery time under 1 hour.
Q2. Which phase in the Expand-Contract pattern is non-reversible?
Answer: The Contract phase (dropping old columns/tables)
Expand (addition) and Migrate (data migration) are reversible, but Contract (deletion) is irreversible since data is removed. Therefore, the Contract phase is performed separately after sufficient stabilization.
Q3. What is the "single source of truth" in GitOps?
Answer: The Git repository
In GitOps, the Git repository is the sole source defining the desired state of the system. The actual cluster state must always match the state declared in Git, and tools like ArgoCD automatically detect and synchronize any drift.
Q4. What triggers an automatic rollback in canary deployment?
Answer: Metric criteria defined in the AnalysisTemplate (success rate, latency, etc.)
Argo Rollouts queries metrics from Prometheus, Datadog, etc. via the AnalysisTemplate. If the success rate falls below the threshold (e.g., 95%) or P99 latency exceeds the limit, an automatic rollback is triggered.
Q5. What is the purpose of an SBOM (Software Bill of Materials)?
Answer: To provide a list of all components (libraries, dependencies) included in software, strengthening supply chain security
An SBOM is an "ingredient list" for software. It helps quickly identify impact scope when vulnerabilities are discovered, verify license compliance, and respond to supply chain attacks (e.g., Log4Shell). It can be automatically generated using tools like Syft and Trivy.
14. References
Official Documentation
- GitHub Actions Documentation
- Jenkins Pipeline Documentation
- GitLab CI/CD Documentation
- ArgoCD Official Documentation
- Argo Rollouts Documentation