Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — a loop, not a line

In the previous post we covered the parts of AI development automation — which agents exist, how they integrate with GitHub, and how you hand them tickets.

This post assembles those parts into a single pipeline. The goal is this:

A human files one Issue. They step away. A short while later, a verified deployment is live in production. The only thing the human did was press the merge button — the decision.

And more importantly: this is a loop, not a line. Deployment is not the end. When monitoring detects an anomaly, it rolls back automatically, and that incident becomes a new Issue that re-enters the pipeline at its entrance. The system fixes itself.

This post splits that loop into 7 stages, designs the gate for each stage, and implements all of it with GitHub Actions. The core principle is a single one:

"Automate the work, gate the decisions."

Let AI write, review, and deploy code — but for irreversible decisions (production merge, schema changes, security logic), always put a human or a strong automated check in the way.

Chapter 0 · Bird's-eye view of the whole pipeline

The big picture first. The 7 stages and who owns each one.

 ┌────────────────────────────────────────────────────────────────┐
 │                                                                │
 │  [S1] Issue ──ai-ready label──▶ AI agent ──▶ Draft PR          │
 │         ▲                                         │            │
 │         │                                         ▼            │
 │         │                              [S2] AI code review     │
 │         │                                (generate≠review)     │
 │         │                                         │            │
 │         │                                         ▼            │
 │         │                              [S3] CI verification    │
 │         │                          lint·type·test·build·E2E    │
 │         │                                         │            │
 │         │                          ┌── fail ──────┤            │
 │         │                          ▼              ▼ pass       │
 │         │                   AI self-heal   [S4] Human Gate     │
 │         │                  (read logs, repush)  merge decision │
 │         │                                         │            │
 │         │                                         ▼            │
 │         │                              [S5] CD: Preview        │
 │         │                              → Canary → Production    │
 │         │                                         │            │
 │         │                                         ▼            │
 │         │                              [S6] Auto-verification  │
 │         │                          smoke·E2E·SLO gate          │
 │         │                                         │            │
 │         │                          ┌── fail ──────┤            │
 │         │                          ▼              ▼ pass       │
 │         └──── auto Issue creation ◀ [S7] Monitoring & feedback  │
 │             (auto rollback + incident) post-deploy observe,    │
 │                                        rollback trigger        │
 │                                                                │
 └────────────────────────────────────────────────────────────────┘

Stage	Name	Owner	Output
S1	Issue → PR	AI agent	Draft PR
S2	AI code review	AI reviewer (model different from generator)	Inline comments + APPROVE/REQUEST_CHANGES
S3	CI verification gate	CI system	Green light / red light
S4	Human Gate	Human	Merge decision
S5	CD deployment	CD system	Preview → Canary → Production
S6	Auto-verification	Verification system	Promote or roll back
S7	Monitoring & feedback	Observability system	Healthy or auto Issue

The rest of this post digs deep into each stage, one at a time.

Chapter 1 · The 5 principles of pipeline design

Nail down the principles before implementing. If these wobble, the pipeline becomes dangerous.

Principle 1 — Every stage is fail-closed

When a stage fails or its judgment is uncertain, it stops. It does not let things through. AI review is confused → REQUEST_CHANGES. CI is flaky → red light. Verification is ambiguous → roll back. When in doubt, do not proceed.

Principle 2 — Every stage must be reversible

A PR can be closed, a merge can be reverted, a deployment can be rolled back. If a stage has no "cancel button," that stage must not be automated.

Principle 3 — Every stage must be observable

Logs record who did what, when, and why. Especially what AI did. Automation you cannot debug is a time bomb.

Principle 4 — Idempotency

If the same Issue is triggered twice, you must not get two PRs. If the same commit is deployed twice, it must be safe. Retries must be safe for automation to be safe.

Principle 5 — The trust ladder

Do not automate everything from the start. Automate low-risk, high-frequency work first, and as success-rate data accumulates, widen the automation scope. Auto-merge for doc fixes → auto-merge for dependency bumps → … one rung at a time.

Chapter 2 · S1 — Issue → PR (the generation stage)

This was covered in the previous post, so just the essentials. The ai-ready label is the trigger; the AI agent creates a branch, implements, and opens a Draft PR.

The real goal of this stage is not "writing code" but "producing a reviewable PR." Every stage downstream depends on PR quality.

The conditions for a PR to be "born reviewable"

Created as a Draft — not mergeable until a human promotes it to "Ready for review."
Issue link — Closes #123 in the PR body. The Issue closes automatically on merge.
Work summary — the agent records "what and why" in the body. The reviewer's starting point.
Consistent PR title — like [AI] fix: .... Downstream stages branch on the title.
ai-generated label — for tracking.

Idempotency — preventing duplicate PRs

# If a PR already exists for the same issue, don't create a new one
- name: Check existing PR
  id: check
  run: |
    EXISTING=$(gh pr list --search "in:title #${{ github.event.issue.number }}" --json number --jq 'length')
    echo "exists=$EXISTING" >> "$GITHUB_OUTPUT"

- name: Run agent
  if: steps.check.outputs.exists == '0'
  uses: anthropics/claude-code-action@v1
  # ...

Chapter 3 · S2 — AI code review automation (the core stage)

This is the heart of this pipeline. Insert a layer of AI review before human review.

Why AI review before human review

Noise filter — catch obvious mistakes (missing error handling, type mismatches, convention violations) before a human sees them.
Saves human review time — humans focus on judgment like "is this the right approach." AI handles the touch-ups.
24/7 — even if a PR lands at 3 a.m., it gets reviewed immediately.

Iron rule: generate ≠ review (different model, different agent)

Do not let the agent that created the PR review its own PR. It cannot see the blind spots in its own code. Where possible, review with a different model. If Claude wrote it, have a different tool review it, or at minimum a separate session with a separate prompt.

Generation agent (Claude/Copilot) ──▶ PR
                                       │
Review agent (different model/tool) ──▶ review comments
                                       │
Human ─────────────────────────────▶ final decision

Tool landscape

Tool	Form	Characteristics
CodeRabbit	GitHub App	Auto-reviews every PR, configured via `.coderabbit.yaml`, summary + inline
Greptile	GitHub App	Review based on full-codebase context
GitHub Copilot review	Built-in	Copilot can be designated as a PR reviewer
Claude Code Action	Actions	Review via `@claude` or a workflow, strongly customizable

What AI review catches well / poorly

Catches well	Catches poorly
Missing null / error handling	"Is this the right architecture"
Convention / naming violations	Whether business requirements are met
Obvious security patterns (SQLi, hardcoded secrets)	Subtle domain-specific bugs
Test coverage gaps	Subtle performance trade-offs
Common bug patterns	The team's implicit context

→ This is exactly why the human gate (S4) is needed. AI review does not replace humans — it reduces the human's burden.

Implementation — review with Claude Code Action

name: AI Review
on:
  pull_request:
    types: [opened, synchronize, ready_for_review]

jobs:
  ai-review:
    # only when not a draft — don't review work-in-progress PRs
    if: github.event.pull_request.draft == false
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          prompt: |
            Review this PR. Focus on:
            - Correctness: logic bugs, edge cases, missing error handling
            - Security: input validation, secrets, permissions
            - Tests: is coverage sufficient for the change
            - Conventions: does it follow CLAUDE.md

            Leave issues as inline comments on the relevant lines.
            At the end, state APPROVE or REQUEST_CHANGES with a summary.
            Mark trivial style points with nit: so a human can ignore them.

Making AI review a "required check"

If review only leaves comments, it gets ignored. Make it a status check and put it in the branch protection rules — then a merge is blocked outright when it is REQUEST_CHANGES. Configure the review action to report pass/fail through the GitHub Checks API.

Chapter 4 · S3 — The CI verification gate

The stage where a machine verifies the code AI wrote. If CI is weak, this whole pipeline is weak.

Layered checks

fast ──────────────────────────────────▶ slow
lint → typecheck → unit test → build → integration → E2E
  │        │           │          │         │          │
seconds  seconds   sec~min      min       min       min~tens of min

Put the fast checks up front so it fails fast. There's no need to go all the way to E2E to learn something will break at lint.

Making CI "readable by AI"

This is the key part, and it's often missed. If CI failure messages are friendly, AI fixes them itself.

Error: test failed (bad) → Error: expected status 200, got 401 at auth.test.ts:42 (good)
Print the failed command, the file and line, the expected and actual values.
The AI agent reads this log and self-heals in the next commit.

The self-heal loop

PR push ──▶ CI runs ──▶ fail
                          │
                          ▼
            agent reads the CI log
                          │
                          ▼
            identify cause → push a fix commit
                          │
                          ▼
                     CI re-runs ──▶ pass

Put a ceiling on this loop — if it's still red after 3 attempts, call a human. Infinite self-healing is a cost bomb.

# On CI failure, give the agent the logs and let it attempt a fix —
# but cap the number of attempts with a label
- name: Self-heal on CI failure
  if: failure() && !contains(github.event.pull_request.labels.*.name, 'ai-heal-exhausted')
  uses: anthropics/claude-code-action@v1
  with:
    anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
    prompt: |
      CI failed. Read the logs below, fix the cause, and commit.
      Don't guess — only fix what the logs point to.

Branch protection — enforce the gate

Settings → Branches → main protection rules:
  ✅ Require status checks: ci/lint, ci/test, ci/build, ai-review
  ✅ Require branches to be up to date before merging
  ✅ Require a pull request before merging
  ✅ Require approvals: 1+ (S4 human gate)
  ✅ Dismiss stale approvals when new commits pushed
  ✅ Do not allow bypassing the above

Chapter 5 · S4 — The Human Gate (the gate a human guards)

The core of automation is, paradoxically, clearly defining "where the human stands guard."

What the human must decide

The merge decision itself — the responsibility for "this goes into main" is the human's.
Architecture direction — "is this approach right" is something AI cannot judge.
Whether business requirements are met — did this genuinely satisfy the Issue's intent.
Security and data-sensitive changes — areas where the cost of a mistake is high.

Things that make human review fast

For S4 not to become a bottleneck, when a human receives a PR, it should already have:

✅ AI review done (all touch-ups handled)
✅ CI green (machine verification passed)
✅ The PR is small (a reviewable size)
✅ A good description (what and why)
✅ A Preview deployment URL attached (you can actually click through it)

Then human review only needs to do "judgment." It's done in 5 minutes.

Auto-merge — when is it OK

GitHub's auto-merge means "merge automatically once all required checks pass and required approvals are complete." It's risky, but for certain change types it's safe.

Auto-merge OK	Auto-merge forbidden
Doc / comment fixes	Application logic
Dependency patch bumps (when CI passes)	Schema migrations
Lint / format auto-fixes	Security / auth logic
Generated type / SDK updates	Infrastructure / IaC

# If it's a dependency PR and CI passes + AI review APPROVE, enable auto-merge
- name: Enable auto-merge for safe deps
  if: |
    contains(github.event.pull_request.labels.*.name, 'dependencies') &&
    github.event.pull_request.user.login == 'dependabot[bot]'
  run: gh pr merge --auto --squash "${{ github.event.pull_request.number }}"

Escalation path

When AI (whether generation or review) is uncertain, it tags a human. It automatically leaves a comment like "this change touches auth logic — needs review by a security owner" and attaches the needs-human label.

Chapter 6 · S5 — CD: Preview → Canary → Production

It's merged. Now, deployment. The key is not deploying everything at once.

Preview deployment — per PR

When a PR opens, deploy to an isolated real environment and give it a unique URL. Vercel, Netlify, and Cloudflare provide this out of the box; for your own infrastructure, implement it with a per-PR namespace.

→ The human reviewer in S4 sees a working artifact, not code. The AI reviewer can also run E2E against the Preview URL.

After merge — progressive delivery

merge to main
   │
   ▼
Canary deploy (5% of traffic)
   │
   ▼  [S6 auto-verification]
smoke test + SLO observation (5~15 min)
   │
   ├── fail ──▶ auto rollback (remove Canary)
   │
   ▼ pass
Production promotion (100% of traffic)

Canary — only a slice of traffic (5~10%) goes to the new version. Even if a problem hits, the blast radius is small.
Feature Flag — deploy the code but keep the feature off. Turning it on is a separate decision.
Environment promotion — dev → staging → prod. Each step is a gate.

Implementing the gate with GitHub Environments

name: Deploy
on:
  pull_request:
    types: [closed]
    branches: [main]

jobs:
  deploy-canary:
    if: github.event.pull_request.merged == true
    runs-on: ubuntu-latest
    environment: canary          # protection rules: none (automatic)
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh canary

  promote-production:
    needs: [deploy-canary, verify-canary]
    runs-on: ubuntu-latest
    environment: production      # protection rules: required reviewers = human gate
    steps:
      - run: ./scripts/deploy.sh production

Put required reviewers on environment: production and human approval is enforced for production promotion — another Human Gate baked right into the workflow.

Chapter 7 · S6 — Auto-verification

Deploying isn't the end. A machine confirms that what was deployed actually works.

Post-deploy verification layers

Verification	What	When
Smoke test	A few core paths (login, health check)	Immediately after deploy
E2E	Major user scenarios	Canary stage
Synthetic monitoring	Periodic pings from outside (Checkly, Datadog)	Continuous
SLO observation	Error rate, latency, availability	During the Canary window
Visual regression	UI pixel diffs	Preview/Canary

The verification gate — promote the Canary or roll back

  verify-canary:
    needs: deploy-canary
    runs-on: ubuntu-latest
    outputs:
      verdict: ${{ steps.gate.outputs.verdict }}
    steps:
      - name: Smoke test
        run: ./scripts/smoke-test.sh https://canary.example.com

      - name: Observe SLO for 10 minutes
        id: gate
        run: |
          sleep 600
          ERROR_RATE=$(curl -s "$METRICS_API/error-rate?env=canary&window=10m")
          P99=$(curl -s "$METRICS_API/latency-p99?env=canary&window=10m")
          # fail if error rate exceeds 1% or p99 exceeds 500ms
          if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
            echo "verdict=rollback" >> "$GITHUB_OUTPUT"; exit 1
          fi
          echo "verdict=promote" >> "$GITHUB_OUTPUT"

  rollback-canary:
    needs: verify-canary
    if: failure()
    runs-on: ubuntu-latest
    steps:
      - run: ./scripts/rollback.sh canary

If verification fails → the Canary is removed automatically. Production traffic was only ever 5% affected, and even that is reclaimed immediately.

Chapter 8 · S7 — Monitoring & the feedback loop (closing the loop)

This is where the pipeline goes from a line to a loop. Even after deployment, the system keeps watching.

Auto-rollback triggers

Observation continues even after production promotion. Break the SLO and it rolls back automatically.

Error budget burn rate — if it burns down fast, roll back.
Latency spike — p99 exceeds the threshold.
Core business metric drop — payment success rate, signup conversion rate, etc.

The self-healing loop — an incident becomes an Issue again

This is the core of the loop. When an auto-rollback happens, the incident is turned into a new Issue and sent back to the pipeline's entrance (S1).

name: Monitor & Auto-Rollback
on:
  schedule:
    - cron: '*/5 * * * *'      # check SLO every 5 minutes
  workflow_run:
    workflows: ["Deploy"]
    types: [completed]

jobs:
  slo-check:
    runs-on: ubuntu-latest
    steps:
      - name: Check production SLO
        id: slo
        run: |
          BURN=$(curl -s "$METRICS_API/error-budget-burn?env=prod&window=1h")
          echo "burn=$BURN" >> "$GITHUB_OUTPUT"

      - name: Rollback + file issue if breaching
        if: ${{ steps.slo.outputs.burn > 2.0 }}
        run: |
          ./scripts/rollback.sh production
          gh issue create \
            --title "Auto-rollback: error budget burn rate ${{ steps.slo.outputs.burn }}" \
            --label "ai-ready,incident,priority-high" \
            --body "Auto-rolled back due to a production SLO violation.
            Last deploy: ${{ github.sha }}
            burn rate: ${{ steps.slo.outputs.burn }}
            The agent should analyze the diff of the rolled-back commit, diagnose the cause,
            and submit a fix PR."

This Issue has the ai-ready label attached, so → S1 triggers again. AI analyzes the rolled-back commit, creates a fix PR, AI review → CI → human gate → redeploy. The loop is closed.

What a closed loop means

   Issue ──▶ PR ──▶ review ──▶ CI ──▶ merge ──▶ deploy ──▶ verify ──▶ monitor
     ▲                                                                  │
     └──────────── incident becomes a new Issue ◀───────────────────────┘

The system takes the problem it created as its own input and fixes it. The human still guards the merge gate, but is freed from the repetitive labor of detection, diagnosis, kicking off the fix, and redeploying.

Chapter 9 · The full implementation — wiring it together with GitHub Actions

The stages so far, as actual files. 5 workflows connected by events.

Workflow map

File	Trigger	Role
`ai-resolve.yml`	`issues: labeled`	S1: Issue → PR
`ai-review.yml`	`pull_request: opened/synchronize`	S2: AI review
`ci.yml`	`pull_request: opened/synchronize`	S3: verification gate
`deploy.yml`	`pull_request: closed (merged)`	S5: Canary → Production
`monitor.yml`	`schedule` + `workflow_run`	S6/S7: verify, rollback, feedback

How they connect via events

GitHub Actions has no central orchestrator. Events are the connecting wires.

issue.labeled('ai-ready')  ──▶ ai-resolve.yml ──▶ (PR created)
                                                       │
pull_request.opened ◀──────────────────────────────────┘
   ├──▶ ai-review.yml   (AI review → report check)
   └──▶ ci.yml          (verification → report check)
                                                       │
   [branch protection: all checks + human approval awaited]
                                                       ▼
pull_request.closed(merged) ──▶ deploy.yml ──▶ (Canary → verify → Production)
                                                       │
workflow_run('Deploy' completed) ──▶ monitor.yml       │
schedule('*/5')                  ──▶ monitor.yml ◀─────┘
   └──▶ on SLO violation: rollback + issue.create('ai-ready')
                                          │
                                          └──▶ (back to ai-resolve.yml — loop complete)

`ai-resolve.yml` — S1

name: AI Resolve
on:
  issues:
    types: [labeled]

jobs:
  resolve:
    if: github.event.label.name == 'ai-ready'
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
      issues: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Skip if PR already exists
        id: dedup
        run: |
          N=$(gh pr list --search "#${{ github.event.issue.number }} in:body" --json number --jq 'length')
          echo "exists=$N" >> "$GITHUB_OUTPUT"
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Run agent
        if: steps.dedup.outputs.exists == '0'
        uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          prompt: |
            Issue #${{ github.event.issue.number }}: ${{ github.event.issue.title }}

            ${{ github.event.issue.body }}

            Implement this issue. Create a branch, make the changes, get the tests passing,
            and open a Draft PR that includes "Closes #${{ github.event.issue.number }}".
            Follow the CLAUDE.md conventions, and summarize what and why in the PR body.

`ci.yml` — S3

name: CI
on:
  pull_request:
    types: [opened, synchronize, ready_for_review]

concurrency:
  group: ci-${{ github.event.pull_request.number }}
  cancel-in-progress: true     # cancel the previous run on a new push — saves cost

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint        # fast things first
      - run: pnpm typecheck
      - run: pnpm test
      - run: pnpm build

`deploy.yml` — S5 (see Chapter 6, essentials only)

name: Deploy
on:
  pull_request:
    types: [closed]
    branches: [main]

jobs:
  canary:
    if: github.event.pull_request.merged == true
    runs-on: ubuntu-latest
    environment: canary
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh canary

  verify:
    needs: canary
    runs-on: ubuntu-latest
    steps:
      - run: ./scripts/smoke-test.sh https://canary.example.com
      - run: ./scripts/observe-slo.sh canary 600   # observe for 10 min

  production:
    needs: verify
    runs-on: ubuntu-latest
    environment: production       # required reviewers = Human Gate
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh production

  rollback:
    needs: [canary, verify]
    if: failure()
    runs-on: ubuntu-latest
    steps:
      - run: ./scripts/rollback.sh canary

Repo setup checklist

Secrets: ANTHROPIC_API_KEY, deployment credentials.
Environments: canary (automatic), production (required reviewers designated).
Branch protection (main): include ci/verify and ai-review in required checks. Required approvals 1+.
Labels: ai-ready, ai-generated, needs-human, incident, dependencies.

Chapter 10 · The decision-gate matrix — what to automate and what the human handles

The real design of a pipeline is "how far do you automate." You set different gates per change type.

Change type	AI generation	AI review	CI	Human gate	Auto-merge	Deploy
Docs / comments	✅	✅	✅	optional	✅ possible	automatic
Dependency patch bump	✅	✅	✅ required	optional	✅ conditional	Canary automatic
Bug fix (narrow scope)	✅	✅	✅	✅ 1 person	❌	Canary → auto-promote
Feature addition	✅	✅	✅	✅ 1+ person	❌	Canary → human promote
Refactoring	✅	✅	✅ + coverage	✅ 1 person	❌	Canary → auto-promote
DB migration	⚠️ AI assists	✅	✅	✅ required	❌	human-triggered
Security / auth logic	⚠️ AI assists	✅	✅	✅ 2 people	❌	human-triggered
Infrastructure / IaC	⚠️ AI assists	✅	✅ plan diff	✅ required	❌	human-triggered

Legend: ✅ automatic / ⚠️ AI assists only, human leads / ❌ not done

How to climb the trust ladder

Start with every cell conservative — all auto-merge off, all deployments human-triggered. Then raise one cell at a time with data:

Turn on auto-merge for doc PRs for a month → 0 incidents → keep it.
Turn on auto-merge for dependency bumps → watch the CI pass rate and rollback rate → if stable, keep it.
Turn on Canary auto-promotion for bug fixes → watch the escape rate (missed bugs).
…

Do not widen the automation scope without success-rate data. Trust is accrued, not declared.

Chapter 11 · Safety mechanisms & failure modes

The faster the automation, the faster the accidents. Put a defense on each failure mode.

The "AI approves AI" problem

The most dangerous anti-pattern. If the generation agent reviews and approves its own PR, verification is zero. Defenses:

Generation and review are different tools/models.
An AI review's APPROVE cannot replace human approval — the "required approval" in branch protection must be a human account.
Configure bot account approvals to be excluded from the required approval count.

Prompt injection through the pipeline

Issue bodies, PR comments, code, CI logs — all of it is input to the agent. An attacker can plant commands there.

No auto-triggering from external users' Issues/PRs — only the ai-ready label applied by a member triggers.
Minimal-privilege agent tokens — no access to production or secrets.
Detect workflow file changes — if a PR touches .github/workflows/, it always requires a mandatory human review.

Cost runaway

Step ceilings, a self-heal attempt ceiling (3 times).
Eliminate duplicate runs with concurrency + cancel-in-progress.
Limit the number of concurrently running agents.
Daily/weekly cost alarms.

Every stage is fail-closed + rollback

Stage	On failure	Rollback means
S1 generation	No PR created, comment on the issue	—
S2 review	REQUEST_CHANGES, merge blocked	—
S3 CI	Red light, merge blocked	—
S4 human	Not merged	—
S5 deploy	Halt at the Canary stage	`rollback.sh canary`
S6 verification	Not promoted	Remove the Canary
S7 monitor	Auto rollback + filed as an issue	`rollback.sh production`

Auditing — who, when, what

Every stage's actions must be recorded as logs. What AI did must be traceable via the commit trailer (Co-Authored-By:), the PR label (ai-generated), and the workflow run history. If you can't answer "why did this get deployed," it's better to turn that automation off.

Chapter 12 · Operations — metrics and gradual trust

Turning the pipeline on isn't the end. You measure and tune.

Metrics to track

Metric	Meaning	Healthy direction
Lead time (Issue→deploy)	Time for one lap	↓
% auto-merged	Share merged without a human	↑ (but watch the escape rate)
AI review precision	Share of AI review flags that are actually valid	↑
Escape rate	Share of bugs that got through the pipeline	↓ (most important)
Rollback rate	Share of deployments rolled back	low and stable
Self-heal success rate	Share of CI failures AI fixed	↑
Human review wait time	S4 queue wait	↓

Applying DORA metrics to this pipeline

The traditional 4 DORA metrics apply directly — Deployment Frequency, Lead Time, Change Failure Rate, MTTR. The goal of an AI pipeline is to raise Throughput without breaking Stability. If the Change Failure Rate rises, shrink the automation scope.

The bottleneck is almost always S4

Even if 10 agents create 10 PRs in 5 minutes, if human review handles 5 a day, throughput is 5 a day. The real ceiling is review capacity, not generation speed. So the focus of tuning is:

Strengthen AI review (S2) to reduce the human review burden.
Split PRs smaller to make review faster.
Increase auto-merge for low-risk changes (while watching the data) to route around S4.

Gradual trust — climb the ladder with data

Look at the metrics every week. If the escape rate is low and stable → climb one rung of the trust ladder (Chapter 10). If the escape rate rises or the rollback rate spikes → step down one rung. The automation scope is not a fixed value but a dial you adjust with data.

Epilogue — the pipeline is the shape of the team

Once you implement this post's pipeline fully, the way the team works changes.

Developers type less code and write Issues well and review PRs well.
Touch-up reviews go to AI, judgment reviews go to humans.
The repetitive labor of detection, diagnosis, kicking off the fix, and redeploying goes to the system.
Humans guard the gate of irreversible decisions.

Summed up in three core insights.

It's a loop, not a line. Deployment is not the end. When monitoring catches an incident and turns it back into an Issue, the system takes its own problem as input and fixes it.
"Automate the work, gate the decisions." Let AI write, review, and deploy code — but for irreversible decisions like merge, schema, and security, put a human or a strong automated check in the way. Fail-closed is the default.
Trust is accrued. Don't automate everything from the start; start with low-risk work and climb one rung at a time, watching the metrics (especially the escape rate). The automation scope is a dial you adjust with data.

Paradoxically, the ultimate purpose of all this automation is not to take humans out of the work, but to focus humans on the most important place — judgment and decisions. When the pipeline absorbs the repetitive labor, the team's thinking rises from "how do we build it" to "what and why do we build."

A 14-item checklist

Have you identified all 7 stages and assigned an owner to each?
Is every stage fail-closed?
Does every stage have a rollback/cancel means?
Are the generation agent and review agent separated (different models)?
Is AI review wired in as a required status check?
Are CI failure messages friendly enough for AI to read?
Does the self-heal loop have an attempt ceiling?
Does branch protection enforce required checks + human approval?
Is human approval counted only from human accounts, not bots?
Are Preview, Canary, and Production separated in stages?
Does the production environment have required reviewers?
Does auto-verification (smoke, SLO) gate the Canary promotion?
On an SLO violation, do auto-rollback + auto Issue creation happen (the loop)?
Do you track the escape rate and adjust the automation scope with that data?

10 anti-patterns

The generation agent reviews and approves its own PR.
An AI review APPROVE counted as a human approval.
Auto-merging application logic.
100% production deploy the moment a merge happens (no Canary).
No post-deploy verification (no smoke / SLO gate).
Auto-rollback exists but the incident doesn't come back as an Issue (the loop isn't closed).
CI failure messages are unfriendly, so AI self-healing is impossible.
No ceiling on the self-heal loop → cost bomb.
External users' Issues auto-trigger the pipeline.
Widening the automation scope without measuring the escape rate.

Next post preview

Candidates for the next post: Building your own AI code reviewer — a custom review bot that learns the team's conventions, Progressive Delivery deep dive — SLO-based auto-promotion with Argo Rollouts and Flagger, Agent orchestration — turning a multi-stage pipeline into a state machine with LangGraph.

"The best pipeline doesn't replace humans. It absorbs all the rest of the repetition so that humans only make decisions."

— From Issue to Deploy, building an automation pipeline, done.