Skip to content

필사 모드: Automating Database Migrations — Treating Schema as Code with CI/CD and GitOps

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — The Friday Night ALTER TABLE

Trace many outages back to their root cause and you find a single line: "someone ran SQL against the production database by hand." Application code goes through PR review, passes tests, and deploys automatically, yet the database schema that code depends on is changed by someone SSHing in and running an ALTER TABLE directly. Friday night, a tired engineer types one line of SQL that grabs a lock and freezes a table, and the next morning brings an incident review.

The thesis of this article is simple: treat schema changes exactly like application code. Version them, review them in PRs, validate them in CI, apply them automatically in CD, and roll them back when something goes wrong. This is the GitOps approach of treating migrations as code.

We will work through how migration tools operate, pipeline integration, per-environment approval gates, applying on Kubernetes, drift detection, secret management, rollback automation, and observability, all from a practitioner's point of view.

How Migration Tools Work

Every migration tool tracks "which change has been applied, and up to where." The heart of it is a metadata table.

- Flyway: records the version, checksum, and execution time of each applied migration in the `flyway_schema_history` table.

- Liquibase: records per-changeset in the `DATABASECHANGELOG` table and prevents concurrent runs with `DATABASECHANGELOGLOCK`.

- Atlas: you declare the desired final schema, and it compares it against the current state to generate the diff as SQL automatically.

There are two broad styles.

+---------------------------+-------------------------------------------+

| Imperative (versioned) | Declarative |

+---------------------------+-------------------------------------------+

| "run these steps in order"| "the final state is this" |

| V1__init.sql | schema.hcl / schema.sql |

| V2__add_email.sql | tool computes the diff and emits SQL |

| Flyway, golang-migrate | Atlas, (parts of Liquibase) |

| clear change history | clear intent, easy drift correction |

+---------------------------+-------------------------------------------+

The imperative style gives clear ordering and history, making debugging easy; the declarative style makes the "desired state" the code itself, making drift correction easy. In practice, a hybrid is common: use imperative as the baseline, but detect drift with a declarative tool like Atlas.

Flyway Migration Example

-- V2__add_user_status.sql

ALTER TABLE users

ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'active';

CREATE INDEX CONCURRENTLY idx_users_status ON users (status);

The key here is `CREATE INDEX CONCURRENTLY`. A plain `CREATE INDEX` takes a write lock, blocking INSERT/UPDATE while the index builds. On a large table this can freeze the service for minutes. CONCURRENTLY builds the index without that lock, but with a constraint: it cannot run inside a transaction, so you must configure your migration tool to disable the transaction for that step.

Validating in the PR — Lint and Dry Run

Treating migrations as code means validating them automatically at the PR stage. Machines catch the risks before humans review.

Migration Linting

Linters catch dangerous patterns statically. Notable ones are Atlas's `migrate lint` and squawk (for PostgreSQL). The risks they catch include:

- adding a `NOT NULL` column without a default (triggers a full table rewrite)

- a plain `CREATE INDEX` (write lock)

- changing a column type (table rewrite)

- dropping a column or table (can break older app versions)

- `ALTER TABLE ... ADD COLUMN ... DEFAULT` (full rewrite on older databases)

.github/workflows/migration-lint.yml

name: migration-lint

on:

pull_request:

paths:

- 'migrations/**'

jobs:

lint:

runs-on: ubuntu-latest

services:

postgres:

image: postgres:17

env:

POSTGRES_PASSWORD: postgres

ports:

- 5432:5432

options: >-

--health-cmd pg_isready

--health-interval 10s

--health-timeout 5s

--health-retries 5

steps:

- uses: actions/checkout@v4

with:

fetch-depth: 0

- uses: ariga/setup-atlas@v0

- name: Lint new migrations against main

run: |

atlas migrate lint \

--dir "file://migrations" \

--dev-url "postgres://postgres:postgres@localhost:5432/dev?sslmode=disable" \

--git-base "origin/main"

This workflow inspects only the migrations newly added relative to main. If a dangerous pattern shows up, the PR turns red and is blocked.

Dry Run — Actually Apply It

If linting is static analysis, the dry run is dynamic verification by actually applying. In CI you spin up a throwaway database with a copy of production's schema and run the migration for real.

In CI, dump the production schema into a throwaway Postgres, then migrate

pg_dump --schema-only "$PROD_RO_URL" > schema.sql

psql "$CI_DB_URL" -f schema.sql

flyway -url="$CI_DB_URL" -locations="filesystem:./migrations" migrate

If the migration breaks, takes longer than expected, or causes lock contention, the dry run catches it at the PR stage. It is the last safety net before production.

Detecting Migration Conflicts

When several developers add migrations at once, version numbers collide. If two PRs each create `V5__...sql`, after both merge it is ambiguous which one runs first.

main: V1 V2 V3 V4

|

PR-A ----------+--- V5__add_phone.sql

PR-B ----------+--- V5__add_avatar.sql <- same version! conflict

Three strategies help.

- Timestamp versions: writing `V20260616103000__...` down to the millisecond makes collisions nearly impossible.

- CI conflict check: fail the build if any version number appears more than once.

- Order-independent design: tracking by ID, as Liquibase changesets do, reduces ordering dependencies.

Detect duplicate version numbers (CI step)

dupes=$(ls migrations/ | grep -oE '^V[0-9]+' | sort | uniq -d)

if [ -n "$dupes" ]; then

echo "Duplicate migration versions found: $dupes"

exit 1

fi

Per-Environment Application and Approval Gates

Migrations should pass through environments in stages: automatically to dev, automatically to staging, and to production only after a human approves.

[PR merged]

|

v

+--------+ +-----------+ +--------------+

| dev | ---> | staging | ---> | production |

| auto | | auto | | manual gate |

+--------+ +-----------+ +--------------+

|

(Slack notice + 2 approvers)

In GitHub Actions you build the approval gate with an environment protection rule.

.github/workflows/migrate-deploy.yml

name: migrate-deploy

on:

push:

branches: [main]

paths:

- 'migrations/**'

jobs:

migrate-staging:

runs-on: ubuntu-latest

environment: staging

steps:

- uses: actions/checkout@v4

- uses: ariga/setup-atlas@v0

- name: Apply to staging

env:

DB_URL: ${{ secrets.STAGING_DB_URL }}

run: atlas migrate apply --dir "file://migrations" --url "$DB_URL"

migrate-production:

needs: migrate-staging

runs-on: ubuntu-latest

environment: production # an environment with an approval gate

steps:

- uses: actions/checkout@v4

- uses: ariga/setup-atlas@v0

- name: Apply to production

env:

DB_URL: ${{ secrets.PROD_DB_URL }}

run: atlas migrate apply --dir "file://migrations" --url "$DB_URL"

A job pinned to `environment: production`, with required reviewers configured in GitHub, sits paused until approval. An approver must click the button before it applies to production.

Applying on Kubernetes — Jobs and Init Containers

In a Kubernetes environment, how to run the migration becomes its own problem. Two patterns dominate.

Pattern 1: A Pre-Deploy Job

Before deploying the application, run a Job that does only the migration, and proceed only on success.

apiVersion: batch/v1

kind: Job

metadata:

name: db-migrate

spec:

backoffLimit: 0 # no retry on failure (risk of double-run)

template:

spec:

restartPolicy: Never

containers:

- name: migrate

image: myorg/migrations:1.4.2

command: ["flyway", "migrate"]

env:

- name: FLYWAY_URL

valueFrom:

secretKeyRef:

name: db-credentials

key: url

Wire this Job to run just before deployment with a Helm hook (`helm.sh/hook: pre-install,pre-upgrade`) or an Argo CD PreSync hook.

Pattern 2: Init Container

Here each pod runs the migration as an init container when it starts. It is simple, but if there are many pods they all try to migrate at once. The migration tool's lock (both Flyway and Liquibase have locking mechanisms) prevents corruption, but the Job pattern is still recommended.

Recommended: single run via Job (or hook) -> deploy app

Avoid: every pod's init container at once -> lock contention

Drift Detection

Drift is the state where "the schema the code (migrations) expects" diverges from "the schema actually in the production database." It happens when someone adds a column by hand, or creates an index in a hotfix without recording it in a migration.

Use Atlas to diff production against the migrations directory (scheduled cron)

atlas migrate diff --dir "file://migrations" \

--to "postgres://...prod..." \

--dev-url "postgres://...dev..." \

--format '{{ sql . }}'

A non-empty output means drift exists -> alert

When drift is detected you have two responses: bring production in line with the code (write an additional migration), or bring the code in line with production (absorb the unexpected change into a migration). Either way, leaving drift unattended makes the next migration break unpredictably. Run a scheduled cron check and post to Slack.

Secrets and Credentials

Because migrations handle database credentials, secret management is the heart of security.

- CI/CD secret stores: store DB URLs in GitHub Actions Secrets or GitLab CI variables, separated per environment.

- Dynamic credentials: with HashiCorp Vault's database secrets engine, issuing short-lived credentials at migration time means a leaked credential expires soon anyway.

- Least privilege: grant the migration account only DDL rights and keep it separate from the application account. Some teams keep the production migration account disabled by default and enable it only within the pipeline.

[anti-pattern] hardcoding the password in a migration SQL file

[anti-pattern] all environments sharing one DB account

[recommended] per-environment secrets + least privilege + (ideally) dynamic credentials

Rollback Automation

How to undo a migration that went wrong is the hardest topic. The core principle is: design migrations to be reversible.

Undo Scripts

Flyway Teams and Liquibase let you write an undo/rollback script paired with each migration.

-- V5__add_status.sql (forward)

ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';

-- U5__add_status.sql (undo)

ALTER TABLE users DROP COLUMN status;

But undo is not always safe. Dropping a column erases the data accumulated in the meantime. So the safer approach is the expand-contract pattern.

Expand-Contract Pattern

Break schema changes into small, reversible steps.

1. Expand: add the new column/table (compatible with old app, easy to revert)

2. Migrate: app writes to both new and old structures (dual write)

3. Backfill: fill the new structure with existing data

4. Switch: app reads only from the new structure

5. Contract: drop the old column/table (after it is well settled)

Each step is small and reversible, so if something breaks you revert only that step. Even something that looks like "rename a column" can be done with zero downtime by splitting it into (add new column → dual write → backfill → switch reads → drop old column).

Observability and Alerting

Migrations tend to happen in the dark, so observe them explicitly.

- Execution logs: record which migration was applied when, and how long it took, as structured logs.

- Duration metrics: export migration runtime as a metric and warn when it takes longer than usual.

- Alerts: notify Slack of migration success/failure. Always alert for production applications in particular.

Step that posts migration results to Slack

- name: Notify Slack

if: always()

run: |

status="${{ job.status }}"

curl -X POST "$SLACK_WEBHOOK" \

-H 'Content-Type: application/json' \

-d "{\"text\": \"DB migration [production]: $status\"}"

env:

SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

A Full Pipeline Example — GitHub Actions

Combine all the pieces so far and you get this flow.

Developer: add new SQL to migrations/ -> PR

|

v

[CI on PR] lint (squawk/atlas) + throwaway-DB dry run + duplicate-version check

|

v (merge)

[CD] auto-apply dev -> auto-apply staging -> production approval gate

|

v

Kubernetes: pre-sync Job runs the migration -> app rollout

|

v

Scheduled cron: drift detection -> Slack alert on anomaly

The heart of this flow is that there is no path for a human to touch the production database directly. Every change passes through a PR, is validated automatically, and is tracked.

Testing Migrations — Test Them Like Code

If you treat migrations like code, you should test them like code. There are ways to verify migration correctness at the level of unit tests.

Forward and Backward Round-Trip Test

Apply a migration (forward), then revert it (backward), and check the schema returns to its original state. This round-trip test guarantees the undo script's correctness.

Migration round-trip test (CI)

flyway -url="$CI_DB_URL" migrate # apply up to latest

flyway -url="$CI_DB_URL" undo # revert the last migration

flyway -url="$CI_DB_URL" migrate # apply again

Compare schema dumps to confirm idempotency

pg_dump --schema-only "$CI_DB_URL" > after.sql

diff expected.sql after.sql

Data-Preservation Test

Check that a schema-changing migration does not corrupt existing data. Insert seed data, apply the migration, and verify the data was transformed correctly.

-- Test scenario: seed data before migration

INSERT INTO users (id, name, email) VALUES

(1, 'alice', 'alice@example.com'),

(2, 'bob', 'bob@example.com');

-- After applying the migration, assert the transformation result

-- (e.g., is the status column all filled with 'active'?)

SELECT count(*) FROM users WHERE status IS NULL; -- should be 0

Put such tests in the CI pipeline and you catch migration regressions automatically. Migrations that include data-transformation logic in particular must be tested.

Adopting Automation on a Legacy DB — Baseline

To introduce automation on a database already in production that has never used a migration tool, you need a baseline. Declare the current production schema as "starting point 0," and manage only the changes after that as migrations.

1. Dump the current production schema -> V1__baseline.sql (existing state)

2. Register the baseline with the migration tool (flyway baseline)

- the metadata table records "applied through V1"

3. All later changes start from V2__... as migrations

4. Without touching existing data/schema, gradually expand the automated area

The key to a baseline is "do not try to reproduce the existing state." Trying to recreate a schema that already exists in production via a migration causes conflicts. Instead, accept the current state and manage only future changes on top of it. This lets you transition to a GitOps workflow gradually and without risk.

A DB Shared by Several Services

In a microservices environment, when several services share one DB, who owns the schema becomes ambiguous. The recommendation is "place schema ownership clearly with one service." Keep the migrations directory in that service's repository too, and have other services read only. Long term, it is best to separate schemas (or DBs) per service to make ownership clear.

Common Pitfalls

- Migrations that lock big tables: an `ALTER TABLE` that grabs an ACCESS EXCLUSIVE lock freezes the service. Avoid it with CONCURRENTLY, batched backfills, and expand-contract.

- Ordering of migration and app deploy: drop a column first and then deploy the app, and in between the old app version references a missing column and breaks. Always expand first, contract later.

- Assuming DDL runs in a transaction: some DDL (like CONCURRENTLY) must run outside a transaction. Check your tool's configuration.

- Mistaking rollback for data recovery: the data of a dropped column does not come back from an undo. Backups and PITR (point-in-time recovery) are separately required.

- Changing checksums: editing an already-applied migration file later makes the checksum mismatch and the tool rejects it. Never edit applied migrations.

Checklist

Before deploying, confirm the following.

- [ ] Is the migration version-controlled and PR-reviewed?

- [ ] Did it pass lint and dry run in CI?

- [ ] Are there no duplicate version-number conflicts?

- [ ] Did large-table changes use CONCURRENTLY / batched backfill?

- [ ] Is it designed reversibly with expand-contract?

- [ ] Is there an approval gate on production application?

- [ ] Are secrets separated per environment and least-privilege?

- [ ] Does the drift-detection cron run?

- [ ] Do migration success/failure events go to alerting?

- [ ] Are backups/PITR current and the recovery procedure verified?

Closing

The essence of migration automation is not the tool but the principle: treat schema as code, break changes into small reversible steps, and replace direct human intervention with a pipeline. Flyway, Liquibase, or Atlas — whichever tool you use, the principle is the same. On the day the Friday-night, hand-typed ALTER TABLE disappears, your database finally becomes as trustworthy as your code.

References

- Flyway documentation: https://flywaydb.org/documentation/

- Liquibase documentation: https://docs.liquibase.com/

- Atlas documentation: https://atlasgo.io/getting-started

- golang-migrate: https://github.com/golang-migrate/migrate

- squawk (PostgreSQL migration linter): https://squawkhq.com/

- PostgreSQL ALTER TABLE docs: https://www.postgresql.org/docs/current/sql-altertable.html

- AWS DMS docs: https://docs.aws.amazon.com/dms/

- Argo CD Sync Phases and Hooks: https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/

현재 단락 (1/250)

Trace many outages back to their root cause and you find a single line: "someone ran SQL against the...

작성 글자: 0원문 글자: 15,341작성 단락: 0/250