The phrase “zero-downtime migration” gets thrown around like it is a database feature you enable. It is not. It is an engineering practice, and it is almost always a sequence of boring steps executed in the right order.
Schema changes on a small table are free. Schema changes on a billion-row table are a multi-day operation with rollback plans, checkpoints, and a deploy plan wired into the sequence.
My thesis: every zero-downtime migration follows the same pattern. Expand the schema so old and new code both work. Migrate the data. Contract the schema once the old path is dead. Dual writes and dual reads bridge the gap. Everything else is detail.
The detail is where teams get hurt.
1. Expand, migrate, contract
This is the only pattern that actually works on live systems.
expand : add the new thing without removing the old
migrate : backfill and start writing to the new thing
contract : remove the old thing once every reader and writer is goneEach phase ships as its own deploy. Not one deploy. Not two. Usually three to five, across days, sometimes weeks.
The core rule is simple: the old schema has to keep working during every intermediate step. If a pod running the old binary wakes up in the middle of a backfill and writes to the old column, the system must still be correct.
That means no destructive changes until the contract phase. No dropping columns. No renaming tables in place. No tightening constraints on a live column without proving the invariant first.
2. The shape of a safe schema change
Most large-table migrations fall into a handful of shapes:
- adding a
NOT NULLcolumn - renaming a column or table
- changing a column type
- switching a primary key
- splitting one table into many
- merging many tables into one
They all look different on the surface. Underneath, they are all expand/contract with variations on how the dual-write and backfill work.
The reason this matters is that the risk is almost never the ALTER statement. The risk is the deploy ordering and the runtime behavior during transition.
3. Adding a NOT NULL column without locking the world
The naive version is a disaster:
ALTER TABLE orders ADD COLUMN region TEXT NOT NULL DEFAULT 'IN';On modern Postgres this looks innocent because the default is stored as metadata and does not rewrite the table. It still takes an ACCESS EXCLUSIVE lock, and on a hot table that lock can stall every query behind it while Postgres waits to acquire it. On MySQL/InnoDB the picture is worse: older versions rewrite the whole table.
The safe sequence is four distinct deploys:
-- Step 1: expand. Nullable, no default, instant.
ALTER TABLE orders ADD COLUMN region TEXT;
-- Step 2: start writing. Application code sets region on INSERT/UPDATE.
-- Old rows remain NULL.
-- Step 3: backfill in chunks (see section 7).
-- Step 4: add a CHECK constraint NOT VALID so new writes are enforced
-- without scanning the table.
ALTER TABLE orders
ADD CONSTRAINT orders_region_not_null
CHECK (region IS NOT NULL) NOT VALID;
-- Step 5: validate in the background. This takes a SHARE UPDATE EXCLUSIVE
-- lock — readers and writers continue.
ALTER TABLE orders VALIDATE CONSTRAINT orders_region_not_null;
-- Step 6: promote to a real NOT NULL. On Postgres 12+, if a validated
-- CHECK constraint exists, this is O(1).
ALTER TABLE orders ALTER COLUMN region SET NOT NULL;
-- Step 7: contract. Drop the helper CHECK.
ALTER TABLE orders DROP CONSTRAINT orders_region_not_null;The trick is in steps 4 and 5. NOT VALID enforces the constraint for new writes but does not scan existing rows. VALIDATE CONSTRAINT does the scan with a weaker lock. By the time step 6 runs, Postgres already knows the column has no nulls and skips the full-table scan.
Skipping the NOT VALID / VALIDATE dance is the single most common mistake in this migration. It looks like an optimization. It is actually the whole point.
4. Renaming a column is never a rename
A column rename is two schemas that must coexist.
-- Expand: add the new column
ALTER TABLE users ADD COLUMN email_address TEXT;
-- Backfill: copy email -> email_address
-- Dual-write: application writes both on every change.
-- Or use a trigger if the app cannot be changed in one step:
CREATE OR REPLACE FUNCTION users_sync_email()
RETURNS TRIGGER AS $$
BEGIN
NEW.email_address := COALESCE(NEW.email_address, NEW.email);
NEW.email := COALESCE(NEW.email, NEW.email_address);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_sync_email_trg
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION users_sync_email();Order of operations:
- Add the new column.
- Install the dual-write trigger or dual-write in the app.
- Backfill old values.
- Switch readers to the new column.
- Wait long enough that no writer uses the old column.
- Drop the trigger.
- Drop the old column.
Step 5 is the step every team rushes. A pod that has been idling for three days and wakes up with the old binary will happily write to the old column. You need either a version gate or enough confidence in rollout completion to move on.
Use the trigger approach when the application is not in your control or deploys are slow. Use app-level dual write when you own the write path and can version it cleanly. Mixing both is how you get infinite recursion loops in the trigger.
5. Changing a primary key
Switching from SERIAL (bigint, sequence-generated) to UUID or ULID is one of the hardest migrations because foreign keys depend on it.
Do not try to do this with an ALTER COLUMN TYPE. You cannot. The plan is:
- Add a new
id_uuidcolumn, nullable, with a default ofgen_random_uuid(). - Backfill existing rows with stable UUIDs.
- Add the new foreign-key columns on every child table.
- Dual-write both ids in application code.
- Backfill child tables using the parent’s
id -> id_uuidmap. - Add a unique index on
id_uuid. - Switch reads to use
id_uuid. - Drop the old FKs, drop the old PK, promote
id_uuidto PK. - Drop the old
idcolumns across the tree.
Two things bite here:
gen_random_uuid()as a default on a new column does not backfill existing rows. Postgres fills the column as rows are touched for inserts, not rewrites. Old rows stay NULL until you explicitly update them.- Creating a unique index on a billion-row table needs
CREATE UNIQUE INDEX CONCURRENTLY. WithoutCONCURRENTLY, you will hold a write lock for the duration of the build and the incident channel will not be empty.
For something this invasive, it is worth asking whether you actually need a PK swap or whether an external id mapping table will do. A mapping table is cheaper and reversible. A PK swap is not.
6. Splitting one table into many
Splitting orders into orders + order_line_items + order_addresses is the same pattern, shifted up one level.
The stages:
- Create the new tables empty.
- Shadow-write: on every insert/update of
orders, also write to the new tables inside the same transaction. The write is idempotent on(order_id, line_id)so replays do not double-insert. - Backfill historical rows in chunks. Use an advisory lock per chunk to make the backfill idempotent and re-runnable.
- Add dual-read behind a feature flag. Read from the new tables, fall back to old tables on mismatch, and emit a metric when the two disagree. Let it run for a week.
- Cut reads over fully.
- Stop shadow writes.
- Drop the legacy columns.
This is where idempotency stops being a nice-to-have. A backfill that is not safely replayable will eat your weekend the first time it crashes at 73% through.
The disagreement metric between dual reads is the signal that actually matters. If you are seeing nonzero mismatches a week in, do not cut over. Something is wrong with the write path, not the backfill.
7. Backfill is the hardest part, not the schema change
Most ALTER statements on modern Postgres are fast. The backfill is where the time and risk actually live.
The rules:
- Chunk it. Never
UPDATE ... WHERE conditionon a billion rows. Ranges of 1000–10000 rows per batch, committed per batch. - Throttle it. Sleep between batches. Dial it against replication lag or autovacuum pressure, not wall-clock time.
- Checkpoint it. Write the last processed id to a side table. Crashes should resume, not restart.
- Make it idempotent. The same batch re-run must be a no-op.
- Keep it out of the hot path. Backfill workers run on a replica or a dedicated connection pool with a lower priority.
A reasonable Go/Node-shaped sketch:
// Backfill orders.region from the orders_region_lookup map.
// Resumable, chunked, throttled, idempotent.
import { sleep } from "./util";
import { db, metrics } from "./infra";
const CHUNK = 5_000;
const MAX_LAG_MS = 2_000;
async function backfillRegions(jobId: string) {
// Resume from checkpoint or start from zero.
const { last_id } = await db.one(
`SELECT last_id FROM backfill_checkpoint WHERE job_id = $1`,
[jobId],
);
let cursor: bigint = last_id ?? 0n;
while (true) {
// Pull the next slice, skip rows already populated.
const rows = await db.any(
`SELECT id FROM orders
WHERE id > $1 AND region IS NULL
ORDER BY id
LIMIT $2`,
[cursor, CHUNK],
);
if (rows.length === 0) break;
const ids = rows.map((r) => r.id);
// Single UPDATE per batch, joined against the lookup table.
// Idempotent because of the IS NULL guard.
const res = await db.result(
`UPDATE orders o
SET region = m.region
FROM orders_region_lookup m
WHERE o.id = ANY($1::bigint[])
AND o.region IS NULL
AND m.order_id = o.id`,
[ids],
);
cursor = ids[ids.length - 1];
await db.none(
`INSERT INTO backfill_checkpoint(job_id, last_id, updated_at)
VALUES ($1, $2, now())
ON CONFLICT (job_id)
DO UPDATE SET last_id = EXCLUDED.last_id, updated_at = now()`,
[jobId, cursor],
);
metrics.increment("backfill.rows", res.rowCount);
// Throttle against replica lag. The exact knob depends on the setup,
// but "wall-clock sleep" is the laziest and most dangerous choice.
const lag = await db.one<number>(
`SELECT COALESCE(
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) * 1000,
0
) AS lag_ms`,
);
if (lag.lag_ms > MAX_LAG_MS) {
await sleep(Math.min(5_000, lag.lag_ms));
} else {
await sleep(50);
}
}
}What this sketch gets right, and why each piece matters:
IS NULLguard in both theSELECTand theUPDATE— if the app has already written the value, the backfill leaves it alone.- Checkpoint after every batch. A crash loses at most one chunk.
- Throttle against replica lag, not a fixed sleep. Fixed sleeps either starve the backfill or melt the replicas depending on traffic.
ANY($1::bigint[])instead of a range scan. Range scans look clean but hit dead tuples and skewed distributions unpredictably.
What this sketch does not do, deliberately: it does not use a single long-running transaction. Long transactions block autovacuum. Blocked autovacuum is how you get a slow-burning production incident two weeks after the backfill “succeeded.”
8. Online schema change tools
For cases where ALTER itself is the problem, not the data migration, the tooling landscape is narrow and opinionated.
MySQL world:
pt-online-schema-change(Percona) — creates a shadow table, copies rows in chunks, maintains a trigger for live writes, renames at the end.gh-ost(GitHub) — same idea, but reads the binlog instead of using triggers. Less load on the primary, more operational complexity. Better for very hot tables.
Postgres world:
pg_repack— rebuilds a table without holding an exclusive lock. Good for bloat reclaim and some schema changes, but limited in what it can do.pgroll(Xata) — expand/contract migrations as a first-class concept. You declare both versions of the schema; it generates views so old and new code see the shape they expect.
What these tools buy you:
- Avoiding a full-table lock on
ALTER. - A structured rollback path.
- Built-in throttling.
What they cost:
- Trigger overhead on every write during the migration.
- Disk usage roughly 2x the table for the duration.
- Long-running operations that break on schema drift if something else runs
ALTERmid-flight. - A learning curve that nobody wants to climb during an incident.
These tools are a sharp instrument, not a default. I reach for them when the schema change itself is the blocker. For everything else, expand/contract with application-level dual writes is usually simpler and more observable.
9. Deploy ordering is where most incidents live
Even if the schema work is flawless, the deploy order can create downtime.
Wrong order:
1. deploy app v2 that reads the new column
2. run the migration that adds the new columnResult: app v2 crashes for the duration of the migration.
Right order:
1. run the migration (expand)
2. deploy app v2 that dual-writes
3. run the backfill
4. deploy app v3 that reads the new column
5. wait, verify, let old pods drain
6. run the migration (contract)
7. deploy app v4 that no longer references the old columnThree rules that save you:
- Schema changes must land before the code that depends on them. Every time.
- Code that removes old references must land after the schema stops having them. Every time.
- Roll forward only. If a step fails, you do not roll back schema. You fix forward. Rollback of a contract phase is usually impossible without downtime.
The corollary: long-running deploys and mixed-version fleets are the enemy. Kubernetes rolling updates with slow readiness probes mean version N and version N+1 coexist for minutes. The migration plan must assume that.
10. Constraint enforcement gotchas
A few subtle ones that have embarrassed me personally:
ADD FOREIGN KEY ... NOT VALID+VALIDATE CONSTRAINTis the only safe way to add an FK on a large table. Otherwise you hold a lock while scanning the whole child table.CREATE INDEX CONCURRENTLYdoes not run inside a transaction block. Run it outside your migration runner’s wrapper, or your migration framework will refuse to execute it.CONCURRENTLYcan fail silently and leave an invalid index behind.DROP INDEX CONCURRENTLYthe dead index and retry, otherwise the nextCREATE INDEX CONCURRENTLYwill fail because the name is taken.- Adding a column with a volatile default (
DEFAULT now(),DEFAULT gen_random_uuid()) on older Postgres versions rewrites the table. Modern versions handle non-volatile defaults in metadata, but volatile defaults still force a rewrite. - Unique indexes and
NOT NULLon the same column in one step: you need both to be validated in the background, not enforced synchronously.
The common thread: any DDL that appears to “just work” on a small table is hiding a full-table scan or a table rewrite that will be catastrophic on a large one. Read the lock-level column in the docs before every DDL change on a hot table.
11. Monitoring and rollback
A migration plan that does not say what “healthy” looks like is not a plan.
For every phase, define:
- the expected lock levels
- the expected replica lag envelope
- the expected backfill throughput (rows/sec)
- the expected error rate delta
- the disagreement metric during dual read
- the rollback step for this phase
Rollback matters because most phases are reversible, but not all. Expand is reversible (drop the new column). Contract is not. Once the old column is gone, the only way back is a restore.
That is why the contract phase comes last, and only after the disagreement metric has been zero for long enough that you trust it. “Long enough” is a judgment call, but as a rule: at least one full weekly traffic cycle, and never less than 48 hours after the last code deploy that could have touched the old path.
12. The practical rule
There are two invariants I repeat to every team doing a migration like this.
The first:
The old schema has to keep working until the last caller is gone.
The second:
Every step must be reversible until the last.
If a phase violates either, the plan is wrong and the phase needs to be split further.
The schema tool is not the interesting part. The brokers, the ORMs, the migration framework — those are details. The interesting part is the sequence: expand the world, dual-write, backfill with checkpoints, dual-read to verify, cut over behind a flag, and only then contract.
Done this way, “zero-downtime” stops being a boast. It becomes the boring, repeatable outcome of a plan that respects the fact that large systems do not change all at once.