Zero-Downtime Database Migrations on Large Tables

The phrase "zero-downtime migration" gets thrown around like it is a database feature you enable. It is not. It is an engineering practice, and it is almost always a sequence of boring steps executed in the right order.

Schema changes on a small table are free. Schema changes on a billion-row table are a multi-day operation with rollback plans, checkpoints, and a deploy plan wired into the sequence.

My thesis: every zero-downtime migration follows the same pattern. Expand the schema so old and new code both work. Migrate the data. Contract the schema once the old path is dead. Dual writes and dual reads bridge the gap. Everything else is detail.

The detail is where teams get hurt.

1. Expand, migrate, contract

This is the only pattern that actually works on live systems.

Text

expand   : add the new thing without removing the old
migrate  : backfill and start writing to the new thing
contract : remove the old thing once every reader and writer is gone

Each phase ships as its own deploy. Not one deploy. Not two. Usually three to five, across days, sometimes weeks.

The core rule is simple: the old schema has to keep working during every intermediate step. If a pod running the old binary wakes up in the middle of a backfill and writes to the old column, the system must still be correct.

That means no destructive changes until the contract phase. No dropping columns. No renaming tables in place. No tightening constraints on a live column without proving the invariant first.

2. The shape of a safe schema change

Most large-table migrations fall into a handful of shapes:

adding a NOT NULL column
renaming a column or table
changing a column type
switching a primary key
splitting one table into many
merging many tables into one

They all look different on the surface. Underneath, they are all expand/contract with variations on how the dual-write and backfill work.

The reason this matters is that the risk is almost never the ALTER statement. The risk is the deploy ordering and the runtime behavior during transition.

3. Adding a NOT NULL column without locking the world

The naive version is a disaster:

SQL

ALTER TABLE orders ADD COLUMN region TEXT NOT NULL DEFAULT 'IN';

On modern Postgres this looks innocent because the default is stored as metadata and does not rewrite the table. It still takes an ACCESS EXCLUSIVE lock, and on a hot table that lock can stall every query behind it while Postgres waits to acquire it. On MySQL/InnoDB the picture is worse: older versions rewrite the whole table.

The safe sequence is four distinct deploys:

SQL

-- Step 1: expand. Nullable, no default, instant.
ALTER TABLE orders ADD COLUMN region TEXT;

-- Step 2: start writing. Application code sets region on INSERT/UPDATE.
--        Old rows remain NULL.

-- Step 3: backfill in chunks (see section 7).

-- Step 4: add a CHECK constraint NOT VALID so new writes are enforced
--        without scanning the table.
ALTER TABLE orders
  ADD CONSTRAINT orders_region_not_null
  CHECK (region IS NOT NULL) NOT VALID;

-- Step 5: validate in the background. This takes a SHARE UPDATE EXCLUSIVE
--        lock — readers and writers continue.
ALTER TABLE orders VALIDATE CONSTRAINT orders_region_not_null;

-- Step 6: promote to a real NOT NULL. On Postgres 12+, if a validated
--        CHECK constraint exists, this is O(1).
ALTER TABLE orders ALTER COLUMN region SET NOT NULL;

-- Step 7: contract. Drop the helper CHECK.
ALTER TABLE orders DROP CONSTRAINT orders_region_not_null;

The trick is in steps 4 and 5. NOT VALID enforces the constraint for new writes but does not scan existing rows. VALIDATE CONSTRAINT does the scan with a weaker lock. By the time step 6 runs, Postgres already knows the column has no nulls and skips the full-table scan.

Skipping the NOT VALID / VALIDATE dance is the single most common mistake in this migration. It looks like an optimization. It is actually the whole point.

4. Renaming a column is never a rename

A column rename is two schemas that must coexist.

SQL

-- Expand: add the new column
ALTER TABLE users ADD COLUMN email_address TEXT;

-- Backfill: copy email -> email_address
-- Dual-write: application writes both on every change.
-- Or use a trigger if the app cannot be changed in one step:
CREATE OR REPLACE FUNCTION users_sync_email()
RETURNS TRIGGER AS $$
BEGIN
  NEW.email_address := COALESCE(NEW.email_address, NEW.email);
  NEW.email         := COALESCE(NEW.email, NEW.email_address);
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER users_sync_email_trg
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION users_sync_email();

Order of operations:

Add the new column.
Install the dual-write trigger or dual-write in the app.
Backfill old values.
Switch readers to the new column.
Wait long enough that no writer uses the old column.
Drop the trigger.
Drop the old column.

Step 5 is the step every team rushes. A pod that has been idling for three days and wakes up with the old binary will happily write to the old column. You need either a version gate or enough confidence in rollout completion to move on.

Use the trigger approach when the application is not in your control or deploys are slow. Use app-level dual write when you own the write path and can version it cleanly. Mixing both is how you get infinite recursion loops in the trigger.

5. Changing a primary key

Switching from SERIAL (bigint, sequence-generated) to UUID or ULID is one of the hardest migrations because foreign keys depend on it.

Do not try to do this with an ALTER COLUMN TYPE. You cannot. The plan is:

Add a new id_uuid column, nullable, with a default of gen_random_uuid().
Backfill existing rows with stable UUIDs.
Add the new foreign-key columns on every child table.
Dual-write both ids in application code.
Backfill child tables using the parent's id -> id_uuid map.
Add a unique index on id_uuid.
Switch reads to use id_uuid.
Drop the old FKs, drop the old PK, promote id_uuid to PK.
Drop the old id columns across the tree.

Two things bite here:

gen_random_uuid() as a default on a new column does not backfill existing rows. Postgres fills the column as rows are touched for inserts, not rewrites. Old rows stay NULL until you explicitly update them.
Creating a unique index on a billion-row table needs CREATE UNIQUE INDEX CONCURRENTLY. Without CONCURRENTLY, you will hold a write lock for the duration of the build and the incident channel will not be empty.

For something this invasive, it is worth asking whether you actually need a PK swap or whether an external id mapping table will do. A mapping table is cheaper and reversible. A PK swap is not.

6. Splitting one table into many

Splitting orders into orders + order_line_items + order_addresses is the same pattern, shifted up one level.

The stages:

Create the new tables empty.
Shadow-write: on every insert/update of orders, also write to the new tables inside the same transaction. The write is idempotent on (order_id, line_id) so replays do not double-insert.
Backfill historical rows in chunks. Use an advisory lock per chunk to make the backfill idempotent and re-runnable.
Add dual-read behind a feature flag. Read from the new tables, fall back to old tables on mismatch, and emit a metric when the two disagree. Let it run for a week.
Cut reads over fully.
Stop shadow writes.
Drop the legacy columns.

This is where idempotency stops being a nice-to-have. A backfill that is not safely replayable will eat your weekend the first time it crashes at 73% through.

The disagreement metric between dual reads is the signal that actually matters. If you are seeing nonzero mismatches a week in, do not cut over. Something is wrong with the write path, not the backfill.

7. Backfill is the hardest part, not the schema change

Most ALTER statements on modern Postgres are fast. The backfill is where the time and risk actually live.

The rules:

Chunk it. Never UPDATE ... WHERE condition on a billion rows. Ranges of 1000–10000 rows per batch, committed per batch.
Throttle it. Sleep between batches. Dial it against replication lag or autovacuum pressure, not wall-clock time.
Checkpoint it. Write the last processed id to a side table. Crashes should resume, not restart.
Make it idempotent. The same batch re-run must be a no-op.
Keep it out of the hot path. Backfill workers run on a replica or a dedicated connection pool with a lower priority.

A reasonable Go/Node-shaped sketch:

TypeScript

// Backfill orders.region from the orders_region_lookup map.
// Resumable, chunked, throttled, idempotent.

import { sleep } from "./util";
import { db, metrics } from "./infra";

const CHUNK = 5_000;
const MAX_LAG_MS = 2_000;

async function backfillRegions(jobId: string) {
  // Resume from checkpoint or start from zero.
  const { last_id } = await db.one(
    `SELECT last_id FROM backfill_checkpoint WHERE job_id = $1`,
    [jobId],
  );
  let cursor: bigint = last_id ?? 0n;

  while (true) {
    // Pull the next slice, skip rows already populated.
    const rows = await db.any(
      `SELECT id FROM orders
        WHERE id > $1 AND region IS NULL
        ORDER BY id
        LIMIT $2`,
      [cursor, CHUNK],
    );
    if (rows.length === 0) break;

    const ids = rows.map((r) => r.id);

    // Single UPDATE per batch, joined against the lookup table.
    // Idempotent because of the IS NULL guard.
    const res = await db.result(
      `UPDATE orders o
          SET region = m.region
         FROM orders_region_lookup m
        WHERE o.id = ANY($1::bigint[])
          AND o.region IS NULL
          AND m.order_id = o.id`,
      [ids],
    );

    cursor = ids[ids.length - 1];

    await db.none(
      `INSERT INTO backfill_checkpoint(job_id, last_id, updated_at)
       VALUES ($1, $2, now())
       ON CONFLICT (job_id)
       DO UPDATE SET last_id = EXCLUDED.last_id, updated_at = now()`,
      [jobId, cursor],
    );

    metrics.increment("backfill.rows", res.rowCount);

    // Throttle against replica lag. The exact knob depends on the setup,
    // but "wall-clock sleep" is the laziest and most dangerous choice.
    const lag = await db.one<number>(
      `SELECT COALESCE(
         EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) * 1000,
         0
       ) AS lag_ms`,
    );
    if (lag.lag_ms > MAX_LAG_MS) {
      await sleep(Math.min(5_000, lag.lag_ms));
    } else {
      await sleep(50);
    }
  }
}

What this sketch gets right, and why each piece matters:

IS NULL guard in both the SELECT and the UPDATE — if the app has already written the value, the backfill leaves it alone.
Checkpoint after every batch. A crash loses at most one chunk.
Throttle against replica lag, not a fixed sleep. Fixed sleeps either starve the backfill or melt the replicas depending on traffic.
ANY($1::bigint[]) instead of a range scan. Range scans look clean but hit dead tuples and skewed distributions unpredictably.

What this sketch does not do, deliberately: it does not use a single long-running transaction. Long transactions block autovacuum. Blocked autovacuum is how you get a slow-burning production incident two weeks after the backfill "succeeded."

8. Online schema change tools

For cases where ALTER itself is the problem, not the data migration, the tooling landscape is narrow and opinionated.

MySQL world:

pt-online-schema-change (Percona) — creates a shadow table, copies rows in chunks, maintains a trigger for live writes, renames at the end.
gh-ost (GitHub) — same idea, but reads the binlog instead of using triggers. Less load on the primary, more operational complexity. Better for very hot tables.

Postgres world:

pg_repack — rebuilds a table without holding an exclusive lock. Good for bloat reclaim and some schema changes, but limited in what it can do.
pgroll (Xata) — expand/contract migrations as a first-class concept. You declare both versions of the schema; it generates views so old and new code see the shape they expect.

What these tools buy you:

Avoiding a full-table lock on ALTER.
A structured rollback path.
Built-in throttling.

What they cost:

Trigger overhead on every write during the migration.
Disk usage roughly 2x the table for the duration.
Long-running operations that break on schema drift if something else runs ALTER mid-flight.
A learning curve that nobody wants to climb during an incident.

These tools are a sharp instrument, not a default. I reach for them when the schema change itself is the blocker. For everything else, expand/contract with application-level dual writes is usually simpler and more observable.

9. Deploy ordering is where most incidents live

Even if the schema work is flawless, the deploy order can create downtime.

Wrong order:

Text

1. deploy app v2 that reads the new column
2. run the migration that adds the new column

Result: app v2 crashes for the duration of the migration.

Right order:

Text

1. run the migration (expand)
2. deploy app v2 that dual-writes
3. run the backfill
4. deploy app v3 that reads the new column
5. wait, verify, let old pods drain
6. run the migration (contract)
7. deploy app v4 that no longer references the old column

Three rules that save you:

Schema changes must land before the code that depends on them. Every time.
Code that removes old references must land after the schema stops having them. Every time.
Roll forward only. If a step fails, you do not roll back schema. You fix forward. Rollback of a contract phase is usually impossible without downtime.

The corollary: long-running deploys and mixed-version fleets are the enemy. Kubernetes rolling updates with slow readiness probes mean version N and version N+1 coexist for minutes. The migration plan must assume that.

10. Constraint enforcement gotchas

A few subtle ones that have embarrassed me personally:

ADD FOREIGN KEY ... NOT VALID + VALIDATE CONSTRAINT is the only safe way to add an FK on a large table. Otherwise you hold a lock while scanning the whole child table.
CREATE INDEX CONCURRENTLY does not run inside a transaction block. Run it outside your migration runner's wrapper, or your migration framework will refuse to execute it.
CONCURRENTLY can fail silently and leave an invalid index behind. DROP INDEX CONCURRENTLY the dead index and retry, otherwise the next CREATE INDEX CONCURRENTLY will fail because the name is taken.
Adding a column with a volatile default (DEFAULT now(), DEFAULT gen_random_uuid()) on older Postgres versions rewrites the table. Modern versions handle non-volatile defaults in metadata, but volatile defaults still force a rewrite.
Unique indexes and NOT NULL on the same column in one step: you need both to be validated in the background, not enforced synchronously.

The common thread: any DDL that appears to "just work" on a small table is hiding a full-table scan or a table rewrite that will be catastrophic on a large one. Read the lock-level column in the docs before every DDL change on a hot table.

11. Monitoring and rollback

A migration plan that does not say what "healthy" looks like is not a plan.

For every phase, define:

the expected lock levels
the expected replica lag envelope
the expected backfill throughput (rows/sec)
the expected error rate delta
the disagreement metric during dual read
the rollback step for this phase

Rollback matters because most phases are reversible, but not all. Expand is reversible (drop the new column). Contract is not. Once the old column is gone, the only way back is a restore.

That is why the contract phase comes last, and only after the disagreement metric has been zero for long enough that you trust it. "Long enough" is a judgment call, but as a rule: at least one full weekly traffic cycle, and never less than 48 hours after the last code deploy that could have touched the old path.

12. The practical rule

There are two invariants I repeat to every team doing a migration like this.

The first:

The old schema has to keep working until the last caller is gone.

The second:

Every step must be reversible until the last.

If a phase violates either, the plan is wrong and the phase needs to be split further.

The schema tool is not the interesting part. The brokers, the ORMs, the migration framework — those are details. The interesting part is the sequence: expand the world, dual-write, backfill with checkpoints, dual-read to verify, cut over behind a flag, and only then contract.

Done this way, "zero-downtime" stops being a boast. It becomes the boring, repeatable outcome of a plan that respects the fact that large systems do not change all at once.