Zero-downtime database migrations

Schema changes do not have to ship at 2am with a rollback plan written on a napkin. The patterns are well-known, they have been for twenty years, and applying them turns most migrations into something you ship on a Tuesday afternoon between coffees.

Why downtime is a choice

"We need a maintenance window for the migration" almost always means "we are going to do the migration in one step instead of three". One step is faster to write. Three steps are faster to run, safer to deploy, and rollback-able. The trade is small, intermediate complexity in exchange for never needing a rollback plan that involves restoring from backup.

The two ideas that make zero-downtime migration mostly mechanical are expand-contract and dual writes. Get those right and you can ship anything.

Maintenance windows are a smell. They mean someone shipped a one-step plan when a three-step plan was available.

Expand and contract

The pattern, in three movements:

Expand. Add the new shape alongside the old one. Both work. Application code reads from old, writes to both.
Migrate. Backfill the new shape from the old one. Switch reads to new. Old shape is now redundant but still maintained.
Contract. Stop writing to the old shape. Once you are sure nothing reads from it, drop it.

Each step is independently deployable and independently reversible. Each is a normal, daytime release. The price is that the schema is messy for a few days; the prize is that the database is online for those days.

Renaming a column

"Rename username to handle" is the canonical example. The naive approach — ALTER TABLE users RENAME COLUMN username TO handle — is fast in Postgres, but it requires every running app version to know about the new name simultaneously. In a real deployment, old and new code coexist for minutes or hours. So:

sql
-- Step 1 (expand): add the new column, copy on write
alter table users add column handle text;
create or replace function sync_username_to_handle() returns trigger as $$
begin
  if NEW.handle is null then NEW.handle := NEW.username; end if;
  if NEW.username is null then NEW.username := NEW.handle; end if;
  return NEW;
end; $$ language plpgsql;

create trigger users_sync
before insert or update on users
for each row execute function sync_username_to_handle();

sql
-- Step 2 (backfill): copy existing rows
update users set handle = username where handle is null;
-- Application reads switched to 'handle' here, in a separate deploy.

sql
-- Step 3 (contract): once nothing reads username
drop trigger users_sync on users;
drop function sync_username_to_handle();
alter table users drop column username;

Each step is a tiny, independent migration. If step 2 reveals a problem, you roll forward by fixing the data, not back by reverting code.

Splitting a table

A bigger version of the same idea. You have an orders table that has accreted twelve columns about shipping; you want to extract a shipments table.

Expand. Create shipments. Add a shipment_id on orders. Application code, when creating an order, also creates a shipment row and links it.
Backfill. Copy existing shipping data from orders into shipments in batches.
Switch reads. Update queries to read shipping data from shipments. Deploy.
Stop writing the old columns. Deploy.
Contract. Drop the old columns from orders.

Five deploys for a table split. Yes, it is more deploys than dropping the columns and praying. The five-deploy version never wakes anyone up.

Backfills that scale

A naive UPDATE users SET handle = username on a table with a hundred million rows takes a row-level lock on every row, generates a giant write-ahead log entry, and may take down replication. Backfills must be batched, idempotent, and observable.

sql
-- pg-friendly batched backfill, run in a loop
do $$
declare
  batch_size int := 5000;
  rows_updated int;
begin
  loop
    with cte as (
      select id from users
      where handle is null
      order by id
      limit batch_size
      for update skip locked
    )
    update users u set handle = u.username
    from cte where u.id = cte.id;
    get diagnostics rows_updated = row_count;
    exit when rows_updated = 0;
    perform pg_sleep(0.05);  -- breathing room for replication
  end loop;
end $$;

For tables in the hundreds of millions, run the backfill from an application worker rather than a single transaction. You want checkpointing, rate-limiting, and the ability to pause it from a feature flag without killing a long-running statement.

Feature flags as migration tooling

The most underused tool in zero-downtime migrations is the feature flag. Every "switch reads from new" step is conceptually a flag flip. Wire it up that way and you can roll the change to 1% of traffic, watch your dashboards, and increase from there.

typescript
// In application code
const useNewShape = await flags.boolean(
  'orders.read-shipments-table',
  { userId },
)

const order = useNewShape
  ? await db.orders.withShipment(orderId)
  : await db.orders.legacy(orderId)

When the flag has been at 100% for a week with no anomalies, you remove the flag and delete the legacy branch. The flag itself is documentation: anyone reading the code knows there is a migration in flight.

Locks: the silent killers

Most "Postgres is fast at this" advice is correct in isolation and a trap in production. The trap is the lock that the operation acquires and how long it holds it. A few facts worth memorising:

ALTER TABLE ... ADD COLUMN with a default rewrites the table on Postgres < 11. On modern Postgres, a constant default is metadata-only — but a volatile default still rewrites. Always add the column nullable, backfill, then add the constraint.
ALTER TABLE ... ADD CONSTRAINT NOT NULL takes an ACCESS EXCLUSIVE lock. Use ADD CONSTRAINT ... NOT VALID, then VALIDATE CONSTRAINT separately — the latter takes only a SHARE UPDATE EXCLUSIVE lock.
CREATE INDEX blocks writes. Always use CREATE INDEX CONCURRENTLY in production. Yes, it is slower; you do not care.
Long transactions hold locks until they commit. Wrapping a multi-step migration in BEGIN; ... COMMIT; turns three small problems into one outage.

Set lock_timeout on every migration. If you cannot get the lock in 5 seconds, fail loudly and try again. This is dramatically safer than a migration script that silently waits behind a long-running query and then takes the table offline when it does acquire.

sql
set lock_timeout = '5s';
set statement_timeout = '60s';
alter table users add column handle text;

Rollback strategy

Each step in expand-contract has a clear rollback because each step is small and additive. The rollback for "added a column" is "drop the column". The rollback for "switched reads" is "flip the flag". The rollback for "dropped the column" is — and this is the trick — do not drop the column until you are absolutely sure. Rollback for a destructive step is "restore from backup", which is not a real rollback.

My rule: leave a dropped column lying around for two release cycles after the deploy that stopped writing to it. Rename it to old_username_DEPRECATED_2026_05 if it irritates you. Disk is cheap; outages are not.

Pitfalls

Putting a backfill inside the migration tool

Migration tools (Flyway, sqlx, drizzle-kit) typically wrap each migration in a transaction. A long backfill inside a transaction holds locks for the duration. Backfills should be runnable as standalone scripts, idempotent, and resumable.

Forgetting replicas

A migration that runs in 30 seconds on the primary may take 30 minutes to apply on a replica, during which the replica is unavailable for reads. Test on a real replica or pause replicas during the change.

CASCADE in production

A CASCADE on DROP can silently delete data several joins away. Read every CASCADE twice. In production, prefer to drop dependents explicitly.

Index without CONCURRENTLY

A casual CREATE INDEX on a busy 100M-row table will block writes for the duration. Always CONCURRENTLY in production. CI tests will not catch this because the test database is empty.

Wrap-up

Most production schema changes can be ship-on-a-Tuesday work with the patterns above. The price is that you write three deploys instead of one. The reward is that "we need to take the site down for a migration" stops being a sentence anyone says.

Postgres in particular has gotten very good at online operations over the last decade. The footguns are well-documented, the patterns are stable, and the tooling for feature flags and migrations is everywhere. There is no excuse left.

Want this done for you?

If you would rather skip the YAK shave and have someone who has done this fifty times set it up properly, that is what I do for a living.

Start a project