Zero-downtime database migrations
Schema changes do not have to ship at 2am with a rollback plan written on a napkin. The patterns are well-known, they have been for twenty years, and applying them turns most migrations into something you ship on a Tuesday afternoon between coffees.
Why downtime is a choice
"We need a maintenance window for the migration" almost always means "we are going to do the migration in one step instead of three". One step is faster to write. Three steps are faster to run, safer to deploy, and rollback-able. The trade is small, intermediate complexity in exchange for never needing a rollback plan that involves restoring from backup.
The two ideas that make zero-downtime migration mostly mechanical are expand-contract and dual writes. Get those right and you can ship anything.
Maintenance windows are a smell. They mean someone shipped a one-step plan when a three-step plan was available.
Expand and contract
The pattern, in three movements:
- Expand. Add the new shape alongside the old one. Both work. Application code reads from old, writes to both.
- Migrate. Backfill the new shape from the old one. Switch reads to new. Old shape is now redundant but still maintained.
- Contract. Stop writing to the old shape. Once you are sure nothing reads from it, drop it.
Each step is independently deployable and independently reversible. Each is a normal, daytime release. The price is that the schema is messy for a few days; the prize is that the database is online for those days.
Renaming a column
"Rename username to handle" is the canonical example. The naive approach — ALTER TABLE users RENAME COLUMN username TO handle — is fast in Postgres, but it requires every running app version to know about the new name simultaneously. In a real deployment, old and new code coexist for minutes or hours. So:
sql-- Step 1 (expand): add the new column, copy on write alter table users add column handle text; create or replace function sync_username_to_handle() returns trigger as $$ begin if NEW.handle is null then NEW.handle := NEW.username; end if; if NEW.username is null then NEW.username := NEW.handle; end if; return NEW; end; $$ language plpgsql; create trigger users_sync before insert or update on users for each row execute function sync_username_to_handle();
sql-- Step 2 (backfill): copy existing rows update users set handle = username where handle is null; -- Application reads switched to 'handle' here, in a separate deploy.
sql-- Step 3 (contract): once nothing reads username drop trigger users_sync on users; drop function sync_username_to_handle(); alter table users drop column username;
Each step is a tiny, independent migration. If step 2 reveals a problem, you roll forward by fixing the data, not back by reverting code.
Splitting a table
A bigger version of the same idea. You have an orders table that has accreted twelve columns about shipping; you want to extract a shipments table.
- Expand. Create
shipments. Add ashipment_idonorders. Application code, when creating an order, also creates a shipment row and links it. - Backfill. Copy existing shipping data from
ordersintoshipmentsin batches. - Switch reads. Update queries to read shipping data from
shipments. Deploy. - Stop writing the old columns. Deploy.
- Contract. Drop the old columns from
orders.
Five deploys for a table split. Yes, it is more deploys than dropping the columns and praying. The five-deploy version never wakes anyone up.
Backfills that scale
A naive UPDATE users SET handle = username on a table with a hundred million rows takes a row-level lock on every row, generates a giant write-ahead log entry, and may take down replication. Backfills must be batched, idempotent, and observable.
sql-- pg-friendly batched backfill, run in a loop do $$ declare batch_size int := 5000; rows_updated int; begin loop with cte as ( select id from users where handle is null order by id limit batch_size for update skip locked ) update users u set handle = u.username from cte where u.id = cte.id; get diagnostics rows_updated = row_count; exit when rows_updated = 0; perform pg_sleep(0.05); -- breathing room for replication end loop; end $$;
For tables in the hundreds of millions, run the backfill from an application worker rather than a single transaction. You want checkpointing, rate-limiting, and the ability to pause it from a feature flag without killing a long-running statement.
Feature flags as migration tooling
The most underused tool in zero-downtime migrations is the feature flag. Every "switch reads from new" step is conceptually a flag flip. Wire it up that way and you can roll the change to 1% of traffic, watch your dashboards, and increase from there.
typescript// In application code const useNewShape = await flags.boolean( 'orders.read-shipments-table', { userId }, ) const order = useNewShape ? await db.orders.withShipment(orderId) : await db.orders.legacy(orderId)
When the flag has been at 100% for a week with no anomalies, you remove the flag and delete the legacy branch. The flag itself is documentation: anyone reading the code knows there is a migration in flight.
Locks: the silent killers
Most "Postgres is fast at this" advice is correct in isolation and a trap in production. The trap is the lock that the operation acquires and how long it holds it. A few facts worth memorising:
ALTER TABLE ... ADD COLUMNwith a default rewrites the table on Postgres < 11. On modern Postgres, a constant default is metadata-only — but a volatile default still rewrites. Always add the column nullable, backfill, then add the constraint.ALTER TABLE ... ADD CONSTRAINT NOT NULLtakes anACCESS EXCLUSIVElock. UseADD CONSTRAINT ... NOT VALID, thenVALIDATE CONSTRAINTseparately — the latter takes only aSHARE UPDATE EXCLUSIVElock.CREATE INDEXblocks writes. Always useCREATE INDEX CONCURRENTLYin production. Yes, it is slower; you do not care.- Long transactions hold locks until they commit. Wrapping a multi-step migration in
BEGIN; ... COMMIT;turns three small problems into one outage.
Set lock_timeout on every migration. If you cannot get the lock in 5 seconds, fail loudly and try again. This is dramatically safer than a migration script that silently waits behind a long-running query and then takes the table offline when it does acquire.
sqlset lock_timeout = '5s'; set statement_timeout = '60s'; alter table users add column handle text;
Rollback strategy
Each step in expand-contract has a clear rollback because each step is small and additive. The rollback for "added a column" is "drop the column". The rollback for "switched reads" is "flip the flag". The rollback for "dropped the column" is — and this is the trick — do not drop the column until you are absolutely sure. Rollback for a destructive step is "restore from backup", which is not a real rollback.
My rule: leave a dropped column lying around for two release cycles after the deploy that stopped writing to it. Rename it to old_username_DEPRECATED_2026_05 if it irritates you. Disk is cheap; outages are not.
Pitfalls
Migration tools (Flyway, sqlx, drizzle-kit) typically wrap each migration in a transaction. A long backfill inside a transaction holds locks for the duration. Backfills should be runnable as standalone scripts, idempotent, and resumable.
A migration that runs in 30 seconds on the primary may take 30 minutes to apply on a replica, during which the replica is unavailable for reads. Test on a real replica or pause replicas during the change.
A CASCADE on DROP can silently delete data several joins away. Read every CASCADE twice. In production, prefer to drop dependents explicitly.
A casual CREATE INDEX on a busy 100M-row table will block writes for the duration. Always CONCURRENTLY in production. CI tests will not catch this because the test database is empty.
Wrap-up
Most production schema changes can be ship-on-a-Tuesday work with the patterns above. The price is that you write three deploys instead of one. The reward is that "we need to take the site down for a migration" stops being a sentence anyone says.
Postgres in particular has gotten very good at online operations over the last decade. The footguns are well-documented, the patterns are stable, and the tooling for feature flags and migrations is everywhere. There is no excuse left.
Want this done for you?
If you would rather skip the YAK shave and have someone who has done this fifty times set it up properly, that is what I do for a living.
Start a project