How do you run database migrations without downtime?

Zero-downtime migrations use an expand-contract pattern: first expand the schema to support both old and new formats simultaneously, deploy application code that writes to both, backfill existing data, switch reads to the new format, then contract by removing the old schema. This separates the deployment from the migration, eliminating the need for maintenance windows.

Why do database migrations cause downtime at scale?

Most migration tools execute ALTER TABLE statements that acquire exclusive locks on the entire table. On small tables this completes in milliseconds, but on tables with millions of rows, operations like adding a column with a default value, creating an index, or changing a column type can lock the table for minutes or hours — blocking all reads and writes during that period.

What is the expand-contract pattern for database migrations?

The expand-contract pattern splits a breaking schema change into non-breaking steps. The expand phase adds new columns or tables alongside existing ones. Application code is updated to write to both old and new locations. A backfill process migrates existing data. Once complete, reads switch to the new location, and the contract phase removes the old schema. Each step is independently deployable and reversible.

How do you migrate large tables without locking?

For large tables, use online DDL tools like pt-online-schema-change or gh-ost that create a shadow copy of the table, apply changes to the copy, replicate ongoing writes via triggers or binlog parsing, then perform an atomic rename. For PostgreSQL, most ALTER TABLE operations are non-blocking, but adding indexes should use CREATE INDEX CONCURRENTLY to avoid locking.

Zero-Downtime Database Migrations at Scale

The database migration that takes 200 milliseconds on your development machine takes 45 minutes on a production table with 80 million rows. During those 45 minutes, the table is locked. Every query that touches it queues. The application stalls. Users see errors. Revenue stops.

This is not a theoretical failure mode. It is the most common cause of unplanned downtime on growth-stage platforms — a migration that worked perfectly in staging, tested against a database with 50,000 rows, deployed to production where the same table holds tens of millions.

Why Standard Migrations Break at Scale

Most migration frameworks — Rails migrations, Django migrations, Alembic, Flyway — generate DDL statements designed for correctness, not for scale. They assume that schema changes are fast, that locks are brief, and that the migration will complete before any timeout fires.

These assumptions hold until they don’t:

Adding a column with a default value rewrites the entire table in MySQL 5.7 and earlier — a 100 million row table can take 30+ minutes
Creating an index on a large table acquires a lock that blocks writes for the duration of the build
Changing a column type requires rewriting every row — and the lock prevents concurrent access
Adding a foreign key constraint validates every existing row before the constraint takes effect, holding locks throughout

The failure is predictable: the migration that ran in 2 seconds on a 10,000-row staging table runs for 20 minutes on a 50 million-row production table, and every request that touches that table during those 20 minutes either blocks or fails.

The Expand-Contract Pattern

The fundamental technique for zero-downtime migrations is the expand-contract pattern. Instead of making a breaking change in a single step, you split it into a sequence of non-breaking steps:

Phase 1: Expand

Add the new schema alongside the existing schema. Both coexist without conflict:

Adding a new column? Add it as nullable with no default (instant in most databases)
Renaming a column? Add a new column with the new name
Changing a column type? Add a new column with the new type
Splitting a table? Create the new table alongside the old one

Phase 2: Dual-Write

Deploy application code that writes to both the old and new locations. Reads continue from the old location. This ensures that all new data is captured in the new format while maintaining backward compatibility.

Phase 3: Backfill

Migrate existing data from the old format to the new format. This is a data migration, not a schema migration — it operates on rows, not on table structure. It must be:

Batched — process rows in chunks of 1,000-10,000 to avoid long-running transactions
Throttled — add delays between batches to avoid overwhelming replication or I/O
Resumable — track progress so a failed backfill can restart from where it stopped, not from the beginning
Idempotent — running the same batch twice produces the same result, so retries are safe

Phase 4: Switch Reads

Once backfill is complete and verified, switch reads to the new location. The old location continues to receive writes as a safety net.

Phase 5: Contract

Remove the old schema. This is the only step that is not backward-compatible, so it should only happen after the new schema has been in production for a sufficient validation period.

Each phase is an independent deployment. Each can be rolled back without affecting the others. The migration that would have required a 45-minute maintenance window becomes five small, safe deployments spread over days.

Online Schema Change Tools

For operations that inherently require table rewrites — changing column types, adding indexes on large tables — online schema change tools provide the mechanism:

MySQL: gh-ost and pt-online-schema-change

gh-ost (GitHub Online Schema Change) creates a shadow table with the desired schema, copies data in batches, captures ongoing changes via binlog streaming, and performs an atomic table rename when complete. It operates without triggers, which makes it safer for high-write-volume tables.

pt-online-schema-change from Percona Toolkit uses a similar approach but captures changes via triggers on the original table. It is well-tested but adds write amplification proportional to the trigger overhead.

Both tools allow schema changes on tables with billions of rows without locking or downtime.

PostgreSQL: Mostly Non-Blocking by Default

PostgreSQL handles most DDL operations without exclusive locks:

Adding a nullable column is instant (no table rewrite)
Adding a column with a default value is instant in PostgreSQL 11+ (stored in catalog, not in rows)
Creating indexes should always use CREATE INDEX CONCURRENTLY to avoid blocking writes
Changing column types still requires a full table rewrite — use the expand-contract pattern for these

Data Migration Patterns for Large Tables

The backfill phase — migrating existing data — is where most zero-downtime migrations fail. The common failure modes:

Single-Transaction Migrations

Wrapping a data migration in a single transaction seems safe — either all rows migrate or none do. But on a 50 million row table, that transaction holds locks for the entire duration, accumulates undo/redo log entries proportional to the data volume, and will likely be killed by a timeout or OOM condition.

Solution: Batch the migration. Process 5,000 rows per transaction. Track the last processed ID. Resume from that point if interrupted.

Unbounded Queries

A migration query like UPDATE users SET new_column = old_column WHERE new_column IS NULL scans the entire table to find rows that need migration. On the first run, this is efficient — most rows match. On subsequent runs after a failure, it still scans the entire table to find the remaining unmigrated rows.

Solution: Use cursor-based batching with the primary key: WHERE id > last_processed_id AND id <= last_processed_id + batch_size. This produces index-only scans regardless of migration progress.

Replication Lag

On replicated databases, large batch updates on the primary generate replication events that secondaries must process. If migration batches are too large or too fast, replication lag grows until secondaries fall behind — causing stale reads or failover delays.

Solution: Monitor replication lag between batches. Pause migration when lag exceeds a threshold (e.g., 5 seconds). Resume when lag recovers. This self-throttling ensures the migration completes as fast as replication allows without destabilizing the cluster.

Index Operations at Scale

Index creation is the most common zero-downtime violation. A standard CREATE INDEX on a 100 million row table can take 10-30 minutes, during which writes to the table are blocked.

The patterns for safe index operations:

PostgreSQL: Always use CREATE INDEX CONCURRENTLY. It takes longer but never blocks writes. Be aware that it can fail and leave an invalid index that must be dropped and retried
MySQL: InnoDB supports online index creation natively in 5.6+ for most index types. Monitor for metadata lock waits during the operation
Large indexes: Build during low-traffic periods even with concurrent operations — the I/O impact of building a multi-gigabyte index can degrade query performance

Migration Verification

Every migration — schema or data — requires verification before the contract phase:

Row count verification — does the new location contain the same number of relevant rows as the old?
Data integrity checks — do aggregates (sums, counts, checksums) match between old and new?
Application-level validation — do reads from the new location return correct results for known test cases?
Performance validation — are queries against the new schema performing within acceptable latency bounds?

Automated verification should gate the contract phase. If verification fails, the system stays in dual-write mode until the discrepancy is resolved.

Key Takeaways

Zero-downtime database migrations are not about finding the right tool — they are about choosing the right strategy. The expand-contract pattern works for any database engine, any scale, and any type of schema change. The investment is in discipline: breaking a single dangerous change into multiple safe steps, and verifying each step before proceeding to the next.

The platforms that grow without maintenance windows are those that treat every migration as a multi-step deployment, not a single DDL statement.

If your platform is experiencing migration-related downtime or you’re planning a schema change that could affect availability, a Platform Intelligence Audit can assess your migration strategy and identify patterns that eliminate downtime risk.