How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.

On this page

Database Migrations Without Downtime: Patterns From Three Real Cutovers

We've shipped three non-trivial schema migrations in the last quarter against a Postgres database serving ~12k req/s at peak. None of them caused customer-visible downtime. Here's the playbook we settled on, with the gotchas we hit along the way.

The "Expand → Migrate → Contract" Pattern #

Every migration follows the same three-phase shape:

Expand: add the new schema additively. New columns nullable, new tables empty, new indexes built CONCURRENTLY. App still reads/writes the old shape.
Migrate: dual-write to both old and new shapes; backfill old rows in batches; switch reads to the new shape behind a feature flag.
Contract: once new path is stable for ≥ 1 week, drop the old columns/tables.

The trap most teams fall into is collapsing 1 → 3 into a single deploy because "it's a small change." It's never that small once it lands on prod traffic.

Cutover #1: Splitting a 180GB Wide Table #

Original schema had orders carrying 47 columns including a fat JSONB blob of line items. We split out order_items as a proper relation.

What we did #

sql.sql

-- Phase 1: Expand
CREATE TABLE order_items (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
  sku TEXT NOT NULL,
  qty INT NOT NULL,
  price_cents INT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX CONCURRENTLY order_items_order_id_idx ON order_items(order_id);

App code began dual-writing on every insert. Backfill was done in 50k-row batches with a 200ms sleep between batches:

python.python

while True:
    rows = db.execute("""
      SELECT id, items FROM orders
      WHERE migrated_at IS NULL
      ORDER BY id LIMIT 50000
    """).fetchall()
    if not rows: break
    insert_items(rows)
    db.execute("UPDATE orders SET migrated_at = now() WHERE id = ANY(%s)", [[r.id for r in rows]])
    time.sleep(0.2)  # don't starve OLTP

Backfill took 11 hours. CPU on the primary stayed below 60% the whole time because of the throttle.

What bit us #

Index build time on order_items.order_id was 2× longer than predicted because we underestimated bloat. Lesson: always build indexes CONCURRENTLY, and run pg_stat_progress_create_index to monitor progress instead of guessing.

Cutover #2: Renaming a Column That Half the Codebase Read #

We had a user.is_active boolean. Product wanted three states: active, paused, disabled. Renaming an enum-like column with 200+ call sites is a minefield.

Pattern: View-Backed Compatibility Shim #

sql.sql

-- Phase 1: add new column, keep old
ALTER TABLE users ADD COLUMN status TEXT NOT NULL DEFAULT 'active'
  CHECK (status IN ('active','paused','disabled'));

-- Backfill from old column
UPDATE users SET status = CASE WHEN is_active THEN 'active' ELSE 'disabled' END;

-- Phase 2: triggers keep them in sync until all writers migrate
CREATE TRIGGER sync_user_status
  BEFORE UPDATE OF is_active, status ON users
  FOR EACH ROW EXECUTE FUNCTION sync_user_status_fn();

The trigger let old code keep writing is_active while new code wrote status. We removed the trigger and dropped is_active only after grep showed zero references in any deployed branch.

Cutover #3: Foreign-Key Type Change (`int` → `bigint`)#

A counter table hit 1.9 billion rows; int4 was about to overflow. Changing the type with ALTER COLUMN rewrites the whole table — totally unacceptable here.

Online type change #

Add new id_big BIGINT column.
Backfill in batches.
Add a unique index CONCURRENTLY on id_big.
Atomically swap primary key in a single transaction (this is fast — it's a metadata change once the index exists).
Drop the old column.

The cutover transaction took 47ms. Total project: 6 weeks of incremental backfill.

Best Practices We Now Enforce #

Every migration PR includes a rollback plan, even if "rollback = forward-fix." If the rollback is non-obvious, the PR doesn't merge.
Backfills run as a separate job, never inline with the schema migration. Schema migration must finish in < 10s.
Read replicas are warmed before cutover by sending shadow traffic for ≥ 1 hour.
Feature flags gate the read path, never the write path. Reads flip back instantly; writes don't.
One-week soak between phases. We've never regretted waiting; we've regretted rushing.

What We Test Before Every Migration #

Check	Tool	Pass criteria
Migration runs in < 10s on prod-sized clone	`pg_dump` + restore in CI	hard fail above 10s
Backfill ETA	Run on 1% sample, extrapolate	within 24h or split further
Replication lag during backfill	`pg_stat_replication`	stays < 30s
Rollback in a sandbox	Restore from snapshot, replay forward, then back	clean state

When You Can Skip This #

You can skip the dance only if all of the following are true:

Table has < 100k rows
Migration is purely additive (new nullable column, new table, new index built CONCURRENTLY)
Application can tolerate a 30-second window of "old code reads new schema"

If any one of those isn't true, do the full dance. The hour of planning saves the day of incident response.

Database Migrations Without Downtime: Patterns From Three Real Cutovers

Database Migrations Without Downtime: Patterns From Three Real Cutovers

The "Expand → Migrate → Contract" Pattern #

Cutover #1: Splitting a 180GB Wide Table #

What we did #

What bit us #

Cutover #2: Renaming a Column That Half the Codebase Read #

Pattern: View-Backed Compatibility Shim #

Cutover #3: Foreign-Key Type Change (`int` → `bigint`)#

Online type change #

Best Practices We Now Enforce #

What We Test Before Every Migration #

When You Can Skip This #

Stay Updated

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

Embedding Quality in RAG: How We Cut Hallucinations by 60%

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

Observability for Edge Functions — Logs, Traces, and Metrics

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025