How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.
We've shipped three non-trivial schema migrations in the last quarter against a Postgres database serving ~12k req/s at peak. None of them caused customer-visible downtime. Here's the playbook we settled on, with the gotchas we hit along the way.
Every migration follows the same three-phase shape:
CONCURRENTLY. App still reads/writes the old shape.The trap most teams fall into is collapsing 1 → 3 into a single deploy because "it's a small change." It's never that small once it lands on prod traffic.
Original schema had orders carrying 47 columns including a fat JSONB blob of line items. We split out order_items as a proper relation.
-- Phase 1: Expand
CREATE TABLE order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
sku TEXT NOT NULL,
qty INT NOT NULL,
price_cents INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX CONCURRENTLY order_items_order_id_idx ON order_items(order_id);
App code began dual-writing on every insert. Backfill was done in 50k-row batches with a 200ms sleep between batches:
while True:
rows = db.execute("""
SELECT id, items FROM orders
WHERE migrated_at IS NULL
ORDER BY id LIMIT 50000
""").fetchall()
if not rows: break
insert_items(rows)
db.execute("UPDATE orders SET migrated_at = now() WHERE id = ANY(%s)", [[r.id for r in rows]])
time.sleep(0.2) # don't starve OLTP
Backfill took 11 hours. CPU on the primary stayed below 60% the whole time because of the throttle.
Index build time on order_items.order_id was 2× longer than predicted because we underestimated bloat. Lesson: always build indexes CONCURRENTLY, and run pg_stat_progress_create_index to monitor progress instead of guessing.
We had a user.is_active boolean. Product wanted three states: active, paused, disabled. Renaming an enum-like column with 200+ call sites is a minefield.
-- Phase 1: add new column, keep old
ALTER TABLE users ADD COLUMN status TEXT NOT NULL DEFAULT 'active'
CHECK (status IN ('active','paused','disabled'));
-- Backfill from old column
UPDATE users SET status = CASE WHEN is_active THEN 'active' ELSE 'disabled' END;
-- Phase 2: triggers keep them in sync until all writers migrate
CREATE TRIGGER sync_user_status
BEFORE UPDATE OF is_active, status ON users
FOR EACH ROW EXECUTE FUNCTION sync_user_status_fn();
The trigger let old code keep writing is_active while new code wrote status. We removed the trigger and dropped is_active only after grep showed zero references in any deployed branch.
int → bigint)#A counter table hit 1.9 billion rows; int4 was about to overflow. Changing the type with ALTER COLUMN rewrites the whole table — totally unacceptable here.
id_big BIGINT column.CONCURRENTLY on id_big.The cutover transaction took 47ms. Total project: 6 weeks of incremental backfill.
| Check | Tool | Pass criteria |
|---|---|---|
| Migration runs in < 10s on prod-sized clone | pg_dump + restore in CI | hard fail above 10s |
| Backfill ETA | Run on 1% sample, extrapolate | within 24h or split further |
| Replication lag during backfill | pg_stat_replication | stays < 30s |
| Rollback in a sandbox | Restore from snapshot, replay forward, then back | clean state |
You can skip the dance only if all of the following are true:
If any one of those isn't true, do the full dance. The hour of planning saves the day of incident response.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.
Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.
Explore more articles in this category
How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
A real-world Terraform module version pinning guide for platform teams that want safer upgrades, clearer ownership, and fewer broken pipelines after shared module releases.