Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.
We simulated a bad deploy by merging a PR that intentionally broke a health check.
Observed:
Fixes:
```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```
In another exercise, we revoked a service account permission in staging.
Changes:
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Wikis rot. We moved every operational doc into the repo it describes. Six months in, the docs are mostly correct because the only people who can update them are the ones who change the system.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.