Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.
We simulated a bad deploy by merging a PR that intentionally broke a health check.
Observed:
Fixes:
```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```
In another exercise, we revoked a service account permission in staging.
Changes:
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
A practical field manual for engineering teams who want AI features that survive real users, incidents, and budgets — not just demo day.
Explore more articles in this category
Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.
A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.
Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.