Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.
We simulated a bad deploy by merging a PR that intentionally broke a health check.
Observed:
Fixes:
```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```
In another exercise, we revoked a service account permission in staging.
Changes:
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
We have ~40 prompts in production. The patterns that improved quality, the ones that turned out to be folklore, and how we test prompts now.
Explore more articles in this category
Run your first three Kubernetes objects — Pod, Deployment, Service — on a local cluster, then understand why each one exists and how they fit together.
Walk through a working GitHub Actions workflow — install, test, build, deploy — for a tiny Node app. Every line explained.
Walk through your first Dockerfile, container run, and image push in 30 minutes. No theory dumps — just the commands and what each one is doing.