Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.
We simulated a bad deploy by merging a PR that intentionally broke a health check.
Observed:
Fixes:
```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```
In another exercise, we revoked a service account permission in staging.
Changes:
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We ran Istio for a year, then switched to Linkerd. Both can do the job. The decision came down to operational fit, not features.
Build MLOps pipelines for training, evaluation, and deployment. Reproducibility and monitoring.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.