Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.

On this page

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.

Game Day Scenario: Rollback That Never Rolls Back #

We simulated a bad deploy by merging a PR that intentionally broke a health check.

Observed:

The canary failed, alerts fired, but the pipeline stopped for manual intervention.
On-call engineers had different mental models of how rollback should work.

Fixes:

We made rollback a first-class job in the pipeline:

```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```

We documented one canonical rollback path per service.

Game Day Scenario: Missing Permissions #

In another exercise, we revoked a service account permission in staging.

The deploy failed halfway through, leaving stale pods.
Logs showed a generic “Forbidden” error; the pipeline reported only “step failed”.

Changes:

We added structured logging around each infra call.
We taught the pipeline to surface the exact failing command and principal.

Takeaways #

Run game days regularly; don’t wait for production to teach you.
Practice rollback as much as you practice deploy.
Make pipeline failures boring and obvious, not puzzles.

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Game Day Scenario: Rollback That Never Rolls Back #

Game Day Scenario: Missing Permissions #

Takeaways #

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

AI Best Practices for Engineering Teams: From Prompt Experiments to Platform Discipline

More from DevOps

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

Argo Rollouts: Canary Deployments That Caught a $40k Bug

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

Argo Rollouts: Canary Deployments That Caught a $40k Bug

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

Backstage Adoption: From Demo to 80% Service Coverage in 6 Months

Cloudflare Workers vs Vercel Edge: A Latency-Cost Comparison

About Kiril urbonas

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Game Day Scenario: Rollback That Never Rolls Back#

Game Day Scenario: Missing Permissions#

Takeaways#

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

AI Best Practices for Engineering Teams: From Prompt Experiments to Platform Discipline

More from DevOps

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

Argo Rollouts: Canary Deployments That Caught a $40k Bug

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

About Kiril urbonas

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production

Game Day Scenario: Rollback That Never Rolls Back #

Game Day Scenario: Missing Permissions #

Takeaways #