Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.

On this page

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.

Game Day Scenario: Rollback That Never Rolls Back #

We simulated a bad deploy by merging a PR that intentionally broke a health check.

Observed:

The canary failed, alerts fired, but the pipeline stopped for manual intervention.
On-call engineers had different mental models of how rollback should work.

Fixes:

We made rollback a first-class job in the pipeline:

```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```

We documented one canonical rollback path per service.

Game Day Scenario: Missing Permissions #

In another exercise, we revoked a service account permission in staging.

The deploy failed halfway through, leaving stale pods.
Logs showed a generic “Forbidden” error; the pipeline reported only “step failed”.

Changes:

We added structured logging around each infra call.
We taught the pipeline to surface the exact failing command and principal.

Takeaways #

Run game days regularly; don’t wait for production to teach you.
Practice rollback as much as you practice deploy.
Make pipeline failures boring and obvious, not puzzles.

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Game Day Scenario: Rollback That Never Rolls Back #

Game Day Scenario: Missing Permissions #

Takeaways #

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

Best Practices: Infrastructure Documentation as Code

More from DevOps

HashiCorp Vault as a Secrets Backend for Kubernetes

Kafka Partition Strategies — Scaling Consumers Without Reshuffling Everything

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

HashiCorp Vault as a Secrets Backend for Kubernetes

Kafka Partition Strategies — Scaling Consumers Without Reshuffling Everything

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

Container Resource Limits — What They Actually Do at the Kernel Level

Kubernetes Resource Requests — Right-Sizing Without Guessing

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Game Day Scenario: Rollback That Never Rolls Back#

Game Day Scenario: Missing Permissions#

Takeaways#

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

Best Practices: Infrastructure Documentation as Code

More from DevOps

HashiCorp Vault as a Secrets Backend for Kubernetes

Kafka Partition Strategies — Scaling Consumers Without Reshuffling Everything

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

Game Day Scenario: Rollback That Never Rolls Back #

Game Day Scenario: Missing Permissions #

Takeaways #