We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.

On this page

Chaos Engineering — What We Actually Run as Game Days

We've been running quarterly chaos game days for about two years. The pitch — deliberately break things in production to find weaknesses before they bite — is well-known. The actual operational reality is less exciting than the marketing and more useful than I expected. This post is what we run, what's broken, and the discipline that makes it pay back.

What a game day is, in our flavor #

A few hours, in business hours, in production (with safety rails). Five to ten people in a room (or video call). One scenario picked in advance, with an explicit blast-radius limit. Someone introduces a controlled failure; the team responds as if it were a real incident. After the recovery, we debrief: what worked, what didn't, what to change.

It's neither continuous chaos (like Netflix's Simian Army) nor a full DR drill. Just one focused scenario per quarter, picked from a backlog of "things we should test."

The scenarios that found real problems #

The five that mattered most:

Killing the primary RDS instance. Triggered a failover. Took ~12 minutes to fully recover — way longer than the 60s the docs suggest. Investigation: app-side connection pools weren't recycling stale connections quickly enough. Fix: shortened the connection lifetime to 5 minutes; reconnect logic now triggers within 15s of a dead pool member. Next drill: ~90 seconds total recovery.

Killing a single AZ's worth of EKS nodes (using AWS Fault Injection Simulator). Multi-AZ deployments handled it fine. But the in-cluster Postgres (Bitnami chart) we'd been using for a non-critical service didn't have HA configured — went down for 20 minutes. Decision: that service moved to RDS at the next sprint. We'd been planning to do it "eventually"; the drill made "eventually" concrete.

Saturating outbound network from one pod. A pod ate all the NAT gateway bandwidth. Other services started seeing failed outbound calls. Discovered our NAT was a single-AZ instance (cost optimization) — drill exposed that one AZ couldn't reach the internet during the saturation. Added redundant NAT, accepted the higher cost.

Slowing down database queries by 500ms (using tc to add latency in chaos-mesh). Most services degraded gracefully. One legacy service had a 30-second timeout on a query that should never take more than 200ms — when the query was slowed to 700ms, the legacy service's request handlers piled up, eventually crashing the pod. Fix: tighter timeouts; circuit breaker added.

Killing the OpenTelemetry collector pods. Discovered we had no fallback. Traces stopped flowing for ~3 minutes during recovery. Not customer-impacting but blocked debug visibility. Fix: collector now runs as DaemonSet with PDBs and a small node-local buffer.

Each of these surfaced something we wouldn't have found in dev or staging. The cost: a few hours per quarter, plus the actual fixes.

The scenarios that didn't find much #

Several drills produced no actionable finding. Worth listing because the negative results are informative too:

Killing one app pod. Kubernetes brought up a replacement; nothing noticed. Useful as a baseline confirmation; not a finding.

Slowing one downstream service's responses by 100ms. Within normal jitter. Apps tolerated it. We tried 1 second and got the timeout finding above — 100ms was just too small to expose anything.

Region failover simulated by DNS changes. Worked clean. We've done this drill twice; both times the failover went as expected. Continues to be valuable practice but no new findings.

Killing the Redis primary. Sentinel did its job. ~5 seconds of failed reads, then recovered. Acceptable; left it alone.

The lesson: not every drill produces a fix. Negative results are reassurance that something works.

Tooling we use #

The kit:

AWS Fault Injection Simulator for cloud-level scenarios (kill EC2s, throttle EBS, simulate AZ failures). Managed AWS service, well-integrated with CloudWatch alarms for safety.
Chaos Mesh (Kubernetes operator) for pod-level chaos (kill, delay, network partition, IO faults, stress).
A small homegrown runbook system for non-tool scenarios (manual API calls, intentional bad deploys).

For destructive scenarios we set explicit stop conditions. Example: "kill nodes in AZ-a until cluster reaches 50% capacity, then stop." FIS supports this natively; for Chaos Mesh we use manual controls.

Safety discipline #

The thing that took the longest to get right. Patterns that earned their place:

Pre-drill alignment. A doc per drill: what we'll do, when, expected impact, abort conditions, communications plan, who's running which role. Filled out before the drill, signed off by the platform lead. The doc is short (one page); the value is in the conversation it forces.

Comms in advance. Internal Slack post 24h before. We tag the channels of the teams whose services might be affected — they don't have to do anything, just be aware. Reduces panic when they see weird signals.

Blast radius limits. Every drill has a configurable blast-radius cap. "Kill at most 30% of pods." "Latency injection at most 10% of traffic." The cap exists to bound the worst case if the drill goes wrong.

Abort criteria, written in advance. Specific metrics that, if crossed, trigger abort. Example: "if API error rate exceeds 1% for more than 60s, abort." Without explicit criteria, abort becomes a judgment call in the moment.

Drill, debrief, follow up. Each drill is logged in a quarterly summary doc with findings and assigned owners. We track follow-up actions to closure — half-finished follow-ups defeat the purpose.

What we don't do #

A few patterns we considered and rejected:

Continuous chaos in production. Some orgs run a "chaos monkey" that randomly kills production pods 24/7. We don't. The team would treat the alerts as drill artifacts and miss real issues. Discrete drills with calendar visibility work better for our size.

Chaos in dev/staging only. The whole point is to test production-shaped behavior. Dev/staging is for confidence; prod drills are for finding real issues.

Drills without business-hour visibility. We considered running drills at 3 AM "so nobody is affected." Then realized: nobody can respond at 3 AM either, which defeats the purpose. Drills happen during peak engineer availability, on a Tuesday or Wednesday, between 10am and 2pm local time.

Drills as performance theater. Some teams do showy chaos drills as a recruiting talking point. The actual value comes from the fixes, not the drill itself. We don't tweet about ours.

What you need before you start #

A few prerequisites we didn't have at first:

Working observability. If you can't see what's happening, you can't run a drill safely. Metrics, traces, logs all need to be reliable before chaos engineering pays off.
A clear incident response process. A drill is a simulated incident. If your team can't respond to a real incident smoothly, chaos drills add stress without value.
Honest leadership commitment to fixing the findings. A drill that finds 5 issues that don't get fixed is wasted. The team learns "drills find problems, nothing changes" — the next drill participation suffers.

Skip chaos engineering until you have these. Otherwise it's theater.

What I'd tell a team starting #

One drill per quarter is enough. Not weekly. Time to plan, time to fix the findings.

Start with the safest scenarios. Kill a single pod, observe. Then a single node. Then more elaborate scenarios. Build the team's calibration.

Specific abort criteria in writing. Vague "we'll stop if it gets bad" leads to drift.

Pick scenarios based on what scares you. Not what's easy to instrument. The valuable drill is the one that targets the failure mode you're worried about.

Drill the response, not the technical detail. The drill found a fix; great. The drill also exposed that on-call didn't know where the runbook was; that's a finding too.

Chaos engineering is one of those practices that sounds bigger than it is. A quarterly few-hour drill with structured planning, real failures, and follow-up discipline does most of the work. The teams I've seen do it badly either run it as theater or skip the follow-ups. Done well, it's just a focused way to find the issues that lurk in production until something else triggers them.

Chaos Engineering — What We Actually Run as Game Days

Chaos Engineering — What We Actually Run as Game Days

What a game day is, in our flavor #

The scenarios that found real problems #

The scenarios that didn't find much #

Tooling we use #

Safety discipline #

What we don't do #

What you need before you start #

What I'd tell a team starting #

Stay Updated

Postgres Connection Pooling — PgBouncer in Front of RDS

AI Agent Tool Design — Boundaries and Confirmations

More from DevOps

Helm Chart Anti-Patterns We've Stopped Using

Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up

Internal Developer Platforms — Backstage in Practice

Helm Chart Anti-Patterns We've Stopped Using

Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up

Internal Developer Platforms — Backstage in Practice

Kubernetes 101 — Pods, Deployments, and Services Explained

Your First CI/CD Pipeline with GitHub Actions

Docker for Beginners — Build, Run, and Ship Your First Container

About Admin

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production