We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.
We've been running quarterly chaos game days for about two years. The pitch — deliberately break things in production to find weaknesses before they bite — is well-known. The actual operational reality is less exciting than the marketing and more useful than I expected. This post is what we run, what's broken, and the discipline that makes it pay back.
A few hours, in business hours, in production (with safety rails). Five to ten people in a room (or video call). One scenario picked in advance, with an explicit blast-radius limit. Someone introduces a controlled failure; the team responds as if it were a real incident. After the recovery, we debrief: what worked, what didn't, what to change.
It's neither continuous chaos (like Netflix's Simian Army) nor a full DR drill. Just one focused scenario per quarter, picked from a backlog of "things we should test."
The five that mattered most:
Killing the primary RDS instance. Triggered a failover. Took ~12 minutes to fully recover — way longer than the 60s the docs suggest. Investigation: app-side connection pools weren't recycling stale connections quickly enough. Fix: shortened the connection lifetime to 5 minutes; reconnect logic now triggers within 15s of a dead pool member. Next drill: ~90 seconds total recovery.
Killing a single AZ's worth of EKS nodes (using AWS Fault Injection Simulator). Multi-AZ deployments handled it fine. But the in-cluster Postgres (Bitnami chart) we'd been using for a non-critical service didn't have HA configured — went down for 20 minutes. Decision: that service moved to RDS at the next sprint. We'd been planning to do it "eventually"; the drill made "eventually" concrete.
Saturating outbound network from one pod. A pod ate all the NAT gateway bandwidth. Other services started seeing failed outbound calls. Discovered our NAT was a single-AZ instance (cost optimization) — drill exposed that one AZ couldn't reach the internet during the saturation. Added redundant NAT, accepted the higher cost.
Slowing down database queries by 500ms (using tc to add latency in chaos-mesh). Most services degraded gracefully. One legacy service had a 30-second timeout on a query that should never take more than 200ms — when the query was slowed to 700ms, the legacy service's request handlers piled up, eventually crashing the pod. Fix: tighter timeouts; circuit breaker added.
Killing the OpenTelemetry collector pods. Discovered we had no fallback. Traces stopped flowing for ~3 minutes during recovery. Not customer-impacting but blocked debug visibility. Fix: collector now runs as DaemonSet with PDBs and a small node-local buffer.
Each of these surfaced something we wouldn't have found in dev or staging. The cost: a few hours per quarter, plus the actual fixes.
Several drills produced no actionable finding. Worth listing because the negative results are informative too:
Killing one app pod. Kubernetes brought up a replacement; nothing noticed. Useful as a baseline confirmation; not a finding.
Slowing one downstream service's responses by 100ms. Within normal jitter. Apps tolerated it. We tried 1 second and got the timeout finding above — 100ms was just too small to expose anything.
Region failover simulated by DNS changes. Worked clean. We've done this drill twice; both times the failover went as expected. Continues to be valuable practice but no new findings.
Killing the Redis primary. Sentinel did its job. ~5 seconds of failed reads, then recovered. Acceptable; left it alone.
The lesson: not every drill produces a fix. Negative results are reassurance that something works.
The kit:
For destructive scenarios we set explicit stop conditions. Example: "kill nodes in AZ-a until cluster reaches 50% capacity, then stop." FIS supports this natively; for Chaos Mesh we use manual controls.
The thing that took the longest to get right. Patterns that earned their place:
Pre-drill alignment. A doc per drill: what we'll do, when, expected impact, abort conditions, communications plan, who's running which role. Filled out before the drill, signed off by the platform lead. The doc is short (one page); the value is in the conversation it forces.
Comms in advance. Internal Slack post 24h before. We tag the channels of the teams whose services might be affected — they don't have to do anything, just be aware. Reduces panic when they see weird signals.
Blast radius limits. Every drill has a configurable blast-radius cap. "Kill at most 30% of pods." "Latency injection at most 10% of traffic." The cap exists to bound the worst case if the drill goes wrong.
Abort criteria, written in advance. Specific metrics that, if crossed, trigger abort. Example: "if API error rate exceeds 1% for more than 60s, abort." Without explicit criteria, abort becomes a judgment call in the moment.
Drill, debrief, follow up. Each drill is logged in a quarterly summary doc with findings and assigned owners. We track follow-up actions to closure — half-finished follow-ups defeat the purpose.
A few patterns we considered and rejected:
Continuous chaos in production. Some orgs run a "chaos monkey" that randomly kills production pods 24/7. We don't. The team would treat the alerts as drill artifacts and miss real issues. Discrete drills with calendar visibility work better for our size.
Chaos in dev/staging only. The whole point is to test production-shaped behavior. Dev/staging is for confidence; prod drills are for finding real issues.
Drills without business-hour visibility. We considered running drills at 3 AM "so nobody is affected." Then realized: nobody can respond at 3 AM either, which defeats the purpose. Drills happen during peak engineer availability, on a Tuesday or Wednesday, between 10am and 2pm local time.
Drills as performance theater. Some teams do showy chaos drills as a recruiting talking point. The actual value comes from the fixes, not the drill itself. We don't tweet about ours.
A few prerequisites we didn't have at first:
Skip chaos engineering until you have these. Otherwise it's theater.
One drill per quarter is enough. Not weekly. Time to plan, time to fix the findings.
Start with the safest scenarios. Kill a single pod, observe. Then a single node. Then more elaborate scenarios. Build the team's calibration.
Specific abort criteria in writing. Vague "we'll stop if it gets bad" leads to drift.
Pick scenarios based on what scares you. Not what's easy to instrument. The valuable drill is the one that targets the failure mode you're worried about.
Drill the response, not the technical detail. The drill found a fix; great. The drill also exposed that on-call didn't know where the runbook was; that's a finding too.
Chaos engineering is one of those practices that sounds bigger than it is. A quarterly few-hour drill with structured planning, real failures, and follow-up discipline does most of the work. The teams I've seen do it badly either run it as theater or skip the follow-ups. Done well, it's just a focused way to find the issues that lurk in production until something else triggers them.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.