The cache-control header most teams under-use. How stale-while-revalidate and stale-if-error turned our CDN from a freshness liability into a latency and resilience win — with the gotchas.
Edge caching is a tradeoff between latency and freshness, and most teams pick one extreme: either no-cache everything (slow, origin-bound) or cache aggressively and fight stale-content bugs. stale-while-revalidate lets you stop choosing. It's been the single highest-leverage header change we've made to our CDN config.
A normal max-age=60 cache works like this: for 60 seconds the edge serves cached content fast. At second 61, the next user eats a full origin round-trip while the cache refills. That unlucky user pays for everyone else's freshness. Under bursty traffic this shows up as periodic latency spikes exactly at TTL boundaries.
Cache-Control: public, max-age=60, stale-while-revalidate=600
This says: content is fresh for 60s. After that, for up to 600s more, serve the stale copy immediately and revalidate with origin in the background. The user who hits the expired cache gets the old content instantly — no waiting — and the next user gets the refreshed copy.
The latency spike at the TTL boundary disappears. No user ever waits on origin as long as the content is within the stale window. We saw p99 on cached HTML routes drop from ~340ms (TTL-boundary tail) to a flat ~25ms.
The companion header is the one that's saved us during incidents:
Cache-Control: public, max-age=60, stale-while-revalidate=600, stale-if-error=86400
stale-if-error=86400 says: if origin returns a 5xx (or is unreachable) on revalidation, keep serving the stale copy for up to a day rather than propagating the error. During a 20-minute origin outage, our marketing and docs pages stayed up entirely from the edge. Users never saw the incident. This is a CDN-level circuit breaker you get for one header directive.
None of this is safe if your cache key is wrong. The key must include every input that changes the response:
Vary: Accept-Encoding at minimum. If you serve different content by Accept-Language or auth state, that must be in the key — or you'll serve one user's content to another.utm_*, fbclid) shouldn't fragment the cache. Normalize them out of the cache key or you get a near-zero hit rate.private or a key that includes the user. The classic CDN incident is caching a logged-in page and serving it to the world.stale-while-revalidate widens the window where stale content can be served, so for content that must update now (a published price change, a corrected article), TTL isn't enough — you need active purge. We pair long stale windows with event-driven purge:
on publish/update:
→ POST /purge { url: "https://site/article/123" }
The mental model: TTL handles routine freshness; purge handles urgent freshness. Long stale-while-revalidate is safe precisely because purge is the lever for the cases that can't wait.
Good fits:
Bad fits:
private or not at all)# Static-ish content
Cache-Control: public, max-age=60, stale-while-revalidate=600, stale-if-error=86400
# Truly static assets (hashed filenames)
Cache-Control: public, max-age=31536000, immutable
# User-specific
Cache-Control: private, no-cache
The win is having a default that's fast and resilient and recoverable, instead of picking two. Stale-while-revalidate removes the TTL-boundary latency tax; stale-if-error gives you outage survival; purge keeps you honest on freshness. Get the cache key right first — everything above is only safe once the key reflects exactly what varies the response.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.
Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.
Explore more articles in this category
Least privilege fails when it's a one-time audit that locks things down until something breaks, then gets reverted. The iterative, log-driven approach that tightens permissions safely — and the policies we stopped writing by hand.
The architectural choice is presented as binary; the practical answer is "depends on the workload." The patterns that earn their place and the failure modes we've hit.
Three discounting mechanisms, three different commitments. The rules of thumb we use to pick, and the mistakes we made before settling on them.
Evergreen posts worth revisiting.