The cache-control header most teams under-use. How stale-while-revalidate and stale-if-error turned our CDN from a freshness liability into a latency and resilience win — with the gotchas.

On this page

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Edge caching is a tradeoff between latency and freshness, and most teams pick one extreme: either no-cache everything (slow, origin-bound) or cache aggressively and fight stale-content bugs. stale-while-revalidate lets you stop choosing. It's been the single highest-leverage header change we've made to our CDN config.

The problem with binary caching #

A normal max-age=60 cache works like this: for 60 seconds the edge serves cached content fast. At second 61, the next user eats a full origin round-trip while the cache refills. That unlucky user pays for everyone else's freshness. Under bursty traffic this shows up as periodic latency spikes exactly at TTL boundaries.

stale-while-revalidate: serve stale, refresh in background #

code

Cache-Control: public, max-age=60, stale-while-revalidate=600

This says: content is fresh for 60s. After that, for up to 600s more, serve the stale copy immediately and revalidate with origin in the background. The user who hits the expired cache gets the old content instantly — no waiting — and the next user gets the refreshed copy.

The latency spike at the TTL boundary disappears. No user ever waits on origin as long as the content is within the stale window. We saw p99 on cached HTML routes drop from ~340ms (TTL-boundary tail) to a flat ~25ms.

stale-if-error: free resilience #

The companion header is the one that's saved us during incidents:

code

Cache-Control: public, max-age=60, stale-while-revalidate=600, stale-if-error=86400

stale-if-error=86400 says: if origin returns a 5xx (or is unreachable) on revalidation, keep serving the stale copy for up to a day rather than propagating the error. During a 20-minute origin outage, our marketing and docs pages stayed up entirely from the edge. Users never saw the incident. This is a CDN-level circuit breaker you get for one header directive.

The cache key determines correctness #

None of this is safe if your cache key is wrong. The key must include every input that changes the response:

Vary on what matters: Vary: Accept-Encoding at minimum. If you serve different content by Accept-Language or auth state, that must be in the key — or you'll serve one user's content to another.
Strip what doesn't: marketing query params (utm_*, fbclid) shouldn't fragment the cache. Normalize them out of the cache key or you get a near-zero hit rate.
Never cache authenticated responses on a shared key. Cookie-bearing responses need either private or a key that includes the user. The classic CDN incident is caching a logged-in page and serving it to the world.

Purge is your freshness escape hatch #

stale-while-revalidate widens the window where stale content can be served, so for content that must update now (a published price change, a corrected article), TTL isn't enough — you need active purge. We pair long stale windows with event-driven purge:

code

on publish/update:
  → POST /purge { url: "https://site/article/123" }

The mental model: TTL handles routine freshness; purge handles urgent freshness. Long stale-while-revalidate is safe precisely because purge is the lever for the cases that can't wait.

What to put behind it — and what not to #

Good fits:

Marketing pages, docs, blog content
Product listings that tolerate seconds of staleness
API responses for slow-changing reference data

Bad fits:

Per-user dashboards (cache private or not at all)
Anything where stale data is a correctness or compliance problem (account balances, inventory at checkout)
Responses that set cookies (caching them leaks session state)

The config we standardized on #

code

# Static-ish content
Cache-Control: public, max-age=60, stale-while-revalidate=600, stale-if-error=86400

# Truly static assets (hashed filenames)
Cache-Control: public, max-age=31536000, immutable

# User-specific
Cache-Control: private, no-cache

The win is having a default that's fast and resilient and recoverable, instead of picking two. Stale-while-revalidate removes the TTL-boundary latency tax; stale-if-error gives you outage survival; purge keeps you honest on freshness. Get the cache key right first — everything above is only safe once the key reflects exactly what varies the response.

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

The problem with binary caching #

stale-while-revalidate: serve stale, refresh in background #

stale-if-error: free resilience #

The cache key determines correctness #

Purge is your freshness escape hatch #

What to put behind it — and what not to #

The config we standardized on #

Stay Updated

LLM Output Validation — Schema-Constrained Generation in Production

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

More from Cloud

Cloud IAM Least-Privilege Without Breaking Everything

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

Cloud IAM Least-Privilege Without Breaking Everything

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

Caching Patterns — Read-Through, Write-Through, Cache-Aside in Practice

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

CI Pipeline Caching That Actually Pays Off

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

The problem with binary caching#

stale-while-revalidate: serve stale, refresh in background#

stale-if-error: free resilience#

The cache key determines correctness#

Purge is your freshness escape hatch#

What to put behind it — and what not to#

The config we standardized on#

Stay Updated

LLM Output Validation — Schema-Constrained Generation in Production

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

More from Cloud

Cloud IAM Least-Privilege Without Breaking Everything

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

The problem with binary caching #

stale-while-revalidate: serve stale, refresh in background #

stale-if-error: free resilience #

The cache key determines correctness #

Purge is your freshness escape hatch #

What to put behind it — and what not to #

The config we standardized on #