Three caching patterns, three failure modes. The one we use most, the one that bit us, and the rule that decides which pattern fits which workload.
Most teams use caching wrong because the differences between caching patterns aren't usually explained in a way that maps to actual decisions. The textbook descriptions say "read-through caches read from the cache; if missing, fetch from DB and populate" — accurate but useless. What matters is the failure modes: what happens when the cache is stale, when the cache is down, when two writes race, when the cache fills up.
After running each of the three main patterns in production, this is the working version.
Cache-aside (lazy loading). The application reads from the cache directly. On miss, it reads from the DB, writes the result to the cache, returns it. On write, the application writes to the DB and either invalidates the cache or writes a fresh value.
Read-through. Application reads from the cache via an abstraction; the cache layer is responsible for fetching from the DB on miss. Application doesn't know about misses.
Write-through. Writes go to the cache first; the cache propagates the write to the DB. Cache and DB stay in sync by the cache library's choice of "write to both" or "write to cache, then async DB."
There's also write-behind (writes go to cache; DB writes are deferred and batched) and write-around (writes skip the cache, go straight to DB; cache populates only on read miss). I'll mention these briefly.
This is what most teams should use, and what we use for ~80% of our caching. The flow:
def get_user(user_id):
cache_key = f"user:{user_id}"
cached = cache.get(cache_key)
if cached is not None:
return cached
user = db.query(User).filter_by(id=user_id).one_or_none()
if user is not None:
cache.set(cache_key, user, ttl=300) # 5 min TTL
return user
def update_user(user_id, data):
db.update(User, user_id, data)
cache.delete(f"user:{user_id}") # invalidate
Advantages:
Disadvantages:
The thundering herd is the failure mode that bites teams. The fix is request coalescing (single-flight): when one request misses, others wait for it instead of all hitting the DB.
Read-through wraps the cache + DB behind a single read interface. The application code looks like:
def get_user(user_id):
return read_through_cache.get(f"user:{user_id}") # may fetch from DB internally
The cache library handles miss → fetch → populate.
Advantages:
Disadvantages:
We use read-through in one specific service where the cache and DB are managed by the same library (Hibernate's second-level cache). For most services it's overkill.
Writes go to the cache first; the cache writes to the DB:
def update_user(user_id, data):
write_through_cache.set(f"user:{user_id}", data) # cache propagates to DB
Advantages:
Disadvantages:
Write-through fits when reads vastly outnumber writes AND consistency on reads is critical. For typical apps, cache-aside is better.
Write-behind queues DB writes and applies them async. Fast writes; sometimes data loss when the cache crashes before flushing.
We don't use this for anything where data loss matters. It's appropriate for high-throughput append-mostly counters where you've decided you can lose 30 seconds of writes if the cache crashes.
All caching patterns have a stale window: a time during which the cache has old data while the DB has new data. The window's size depends on the pattern:
For most apps, "up to TTL" stale is fine for individual reads but bad for things like "the user just updated their profile and reloaded, sees old data." That's the case where explicit invalidation on write matters.
When a hot cache entry expires, the first request after expiration goes to the DB to repopulate. While it's running, every other request for the same key also misses and tries to repopulate. Result: a stampede on the DB.
Fixes:
Single-flight (request coalescing). Only one request triggers the DB fetch; others wait for it. Most Redis client libraries support this; if not, build it with a per-key mutex.
Stale-while-revalidate. Serve stale content while fetching fresh in the background. The first request after expiration serves stale + triggers an async refresh; subsequent requests see fresh.
Probabilistic early refresh. Add jitter — refresh keys before they expire with some probability. Spreads the refresh load across time.
We use stale-while-revalidate for the things that matter; single-flight for everything else.
Invalidating by exact key is fine until you have related caches. "User updated their profile" should invalidate:
Maintaining this list per write gets unmaintainable. Tag-based invalidation: cache writes specify tags (user:42, team:5); invalidation clears all keys with a tag.
Not all cache libraries support this. We use it where available; manual key listing elsewhere.
A few patterns where caching is wrong:
now() or simple computations costs more than it saves.Start without caching. Add it when measurements show it's worth it.
The pattern doesn't matter as much as the discipline around stale windows, single-flight, and explicit invalidation. Cache-aside with a good invalidation story is enough for most teams. Reach for the more exotic patterns only when measurements point at a specific bottleneck.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Picking partition counts and keys decides whether your Kafka consumers scale linearly or hit a wall. The patterns that survived rebalances, partition-count changes, and consumer-group ops.
io_uring replaces epoll for new high-throughput services. The patterns that earn their place, the gotchas in older kernels, and where we'd still pick epoll.
Explore more articles in this category
Bad resource requests waste money or trigger OOMs. The methodology we use to right-size requests based on actual usage, and the gotchas the autoscalers don't fix.
Edge compute is useless without an edge data layer. Three serverless databases that put data within ms of your edge functions, with the tradeoffs that aren't on the marketing pages.
OIDC federation between AWS, GCP, and CI providers let us delete every long-lived cloud credential we had. The setup, the gotchas, and the trust-relationship discipline.