Design for region failure. Active/passive and active/active, data replication, and failover testing.
Single-region risk is high. Multi-region design improves availability and disaster recovery.
Multi-region adds cost and complexity; start with critical paths and expand.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
We track the four DORA metrics plus a handful of others. The trade-off between what's measurable and what's meaningful, and how we use the numbers.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.