Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.
We deployed the same edge function on both platforms and measured for a quarter. Where each wins, where each loses, and the surprises along the way.
We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.
A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.
Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.
We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.
We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.
How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.
We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.
We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.