Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
We track the four DORA metrics plus a handful of others. The trade-off between what's measurable and what's meaningful, and how we use the numbers.
We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide "promote or roll back" are where the design is.
We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.
We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
A working Prometheus stack for a 40-node cluster: what we deploy, what we tune, and what we wish we'd known about cardinality two years ago.
We've had to restore a Kubernetes cluster from backup twice. Once it worked. Once it took 14 hours. Here's the strategy we run now.
Build MLOps pipelines for training, evaluation, and deployment. Reproducibility and monitoring.
We ran Istio for a year, then switched to Linkerd. Both can do the job. The decision came down to operational fit, not features.
We started with a single Celery worker handling everything. Eight months and three architecture changes later, here's what scaled and what we learned about queue design.