Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
We upgraded a 60-node EKS cluster from 1.27 to 1.31 over six months. Four minor versions, one bad surprise, zero customer impact. Here's the playbook.
I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.
We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.
We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
We mapped every byte that ends up in our production containers. The map showed three places trust was implicit. Each became a control.
We expanded from one Kubernetes cluster to four across two regions. The traffic-routing layer was the hardest piece. Here's what we tried, what worked, and what we'd do again.
We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.
Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.
We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.
Learn how to fine-tune LLMs like Llama 2, Mistral, and GPT models for your specific use case. Includes LoRA, QLoRA, and full fine-tuning techniques.
Compare Terraform, Pulumi, and Ansible for Infrastructure as Code. Learn when to use each tool and how they complement each other in modern DevOps workflows.