Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.
Provision real cloud resources with Terraform — a VPC, an S3 bucket, and an EC2 instance — using the standard init/plan/apply workflow.
GitOps in plain words — what it actually is, the workflow it enables, and a hands-on demo using Argo CD on a local Kubernetes cluster.
Install Ansible, write your first playbook, and configure a remote server (nginx + a deploy user) without touching it manually. The basics that scale up.
A 2 AM incident, the autovacuum settings that caused it, and the parameter changes that prevented the next one. The discipline that took our biggest Postgres host from periodic stalls to steady.
We launched Backstage in October. Six months in, 80% of services are catalogued, on-boarding takes a third of the time, and we mostly know what owns what.
We ran Pulumi in TypeScript and Terraform in HCL side by side across 60+ services. Each won different categories of work. Here's the breakdown.
How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.
We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.