Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

Linux Memory Management: When OOM Killer Strikes Your K8s Pods

Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.

Kiril Urbonas·9

Read article

••3 months ago

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.

Kiril Urbonas·4

Read article

••3 months ago

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.

Kiril Urbonas·9

Read article

••3 months ago

systemd Timers vs Cron: When We Switched and What We Learned

We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.

Kiril Urbonas·7

Read article

••3 months ago

Database Migrations Without Downtime: Patterns From Three Real Cutovers

How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.

Kiril Urbonas·9

Read article

••3 months ago

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.

Kiril Urbonas·12

Read article

••3 months ago

Incident Postmortems That Actually Prevent Repeat Failures

We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.

Kiril Urbonas·5

Read article

••3 months ago

Linux Performance Troubleshooting: A Real Incident Walkthrough

Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.

Kiril Urbonas·7

Read article

••3 months ago

AWS Cost Audit: 7 Things We Found Wasting Money Every Month

A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.

Kiril Urbonas·6

Read article

••3 months ago

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.

Kiril Urbonas·19

Read article

••3 months ago

RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps

A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.

Kiril Urbonas·9

Read article

••3 months ago

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern

A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.

Kiril Urbonas·7

Read article

Page 2 of 15 · 179 posts