Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

••March 11, 2025

A Pragmatic Multi-Region Strategy for Small Teams

How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.

Kiril Urbonas·6

Read article

••March 7, 2025

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

Kiril Urbonas·7

Read article

••March 4, 2025

A Pragmatic Multi-Region Strategy for Small Teams

How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.

Kiril Urbonas·9

Read article

••December 10, 2024

Field Notes: RAG Retrieval Quality Evaluation

I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.

Kiril Urbonas·2

Read article

••December 6, 2024

Field Notes: Prompt Versioning and Regression Testing

We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.

Kiril Urbonas·7

Read article

••August 10, 2024

Production Playbook: Cloud Disaster Recovery Runbook Design

A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.

Kiril Urbonas·5

Read article

••July 11, 2024

Deep Dive: SLO-Based Monitoring for APIs

We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.

Kiril Urbonas·5

Read article

••July 7, 2024

Deep Dive: Secure Container Supply Chain Controls

We mapped every byte that ends up in our production containers. The map showed three places trust was implicit. Each became a control.

Kiril Urbonas·3

Read article

••June 13, 2024

Deep Dive: Multi-Cluster Traffic Routing Strategies

We expanded from one Kubernetes cluster to four across two regions. The traffic-routing layer was the hardest piece. Here's what we tried, what worked, and what we'd do again.

Kiril Urbonas·12

Read article

••June 2, 2024

Deep Dive: Model Serving Observability Stack

We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.

Kiril Urbonas·11

Read article

••March 20, 2024

Practical Guide: Incident Response for Platform Teams

Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.

Kiril Urbonas·2

Read article

••March 11, 2024

Practical Guide: Infrastructure Drift Detection Workflow

We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.

Kiril Urbonas·4

Read article

Page 14 of 15 · 179 posts