Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

Four Signals That Matter: Choosing SLIs Users Actually Feel

Most SLI dashboards track things nobody notices. Here's how we picked the handful of signals that map to real user pain, and dropped the vanity metrics.

Kiril Urbonas·1

Read article

••last week

Hunting Slow Queries with pg_stat_statements

The dashboard said the database was fine. It wasn't. Here's how pg_stat_statements found the query eating 40% of our Postgres CPU.

Kiril Urbonas·1

Read article

••last week

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Free memory is a lie and load average doesn't see memory stalls. How Pressure Stall Information gives you a direct, early signal of memory contention — and how we wired it into alerts and autoscaling.

Kiril Urbonas·3

Read article

••2 weeks ago

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.

Kiril Urbonas·3

Read article

••3 weeks ago

Observability — Correlating Logs, Metrics, and Traces in Anger

The "three pillars" framing misses the point — what matters is correlating across them. The patterns that earn their place and the tooling decisions that pay back.

Kiril Urbonas·5

Read article

••last month

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.

Kiril Urbonas·4

Read article

••last month

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.

Kiril Urbonas·14

Read article

••last month

SLI Design — Picking Metrics That Actually Correlate With User Experience

Wrong SLI metrics mean green dashboards while users churn. The discipline of picking signals that move with what users actually feel, and the ones that look reliable but lie.

Kiril Urbonas·5

Read article

••2 months ago

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.

Kiril Urbonas·7

Read article

••2 months ago

Cloudflare Workers vs Vercel Edge: A Latency-Cost Comparison

We deployed the same edge function on both platforms and measured for a quarter. Where each wins, where each loses, and the surprises along the way.

Kiril Urbonas·134

Read article

••2 months ago

eBPF for SREs: Three Real Diagnoses That Saved Hours

We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.

Kiril Urbonas·6

Read article

••2 months ago

Argo Rollouts: Canary Deployments That Caught a $40k Bug

A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.

Kiril Urbonas·4

Read article

Page 1 of 15 · 179 posts