Blog

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.

Kiril Urbonas·4

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.

Kiril Urbonas·14

Multi-Provider LLM Routing — Failover, Cost Routing, and Load Balancing

Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.

Kiril Urbonas·1

SLI Design — Picking Metrics That Actually Correlate With User Experience

Wrong SLI metrics mean green dashboards while users churn. The discipline of picking signals that move with what users actually feel, and the ones that look reliable but lie.

Kiril Urbonas·4

••2 months ago

Chaos Engineering — What We Actually Run as Game Days

We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.

Kiril Urbonas·7

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.

Kiril Urbonas·19

Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift

A practical artifact promotion guide for CI/CD teams that were tired of hearing 'it passed in staging' after production behaved differently because the release was rebuilt.

Kiril Urbonas·48

RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps

A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.

Kiril Urbonas·9

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern

A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.

Kiril Urbonas·7

Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern

A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.

Kiril Urbonas·50

Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams

A real-world multi-cluster traffic routing guide for SaaS teams that have outgrown a single Kubernetes cluster and need safer rollout control without a service-mesh science project.

Kiril Urbonas·11