Skip to main content

Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

Real-World RAG Incidents: Lessons from a Production Rollout

••8 months ago

Real-World RAG Incidents: Lessons from a Production Rollout

A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

Kiril Urbonas·2

Canary Releases: Gradual Rollout Strategy

••8 months ago

Canary Releases: Gradual Rollout Strategy

We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide "promote or roll back" are where the design is.

Kiril Urbonas·14

How We Stopped Terraform Drift from Surprising On-Call

••8 months ago

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

Kiril Urbonas·4

Systemd Tricks We Use to Keep Services Boring

••8 months ago

Systemd Tricks We Use to Keep Services Boring

Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.

Kiril Urbonas·5

Blue-Green Deployments: Zero-Downtime Releases

••8 months ago

Blue-Green Deployments: Zero-Downtime Releases

We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.

Kiril Urbonas·7

A Pragmatic Multi-Region Strategy for Small Teams

••8 months ago

A Pragmatic Multi-Region Strategy for Small Teams

How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.

Kiril Urbonas·4

Log Aggregation Strategies: Centralizing Your Logs

••8 months ago

Log Aggregation Strategies: Centralizing Your Logs

We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.

Kiril Urbonas·8

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

••8 months ago

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.

Kiril Urbonas·4

Real-World RAG Incidents: Lessons from a Production Rollout

••8 months ago

Real-World RAG Incidents: Lessons from a Production Rollout

A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

Kiril Urbonas·4

Infrastructure Monitoring with Prometheus: Complete Setup Guide

••8 months ago

Infrastructure Monitoring with Prometheus: Complete Setup Guide

A working Prometheus stack for a 40-node cluster: what we deploy, what we tune, and what we wish we'd known about cardinality two years ago.

Kiril Urbonas·12

How We Stopped Terraform Drift from Surprising On-Call

••8 months ago

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

Kiril Urbonas·4

Systemd Tricks We Use to Keep Services Boring

••8 months ago

Systemd Tricks We Use to Keep Services Boring

Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.

Kiril Urbonas·5

Page 25 of 44 · 518 posts

1...24 25 26...44

Reading List About