Reading List About

Featured Article

Real-World RAG Incidents: Lessons from a Production Rollout

A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

AI LLM GPT Python

KU

Kiril urbonasDevOps Engineer and AI Enthusiast

|Mar 10, 2026

Real-World RAG Incidents: Lessons from a Production Rollout

Topics

Monitoring280 Terraform207 AWS166 Kubernetes124 Python111 Security107 CI/CD103 LLM97 Ansible95 Linux95

Latest Articles

How We Stopped Terraform Drift from Surprising On-Call

••last week

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

Kiril urbonas·1 min read

Systemd Tricks We Use to Keep Services Boring

••last week

Systemd Tricks We Use to Keep Services Boring

Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.

Kiril urbonas·1 min read

Page 1 of 44 · 518 posts

Previous

1 2...44

DevOpsNess

Practical AI, DevOps, Cloud, and Linux guidance for engineering teams

Weekly deep dives, implementation patterns, and reliability-focused playbooks.

Join Newsletter Browse Posts

A practical blog covering AI, cloud, DevOps, and modern technology for engineering teams.

Explore

Latest Articles
Archive
Reading List

Resources

About
RSS Feed
Newsletter

Legal

A Pragmatic Multi-Region Strategy for Small Teams

••last week

A Pragmatic Multi-Region Strategy for Small Teams

How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.

Kiril urbonas·2 min read

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

••last week

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.

Kiril urbonas·2 min read

Ansible and Infrastructure as Code: Idempotency and Best Practices

••last week

Ansible and Infrastructure as Code: Idempotency and Best Practices

Write Ansible playbooks that are idempotent, readable, and maintainable for config management.

Kiril urbonas·1 min read

Real-World RAG Incidents: Lessons from a Production Rollout

••2 weeks ago

Real-World RAG Incidents: Lessons from a Production Rollout

A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

Kiril urbonas·2 min read

How We Stopped Terraform Drift from Surprising On-Call

••2 weeks ago

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

Kiril urbonas·1 min read

Systemd Tricks We Use to Keep Services Boring

••2 weeks ago

Systemd Tricks We Use to Keep Services Boring

Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.

Kiril urbonas·1 min read

A Pragmatic Multi-Region Strategy for Small Teams

••2 weeks ago

A Pragmatic Multi-Region Strategy for Small Teams

How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.

Kiril urbonas·2 min read

End-of-Week Engineering: Why Smart Tech Teams Don’t Ship Major Changes on Friday

••2 weeks ago

End-of-Week Engineering: Why Smart Tech Teams Don’t Ship Major Changes on Friday

A practical risk-management framework for release timing, Friday deployment policies, progressive delivery, and how elite teams protect reliability and people.

Kiril Urbonas·5 min read

Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work

••2 weeks ago

Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work

Cut Kubernetes spend without hurting reliability using a practical FinOps playbook for rightsizing, autoscaling guardrails, showback, and weekly waste cleanup.

Kiril Urbonas·5 min read

SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability

••3 weeks ago

SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability

A practical way to define SLOs and error budgets, connect them to release decisions, and avoid reliability debates without data.

Kiril Urbonas·2 min read

Privacy
Terms

© 2026 DevOpsNess. By Kiril Urbonas.

RSS Privacy Terms