_d
devops/ness
Blog
Reading ListAbout
Subscribe
Featured Article

Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams

A real-world multi-cluster traffic routing guide for SaaS teams that have outgrown a single Kubernetes cluster and need safer rollout control without a service-mesh science project.

CloudKubernetesAWSMonitoring
KU
Kiril urbonasDevOps Engineer and AI Enthusiast
|Mar 21, 2026
Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams

Topics

Monitoring283Terraform209AWS170Kubernetes126Python113Security109CI/CD105LLM99Ansible97Linux97

Latest Articles

View All →
Terraform State Isolation by Environment: How We Stopped One Change from Hitting Prod
••yesterday

Terraform State Isolation by Environment: How We Stopped One Change from Hitting Prod

A practical Terraform state isolation guide built from a real environment-mixing incident, with patterns for safer backends, clearer ownership, and lower blast radius.

KU
Kiril urbonas·3 min read
Read article
Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions
••2 days ago

Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions

A real-world guide to prompt versioning and regression testing for production AI features, focused on preventing the subtle changes that hurt quality long before anyone notices.

KU
Kiril urbonas·3 min read
Read article
Page 1 of 45 · 529 posts
Previous
12...45
Next

DevOpsNess

Practical AI, DevOps, Cloud, and Linux guidance for engineering teams

Weekly deep dives, implementation patterns, and reliability-focused playbooks.

Join NewsletterBrowse Posts
_d
devops/ness

A practical blog covering AI, cloud, DevOps, and modern technology for engineering teams.

Explore

  • Latest Articles
  • Archive
  • Reading List

Resources

  • About
  • RSS Feed
  • Newsletter

Legal

Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops
••3 days ago

Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops

A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.

KU
Kiril urbonas·3 min read
Read article
Blue-Green Deployment Guardrails in Kubernetes: Lessons from a Failed Friday Rollout
••4 days ago

Blue-Green Deployment Guardrails in Kubernetes: Lessons from a Failed Friday Rollout

A Kubernetes blue-green deployment guide built around a real rollout failure, showing the guardrails that matter when traffic shifting, health checks, and rollback timing all interact.

KU
Kiril urbonas·3 min read
Read article
Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover
••5 days ago

Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover

A practical disaster recovery runbook guide for small cloud teams that need realistic failover steps, clear ownership, and repeatable rehearsals instead of shelfware documents.

KU
Kiril urbonas·4 min read
Read article
RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production
••6 days ago

RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production

A search-friendly guide to RAG retrieval quality evaluation, based on the moment one production assistant started citing stale documents and the team had to prove what 'good retrieval' meant.

KU
Kiril urbonas·3 min read
Read article
Infrastructure Documentation as Code: How One Platform Team Reduced Audit Fire Drills
••last week

Infrastructure Documentation as Code: How One Platform Team Reduced Audit Fire Drills

This infrastructure documentation as code guide shows how a platform team moved runbooks, ownership maps, and architecture decisions into versioned workflows that people actually trusted.

KU
Kiril urbonas·4 min read
Read article
Linux Patch Management for Production Fleets: A Real-World Maintenance Workflow
••last week

Linux Patch Management for Production Fleets: A Real-World Maintenance Workflow

A production-tested Linux patch management workflow for teams that need security fixes without turning every maintenance window into a gamble.

KU
Kiril urbonas·4 min read
Read article
AWS Cost Allocation Tags for Shared Platforms: What Finally Worked
••last week

AWS Cost Allocation Tags for Shared Platforms: What Finally Worked

A hands-on guide to AWS cost allocation tags for shared environments, built from a real platform-team problem: everyone used the cluster, but nobody trusted the bill.

KU
Kiril urbonas·4 min read
Read article
GitHub Actions Monorepo CI: How We Cut Build Times Without Breaking Main
••last week

GitHub Actions Monorepo CI: How We Cut Build Times Without Breaking Main

A practical GitHub Actions monorepo CI guide built around a real scaling problem: long queues, noisy failures, and developers waiting 40 minutes for feedback.

KU
Kiril urbonas·4 min read
Read article
Real-World RAG Incidents: Lessons from a Production Rollout
••last week

Real-World RAG Incidents: Lessons from a Production Rollout

A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

KU
Kiril urbonas·2 min read
Read article
How We Stopped Terraform Drift from Surprising On-Call
••last week

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

KU
Kiril urbonas·1 min read
Read article
  • Privacy
  • Terms

© 2026 DevOpsNess. By Kiril Urbonas.

RSSPrivacyTerms