_d
devops/ness
Blog
Reading ListAbout
Subscribe

Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

Tag: #monitoringClear filters
Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks
••4 days ago

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.

KU
Kiril urbonas
Read article
Incident Postmortems That Actually Prevent Repeat Failures
••6 days ago

Incident Postmortems That Actually Prevent Repeat Failures

How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.

KU
Kiril urbonas
Read article
Linux Performance Troubleshooting: A Real Incident Walkthrough
••last week

Linux Performance Troubleshooting: A Real Incident Walkthrough

Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.

KU
Kiril urbonas
Read article
AWS Cost Audit: 7 Things We Found Wasting Money Every Month
••last week

AWS Cost Audit: 7 Things We Found Wasting Money Every Month

A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.

KU
Kiril urbonas
Read article
Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact
••last week

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.

KU
Kiril urbonas
Read article
RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps
••2 weeks ago

RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps

A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.

KU
Kiril urbonas
Read article
Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern
••2 weeks ago

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern

A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.

KU
Kiril urbonas
Read article
Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern
••2 weeks ago

Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern

A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.

KU
Kiril urbonas
Read article
Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams
••2 weeks ago

Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams

A real-world multi-cluster traffic routing guide for SaaS teams that have outgrown a single Kubernetes cluster and need safer rollout control without a service-mesh science project.

KU
Kiril urbonas
Read article
Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops
••3 weeks ago

Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops

A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.

KU
Kiril urbonas
Read article
Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover
••3 weeks ago

Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover

A practical disaster recovery runbook guide for small cloud teams that need realistic failover steps, clear ownership, and repeatable rehearsals instead of shelfware documents.

KU
Kiril urbonas
Read article
How We Stopped Terraform Drift from Surprising On-Call
••last month

How We Stopped Terraform Drift from Surprising On-Call

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

KU
Kiril urbonas
Read article
Page 1 of 25 · 291 posts
12...25
Next