{"name":"DevOpsNess","description":"Practical tutorials and articles on AI, DevOps, cloud, Linux, and infrastructure.","url":"https://www.devopsness.com","contentCount":200,"content":[{"title":"Database Backups — Testing Restores, Not Just Taking Them","url":"https://www.devopsness.com/blog/database-backup-restoration-testing-restores","description":"Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes \"we have backups\" actually mean something.","publishedAt":"2026-05-14T00:00:00.000Z","updatedAt":"2026-05-16T16:13:10.842Z","category":"Infrastructure"},{"title":"Helm Chart Anti-Patterns We've Stopped Using","url":"https://www.devopsness.com/blog/helm-chart-anti-patterns","description":"Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.","publishedAt":"2026-05-13T00:00:00.000Z","updatedAt":"2026-05-14T08:16:44.770Z","category":"DevOps"},{"title":"CDN Cache Invalidation — Strategies That Don't Break in Production","url":"https://www.devopsness.com/blog/cdn-cache-invalidation-strategies","description":"There are two hard problems in computer science.\" We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.","publishedAt":"2026-05-12T00:00:00.000Z","updatedAt":"2026-05-14T08:20:25.226Z","category":"Cloud"},{"title":"Embeddings Drift Detection — When \"Similar Enough\" Stops Being Similar","url":"https://www.devopsness.com/blog/embeddings-drift-detection-when-similar-stops","description":"Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.","publishedAt":"2026-05-11T00:00:00.000Z","updatedAt":"2026-05-14T08:20:32.764Z","category":"AI"},{"title":"Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up","url":"https://www.devopsness.com/blog/job-queues-sidekiq-celery-bullmq-patterns","description":"We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.","publishedAt":"2026-05-10T00:00:00.000Z","updatedAt":"2026-05-16T02:21:10.503Z","category":"DevOps"},{"title":"systemd Timers vs Cron — What We Learned Switching","url":"https://www.devopsness.com/blog/systemd-timers-vs-cron-what-we-learned","description":"We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.","publishedAt":"2026-05-09T00:00:00.000Z","updatedAt":"2026-05-14T08:17:05.894Z","category":"Linux"},{"title":"AWS Step Functions for Workflow Orchestration","url":"https://www.devopsness.com/blog/aws-step-functions-workflow-orchestration","description":"We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.","publishedAt":"2026-05-08T00:00:00.000Z","updatedAt":"2026-05-16T04:16:02.232Z","category":"Cloud"},{"title":"LLM Streaming UX — Backpressure, Cancellation, Partial Results","url":"https://www.devopsness.com/blog/llm-streaming-ux-backpressure-cancellation","description":"Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.","publishedAt":"2026-05-07T00:00:00.000Z","updatedAt":"2026-05-14T08:20:47.937Z","category":"AI"},{"title":"Internal Developer Platforms — Backstage in Practice","url":"https://www.devopsness.com/blog/internal-developer-platforms-backstage-in-practice","description":"We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.","publishedAt":"2026-05-06T00:00:00.000Z","updatedAt":"2026-05-14T08:16:47.078Z","category":"DevOps"},{"title":"Postgres Replication Lag — Monitoring and Failover Practice","url":"https://www.devopsness.com/blog/postgres-replication-lag-failover-practice","description":"Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.","publishedAt":"2026-05-05T00:00:00.000Z","updatedAt":"2026-05-14T08:20:52.709Z","category":"Infrastructure"},{"title":"Bash One-Liners We Actually Use","url":"https://www.devopsness.com/blog/bash-one-liners-we-actually-use","description":"A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.","publishedAt":"2026-05-04T00:00:00.000Z","updatedAt":"2026-05-14T08:16:24.607Z","category":"Linux"},{"title":"Karpenter — Node Provisioning Patterns at Scale","url":"https://www.devopsness.com/blog/karpenter-node-provisioning-patterns-at-scale","description":"After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.","publishedAt":"2026-05-03T00:00:00.000Z","updatedAt":"2026-05-14T08:16:51.810Z","category":"Cloud"},{"title":"AI Agent Tool Design — Boundaries and Confirmations","url":"https://www.devopsness.com/blog/ai-agent-tool-design-boundaries-confirmations","description":"When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.","publishedAt":"2026-05-02T00:00:00.000Z","updatedAt":"2026-05-14T08:20:10.632Z","category":"AI"},{"title":"Chaos Engineering — What We Actually Run as Game Days","url":"https://www.devopsness.com/blog/chaos-engineering-game-days-platform-teams","description":"We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.","publishedAt":"2026-05-01T00:00:00.000Z","updatedAt":"2026-05-17T11:53:19.805Z","category":"DevOps"},{"title":"Postgres Connection Pooling — PgBouncer in Front of RDS","url":"https://www.devopsness.com/blog/postgres-connection-pooling-pgbouncer","description":"Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.","publishedAt":"2026-04-30T00:00:00.000Z","updatedAt":"2026-05-17T20:58:45.607Z","category":"Infrastructure"},{"title":"What Are Embeddings? A Beginner's Guide with Code","url":"https://www.devopsness.com/blog/what-are-embeddings-beginner-guide","description":"Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.","publishedAt":"2026-04-29T18:48:06.923Z","updatedAt":"2026-05-04T19:03:49.101Z","category":"AI"},{"title":"Terraform Tutorial — Your First Infrastructure-as-Code Project","url":"https://www.devopsness.com/blog/terraform-tutorial-first-iac-project","description":"Provision real cloud resources with Terraform — a VPC, an S3 bucket, and an EC2 instance — using the standard init/plan/apply workflow.","publishedAt":"2026-04-29T18:48:02.259Z","updatedAt":"2026-05-14T07:25:29.877Z","category":"Infrastructure"},{"title":"SSH Tutorial — Keys, Config, and Working Remotely","url":"https://www.devopsness.com/blog/ssh-tutorial-keys-config-remote-work","description":"Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.","publishedAt":"2026-04-29T18:47:57.183Z","updatedAt":"2026-05-16T10:41:40.015Z","category":"Linux"},{"title":"Prompt Engineering Basics — From \"Help Me\" to Working Prompts","url":"https://www.devopsness.com/blog/prompt-engineering-basics-tutorial","description":"A hands-on intro to prompt engineering. Learn the four levers (role, format, examples, constraints) and watch a vague prompt turn into a reliable one.","publishedAt":"2026-04-29T18:47:52.577Z","updatedAt":"2026-05-14T07:25:26.566Z","category":"AI"},{"title":"Linux File Permissions — Read, Write, Execute Without Tears","url":"https://www.devopsness.com/blog/linux-file-permissions-explained","description":"A clear walkthrough of Linux file permissions. Read the funny rwx- letters, change them safely with chmod, fix \"permission denied\" errors with confidence.","publishedAt":"2026-04-29T18:47:47.952Z","updatedAt":"2026-05-16T12:18:52.673Z","category":"Linux"},{"title":"Kubernetes 101 — Pods, Deployments, and Services Explained","url":"https://www.devopsness.com/blog/kubernetes-101-pods-deployments-services","description":"Run your first three Kubernetes objects — Pod, Deployment, Service — on a local cluster, then understand why each one exists and how they fit together.","publishedAt":"2026-04-29T18:47:42.868Z","updatedAt":"2026-05-17T18:12:37.041Z","category":"DevOps"},{"title":"GitOps Explained — What It Is and Why Teams Adopt It","url":"https://www.devopsness.com/blog/gitops-explained-introduction","description":"GitOps in plain words — what it actually is, the workflow it enables, and a hands-on demo using Argo CD on a local Kubernetes cluster.","publishedAt":"2026-04-29T18:47:38.356Z","updatedAt":"2026-05-16T10:54:43.977Z","category":"Infrastructure"},{"title":"Your First CI/CD Pipeline with GitHub Actions","url":"https://www.devopsness.com/blog/first-cicd-pipeline-github-actions-tutorial","description":"Walk through a working GitHub Actions workflow — install, test, build, deploy — for a tiny Node app. Every line explained.","publishedAt":"2026-04-29T18:47:32.442Z","updatedAt":"2026-05-17T09:09:12.983Z","category":"DevOps"},{"title":"Docker for Beginners — Build, Run, and Ship Your First Container","url":"https://www.devopsness.com/blog/docker-beginners-tutorial-first-container","description":"Walk through your first Dockerfile, container run, and image push in 30 minutes. No theory dumps — just the commands and what each one is doing.","publishedAt":"2026-04-29T18:47:28.104Z","updatedAt":"2026-05-15T20:55:20.862Z","category":"DevOps"},{"title":"Build Your First RAG App in 100 Lines of Python","url":"https://www.devopsness.com/blog/build-first-rag-app-python-tutorial","description":"A working retrieval-augmented generation app you can run today. Markdown ingestion, embeddings, semantic search, and an LLM answer — start to finish in one afternoon.","publishedAt":"2026-04-29T18:47:22.404Z","updatedAt":"2026-05-11T07:51:12.075Z","category":"AI"},{"title":"Bash Scripting Tutorial — Write Your First Useful Script","url":"https://www.devopsness.com/blog/bash-scripting-tutorial-first-script","description":"Build a real disk-cleanup script step by step. Learn variables, conditionals, loops, error handling, and the safety preamble that prevents foot-guns.","publishedAt":"2026-04-29T18:47:15.710Z","updatedAt":"2026-05-15T20:41:21.539Z","category":"Linux"},{"title":"AWS VPC Explained — Subnets, Route Tables, and the Internet Gateway","url":"https://www.devopsness.com/blog/aws-vpc-explained-beginner-guide","description":"A working mental model for AWS VPCs — what each piece does, how they connect, and why \"VPC\" is the wrong mental model if you came from physical networks.","publishedAt":"2026-04-29T18:47:09.519Z","updatedAt":"2026-05-11T09:25:23.347Z","category":"Cloud"},{"title":"AWS S3 Tutorial — Buckets, Permissions, and Common Pitfalls","url":"https://www.devopsness.com/blog/aws-s3-tutorial-buckets-permissions","description":"Create your first S3 bucket, upload and download files, and set up the right access controls — without accidentally making everything public.","publishedAt":"2026-04-29T18:47:04.307Z","updatedAt":"2026-05-15T22:08:08.141Z","category":"Cloud"},{"title":"AWS Lambda — Deploy Your First Serverless Function","url":"https://www.devopsness.com/blog/aws-lambda-deploy-first-serverless-function","description":"Write, package, and deploy a Lambda function using only the AWS CLI. Trigger it via a public URL. Understand what serverless actually means.","publishedAt":"2026-04-29T18:46:58.792Z","updatedAt":"2026-05-18T01:08:50.123Z","category":"Cloud"},{"title":"Ansible Tutorial — Configure a Server in 30 Minutes","url":"https://www.devopsness.com/blog/ansible-tutorial-configure-server","description":"Install Ansible, write your first playbook, and configure a remote server (nginx + a deploy user) without touching it manually. The basics that scale up.","publishedAt":"2026-04-29T18:46:52.831Z","updatedAt":"2026-05-16T09:20:37.957Z","category":"Infrastructure"},{"title":"Feature Flags in Production — Provider Choice and Operational Reality","url":"https://www.devopsness.com/blog/feature-flags-in-production-operational-reality","description":"We use feature flags on roughly every customer-facing change. The provider tradeoff, the patterns that hold up, and the failure modes that show up only after a couple of years.","publishedAt":"2026-04-28T00:00:00.000Z","updatedAt":"2026-05-14T08:20:34.690Z","category":"DevOps"},{"title":"Distributed Tracing with OpenTelemetry — What We Ship, What We Skip","url":"https://www.devopsness.com/blog/distributed-tracing-opentelemetry-what-we-ship","description":"How we run OpenTelemetry across ~40 services. The instrumentation that earns its place, the patterns we abandoned, and what tracing actually catches that metrics don't.","publishedAt":"2026-04-27T00:00:00.000Z","updatedAt":"2026-05-14T07:27:13.672Z","category":"DevOps"},{"title":"Postgres Autovacuum — Tuning From Production Stalls","url":"https://www.devopsness.com/blog/postgres-autovacuum-tuning-from-production-stalls","description":"A 2 AM incident, the autovacuum settings that caused it, and the parameter changes that prevented the next one. The discipline that took our biggest Postgres host from periodic stalls to steady.","publishedAt":"2026-04-26T00:00:00.000Z","updatedAt":"2026-05-14T08:20:49.672Z","category":"Infrastructure"},{"title":"Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers","url":"https://www.devopsness.com/blog/fine-tuning-vs-rag-vs-long-context-a-decision-framework-with-numbers-2026-04-25","description":"We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.","publishedAt":"2026-04-25T12:00:00.000Z","updatedAt":"2026-05-18T01:42:17.540Z","category":"AI"},{"title":"Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool","url":"https://www.devopsness.com/blog/database-connection-pooling-at-scale-pgbouncer-rds-proxy-application-pool-2026-04-24","description":"Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.","publishedAt":"2026-04-24T12:00:00.000Z","updatedAt":"2026-05-01T03:35:09.364Z","category":"DevOps"},{"title":"Backstage Adoption: From Demo to 80% Service Coverage in 6 Months","url":"https://www.devopsness.com/blog/backstage-adoption-from-demo-to-80-service-coverage-in-6-months-2026-04-23","description":"We launched Backstage in October. Six months in, 80% of services are catalogued, on-boarding takes a third of the time, and we mostly know what owns what.","publishedAt":"2026-04-23T12:00:00.000Z","updatedAt":"2026-05-13T16:03:15.561Z","category":"Infrastructure"},{"title":"Cloudflare Workers vs Vercel Edge: A Latency-Cost Comparison","url":"https://www.devopsness.com/blog/cloudflare-workers-vs-vercel-edge-a-latency-cost-comparison-2026-04-22","description":"We deployed the same edge function on both platforms and measured for a quarter. Where each wins, where each loses, and the surprises along the way.","publishedAt":"2026-04-22T12:00:00.000Z","updatedAt":"2026-05-18T09:07:42.336Z","category":"Cloud"},{"title":"eBPF for SREs: Three Real Diagnoses That Saved Hours","url":"https://www.devopsness.com/blog/ebpf-for-sres-three-real-diagnoses-that-saved-hours-2026-04-21","description":"We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.","publishedAt":"2026-04-21T12:00:00.000Z","updatedAt":"2026-05-13T16:15:54.816Z","category":"Linux"},{"title":"LLM Output Validation: Schema-First Prompt Engineering Patterns","url":"https://www.devopsness.com/blog/llm-output-validation-schema-first-prompt-engineering-patterns-2026-04-20","description":"We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.","publishedAt":"2026-04-20T12:00:00.000Z","updatedAt":"2026-04-30T11:50:13.861Z","category":"AI"},{"title":"Argo Rollouts: Canary Deployments That Caught a $40k Bug","url":"https://www.devopsness.com/blog/argo-rollouts-canary-deployments-that-caught-a-40k-bug-2026-04-19","description":"A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.","publishedAt":"2026-04-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.707Z","category":"DevOps"},{"title":"Pulumi vs Terraform: What 18 Months of Production Taught Us","url":"https://www.devopsness.com/blog/pulumi-vs-terraform-what-18-months-of-production-taught-us-2026-04-18","description":"We ran Pulumi in TypeScript and Terraform in HCL side by side across 60+ services. Each won different categories of work. Here's the breakdown.","publishedAt":"2026-04-18T12:00:00.000Z","updatedAt":"2026-05-16T19:12:14.145Z","category":"Infrastructure"},{"title":"GCP Workload Identity Federation: Replacing Service Account Keys","url":"https://www.devopsness.com/blog/gcp-workload-identity-federation-replacing-service-account-keys-2026-04-17","description":"We deleted every static GCP service account key in our org over six weeks. Here's the migration plan, the gotchas, and the policies we now enforce.","publishedAt":"2026-04-17T12:00:00.000Z","updatedAt":"2026-05-16T10:25:38.895Z","category":"Cloud"},{"title":"Linux Memory Management: When OOM Killer Strikes Your K8s Pods","url":"https://www.devopsness.com/blog/linux-memory-management-when-oom-killer-strikes-your-k8s-pods-2026-04-16","description":"Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.","publishedAt":"2026-04-16T12:00:00.000Z","updatedAt":"2026-05-01T05:17:17.063Z","category":"Linux"},{"title":"GitHub Actions Self-Hosted Runners: Why We Switched and What Broke","url":"https://www.devopsness.com/blog/github-actions-self-hosted-runners-why-we-switched-and-what-broke-2026-04-15","description":"Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.","publishedAt":"2026-04-15T12:00:00.000Z","updatedAt":"2026-05-16T11:27:32.270Z","category":"DevOps"},{"title":"Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production","url":"https://www.devopsness.com/blog/vector-database-selection-pinecone-pgvector-qdrant-after-6-months-in-production-2026-04-14","description":"We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.","publishedAt":"2026-04-14T12:00:00.000Z","updatedAt":"2026-05-14T15:28:09.223Z","category":"AI"},{"title":"Pre-Commit Hooks That Saved Our Repo: 7 Real Examples","url":"https://www.devopsness.com/blog/pre-commit-hooks-that-saved-our-repo-7-real-examples-2026-04-13","description":"Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.","publishedAt":"2026-04-13T12:00:00.000Z","updatedAt":"2026-05-11T09:17:34.252Z","category":"DevOps"},{"title":"EKS Auto Mode: What Worked, What Broke in Our Migration","url":"https://www.devopsness.com/blog/eks-auto-mode-what-worked-what-broke-in-our-migration-2026-04-12","description":"We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.","publishedAt":"2026-04-12T12:00:00.000Z","updatedAt":"2026-05-11T09:31:11.460Z","category":"Cloud"},{"title":"Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months","url":"https://www.devopsness.com/blog/self-hosted-llms-vs-openai-api-a-cost-vs-latency-analysis-after-6-months-2026-04-11","description":"We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.","publishedAt":"2026-04-11T12:00:00.000Z","updatedAt":"2026-05-09T13:42:04.163Z","category":"AI"},{"title":"OpenTelemetry Collector Pipelines: Real Configs That Survived Production","url":"https://www.devopsness.com/blog/opentelemetry-collector-pipelines-real-configs-that-survived-production-2026-04-10","description":"We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.","publishedAt":"2026-04-10T12:00:00.000Z","updatedAt":"2026-05-11T14:06:40.154Z","category":"DevOps"},{"title":"Blue/Green Deploys for Stateful Services: A Postgres Cutover Story","url":"https://www.devopsness.com/blog/blue-green-deploys-for-stateful-services-a-postgres-cutover-story-2026-04-09","description":"Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.","publishedAt":"2026-04-09T12:00:00.000Z","updatedAt":"2026-05-09T14:48:39.161Z","category":"DevOps"},{"title":"systemd Timers vs Cron: When We Switched and What We Learned","url":"https://www.devopsness.com/blog/systemd-timers-vs-cron-when-we-switched-and-what-we-learned-2026-04-08","description":"We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.","publishedAt":"2026-04-08T12:00:00.000Z","updatedAt":"2026-04-30T04:17:49.320Z","category":"Linux"},{"title":"Zero Trust on AWS: Lessons From Implementing IAM Identity Center","url":"https://www.devopsness.com/blog/zero-trust-on-aws-lessons-from-implementing-iam-identity-center-2026-04-07","description":"We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.","publishedAt":"2026-04-07T12:00:00.000Z","updatedAt":"2026-05-14T16:23:47.178Z","category":"Cloud"},{"title":"Embedding Quality in RAG: How We Cut Hallucinations by 60%","url":"https://www.devopsness.com/blog/embedding-quality-in-rag-how-we-cut-hallucinations-by-60-2026-04-06","description":"Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.","publishedAt":"2026-04-06T12:00:00.000Z","updatedAt":"2026-05-18T08:10:47.675Z","category":"AI"},{"title":"Database Migrations Without Downtime: Patterns From Three Real Cutovers","url":"https://www.devopsness.com/blog/database-migrations-without-downtime-patterns-from-three-real-cutovers-2026-04-05","description":"How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.","publishedAt":"2026-04-05T12:00:00.000Z","updatedAt":"2026-05-16T19:29:50.942Z","category":"Infrastructure"},{"title":"Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks","url":"https://www.devopsness.com/blog/monitoring-that-actually-helps-on-call-alerts-dashboards-and-runbooks","description":"We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.","publishedAt":"2026-04-04T12:00:00.000Z","updatedAt":"2026-05-18T09:28:26.450Z","category":"Infrastructure"},{"title":"Secrets Management in Practice: From .env Files to Vault","url":"https://www.devopsness.com/blog/secrets-management-in-practice-from-env-files-to-vault","description":"We had .env files in three repos, AWS keys in Slack DMs, and a postgres password etched into a Confluence page. Cleaning it up took a sprint and changed how we think about secrets.","publishedAt":"2026-04-03T12:00:00.000Z","updatedAt":"2026-05-14T10:40:08.304Z","category":"Cloud"},{"title":"Incident Postmortems That Actually Prevent Repeat Failures","url":"https://www.devopsness.com/blog/incident-postmortems-that-actually-prevent-repeat-failures","description":"We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.","publishedAt":"2026-04-02T12:00:00.000Z","updatedAt":"2026-04-26T18:12:49.387Z","category":"DevOps"},{"title":"Terraform Modules Done Right: Lessons from Managing 50+ Services","url":"https://www.devopsness.com/blog/terraform-modules-done-right-lessons-from-managing-50-services","description":"Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.","publishedAt":"2026-04-01T12:00:00.000Z","updatedAt":"2026-05-14T16:43:40.082Z","category":"Infrastructure"},{"title":"Linux Performance Troubleshooting: A Real Incident Walkthrough","url":"https://www.devopsness.com/blog/linux-performance-troubleshooting-a-real-incident-walkthrough","description":"Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.","publishedAt":"2026-03-31T12:00:00.000Z","updatedAt":"2026-04-30T01:32:34.371Z","category":"Linux"},{"title":"Prompt Engineering Patterns That Actually Work in Production","url":"https://www.devopsness.com/blog/prompt-engineering-patterns-that-actually-work-in-production","description":"Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.","publishedAt":"2026-03-30T12:00:00.000Z","updatedAt":"2026-05-14T16:10:26.172Z","category":"AI"},{"title":"AWS Cost Audit: 7 Things We Found Wasting Money Every Month","url":"https://www.devopsness.com/blog/aws-cost-audit-7-things-we-found-wasting-money-every-month","description":"A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.","publishedAt":"2026-03-29T12:00:00.000Z","updatedAt":"2026-04-30T07:25:59.937Z","category":"Cloud"},{"title":"How We Cut Our Docker Image Size by 80% and Why It Matters","url":"https://www.devopsness.com/blog/how-we-cut-our-docker-image-size-by-80-and-why-it-matters","description":"A real walkthrough of shrinking bloated Docker images from 1.2GB to 240MB using multi-stage builds, Alpine, and dependency auditing.","publishedAt":"2026-03-28T12:00:00.000Z","updatedAt":"2026-04-16T03:52:58.160Z","category":"DevOps"},{"title":"Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact","url":"https://www.devopsness.com/blog/model-fallback-policies-for-customer-facing-ai-the-routing-rules-that-kept-sla-intact-2026-03-27","description":"A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.","publishedAt":"2026-03-27T12:00:00.000Z","updatedAt":"2026-05-16T08:55:35.248Z","category":"AI"},{"title":"Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift","url":"https://www.devopsness.com/blog/artifact-promotion-instead-of-rebuilds-the-release-control-pattern-that-stopped-drift-2026-03-26","description":"A practical artifact promotion guide for CI/CD teams that were tired of hearing 'it passed in staging' after production behaved differently because the release was rebuilt.","publishedAt":"2026-03-26T12:00:00.000Z","updatedAt":"2026-05-18T08:27:50.668Z","category":"DevOps"},{"title":"RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps","url":"https://www.devopsness.com/blog/rds-restore-drills-for-busy-teams-the-recovery-workflow-that-surfaced-real-gaps-2026-03-25","description":"A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.","publishedAt":"2026-03-25T12:00:00.000Z","updatedAt":"2026-04-25T01:33:18.801Z","category":"Cloud"},{"title":"Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern","url":"https://www.devopsness.com/blog/systemd-drop-in-overrides-for-vendor-services-the-supportable-linux-ops-pattern-2026-03-24","description":"A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.","publishedAt":"2026-03-24T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.084Z","category":"Linux"},{"title":"Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage","url":"https://www.devopsness.com/blog/terraform-module-version-pinning-how-one-platform-team-stopped-surprise-breakage-2026-03-23","description":"A real-world Terraform module version pinning guide for platform teams that want safer upgrades, clearer ownership, and fewer broken pipelines after shared module releases.","publishedAt":"2026-03-23T12:00:00.000Z","updatedAt":"2026-05-16T09:44:03.648Z","category":"Infrastructure"},{"title":"Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern","url":"https://www.devopsness.com/blog/embedding-model-upgrades-without-search-chaos-a-safer-rag-rollout-pattern-2026-03-22","description":"A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.","publishedAt":"2026-03-22T12:00:00.000Z","updatedAt":"2026-05-17T05:33:36.888Z","category":"AI"},{"title":"Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams","url":"https://www.devopsness.com/blog/multi-cluster-traffic-routing-strategies-a-pragmatic-rollout-pattern-for-growing-saas-teams-2026-03-21","description":"A real-world multi-cluster traffic routing guide for SaaS teams that have outgrown a single Kubernetes cluster and need safer rollout control without a service-mesh science project.","publishedAt":"2026-03-21T12:00:00.000Z","updatedAt":"2026-05-07T00:09:18.578Z","category":"Cloud"},{"title":"Terraform State Isolation by Environment: How We Stopped One Change from Hitting Prod","url":"https://www.devopsness.com/blog/terraform-state-isolation-by-environment-how-we-stopped-one-change-from-hitting-prod-2026-03-20","description":"A practical Terraform state isolation guide built from a real environment-mixing incident, with patterns for safer backends, clearer ownership, and lower blast radius.","publishedAt":"2026-03-20T12:00:00.000Z","updatedAt":"2026-05-07T09:42:42.685Z","category":"Infrastructure"},{"title":"Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions","url":"https://www.devopsness.com/blog/prompt-versioning-and-regression-testing-how-teams-avoid-silent-ai-regressions-2026-03-19","description":"A real-world guide to prompt versioning and regression testing for production AI features, focused on preventing the subtle changes that hurt quality long before anyone notices.","publishedAt":"2026-03-19T12:00:00.000Z","updatedAt":"2026-05-16T10:28:27.351Z","category":"AI"},{"title":"Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops","url":"https://www.devopsness.com/blog/systemd-service-reliability-patterns-what-we-changed-after-repeated-restart-loops-2026-03-18","description":"A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.","publishedAt":"2026-03-18T12:00:00.000Z","updatedAt":"2026-05-07T16:03:53.859Z","category":"Linux"},{"title":"Blue-Green Deployment Guardrails in Kubernetes: Lessons from a Failed Friday Rollout","url":"https://www.devopsness.com/blog/blue-green-deployment-guardrails-in-kubernetes-lessons-from-a-failed-friday-rollout-2026-03-17","description":"A Kubernetes blue-green deployment guide built around a real rollout failure, showing the guardrails that matter when traffic shifting, health checks, and rollback timing all interact.","publishedAt":"2026-03-17T12:00:00.000Z","updatedAt":"2026-05-16T10:08:02.215Z","category":"DevOps"},{"title":"Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover","url":"https://www.devopsness.com/blog/cloud-disaster-recovery-runbook-design-how-small-teams-rehearse-multi-region-failover-2026-03-16","description":"A practical disaster recovery runbook guide for small cloud teams that need realistic failover steps, clear ownership, and repeatable rehearsals instead of shelfware documents.","publishedAt":"2026-03-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.243Z","category":"Cloud"},{"title":"RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production","url":"https://www.devopsness.com/blog/rag-retrieval-quality-evaluation-the-checks-we-added-after-bad-answers-reached-production-2026-03-15","description":"A search-friendly guide to RAG retrieval quality evaluation, based on the moment one production assistant started citing stale documents and the team had to prove what 'good retrieval' meant.","publishedAt":"2026-03-15T12:00:00.000Z","updatedAt":"2026-05-11T11:40:36.838Z","category":"AI"},{"title":"Infrastructure Documentation as Code: How One Platform Team Reduced Audit Fire Drills","url":"https://www.devopsness.com/blog/infrastructure-documentation-as-code-how-one-platform-team-reduced-audit-fire-drills-2026-03-14","description":"This infrastructure documentation as code guide shows how a platform team moved runbooks, ownership maps, and architecture decisions into versioned workflows that people actually trusted.","publishedAt":"2026-03-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.256Z","category":"Infrastructure"},{"title":"Linux Patch Management for Production Fleets: A Real-World Maintenance Workflow","url":"https://www.devopsness.com/blog/linux-patch-management-for-production-fleets-a-real-world-maintenance-workflow-2026-03-13","description":"A production-tested Linux patch management workflow for teams that need security fixes without turning every maintenance window into a gamble.","publishedAt":"2026-03-13T12:00:00.000Z","updatedAt":"2026-05-10T11:58:40.868Z","category":"Linux"},{"title":"AWS Cost Allocation Tags for Shared Platforms: What Finally Worked","url":"https://www.devopsness.com/blog/aws-cost-allocation-tags-for-shared-platforms-what-finally-worked-2026-03-12","description":"A hands-on guide to AWS cost allocation tags for shared environments, built from a real platform-team problem: everyone used the cluster, but nobody trusted the bill.","publishedAt":"2026-03-12T12:00:00.000Z","updatedAt":"2026-05-06T07:37:26.441Z","category":"Cloud"},{"title":"GitHub Actions Monorepo CI: How We Cut Build Times Without Breaking Main","url":"https://www.devopsness.com/blog/github-actions-monorepo-ci-how-we-cut-build-times-without-breaking-main-2026-03-11","description":"A practical GitHub Actions monorepo CI guide built around a real scaling problem: long queues, noisy failures, and developers waiting 40 minutes for feedback.","publishedAt":"2026-03-11T12:00:00.000Z","updatedAt":"2026-05-14T20:50:33.076Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-46","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-03-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.352Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-45","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-03-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.078Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-45","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-03-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.353Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-45","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-03-07T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.087Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-45","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-03-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.832Z","category":"DevOps"},{"title":"Ansible and Infrastructure as Code: Idempotency and Best Practices","url":"https://www.devopsness.com/blog/ansible-and-infrastructure-as-code-idempotency-and-best-practices","description":"Write Ansible playbooks that are idempotent, readable, and maintainable for config management.","publishedAt":"2026-03-05T21:11:57.455Z","updatedAt":"2026-05-09T00:03:34.495Z","category":"Infrastructure"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-45","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-03-04T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.354Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-44","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-03-03T12:00:00.000Z","updatedAt":"2026-05-04T04:37:06.590Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-44","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-03-02T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.833Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-44","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-03-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.802Z","category":"Cloud"},{"title":"End-of-Week Engineering: Why Smart Tech Teams Don’t Ship Major Changes on Friday","url":"https://www.devopsness.com/blog/end-of-week-engineering-no-friday-deployments-2026-02-28","description":"A practical risk-management framework for release timing, Friday deployment policies, progressive delivery, and how elite teams protect reliability and people.","publishedAt":"2026-02-28T12:00:00.000Z","updatedAt":"2026-05-11T12:51:49.970Z","category":"DevOps"},{"title":"Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work","url":"https://www.devopsness.com/blog/kubernetes-finops-cost-optimization-2026-02-27","description":"Cut Kubernetes spend without hurting reliability using a practical FinOps playbook for rightsizing, autoscaling guardrails, showback, and weekly waste cleanup.","publishedAt":"2026-02-27T10:00:00.000Z","updatedAt":"2026-04-24T06:59:14.717Z","category":"Cloud"},{"title":"SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability","url":"https://www.devopsness.com/blog/sre-error-budgets-practical-guide-2026-02-26","description":"A practical way to define SLOs and error budgets, connect them to release decisions, and avoid reliability debates without data.","publishedAt":"2026-02-26T10:00:00.000Z","updatedAt":"2026-05-18T04:54:09.783Z","category":"DevOps"},{"title":"Platform Engineering with Backstage: Build a Useful Developer Portal","url":"https://www.devopsness.com/blog/platform-engineering-backstage-developer-portal-2026-02-25","description":"How to implement Backstage with real templates, scorecards, and golden paths so internal platform work reduces delivery friction.","publishedAt":"2026-02-25T10:00:00.000Z","updatedAt":"2026-05-09T01:16:39.261Z","category":"Infrastructure"},{"title":"GitHub Actions for Monorepos: Fast CI Without Pipeline Chaos","url":"https://www.devopsness.com/blog/github-actions-monorepo-fast-ci-2026-02-24","description":"A practical pattern for monorepo CI with path filters, matrix builds, caching, and deployment guards that keep feedback fast as teams scale.","publishedAt":"2026-02-24T10:00:00.000Z","updatedAt":"2026-05-11T11:00:21.155Z","category":"DevOps"},{"title":"Azure DevOps Best Practices in 2026: Build Pipelines You Can Trust","url":"https://www.devopsness.com/blog/azure-devops-best-practices-2026-02-23","description":"A production-focused guide to Azure DevOps: standardized YAML templates, secure service connections, rollout safety, and measurable delivery reliability.","publishedAt":"2026-02-23T10:00:00.000Z","updatedAt":"2026-05-17T14:20:17.030Z","category":"DevOps"},{"title":"AI Best Practices in 2026: Shipping Reliable Systems, Not Demo Magic","url":"https://www.devopsness.com/blog/ai-best-practices-2026-02-22-reliable-production-systems","description":"A practical production playbook for AI systems: evaluation gates, guardrails, observability, cost control, and reliable release management.","publishedAt":"2026-02-22T09:30:00.000Z","updatedAt":"2026-05-16T02:01:34.878Z","category":"AI"},{"title":"AI Best Practices for Engineering Teams: From Prompt Experiments to Platform Discipline","url":"https://www.devopsness.com/blog/ai-best-practices-2026-02-21-platform-discipline","description":"A practical field manual for engineering teams who want AI features that survive real users, incidents, and budgets — not just demo day.","publishedAt":"2026-02-21T09:30:00.000Z","updatedAt":"2026-05-09T20:05:03.734Z","category":"AI"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-44","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-02-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.805Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-44","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-02-18T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.583Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-43","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-02-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.535Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-43","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-02-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.075Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-43","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-02-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.828Z","category":"Cloud"},{"title":"Kubernetes Networking: Services, Ingress, and Network Policies","url":"https://www.devopsness.com/blog/kubernetes-networking-services-ingress-and-network-policies","description":"Understand Kubernetes networking: ClusterIP, NodePort, LoadBalancer, Ingress, and policy.","publishedAt":"2026-02-13T07:21:17.596Z","updatedAt":"2026-05-13T16:13:18.138Z","category":"DevOps"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-43","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-02-11T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.806Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-43","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-02-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.367Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-42","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-02-09T12:00:00.000Z","updatedAt":"2026-05-09T01:10:37.797Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-42","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-02-07T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.583Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-42","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-02-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.885Z","category":"Cloud"},{"title":"Infrastructure Cost Optimization: Reducing Cloud Spending","url":"https://www.devopsness.com/blog/infrastructure-cost-optimization-reducing-cloud-spending","description":"We cut our AWS bill by 38% in a quarter. The specific changes that moved the bill, ranked by impact, with what we'd do first.","publishedAt":"2026-02-05T16:17:55.440Z","updatedAt":"2026-05-17T01:15:30.630Z","category":"Infrastructure"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-42","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-02-03T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.815Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-42","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-02-02T12:00:00.000Z","updatedAt":"2026-04-30T12:44:21.032Z","category":"AI"},{"title":"Multi-Cloud Infrastructure: Managing Resources Across Providers","url":"https://www.devopsness.com/blog/multi-cloud-infrastructure-managing-resources-across-providers","description":"We run mostly on AWS but use GCP for specific workloads. The honest cost-benefit analysis of multi-cloud, plus the patterns that make it not awful.","publishedAt":"2026-02-01T16:17:55.440Z","updatedAt":"2026-04-27T07:48:11.589Z","category":"Infrastructure"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-41","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-01-31T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.520Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-41","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-01-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.596Z","category":"Linux"},{"title":"Disaster Recovery Planning: Building Resilient Infrastructure","url":"https://www.devopsness.com/blog/disaster-recovery-planning-building-resilient-infrastructure","description":"A different angle on DR: the planning process — RTO/RPO conversations, dependency mapping, and what we learned about prioritizing what to recover.","publishedAt":"2026-01-29T16:17:55.440Z","updatedAt":"2026-04-26T18:12:43.959Z","category":"Infrastructure"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-41","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-01-27T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.907Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-41","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-01-26T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.818Z","category":"DevOps"},{"title":"Infrastructure Monitoring: Observability for IaC","url":"https://www.devopsness.com/blog/infrastructure-monitoring-observability-iac","description":"Defining monitoring as code: dashboards, alerts, and SLOs in Git. The patterns that survived the migration from clicked-together monitoring.","publishedAt":"2026-01-25T16:17:55.440Z","updatedAt":"2026-04-26T18:12:51.226Z","category":"Infrastructure"},{"title":"FinOps and Cloud Cost Management for Engineering Teams","url":"https://www.devopsness.com/blog/finops-and-cloud-cost-management-for-engineering-teams","description":"Embed cost ownership in engineering: tags, budgets, and showback.","publishedAt":"2026-01-23T17:30:37.737Z","updatedAt":"2026-05-10T11:48:02.190Z","category":"Cloud"},{"title":"Ansible Playbook Optimization: Writing Efficient Playbooks","url":"https://www.devopsness.com/blog/ansible-playbook-optimization-writing-efficient-playbooks","description":"We cut our largest playbook's runtime from 14 minutes to 4 minutes. The specific changes that mattered, plus the ones that didn't.","publishedAt":"2026-01-22T16:17:55.440Z","updatedAt":"2026-04-26T18:12:29.935Z","category":"Infrastructure"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-41","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-01-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.397Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-40","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-01-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.592Z","category":"Infrastructure"},{"title":"Pulumi vs Terraform Deep Dive: Choosing the Right IaC Tool","url":"https://www.devopsness.com/blog/pulumi-vs-terraform-deep-dive-choosing-right-iac-tool","description":"We tried Pulumi for a quarter and went back to Terraform. Both are real options. Why we picked one and what would change our mind.","publishedAt":"2026-01-18T16:17:55.440Z","updatedAt":"2026-04-26T18:13:04.361Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-40","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-01-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.954Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-40","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-01-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.897Z","category":"Cloud"},{"title":"Operational Checklist: Kubernetes Secrets and External Vault Integration","url":"https://www.devopsness.com/blog/operational-checklist-kubernetes-secrets-and-external-vault-integration","description":"K8s Secrets are barely encrypted. We moved every secret to Vault with the Vault Agent injector and never went back. The setup checklist.","publishedAt":"2026-01-15T15:10:00.000Z","updatedAt":"2026-05-12T16:17:40.953Z","category":"DevOps"},{"title":"Infrastructure Testing Strategies: Validating Your IaC","url":"https://www.devopsness.com/blog/infrastructure-testing-strategies-validating-iac","description":"We test infrastructure code with three layers: validation, plan review, and integration tests. The setup that catches real bugs without slowing down PRs.","publishedAt":"2026-01-14T16:17:55.440Z","updatedAt":"2026-05-10T12:10:40.138Z","category":"Infrastructure"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-40","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-01-13T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.662Z","category":"DevOps"},{"title":"Terraform Modules Best Practices: Building Reusable Infrastructure","url":"https://www.devopsness.com/blog/terraform-modules-best-practices-building-reusable-infrastructure","description":"We have a private module registry with ~25 modules used across 12 accounts. Versioning, interface design, and the over-modularization mistake we keep making.","publishedAt":"2026-01-11T16:17:55.440Z","updatedAt":"2026-04-27T07:48:11.069Z","category":"Infrastructure"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-40","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-01-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.640Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-39","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-01-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.941Z","category":"Infrastructure"},{"title":"Linux Container Internals: Understanding How Containers Work","url":"https://www.devopsness.com/blog/linux-container-internals-understanding-how-containers-work","description":"A container is a process with extra kernel features applied. Walking through namespaces, cgroups, and the actual mechanics — the level of detail that makes \"container weirdness\" debuggable.","publishedAt":"2026-01-07T16:17:55.440Z","updatedAt":"2026-05-17T23:14:34.687Z","category":"Linux"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-39","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-01-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.426Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-39","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-01-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.798Z","category":"Cloud"},{"title":"Shell Scripting Best Practices: Writing Maintainable Scripts","url":"https://www.devopsness.com/blog/shell-scripting-best-practices-writing-maintainable-scripts","description":"We have a few hundred shell scripts in production. The patterns that make them survive contact with reality, and the ones we've stopped writing.","publishedAt":"2026-01-04T16:17:55.440Z","updatedAt":"2026-05-18T01:43:41.241Z","category":"Linux"},{"title":"Prompt Engineering for DevOps: Consistency and Safety","url":"https://www.devopsness.com/blog/prompt-engineering-for-devops-consistency-and-safety","description":"Use prompts to get reliable, safe outputs from LLMs for runbooks, code, and ops tasks.","publishedAt":"2026-01-03T03:39:57.879Z","updatedAt":"2026-04-27T07:48:11.597Z","category":"AI"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-39","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-01-02T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.797Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-39","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-01-01T12:00:00.000Z","updatedAt":"2026-05-14T12:22:34.844Z","category":"AI"},{"title":"File System Optimization: Improving Disk Performance","url":"https://www.devopsness.com/blog/file-system-optimization-improving-disk-performance","description":"Filesystem choice, mount options, IO schedulers — the per-host tweaks that actually moved disk performance for our database and storage workloads.","publishedAt":"2025-12-31T16:17:55.440Z","updatedAt":"2026-05-13T08:48:10.947Z","category":"Linux"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-38","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.571Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-38","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-29T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.372Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-38","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-28T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.651Z","category":"Cloud"},{"title":"Process Management and Monitoring in Linux","url":"https://www.devopsness.com/blog/process-management-monitoring-linux","description":"How processes actually live and die on Linux, the tools that show what's happening, and the patterns we use for monitoring service health.","publishedAt":"2025-12-27T16:17:55.440Z","updatedAt":"2026-05-16T09:43:50.592Z","category":"Linux"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-38","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-12-26T12:00:00.000Z","updatedAt":"2026-05-10T13:46:27.433Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-38","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-12-25T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.370Z","category":"AI"},{"title":"Linux Security Hardening: Protecting Your System","url":"https://www.devopsness.com/blog/linux-security-hardening-protecting-system","description":"A practical Linux hardening checklist for production hosts. The settings that earn their place via real production reasons, not the cargo-cult version.","publishedAt":"2025-12-24T16:17:55.440Z","updatedAt":"2026-05-16T09:20:10.150Z","category":"Linux"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-37","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.458Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-37","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-22T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.354Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-37","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.631Z","category":"Cloud"},{"title":"Operational Checklist: Systemd Service Reliability Patterns","url":"https://www.devopsness.com/blog/operational-checklist-systemd-service-reliability-patterns","description":"A condensed checklist of the systemd unit-file patterns we now use everywhere, with the production reasons each one matters.","publishedAt":"2025-12-20T16:21:00.000Z","updatedAt":"2026-04-27T07:48:11.072Z","category":"Linux"},{"title":"Network Configuration and Troubleshooting in Linux","url":"https://www.devopsness.com/blog/network-configuration-troubleshooting-linux","description":"A systematic approach to debugging Linux network issues. The tools that earn their place and the order I use them in.","publishedAt":"2025-12-20T16:17:55.440Z","updatedAt":"2026-04-26T18:12:57.825Z","category":"Linux"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-37","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-12-19T12:00:00.000Z","updatedAt":"2026-05-13T15:09:32.023Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-37","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-12-18T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.091Z","category":"AI"},{"title":"Linux Performance Tuning: Optimizing System Performance","url":"https://www.devopsness.com/blog/linux-performance-tuning-optimizing-system-performance","description":"A practical Linux performance tuning playbook for production servers. The kernel parameters, disk and network tweaks that earn their place, and the ones that turned out to be folklore.","publishedAt":"2025-12-17T16:17:55.440Z","updatedAt":"2026-05-15T21:32:31.287Z","category":"Linux"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-36","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.406Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-36","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.353Z","category":"Linux"},{"title":"Systemd Service Management: Creating and Managing Services","url":"https://www.devopsness.com/blog/systemd-service-management-creating-managing-services","description":"A practical guide to writing and managing systemd services for production. The unit file features that earn their place, plus the operational workflows.","publishedAt":"2025-12-13T16:17:55.440Z","updatedAt":"2026-04-26T18:13:08.717Z","category":"Linux"},{"title":"Systemd and Modern Linux Service Management","url":"https://www.devopsness.com/blog/systemd-and-modern-linux-service-management","description":"Run services reliably with systemd: units, dependencies, and resource limits.","publishedAt":"2025-12-13T13:49:18.020Z","updatedAt":"2026-04-27T17:29:24.729Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-36","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-12T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.644Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-36","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-12-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.420Z","category":"DevOps"},{"title":"Edge Computing with AWS: CloudFront and Lambda@Edge","url":"https://www.devopsness.com/blog/edge-computing-aws-cloudfront-lambda-edge","description":"We use CloudFront + Lambda@Edge for specific patterns. The wins, the production gotchas, and where we hit Lambda@Edge's limits.","publishedAt":"2025-12-09T16:17:55.440Z","updatedAt":"2026-04-26T18:12:46.184Z","category":"Cloud"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-36","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-12-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.092Z","category":"AI"},{"title":"Cloud-Native Databases: Choosing the Right Database for Your Workload","url":"https://www.devopsness.com/blog/cloud-native-databases-choosing-right-database-workload","description":"Postgres, DynamoDB, Redis, Elasticsearch, Snowflake. We use all five for different workloads. The decision criteria, not the marketing comparison.","publishedAt":"2025-12-06T16:17:55.440Z","updatedAt":"2026-05-16T20:35:29.739Z","category":"Cloud"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-35","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.576Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-35","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-03T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.256Z","category":"Linux"},{"title":"Disaster Recovery in the Cloud: Backup and Recovery Strategies","url":"https://www.devopsness.com/blog/disaster-recovery-cloud-backup-recovery-strategies","description":"We've executed real disaster recoveries twice. The plan that survived contact with reality, and what was wrong about the plans we had before that.","publishedAt":"2025-12-02T16:17:55.440Z","updatedAt":"2026-05-18T07:45:51.893Z","category":"Cloud"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-35","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-01T12:00:00.000Z","updatedAt":"2026-05-04T21:41:28.576Z","category":"Cloud"},{"title":"Cloud Networking Fundamentals: VPCs, Subnets, and Routing","url":"https://www.devopsness.com/blog/cloud-networking-fundamentals-vpcs-subnets-routing","description":"VPCs, subnets, route tables, gateways. The mental model that finally made cloud networking click after I stopped trying to map it 1:1 to physical networks.","publishedAt":"2025-11-29T16:17:55.440Z","updatedAt":"2026-04-26T18:12:39.454Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-35","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-28T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.221Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-35","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-27T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.080Z","category":"AI"},{"title":"AWS ECS vs EKS: Choosing the Right Container Platform","url":"https://www.devopsness.com/blog/aws-ecs-vs-eks-choosing-right-container-platform","description":"We run both ECS and EKS in production. Which we use for what, and the actual decision criteria — not the marketing comparison.","publishedAt":"2025-11-25T16:17:55.440Z","updatedAt":"2026-04-26T18:12:31.554Z","category":"Cloud"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-34","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-11-24T12:00:00.000Z","updatedAt":"2026-05-04T13:50:45.440Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-34","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-11-23T12:00:00.000Z","updatedAt":"2026-05-04T08:58:49.758Z","category":"Linux"},{"title":"Container Image Scanning in CI and at Runtime","url":"https://www.devopsness.com/blog/container-image-scanning-in-ci-and-at-runtime","description":"Shift-left security with image scanning. Trivy, policy gates, and runtime integration.","publishedAt":"2025-11-22T23:58:38.161Z","updatedAt":"2026-05-16T11:19:09.372Z","category":"DevOps"},{"title":"Cloud Security Best Practices: Securing Your AWS Infrastructure","url":"https://www.devopsness.com/blog/cloud-security-best-practices-securing-aws-infrastructure","description":"A working AWS security baseline, derived from the actual incidents we've had and the audit findings we've cleared.","publishedAt":"2025-11-21T16:17:55.440Z","updatedAt":"2026-04-26T18:12:40.479Z","category":"Cloud"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-34","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-11-20T12:00:00.000Z","updatedAt":"2026-05-17T17:06:30.932Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-34","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.809Z","category":"DevOps"},{"title":"Serverless Architecture Patterns: Building Scalable Applications","url":"https://www.devopsness.com/blog/serverless-architecture-patterns-building-scalable-applications","description":"We use serverless for specific patterns, not as a default. The patterns where it shines, the ones it doesn't, and the gotchas at production scale.","publishedAt":"2025-11-18T16:17:55.440Z","updatedAt":"2026-04-27T07:48:11.375Z","category":"Cloud"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-34","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-17T12:00:00.000Z","updatedAt":"2026-05-02T00:54:39.002Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-33","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-11-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.416Z","category":"Infrastructure"},{"title":"Cloud Cost Monitoring: Tracking and Optimizing AWS Spending","url":"https://www.devopsness.com/blog/cloud-cost-monitoring-tracking-optimizing-aws-spending","description":"Building visibility into cloud costs that actually drives action. The dashboards we look at, the alerts that fire, and the queries we run.","publishedAt":"2025-11-14T16:17:55.440Z","updatedAt":"2026-05-06T09:22:03.280Z","category":"Cloud"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-33","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-11-13T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.084Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-33","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-11-12T12:00:00.000Z","updatedAt":"2026-05-11T07:22:44.814Z","category":"Cloud"},{"title":"Multi-Region Deployment: Building Resilient Cloud Applications","url":"https://www.devopsness.com/blog/multi-region-deployment-building-resilient-cloud-applications","description":"We run our app in two AWS regions for failover. The hard parts aren't the deployment — they're data consistency, traffic shifting, and the assumptions that break when \"primary\" is suddenly the wrong region.","publishedAt":"2025-11-11T16:17:55.440Z","updatedAt":"2026-05-16T03:57:37.193Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-33","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.428Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-33","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.264Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-32","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-11-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.421Z","category":"Infrastructure"},{"title":"AWS Lambda Optimization: Reducing Costs and Improving Performance","url":"https://www.devopsness.com/blog/aws-lambda-optimization-reducing-costs-improving-performance","description":"We run ~200 Lambda functions. Cold starts, memory tuning, and the cost-vs-latency trade-offs that actually move the bill.","publishedAt":"2025-11-07T16:17:55.440Z","updatedAt":"2026-04-27T07:48:12.639Z","category":"Cloud"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-32","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-11-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.654Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-32","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-11-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.827Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-32","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-04T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.799Z","category":"DevOps"},{"title":"DevOps Metrics and KPIs: Measuring Success","url":"https://www.devopsness.com/blog/devops-metrics-kpis-measuring-success","description":"We track the four DORA metrics plus a handful of others. The trade-off between what's measurable and what's meaningful, and how we use the numbers.","publishedAt":"2025-11-03T16:17:55.440Z","updatedAt":"2026-04-26T18:12:43.147Z","category":"DevOps"},{"title":"Multi-Region Resilience: Failover, Data, and DNS","url":"https://www.devopsness.com/blog/multi-region-resilience-failover-data-and-dns","description":"Design for region failure. Active/passive and active/active, data replication, and failover testing.","publishedAt":"2025-11-02T10:07:58.303Z","updatedAt":"2026-04-27T07:48:08.418Z","category":"Cloud"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-32","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.240Z","category":"AI"},{"title":"Canary Releases: Gradual Rollout Strategy","url":"https://www.devopsness.com/blog/canary-releases-gradual-rollout-strategy","description":"We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide \"promote or roll back\" are where the design is.","publishedAt":"2025-10-31T16:17:55.440Z","updatedAt":"2026-05-11T20:50:32.529Z","category":"DevOps"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-31","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-10-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.241Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-31","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-10-28T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.094Z","category":"Linux"},{"title":"Blue-Green Deployments: Zero-Downtime Releases","url":"https://www.devopsness.com/blog/blue-green-deployments-zero-downtime-releases","description":"We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.","publishedAt":"2025-10-27T16:17:55.440Z","updatedAt":"2026-05-14T16:08:08.625Z","category":"DevOps"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-31","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-10-25T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.838Z","category":"Cloud"},{"title":"Log Aggregation Strategies: Centralizing Your Logs","url":"https://www.devopsness.com/blog/log-aggregation-strategies-centralizing-logs","description":"We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.","publishedAt":"2025-10-24T16:17:55.440Z","updatedAt":"2026-05-04T12:20:53.439Z","category":"DevOps"}]}