{"name":"DevOpsNess","description":"Practical tutorials and articles on AI, DevOps, cloud, Linux, and infrastructure.","url":"https://www.devopsness.com","contentCount":200,"content":[{"title":"Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers","url":"https://www.devopsness.com/blog/fine-tuning-vs-rag-vs-long-context-a-decision-framework-with-numbers-2026-04-25","description":"We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.","publishedAt":"2026-04-25T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.673Z","category":"AI"},{"title":"Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool","url":"https://www.devopsness.com/blog/database-connection-pooling-at-scale-pgbouncer-rds-proxy-application-pool-2026-04-24","description":"Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.","publishedAt":"2026-04-24T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.667Z","category":"DevOps"},{"title":"Backstage Adoption: From Demo to 80% Service Coverage in 6 Months","url":"https://www.devopsness.com/blog/backstage-adoption-from-demo-to-80-service-coverage-in-6-months-2026-04-23","description":"We launched Backstage in October. Six months in, 80% of services are catalogued, on-boarding takes a third of the time, and we mostly know what owns what.","publishedAt":"2026-04-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.671Z","category":"Infrastructure"},{"title":"Cloudflare Workers vs Vercel Edge: A Latency-Cost Comparison","url":"https://www.devopsness.com/blog/cloudflare-workers-vs-vercel-edge-a-latency-cost-comparison-2026-04-22","description":"We deployed the same edge function on both platforms and measured for a quarter. Where each wins, where each loses, and the surprises along the way.","publishedAt":"2026-04-22T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.676Z","category":"Cloud"},{"title":"eBPF for SREs: Three Real Diagnoses That Saved Hours","url":"https://www.devopsness.com/blog/ebpf-for-sres-three-real-diagnoses-that-saved-hours-2026-04-21","description":"We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.","publishedAt":"2026-04-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.691Z","category":"Linux"},{"title":"LLM Output Validation: Schema-First Prompt Engineering Patterns","url":"https://www.devopsness.com/blog/llm-output-validation-schema-first-prompt-engineering-patterns-2026-04-20","description":"We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.","publishedAt":"2026-04-20T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.664Z","category":"AI"},{"title":"Argo Rollouts: Canary Deployments That Caught a $40k Bug","url":"https://www.devopsness.com/blog/argo-rollouts-canary-deployments-that-caught-a-40k-bug-2026-04-19","description":"A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.","publishedAt":"2026-04-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.707Z","category":"DevOps"},{"title":"Pulumi vs Terraform: What 18 Months of Production Taught Us","url":"https://www.devopsness.com/blog/pulumi-vs-terraform-what-18-months-of-production-taught-us-2026-04-18","description":"We ran Pulumi in TypeScript and Terraform in HCL side by side across 60+ services. Each won different categories of work. Here's the breakdown.","publishedAt":"2026-04-18T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.656Z","category":"Infrastructure"},{"title":"GCP Workload Identity Federation: Replacing Service Account Keys","url":"https://www.devopsness.com/blog/gcp-workload-identity-federation-replacing-service-account-keys-2026-04-17","description":"We deleted every static GCP service account key in our org over six weeks. Here's the migration plan, the gotchas, and the policies we now enforce.","publishedAt":"2026-04-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.659Z","category":"Cloud"},{"title":"Linux Memory Management: When OOM Killer Strikes Your K8s Pods","url":"https://www.devopsness.com/blog/linux-memory-management-when-oom-killer-strikes-your-k8s-pods-2026-04-16","description":"Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.","publishedAt":"2026-04-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.673Z","category":"Linux"},{"title":"GitHub Actions Self-Hosted Runners: Why We Switched and What Broke","url":"https://www.devopsness.com/blog/github-actions-self-hosted-runners-why-we-switched-and-what-broke-2026-04-15","description":"Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.","publishedAt":"2026-04-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.678Z","category":"DevOps"},{"title":"Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production","url":"https://www.devopsness.com/blog/vector-database-selection-pinecone-pgvector-qdrant-after-6-months-in-production-2026-04-14","description":"We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.","publishedAt":"2026-04-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.682Z","category":"AI"},{"title":"Pre-Commit Hooks That Saved Our Repo: 7 Real Examples","url":"https://www.devopsness.com/blog/pre-commit-hooks-that-saved-our-repo-7-real-examples-2026-04-13","description":"Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.","publishedAt":"2026-04-13T12:00:00.000Z","updatedAt":"2026-04-27T10:01:39.757Z","category":"DevOps"},{"title":"EKS Auto Mode: What Worked, What Broke in Our Migration","url":"https://www.devopsness.com/blog/eks-auto-mode-what-worked-what-broke-in-our-migration-2026-04-12","description":"We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.","publishedAt":"2026-04-12T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.898Z","category":"Cloud"},{"title":"Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months","url":"https://www.devopsness.com/blog/self-hosted-llms-vs-openai-api-a-cost-vs-latency-analysis-after-6-months-2026-04-11","description":"We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.","publishedAt":"2026-04-11T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.893Z","category":"AI"},{"title":"OpenTelemetry Collector Pipelines: Real Configs That Survived Production","url":"https://www.devopsness.com/blog/opentelemetry-collector-pipelines-real-configs-that-survived-production-2026-04-10","description":"We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.","publishedAt":"2026-04-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.912Z","category":"DevOps"},{"title":"Blue/Green Deploys for Stateful Services: A Postgres Cutover Story","url":"https://www.devopsness.com/blog/blue-green-deploys-for-stateful-services-a-postgres-cutover-story-2026-04-09","description":"Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.","publishedAt":"2026-04-09T12:00:00.000Z","updatedAt":"2026-04-27T09:36:07.795Z","category":"DevOps"},{"title":"systemd Timers vs Cron: When We Switched and What We Learned","url":"https://www.devopsness.com/blog/systemd-timers-vs-cron-when-we-switched-and-what-we-learned-2026-04-08","description":"We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.","publishedAt":"2026-04-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.657Z","category":"Linux"},{"title":"Zero Trust on AWS: Lessons From Implementing IAM Identity Center","url":"https://www.devopsness.com/blog/zero-trust-on-aws-lessons-from-implementing-iam-identity-center-2026-04-07","description":"We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.","publishedAt":"2026-04-07T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.675Z","category":"Cloud"},{"title":"Embedding Quality in RAG: How We Cut Hallucinations by 60%","url":"https://www.devopsness.com/blog/embedding-quality-in-rag-how-we-cut-hallucinations-by-60-2026-04-06","description":"Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.","publishedAt":"2026-04-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.678Z","category":"AI"},{"title":"Database Migrations Without Downtime: Patterns From Three Real Cutovers","url":"https://www.devopsness.com/blog/database-migrations-without-downtime-patterns-from-three-real-cutovers-2026-04-05","description":"How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.","publishedAt":"2026-04-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.896Z","category":"Infrastructure"},{"title":"Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks","url":"https://www.devopsness.com/blog/monitoring-that-actually-helps-on-call-alerts-dashboards-and-runbooks","description":"We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.","publishedAt":"2026-04-04T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.723Z","category":"Infrastructure"},{"title":"Secrets Management in Practice: From .env Files to Vault","url":"https://www.devopsness.com/blog/secrets-management-in-practice-from-env-files-to-vault","description":"We had .env files in three repos, AWS keys in Slack DMs, and a postgres password etched into a Confluence page. Cleaning it up took a sprint and changed how we think about secrets.","publishedAt":"2026-04-03T12:00:00.000Z","updatedAt":"2026-04-26T18:13:05.169Z","category":"Cloud"},{"title":"Incident Postmortems That Actually Prevent Repeat Failures","url":"https://www.devopsness.com/blog/incident-postmortems-that-actually-prevent-repeat-failures","description":"We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.","publishedAt":"2026-04-02T12:00:00.000Z","updatedAt":"2026-04-26T18:12:49.387Z","category":"DevOps"},{"title":"Terraform Modules Done Right: Lessons from Managing 50+ Services","url":"https://www.devopsness.com/blog/terraform-modules-done-right-lessons-from-managing-50-services","description":"Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.","publishedAt":"2026-04-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.358Z","category":"Infrastructure"},{"title":"Linux Performance Troubleshooting: A Real Incident Walkthrough","url":"https://www.devopsness.com/blog/linux-performance-troubleshooting-a-real-incident-walkthrough","description":"Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.","publishedAt":"2026-03-31T12:00:00.000Z","updatedAt":"2026-04-27T07:48:05.974Z","category":"Linux"},{"title":"Prompt Engineering Patterns That Actually Work in Production","url":"https://www.devopsness.com/blog/prompt-engineering-patterns-that-actually-work-in-production","description":"Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.","publishedAt":"2026-03-30T12:00:00.000Z","updatedAt":"2026-04-14T23:03:27.130Z","category":"AI"},{"title":"AWS Cost Audit: 7 Things We Found Wasting Money Every Month","url":"https://www.devopsness.com/blog/aws-cost-audit-7-things-we-found-wasting-money-every-month","description":"A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.","publishedAt":"2026-03-29T12:00:00.000Z","updatedAt":"2026-04-27T07:50:45.224Z","category":"Cloud"},{"title":"How We Cut Our Docker Image Size by 80% and Why It Matters","url":"https://www.devopsness.com/blog/how-we-cut-our-docker-image-size-by-80-and-why-it-matters","description":"A real walkthrough of shrinking bloated Docker images from 1.2GB to 240MB using multi-stage builds, Alpine, and dependency auditing.","publishedAt":"2026-03-28T12:00:00.000Z","updatedAt":"2026-04-16T03:52:58.160Z","category":"DevOps"},{"title":"Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact","url":"https://www.devopsness.com/blog/model-fallback-policies-for-customer-facing-ai-the-routing-rules-that-kept-sla-intact-2026-03-27","description":"A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.","publishedAt":"2026-03-27T12:00:00.000Z","updatedAt":"2026-04-26T18:31:51.542Z","category":"AI"},{"title":"Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift","url":"https://www.devopsness.com/blog/artifact-promotion-instead-of-rebuilds-the-release-control-pattern-that-stopped-drift-2026-03-26","description":"A practical artifact promotion guide for CI/CD teams that were tired of hearing 'it passed in staging' after production behaved differently because the release was rebuilt.","publishedAt":"2026-03-26T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.355Z","category":"DevOps"},{"title":"RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps","url":"https://www.devopsness.com/blog/rds-restore-drills-for-busy-teams-the-recovery-workflow-that-surfaced-real-gaps-2026-03-25","description":"A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.","publishedAt":"2026-03-25T12:00:00.000Z","updatedAt":"2026-04-25T01:33:18.801Z","category":"Cloud"},{"title":"Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern","url":"https://www.devopsness.com/blog/systemd-drop-in-overrides-for-vendor-services-the-supportable-linux-ops-pattern-2026-03-24","description":"A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.","publishedAt":"2026-03-24T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.084Z","category":"Linux"},{"title":"Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage","url":"https://www.devopsness.com/blog/terraform-module-version-pinning-how-one-platform-team-stopped-surprise-breakage-2026-03-23","description":"A real-world Terraform module version pinning guide for platform teams that want safer upgrades, clearer ownership, and fewer broken pipelines after shared module releases.","publishedAt":"2026-03-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.662Z","category":"Infrastructure"},{"title":"Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern","url":"https://www.devopsness.com/blog/embedding-model-upgrades-without-search-chaos-a-safer-rag-rollout-pattern-2026-03-22","description":"A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.","publishedAt":"2026-03-22T12:00:00.000Z","updatedAt":"2026-04-25T11:45:34.615Z","category":"AI"},{"title":"Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams","url":"https://www.devopsness.com/blog/multi-cluster-traffic-routing-strategies-a-pragmatic-rollout-pattern-for-growing-saas-teams-2026-03-21","description":"A real-world multi-cluster traffic routing guide for SaaS teams that have outgrown a single Kubernetes cluster and need safer rollout control without a service-mesh science project.","publishedAt":"2026-03-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.262Z","category":"Cloud"},{"title":"Terraform State Isolation by Environment: How We Stopped One Change from Hitting Prod","url":"https://www.devopsness.com/blog/terraform-state-isolation-by-environment-how-we-stopped-one-change-from-hitting-prod-2026-03-20","description":"A practical Terraform state isolation guide built from a real environment-mixing incident, with patterns for safer backends, clearer ownership, and lower blast radius.","publishedAt":"2026-03-20T12:00:00.000Z","updatedAt":"2026-04-27T09:06:48.051Z","category":"Infrastructure"},{"title":"Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions","url":"https://www.devopsness.com/blog/prompt-versioning-and-regression-testing-how-teams-avoid-silent-ai-regressions-2026-03-19","description":"A real-world guide to prompt versioning and regression testing for production AI features, focused on preventing the subtle changes that hurt quality long before anyone notices.","publishedAt":"2026-03-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.090Z","category":"AI"},{"title":"Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops","url":"https://www.devopsness.com/blog/systemd-service-reliability-patterns-what-we-changed-after-repeated-restart-loops-2026-03-18","description":"A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.","publishedAt":"2026-03-18T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.257Z","category":"Linux"},{"title":"Blue-Green Deployment Guardrails in Kubernetes: Lessons from a Failed Friday Rollout","url":"https://www.devopsness.com/blog/blue-green-deployment-guardrails-in-kubernetes-lessons-from-a-failed-friday-rollout-2026-03-17","description":"A Kubernetes blue-green deployment guide built around a real rollout failure, showing the guardrails that matter when traffic shifting, health checks, and rollback timing all interact.","publishedAt":"2026-03-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.599Z","category":"DevOps"},{"title":"Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover","url":"https://www.devopsness.com/blog/cloud-disaster-recovery-runbook-design-how-small-teams-rehearse-multi-region-failover-2026-03-16","description":"A practical disaster recovery runbook guide for small cloud teams that need realistic failover steps, clear ownership, and repeatable rehearsals instead of shelfware documents.","publishedAt":"2026-03-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.243Z","category":"Cloud"},{"title":"RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production","url":"https://www.devopsness.com/blog/rag-retrieval-quality-evaluation-the-checks-we-added-after-bad-answers-reached-production-2026-03-15","description":"A search-friendly guide to RAG retrieval quality evaluation, based on the moment one production assistant started citing stale documents and the team had to prove what 'good retrieval' meant.","publishedAt":"2026-03-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.638Z","category":"AI"},{"title":"Infrastructure Documentation as Code: How One Platform Team Reduced Audit Fire Drills","url":"https://www.devopsness.com/blog/infrastructure-documentation-as-code-how-one-platform-team-reduced-audit-fire-drills-2026-03-14","description":"This infrastructure documentation as code guide shows how a platform team moved runbooks, ownership maps, and architecture decisions into versioned workflows that people actually trusted.","publishedAt":"2026-03-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.256Z","category":"Infrastructure"},{"title":"Linux Patch Management for Production Fleets: A Real-World Maintenance Workflow","url":"https://www.devopsness.com/blog/linux-patch-management-for-production-fleets-a-real-world-maintenance-workflow-2026-03-13","description":"A production-tested Linux patch management workflow for teams that need security fixes without turning every maintenance window into a gamble.","publishedAt":"2026-03-13T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.256Z","category":"Linux"},{"title":"AWS Cost Allocation Tags for Shared Platforms: What Finally Worked","url":"https://www.devopsness.com/blog/aws-cost-allocation-tags-for-shared-platforms-what-finally-worked-2026-03-12","description":"A hands-on guide to AWS cost allocation tags for shared environments, built from a real platform-team problem: everyone used the cluster, but nobody trusted the bill.","publishedAt":"2026-03-12T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.255Z","category":"Cloud"},{"title":"GitHub Actions Monorepo CI: How We Cut Build Times Without Breaking Main","url":"https://www.devopsness.com/blog/github-actions-monorepo-ci-how-we-cut-build-times-without-breaking-main-2026-03-11","description":"A practical GitHub Actions monorepo CI guide built around a real scaling problem: long queues, noisy failures, and developers waiting 40 minutes for feedback.","publishedAt":"2026-03-11T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.630Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-46","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-03-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.352Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-45","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-03-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.078Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-45","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-03-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.353Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-45","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-03-07T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.087Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-45","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-03-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.832Z","category":"DevOps"},{"title":"Ansible and Infrastructure as Code: Idempotency and Best Practices","url":"https://www.devopsness.com/blog/ansible-and-infrastructure-as-code-idempotency-and-best-practices","description":"Write Ansible playbooks that are idempotent, readable, and maintainable for config management.","publishedAt":"2026-03-05T21:11:57.455Z","updatedAt":"2026-04-27T07:48:06.100Z","category":"Infrastructure"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-45","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-03-04T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.354Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-44","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-03-03T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.811Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-44","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-03-02T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.833Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-44","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-03-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.802Z","category":"Cloud"},{"title":"End-of-Week Engineering: Why Smart Tech Teams Don’t Ship Major Changes on Friday","url":"https://www.devopsness.com/blog/end-of-week-engineering-no-friday-deployments-2026-02-28","description":"A practical risk-management framework for release timing, Friday deployment policies, progressive delivery, and how elite teams protect reliability and people.","publishedAt":"2026-02-28T12:00:00.000Z","updatedAt":"2026-04-04T22:09:28.761Z","category":"DevOps"},{"title":"Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work","url":"https://www.devopsness.com/blog/kubernetes-finops-cost-optimization-2026-02-27","description":"Cut Kubernetes spend without hurting reliability using a practical FinOps playbook for rightsizing, autoscaling guardrails, showback, and weekly waste cleanup.","publishedAt":"2026-02-27T10:00:00.000Z","updatedAt":"2026-04-24T06:59:14.717Z","category":"Cloud"},{"title":"SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability","url":"https://www.devopsness.com/blog/sre-error-budgets-practical-guide-2026-02-26","description":"A practical way to define SLOs and error budgets, connect them to release decisions, and avoid reliability debates without data.","publishedAt":"2026-02-26T10:00:00.000Z","updatedAt":"2026-04-22T02:56:58.744Z","category":"DevOps"},{"title":"Platform Engineering with Backstage: Build a Useful Developer Portal","url":"https://www.devopsness.com/blog/platform-engineering-backstage-developer-portal-2026-02-25","description":"How to implement Backstage with real templates, scorecards, and golden paths so internal platform work reduces delivery friction.","publishedAt":"2026-02-25T10:00:00.000Z","updatedAt":"2026-04-22T03:16:04.064Z","category":"Infrastructure"},{"title":"GitHub Actions for Monorepos: Fast CI Without Pipeline Chaos","url":"https://www.devopsness.com/blog/github-actions-monorepo-fast-ci-2026-02-24","description":"A practical pattern for monorepo CI with path filters, matrix builds, caching, and deployment guards that keep feedback fast as teams scale.","publishedAt":"2026-02-24T10:00:00.000Z","updatedAt":"2026-04-26T06:29:57.284Z","category":"DevOps"},{"title":"Azure DevOps Best Practices in 2026: Build Pipelines You Can Trust","url":"https://www.devopsness.com/blog/azure-devops-best-practices-2026-02-23","description":"A production-focused guide to Azure DevOps: standardized YAML templates, secure service connections, rollout safety, and measurable delivery reliability.","publishedAt":"2026-02-23T10:00:00.000Z","updatedAt":"2026-04-26T16:07:37.805Z","category":"DevOps"},{"title":"AI Best Practices in 2026: Shipping Reliable Systems, Not Demo Magic","url":"https://www.devopsness.com/blog/ai-best-practices-2026-02-22-reliable-production-systems","description":"A practical production playbook for AI systems: evaluation gates, guardrails, observability, cost control, and reliable release management.","publishedAt":"2026-02-22T09:30:00.000Z","updatedAt":"2026-04-23T14:49:02.535Z","category":"AI"},{"title":"AI Best Practices for Engineering Teams: From Prompt Experiments to Platform Discipline","url":"https://www.devopsness.com/blog/ai-best-practices-2026-02-21-platform-discipline","description":"A practical field manual for engineering teams who want AI features that survive real users, incidents, and budgets — not just demo day.","publishedAt":"2026-02-21T09:30:00.000Z","updatedAt":"2026-04-27T07:48:11.571Z","category":"AI"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-44","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-02-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.805Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-44","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-02-18T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.583Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-43","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-02-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.535Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-43","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-02-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.075Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-43","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-02-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.828Z","category":"Cloud"},{"title":"Kubernetes Networking: Services, Ingress, and Network Policies","url":"https://www.devopsness.com/blog/kubernetes-networking-services-ingress-and-network-policies","description":"Understand Kubernetes networking: ClusterIP, NodePort, LoadBalancer, Ingress, and policy.","publishedAt":"2026-02-13T07:21:17.596Z","updatedAt":"2026-04-27T07:48:06.065Z","category":"DevOps"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-43","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-02-11T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.806Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-43","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-02-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.367Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-42","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-02-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.420Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-42","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-02-07T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.583Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-42","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-02-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.885Z","category":"Cloud"},{"title":"Infrastructure Cost Optimization: Reducing Cloud Spending","url":"https://www.devopsness.com/blog/infrastructure-cost-optimization-reducing-cloud-spending","description":"We cut our AWS bill by 38% in a quarter. The specific changes that moved the bill, ranked by impact, with what we'd do first.","publishedAt":"2026-02-05T16:17:55.440Z","updatedAt":"2026-04-26T18:12:50.185Z","category":"Infrastructure"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-42","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-02-03T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.815Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-42","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-02-02T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.615Z","category":"AI"},{"title":"Multi-Cloud Infrastructure: Managing Resources Across Providers","url":"https://www.devopsness.com/blog/multi-cloud-infrastructure-managing-resources-across-providers","description":"We run mostly on AWS but use GCP for specific workloads. The honest cost-benefit analysis of multi-cloud, plus the patterns that make it not awful.","publishedAt":"2026-02-01T16:17:55.440Z","updatedAt":"2026-04-27T07:48:11.589Z","category":"Infrastructure"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-41","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-01-31T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.520Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-41","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-01-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.596Z","category":"Linux"},{"title":"Disaster Recovery Planning: Building Resilient Infrastructure","url":"https://www.devopsness.com/blog/disaster-recovery-planning-building-resilient-infrastructure","description":"A different angle on DR: the planning process — RTO/RPO conversations, dependency mapping, and what we learned about prioritizing what to recover.","publishedAt":"2026-01-29T16:17:55.440Z","updatedAt":"2026-04-26T18:12:43.959Z","category":"Infrastructure"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-41","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-01-27T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.907Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-41","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-01-26T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.818Z","category":"DevOps"},{"title":"Infrastructure Monitoring: Observability for IaC","url":"https://www.devopsness.com/blog/infrastructure-monitoring-observability-iac","description":"Defining monitoring as code: dashboards, alerts, and SLOs in Git. The patterns that survived the migration from clicked-together monitoring.","publishedAt":"2026-01-25T16:17:55.440Z","updatedAt":"2026-04-26T18:12:51.226Z","category":"Infrastructure"},{"title":"FinOps and Cloud Cost Management for Engineering Teams","url":"https://www.devopsness.com/blog/finops-and-cloud-cost-management-for-engineering-teams","description":"Embed cost ownership in engineering: tags, budgets, and showback.","publishedAt":"2026-01-23T17:30:37.737Z","updatedAt":"2026-04-27T07:48:10.249Z","category":"Cloud"},{"title":"Ansible Playbook Optimization: Writing Efficient Playbooks","url":"https://www.devopsness.com/blog/ansible-playbook-optimization-writing-efficient-playbooks","description":"We cut our largest playbook's runtime from 14 minutes to 4 minutes. The specific changes that mattered, plus the ones that didn't.","publishedAt":"2026-01-22T16:17:55.440Z","updatedAt":"2026-04-26T18:12:29.935Z","category":"Infrastructure"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-41","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-01-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.397Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-40","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-01-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.592Z","category":"Infrastructure"},{"title":"Pulumi vs Terraform Deep Dive: Choosing the Right IaC Tool","url":"https://www.devopsness.com/blog/pulumi-vs-terraform-deep-dive-choosing-right-iac-tool","description":"We tried Pulumi for a quarter and went back to Terraform. Both are real options. Why we picked one and what would change our mind.","publishedAt":"2026-01-18T16:17:55.440Z","updatedAt":"2026-04-26T18:13:04.361Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-40","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-01-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.954Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-40","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-01-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.897Z","category":"Cloud"},{"title":"Operational Checklist: Kubernetes Secrets and External Vault Integration","url":"https://www.devopsness.com/blog/operational-checklist-kubernetes-secrets-and-external-vault-integration","description":"K8s Secrets are barely encrypted. We moved every secret to Vault with the Vault Agent injector and never went back. The setup checklist.","publishedAt":"2026-01-15T15:10:00.000Z","updatedAt":"2026-04-27T07:48:09.922Z","category":"DevOps"},{"title":"Infrastructure Testing Strategies: Validating Your IaC","url":"https://www.devopsness.com/blog/infrastructure-testing-strategies-validating-iac","description":"We test infrastructure code with three layers: validation, plan review, and integration tests. The setup that catches real bugs without slowing down PRs.","publishedAt":"2026-01-14T16:17:55.440Z","updatedAt":"2026-04-26T18:12:52.023Z","category":"Infrastructure"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-40","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-01-13T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.662Z","category":"DevOps"},{"title":"Terraform Modules Best Practices: Building Reusable Infrastructure","url":"https://www.devopsness.com/blog/terraform-modules-best-practices-building-reusable-infrastructure","description":"We have a private module registry with ~25 modules used across 12 accounts. Versioning, interface design, and the over-modularization mistake we keep making.","publishedAt":"2026-01-11T16:17:55.440Z","updatedAt":"2026-04-27T07:48:11.069Z","category":"Infrastructure"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-40","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-01-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.640Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-39","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2026-01-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.941Z","category":"Infrastructure"},{"title":"Linux Container Internals: Understanding How Containers Work","url":"https://www.devopsness.com/blog/linux-container-internals-understanding-how-containers-work","description":"A container is a process with extra kernel features applied. Walking through namespaces, cgroups, and the actual mechanics — the level of detail that makes \"container weirdness\" debuggable.","publishedAt":"2026-01-07T16:17:55.440Z","updatedAt":"2026-04-26T18:12:53.625Z","category":"Linux"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-39","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2026-01-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.426Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-39","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2026-01-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.798Z","category":"Cloud"},{"title":"Shell Scripting Best Practices: Writing Maintainable Scripts","url":"https://www.devopsness.com/blog/shell-scripting-best-practices-writing-maintainable-scripts","description":"We have a few hundred shell scripts in production. The patterns that make them survive contact with reality, and the ones we've stopped writing.","publishedAt":"2026-01-04T16:17:55.440Z","updatedAt":"2026-04-26T18:13:06.370Z","category":"Linux"},{"title":"Prompt Engineering for DevOps: Consistency and Safety","url":"https://www.devopsness.com/blog/prompt-engineering-for-devops-consistency-and-safety","description":"Use prompts to get reliable, safe outputs from LLMs for runbooks, code, and ops tasks.","publishedAt":"2026-01-03T03:39:57.879Z","updatedAt":"2026-04-27T07:48:11.597Z","category":"AI"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-39","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2026-01-02T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.797Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-39","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2026-01-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.362Z","category":"AI"},{"title":"File System Optimization: Improving Disk Performance","url":"https://www.devopsness.com/blog/file-system-optimization-improving-disk-performance","description":"Filesystem choice, mount options, IO schedulers — the per-host tweaks that actually moved disk performance for our database and storage workloads.","publishedAt":"2025-12-31T16:17:55.440Z","updatedAt":"2026-04-27T06:25:51.145Z","category":"Linux"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-38","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.571Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-38","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-29T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.372Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-38","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-28T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.651Z","category":"Cloud"},{"title":"Process Management and Monitoring in Linux","url":"https://www.devopsness.com/blog/process-management-monitoring-linux","description":"How processes actually live and die on Linux, the tools that show what's happening, and the patterns we use for monitoring service health.","publishedAt":"2025-12-27T16:17:55.440Z","updatedAt":"2026-04-26T18:13:00.716Z","category":"Linux"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-38","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-12-26T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.953Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-38","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-12-25T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.370Z","category":"AI"},{"title":"Linux Security Hardening: Protecting Your System","url":"https://www.devopsness.com/blog/linux-security-hardening-protecting-system","description":"A practical Linux hardening checklist for production hosts. The settings that earn their place via real production reasons, not the cargo-cult version.","publishedAt":"2025-12-24T16:17:55.440Z","updatedAt":"2026-04-26T18:12:54.433Z","category":"Linux"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-37","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.458Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-37","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-22T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.354Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-37","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.631Z","category":"Cloud"},{"title":"Operational Checklist: Systemd Service Reliability Patterns","url":"https://www.devopsness.com/blog/operational-checklist-systemd-service-reliability-patterns","description":"A condensed checklist of the systemd unit-file patterns we now use everywhere, with the production reasons each one matters.","publishedAt":"2025-12-20T16:21:00.000Z","updatedAt":"2026-04-27T07:48:11.072Z","category":"Linux"},{"title":"Network Configuration and Troubleshooting in Linux","url":"https://www.devopsness.com/blog/network-configuration-troubleshooting-linux","description":"A systematic approach to debugging Linux network issues. The tools that earn their place and the order I use them in.","publishedAt":"2025-12-20T16:17:55.440Z","updatedAt":"2026-04-26T18:12:57.825Z","category":"Linux"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-37","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-12-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.586Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-37","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-12-18T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.091Z","category":"AI"},{"title":"Linux Performance Tuning: Optimizing System Performance","url":"https://www.devopsness.com/blog/linux-performance-tuning-optimizing-system-performance","description":"A practical Linux performance tuning playbook for production servers. The kernel parameters, disk and network tweaks that earn their place, and the ones that turned out to be folklore.","publishedAt":"2025-12-17T16:17:55.440Z","updatedAt":"2026-04-26T18:12:54.029Z","category":"Linux"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-36","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.406Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-36","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-14T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.353Z","category":"Linux"},{"title":"Systemd Service Management: Creating and Managing Services","url":"https://www.devopsness.com/blog/systemd-service-management-creating-managing-services","description":"A practical guide to writing and managing systemd services for production. The unit file features that earn their place, plus the operational workflows.","publishedAt":"2025-12-13T16:17:55.440Z","updatedAt":"2026-04-26T18:13:08.717Z","category":"Linux"},{"title":"Systemd and Modern Linux Service Management","url":"https://www.devopsness.com/blog/systemd-and-modern-linux-service-management","description":"Run services reliably with systemd: units, dependencies, and resource limits.","publishedAt":"2025-12-13T13:49:18.020Z","updatedAt":"2026-04-27T07:48:06.956Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-36","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-12T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.644Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-36","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-12-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.420Z","category":"DevOps"},{"title":"Edge Computing with AWS: CloudFront and Lambda@Edge","url":"https://www.devopsness.com/blog/edge-computing-aws-cloudfront-lambda-edge","description":"We use CloudFront + Lambda@Edge for specific patterns. The wins, the production gotchas, and where we hit Lambda@Edge's limits.","publishedAt":"2025-12-09T16:17:55.440Z","updatedAt":"2026-04-26T18:12:46.184Z","category":"Cloud"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-36","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-12-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.092Z","category":"AI"},{"title":"Cloud-Native Databases: Choosing the Right Database for Your Workload","url":"https://www.devopsness.com/blog/cloud-native-databases-choosing-right-database-workload","description":"Postgres, DynamoDB, Redis, Elasticsearch, Snowflake. We use all five for different workloads. The decision criteria, not the marketing comparison.","publishedAt":"2025-12-06T16:17:55.440Z","updatedAt":"2026-04-26T18:12:39.055Z","category":"Cloud"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-35","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-12-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.576Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-35","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-12-03T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.256Z","category":"Linux"},{"title":"Disaster Recovery in the Cloud: Backup and Recovery Strategies","url":"https://www.devopsness.com/blog/disaster-recovery-cloud-backup-recovery-strategies","description":"We've executed real disaster recoveries twice. The plan that survived contact with reality, and what was wrong about the plans we had before that.","publishedAt":"2025-12-02T16:17:55.440Z","updatedAt":"2026-04-27T07:48:10.274Z","category":"Cloud"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-35","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-12-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:12.651Z","category":"Cloud"},{"title":"Cloud Networking Fundamentals: VPCs, Subnets, and Routing","url":"https://www.devopsness.com/blog/cloud-networking-fundamentals-vpcs-subnets-routing","description":"VPCs, subnets, route tables, gateways. The mental model that finally made cloud networking click after I stopped trying to map it 1:1 to physical networks.","publishedAt":"2025-11-29T16:17:55.440Z","updatedAt":"2026-04-26T18:12:39.454Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-35","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-28T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.221Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-35","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-27T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.080Z","category":"AI"},{"title":"AWS ECS vs EKS: Choosing the Right Container Platform","url":"https://www.devopsness.com/blog/aws-ecs-vs-eks-choosing-right-container-platform","description":"We run both ECS and EKS in production. Which we use for what, and the actual decision criteria — not the marketing comparison.","publishedAt":"2025-11-25T16:17:55.440Z","updatedAt":"2026-04-26T18:12:31.554Z","category":"Cloud"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-34","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-11-24T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.576Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-34","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-11-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.407Z","category":"Linux"},{"title":"Container Image Scanning in CI and at Runtime","url":"https://www.devopsness.com/blog/container-image-scanning-in-ci-and-at-runtime","description":"Shift-left security with image scanning. Trivy, policy gates, and runtime integration.","publishedAt":"2025-11-22T23:58:38.161Z","updatedAt":"2026-04-27T07:48:08.425Z","category":"DevOps"},{"title":"Cloud Security Best Practices: Securing Your AWS Infrastructure","url":"https://www.devopsness.com/blog/cloud-security-best-practices-securing-aws-infrastructure","description":"A working AWS security baseline, derived from the actual incidents we've had and the audit findings we've cleared.","publishedAt":"2025-11-21T16:17:55.440Z","updatedAt":"2026-04-26T18:12:40.479Z","category":"Cloud"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-34","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-11-20T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.433Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-34","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.809Z","category":"DevOps"},{"title":"Serverless Architecture Patterns: Building Scalable Applications","url":"https://www.devopsness.com/blog/serverless-architecture-patterns-building-scalable-applications","description":"We use serverless for specific patterns, not as a default. The patterns where it shines, the ones it doesn't, and the gotchas at production scale.","publishedAt":"2025-11-18T16:17:55.440Z","updatedAt":"2026-04-27T07:48:11.375Z","category":"Cloud"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-34","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.070Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-33","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-11-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.416Z","category":"Infrastructure"},{"title":"Cloud Cost Monitoring: Tracking and Optimizing AWS Spending","url":"https://www.devopsness.com/blog/cloud-cost-monitoring-tracking-optimizing-aws-spending","description":"Building visibility into cloud costs that actually drives action. The dashboards we look at, the alerts that fire, and the queries we run.","publishedAt":"2025-11-14T16:17:55.440Z","updatedAt":"2026-04-26T18:12:38.656Z","category":"Cloud"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-33","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-11-13T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.084Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-33","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-11-12T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.799Z","category":"Cloud"},{"title":"Multi-Region Deployment: Building Resilient Cloud Applications","url":"https://www.devopsness.com/blog/multi-region-deployment-building-resilient-cloud-applications","description":"We run our app in two AWS regions for failover. The hard parts aren't the deployment — they're data consistency, traffic shifting, and the assumptions that break when \"primary\" is suddenly the wrong region.","publishedAt":"2025-11-11T16:17:55.440Z","updatedAt":"2026-04-26T18:12:57.422Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-33","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-10T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.428Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-33","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-09T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.264Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-32","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-11-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.421Z","category":"Infrastructure"},{"title":"AWS Lambda Optimization: Reducing Costs and Improving Performance","url":"https://www.devopsness.com/blog/aws-lambda-optimization-reducing-costs-improving-performance","description":"We run ~200 Lambda functions. Cold starts, memory tuning, and the cost-vs-latency trade-offs that actually move the bill.","publishedAt":"2025-11-07T16:17:55.440Z","updatedAt":"2026-04-27T07:48:12.639Z","category":"Cloud"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-32","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-11-06T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.654Z","category":"Linux"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-32","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-11-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.827Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-32","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-11-04T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.799Z","category":"DevOps"},{"title":"DevOps Metrics and KPIs: Measuring Success","url":"https://www.devopsness.com/blog/devops-metrics-kpis-measuring-success","description":"We track the four DORA metrics plus a handful of others. The trade-off between what's measurable and what's meaningful, and how we use the numbers.","publishedAt":"2025-11-03T16:17:55.440Z","updatedAt":"2026-04-26T18:12:43.147Z","category":"DevOps"},{"title":"Multi-Region Resilience: Failover, Data, and DNS","url":"https://www.devopsness.com/blog/multi-region-resilience-failover-data-and-dns","description":"Design for region failure. Active/passive and active/active, data replication, and failover testing.","publishedAt":"2025-11-02T10:07:58.303Z","updatedAt":"2026-04-27T07:48:08.418Z","category":"Cloud"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-32","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-11-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.240Z","category":"AI"},{"title":"Canary Releases: Gradual Rollout Strategy","url":"https://www.devopsness.com/blog/canary-releases-gradual-rollout-strategy","description":"We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide \"promote or roll back\" are where the design is.","publishedAt":"2025-10-31T16:17:55.440Z","updatedAt":"2026-04-26T18:12:37.791Z","category":"DevOps"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-31","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-10-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.241Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-31","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-10-28T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.094Z","category":"Linux"},{"title":"Blue-Green Deployments: Zero-Downtime Releases","url":"https://www.devopsness.com/blog/blue-green-deployments-zero-downtime-releases","description":"We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.","publishedAt":"2025-10-27T16:17:55.440Z","updatedAt":"2026-04-26T18:12:36.981Z","category":"DevOps"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-31","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-10-25T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.838Z","category":"Cloud"},{"title":"Log Aggregation Strategies: Centralizing Your Logs","url":"https://www.devopsness.com/blog/log-aggregation-strategies-centralizing-logs","description":"We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.","publishedAt":"2025-10-24T16:17:55.440Z","updatedAt":"2026-04-26T18:12:54.826Z","category":"DevOps"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-31","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-10-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.853Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-31","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-10-21T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.249Z","category":"AI"},{"title":"Infrastructure Monitoring with Prometheus: Complete Setup Guide","url":"https://www.devopsness.com/blog/infrastructure-monitoring-prometheus-complete-setup-guide","description":"A working Prometheus stack for a 40-node cluster: what we deploy, what we tune, and what we wish we'd known about cardinality two years ago.","publishedAt":"2025-10-20T16:17:55.440Z","updatedAt":"2026-04-26T18:12:51.626Z","category":"DevOps"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-30","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-10-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.086Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-30","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-10-17T12:00:00.000Z","updatedAt":"2026-04-27T07:48:08.423Z","category":"Linux"},{"title":"Docker Multi-Stage Builds: Optimizing Image Size","url":"https://www.devopsness.com/blog/docker-multi-stage-builds-optimizing-image-size","description":"A focused look at the techniques that shrink container images: which actually pay off, which are folklore, and the discipline that keeps images small over time.","publishedAt":"2025-10-16T16:17:55.440Z","updatedAt":"2026-04-26T18:12:44.387Z","category":"DevOps"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-30","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-10-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.843Z","category":"Cloud"},{"title":"Kubernetes Backup Strategies: Protecting Your Cluster Data","url":"https://www.devopsness.com/blog/kubernetes-backup-strategies-protecting-cluster-data","description":"We've had to restore a Kubernetes cluster from backup twice. Once it worked. Once it took 14 hours. Here's the strategy we run now.","publishedAt":"2025-10-13T16:17:55.440Z","updatedAt":"2026-04-26T18:12:52.816Z","category":"DevOps"},{"title":"MLOps Pipelines: From Experiment to Production Models","url":"https://www.devopsness.com/blog/mlops-pipelines-from-experiment-to-production-models","description":"Build MLOps pipelines for training, evaluation, and deployment. Reproducibility and monitoring.","publishedAt":"2025-10-12T20:17:18.444Z","updatedAt":"2026-04-27T07:48:06.945Z","category":"AI"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-30","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-10-11T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.107Z","category":"DevOps"},{"title":"Service Mesh Implementation: Istio vs Linkerd","url":"https://www.devopsness.com/blog/service-mesh-implementation-istio-vs-linkerd","description":"We ran Istio for a year, then switched to Linkerd. Both can do the job. The decision came down to operational fit, not features.","publishedAt":"2025-10-09T16:17:55.440Z","updatedAt":"2026-04-26T18:13:05.974Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-30","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-10-08T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.245Z","category":"AI"},{"title":"Architecture Review: Python Worker Queue Scaling Patterns","url":"https://www.devopsness.com/blog/architecture-review-python-worker-queue-scaling-patterns","description":"We started with a single Celery worker handling everything. Eight months and three architecture changes later, here's what scaled and what we learned about queue design.","publishedAt":"2025-10-07T13:08:00.000Z","updatedAt":"2026-04-27T07:48:12.648Z","category":"AI"},{"title":"CI/CD Pipeline Optimization: Speeding Up Your Builds","url":"https://www.devopsness.com/blog/ci-cd-pipeline-optimization-speeding-up-builds","description":"We cut our average CI build time from 28 minutes to 6 minutes. The changes that mattered, ranked by impact.","publishedAt":"2025-10-06T16:17:55.440Z","updatedAt":"2026-04-26T18:12:38.254Z","category":"DevOps"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-29","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-10-05T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.084Z","category":"Infrastructure"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-29","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-10-04T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.095Z","category":"Linux"},{"title":"Container Security Scanning: Protecting Your Docker Images","url":"https://www.devopsness.com/blog/container-security-scanning-protecting-docker-images","description":"We scan every container image in CI and at runtime. Trivy + Cosign + admission controllers. The setup that earns its place and what we wish we'd known.","publishedAt":"2025-10-02T16:17:55.440Z","updatedAt":"2026-04-26T18:12:41.125Z","category":"DevOps"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-29","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-10-01T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.951Z","category":"Cloud"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-29","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-09-30T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.355Z","category":"DevOps"},{"title":"GitOps with ArgoCD: Automating Kubernetes Deployments","url":"https://www.devopsness.com/blog/gitops-argocd-automating-kubernetes-deployments","description":"We migrated 40+ services to GitOps with Argo CD. Two years in, here's what works and what required workarounds.","publishedAt":"2025-09-28T16:17:55.440Z","updatedAt":"2026-04-26T18:12:48.993Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-29","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-09-27T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.264Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-28","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-09-26T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.932Z","category":"Infrastructure"},{"title":"Kubernetes Networking Deep Dive: Understanding Pods, Services, and Ingress","url":"https://www.devopsness.com/blog/kubernetes-networking-deep-dive-pods-services-ingress","description":"How a packet actually gets from the internet to a pod, walked layer by layer. Plus the things that surprise people the first time they hit them.","publishedAt":"2025-09-25T16:17:55.440Z","updatedAt":"2026-04-26T18:12:53.224Z","category":"DevOps"},{"title":"Systemd Tricks We Use to Keep Services Boring","url":"https://www.devopsness.com/blog/systemd-tricks-we-use-to-keep-services-boring-28","description":"Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.","publishedAt":"2025-09-23T12:00:00.000Z","updatedAt":"2026-04-27T07:48:11.812Z","category":"Linux"},{"title":"AWS Lambda and Serverless Best Practices for Production","url":"https://www.devopsness.com/blog/aws-lambda-and-serverless-best-practices-for-production","description":"Design serverless apps for reliability, cold start, and cost. Event-driven patterns and observability.","publishedAt":"2025-09-22T06:26:38.586Z","updatedAt":"2026-04-27T07:48:07.827Z","category":"Cloud"},{"title":"Production AI Pipelines: Building End-to-End ML Systems","url":"https://www.devopsness.com/blog/production-ai-pipelines-building-end-to-end-ml-systems","description":"We've shipped three end-to-end ML systems. The pieces that look obvious in slides and turn out to be the actual work.","publishedAt":"2025-09-21T16:17:55.440Z","updatedAt":"2026-04-26T18:13:03.157Z","category":"AI"},{"title":"Architecture Review: LLM Gateway Design for Multi-Provider Inference","url":"https://www.devopsness.com/blog/architecture-review-llm-gateway-design-for-multi-provider-inference","description":"We started routing 90% of LLM traffic through a small internal gateway. The gateway wasn't planned — it emerged from solving the same problem in 5 places. Here's the shape it took.","publishedAt":"2025-09-20T09:40:00.000Z","updatedAt":"2026-04-27T07:48:12.652Z","category":"AI"},{"title":"A Pragmatic Multi-Region Strategy for Small Teams","url":"https://www.devopsness.com/blog/a-pragmatic-multi-region-strategy-for-small-teams-28","description":"How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.","publishedAt":"2025-09-19T12:00:00.000Z","updatedAt":"2026-04-27T07:48:07.851Z","category":"Cloud"},{"title":"AI Security and Safety: Protecting Your AI Applications","url":"https://www.devopsness.com/blog/ai-security-safety-protecting-ai-applications","description":"Prompt injection, data leakage, jailbreaks, and the boring controls that actually keep production AI features safe. The threat model that matters once you ship.","publishedAt":"2025-09-18T16:17:55.440Z","updatedAt":"2026-04-26T18:12:29.470Z","category":"AI"},{"title":"What We Learned Running Weekly Game Days on Our CI/CD Pipeline","url":"https://www.devopsness.com/blog/what-we-learned-running-weekly-game-days-on-our-ci-cd-pipeline-28","description":"Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.","publishedAt":"2025-09-16T12:00:00.000Z","updatedAt":"2026-04-27T07:48:06.951Z","category":"DevOps"},{"title":"Real-World RAG Incidents: Lessons from a Production Rollout","url":"https://www.devopsness.com/blog/real-world-rag-incidents-lessons-from-a-production-rollout-28","description":"A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.","publishedAt":"2025-09-15T12:00:00.000Z","updatedAt":"2026-04-27T07:48:10.241Z","category":"AI"},{"title":"Embedding Models Comparison: Choosing the Right Model for Your Use Case","url":"https://www.devopsness.com/blog/embedding-models-comparison-choosing-right-model-use-case","description":"We benchmarked six embedding models on the same retrieval task. The results that surprised us, and how we'd pick today.","publishedAt":"2025-09-14T16:17:55.440Z","updatedAt":"2026-04-27T11:01:12.830Z","category":"AI"},{"title":"How We Stopped Terraform Drift from Surprising On-Call","url":"https://www.devopsness.com/blog/how-we-stopped-terraform-drift-from-surprising-on-call-27","description":"A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.","publishedAt":"2025-09-12T12:00:00.000Z","updatedAt":"2026-04-27T07:48:09.927Z","category":"Infrastructure"}]}