_d
devops/ness
Blog
Reading ListAbout
Subscribe

Archive

Browse all 274 articles organized by date

2026

31 articles

January

  • 29Disaster Recovery Planning: Building Resilient Infrastructure
  • 28Operational Checklist: Blue-Green Deployment Guardrails
  • 25Infrastructure Monitoring: Observability for IaC
  • 24Operational Checklist: Infrastructure Drift Detection Workflow
  • 22Ansible Playbook Optimization: Writing Efficient Playbooks
  • 20Operational Checklist: Multi-Cluster Traffic Routing Strategies
  • 18Pulumi vs Terraform Deep Dive: Choosing the Right IaC Tool
  • 15Operational Checklist: Kubernetes Secrets and External Vault Integration
  • 14Infrastructure Testing Strategies: Validating Your IaC
  • 12Operational Checklist: Python Worker Queue Scaling Patterns
  • 11Terraform Modules Best Practices: Building Reusable Infrastructure
  • 8Operational Checklist: Model Serving Observability Stack
  • 7Linux Container Internals: Understanding How Containers Work
  • 4Shell Scripting Best Practices: Writing Maintainable Scripts
  • 4Operational Checklist: RAG Retrieval Quality Evaluation

February

  • 28End-of-Week Engineering: Why Smart Tech Teams Don’t Ship Major Changes on Friday
  • 27Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work
  • 26SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability
  • 25Platform Engineering with Backstage: Build a Useful Developer Portal
  • 24GitHub Actions for Monorepos: Fast CI Without Pipeline Chaos
  • 23Azure DevOps Best Practices in 2026: Build Pipelines You Can Trust
  • 22AI Best Practices in 2026: Shipping Reliable Systems, Not Demo Magic
  • 21

2025

134 articles

January

  • 29Troubleshooting: Kubernetes Cluster Upgrade Strategy
  • 26Field Notes: AI Inference Cost Optimization
  • 22Field Notes: SLO-Based Monitoring for APIs
  • 18Field Notes: Secure Container Supply Chain Controls
  • 14Field Notes: Infrastructure Documentation as Code
  • 9Field Notes: Cloud Networking Segmentation Patterns
  • 6Field Notes: Incident Response for Platform Teams

2024

105 articles

January

  • 28Practical Guide: Cloud Disaster Recovery Runbook Design
  • 24Practical Guide: AWS Cost Control with Tagging and Budgets
  • 21Practical Guide: Ansible Role Design for Large Teams
  • 17Practical Guide: Terraform State Isolation by Environment
  • 15Orchestrating AI Agents on Kubernetes
  • 13Practical Guide: GitHub Actions Pipeline Reliability
  • 10eBPF: The Future of Kernel Observability

2023

4 articles

December

  • 28AWS Cost Optimization Strategies
  • 25Advanced Bash Scripting Techniques
  • 20Docker Multi-Stage Builds for Production
  • 15Infrastructure as Code with Ansible
AI Best Practices for Engineering Teams: From Prompt Experiments to Platform Discipline
  • 20Operational Checklist: AI Inference Cost Optimization
  • 16Operational Checklist: SLO-Based Monitoring for APIs
  • 12Operational Checklist: Secure Container Supply Chain Controls
  • 8Operational Checklist: Infrastructure Documentation as Code
  • 5Infrastructure Cost Optimization: Reducing Cloud Spending
  • 4Operational Checklist: Cloud Networking Segmentation Patterns
  • 1Multi-Cloud Infrastructure: Managing Resources Across Providers
  • 1Operational Checklist: Incident Response for Platform Teams
  • 2
    Field Notes: Blue-Green Deployment Guardrails

    February

    • 26Troubleshooting: Linux Performance Baseline Methodology
    • 22Troubleshooting: Cloud Disaster Recovery Runbook Design
    • 18Troubleshooting: AWS Cost Control with Tagging and Budgets
    • 15Troubleshooting: Ansible Role Design for Large Teams
    • 10Troubleshooting: Terraform State Isolation by Environment
    • 6Troubleshooting: GitHub Actions Pipeline Reliability
    • 2Troubleshooting: Docker Image Hardening for Production

    March

    • 29Troubleshooting: Kubernetes Secrets and External Vault Integration
    • 26Troubleshooting: Python Worker Queue Scaling Patterns
    • 21Troubleshooting: Model Serving Observability Stack
    • 17Troubleshooting: RAG Retrieval Quality Evaluation
    • 13Troubleshooting: Prompt Versioning and Regression Testing
    • 9Troubleshooting: LLM Gateway Design for Multi-Provider Inference
    • 6Troubleshooting: Kernel and Package Patch Management
    • 2Troubleshooting: Systemd Service Reliability Patterns

    April

    • 29Troubleshooting: SLO-Based Monitoring for APIs
    • 25Troubleshooting: Secure Container Supply Chain Controls
    • 21Troubleshooting: Infrastructure Documentation as Code
    • 17Troubleshooting: Cloud Networking Segmentation Patterns
    • 14Troubleshooting: Incident Response for Platform Teams
    • 10Troubleshooting: Blue-Green Deployment Guardrails
    • 6Troubleshooting: Infrastructure Drift Detection Workflow
    • 2Troubleshooting: Multi-Cluster Traffic Routing Strategies

    May

    • 30Best Practices: Cloud Disaster Recovery Runbook Design
    • 26Best Practices: AWS Cost Control with Tagging and Budgets
    • 23Best Practices: Ansible Role Design for Large Teams
    • 19Best Practices: Terraform State Isolation by Environment
    • 15Best Practices: GitHub Actions Pipeline Reliability
    • 11Best Practices: Docker Image Hardening for Production
    • 7Best Practices: Kubernetes Cluster Upgrade Strategy
    • 4Troubleshooting: AI Inference Cost Optimization

    June

    • 27Best Practices: Model Serving Observability Stack
    • 23Best Practices: RAG Retrieval Quality Evaluation
    • 19Best Practices: Prompt Versioning and Regression Testing
    • 15Best Practices: LLM Gateway Design for Multi-Provider Inference
    • 12Best Practices: Kernel and Package Patch Management
    • 8Best Practices: Systemd Service Reliability Patterns
    • 3Best Practices: Linux Performance Baseline Methodology

    July

    • 28Best Practices: Infrastructure Documentation as Code
    • 24Best Practices: Cloud Networking Segmentation Patterns
    • 21Best Practices: Incident Response for Platform Teams
    • 17Best Practices: Blue-Green Deployment Guardrails
    • 12Best Practices: Infrastructure Drift Detection Workflow
    • 8Best Practices: Multi-Cluster Traffic Routing Strategies
    • 4Best Practices: Kubernetes Secrets and External Vault Integration
    • 1Best Practices: Python Worker Queue Scaling Patterns

    August

    • 31Multi-Agent AI Systems: Building Collaborative AI Applications
    • 29Architecture Review: Ansible Role Design for Large Teams
    • 27Prompt Engineering Best Practices: Maximizing LLM Performance
    • 25Architecture Review: Terraform State Isolation by Environment
    • 23AI Model Deployment Strategies: From Development to Production
    • 20Model Quantization Techniques: Reducing LLM Size and Cost
    • 20Architecture Review: GitHub Actions Pipeline Reliability
    • 16Architecture Review: Docker Image Hardening for Production
    • 16Vector Databases for AI: Comparing Pinecone, Weaviate, and ChromaDB
    • 13Building RAG Applications: A Complete Guide to Retrieval Augmented Generation
    • 12Architecture Review: Kubernetes Cluster Upgrade Strategy
    • 9Best Practices: AI Inference Cost Optimization
    • 5Best Practices: SLO-Based Monitoring for APIs
    • 1Best Practices: Secure Container Supply Chain Controls

    September

    • 29Architecture Review: RAG Retrieval Quality Evaluation
    • 28GitOps with ArgoCD: Automating Kubernetes Deployments
    • 25Kubernetes Networking Deep Dive: Understanding Pods, Services, and Ingress
    • 24Architecture Review: Prompt Versioning and Regression Testing
    • 21Production AI Pipelines: Building End-to-End ML Systems
    • 20Architecture Review: LLM Gateway Design for Multi-Provider Inference
    • 18AI Security and Safety: Protecting Your AI Applications
    • 17Architecture Review: Kernel and Package Patch Management
    • 14Embedding Models Comparison: Choosing the Right Model for Your Use Case
    • 13Architecture Review: Systemd Service Reliability Patterns
    • 10AI Cost Optimization: Reducing LLM Inference Costs by 80%
    • 9Architecture Review: Linux Performance Baseline Methodology
    • 7Fine-tuning vs Few-Shot Learning: When to Use Each Approach
    • 5Architecture Review: Cloud Disaster Recovery Runbook Design
    • 3AI Observability and Monitoring: Tracking Model Performance in Production
    • 1Architecture Review: AWS Cost Control with Tagging and Budgets

    October

    • 31Canary Releases: Gradual Rollout Strategy
    • 29Architecture Review: Cloud Networking Segmentation Patterns
    • 27Blue-Green Deployments: Zero-Downtime Releases
    • 26Architecture Review: Incident Response for Platform Teams
    • 24Log Aggregation Strategies: Centralizing Your Logs
    • 22Architecture Review: Blue-Green Deployment Guardrails
    • 20Infrastructure Monitoring with Prometheus: Complete Setup Guide
    • 18Architecture Review: Infrastructure Drift Detection Workflow
    • 16Docker Multi-Stage Builds: Optimizing Image Size
    • 14Architecture Review: Multi-Cluster Traffic Routing Strategies
    • 13Kubernetes Backup Strategies: Protecting Your Cluster Data
    • 10Architecture Review: Kubernetes Secrets and External Vault Integration
    • 9Service Mesh Implementation: Istio vs Linkerd
    • 7Architecture Review: Python Worker Queue Scaling Patterns
    • 6CI/CD Pipeline Optimization: Speeding Up Your Builds
    • 3Architecture Review: Model Serving Observability Stack
    • 2Container Security Scanning: Protecting Your Docker Images

    November

    • 30Operational Checklist: Terraform State Isolation by Environment
    • 29Cloud Networking Fundamentals: VPCs, Subnets, and Routing
    • 26Operational Checklist: GitHub Actions Pipeline Reliability
    • 25AWS ECS vs EKS: Choosing the Right Container Platform
    • 22Operational Checklist: Docker Image Hardening for Production
    • 21Cloud Security Best Practices: Securing Your AWS Infrastructure
    • 18Operational Checklist: Kubernetes Cluster Upgrade Strategy
    • 18Serverless Architecture Patterns: Building Scalable Applications
    • 15Architecture Review: AI Inference Cost Optimization
    • 14Cloud Cost Monitoring: Tracking and Optimizing AWS Spending
    • 11Multi-Region Deployment: Building Resilient Cloud Applications
    • 11Architecture Review: SLO-Based Monitoring for APIs
    • 7AWS Lambda Optimization: Reducing Costs and Improving Performance
    • 7Architecture Review: Secure Container Supply Chain Controls
    • 3DevOps Metrics and KPIs: Measuring Success
    • 2Architecture Review: Infrastructure Documentation as Code

    December

    • 31File System Optimization: Improving Disk Performance
    • 31Operational Checklist: Prompt Versioning and Regression Testing
    • 27Process Management and Monitoring in Linux
    • 27Operational Checklist: LLM Gateway Design for Multi-Provider Inference
    • 24Linux Security Hardening: Protecting Your System
    • 24Operational Checklist: Kernel and Package Patch Management
    • 20Operational Checklist: Systemd Service Reliability Patterns
    • 20Network Configuration and Troubleshooting in Linux
    • 17Linux Performance Tuning: Optimizing System Performance
    • 16Operational Checklist: Linux Performance Baseline Methodology
    • 13Systemd Service Management: Creating and Managing Services
    • 11Operational Checklist: Cloud Disaster Recovery Runbook Design
    • 9Edge Computing with AWS: CloudFront and Lambda@Edge
    • 7Operational Checklist: AWS Cost Control with Tagging and Budgets
    • 6Cloud-Native Databases: Choosing the Right Database for Your Workload
    • 4Operational Checklist: Ansible Role Design for Large Teams
    • 2Disaster Recovery in the Cloud: Backup and Recovery Strategies
    9
    Practical Guide: Docker Image Hardening for Production
  • 8Zero Trust Architecture in Multi-Cloud
  • 5Practical Guide: Kubernetes Cluster Upgrade Strategy
  • 5Terraform State Management Strategies
  • 3Building Scalable CI/CD Pipelines with GitHub Actions
  • 1Fine-tuning Llama 3 on Consumer Hardware
  • February

    • 29Practical Guide: Python Worker Queue Scaling Patterns
    • 25Practical Guide: Model Serving Observability Stack
    • 21Practical Guide: RAG Retrieval Quality Evaluation
    • 17Practical Guide: Prompt Versioning and Regression Testing
    • 13Practical Guide: LLM Gateway Design for Multi-Provider Inference
    • 12Fine-tuning Large Language Models: A Practical Guide
    • 10Practical Guide: Kernel and Package Patch Management
    • 10Infrastructure as Code: Terraform vs Pulumi vs Ansible
    • 7Linux System Monitoring with Prometheus and Grafana
    • 5Practical Guide: Systemd Service Reliability Patterns
    • 5AWS Cost Optimization: 10 Strategies to Reduce Your Cloud Bill
    • 3Building Production-Ready AI Applications with LangChain and Docker
    • 1Practical Guide: Linux Performance Baseline Methodology
    • 1Kubernetes Autoscaling: HPA vs VPA vs Cluster Autoscaler

    March

    • 31Practical Guide: Secure Container Supply Chain Controls
    • 27Practical Guide: Infrastructure Documentation as Code
    • 23Practical Guide: Cloud Networking Segmentation Patterns
    • 20Practical Guide: Incident Response for Platform Teams
    • 16Practical Guide: Blue-Green Deployment Guardrails
    • 11Practical Guide: Infrastructure Drift Detection Workflow
    • 7Practical Guide: Multi-Cluster Traffic Routing Strategies
    • 3Practical Guide: Kubernetes Secrets and External Vault Integration

    April

    • 28Deep Dive: Ansible Role Design for Large Teams
    • 24Deep Dive: Terraform State Isolation by Environment
    • 19Deep Dive: GitHub Actions Pipeline Reliability
    • 15Deep Dive: Docker Image Hardening for Production
    • 11Deep Dive: Kubernetes Cluster Upgrade Strategy
    • 8Practical Guide: AI Inference Cost Optimization
    • 4Practical Guide: SLO-Based Monitoring for APIs

    May

    • 28Deep Dive: RAG Retrieval Quality Evaluation
    • 24Deep Dive: Prompt Versioning and Regression Testing
    • 20Deep Dive: LLM Gateway Design for Multi-Provider Inference
    • 17Deep Dive: Kernel and Package Patch Management
    • 13Deep Dive: Systemd Service Reliability Patterns
    • 9Deep Dive: Linux Performance Baseline Methodology
    • 5Deep Dive: Cloud Disaster Recovery Runbook Design
    • 1Deep Dive: AWS Cost Control with Tagging and Budgets

    June

    • 28Deep Dive: Cloud Networking Segmentation Patterns
    • 25Deep Dive: Incident Response for Platform Teams
    • 21Deep Dive: Blue-Green Deployment Guardrails
    • 17Deep Dive: Infrastructure Drift Detection Workflow
    • 13Deep Dive: Multi-Cluster Traffic Routing Strategies
    • 9Deep Dive: Kubernetes Secrets and External Vault Integration
    • 6Deep Dive: Python Worker Queue Scaling Patterns
    • 2Deep Dive: Model Serving Observability Stack

    July

    • 30Production Playbook: Terraform State Isolation by Environment
    • 26Production Playbook: GitHub Actions Pipeline Reliability
    • 22Production Playbook: Docker Image Hardening for Production
    • 18Production Playbook: Kubernetes Cluster Upgrade Strategy
    • 15Deep Dive: AI Inference Cost Optimization
    • 11Deep Dive: SLO-Based Monitoring for APIs
    • 7Deep Dive: Secure Container Supply Chain Controls
    • 2Deep Dive: Infrastructure Documentation as Code

    August

    • 30Production Playbook: Prompt Versioning and Regression Testing
    • 26Production Playbook: LLM Gateway Design for Multi-Provider Inference
    • 23Production Playbook: Kernel and Package Patch Management
    • 19Production Playbook: Systemd Service Reliability Patterns
    • 15Production Playbook: Linux Performance Baseline Methodology
    • 10Production Playbook: Cloud Disaster Recovery Runbook Design
    • 6Production Playbook: AWS Cost Control with Tagging and Budgets
    • 3Production Playbook: Ansible Role Design for Large Teams

    September

    • 27Production Playbook: Blue-Green Deployment Guardrails
    • 23Production Playbook: Infrastructure Drift Detection Workflow
    • 18Production Playbook: Multi-Cluster Traffic Routing Strategies
    • 14Production Playbook: Kubernetes Secrets and External Vault Integration
    • 11Production Playbook: Python Worker Queue Scaling Patterns
    • 7Production Playbook: Model Serving Observability Stack
    • 3Production Playbook: RAG Retrieval Quality Evaluation

    October

    • 28Field Notes: Docker Image Hardening for Production
    • 23Field Notes: Kubernetes Cluster Upgrade Strategy
    • 20Production Playbook: AI Inference Cost Optimization
    • 16Production Playbook: SLO-Based Monitoring for APIs
    • 12Production Playbook: Secure Container Supply Chain Controls
    • 8Production Playbook: Infrastructure Documentation as Code
    • 4Production Playbook: Cloud Networking Segmentation Patterns
    • 1Production Playbook: Incident Response for Platform Teams

    November

    • 28Field Notes: Kernel and Package Patch Management
    • 24Field Notes: Systemd Service Reliability Patterns
    • 20Field Notes: Linux Performance Baseline Methodology
    • 16Field Notes: Cloud Disaster Recovery Runbook Design
    • 12Field Notes: AWS Cost Control with Tagging and Budgets
    • 9Field Notes: Ansible Role Design for Large Teams
    • 5Field Notes: Terraform State Isolation by Environment
    • 1Field Notes: GitHub Actions Pipeline Reliability

    December

    • 29Field Notes: Infrastructure Drift Detection Workflow
    • 25Field Notes: Multi-Cluster Traffic Routing Strategies
    • 21Field Notes: Kubernetes Secrets and External Vault Integration
    • 18Field Notes: Python Worker Queue Scaling Patterns
    • 14Field Notes: Model Serving Observability Stack
    • 10Field Notes: RAG Retrieval Quality Evaluation
    • 6Field Notes: Prompt Versioning and Regression Testing
    • 1Field Notes: LLM Gateway Design for Multi-Provider Inference