Node upgrades, autoscaler scale-downs, and spot reclaims all drain nodes. Without PDBs they can take all your replicas at once. The budgets, probes, and graceful-shutdown handling that keep voluntary disruptions invisible to users.
The first time a cluster upgrade took down a "highly available" service, we learned what PodDisruptionBudgets are for. Three replicas, spread across three nodes, looked redundant — until the node-pool upgrade drained all three nodes in quick succession and every replica went down at once. The deployment said 3/3; reality said 0 serving. Pod Disruption Budgets are how you tell Kubernetes "you may disrupt my pods, but not all of them at once."
PDBs protect against voluntary disruptions — the ones Kubernetes initiates and can be asked to slow down:
kubectl drain during node upgradesThey do not protect against involuntary disruptions — a kernel panic, a hardware failure, an OOM kill. Those just happen. PDBs are a contract with the eviction API, and only voluntary disruptions go through it.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2 # never let fewer than 2 pods be available
selector:
matchLabels:
app: web
When something tries to drain a node, the eviction API checks the PDB. If evicting a web pod would drop available replicas below 2, the eviction is refused and the drain blocks until a replacement pod is Ready elsewhere. The drain proceeds one pod at a time, waiting for recovery between each — exactly the rolling behavior you wanted.
Two ways to express the same budget; pick by what stays stable as you scale:
minAvailable: 2 # absolute floor — but means 50% at 4 replicas, 20% at 10
# vs
maxUnavailable: 1 # at most 1 down at a time, regardless of replica count
maxUnavailable: 1 for most stateless services — it scales naturally and clearly says "drain one at a time."minAvailable as a percentage (minAvailable: 80%) when you need a capacity floor to handle load, not just availability.Never set minAvailable equal to the replica count. minAvailable: 3 on a 3-replica deployment means no pod can ever be voluntarily evicted — the drain blocks forever and your node upgrade hangs. We did this once and wondered why a cluster upgrade stalled for an hour.
A PDB controls how many pods go down at once. It does nothing about whether each individual pod shutdown is graceful. You also need:
Readiness probes that mean it. The PDB counts a pod as "available" when it's Ready. If your readiness probe goes green before the app can actually serve, the PDB lets the next eviction proceed into a pod that isn't really ready. The budget is only as honest as the probe.
Graceful shutdown. On eviction, Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds, then SIGKILL. The pod must use that window: stop accepting new connections, drain in-flight requests, then exit.
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"] # let endpoints propagate removal
terminationGracePeriodSeconds: 30
The preStop sleep matters more than it looks: pod termination and Service endpoint removal are concurrent, not ordered. Without the brief sleep, the pod can receive SIGTERM and start shutting down while the load balancer still routes new requests to it — connections refused, errors to users. The sleep holds the pod alive long enough for endpoint removal to propagate.
A PDB that's too strict blocks the very operations it's meant to make safe. If a deployment is already degraded (one pod crashlooping) and the PDB requires minAvailable: 3 of 3, a node drain can't make progress — you're stuck. Leave headroom: run enough replicas that the PDB permits at least one eviction even during a partial outage. Our rule: replicas ≥ minAvailable + 1, always, so there's room to drain even when something's already wrong.
PDBs are cheap to add and easy to get subtly wrong. The pattern that works: maxUnavailable: 1 (or a percentage floor), honest readiness probes, a preStop drain delay, and always one more replica than the budget requires. Then voluntary disruptions — which happen constantly in a healthy cluster — stay invisible to the people using your service.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
State drift is silent until a deploy fails or an outage reveals it. The scheduled plan-and-diff pipeline that surfaces console hotfixes and manual edits while they're still cheap to reconcile.
Free memory is a lie and load average doesn't see memory stalls. How Pressure Stall Information gives you a direct, early signal of memory contention — and how we wired it into alerts and autoscaling.
Explore more articles in this category
Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.
Most CI caches either miss constantly or restore stale junk. The cache-key discipline, scope boundaries, and measurements that turned our pipeline cache from theatre into real minutes saved.
Default-deny, namespace isolation, egress control — the patterns we use, the gotchas around DNS, and where Cilium changed our calculus.