We had four different patch cadences across our fleet and routinely missed CVEs by weeks. The unified workflow that finally caught up.

On this page

Kernel and Package Patch Management Best Practices

For a long time our patch story was "everyone does their own thing." The bare-metal database fleet got patched on a quarterly cycle by the DBA team. EC2 instances got patched whenever the team that owned them remembered. Our Kubernetes nodes got patched as a side effect of node rotation, which happened on a different cadence. Every team had defensible reasons for their cadence; the result was that our average package was 6 weeks behind upstream, and CVEs published on a Wednesday could take a month to land in production.

We unified the patch story over a quarter. The goal wasn't faster patching — it was predictable patching with clear ownership.

What "good" looks like #

Three properties we wanted:

Predictable. Patching happens on a known schedule. Nobody is surprised by a Tuesday-night kernel restart.
Bounded latency from upstream to production. Critical CVEs land in production within 7 days. Non-critical updates within 30 days.
Single ownership per host class. Each kind of host has one team responsible for its patch state.

Before unification we had none of these. After, all three.

The taxonomy that organized everything #

We classified hosts into four buckets:

Kubernetes worker nodes (most of our fleet). Owned by platform team. Patched via rotation; our distribution (Bottlerocket) auto-patches and reboots.
EC2 instances running stateful services (databases, message queues). Owned by data-platform team. Patched manually with explicit failover.
Bastion / utility hosts (jump boxes, monitoring relays). Owned by SRE. Patched via rolling replacement weekly.
Edge / on-prem boxes (a few legacy hosts). Owned by infrastructure team. Patched manually with explicit downtime windows.

Each bucket has its own runbook, schedule, and dashboard. Different details, same shape.

Bucket 1: Kubernetes nodes (auto-patching with discipline)#

Bottlerocket is our node OS. It's read-only, atomically updateable, and ships security patches as part of normal AMI releases. The flow:

Bottlerocket auto-detects new versions and pulls them
It marks the node ready-to-update
Our Kubernetes update operator (we use the upstream update-operator) drains and reboots the node during a configured maintenance window

Our maintenance window is 03:00-04:00 UTC daily. Up to 10% of nodes can be in the update process simultaneously. Pod disruption budgets ensure no service drops below capacity during reboots.

What we monitor:

Number of nodes pending update (alerts if > 20% pending for > 7 days — indicates the rotation isn't keeping up)
Time since last reboot per node (alerts if > 30 days — node is stuck)
Failed update attempts (any failure, page the platform team)

Time from upstream Bottlerocket release to fully-rotated fleet: typically 5-7 days.

Bucket 2: stateful EC2 instances (manual with care)#

Our databases are on EC2 with EBS-backed storage. Auto-patching here is dangerous because the patch process involves a reboot, and an unplanned reboot of a database primary mid-transaction is bad.

The workflow is manual but scripted:

Every Monday, an automated job lists all kernel/security updates available across the database fleet.
The DBA on duty triages: which ones are CVE-relevant, which can wait.
CVE-relevant ones get scheduled for patching that week, in order of: replicas first, then primary (with explicit failover).
Each patching run is in a 30-min maintenance window, announced 48 hours in advance.

The actual update is a wrapper around apt-get upgrade followed by a controlled reboot:

bash.bash

# Run on the database host
patch-host.sh --pre-check     # snapshot taken, replication confirmed healthy
patch-host.sh --upgrade        # apt-get upgrade
patch-host.sh --reboot         # graceful reboot, verify service back up
patch-host.sh --post-check    # replication caught up, queries succeed

Each step has hard gates. If pre-check fails, no upgrade. If the reboot doesn't recover within 5 minutes, page on-call.

Time from CVE disclosure to fleet-patched: typically 5-10 days (limited by maintenance windows).

Bucket 3: bastion / utility hosts (rolling replacement)#

These are stateless jump hosts and small utility instances. We don't patch them in place — we replace them.

Every Sunday at 02:00, an automation job:

Provisions a fresh instance from the latest AMI
Runs the Ansible playbook to configure it
Adds it to the load balancer (for jump hosts)
Removes the old one
Terminates the old one

Total fleet rotation: ~2 hours, automated. The "patch" is just whatever's in the latest AMI baseline. Faster than per-package patching because we re-provision from a controlled baseline.

This works because the hosts are stateless. For stateful hosts (Bucket 2), in-place patching is required.

Bucket 4: edge / on-prem (manual with downtime windows)#

A small fraction of our hosts are physical or in colos we control directly. These get monthly patch windows announced two weeks in advance, with explicit downtime communicated to dependent teams.

The volume is small (we have maybe 6 such hosts) so the manual overhead is acceptable. We've considered automation but the unique-snowflake characteristic of each box would make the automation as bespoke as the manual process.

CVE response (the override path)#

The above schedules apply to routine patches. When a critical CVE drops, we have a separate path:

Severity: based on CVSS score and AWS/upstream guidance.
Critical (CVSS ≥ 9.0 with active exploitation): patch within 24 hours, all teams override their normal cadence.
High (CVSS 7.0-8.9): patch within 7 days.
Medium and below: included in normal cadence.

A Critical CVE triggers a war-room channel. The platform team identifies which buckets are affected; each bucket owner takes the patch action. Status is updated every 2 hours until all hosts are patched.

We've had two Critical CVEs in the last 18 months. Both were fully patched within 24 hours. This is the system working — fast when it needs to be, predictable when it doesn't.

The unified dashboard #

A single Grafana dashboard shows fleet patch status:

Total hosts by bucket
Hosts with patches available
Hosts with security patches available
Average days behind latest packages
Hosts patched in the last 7 / 30 / 90 days
Failed patches in the last 30 days

The dashboard is the team's accountability mechanism. If the data-platform team's "average days behind" creeps up, it's visible. The bucket owner is responsible for explaining or fixing.

Common mistakes we've seen #

A few patterns we've eliminated or watch for:

"This server has been up for 412 days" treated as a badge of honor. It's a CVE liability. We celebrate predictable reboots, not uptime.

Skipping reboot to avoid disruption. If the kernel is patched on-disk but the running kernel is the old one, the patch isn't live. We track "pending reboot" separately from "patched."

Different teams running different versions of similar software. We had three teams running PostgreSQL, each on a different patch level. Now: a single canonical patch level per package, owned by one team, applied across all hosts running that package.

Manual patch tracking in spreadsheets. Doesn't survive the first incident. Use the dashboard; when humans fall behind on the spreadsheet, the dashboard still shows truth.

What we use to make this work #

Tools, briefly:

AWS Systems Manager Patch Manager for EC2 fleet visibility (which hosts have patches available).
AMIs from the AWS team (Amazon Linux) and Bottlerocket maintainers, refreshed monthly.
Custom bash wrappers (patch-host.sh-style scripts) for the gated update flow on stateful hosts.
Grafana for the dashboard.
PagerDuty for the war-room flow on Critical CVEs.

None of this is exotic. The discipline is the work, not the tooling.

What I'd tell a team starting #

Categorize your hosts. Even three categories is enough. You can't have a single patch policy for a database server and a stateless edge worker — they have different risk profiles.

Pick a baseline cadence, even if it's slow at first. Monthly is fine to start. The point is predictability; speed comes after the cadence is reliable.

Build the dashboard before the policy. Without visibility, the policy is theoretical. With visibility, the policy becomes a forcing function.

Have a CVE override path. Routine cadence handles 99% of patches; the 1% that demand urgency need a different process. Defining it up front means you don't argue about it in the middle of an actual CVE.

A final note on uptime #

I started this post mentioning a server "412 days uptime" as a liability. That framing took us a while to land on. There's a generation of engineers who learned that long uptime was a sign of stability. In modern Linux, it's a sign of unpatched accumulation. We retrained the team's instinct: "this server hasn't rebooted in months — when's its next maintenance?" is the right question, not "look how stable it is."

Best Practices: Kernel and Package Patch Management

Kernel and Package Patch Management Best Practices

What "good" looks like #

The taxonomy that organized everything #

Bucket 1: Kubernetes nodes (auto-patching with discipline)#

Bucket 2: stateful EC2 instances (manual with care)#

Bucket 3: bastion / utility hosts (rolling replacement)#

Bucket 4: edge / on-prem (manual with downtime windows)#

CVE response (the override path)#

The unified dashboard #

Common mistakes we've seen #

What we use to make this work #

What I'd tell a team starting #

A final note on uptime #

Stay Updated

Docker Security Best Practices: Images, Runtime, and Supply Chain

How We Stopped Terraform Drift from Surprising On-Call

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Linux Network Debugging — tcpdump, ss, and eBPF in Anger

About Kiril Urbonas