We had four different patch cadences across our fleet and routinely missed CVEs by weeks. The unified workflow that finally caught up.
For a long time our patch story was "everyone does their own thing." The bare-metal database fleet got patched on a quarterly cycle by the DBA team. EC2 instances got patched whenever the team that owned them remembered. Our Kubernetes nodes got patched as a side effect of node rotation, which happened on a different cadence. Every team had defensible reasons for their cadence; the result was that our average package was 6 weeks behind upstream, and CVEs published on a Wednesday could take a month to land in production.
We unified the patch story over a quarter. The goal wasn't faster patching — it was predictable patching with clear ownership.
Three properties we wanted:
Before unification we had none of these. After, all three.
We classified hosts into four buckets:
Each bucket has its own runbook, schedule, and dashboard. Different details, same shape.
Bottlerocket is our node OS. It's read-only, atomically updateable, and ships security patches as part of normal AMI releases. The flow:
update-operator) drains and reboots the node during a configured maintenance windowOur maintenance window is 03:00-04:00 UTC daily. Up to 10% of nodes can be in the update process simultaneously. Pod disruption budgets ensure no service drops below capacity during reboots.
What we monitor:
Time from upstream Bottlerocket release to fully-rotated fleet: typically 5-7 days.
Our databases are on EC2 with EBS-backed storage. Auto-patching here is dangerous because the patch process involves a reboot, and an unplanned reboot of a database primary mid-transaction is bad.
The workflow is manual but scripted:
The actual update is a wrapper around apt-get upgrade followed by a controlled reboot:
# Run on the database host
patch-host.sh --pre-check # snapshot taken, replication confirmed healthy
patch-host.sh --upgrade # apt-get upgrade
patch-host.sh --reboot # graceful reboot, verify service back up
patch-host.sh --post-check # replication caught up, queries succeed
Each step has hard gates. If pre-check fails, no upgrade. If the reboot doesn't recover within 5 minutes, page on-call.
Time from CVE disclosure to fleet-patched: typically 5-10 days (limited by maintenance windows).
These are stateless jump hosts and small utility instances. We don't patch them in place — we replace them.
Every Sunday at 02:00, an automation job:
Total fleet rotation: ~2 hours, automated. The "patch" is just whatever's in the latest AMI baseline. Faster than per-package patching because we re-provision from a controlled baseline.
This works because the hosts are stateless. For stateful hosts (Bucket 2), in-place patching is required.
A small fraction of our hosts are physical or in colos we control directly. These get monthly patch windows announced two weeks in advance, with explicit downtime communicated to dependent teams.
The volume is small (we have maybe 6 such hosts) so the manual overhead is acceptable. We've considered automation but the unique-snowflake characteristic of each box would make the automation as bespoke as the manual process.
The above schedules apply to routine patches. When a critical CVE drops, we have a separate path:
A Critical CVE triggers a war-room channel. The platform team identifies which buckets are affected; each bucket owner takes the patch action. Status is updated every 2 hours until all hosts are patched.
We've had two Critical CVEs in the last 18 months. Both were fully patched within 24 hours. This is the system working — fast when it needs to be, predictable when it doesn't.
A single Grafana dashboard shows fleet patch status:
The dashboard is the team's accountability mechanism. If the data-platform team's "average days behind" creeps up, it's visible. The bucket owner is responsible for explaining or fixing.
A few patterns we've eliminated or watch for:
"This server has been up for 412 days" treated as a badge of honor. It's a CVE liability. We celebrate predictable reboots, not uptime.
Skipping reboot to avoid disruption. If the kernel is patched on-disk but the running kernel is the old one, the patch isn't live. We track "pending reboot" separately from "patched."
Different teams running different versions of similar software. We had three teams running PostgreSQL, each on a different patch level. Now: a single canonical patch level per package, owned by one team, applied across all hosts running that package.
Manual patch tracking in spreadsheets. Doesn't survive the first incident. Use the dashboard; when humans fall behind on the spreadsheet, the dashboard still shows truth.
Tools, briefly:
patch-host.sh-style scripts) for the gated update flow on stateful hosts.None of this is exotic. The discipline is the work, not the tooling.
Categorize your hosts. Even three categories is enough. You can't have a single patch policy for a database server and a stateless edge worker — they have different risk profiles.
Pick a baseline cadence, even if it's slow at first. Monthly is fine to start. The point is predictability; speed comes after the cadence is reliable.
Build the dashboard before the policy. Without visibility, the policy is theoretical. With visibility, the policy becomes a forcing function.
Have a CVE override path. Routine cadence handles 99% of patches; the 1% that demand urgency need a different process. Defining it up front means you don't argue about it in the middle of an actual CVE.
I started this post mentioning a server "412 days uptime" as a liability. That framing took us a while to land on. There's a generation of engineers who learned that long uptime was a sign of stability. In modern Linux, it's a sign of unpatched accumulation. We retrained the team's instinct: "this server hasn't rebooted in months — when's its next maintenance?" is the right question, not "look how stable it is."
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Harden container images and runtime. Image scanning, minimal base, and supply chain security.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.
Evergreen posts worth revisiting.