You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
If you run any non-trivial software in production, you have known vulnerabilities right now. Your OS packages have a few open CVEs. Some npm or Python dependency has a high-severity advisory you haven't patched yet. A library you bundled six months ago has been deprecated by upstream. This is normal. The question isn't whether you have vulnerabilities — it's whether you have a system for handling them.
After a few real incidents and a lot of routine maintenance, this is what we run.
Scanners will tell you you have hundreds of vulnerabilities. That's not a crisis — that's the baseline. Every CVE counter is a starting point, not a verdict. The work is figuring out which ones actually matter.
The teams that struggle here are the ones who treat every red row in the scanner as urgent. They burn out fast, stop reading the scanner, and miss the one that actually mattered. The teams that handle this well have a triage process — most vulnerabilities go in the "patch on schedule" bucket, a few in "patch now", and you spend your energy where it counts.
CVSS scores are useful but not sufficient. A "critical 9.8" in a library you never call from any reachable code path is meaningfully less urgent than a "medium 6.4" in your auth path. We triage on three axes:
Severity. Standard CVSS score. Critical/high/medium/low. Easy to read, often misleading on its own.
Reachability. Is the vulnerable code actually called by our application? Many scanners now do reachability analysis — Snyk, GitHub's CodeQL, Semgrep Pro. A high-severity bug in a transitive dependency that's never imported is much less urgent than a medium-severity bug in the function our auth flow calls every request.
Blast radius. What's behind the vulnerable code? Our public-facing payments service has a much bigger blast radius than an internal cron job that runs nightly.
The triage isn't a formula — it's a conversation. But these three axes get you a much more useful priority than "everything red is urgent."
Each vulnerability lands in one of four buckets:
Fix now (within 24 hours). Active exploit in the wild, critical severity, reachable code, internet-exposed service. Pager-worthy. Examples in the last few years: log4shell, certain OpenSSL CVEs, the polyfill.io supply-chain attack.
Fix soon (within 30 days). High severity, reachable, no active exploitation evidence yet. The bulk of "patch this" work. Goes into the next sprint with a hard due date.
Fix later (within 90 days). Medium severity, or high severity but not reachable, or in a non-public component. Gets included in routine patching cycles.
Accept. Low severity, or no reachable path, or the upstream has no fix and we have a mitigation in place. Documented with an explicit owner and an expiry date for the acceptance.
The "accept" path is real and not a failure mode. Some vulnerabilities you genuinely shouldn't patch — the fix breaks a critical dependency, or the vuln is in a component that's about to be retired. The discipline is: document the decision, set a review date, don't let "accepted" turn into "forgotten."
A few times a year, a CVE drops that's serious enough to interrupt regular work. log4shell. Heartbleed. ShellShock. Polyfill. The pattern is similar each time:
Confirm reach. Do we use the affected library? Where? In what versions? grep -r across every repo + image inventory. This is where having a software bill of materials (SBOM) saves you hours.
Assess exposure. Is the vulnerable code on the public internet? Behind auth? In a sandboxed environment? Internet-exposed gets fixed first; internal can wait a few hours.
Apply the temporary mitigation. Before the real patch lands, often there's a config workaround — block specific input patterns at the WAF, disable a feature flag, restrict network access. Buy time.
Patch. Upgrade the dependency or apply the vendor fix. Run regression tests. Deploy.
Verify. Confirm the patched version is what's actually running. Scan again. Some images take more than one rebuild to pick up the fix (transitive deps).
Communicate. Internal status updates, status page if customer-facing, post-incident doc once it's stable.
We've practiced this enough times now that the first 30 minutes go on autopilot. The hard part is always step 1 — knowing exactly what you have running where.
Most vulnerability work isn't emergency work. It's the steady rhythm that keeps the inventory clean. Specific practices:
Dependabot (or equivalent) on every repo. Auto-opens PRs for dependency updates. We have it set up to auto-merge patch versions for trusted packages and require review for minor/major bumps. Most updates land within a week of release with no human intervention beyond the merge button.
Nightly base image rebuilds. Every container we build has its base image patched nightly. New CVEs in node:20-alpine get picked up automatically without us doing anything specific.
Weekly Trivy scans in CI. Every PR includes a vulnerability scan. New high/critical CVEs block the merge until acknowledged.
Monthly OS patching cycle. Linux hosts get patched on the first Tuesday of each month, scheduled in advance, gradual rollout dev → staging → prod.
Quarterly dependency review. A human reads through the long-tail of medium/low CVEs and decides which to address. Catches the ones that have been "accepted" for too long.
None of this is exotic. It's just consistent. The teams that struggle with vulnerability management almost always have one or two of these missing — usually the nightly base image rebuild (gives you accumulating OS-level CVEs that nobody notices) or the dependency review (lets the accept bucket grow indefinitely).
We have explicit SLOs that turn the triage into deadlines:
| Severity | Internet-exposed | Internal |
|---|---|---|
| Critical + active exploit | 24h | 72h |
| Critical | 7 days | 14 days |
| High | 30 days | 60 days |
| Medium | 90 days | 180 days |
| Low | best effort | best effort |
The clock starts when the CVE is published (or when an active exploit is identified, whichever is first). The owner is the team that owns the affected component. If the SLO is at risk, it escalates.
We measure compliance monthly. We're at ~95% on the highs, ~88% on the mediums. Lows we don't track strictly — they tend to roll up into the routine patching anyway.
Sometimes you can't fix in the timeframe the SLO requires. Common reasons:
When this happens, we file an explicit exception. It includes:
Exceptions are tracked in a single file in the security repo. The security review meeting goes through every active exception monthly. Anything past expiry gets revisited.
Without explicit exceptions, the "we'll get to it" path silently extends forever. With them, the discipline is visible.
What we actually track:
MTTP — Mean Time To Patch. From CVE publication to patched in production. We watch this monthly per severity. Trending up is a problem; trending down is a win.
SLO compliance rate. Percentage of CVEs patched within their SLO window. Target 95%+ for high, 90%+ for medium.
Open exceptions count. Low number is fine; growing number is a debt signal.
Days-since-last-base-image-rebuild. Any image that hasn't been rebuilt in 30+ days is suspicious. Either it's deprecated and should be removed, or it has accumulated CVE drift.
Coverage. Percentage of repos with dependabot enabled. Percentage of images with scanning in CI. The boring infrastructure-of-the-program metrics.
None of these are subtle. They're meant to be boring and trackable. The point isn't to optimize them — it's to notice when one drifts in the wrong direction.
A few patterns we keep seeing:
Reacting to scanner output without triage. Treating every high-severity row as urgent. You burn out, then you start ignoring the scanner entirely. Triage first, then act.
No SBOM, no inventory. When the next log4shell drops, "do we use it?" should be answerable in minutes, not days. Track what you have running where, in detail.
Accepting vulnerabilities without documentation. The team rotates, the original "we'll fix this next quarter" is forgotten, the CVE stays open for years. Every exception has an owner and an expiry.
Patching without testing. A panicked patch can break things worse than the vulnerability. The CI pipeline runs even during emergencies — if anything, especially then.
Skipping the routine because it's not interesting. Nightly base image rebuilds, weekly scans, monthly reviews — none of these are exciting. They catch most of the real risk before it becomes a crisis.
You will not patch every vulnerability instantly. That's not how this works. The realistic state is: most patches arrive within their SLO, a handful sit in accepted exceptions, and a small percentage are in flight at any given time. The teams that aim for zero open vulnerabilities are either lying or building software that nobody uses.
What you can aim for: a system where every CVE has been seen, triaged, and assigned to a bucket. Nothing slips through unnoticed. Emergencies are rare because the routine catches things before they escalate. When something does escalate, you have a practiced playbook.
Vulnerability handling isn't glamorous and it never ends. The good news is that almost all of it is solvable with consistent process, clear ownership, and explicit deadlines. The exotic emergency stuff is the small minority. Most of the work is just keeping the pipes clean.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.