You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.

On this page

Handling Vulnerabilities in Production: What We Actually Do

If you run any non-trivial software in production, you have known vulnerabilities right now. Your OS packages have a few open CVEs. Some npm or Python dependency has a high-severity advisory you haven't patched yet. A library you bundled six months ago has been deprecated by upstream. This is normal. The question isn't whether you have vulnerabilities — it's whether you have a system for handling them.

After a few real incidents and a lot of routine maintenance, this is what we run.

The first uncomfortable truth #

Scanners will tell you you have hundreds of vulnerabilities. That's not a crisis — that's the baseline. Every CVE counter is a starting point, not a verdict. The work is figuring out which ones actually matter.

The teams that struggle here are the ones who treat every red row in the scanner as urgent. They burn out fast, stop reading the scanner, and miss the one that actually mattered. The teams that handle this well have a triage process — most vulnerabilities go in the "patch on schedule" bucket, a few in "patch now", and you spend your energy where it counts.

The triage that matters #

CVSS scores are useful but not sufficient. A "critical 9.8" in a library you never call from any reachable code path is meaningfully less urgent than a "medium 6.4" in your auth path. We triage on three axes:

Severity. Standard CVSS score. Critical/high/medium/low. Easy to read, often misleading on its own.

Reachability. Is the vulnerable code actually called by our application? Many scanners now do reachability analysis — Snyk, GitHub's CodeQL, Semgrep Pro. A high-severity bug in a transitive dependency that's never imported is much less urgent than a medium-severity bug in the function our auth flow calls every request.

Blast radius. What's behind the vulnerable code? Our public-facing payments service has a much bigger blast radius than an internal cron job that runs nightly.

The triage isn't a formula — it's a conversation. But these three axes get you a much more useful priority than "everything red is urgent."

The four buckets #

Each vulnerability lands in one of four buckets:

Fix now (within 24 hours). Active exploit in the wild, critical severity, reachable code, internet-exposed service. Pager-worthy. Examples in the last few years: log4shell, certain OpenSSL CVEs, the polyfill.io supply-chain attack.

Fix soon (within 30 days). High severity, reachable, no active exploitation evidence yet. The bulk of "patch this" work. Goes into the next sprint with a hard due date.

Fix later (within 90 days). Medium severity, or high severity but not reachable, or in a non-public component. Gets included in routine patching cycles.

Accept. Low severity, or no reachable path, or the upstream has no fix and we have a mitigation in place. Documented with an explicit owner and an expiry date for the acceptance.

The "accept" path is real and not a failure mode. Some vulnerabilities you genuinely shouldn't patch — the fix breaks a critical dependency, or the vuln is in a component that's about to be retired. The discipline is: document the decision, set a review date, don't let "accepted" turn into "forgotten."

The emergency playbook (when something big hits)#

A few times a year, a CVE drops that's serious enough to interrupt regular work. log4shell. Heartbleed. ShellShock. Polyfill. The pattern is similar each time:

Confirm reach. Do we use the affected library? Where? In what versions? grep -r across every repo + image inventory. This is where having a software bill of materials (SBOM) saves you hours.
Assess exposure. Is the vulnerable code on the public internet? Behind auth? In a sandboxed environment? Internet-exposed gets fixed first; internal can wait a few hours.
Apply the temporary mitigation. Before the real patch lands, often there's a config workaround — block specific input patterns at the WAF, disable a feature flag, restrict network access. Buy time.
Patch. Upgrade the dependency or apply the vendor fix. Run regression tests. Deploy.
Verify. Confirm the patched version is what's actually running. Scan again. Some images take more than one rebuild to pick up the fix (transitive deps).
Communicate. Internal status updates, status page if customer-facing, post-incident doc once it's stable.

We've practiced this enough times now that the first 30 minutes go on autopilot. The hard part is always step 1 — knowing exactly what you have running where.

The boring routine that catches 95%#

Most vulnerability work isn't emergency work. It's the steady rhythm that keeps the inventory clean. Specific practices:

Dependabot (or equivalent) on every repo. Auto-opens PRs for dependency updates. We have it set up to auto-merge patch versions for trusted packages and require review for minor/major bumps. Most updates land within a week of release with no human intervention beyond the merge button.

Nightly base image rebuilds. Every container we build has its base image patched nightly. New CVEs in node:20-alpine get picked up automatically without us doing anything specific.

Weekly Trivy scans in CI. Every PR includes a vulnerability scan. New high/critical CVEs block the merge until acknowledged.

Monthly OS patching cycle. Linux hosts get patched on the first Tuesday of each month, scheduled in advance, gradual rollout dev → staging → prod.

Quarterly dependency review. A human reads through the long-tail of medium/low CVEs and decides which to address. Catches the ones that have been "accepted" for too long.

None of this is exotic. It's just consistent. The teams that struggle with vulnerability management almost always have one or two of these missing — usually the nightly base image rebuild (gives you accumulating OS-level CVEs that nobody notices) or the dependency review (lets the accept bucket grow indefinitely).

Patching SLOs #

We have explicit SLOs that turn the triage into deadlines:

Severity	Internet-exposed	Internal
Critical + active exploit	24h	72h
Critical	7 days	14 days
High	30 days	60 days
Medium	90 days	180 days
Low	best effort	best effort

The clock starts when the CVE is published (or when an active exploit is identified, whichever is first). The owner is the team that owns the affected component. If the SLO is at risk, it escalates.

We measure compliance monthly. We're at ~95% on the highs, ~88% on the mediums. Lows we don't track strictly — they tend to roll up into the routine patching anyway.

Exception handling #

Sometimes you can't fix in the timeframe the SLO requires. Common reasons:

Upstream hasn't released a patch yet.
The patch breaks something critical (regression).
The vulnerable component is being retired in 2 weeks anyway.
The fix requires a major version bump that needs more testing time.

When this happens, we file an explicit exception. It includes:

The CVE and affected component
Why we can't fix within the SLO
The compensating control (WAF rule, network restriction, monitoring alert, etc.)
The expiry date (when we'll re-review)
The owner (a person, not a team)

Exceptions are tracked in a single file in the security repo. The security review meeting goes through every active exception monthly. Anything past expiry gets revisited.

Without explicit exceptions, the "we'll get to it" path silently extends forever. With them, the discipline is visible.

The metrics that matter #

What we actually track:

MTTP — Mean Time To Patch. From CVE publication to patched in production. We watch this monthly per severity. Trending up is a problem; trending down is a win.

SLO compliance rate. Percentage of CVEs patched within their SLO window. Target 95%+ for high, 90%+ for medium.

Open exceptions count. Low number is fine; growing number is a debt signal.

Days-since-last-base-image-rebuild. Any image that hasn't been rebuilt in 30+ days is suspicious. Either it's deprecated and should be removed, or it has accumulated CVE drift.

Coverage. Percentage of repos with dependabot enabled. Percentage of images with scanning in CI. The boring infrastructure-of-the-program metrics.

None of these are subtle. They're meant to be boring and trackable. The point isn't to optimize them — it's to notice when one drifts in the wrong direction.

Common mistakes #

A few patterns we keep seeing:

Reacting to scanner output without triage. Treating every high-severity row as urgent. You burn out, then you start ignoring the scanner entirely. Triage first, then act.

No SBOM, no inventory. When the next log4shell drops, "do we use it?" should be answerable in minutes, not days. Track what you have running where, in detail.

Accepting vulnerabilities without documentation. The team rotates, the original "we'll fix this next quarter" is forgotten, the CVE stays open for years. Every exception has an owner and an expiry.

Patching without testing. A panicked patch can break things worse than the vulnerability. The CI pipeline runs even during emergencies — if anything, especially then.

Skipping the routine because it's not interesting. Nightly base image rebuilds, weekly scans, monthly reviews — none of these are exciting. They catch most of the real risk before it becomes a crisis.

The honest tradeoff #

You will not patch every vulnerability instantly. That's not how this works. The realistic state is: most patches arrive within their SLO, a handful sit in accepted exceptions, and a small percentage are in flight at any given time. The teams that aim for zero open vulnerabilities are either lying or building software that nobody uses.

What you can aim for: a system where every CVE has been seen, triaged, and assigned to a bucket. Nothing slips through unnoticed. Emergencies are rare because the routine catches things before they escalate. When something does escalate, you have a practiced playbook.

What to read next #

Container security scanning: protecting Docker images — the scanning + admission control setup we run in CI
Linux security hardening — the broader host-level hardening that reduces the CVE surface area to begin with
Best practices: kernel and package patch management — the OS-level patching cadence in detail
Cloud security best practices: securing AWS infrastructure — defense in depth so a single unpatched CVE isn't catastrophic

Vulnerability handling isn't glamorous and it never ends. The good news is that almost all of it is solvable with consistent process, clear ownership, and explicit deadlines. The exotic emergency stuff is the small minority. Most of the work is just keeping the pipes clean.

Handling Vulnerabilities in Production — What We Actually Do

Handling Vulnerabilities in Production: What We Actually Do

The first uncomfortable truth #

The triage that matters #

The four buckets #

The emergency playbook (when something big hits)#

The boring routine that catches 95%#

Patching SLOs #

Exception handling #

The metrics that matter #

Common mistakes #

The honest tradeoff #

What to read next #

Stay Updated

Proxy vs Reverse Proxy vs Load Balancer — What's Actually Different

Hybrid Search — Combining BM25 and Embeddings for Better RAG

More from DevOps

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

CI Pipeline Caching That Actually Pays Off

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

CI Pipeline Caching That Actually Pays Off

Kubernetes NetworkPolicies in Practice

Cloud IAM Least-Privilege Without Breaking Everything

HashiCorp Vault as a Secrets Backend for Kubernetes

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux