Least privilege fails when it's a one-time audit that locks things down until something breaks, then gets reverted. The iterative, log-driven approach that tightens permissions safely — and the policies we stopped writing by hand.
Everyone agrees on least privilege in principle. In practice it tends to go one of two ways: permissions sprawl to *:* because that's what makes the deploy work at 5pm on a Friday, or someone does a heroic lockdown that breaks a batch job nobody remembered, gets paged, and reverts the whole thing. Neither gets you to least privilege. What works is treating it as an iterative, evidence-driven process rather than a one-shot audit.
A from-scratch "minimal" policy is guesswork. You can't enumerate every action a service legitimately needs by reading the code — there are calls in error paths, in monthly jobs, in dependencies' SDKs you've never inspected. So a hand-written minimal policy is always missing something, and the thing it's missing surfaces as a production failure days later, often in a code path with no good error handling. After two of those, the team's lesson is "least privilege causes outages" and they stop trying.
Get the data instead of guessing. Every major cloud logs which identity called which API. Mine those logs to learn what each role actually uses, then write the policy to match observed behavior plus a margin.
AWS: IAM Access Analyzer can generate a policy directly from CloudTrail history:
aws accessanalyzer start-policy-generation \
--policy-generation-details '{"principalArn":"arn:aws:iam::ACCT:role/my-service"}' \
--cloud-trail-details '{...time range, trail ARN...}'
# then get-generated-policy → a policy scoped to actions actually used
GCP: the IAM Recommender surfaces roles with permissions granted but unused over the trailing 90 days, and suggests a tighter role. Azure: Entra's access reviews and PIM usage data play the same role.
The shift is from "what might this need?" (unknowable, so you over-grant) to "what has this used in 90 days?" (measured, so you can scope precisely).
Don't go from broad to minimal in one step. Stage it:
AccessDenied.The AccessDenied alert is the safety net that makes the whole thing tolerable: when you do scope too tightly, you find out immediately and specifically (role X denied s3:GetObject on bucket Y), and the fix is a one-line policy addition, not an archaeology project.
Least privilege on individual roles is necessary; bounding what any role in a domain can do is the structural backstop. AWS permission boundaries (and SCPs at the org level) set a ceiling: even if someone mistakenly attaches an over-broad policy, the boundary caps the effective permissions.
SCP: deny iam:* except for a designated admin role
SCP: deny actions outside approved regions
Permission boundary: any role devs create can't exceed this set
This decouples "who can grant permissions" from "how bad a mistaken grant can be." Developers can self-serve roles inside the boundary without each grant being a potential org-wide privilege escalation.
The highest-leverage IAM improvement often isn't scoping a policy — it's eliminating long-lived keys entirely. A perfectly-scoped static access key that leaks is still a standing liability. Replace static credentials with short-lived, federated ones:
A credential that lives for 15 minutes and is scoped to one role is a fundamentally smaller attack surface than a perfectly-written policy attached to a key that lives forever in someone's .env.
Least privilege is a direction you move continuously, not a state you reach once. Start from observed usage, not imagination. Tighten in stages with AccessDenied alerting so mistakes are cheap and specific. Cap the worst case with permission boundaries. And prefer short-lived federated credentials over any static key, however well-scoped — because the best permission is the one that expires before an attacker can use it.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
The cache-control header most teams under-use. How stale-while-revalidate and stale-if-error turned our CDN from a freshness liability into a latency and resilience win — with the gotchas.
The architectural choice is presented as binary; the practical answer is "depends on the workload." The patterns that earn their place and the failure modes we've hit.
Three discounting mechanisms, three different commitments. The rules of thumb we use to pick, and the mistakes we made before settling on them.
Evergreen posts worth revisiting.