Least privilege fails when it's a one-time audit that locks things down until something breaks, then gets reverted. The iterative, log-driven approach that tightens permissions safely — and the policies we stopped writing by hand.

On this page

Cloud IAM Least-Privilege Without Breaking Everything

Everyone agrees on least privilege in principle. In practice it tends to go one of two ways: permissions sprawl to *:* because that's what makes the deploy work at 5pm on a Friday, or someone does a heroic lockdown that breaks a batch job nobody remembered, gets paged, and reverts the whole thing. Neither gets you to least privilege. What works is treating it as an iterative, evidence-driven process rather than a one-shot audit.

Why the big-bang lockdown fails #

A from-scratch "minimal" policy is guesswork. You can't enumerate every action a service legitimately needs by reading the code — there are calls in error paths, in monthly jobs, in dependencies' SDKs you've never inspected. So a hand-written minimal policy is always missing something, and the thing it's missing surfaces as a production failure days later, often in a code path with no good error handling. After two of those, the team's lesson is "least privilege causes outages" and they stop trying.

The approach: observe, then tighten #

Get the data instead of guessing. Every major cloud logs which identity called which API. Mine those logs to learn what each role actually uses, then write the policy to match observed behavior plus a margin.

AWS: IAM Access Analyzer can generate a policy directly from CloudTrail history:

bash.bash

aws accessanalyzer start-policy-generation \
  --policy-generation-details '{"principalArn":"arn:aws:iam::ACCT:role/my-service"}' \
  --cloud-trail-details '{...time range, trail ARN...}'
# then get-generated-policy → a policy scoped to actions actually used

GCP: the IAM Recommender surfaces roles with permissions granted but unused over the trailing 90 days, and suggests a tighter role. Azure: Entra's access reviews and PIM usage data play the same role.

The shift is from "what might this need?" (unknowable, so you over-grant) to "what has this used in 90 days?" (measured, so you can scope precisely).

Tighten in stages, with a safety net #

Don't go from broad to minimal in one step. Stage it:

Audit mode first. Before removing a permission, confirm it's genuinely unused over a long enough window (90 days catches monthly and quarterly jobs; 30 days does not — that batch job that runs on the 1st will bite you).
Tighten in a non-prod account, run the full workload including the rare paths (DR drills, batch, backfills), and watch for AccessDenied.
Roll to prod with alerting on authorization failures, so a missing permission shows up as an alert you can fix in minutes rather than a silent broken code path discovered weeks later.

The AccessDenied alert is the safety net that makes the whole thing tolerable: when you do scope too tightly, you find out immediately and specifically (role X denied s3:GetObject on bucket Y), and the fix is a one-line policy addition, not an archaeology project.

Permission boundaries: cap the blast radius #

Least privilege on individual roles is necessary; bounding what any role in a domain can do is the structural backstop. AWS permission boundaries (and SCPs at the org level) set a ceiling: even if someone mistakenly attaches an over-broad policy, the boundary caps the effective permissions.

code

SCP: deny iam:* except for a designated admin role
SCP: deny actions outside approved regions
Permission boundary: any role devs create can't exceed this set

This decouples "who can grant permissions" from "how bad a mistaken grant can be." Developers can self-serve roles inside the boundary without each grant being a potential org-wide privilege escalation.

Kill static credentials while you're here #

The highest-leverage IAM improvement often isn't scoping a policy — it's eliminating long-lived keys entirely. A perfectly-scoped static access key that leaks is still a standing liability. Replace static credentials with short-lived, federated ones:

Workloads: instance/pod identity (IAM Roles for Service Accounts, GCP Workload Identity) — no key to leak.
CI/CD: OIDC federation, so the pipeline assumes a role per run with no stored secret.
Humans: SSO + short-session role assumption, not personal long-lived keys.

A credential that lives for 15 minutes and is scoped to one role is a fundamentally smaller attack surface than a perfectly-written policy attached to a key that lives forever in someone's .env.

The mindset #

Least privilege is a direction you move continuously, not a state you reach once. Start from observed usage, not imagination. Tighten in stages with AccessDenied alerting so mistakes are cheap and specific. Cap the worst case with permission boundaries. And prefer short-lived federated credentials over any static key, however well-scoped — because the best permission is the one that expires before an attacker can use it.

Cloud IAM Least-Privilege Without Breaking Everything

Cloud IAM Least-Privilege Without Breaking Everything

Why the big-bang lockdown fails #

The approach: observe, then tighten #

Tighten in stages, with a safety net #

Permission boundaries: cap the blast radius #

Kill static credentials while you're here #

The mindset #

Stay Updated

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

More from Cloud

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

Caching Patterns — Read-Through, Write-Through, Cache-Aside in Practice

Kubernetes NetworkPolicies in Practice

HashiCorp Vault as a Secrets Backend for Kubernetes

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

Cloud IAM Least-Privilege Without Breaking Everything

Why the big-bang lockdown fails#

The approach: observe, then tighten#

Tighten in stages, with a safety net#

Permission boundaries: cap the blast radius#

Kill static credentials while you're here#

The mindset#

Stay Updated

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

More from Cloud

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

Why the big-bang lockdown fails #

The approach: observe, then tighten #

Tighten in stages, with a safety net #

Permission boundaries: cap the blast radius #

Kill static credentials while you're here #

The mindset #