A working AWS security baseline, derived from the actual incidents we've had and the audit findings we've cleared.
The AWS security advice that's available online is mostly correct and mostly useless — long lists of things you should do, no signal on which ones matter. After running on AWS for several years, surviving one audit, and dealing with a handful of small security incidents, this is the working baseline I'd recommend. Each item earns its place because of something we got bitten by or saw coming close.
The traditional security mental model is "keep the bad guys out." That works for a single-perimeter network. AWS isn't that. The right mental model is blast radius: when something does go wrong (and it will), how much damage can the compromise cause?
Blast radius is determined by:
Most security work in AWS is about reducing blast radius. The "perimeter" thinking — VPNs, bastion hosts, "deny all from the internet" — addresses a small slice and creates a false sense of security if you don't address the rest.
IAM is where the most damage happens. Specific things we enforce:
No long-lived access keys for humans. Humans authenticate via SSO (we use AWS SSO / IAM Identity Center). Their AWS sessions are short-lived (1-hour) and federated. We have zero long-lived access keys for human users; the audit script enforces this monthly.
Service-to-AWS auth via IAM Roles for Service Accounts (IRSA). EKS workloads assume IAM roles via a trust relationship with the cluster's OIDC provider. No static credentials in pods. Same story for Lambda (execution roles), ECS tasks (task roles), and EC2 (instance profiles).
Permission boundaries on roles created by services. When a service creates IAM roles dynamically (e.g., a CI system creating a role per build), we attach a permission boundary that caps the role's privileges. Even if the service is compromised and tries to create overpowered roles, the boundary prevents it.
Least privilege via real usage. We use AWS Access Analyzer's "policy generation" feature, which looks at what a role actually used in the last 90 days and proposes a tighter policy. We do this quarterly for high-privilege roles. It's surprising how often a role has *:* permissions and uses 12 specific actions.
MFA on the root account. The root account is the keys to the kingdom. MFA, hardware token, root credentials in a sealed envelope in a safe. We log in to root once per quarter to verify it still works; otherwise it's not used.
Two specific IAM incidents that shaped our practice:
A leaked CI secret with overly broad permissions. A developer accidentally pushed a CI environment variable to a public repo. The secret had S3 read/write across the org. We rotated within an hour of detection. Damage: someone hit our public bucket once with the credentials before we noticed; nothing valuable there. Lesson: CI credentials should be scoped per-pipeline (one credential, one job's needs), not shared org-wide.
A misconfigured trust policy. We had a role that any account could assume due to a misconfigured trust policy (Principal: "*" with no condition). Found by Access Analyzer's external access checker. No evidence anyone exploited it, but the window was about three weeks. We added a "trust policy must specify conditions" check to our IaC linter.
The network controls we run:
No public IP for compute by default. EC2 instances and ECS tasks don't get public IPs. They run in private subnets and reach the internet (when needed) through NAT gateways. Public access is through ALBs in public subnets, which terminate TLS and forward to private targets.
Security groups, not NACLs, for filtering. Security groups are stateful and per-ENI. NACLs are stateless and per-subnet — easier to misconfigure. We use NACLs only for coarse defense-in-depth at a few perimeter subnets.
SG rules reference SGs, not CIDRs. "Allow port 5432 from sg-app" not "Allow port 5432 from 10.0.0.0/16". When the app scales, new instances get the SG and access is automatic. When the SG goes away, access vanishes.
VPC endpoints for AWS services. S3 gateway endpoint, DynamoDB gateway endpoint, plus interface endpoints for Secrets Manager, KMS, ECR. Two reasons: less NAT bill, and traffic to AWS services stays inside AWS's network rather than traversing the public internet.
No 0.0.0.0/0 inbound on anything internal. Public ALBs allow 443 from 0.0.0.0/0 (that's their job). Nothing else does. This is enforced by an SCP — see below.
SCPs live at the AWS Organizations level and apply to entire accounts. They're hard prerequisites — even an account-admin can't bypass them. Ours include:
SCPs are the single most powerful security control. They're also the hardest to write correctly because the effect is to lock things down across all accounts. We write them carefully, test in a sandbox account first, and roll out one SCP at a time.
Three logs we never turn off:
CloudTrail in every region, every account, with logs going to a central security account. Retention 1 year hot, longer cold. CloudTrail is what tells us "what API calls happened" — essential for any post-incident investigation.
VPC Flow Logs for production VPCs. Volume is high (we sample at 10% in some VPCs to control cost) but worth it. When a security incident asks "did this IP talk to this IP," flow logs answer.
GuardDuty in every account. It costs ~$30-100/account/month depending on volume. It catches a lot: cryptocurrency mining, command-and-control traffic, anomalous IAM activity. We've had ~5 high-severity GuardDuty findings in the last year, none real (all explainable on investigation), but the visibility is worth it.
The logs go to an account where only the security team has access. Even an account-admin in a workload account can't tamper with the logs of their own account.
Secrets do not live in code, environment variables in repos, or Terraform state. They live in Secrets Manager (we tried Parameter Store; Secrets Manager's rotation features won out).
The integration:
Direct access to Secrets Manager is restricted: humans don't read production secrets directly. If a human needs to debug, they get a temporary credential via a break-glass workflow that requires a second person to approve.
In transit:
At rest:
The KMS bill is real — about $1/key/month plus per-call charges. For our ~80 CMKs and the call volume, ~$200/month. Cheap insurance.
Encryption doesn't help if the data is also gone. Backups:
The harder part: testing restores. We do this twice a year — pick a backup, restore it to a new RDS instance, verify the data is queryable. Two of our quarterly restore tests have caught issues (a missing IAM permission on the backup KMS key, once).
The Linux side: we run Amazon Linux 2 / 2023 with auto-update via SSM. Patches go to dev → staging → prod over 7 days. Reboots happen in maintenance windows.
The Kubernetes side: Karpenter rotates nodes every 30 days, which picks up the latest AMI (which has the latest patches). Our average node age is ~14 days; nothing runs on a node that's older than 30 days.
Container images: rebuilt nightly to pick up base image patches. Image scanning (Trivy, in CI) flags vulnerable images before deploy. We have a policy: critical CVEs block deploys, high CVEs are tracked but don't block.
A small script runs every quarter and reports:
*:* policies (target: zero outside a known whitelist)Drift happens. The quarterly check catches it before it becomes an incident.
A few things that get recommended but don't pull their weight for our profile:
Bastion hosts. SSM Session Manager replaces them — no SSH keys, no public IPs, audit log built-in. We removed our last bastion two years ago.
WAF for everything. WAF on public-facing apps that handle untrusted input, yes. WAF on internal APIs, no — the cost and false positives don't justify it.
Manual KMS key rotation procedures. AWS does it automatically with a checkbox. The manual procedures from older guides aren't needed.
Start with the org structure. Multiple accounts, SCPs, central logging. This is the foundation; trying to retrofit it later is much harder.
SSO before access keys. Don't even create the IAM users. Set up SSO from day one and you avoid an entire class of problems.
GuardDuty is cheap; turn it on. ~$30-100/account/month for visibility you can't get any other way.
Backups you don't test aren't backups. Pick one backup per quarter and restore it to verify.
Quarterly audit script. Automate the checks. The first time you run it, you'll find drift; the second time, less; eventually it's a non-event.
Small things, consistently. No single AWS security control is bulletproof. The discipline is to do the small things consistently — least privilege, encryption defaults on, logs enabled, secrets in the right place — across hundreds of resources, every time.
Cloud security is mostly mechanical. The exotic threats (zero-days, supply-chain attacks) are real but rare. The everyday wins come from preventing the same five mistakes that 80% of incidents trace back to: leaked credentials, public buckets, overly broad IAM, misconfigured network access, and unpatched systems. Get those right and most of the threats disappear before they reach you.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Shift-left security with image scanning. Trivy, policy gates, and runtime integration.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.