We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.
Six months ago we had 14 long-lived IAM users, three of which had AdministratorAccess. Today we have zero IAM users with console access and zero static access keys for humans. Every action is traced back to an SSO identity with a session that expires in ≤ 8 hours.
We had three near-misses:
~/.aws/credentials containing a long-lived key.Any of those could have been catastrophic. Static credentials had to go.
┌─────────────────┐ SAML/OIDC ┌─────────────────────────┐
│ Google Workspace│ ──────────────► │ AWS IAM Identity Center│
└─────────────────┘ └────────────┬────────────┘
│ AssumeRole
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ prod acct │ │ stage acct │ │ dev acct │
│ (3 roles) │ │ (4 roles) │ │ (5 roles) │
└────────────┘ └────────────┘ └────────────┘
ReadOnly, Developer, SREOnCall, BillingViewer, ProdBreakGlass)aws sso login instead of static keysaws iam list-users --query 'Users[*].[UserName,CreateDate,PasswordLastUsed]' --output table
aws iam list-access-keys --user-name <each user>
aws iam get-account-authorization-details > iam-snapshot.json
We mapped every IAM user → real human → permission set. Two "users" turned out to be service accounts that had been re-purposed for a human because someone needed admin quickly. Those got split.
resource "aws_ssoadmin_permission_set" "developer" {
name = "Developer"
instance_arn = local.sso_instance_arn
session_duration = "PT8H"
description = "Read most things, write to dev/stage, no prod write"
}
resource "aws_ssoadmin_managed_policy_attachment" "developer_readonly" {
instance_arn = local.sso_instance_arn
managed_policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
permission_set_arn = aws_ssoadmin_permission_set.developer.arn
}
resource "aws_ssoadmin_permission_set_inline_policy" "developer_dev_write" {
instance_arn = local.sso_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.developer.arn
inline_policy = data.aws_iam_policy_document.developer_dev_write.json
}
We code-reviewed every permission set. Three reviewers minimum for anything touching prod.
For one week, both old IAM users and new SSO access were live. Engineers used SSO for daily work; old creds were the fallback. We measured CloudTrail events per identity to see who was still using the old path.
You can't put a service account into IAM Identity Center. We were tempted to — don't. Use IAM Roles for Service Accounts (IRSA on EKS) or EC2 instance profiles or GitHub OIDC for CI. Static keys for services are a step backward.
# GitHub Actions assuming an AWS role via OIDC — no static creds
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-deploy
aws-region: us-east-1
Some older CLIs (and some Terraform providers in 2023) didn't refresh SSO sessions cleanly. The fix was to use aws-vault or granted as a wrapper that handles refresh:
$ granted sso login --sso-region us-east-1
$ assume Developer.dev
[Developer.dev] $ terraform plan
You will eventually need to do something the normal SSO roles don't allow. We have a ProdBreakGlass permission set that:
AdministratorAccess#sec-emergency Slack channelWe've used it twice in 6 months. Both times the post-incident review found the SSO permission sets were missing a legitimate permission, and we added it.
Per-account CloudTrail with management + data events sent to S3 + CloudWatch added ~$180/month across our org. Worth it for the audit trail. We sample data events selectively (S3 + Lambda only) to keep cost in check.
# SCP on the org root
data "aws_iam_policy_document" "no_iam_users" {
statement {
sid = "DenyIAMUserCreation"
effect = "Deny"
actions = [
"iam:CreateUser",
"iam:CreateLoginProfile",
"iam:CreateAccessKey",
]
resources = ["*"]
condition {
test = "StringNotEquals"
variable = "aws:PrincipalTag/AllowIAMUsers"
values = ["true"]
}
}
}
This SCP blocks new IAM users from being created at all unless the calling principal has the AllowIAMUsers=true tag. That tag is held by exactly one role used only for emergency provisioning.
sso login.aws iam list-users, aws iam list-access-keys, CloudTrail review. Catch regressions before they accumulate.The migration was 4 weeks of focused work. The reduction in mental overhead since has been worth every hour.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.
We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.
Explore more articles in this category
We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.
How we migrated from .env files checked into repos to a proper secrets management workflow with HashiCorp Vault and CI/CD integration.
A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.