OIDC federation between AWS, GCP, and CI providers let us delete every long-lived cloud credential we had. The setup, the gotchas, and the trust-relationship discipline.
About two years ago we had AWS access keys in twenty places: GitHub Actions secrets, GCP secret manager, a few engineer laptops, a couple of CI runners, one or two long-running EC2 instance profiles that should have been roles. Each one was a credential we'd have to find and rotate if anything leaked. Today we have zero long-lived cloud access keys outside of break-glass and one legacy integration we haven't migrated. This post is the federation patterns that got us there.
Long-lived credentials are the bottom rung of cloud security. They:
The fix is to replace them with short-lived tokens issued at the moment of use. AWS, GCP, and Azure all support this via OIDC trust relationships: an external identity provider (GitHub Actions, GCP, another AWS account, etc.) presents a signed JWT; the target cloud verifies it; if the trust policy matches, it issues a credential good for ~1 hour.
No persistent secret on the calling side. The credential expires before you'd even notice it got logged.
Federation works because both sides know how to verify a JWT. The caller's identity provider signs the JWT with a key whose public part is published at a well-known URL (the JWKS endpoint). The target cloud fetches that public key, verifies the signature, and checks the claims in the JWT against a trust policy — rules like "this JWT must be issued by GitHub Actions, for the repo company/api, on a push to main." If all the claims match, the cloud hands back a short-lived access token (an AWS STS credential, a GCP service account token, etc.).
You configure the trust policy on the target side once. The caller side just needs to request its JWT (the CI platform or the cloud SDK does this for it) and exchange it.
The most common path. Pattern:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:company/api:ref:refs/heads/main"
}
}
}]
}
The sub condition is the security boundary. It says "this role can only be assumed from the company/api repo on the main branch." A different repo or a feature branch can't assume it.
permissions:
id-token: write # required for OIDC token issuance
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-deploy
aws-region: us-east-1
After this step, the runner has temporary AWS credentials valid for the workflow run. No AWS_ACCESS_KEY_ID secret.
Same shape, different ceremony:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/gh-pool/providers/gh-provider'
service_account: 'deploy@my-project.iam.gserviceaccount.com'
The workflow gets a GCP access token for that service account. Same idea, different config surface.
Less common but useful when a workload running on AWS needs to write to a GCP bucket. Pattern:
sts.GetCallerIdentity → gets a signed identity token → exchanges it at GCP for a service-account access token.The bidirectional version exists too (GCP→AWS) with similar shape. We use this for analytics — Lambda functions on AWS that push data into BigQuery without storing GCP credentials.
This is where most federation setups go wrong. The trust policy controls who can use this role. Get it wrong and you've replaced a leaked key risk with a "anyone can assume your role" risk.
What we enforce in every trust policy:
Specific repo + branch. repo:company/api:ref:refs/heads/main. Not repo:company/* (any repo). Not repo:company/api:* (any branch). Specific.
Audience claim. AWS requires aud = sts.amazonaws.com. Always set explicitly.
Environment claim (when using GitHub Actions environments). environment:production — restricts which environment can assume the role. Especially useful for prod-only roles.
StringEquals over StringLike where you can. Pattern matching is for cases that genuinely need it; exact match is safer.
We had one setup early on where StringLike was used with a * that was a typo and accidentally matched too much. Caught it in a security review; would have been bad in a breach. Specificity over flexibility.
The migration took about a year, prioritized roughly by risk:
After step 6, all that remained was one legacy integration with a partner service that doesn't support OIDC. That one stays on a long-lived key, rotated quarterly, scoped narrowly.
Over-permissive trust policies. repo:company/* accepts any repo in the org. Use the specific repo.
Forgetting id-token: write. GitHub Actions workflows need this permission to request an OIDC token. Without it, the federation step fails with a confusing error.
Hard-coding the audience. GCP and AWS both have audience values; using the wrong one fails verification. Read the docs for the target.
Long session durations. AWS roles default to 1-hour sessions; you can raise to 12 hours. We keep them at 1 hour for CI roles. Longer sessions = more blast radius if a runner is compromised mid-run.
Same role for too many things. A "general deploy" role with all the permissions for everything. Defeats the security benefit. We have separate roles per service or per environment.
Federation is one of those cleanup projects where the moment you finish it, you wonder how you ever lived without it. Long-lived access keys in CI secrets and env files just become a thing you stopped doing. The trust-policy discipline is the part that takes practice; the rest is mechanical.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Pure vector search misses exact-keyword queries. Pure BM25 misses semantic ones. Combining them with reciprocal rank fusion is the simplest large win in RAG retrieval.
Wrong SLI metrics mean green dashboards while users churn. The discipline of picking signals that move with what users actually feel, and the ones that look reliable but lie.
Explore more articles in this category
Bad resource requests waste money or trigger OOMs. The methodology we use to right-size requests based on actual usage, and the gotchas the autoscalers don't fix.
Edge compute is useless without an edge data layer. Three serverless databases that put data within ms of your edge functions, with the tradeoffs that aren't on the marketing pages.
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.