We had .env files in three repos, AWS keys in Slack DMs, and a postgres password etched into a Confluence page. Cleaning it up took a sprint and changed how we think about secrets.
The audit started because of a near-miss. Someone pushed a feature branch with a .env.staging file that had real credentials in it. GitHub's secret scanner caught the AWS access key in about four minutes and emailed us. The key got rotated. Nothing bad happened.
What was bad was the other stuff we found while tracing what else might be exposed. A staging postgres password in a Confluence runbook. An OpenAI key one of the engineers had pasted in a Slack DM six months prior. A secrets.json file in a half-deleted repo that nobody had cleared from local clones. Three different copies of the production Stripe webhook secret in different .env.local files on three engineers' laptops.
This post is what we did about it.
Before any tool decisions, we made a list of every place a secret could live. The exercise is unflattering but valuable. Ours, condensed:
.env-style files: 14 files across 4 repos~/.aws/credentials on engineer laptops: 11 of 14 engineersThe realisation that drove the rest of the work: the problem wasn't that we lacked a tool. We had 1Password. The problem was that we had no policy and no runtime-fetching infrastructure, so engineers chose convenience whenever it was offered.
We considered three options. AWS Secrets Manager (cheap, integrated with our IAM, but AWS-only). HashiCorp Vault (powerful, hosted version available, generic). Doppler (developer-friendly, simpler model, paid).
We picked Vault, hosted via HCP (HashiCorp Cloud Platform). The reasons that mattered for our team specifically:
For a team running entirely on AWS with a small footprint, Secrets Manager would have been the right pick. We're not that team.
Three phases over about a sprint. Each phase had a hard "don't go to the next until..." gate.
We provisioned an HCP Vault Plus cluster, set up the Kubernetes auth method, and installed the Vault Agent injector on our staging cluster. Then we picked one service — a low-risk batch worker — and migrated its single environment variable (an API key for an internal data provider) from a Kubernetes Secret to Vault.
The deploy looked like this in the pod template:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "batch-worker"
vault.hashicorp.com/agent-inject-secret-config: "secret/data/staging/batch-worker"
vault.hashicorp.com/agent-inject-template-config: |
{{- with secret "secret/data/staging/batch-worker" -}}
DATA_PROVIDER_KEY={{ .Data.data.api_key }}
{{- end }}
The application read /vault/secrets/config instead of an env var. The change was 12 lines of YAML, no app code change. We let it bake for a week.
With confidence in the pattern, we migrated all 14 staging services in two days. The work per service was mostly mechanical:
We set a CI guardrail: any helm chart that defines a kind: Secret fails the lint job. There are no exceptions in our codebase any more.
The approach was the same as staging. Riskier because of blast radius, but mechanically identical. We cut over services in dependency order — leaf services first, then their callers — to limit the explosion if something went wrong.
One service had a problem: it was reading its DB password at boot, and the Vault Agent template was rendered too late on cold start. The pod started, tried to connect, failed, restarted. We fixed it by adding a small init-container that waited for the secrets file to exist before letting the main container start. Took an hour. It's now part of our standard pod template.
Vault and K8s injector solved the runtime side. Laptops were a separate problem.
We adopted granted (from CommonFate) for AWS access. Engineers run granted assume <role> and get short-lived AWS credentials in a session. The ~/.aws/credentials file disappeared from most laptops within a week. We turned off the IAM users that had been issuing those static keys.
For one-off secret access during debugging, engineers use the Vault CLI (vault read) directly, gated by their SSO identity. Logged. Auditable.
For everything else — third-party SaaS API keys, occasional shared credentials — we standardised on 1Password Business. Vaults are scoped per team. There are policies; people get audited; engineers leaving the company have all access revoked the same day.
Two things, both manageable.
The first: a service that had been silently relying on a wrong secret. It connected to the wrong staging DB for months because the env var was a leftover from a copy-paste during initial setup. Nobody noticed because the wrong DB had similar enough data that nothing crashed. Migration to Vault forced us to write down the actual intended secret. We found and fixed three of these.
The second: our CI/CD pipelines now needed to authenticate to Vault. We used GitHub OIDC for this — Vault was configured to trust GitHub Actions identity tokens, mapped to specific repos. No long-lived CI credentials. The setup was about a half-day of YAML and a Vault policy.
Earlier sprint planning. We did the work over a single sprint, but realistically the prep — the inventory, the policy decisions, the team training on Vault — could have happened in the sprint before. We crashed those into the same two weeks and the last days of the sprint were stressful.
Also: pick a smaller PoC service. Our "low-risk batch worker" still ran a payment-related calculation that the product team kept asking about. We'd have been better off with something genuinely no-stakes.
A short list, written down, that everyone has seen:
granted for role assumption.Sprint 1: planning and Vault PoC. Sprint 2: full migration. Sprint 3: laptop cleanup and policy. Three sprints, three engineers part-time on it.
The cost is real. The alternative — the next near-miss being an actual miss — was worse. The audit that started this work would have found a public AWS key sooner or later. We just got the easy version.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.
We were drowning in 200 alerts a week. Most got ignored. After a quarter of triage and rework, we're at about 15 — and on-call actually responds to them.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.