We had .env files in three repos, AWS keys in Slack DMs, and a postgres password etched into a Confluence page. Cleaning it up took a sprint and changed how we think about secrets.

On this page

Secrets Management in Practice: From .env Files to Vault

The audit started because of a near-miss. Someone pushed a feature branch with a .env.staging file that had real credentials in it. GitHub's secret scanner caught the AWS access key in about four minutes and emailed us. The key got rotated. Nothing bad happened.

What was bad was the other stuff we found while tracing what else might be exposed. A staging postgres password in a Confluence runbook. An OpenAI key one of the engineers had pasted in a Slack DM six months prior. A secrets.json file in a half-deleted repo that nobody had cleared from local clones. Three different copies of the production Stripe webhook secret in different .env.local files on three engineers' laptops.

This post is what we did about it.

The starting state, written down honestly #

Before any tool decisions, we made a list of every place a secret could live. The exercise is unflattering but valuable. Ours, condensed:

Application .env-style files: 14 files across 4 repos
~/.aws/credentials on engineer laptops: 11 of 14 engineers
CI secret stores (GitHub Actions): about 50 key/value pairs
"Just for testing" configs in random branches: unknown count
Confluence pages: at least 6 with embedded credentials
Slack DMs and channels: unknown count, presumed many
1Password (used informally by ~half the team): some

The realisation that drove the rest of the work: the problem wasn't that we lacked a tool. We had 1Password. The problem was that we had no policy and no runtime-fetching infrastructure, so engineers chose convenience whenever it was offered.

Picking a vault #

We considered three options. AWS Secrets Manager (cheap, integrated with our IAM, but AWS-only). HashiCorp Vault (powerful, hosted version available, generic). Doppler (developer-friendly, simpler model, paid).

We picked Vault, hosted via HCP (HashiCorp Cloud Platform). The reasons that mattered for our team specifically:

We have non-AWS workloads (Vercel, an on-prem Jenkins, a few external SaaS integrations)
Dynamic database credentials were on the roadmap, and Vault's database secrets engine handles that with no code change in the apps
The Kubernetes integration via Vault Agent injector was straightforward to PoC

For a team running entirely on AWS with a small footprint, Secrets Manager would have been the right pick. We're not that team.

The migration plan #

Three phases over about a sprint. Each phase had a hard "don't go to the next until..." gate.

Phase 1: stand up Vault with one customer #

We provisioned an HCP Vault Plus cluster, set up the Kubernetes auth method, and installed the Vault Agent injector on our staging cluster. Then we picked one service — a low-risk batch worker — and migrated its single environment variable (an API key for an internal data provider) from a Kubernetes Secret to Vault.

The deploy looked like this in the pod template:

yaml.yaml

metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "batch-worker"
    vault.hashicorp.com/agent-inject-secret-config: "secret/data/staging/batch-worker"
    vault.hashicorp.com/agent-inject-template-config: |
      {{- with secret "secret/data/staging/batch-worker" -}}
      DATA_PROVIDER_KEY={{ .Data.data.api_key }}
      {{- end }}

The application read /vault/secrets/config instead of an env var. The change was 12 lines of YAML, no app code change. We let it bake for a week.

Phase 2: migrate every service in staging #

With confidence in the pattern, we migrated all 14 staging services in two days. The work per service was mostly mechanical:

Move secret values from K8s Secret to Vault
Remove the K8s Secret manifest from the helm chart
Add the Vault Agent annotations
Update CI to validate the chart didn't accidentally re-add a Secret

We set a CI guardrail: any helm chart that defines a kind: Secret fails the lint job. There are no exceptions in our codebase any more.

Phase 3: production cutover #

The approach was the same as staging. Riskier because of blast radius, but mechanically identical. We cut over services in dependency order — leaf services first, then their callers — to limit the explosion if something went wrong.

One service had a problem: it was reading its DB password at boot, and the Vault Agent template was rendered too late on cold start. The pod started, tried to connect, failed, restarted. We fixed it by adding a small init-container that waited for the secrets file to exist before letting the main container start. Took an hour. It's now part of our standard pod template.

Engineer laptops #

Vault and K8s injector solved the runtime side. Laptops were a separate problem.

We adopted granted (from CommonFate) for AWS access. Engineers run granted assume <role> and get short-lived AWS credentials in a session. The ~/.aws/credentials file disappeared from most laptops within a week. We turned off the IAM users that had been issuing those static keys.

For one-off secret access during debugging, engineers use the Vault CLI (vault read) directly, gated by their SSO identity. Logged. Auditable.

For everything else — third-party SaaS API keys, occasional shared credentials — we standardised on 1Password Business. Vaults are scoped per team. There are policies; people get audited; engineers leaving the company have all access revoked the same day.

What broke during the migration #

Two things, both manageable.

The first: a service that had been silently relying on a wrong secret. It connected to the wrong staging DB for months because the env var was a leftover from a copy-paste during initial setup. Nobody noticed because the wrong DB had similar enough data that nothing crashed. Migration to Vault forced us to write down the actual intended secret. We found and fixed three of these.

The second: our CI/CD pipelines now needed to authenticate to Vault. We used GitHub OIDC for this — Vault was configured to trust GitHub Actions identity tokens, mapped to specific repos. No long-lived CI credentials. The setup was about a half-day of YAML and a Vault policy.

What we'd do differently #

Earlier sprint planning. We did the work over a single sprint, but realistically the prep — the inventory, the policy decisions, the team training on Vault — could have happened in the sprint before. We crashed those into the same two weeks and the last days of the sprint were stressful.

Also: pick a smaller PoC service. Our "low-risk batch worker" still ran a payment-related calculation that the product team kept asking about. We'd have been better off with something genuinely no-stakes.

What we now enforce #

A short list, written down, that everyone has seen:

No secret in any file checked into git, ever. CI scans every commit. Pre-commit hook scans local diffs.
No long-lived AWS access keys on engineer laptops. Use granted for role assumption.
No production credentials in DMs or channels. Share via 1Password.
No Confluence page contains a credential. We keep an empty grep query in our linter.
Every credential in Vault has a documented owner and rotation cadence. Unknown-owner credentials are deleted on the next quarterly review.

How long it took, end to end #

Sprint 1: planning and Vault PoC. Sprint 2: full migration. Sprint 3: laptop cleanup and policy. Three sprints, three engineers part-time on it.

The cost is real. The alternative — the next near-miss being an actual miss — was worse. The audit that started this work would have found a public AWS key sooner or later. We just got the easy version.

Secrets Management in Practice: From .env Files to Vault

Secrets Management in Practice: From .env Files to Vault

The starting state, written down honestly #

Picking a vault #

The migration plan #

Phase 1: stand up Vault with one customer #

Phase 2: migrate every service in staging #

Phase 3: production cutover #

Engineer laptops #

What broke during the migration #

What we'd do differently #

What we now enforce #

How long it took, end to end #

Stay Updated

Incident Postmortems That Actually Prevent Repeat Failures

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

More from Cloud

Cloud IAM Least-Privilege Without Breaking Everything

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

Cloud IAM Least-Privilege Without Breaking Everything

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

CI Pipeline Caching That Actually Pays Off

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

About Kiril Urbonas