Vault + Kubernetes auth + Vault Agent Injector. The setup, the failure modes during pod startup, and the patterns that beat raw Kubernetes Secrets.
Kubernetes Secrets are base64-encoded data sitting in etcd. That's not encryption; it's encoding. For anything sensitive, you want a real secrets backend — encrypted at rest, encrypted in transit, audit-logged, with rotation. Vault is the most common answer for self-managed setups; we've run it as the Kubernetes secrets backend for ~two years. This post is what works, what bit us, and the patterns that earn their place.
Kubernetes Secrets are convenient but weak. Vault is strong but operational. The integration model:
The pod never has long-lived credentials. The secret material is in Vault, behind audit logging and access controls. Kubernetes Secrets aren't eliminated — they often still hold session-cache-style values that aren't secret enough to need Vault — but the high-value secrets all move.
Vault's Kubernetes auth method validates pod-issued JWTs against the cluster's TokenReviewer API. The flow:
TokenReview API to confirm the JWT is valid.The configuration on Vault's side ties:
path "secret/data/payments/*" {
capabilities = ["read"]
}
…to a policy, and the policy to a role like:
vault write auth/kubernetes/role/payments \
bound_service_account_names=payments-sa \
bound_service_account_namespaces=payments \
policies=payments-policy \
ttl=15m
Now any pod running as payments-sa in the payments namespace can fetch secrets at secret/data/payments/* for 15 minutes.
The TTL is the key knob. Shorter TTL = smaller blast radius if a token leaks; longer TTL = fewer renewals, less Vault load.
The piece that makes this usable from applications. Without it, every app needs Vault-client code, token renewal, etc. With it, the Vault Agent Injector watches for pods with specific annotations and sidecar-injects a Vault Agent that handles all the Vault interaction.
Pod annotation:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "payments"
vault.hashicorp.com/agent-inject-secret-db-creds: "secret/data/payments/db"
What happens at pod startup:
emptyDir volume.The application sees a file like /vault/secrets/db-creds that contains the secret. App reads the file; nothing Vault-specific in app code.
Some secrets really are static — API keys for third-party services that don't support rotation. For these, Vault's KV-v2 engine still adds value:
The TTL on the Vault token is short even for static secrets — the secret material doesn't rotate, but access to it does. If a pod is compromised, the attacker has at most TTL minutes before they need to re-authenticate.
The biggest Vault feature most teams underuse: dynamic database credentials.
Vault's database engine connects to your DB (Postgres, MySQL, etc.) and creates a per-request user with limited permissions, returning the credentials. The user expires after the TTL; Vault revokes it automatically.
Pod calls Vault → "give me read-only access to the analytics DB." Vault returns (user: vault_abc123, password: ...). Pod uses those for 1 hour. Then they're revoked.
Benefits:
We use dynamic credentials for our analytics DB (where every service gets a unique read-only user) and for some operator workloads that occasionally need admin access.
The biggest gotcha: pod startup ordering. The Vault Agent init container fetches secrets before the main container starts. If Vault is unavailable, the init container retries, then fails, then the pod fails to start.
Vault outage = no new pods start until Vault recovers. On a Vault outage during a deploy, you stop being able to roll out new versions.
Mitigations:
agent-pre-populate-only. Annotation that runs the agent once at startup, then exits. Useful when secrets don't need rotation; pod starts with whatever Vault returned at start.The "Vault is critical infrastructure" reality means you treat its availability with the same seriousness as the cluster itself.
Vault doesn't automatically rotate the secrets it stores. You have to:
We rotate static secrets quarterly. Database root passwords (used by Vault to create dynamic users) get rotated yearly with a careful plan because changing the root means re-bootstrapping the dynamic engine.
Treating Vault like another optional component. It became a critical path; we didn't have HA for the first 6 months. One Vault restart took down deploys for an hour. Now Vault is treated with the same operational discipline as the cluster control plane.
Tokens with long TTLs to avoid renewals. Set TTLs to 24 hours to reduce Vault load; never bothered with renewals. When a pod's token expired, the pod broke. Now TTLs are short and renewals happen automatically via the sidecar.
One Vault role per service. Started with hundreds of fine-grained roles; became unmanageable. Refactored to fewer roles with more careful policies. ~30 roles across the whole cluster is what we ended up with.
No backup story. Vault has its own state (in Consul, integrated storage, or external). We didn't think about it until we needed to restore. Now Vault snapshots are part of the backup discipline.
Honest list:
Vault solves real problems and creates real operational responsibility. For teams that need it (regulated industries, large clusters, dynamic secrets) it's hard to beat. For teams that don't, simpler answers are usually right. The wrong choice in either direction creates work that doesn't pay back.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
The single most useful Postgres extension you might not be using. The queries it surfaces, the indexes it implies, and the operational discipline of reading it weekly.
Tracking experiments and shipping models are different problems. The MLOps tooling assumes one solution; production splits them. The patterns we use.
Explore more articles in this category
Picking partition counts and keys decides whether your Kafka consumers scale linearly or hit a wall. The patterns that survived rebalances, partition-count changes, and consumer-group ops.
Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.