We've had to restore a Kubernetes cluster from backup twice. Once it worked. Once it took 14 hours. Here's the strategy we run now.
We've had to restore Kubernetes clusters from backup twice in production. Once was clean (~30 minutes from realization to recovery). Once was painful (~14 hours, lots of small mistakes compounding). The difference was in the strategy and the practice. This post is what we landed on after both experiences.
Kubernetes "backup" can mean several things, and they're all separate problems:
Each needs its own backup strategy. The mistake we made the second time was assuming "we have backups" without checking which of these were actually covered.
If you run GitOps (Argo CD, Flux), your cluster state is in Git. The Git repo is the backup.
This is the cleanest answer. To restore the cluster's logical state:
We did this in our first recovery. ~25 minutes from "the cluster is gone" to "the cluster is back," because the cluster state was just code we re-applied.
For teams without GitOps: use Velero (more on this below) to capture the etcd state. But honestly, GitOps as a backup strategy is much better than Velero for state. Velero is for things GitOps doesn't cover.
Cluster state is the easy part. Application data — what's in your databases, what's on persistent volumes, what users care about — is the hard part.
Our setup per type:
Databases (RDS, Cloud SQL, etc.): managed service backups. Daily automated snapshots, point-in-time recovery up to 7 days, monthly snapshots retained for 1 year. Cross-region snapshot copies for prod. The cluster has no role here — backup is at the cloud DB layer.
Persistent volumes (EBS / GCE PD): Velero with cloud-snapshot integration. Velero creates EBS snapshots for any PV during a backup. Restore creates new EBS volumes from snapshots and re-attaches.
Object storage (S3 / GCS): bucket versioning + cross-region replication for important buckets. PV-attached object stores (in-cluster MinIO or Ceph) are backed up via Velero PV snapshots.
Stateful services in-cluster (e.g., Redis, in-cluster Postgres for non-critical use): Velero with hooks. The hook calls a service-specific backup command (e.g., pg_dump, redis-cli SAVE) before snapshot, ensuring the snapshot is consistent.
The trick is that "back up the cluster" doesn't mean one tool. It means database backup + cloud-volume snapshot + object-store versioning, all coordinated.
Velero is the open-source tool for Kubernetes backups. It backs up:
Our Velero configuration:
schedules:
- name: daily-full
schedule: "0 3 * * *"
template:
ttl: 720h # 30 days
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- kube-public
snapshotVolumes: true
- name: hourly-critical
schedule: "0 * * * *"
template:
ttl: 168h # 7 days
includedNamespaces:
- production-checkout
- production-payments
snapshotVolumes: true
Two schedules: daily full backup, hourly backup of critical namespaces. Storage cost is real but manageable (~$80/month for our scale).
Restore:
velero restore create --from-backup daily-full-20250410
It re-creates objects from the backup and re-creates PVs from snapshots.
Things we've hit:
PV snapshots take time. A 200GB EBS volume snapshot starts fast (incremental) but the full set of snapshots for a backup can take 10-30 minutes. The cluster keeps running during this; the snapshots are crash-consistent (point-in-time-ish but not quite atomic).
Cross-namespace dependencies. A backup of namespace X assumes resources in namespace Y exist. If you restore X without Y, things break. We backup all namespaces and restore in dependency order.
CRDs need to be present before restore. If a custom resource type isn't installed, restore can't create instances of it. We restore CRDs first, then operators, then applications.
Secrets restore as-is. If your secrets contain encrypted values (sealed-secrets, External Secrets), the encrypted form is restored — but the decryption key has to be available too. We back up sealed-secrets keys separately to a secure vault.
Restic is slow. For volumes that don't have native cloud snapshots, Velero falls back to Restic-based backup (file-by-file). It's slow on large volumes. We avoid in-cluster persistent storage for things big enough that this matters.
The 14-hour recovery happened when:
Each issue added 30-60 minutes of debugging. None individually was a disaster; together they made for a long day.
Lessons baked into the playbook:
Every quarter, we do a DR drill:
Each drill finds something. Recent finds:
Without the drill, those issues would surface during a real incident, where we couldn't afford the friction.
A few things we deliberately don't back up:
Pods themselves. They're ephemeral; controllers re-create them. We back up the controllers (Deployments, etc.); the pods regenerate.
Cluster-management resources. Karpenter NodePools, Argo CD Applications. These are in the GitOps repo; restoring the GitOps state restores them.
Logs and metrics. They go to off-cluster storage (Datadog, S3). The cluster being gone doesn't affect them.
Container images. They're in ECR / GCR with their own retention. Same story — off-cluster.
This list matters because it bounds what Velero needs to capture. Less data = faster backups, smaller storage, faster restores.
Backups contain everything sensitive. Treat them with the same controls as production:
The cross-account piece is non-negotiable. A common ransomware pattern: attackers compromise an account, then delete backups before triggering the actual attack. Cross-account isolation prevents this.
Our backup setup:
Total: ~$350/month for prod cluster backups. Cheap relative to the cost of an unrecoverable failure.
GitOps is your cluster-state backup. If you're not on GitOps, that's the highest-leverage move you can make for both deployment hygiene and recoverability.
Velero is the workhorse for the rest. Set it up, but understand it's not magic — restore order, CRD timing, and dependency management need attention.
Database backups are at the database layer. Don't try to back up databases via Velero. Use the cloud DB's native backup features.
Cross-account isolation for backups. Backups in the same account as production are vulnerable to account-level compromise.
Quarterly drill, not just "we have backups." Every team I've talked to that practices drills has a much faster MTTR than teams that don't.
Document restore order. When you're in a real incident, you don't want to be working out the order from scratch.
Backup is one of those investments where you don't see the return until you need it. By that time it's too late to start. The teams that have practiced regularly recover in hours; the teams that haven't recover in days. The difference is one of the largest "cost of unprepared" gaps in production engineering.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Build MLOps pipelines for training, evaluation, and deployment. Reproducibility and monitoring.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.