We've had to restore a Kubernetes cluster from backup twice. Once it worked. Once it took 14 hours. Here's the strategy we run now.

On this page

Kubernetes Backup Strategies for Real Recovery

We've had to restore Kubernetes clusters from backup twice in production. Once was clean (~30 minutes from realization to recovery). Once was painful (~14 hours, lots of small mistakes compounding). The difference was in the strategy and the practice. This post is what we landed on after both experiences.

What you're actually backing up #

Kubernetes "backup" can mean several things, and they're all separate problems:

Cluster state: the desired-state in etcd (Deployments, Services, ConfigMaps, etc.)
Application data: what's in your databases, persistent volumes, object storage
Cluster configuration: cluster-level settings, addons, RBAC
Secrets: credentials, certificates, API keys

Each needs its own backup strategy. The mistake we made the second time was assuming "we have backups" without checking which of these were actually covered.

Cluster state: GitOps is the backup #

If you run GitOps (Argo CD, Flux), your cluster state is in Git. The Git repo is the backup.

This is the cleanest answer. To restore the cluster's logical state:

Bootstrap a new cluster with your standard provisioning (Terraform).
Install Argo CD.
Point it at your GitOps repo.
Argo applies all the Application CRs and the cluster reconciles to the desired state.

We did this in our first recovery. ~25 minutes from "the cluster is gone" to "the cluster is back," because the cluster state was just code we re-applied.

For teams without GitOps: use Velero (more on this below) to capture the etcd state. But honestly, GitOps as a backup strategy is much better than Velero for state. Velero is for things GitOps doesn't cover.

Application data: where the real work is #

Cluster state is the easy part. Application data — what's in your databases, what's on persistent volumes, what users care about — is the hard part.

Our setup per type:

Databases (RDS, Cloud SQL, etc.): managed service backups. Daily automated snapshots, point-in-time recovery up to 7 days, monthly snapshots retained for 1 year. Cross-region snapshot copies for prod. The cluster has no role here — backup is at the cloud DB layer.

Persistent volumes (EBS / GCE PD): Velero with cloud-snapshot integration. Velero creates EBS snapshots for any PV during a backup. Restore creates new EBS volumes from snapshots and re-attaches.

Object storage (S3 / GCS): bucket versioning + cross-region replication for important buckets. PV-attached object stores (in-cluster MinIO or Ceph) are backed up via Velero PV snapshots.

Stateful services in-cluster (e.g., Redis, in-cluster Postgres for non-critical use): Velero with hooks. The hook calls a service-specific backup command (e.g., pg_dump, redis-cli SAVE) before snapshot, ensuring the snapshot is consistent.

The trick is that "back up the cluster" doesn't mean one tool. It means database backup + cloud-volume snapshot + object-store versioning, all coordinated.

Velero: the workhorse for cluster-side backup #

Velero is the open-source tool for Kubernetes backups. It backs up:

All cluster objects (or a filtered subset)
PVs, via cloud-provider snapshots
Restic-based PV backup for non-cloud volumes

Our Velero configuration:

yaml.yaml

schedules:
  - name: daily-full
    schedule: "0 3 * * *"
    template:
      ttl: 720h  # 30 days
      includedNamespaces:
        - "*"
      excludedNamespaces:
        - kube-system
        - kube-public
      snapshotVolumes: true
  - name: hourly-critical
    schedule: "0 * * * *"
    template:
      ttl: 168h  # 7 days
      includedNamespaces:
        - production-checkout
        - production-payments
      snapshotVolumes: true

Two schedules: daily full backup, hourly backup of critical namespaces. Storage cost is real but manageable (~$80/month for our scale).

Restore:

bash.bash

velero restore create --from-backup daily-full-20250410

It re-creates objects from the backup and re-creates PVs from snapshots.

What goes wrong with Velero #

Things we've hit:

PV snapshots take time. A 200GB EBS volume snapshot starts fast (incremental) but the full set of snapshots for a backup can take 10-30 minutes. The cluster keeps running during this; the snapshots are crash-consistent (point-in-time-ish but not quite atomic).

Cross-namespace dependencies. A backup of namespace X assumes resources in namespace Y exist. If you restore X without Y, things break. We backup all namespaces and restore in dependency order.

CRDs need to be present before restore. If a custom resource type isn't installed, restore can't create instances of it. We restore CRDs first, then operators, then applications.

Secrets restore as-is. If your secrets contain encrypted values (sealed-secrets, External Secrets), the encrypted form is restored — but the decryption key has to be available too. We back up sealed-secrets keys separately to a secure vault.

Restic is slow. For volumes that don't have native cloud snapshots, Velero falls back to Restic-based backup (file-by-file). It's slow on large volumes. We avoid in-cluster persistent storage for things big enough that this matters.

Our second recovery: the painful one #

The 14-hour recovery happened when:

A Terraform mistake destroyed the EKS cluster
We ran Velero restore on a fresh cluster
Several issues compounded:
- Velero's restore of CRDs hit timing issues; some CRs were created before their CRDs existed
- One PV snapshot was from before a schema migration; the app didn't start
- Sealed-secrets controller hadn't restored its decryption key first; secrets were unreadable
- DNS records pointing at the cluster needed manual recreation
- Some workloads' deployments failed because the new cluster's node pool was sized differently

Each issue added 30-60 minutes of debugging. None individually was a disaster; together they made for a long day.

Lessons baked into the playbook:

Restore order matters (CRDs → operators → secrets → workloads)
Test restore quarterly, not just "we have backups"
Document the restore steps; don't rely on "we'll figure it out"
Keep a non-cluster backup of crucial cluster-control resources (sealed-secrets keys, vault root tokens, root CA)

Quarterly disaster recovery drills #

Every quarter, we do a DR drill:

Pick a week. Tell the team a drill is happening that week.
On a chosen day, randomly pick a non-prod cluster.
Pretend it's gone. Restore from backup to a fresh cluster.
Time how long it takes; record what went wrong.

Each drill finds something. Recent finds:

A new CRD didn't have its operator listed in the restore-order doc
A team had stopped backing up an important volume after refactoring (no one noticed for 3 months)
The restore script had a bug in the new-cluster bootstrap that nobody hit because nobody used it

Without the drill, those issues would surface during a real incident, where we couldn't afford the friction.

What's NOT in our backup strategy #

A few things we deliberately don't back up:

Pods themselves. They're ephemeral; controllers re-create them. We back up the controllers (Deployments, etc.); the pods regenerate.

Cluster-management resources. Karpenter NodePools, Argo CD Applications. These are in the GitOps repo; restoring the GitOps state restores them.

Logs and metrics. They go to off-cluster storage (Datadog, S3). The cluster being gone doesn't affect them.

Container images. They're in ECR / GCR with their own retention. Same story — off-cluster.

This list matters because it bounds what Velero needs to capture. Less data = faster backups, smaller storage, faster restores.

Backup security #

Backups contain everything sensitive. Treat them with the same controls as production:

Encrypted at rest (S3 with KMS keys we own)
Restricted access (only the platform team can read backup buckets)
Cross-account isolation (backups go to a separate AWS account; even compromise of the source account doesn't compromise backups)
Immutable retention for the most important snapshots (S3 Object Lock for critical recovery points)

The cross-account piece is non-negotiable. A common ransomware pattern: attackers compromise an account, then delete backups before triggering the actual attack. Cross-account isolation prevents this.

Cost #

Our backup setup:

Velero compute: minimal (~$30/month per cluster)
EBS snapshots: ~$200/month for our prod cluster's volumes
S3 storage for Velero objects: ~$40/month
RDS automated backups: included in RDS price
RDS manual snapshots / cross-region: ~$80/month

Total: ~$350/month for prod cluster backups. Cheap relative to the cost of an unrecoverable failure.

What I'd tell a team starting #

GitOps is your cluster-state backup. If you're not on GitOps, that's the highest-leverage move you can make for both deployment hygiene and recoverability.

Velero is the workhorse for the rest. Set it up, but understand it's not magic — restore order, CRD timing, and dependency management need attention.

Database backups are at the database layer. Don't try to back up databases via Velero. Use the cloud DB's native backup features.

Cross-account isolation for backups. Backups in the same account as production are vulnerable to account-level compromise.

Quarterly drill, not just "we have backups." Every team I've talked to that practices drills has a much faster MTTR than teams that don't.

Document restore order. When you're in a real incident, you don't want to be working out the order from scratch.

Backup is one of those investments where you don't see the return until you need it. By that time it's too late to start. The teams that have practiced regularly recover in hours; the teams that haven't recover in days. The difference is one of the largest "cost of unprepared" gaps in production engineering.

Kubernetes Backup Strategies: Protecting Your Cluster Data

Kubernetes Backup Strategies for Real Recovery

What you're actually backing up #

Cluster state: GitOps is the backup #

Application data: where the real work is #

Velero: the workhorse for cluster-side backup #

What goes wrong with Velero #

Our second recovery: the painful one #

Quarterly disaster recovery drills #

What's NOT in our backup strategy #

Backup security #

Cost #

What I'd tell a team starting #

Stay Updated

MLOps Pipelines: From Experiment to Production Models

A Pragmatic Multi-Region Strategy for Small Teams

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

About Kiril Urbonas