We migrated 40+ services to GitOps with Argo CD. Two years in, here's what works and what required workarounds.

On this page

GitOps with Argo CD: Automating Kubernetes Deployments

Two years ago we were deploying to Kubernetes with kubectl apply from CI. It mostly worked. It also had drift problems, audit gaps, and a habit of putting clusters in inconsistent states when CI jobs got killed mid-deploy. We migrated to Argo CD. The migration took six months; the system has been stable since. This post is what we'd tell someone considering the same move.

Why we moved #

The specific complaints that drove the decision:

Drift between Git and cluster. Engineers occasionally kubectl edit'd resources during incidents. The next CI deploy reverted their changes — sometimes during the next incident. We needed Git to be the source of truth, enforced.
No audit trail. "Who changed this deployment last and why?" required cross-referencing CI logs, kubectl audit logs, and Git history. None of them alone was enough.
Partial deploy failures. A kubectl apply of 30 resources where #15 fails leaves the cluster half-changed. Cleanup was manual.
No rollback. Rolling back required running an old version of CI, which sometimes didn't work because the CI tooling had moved on.

GitOps via Argo CD addresses all four: cluster state always matches Git, Argo records every sync, sync is transactional with a healthy/unhealthy state, and rollback is git revert + sync.

Architecture #

We run a single Argo CD instance per cluster (we have 3 prod clusters, 1 staging, 1 dev). Each Argo CD watches a single Git repo branch.

The repo layout:

code

clusters/
  prod-us-east/
    apps/
      payments/
        deployment.yaml
        service.yaml
        kustomization.yaml
      checkout/
        ...
    infrastructure/
      cert-manager/
      external-dns/
      prometheus/
  prod-eu-west/
    ...

Argo CD applications are defined as Application CRs that point to a path in the repo. We use the App-of-Apps pattern: a top-level "root" app contains Application CRs for everything else, so adding a new service is one PR that adds one file.

What runs in Argo, what doesn't #

In Argo CD: every Kubernetes resource (deployments, services, ingresses, configmaps, secrets via External Secrets, RBAC, network policies, monitoring config).

Not in Argo CD: cluster bootstrap (Karpenter, Argo CD itself, cluster-essential add-ons that need to exist before Argo can run). These are managed by Terraform.

The split: Terraform creates the cluster + Argo CD. Argo CD then manages everything inside the cluster. There's a small bootstrap dance but it only matters once per cluster.

How a deploy works now #

The flow:

Engineer makes a change in the app code, opens a PR.
CI builds the image, pushes to ECR with a SHA tag.
CI updates the deployment manifest in the GitOps repo (separate repo from app code) with the new image tag, opens a PR there.
The GitOps PR auto-merges if tests passed (we have a separate review path for infra changes).
Argo CD detects the change in the GitOps repo within ~3 minutes.
Argo CD applies the change to the cluster.
Argo CD reports healthy when readiness probes pass.

End-to-end: ~5-7 minutes for typical deploys. Slightly slower than kubectl-from-CI (which was 2-3 min). The trade is worth it.

The "sync" semantics #

Argo CD has three states for an Application:

OutOfSync: cluster state differs from Git
Synced: cluster matches Git
Healthy: all resources report ready

Auto-sync vs manual sync is a per-app setting. Our rules:

Production apps: manual sync. The PR auto-merges in Git, but a human (or our automated promoter) clicks sync. This is our "deploy gate" — gives us a chance to look at what's about to change.
Staging apps: auto-sync. PR merges → Argo applies. No human in the loop.
Infrastructure: manual sync with self-heal disabled. Infra changes need human approval.

The "manual sync" for production turned out to be more useful than we expected. About once a month, an engineer notices something off in the diff before clicking sync (an unintended config change, a stale image tag) and we avoid an issue.

Self-heal: enabled but bounded #

selfHeal: true means Argo will revert any cluster-side change that diverges from Git. We turn this on for most apps.

yaml.yaml

syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
    - CreateNamespace=true
    - ServerSideApply=true

The prune: true means resources removed from Git get deleted from the cluster. selfHeal: true means manual edits get reverted. Both are necessary for "Git is the source of truth."

The exception: pods. Argo doesn't try to manage pods directly; it manages their controllers (deployments, statefulsets). So a kubectl-deleted pod gets recreated by the deployment controller, which is fine.

Secrets: External Secrets Operator #

Argo CD doesn't store secrets in Git. We use External Secrets Operator (ESO) with AWS Secrets Manager:

Secret values live in AWS Secrets Manager
An ExternalSecret resource (in Git) references the Secrets Manager path
ESO syncs the value into a Kubernetes Secret in the cluster
Apps use the Kubernetes Secret normally

This way, the GitOps repo has zero secret material. Anyone with read access to the repo can see the structure but not the values.

Rotation: when we rotate a secret in AWS, ESO picks it up within ~1 minute and updates the Kubernetes Secret. The app pods don't auto-reload — we use stakater/Reloader to detect Secret changes and roll the deployment.

What broke during migration #

A few specific issues we hit:

Resources Argo doesn't know how to track. Some operators create resources Argo doesn't recognize as "owned" by an Application, so they show as orphaned. We added Argo's ignoreDifferences for these specific resource types. About 8 such cases across the cluster.

Race conditions between operators. cert-manager creates secrets; ESO tries to manage them; Argo wants Git to own them. The deconfliction took a couple of iterations. Final answer: cert-manager-created secrets are excluded from Argo's view via annotations.

Helm chart drift. When we upgraded a Helm chart's version, the chart sometimes adds new resources we didn't know about. Argo flagged them as out-of-sync, but our diff-review tooling missed them. We now run a "what new resources will this chart create" check in CI.

ServerSideApply caveats. Switching from client-side to server-side apply is the modern recommendation but it changed how some fields are handled (managed fields, defaulting). Two services had unintended diffs after the switch. Took half a day to track down.

ApplicationSets: managing many similar apps #

We have ~40 services. Without ApplicationSets, that's 40 hand-written Application CRs. With ApplicationSets, it's one CR that templates them:

yaml.yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
  generators:
    - git:
        repoURL: https://github.com/company/gitops
        revision: main
        directories:
          - path: clusters/prod-us-east/apps/*
  template:
    metadata:
      name: '{{path.basename}}-prod-us-east'
    spec:
      project: default
      source:
        repoURL: https://github.com/company/gitops
        targetRevision: main
        path: '{{path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path.basename}}'

Adding a new service: drop a directory in clusters/prod-us-east/apps/. The ApplicationSet generates the Application CR automatically.

We had to retrofit this; we wish we'd started with it.

Multi-cluster: hub-and-spoke #

Across 5 clusters, we considered:

One Argo per cluster (in-cluster): simpler, no cross-cluster auth issues, but 5 Argo instances to manage.
One central Argo, all clusters as targets (hub-and-spoke): one place to look, but cross-cluster auth, network exposure, blast radius concerns.

We went with one-Argo-per-cluster. The operational simplicity won out, and we use the Argo CD UI per cluster rather than centrally. The extra ops burden of 5 Argo instances has been negligible.

Notifications #

Argo's argocd-notifications controller sends Slack messages on sync events. Our subscriptions:

App enters degraded → ping the team channel
App sync fails → ping the team channel
App promoted to prod (sync) → posted in the deploys channel for visibility
Infrastructure app changes → posted in the platform channel

We tuned the noise carefully — too many notifications get tuned out. The current set is read by humans regularly.

What we still find hard #

Drift on ConfigMaps. Some operators (like kube-prometheus-stack) generate ConfigMaps that get updated by their own reconciliation, conflicting with Argo's view. We mostly suppress these with ignoreDifferences but it's a long tail of small fights.

Tracking what's actually deployed. Argo shows the synced commit, but tracing from "what's running in prod" back to "what app PR introduced this" is two hops (image tag → image build → app PR). We have a small tool that does this lookup but it's brittle.

Disaster recovery. If Argo CD itself dies, the cluster keeps running but no new deployments happen. We have runbooks for re-bootstrapping Argo from Terraform; we've practiced this. It works but it's not fast (~30 min to recover).

Secrets rotation visibility. ESO syncs secrets, Reloader rolls deployments — all this happens outside the GitOps loop. The only visibility is in the operators' own logs. We've discussed surfacing this in Argo somehow; haven't done it yet.

What I'd tell a team considering it #

Use Argo CD if you have more than ~5 services on Kubernetes. Below that, kubectl-from-CI is fine. Above, the drift and audit problems become real.

Separate the GitOps repo from the app code repo. Two repos, different review processes. App teams own their app's manifest paths but the repo as a whole has stricter controls.

Manual sync for production. Auto-sync is tempting but the "human notices the diff before deploy" benefit is real and the latency cost is small.

ApplicationSets from day one. Don't write hand-curated Application CRs; you'll regret it at scale.

Plan the secrets story before the migration. ESO + AWS Secrets Manager (or equivalents) is the standard answer. Decide on this before you start migrating apps.

Don't put bootstrap concerns in Argo. Cluster-essential things (CNI, Argo itself, cert-manager) live in Terraform. Argo manages the layer above.

GitOps with Argo CD has been a clear win for us. The migration was painful but the operational mode it enables — Git as source of truth, audit trail by default, rollback via revert — is what we'd build again. The rough edges are real but small compared to what we replaced.

GitOps with ArgoCD: Automating Kubernetes Deployments

GitOps with Argo CD: Automating Kubernetes Deployments

Why we moved #

Architecture #

What runs in Argo, what doesn't #

How a deploy works now #

The "sync" semantics #

Self-heal: enabled but bounded #

Secrets: External Secrets Operator #

What broke during migration #

ApplicationSets: managing many similar apps #

Multi-cluster: hub-and-spoke #

Notifications #

What we still find hard #

What I'd tell a team considering it #

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

About Kiril Urbonas