We migrated 40+ services to GitOps with Argo CD. Two years in, here's what works and what required workarounds.
Two years ago we were deploying to Kubernetes with kubectl apply from CI. It mostly worked. It also had drift problems, audit gaps, and a habit of putting clusters in inconsistent states when CI jobs got killed mid-deploy. We migrated to Argo CD. The migration took six months; the system has been stable since. This post is what we'd tell someone considering the same move.
The specific complaints that drove the decision:
kubectl edit'd resources during incidents. The next CI deploy reverted their changes — sometimes during the next incident. We needed Git to be the source of truth, enforced.kubectl apply of 30 resources where #15 fails leaves the cluster half-changed. Cleanup was manual.GitOps via Argo CD addresses all four: cluster state always matches Git, Argo records every sync, sync is transactional with a healthy/unhealthy state, and rollback is git revert + sync.
We run a single Argo CD instance per cluster (we have 3 prod clusters, 1 staging, 1 dev). Each Argo CD watches a single Git repo branch.
The repo layout:
clusters/
prod-us-east/
apps/
payments/
deployment.yaml
service.yaml
kustomization.yaml
checkout/
...
infrastructure/
cert-manager/
external-dns/
prometheus/
prod-eu-west/
...
Argo CD applications are defined as Application CRs that point to a path in the repo. We use the App-of-Apps pattern: a top-level "root" app contains Application CRs for everything else, so adding a new service is one PR that adds one file.
In Argo CD: every Kubernetes resource (deployments, services, ingresses, configmaps, secrets via External Secrets, RBAC, network policies, monitoring config).
Not in Argo CD: cluster bootstrap (Karpenter, Argo CD itself, cluster-essential add-ons that need to exist before Argo can run). These are managed by Terraform.
The split: Terraform creates the cluster + Argo CD. Argo CD then manages everything inside the cluster. There's a small bootstrap dance but it only matters once per cluster.
The flow:
End-to-end: ~5-7 minutes for typical deploys. Slightly slower than kubectl-from-CI (which was 2-3 min). The trade is worth it.
Argo CD has three states for an Application:
Auto-sync vs manual sync is a per-app setting. Our rules:
The "manual sync" for production turned out to be more useful than we expected. About once a month, an engineer notices something off in the diff before clicking sync (an unintended config change, a stale image tag) and we avoid an issue.
selfHeal: true means Argo will revert any cluster-side change that diverges from Git. We turn this on for most apps.
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
The prune: true means resources removed from Git get deleted from the cluster. selfHeal: true means manual edits get reverted. Both are necessary for "Git is the source of truth."
The exception: pods. Argo doesn't try to manage pods directly; it manages their controllers (deployments, statefulsets). So a kubectl-deleted pod gets recreated by the deployment controller, which is fine.
Argo CD doesn't store secrets in Git. We use External Secrets Operator (ESO) with AWS Secrets Manager:
ExternalSecret resource (in Git) references the Secrets Manager pathThis way, the GitOps repo has zero secret material. Anyone with read access to the repo can see the structure but not the values.
Rotation: when we rotate a secret in AWS, ESO picks it up within ~1 minute and updates the Kubernetes Secret. The app pods don't auto-reload — we use stakater/Reloader to detect Secret changes and roll the deployment.
A few specific issues we hit:
Resources Argo doesn't know how to track. Some operators create resources Argo doesn't recognize as "owned" by an Application, so they show as orphaned. We added Argo's ignoreDifferences for these specific resource types. About 8 such cases across the cluster.
Race conditions between operators. cert-manager creates secrets; ESO tries to manage them; Argo wants Git to own them. The deconfliction took a couple of iterations. Final answer: cert-manager-created secrets are excluded from Argo's view via annotations.
Helm chart drift. When we upgraded a Helm chart's version, the chart sometimes adds new resources we didn't know about. Argo flagged them as out-of-sync, but our diff-review tooling missed them. We now run a "what new resources will this chart create" check in CI.
ServerSideApply caveats. Switching from client-side to server-side apply is the modern recommendation but it changed how some fields are handled (managed fields, defaulting). Two services had unintended diffs after the switch. Took half a day to track down.
We have ~40 services. Without ApplicationSets, that's 40 hand-written Application CRs. With ApplicationSets, it's one CR that templates them:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- git:
repoURL: https://github.com/company/gitops
revision: main
directories:
- path: clusters/prod-us-east/apps/*
template:
metadata:
name: '{{path.basename}}-prod-us-east'
spec:
project: default
source:
repoURL: https://github.com/company/gitops
targetRevision: main
path: '{{path}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{path.basename}}'
Adding a new service: drop a directory in clusters/prod-us-east/apps/. The ApplicationSet generates the Application CR automatically.
We had to retrofit this; we wish we'd started with it.
Across 5 clusters, we considered:
We went with one-Argo-per-cluster. The operational simplicity won out, and we use the Argo CD UI per cluster rather than centrally. The extra ops burden of 5 Argo instances has been negligible.
Argo's argocd-notifications controller sends Slack messages on sync events. Our subscriptions:
We tuned the noise carefully — too many notifications get tuned out. The current set is read by humans regularly.
Drift on ConfigMaps. Some operators (like kube-prometheus-stack) generate ConfigMaps that get updated by their own reconciliation, conflicting with Argo's view. We mostly suppress these with ignoreDifferences but it's a long tail of small fights.
Tracking what's actually deployed. Argo shows the synced commit, but tracing from "what's running in prod" back to "what app PR introduced this" is two hops (image tag → image build → app PR). We have a small tool that does this lookup but it's brittle.
Disaster recovery. If Argo CD itself dies, the cluster keeps running but no new deployments happen. We have runbooks for re-bootstrapping Argo from Terraform; we've practiced this. It works but it's not fast (~30 min to recover).
Secrets rotation visibility. ESO syncs secrets, Reloader rolls deployments — all this happens outside the GitOps loop. The only visibility is in the operators' own logs. We've discussed surfacing this in Argo somehow; haven't done it yet.
Use Argo CD if you have more than ~5 services on Kubernetes. Below that, kubectl-from-CI is fine. Above, the drift and audit problems become real.
Separate the GitOps repo from the app code repo. Two repos, different review processes. App teams own their app's manifest paths but the repo as a whole has stricter controls.
Manual sync for production. Auto-sync is tempting but the "human notices the diff before deploy" benefit is real and the latency cost is small.
ApplicationSets from day one. Don't write hand-curated Application CRs; you'll regret it at scale.
Plan the secrets story before the migration. ESO + AWS Secrets Manager (or equivalents) is the standard answer. Decide on this before you start migrating apps.
Don't put bootstrap concerns in Argo. Cluster-essential things (CNI, Argo itself, cert-manager) live in Terraform. Argo manages the layer above.
GitOps with Argo CD has been a clear win for us. The migration was painful but the operational mode it enables — Git as source of truth, audit trail by default, rollback via revert — is what we'd build again. The rough edges are real but small compared to what we replaced.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.