We use feature flags on roughly every customer-facing change. The provider tradeoff, the patterns that hold up, and the failure modes that show up only after a couple of years.
We use feature flags on roughly every customer-facing change. The pitch — "deploy code dark, then turn it on gradually" — is true and we've leaned on it heavily. The operational reality is messier than the pitch suggests: flag debt, evaluation latency, accidental fail-open states, and the question of "who's actually allowed to flip what." This post is what we've learned after a couple of years of running flags at scale.
Three primary use cases:
Gradual rollouts. New feature ships to 1% of users, then 10%, then 50%, then 100% over a few days. Catches issues that don't show up in staging — bad inputs, weird edge cases, scale problems. Roughly 80% of our flag usage.
Kill switches. A piece of code that we might need to disable fast. Payments retry logic, AI feature paths, expensive computations. If something starts misbehaving in prod, we flip the kill switch and dig in. ~10% of usage.
Customer-specific overrides. A specific customer needs a feature ahead of general availability (or after it's been deprecated). Targeting rules in the flag platform handle this. ~10% of usage.
We do NOT use flags for:
Each of these tools-vs-flags confusions has bitten us at least once. Keeping the categories separate matters.
We evaluated several providers and have run two of them in production. Brief comparison from actually using them:
LaunchDarkly — most-mature feature, broadest platform support, expensive at scale. Best fit for large teams that need fine-grained targeting, experimentation, and lots of language SDKs. We ran this for the first ~18 months; the bill grew faster than our team did.
GrowthBook — open source, can self-host or use their cloud, simpler model. Lacks some of LaunchDarkly's advanced targeting. Fine for "I need feature flags, not feature experiments." We moved here for the cost/simplicity trade.
Unleash — similar shape to GrowthBook, also OSS. We didn't run it but it's the alternative we'd consider.
Roll your own — for small teams with simple needs, a feature_flags table in your database with a small SDK is enough. The maintenance gets real once you cross ~30 flags or need percentage rollouts with consistent bucketing.
The general pattern: start with the simplest thing that works, switch when the gap to "real" flag platform features is causing pain. We probably should have started with GrowthBook instead of LaunchDarkly. The migration cost was real.
A few practices that survived the migration and a couple of incidents:
Flag-on-by-default for cleanup. New flags default to "on" in code, with the platform default being "off" for rollouts. Once a flag has been fully rolled out and stable, we can remove the platform configuration and the code keeps working.
Naming conventions. Every flag is <area>.<feature>.<purpose>, like checkout.express-pay.enabled or payments.retry-v2.kill-switch. Searchable, scannable, hard to mix up.
Owner per flag. Each flag has a tagged owner — usually the team that introduced it. When flag debt accumulates, we know who to ask. We've been burned by orphan flags from departed engineers.
Expiry dates on rollouts. Every rollout flag has a target removal date. After full ramp + stability, the flag should be removed within a few weeks. Without expiry, flags accumulate forever.
SDK init in service template. All services init the flag SDK the same way, with the same fallback behavior, the same logging. Reduces the surface area for "this service handles flags differently."
What's bitten us:
Evaluation latency. Some SDKs do remote evaluation per check — a network round-trip every time you ask "is this flag on?" Latency adds up. We use SDKs that bulk-fetch flag state at startup and re-fetch every 30 seconds in the background. Each isEnabled() check becomes a local map lookup.
Fail-open vs fail-closed. When the flag platform is unreachable, what's the default? Both options have failure modes:
We pick per-flag. Kill switches fail-open (we'd rather keep the feature working in a platform outage than be unable to disable broken code). New-feature flags fail-closed (better to not ship to users than to ship a half-tested feature without supervision).
The SDK has explicit fail_open / fail_closed defaults per flag, set at creation time.
Bucketing consistency. Percentage rollouts should consistently hash the same user into the same bucket. Switching providers changed the hash function — a user who was at 50% with LaunchDarkly might be at 30% with GrowthBook. We migrated users by hand for the dozen flags that needed continuity.
Flag dependencies. Flag A is on only when Flag B is also on. The platforms support this but it gets messy fast. We avoid chained dependencies; if logic requires multiple flags, encode it in code with one flag as input rather than wiring dependencies in the platform.
The flag that became permanent. A handful of our oldest flags have been "100% rolled out" for over a year but never got removed because the code paths under them are subtly different. They're now de facto configuration toggles. We're slowly cleaning them up; the lesson is that "remove this flag" is a real piece of work that has to be scheduled, not assumed.
A quarterly review:
Each item gets one of: keep (with reason), remove, or rewrite. The review takes 30 minutes, catches ~3-5 flags ripe for removal each quarter.
Without this, flag count grows monotonically and the platform turns into a graveyard of dead toggles. We've seen orgs with 1000+ flags and no idea which are live — every code search returns multiple flag checks per file.
A few gaps we work around:
Code-side cleanup. Removing a flag requires removing the SDK calls AND deleting the platform configuration. No platform we've used reliably finds dead flag references in your codebase. We grep periodically for flags that no longer exist in the platform.
Cross-environment coordination. Flag values in dev vs staging vs prod are managed independently by default. We've shipped code that worked because dev had the flag on and broke in prod where it was off. We now keep a synced flag spec per environment with explicit per-env overrides documented.
Auditing who flipped what. Most platforms log changes but the UIs aren't great for "show me every flag change in the last month." We export the audit log weekly to S3 and grep it when we need to.
Use the platform, don't roll your own. Past ~30 flags, you want bucketing, targeting, audit logs, and a UI for non-engineers. Building this is real work; using a platform is cheaper.
Pick the simplest provider that fits your scale. GrowthBook or Unleash for most teams; LaunchDarkly when you actually need its advanced features.
Flag debt is real. Schedule removal. Don't assume someone will get to it.
Decide fail-open vs fail-closed per flag, at creation. Not at incident time.
Kill switches need to be tested. A kill switch that's never been flipped is theater. Tabletop exercises that include flipping kill switches.
Don't reach for flags for non-flag problems. A/B experimentation, configuration, billing — different tools, different shapes.
Feature flags are one of the most useful patterns in modern deployment. The operational discipline around them — naming, ownership, expiry, audits — is what determines whether they help long-term or turn into a slow swamp of dead toggles. The platforms do part of the work; the discipline is on you.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How we run OpenTelemetry across ~40 services. The instrumentation that earns its place, the patterns we abandoned, and what tracing actually catches that metrics don't.
Install Ansible, write your first playbook, and configure a remote server (nginx + a deploy user) without touching it manually. The basics that scale up.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.