We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.
We adopted Backstage about 18 months ago to give engineers a single place to find services, scaffold new ones, and read internal docs. The promise — one portal for "everything about your service" — partially landed. Some pieces are genuinely useful; some were over-engineered for our size; one piece we abandoned. This post is the honest assessment.
The term is overloaded. In practice, an IDP is some combination of:
Backstage covers #1–4 well. #5 is where teams over-invest and where most "IDP project failed" stories come from.
Of the five categories, two earn their place daily:
Service catalog (1). Every service has a catalog-info.yaml in its repo. It declares ownership, dependencies, lifecycle (production / experimental / deprecated), and links (runbook, dashboard, on-call). The catalog UI lets an engineer find "who owns this service" in one click instead of grepping commit history. Daily-used.
Software templates (2). When a new service is created, we use a Backstage template that scaffolds the repo with our standard Dockerfile, GitHub Actions workflow, Argo CD manifest, OpenTelemetry init, and so on. The template lives in a separate repo; we update it whenever our service template changes. Engineers run it from the Backstage UI; ~5 minutes from "I need a new service" to "PR open with all the boilerplate."
The rest:
The promise: "engineers can self-serve database creation, feature-flag toggling, secret rotation, etc., all from Backstage." Sounds great.
Reality: each plugin needed:
Each was a small project. We built two (DB provisioning, S3 bucket creation) and used them maybe ten times each before realizing the engineers preferred the CLI tools they already had. The Backstage UI was a thin layer; the actual work was the integrations behind it.
We retired both plugins. New self-service work goes into the CLI tools directly. Backstage links to the docs.
Backstage is a real Node.js app you have to run. We deployed it on EKS:
Upgrades are the biggest pain. Backstage releases monthly with frequent breaking changes in plugin APIs. We upgrade quarterly, treating it like a real maintenance task. Some quarters we skip a release because the changelog is heavy.
Operational time: ~6 hours per quarter on upgrades + occasional plugin issues. Manageable but real.
The pattern that worked: catalog entries live in the same repo as the code. Each service has catalog-info.yaml:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: checkout-api
description: Customer checkout API
annotations:
backstage.io/source-location: url:https://github.com/company/checkout-api
grafana/dashboard-selector: "tag:checkout-api"
pagerduty.com/service-id: P12345
github.com/project-slug: company/checkout-api
spec:
type: service
lifecycle: production
owner: team-checkout
system: checkout
dependsOn:
- component:default/checkout-db
- component:default/payments-api
Backstage scans configured repos and ingests these. Updates take ~5 minutes to propagate. Engineers update catalog-info.yaml in PRs alongside code changes; ownership info stays current automatically.
What we don't do: manual entry of services through the UI. That drifts. Repo-based catalog declarations stay in sync with reality.
Our Backstage software template for a new web service generates:
Engineer answers ~5 questions (service name, team, language, etc.) in the Backstage UI. Template generates a PR against a new repo. Engineer reviews, merges, gets to feature work.
The template lives in version control. When we change our standards (new logging library, new image base), we update the template; existing services get fixed via their own PRs as they update.
Time from "I want a new service" to "running in staging" used to be ~half a day of copying from another service. Now it's 30 minutes including code review of the generated PR.
Two things in hindsight:
Started with just the catalog. Trying to do catalog + templates + tech docs + plugins simultaneously meant none of them got polished. Catalog alone, deeply, then templates, then add what's needed. Phased adoption beats wholesale launch.
Treated TechDocs as optional. We tried to migrate all docs into TechDocs and the migration stalled because the Notion → TechDocs friction was higher than the value. If we'd left docs where they were and just linked to them from Backstage, we'd have spent less effort and gotten the same outcome.
If your team is small (< 5 engineers), Backstage is overkill. A README and a services.md in a docs repo does the job.
For 20+ engineers, multiple services, ownership questions like "who do I page for X" — Backstage starts paying back. The catalog and templates are the highest-ROI features.
For 100+ engineers, true platform-engineering territory, Backstage is one of several tools you'd consider (Port, Cortex, custom). The operational cost is justified.
We're around 30 engineers across 5 teams. Backstage is net-positive for us but barely. If we were 15 engineers we'd probably skip it.
Worth knowing about:
services.md with a YAML table. Sounds primitive; works fine for many teams. We did this for 18 months before Backstage.Validate the catalog use case first. Can your engineers answer "who owns this service" in 30 seconds today? If yes, Backstage's catalog adds little. If no, that's the value to test.
Start small. Catalog + one good template. Don't try to do plugins on day one.
Catalog-info.yaml in-repo, not UI-managed. Drift is the enemy.
Plan for upgrade cost. Quarterly maintenance, real ops time.
Skip self-service plugins unless they save real time. Most are thin UI on top of CLI tools engineers already use.
Backstage is one of those tools where the marketing oversells and the operational reality is more nuanced. Used carefully, for the things it's actually good at, it's a useful platform layer. Used to its full advertised potential, it becomes a small platform-engineering team's full-time job. Pick what serves you.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.
Evergreen posts worth revisiting.