How we run OpenTelemetry across ~40 services. The instrumentation that earns its place, the patterns we abandoned, and what tracing actually catches that metrics don't.
We've been running OpenTelemetry tracing across our ~40 production services for about two years. The promise of tracing — see where time goes inside a request, end to end — is real and we use it daily. The pitch that "tracing replaces metrics" is mostly wrong. This post is the working version: what we instrument, what we skip, how we sample, and what tracing actually catches.
Metrics tell you that a request was slow. Traces tell you where it was slow.
A typical example: p99 latency on the checkout API jumped from 400 ms to 1.2 s overnight. Metrics show the spike. Logs show successful 200s. The Datadog tracing view shows that 700 ms of every slow request is sitting in a single downstream call to the address-validation service — which itself shows a clean span breakdown pointing at a slow DNS lookup inside that service. Twelve minutes of debugging, total. Without tracing, we'd have been bisecting code paths and reading log lines for an hour.
That's the value: traces collapse the "which service is slow, and which call inside that service" question into a flame graph you read in seconds. For services with deep call graphs (anything customer-facing in a microservices arch), it's the single most useful debugging tool we have.
What it does NOT do: long-term dashboards, alerting, capacity planning. Traces are too high-volume and high-cardinality for those use cases. Metrics + logs do that work; traces are for "why is this specific request slow."
The breakdown:
Automatic, framework-level instrumentation — covers ~80% of useful spans for free:
We use the OpenTelemetry SDK with the auto-instrumentation packages for each language. Drop them in via init code, get all of the above for zero per-line effort.
Manual instrumentation — the 20% that auto can't see:
order.placement, payment.charge, agent.run) — named after the domain action, not the functionretry.attempt spans showing each retry)For these, the code looks like:
with tracer.start_as_current_span("order.placement") as span:
span.set_attributes({
"order.id": order_id,
"order.total_usd": total,
"order.item_count": len(items),
})
# ... business logic ...
The set_attributes is more important than the span itself for debugging. When something is slow, "this slow span had item_count=4203" is what makes the issue obvious.
A few patterns we tried and dropped:
Per-function spans inside a service. Tempting — "trace every function call so I can see exactly where time goes." Result: spans dominate the cost, traces have hundreds of irrelevant spans, the actually-useful spans drown in noise. Manual spans only at meaningful boundaries; rely on auto for IO.
Logs as spans. Some teams emit a span per log line. The data shape is wrong — logs are events, spans are durations. Mixing them muddies both. Keep logs as logs (with trace_id as a field so you can correlate); keep spans as durations.
Auto-instrumentation of all third-party libs. Some auto-instrumentations are noisy (every internal SDK call shows up). We disable the ones with poor signal-to-noise per language.
Tracing every request is expensive. Most APMs charge per span; even self-hosted backends have ingest costs. We use tail-based sampling with these rules:
status.code = ERROR)The benefit: every interesting trace gets captured; routine traces are sampled enough to compute aggregate stats. Our cost dropped to ~15% of what 100% sampling would have been; trace value didn't drop perceptibly.
Tail-based sampling needs a collector that holds traces for a few seconds before deciding to send them. We use the OpenTelemetry Collector for this — sits between agents and the APM, does sampling + batching. Adds ~50 MB of memory per node and a few ms of latency to the trace-send path.
Head-based sampling (decide at request start) is simpler but worse — you commit to keeping or dropping a trace before knowing whether it errors. We started with head-based, switched to tail-based when we saw too many "the trace was sampled out exactly when something interesting happened."
Concrete cases over the last year:
A retry storm in a third-party SDK. The SDK had an exponential-backoff retry loop with a bug — every retry happened immediately instead of waiting. Trace showed 7 retry spans of the same call back-to-back. Vendor bug; we worked around it. Without tracing, the symptom was just "this endpoint is slow."
A N+1 query in an ORM lazy-load. New code path triggered the lazy-loader for each item in a list. Trace showed 200+ identical query spans for a single request. Two-line fix in the application. Metrics would have shown elevated DB CPU; trace showed the specific call path.
A cross-AZ network issue. Two services in the same region were taking 80 ms per request to talk to each other. Trace showed the network span dominated. Investigation found they'd been placed in different AZs by accident and the cross-AZ links had a transient issue. Moved them to the same AZ, problem gone.
A misconfigured connection pool. Pool size of 5 against a workload that needed 50. Trace showed many requests waiting hundreds of ms on "pool acquire." Bumped pool size, problem gone.
In every case the issue had been visible in metrics but ambiguous. Traces made the cause obvious in minutes.
A few things we hit:
Trace context propagation across boundaries. Auto-instrumentation only propagates traceparent headers if it knows about the protocol. Custom protocols (an internal binary RPC, an old SOAP integration) drop the context. We had to manually carry trace IDs through these.
Cardinality in span attributes. Putting user_id or order_id as attributes is great for debugging but explodes cardinality. APMs charge for unique attribute combinations. We removed user/order IDs from attributes and only kept them in span events (lower cost, still searchable).
Collector memory under burst. When traffic spiked, the OpenTelemetry Collector's in-memory queue would fill faster than it could send. We added a memory-limited queue with backpressure and dropped older traces under sustained load.
SDK initialization order. The tracer has to be set up before any instrumented library is imported. In Python this means SDK init at the top of the main entry file, before any other imports. Getting this wrong silently disables auto-instrumentation. Documented in our service template; new services rarely hit it now.
For reference:
We considered self-hosting Jaeger or Tempo. The operational cost was real and we'd have lost integration with our existing dashboards/alerts. The Datadog bill is significant but the engineer-time it saves is larger.
Auto-instrumentation first, manual for business boundaries. Don't try to instrument every function call.
Tail-based sampling. 10% baseline + 100% for errors and slow requests. Captures everything interesting at a fraction of the cost.
Correlate logs and traces. Get the trace_id into your structured logs from day one. The two views together are how you actually debug.
Don't put high-cardinality data in attributes. Use span events instead. Costs less, still searchable.
Treat the OpenTelemetry Collector as production infrastructure. It's in the hot path; it needs the same monitoring, scaling, and capacity planning as any other service.
Tracing isn't a replacement for metrics or logs — it's the third leg. For services with deep call graphs and per-request latency that matters, it's the most useful of the three.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A 2 AM incident, the autovacuum settings that caused it, and the parameter changes that prevented the next one. The discipline that took our biggest Postgres host from periodic stalls to steady.
We use feature flags on roughly every customer-facing change. The provider tradeoff, the patterns that hold up, and the failure modes that show up only after a couple of years.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.