How we run OpenTelemetry across ~40 services. The instrumentation that earns its place, the patterns we abandoned, and what tracing actually catches that metrics don't.

On this page

Distributed Tracing with OpenTelemetry: What We Ship, What We Skip

We've been running OpenTelemetry tracing across our ~40 production services for about two years. The promise of tracing — see where time goes inside a request, end to end — is real and we use it daily. The pitch that "tracing replaces metrics" is mostly wrong. This post is the working version: what we instrument, what we skip, how we sample, and what tracing actually catches.

What tracing buys you over metrics #

Metrics tell you that a request was slow. Traces tell you where it was slow.

A typical example: p99 latency on the checkout API jumped from 400 ms to 1.2 s overnight. Metrics show the spike. Logs show successful 200s. The Datadog tracing view shows that 700 ms of every slow request is sitting in a single downstream call to the address-validation service — which itself shows a clean span breakdown pointing at a slow DNS lookup inside that service. Twelve minutes of debugging, total. Without tracing, we'd have been bisecting code paths and reading log lines for an hour.

That's the value: traces collapse the "which service is slow, and which call inside that service" question into a flame graph you read in seconds. For services with deep call graphs (anything customer-facing in a microservices arch), it's the single most useful debugging tool we have.

What it does NOT do: long-term dashboards, alerting, capacity planning. Traces are too high-volume and high-cardinality for those use cases. Metrics + logs do that work; traces are for "why is this specific request slow."

What we instrument #

The breakdown:

Automatic, framework-level instrumentation — covers ~80% of useful spans for free:

HTTP servers (Express, FastAPI, Gin, etc.): span per incoming request
HTTP clients (requests, axios, net/http): span per outgoing request
Database drivers (pg, mysql2, prisma): span per query
Redis/Memcached clients: span per operation
gRPC: span per call
Message queues (SQS, RabbitMQ, Kafka): span per produce/consume

We use the OpenTelemetry SDK with the auto-instrumentation packages for each language. Drop them in via init code, get all of the above for zero per-line effort.

Manual instrumentation — the 20% that auto can't see:

Business-meaningful operations (order.placement, payment.charge, agent.run) — named after the domain action, not the function
Long blocking operations that aren't a single library call (loops, batch operations, computational work)
Cross-cutting concerns like retry loops (retry.attempt spans showing each retry)

For these, the code looks like:

python.python

with tracer.start_as_current_span("order.placement") as span:
    span.set_attributes({
        "order.id": order_id,
        "order.total_usd": total,
        "order.item_count": len(items),
    })
    # ... business logic ...

The set_attributes is more important than the span itself for debugging. When something is slow, "this slow span had item_count=4203" is what makes the issue obvious.

What we deliberately don't instrument #

A few patterns we tried and dropped:

Per-function spans inside a service. Tempting — "trace every function call so I can see exactly where time goes." Result: spans dominate the cost, traces have hundreds of irrelevant spans, the actually-useful spans drown in noise. Manual spans only at meaningful boundaries; rely on auto for IO.

Logs as spans. Some teams emit a span per log line. The data shape is wrong — logs are events, spans are durations. Mixing them muddies both. Keep logs as logs (with trace_id as a field so you can correlate); keep spans as durations.

Auto-instrumentation of all third-party libs. Some auto-instrumentations are noisy (every internal SDK call shows up). We disable the ones with poor signal-to-noise per language.

Sampling: this matters a lot #

Tracing every request is expensive. Most APMs charge per span; even self-hosted backends have ingest costs. We use tail-based sampling with these rules:

Sample 100% of error requests (any span with status.code = ERROR)
Sample 100% of slow requests (root span duration > 2s)
Sample 100% of high-value endpoints (payments, auth) regardless of latency
Sample 10% of everything else

The benefit: every interesting trace gets captured; routine traces are sampled enough to compute aggregate stats. Our cost dropped to ~15% of what 100% sampling would have been; trace value didn't drop perceptibly.

Tail-based sampling needs a collector that holds traces for a few seconds before deciding to send them. We use the OpenTelemetry Collector for this — sits between agents and the APM, does sampling + batching. Adds ~50 MB of memory per node and a few ms of latency to the trace-send path.

Head-based sampling (decide at request start) is simpler but worse — you commit to keeping or dropping a trace before knowing whether it errors. We started with head-based, switched to tail-based when we saw too many "the trace was sampled out exactly when something interesting happened."

What tracing has actually caught #

Concrete cases over the last year:

A retry storm in a third-party SDK. The SDK had an exponential-backoff retry loop with a bug — every retry happened immediately instead of waiting. Trace showed 7 retry spans of the same call back-to-back. Vendor bug; we worked around it. Without tracing, the symptom was just "this endpoint is slow."

A N+1 query in an ORM lazy-load. New code path triggered the lazy-loader for each item in a list. Trace showed 200+ identical query spans for a single request. Two-line fix in the application. Metrics would have shown elevated DB CPU; trace showed the specific call path.

A cross-AZ network issue. Two services in the same region were taking 80 ms per request to talk to each other. Trace showed the network span dominated. Investigation found they'd been placed in different AZs by accident and the cross-AZ links had a transient issue. Moved them to the same AZ, problem gone.

A misconfigured connection pool. Pool size of 5 against a workload that needed 50. Trace showed many requests waiting hundreds of ms on "pool acquire." Bumped pool size, problem gone.

In every case the issue had been visible in metrics but ambiguous. Traces made the cause obvious in minutes.

What broke at scale #

A few things we hit:

Trace context propagation across boundaries. Auto-instrumentation only propagates traceparent headers if it knows about the protocol. Custom protocols (an internal binary RPC, an old SOAP integration) drop the context. We had to manually carry trace IDs through these.

Cardinality in span attributes. Putting user_id or order_id as attributes is great for debugging but explodes cardinality. APMs charge for unique attribute combinations. We removed user/order IDs from attributes and only kept them in span events (lower cost, still searchable).

Collector memory under burst. When traffic spiked, the OpenTelemetry Collector's in-memory queue would fill faster than it could send. We added a memory-limited queue with backpressure and dropped older traces under sustained load.

SDK initialization order. The tracer has to be set up before any instrumented library is imported. In Python this means SDK init at the top of the main entry file, before any other imports. Getting this wrong silently disables auto-instrumentation. Documented in our service template; new services rarely hit it now.

Stack we run #

For reference:

Instrumentation: OpenTelemetry SDKs (Python, Node, Go) with auto-instrumentation packages
Collector: OpenTelemetry Collector, deployed as a DaemonSet on each Kubernetes node
Backend: Datadog APM (we also evaluated Honeycomb; Datadog won on existing integration depth)
Trace-log correlation: trace_id and span_id auto-injected into structured logs by the SDKs; Datadog correlates the two views

We considered self-hosting Jaeger or Tempo. The operational cost was real and we'd have lost integration with our existing dashboards/alerts. The Datadog bill is significant but the engineer-time it saves is larger.

What I'd tell a team starting #

Auto-instrumentation first, manual for business boundaries. Don't try to instrument every function call.

Tail-based sampling. 10% baseline + 100% for errors and slow requests. Captures everything interesting at a fraction of the cost.

Correlate logs and traces. Get the trace_id into your structured logs from day one. The two views together are how you actually debug.

Don't put high-cardinality data in attributes. Use span events instead. Costs less, still searchable.

Treat the OpenTelemetry Collector as production infrastructure. It's in the hot path; it needs the same monitoring, scaling, and capacity planning as any other service.

Tracing isn't a replacement for metrics or logs — it's the third leg. For services with deep call graphs and per-request latency that matters, it's the most useful of the three.

Distributed Tracing with OpenTelemetry — What We Ship, What We Skip

Distributed Tracing with OpenTelemetry: What We Ship, What We Skip

What tracing buys you over metrics #

What we instrument #

What we deliberately don't instrument #

Sampling: this matters a lot #

What tracing has actually caught #

What broke at scale #

Stack we run #

What I'd tell a team starting #

Stay Updated

Postgres Autovacuum — Tuning From Production Stalls

Feature Flags in Production — Provider Choice and Operational Reality

More from DevOps

Helm Chart Anti-Patterns We've Stopped Using

Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up

Internal Developer Platforms — Backstage in Practice

Helm Chart Anti-Patterns We've Stopped Using

Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up

Internal Developer Platforms — Backstage in Practice

Chaos Engineering — What We Actually Run as Game Days

Embeddings Drift Detection — When "Similar Enough" Stops Being Similar

Kubernetes 101 — Pods, Deployments, and Services Explained

About Admin

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production