The "three pillars" framing misses the point — what matters is correlating across them. The patterns that earn their place and the tooling decisions that pay back.
The "three pillars of observability" framing — logs, metrics, traces — is everywhere. It's also misleading. The pillars aren't independent; the value is in correlating across them. A high-latency request in metrics points to a trace; the trace points to a span; the span points to a log line. Without that chain, each "pillar" is an island.
We've spent the last two years tightening the correlation across our three observability surfaces. This is what works and the operational discipline that makes it work.
The single most important practice: every signal carries the same trace ID. A request comes in, gets assigned a trace ID, and every log line, every metric label, every span uses that ID. Then you can move between them.
Specifically:
trace_id (and span_id where relevant) as a structured field.trace_id as an exemplar (more on this below).This sounds trivial. Operationally, it's where most observability efforts fall down. The instrumentation must consistently propagate the trace ID across service boundaries, queue handoffs, batch jobs, retries.
OpenTelemetry is the common thread — instrument once, route to multiple backends. That's the bet.
Every log line a service writes carries the trace ID. The logger context picks it up from the OpenTelemetry context automatically if the logging library is integrated.
{
"ts": "2026-06-18T14:32:01.123Z",
"level": "error",
"msg": "payment failed",
"trace_id": "abc123def456",
"span_id": "789xyz",
"service": "payments",
"user_id": "u_12345",
"amount_cents": 4500
}
The trace_id is what makes this log line cross-referenceable. The other fields (user_id, amount_cents) are useful context for the specific event.
Patterns that matter:
Prometheus exemplars are the bridge from metrics to traces. An exemplar is a single example trace attached to a metric bucket.
You alert on "p95 latency > 1s." The alert fires; you look at the metric. The metric chart shows the p95 spike — and exemplars on the chart show specific trace IDs that fell in the spike. Click an exemplar, jump to the trace, see what happened.
Without exemplars, you go from "metric says something is slow" to "search for slow traces" — a manual step that wastes minutes.
The Prometheus client libraries (in OpenTelemetry-instrumented apps) emit exemplars automatically when the relevant metric is updated within a trace span. Enable it in the collector config; Grafana renders exemplars on charts natively.
A trace is a tree of spans. Each span has a name, timing, and attributes. The attributes are where most of the debugging value lives — and they're easy to get wrong.
OpenTelemetry has semantic conventions — standardized attribute names for common things. http.status_code not httpStatus. db.statement not query. Use them.
Why: tooling (Grafana, Tempo, etc.) can do useful things with conventional attributes — show error rates by http.status_code, group traces by service.name, surface slow db.system calls automatically. Custom attribute names work too but lose the automatic value.
What we add beyond conventions:
tenant_id (we're multi-tenant; almost every debug starts with "which tenant?")user_id (for user-specific bug reports)request_kind (a service-specific category)What actually happens when an alert fires:
The chain — metric → exemplar → trace → log — is what makes this fast. Without correlation, each step is "search separately and hope."
Tracing every request is expensive at scale. Sampling strategies:
Head-based sampling. Decision at trace start: sample this trace at 1%, drop the rest. Simple; cheap. Problem: error-rate sampling fails — most traces are normal; errors are rare; head-sampling drops them.
Tail-based sampling. Decision at trace end: keep all errors, keep slow ones, sample the rest at 1%. Better quality; needs a collector that buffers traces (more memory) and supports tail sampling.
We use tail-based sampling at the OpenTelemetry collector. The rules:
This keeps the trace store manageable while preserving the traces we'd actually want to investigate.
Observability infrastructure breaks like anything else. We monitor:
We've had two incidents where the observability stack itself was broken — and it took us longer to figure out that than to fix the actual issue. Now we have synthetic alerts that fire if no logs have arrived in N minutes.
Logging too much. Initially we logged every request, every DB query, every state change. Storage blew up; useful signals drowned in noise. Now we log at INFO for "things that happened" and DEBUG for "diagnostic detail" — INFO goes to storage; DEBUG can be enabled per-service per-pod when investigating.
Metric cardinality from user_id. Adding user_id as a metric label was tempting. It exploded cardinality (millions of unique values × hundreds of metrics). Now: user_id stays out of metric labels; it goes in trace attributes and log fields, where high cardinality is fine.
Forgetting to instrument batch jobs. Web requests had tracing; cron jobs and queue consumers didn't. Half our debugging puzzles were "what happened in the batch?" Now everything is instrumented, including batch boundaries.
Trusting auto-instrumentation entirely. OpenTelemetry auto-instrumentation is great but misses application-specific concerns. Manual spans around important operations (e.g., a feature-flag evaluation, a complex business calculation) add the context auto-instrumentation can't.
What we do continuously:
The three pillars framing is useful as a vocabulary. As a strategy, it misses the point — the value is in correlation. Spend less time worrying about which pillar matters most and more on making the trace ID flow consistently across all three. Once that's working, debugging is faster, alerts are more actionable, and on-call gets meaningfully better.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
Sharding isn't just "split the table" — the shard key choice cascades through queries, joins, rebalancing, and operations. The decisions that pay off and the ones we redid.
pg_upgrade is fast but takes downtime; logical replication lets you cut over while the old DB still serves traffic. The runbook, the gotchas, and the post-cutover checklist.
The single most useful Postgres extension you might not be using. The queries it surfaces, the indexes it implies, and the operational discipline of reading it weekly.