The "three pillars" framing misses the point — what matters is correlating across them. The patterns that earn their place and the tooling decisions that pay back.

On this page

Observability — Correlating Logs, Metrics, and Traces in Anger

The "three pillars of observability" framing — logs, metrics, traces — is everywhere. It's also misleading. The pillars aren't independent; the value is in correlating across them. A high-latency request in metrics points to a trace; the trace points to a span; the span points to a log line. Without that chain, each "pillar" is an island.

We've spent the last two years tightening the correlation across our three observability surfaces. This is what works and the operational discipline that makes it work.

The mental model: same identifier across all three #

The single most important practice: every signal carries the same trace ID. A request comes in, gets assigned a trace ID, and every log line, every metric label, every span uses that ID. Then you can move between them.

Specifically:

Logs include trace_id (and span_id where relevant) as a structured field.
Metrics carry the trace_id as an exemplar (more on this below).
Traces have the trace_id as their primary identifier.

This sounds trivial. Operationally, it's where most observability efforts fall down. The instrumentation must consistently propagate the trace ID across service boundaries, queue handoffs, batch jobs, retries.

Tooling we use #

Tracing: OpenTelemetry SDK + collector → Tempo / Jaeger.
Metrics: Prometheus (pull) for service metrics; OpenTelemetry → collector → Mimir for app/business metrics.
Logs: structured JSON to stdout → Loki / OpenSearch.
Grafana for the unified UI; the same dashboard can pivot from a metric to a trace to a log.

OpenTelemetry is the common thread — instrument once, route to multiple backends. That's the bet.

Step 1: structured logs with trace context #

Every log line a service writes carries the trace ID. The logger context picks it up from the OpenTelemetry context automatically if the logging library is integrated.

json.json

{
  "ts": "2026-06-18T14:32:01.123Z",
  "level": "error",
  "msg": "payment failed",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "service": "payments",
  "user_id": "u_12345",
  "amount_cents": 4500
}

The trace_id is what makes this log line cross-referenceable. The other fields (user_id, amount_cents) are useful context for the specific event.

Patterns that matter:

Use a structured logger. Pino, Zap, structlog, slog — anything that emits JSON natively. String-formatted log lines are unparseable.
Inject trace context into the logger. Once per request, set the trace context on the logger; every subsequent log line includes it.
Don't log secrets. Redact at logger config; don't rely on devs to remember.

Step 2: metrics with exemplars #

Prometheus exemplars are the bridge from metrics to traces. An exemplar is a single example trace attached to a metric bucket.

You alert on "p95 latency > 1s." The alert fires; you look at the metric. The metric chart shows the p95 spike — and exemplars on the chart show specific trace IDs that fell in the spike. Click an exemplar, jump to the trace, see what happened.

Without exemplars, you go from "metric says something is slow" to "search for slow traces" — a manual step that wastes minutes.

The Prometheus client libraries (in OpenTelemetry-instrumented apps) emit exemplars automatically when the relevant metric is updated within a trace span. Enable it in the collector config; Grafana renders exemplars on charts natively.

Step 3: traces with semantic conventions #

A trace is a tree of spans. Each span has a name, timing, and attributes. The attributes are where most of the debugging value lives — and they're easy to get wrong.

OpenTelemetry has semantic conventions — standardized attribute names for common things. http.status_code not httpStatus. db.statement not query. Use them.

Why: tooling (Grafana, Tempo, etc.) can do useful things with conventional attributes — show error rates by http.status_code, group traces by service.name, surface slow db.system calls automatically. Custom attribute names work too but lose the automatic value.

What we add beyond conventions:

tenant_id (we're multi-tenant; almost every debug starts with "which tenant?")
user_id (for user-specific bug reports)
request_kind (a service-specific category)

The debugging flow #

What actually happens when an alert fires:

Alert fires. "p95 latency > 1s for service X."
Look at the dashboard. What metrics are anomalous? CPU? Memory? Queue depth?
Find an exemplar. Click a trace ID associated with a slow request.
Open the trace. Which span is slow? A specific DB query? An external call?
Find the corresponding log lines. Filter logs by trace_id; read the error message, the request context.
Form a hypothesis. Slow query → check pg_stat_statements; external call timeout → check downstream service; cache miss → check cache hit rate.

The chain — metric → exemplar → trace → log — is what makes this fast. Without correlation, each step is "search separately and hope."

Sampling: the cost question #

Tracing every request is expensive at scale. Sampling strategies:

Head-based sampling. Decision at trace start: sample this trace at 1%, drop the rest. Simple; cheap. Problem: error-rate sampling fails — most traces are normal; errors are rare; head-sampling drops them.

Tail-based sampling. Decision at trace end: keep all errors, keep slow ones, sample the rest at 1%. Better quality; needs a collector that buffers traces (more memory) and supports tail sampling.

We use tail-based sampling at the OpenTelemetry collector. The rules:

100% of traces with errors.
100% of traces > 1s duration.
100% of traces from specific high-priority customers.
1% of everything else.

This keeps the trace store manageable while preserving the traces we'd actually want to investigate.

What we monitor about the monitoring #

Observability infrastructure breaks like anything else. We monitor:

Ingestion latency. Logs/metrics/traces should show up within seconds.
Cardinality. A metric with too many label values explodes storage. Alert on high-cardinality metrics.
Sampling rate. Are we sampling more than expected?
Storage costs. Logs are usually the biggest cost; trace storage second; metrics smallest.

We've had two incidents where the observability stack itself was broken — and it took us longer to figure out that than to fix the actual issue. Now we have synthetic alerts that fire if no logs have arrived in N minutes.

Things we got wrong #

Logging too much. Initially we logged every request, every DB query, every state change. Storage blew up; useful signals drowned in noise. Now we log at INFO for "things that happened" and DEBUG for "diagnostic detail" — INFO goes to storage; DEBUG can be enabled per-service per-pod when investigating.

Metric cardinality from user_id. Adding user_id as a metric label was tempting. It exploded cardinality (millions of unique values × hundreds of metrics). Now: user_id stays out of metric labels; it goes in trace attributes and log fields, where high cardinality is fine.

Forgetting to instrument batch jobs. Web requests had tracing; cron jobs and queue consumers didn't. Half our debugging puzzles were "what happened in the batch?" Now everything is instrumented, including batch boundaries.

Trusting auto-instrumentation entirely. OpenTelemetry auto-instrumentation is great but misses application-specific concerns. Manual spans around important operations (e.g., a feature-flag evaluation, a complex business calculation) add the context auto-instrumentation can't.

The discipline #

What we do continuously:

Every new service starts with OpenTelemetry instrumentation. Not optional; not "we'll add it later." It's in the service scaffolding.
Trace ID propagation is part of code review. When adding a new service-to-service call, verify trace context flows through.
Quarterly correlation audit. Pick 5 recent incidents. For each, walk through the debugging flow. Did metric → trace → log work end-to-end? Where did it break?
Cost review monthly. Observability cost is real and grows with traffic. Trim retention, drop unused labels, audit high-cardinality metrics.

What to read next #

Burn-rate alerting and SLO discipline — what you alert on with these signals
Distributed tracing — OpenTelemetry, what we ship — the tracing layer expanded
eBPF tools for everyday ops — bpftrace patterns — kernel-level visibility complementing app-level
Pipeline observability — CI failures and alerts — CI-side of the same discipline

The three pillars framing is useful as a vocabulary. As a strategy, it misses the point — the value is in correlation. Spend less time worrying about which pillar matters most and more on making the trace ID flow consistently across all three. Once that's working, debugging is faster, alerts are more actionable, and on-call gets meaningfully better.

Observability — Correlating Logs, Metrics, and Traces in Anger

Observability — Correlating Logs, Metrics, and Traces in Anger

The mental model: same identifier across all three #

Tooling we use #

Step 1: structured logs with trace context #

Step 2: metrics with exemplars #

Step 3: traces with semantic conventions #

The debugging flow #

Sampling: the cost question #

What we monitor about the monitoring #

Things we got wrong #

The discipline #

What to read next #

Stay Updated

Multi-Region — Active-Active vs Active-Passive, And What We Actually Run

More from Infrastructure

Database Sharding — The Choices We Wish We'd Made Earlier

Postgres Logical Replication for Zero-Downtime Major Upgrades

pg_stat_statements — Postgres Query Analysis Without Guessing

Database Sharding — The Choices We Wish We'd Made Earlier

Postgres Logical Replication for Zero-Downtime Major Upgrades

pg_stat_statements — Postgres Query Analysis Without Guessing

Terraform Module Versioning and Shared Registries

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025