We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.

On this page

Model Serving Observability Stack: Deep Dive

When we put our first LLM-backed service in front of customers, our existing observability stack — Datadog for metrics, Loki for logs, Grafana dashboards built around traditional service signals — turned out to be inadequate. We could see the service was up, we could see request latency, we could see errors. We could not answer questions like:

Why are we suddenly burning 40% more tokens per request than yesterday?
Which customer is generating outsized cost?
Did our last prompt change degrade output quality?
When the model returned malformed JSON, what was the input that triggered it?

Building the observability stack to answer those questions took about two months. This post is what we ended up with.

What's different about model serving #

Three properties make LLM serving different from regular API serving:

Cost is variable per request. Two requests with the same shape can differ by 100x in cost depending on input size and output length.
Quality is hard to measure in real time. A returned response is unambiguously "successful" from an HTTP perspective even when the content is wrong.
The model itself can change behaviour without us deploying anything. Provider-side updates, even within the "same" model snapshot, can shift behaviour.

Our existing stack handled (1) only crudely (we knew total spend monthly), couldn't address (2) at all, and was blind to (3).

Layer 1: Custom request-level metrics #

Every LLM call our service makes emits a structured metric event:

python.python

{
  "timestamp": "2026-04-26T15:32:11Z",
  "request_id": "req_abc123",
  "service": "support-assistant",
  "route": "classify_intent",
  "user_id_hash": "h_deadbeef",  # hashed for privacy
  "model": "gpt-4o-mini-2024-07-18",
  "input_tokens": 1240,
  "output_tokens": 87,
  "latency_total_ms": 1180,
  "latency_to_first_token_ms": 320,
  "cached": false,
  "cache_similarity_score": null,
  "downstream_status": "200",
  "schema_validated": true,
  "schema_validation_errors": null
}

These go to a high-cardinality metrics store — we use ClickHouse for this specifically. Datadog can technically handle some of this but the per-request granularity at our volume becomes prohibitively expensive there.

The dashboards built on this answer the cost-per-request and per-customer questions. ClickHouse queries take 1-2 seconds for "tokens per request, by route, last 24h" across our volume.

Layer 2: Sampled prompt and output capture #

We log the actual prompt sent to the model and the actual response received, for ~1% of production traffic. The capture is tied to the request_id from Layer 1.

python.python

{
  "request_id": "req_abc123",
  "prompt": "...",                # full prompt
  "response": "...",              # full response
  "captured_at": "2026-04-26T15:32:11Z",
  "captured_reason": "random_sample"  # or "explicit_flag", "anomaly_detected"
}

This is sensitive — customer data. The capture goes to a separately-encrypted S3 bucket with strict IAM (no engineer has direct access; access requires a JIRA ticket with reason).

The 1% sample rate is enough to:

Spot regressions when prompt changes ship (5-10 examples in the first hour are enough)
Build new eval cases from real production interactions
Investigate post-incident "what was the model actually saying"

When something is genuinely going wrong (e.g., a downstream pipeline starts seeing malformed JSON), we can flip the sample rate to 100% temporarily for the affected route.

Layer 3: Cost attribution #

Cost-per-request × request volume = service cost. But "service cost" alone isn't actionable. We attribute cost three ways:

By route (which feature is expensive)
By customer cohort (which users cost more)
By upstream caller (when one of our internal services calls the LLM service, which one?)

The attribution is computed nightly from the per-request metrics:

sql.sql

SELECT
  route,
  SUM(input_tokens * 0.00015 / 1000) as input_cost_usd,
  SUM(output_tokens * 0.0006 / 1000) as output_cost_usd,
  COUNT(*) as request_count
FROM llm_requests
WHERE date = current_date - 1
GROUP BY route
ORDER BY input_cost_usd + output_cost_usd DESC

The numbers feed into our cost dashboard. Anomalies trigger a Slack alert: "the classify_intent route's daily cost jumped 30%, normally $X, today $Y, possibly an upstream caller spike."

Layer 4: Quality signals #

We've been refining these the longest. Three signals that have proven useful:

Schema validation rate. For routes that expect structured output, we record whether the response parsed correctly against the expected schema. A regression here is a strong signal.

Cache hit rate. For routes that use semantic caching, the hit rate should be stable. A sudden drop suggests the input distribution changed (legitimate traffic shift) or that the cache is misbehaving.

Length distribution. The token count distribution per route should be stable. A sudden shift toward longer responses (or shorter) suggests the model's behaviour changed — sometimes from a provider update, sometimes from a prompt change.

These aren't quality measures themselves — they're early warning signals that quality might have shifted. For actual quality measurement, we run our offline eval suite.

Layer 5: Synthetic eval, on a schedule #

Our offline eval is the ground truth. It runs:

On every PR that touches prompts, model selection, or retrieval
Twice a week as a scheduled cron, against the same fixed set of 200 questions

The cron's job is to catch quality regressions that don't come from our code. If the eval score drifts down without us shipping anything, the model's behaviour has shifted on the provider side. We've caught two such incidents — one a quiet model upgrade, one a degraded snapshot — that would otherwise have shown up only as customer complaints.

Layer 6: Real-time anomaly detection #

A small ML model (yes, we use a model to monitor a model) watches the per-request metrics for anomalies:

Tokens-per-request distribution shift
Latency p95 shift
Schema validation rate drop
Per-customer cost outliers

When it detects something significant, it posts a structured Slack alert with a link to a pre-built investigation dashboard. The alert is intentionally low-noise — we've tuned the model to fire ~once a week, with most firings being legitimately interesting.

The model itself is simple — a moving-window comparison with sigma-based thresholds. Nothing exotic. The value is having someone (or something) watching all the time.

What we don't have #

Real-time quality scoring per request. The judge LLM approach (a smaller model evaluating the bigger model's output) is too slow and expensive at production volume. We only do it on the eval cron.
Token-level streaming metrics. We instrument at the request boundary, not per-token. Not enough payoff for the engineering cost.
Tokenizer-level cost prediction. We use the provider's reported token count after the fact. Pre-flight estimation would be nice but it's not load-bearing.

What we cut from earlier attempts #

In our first month, we tried logging every request to Datadog as a custom event. It worked, but the bill was eye-watering at our volume. Migrated the high-cardinality data to ClickHouse; kept aggregated rollups in Datadog for dashboarding.

We also tried capturing 100% of prompts/responses. Storage cost was high; access control was a constant security review. Going to 1% sampling addressed both — the loss of fidelity is rarely missed.

How to read these dashboards during an incident #

When something is going wrong, the order of inspection is roughly:

Cost-per-request dashboard — has cost shifted suddenly? If yes, what route or customer?
Schema validation rate dashboard — are we returning malformed output? If yes, recent prompt changes?
Latency distribution — is the model actually slower, or are we waiting on something else?
Sampled prompts/responses — pull the latest 50 captured pairs for the affected route. Look for the smoke.

The investigation takes ~10 minutes if the dashboards are doing their job. Without the dashboards, the same investigation took us hours of grep against logs and manual aggregation.

What I'd tell a team building this #

Start with per-request metrics in a high-cardinality store. ClickHouse is great for this, but Snowflake / BigQuery / wherever your analytics team already runs is fine. The point is: every LLM call should be a row, and you should be able to slice it by route, customer, model, and any other dimension.

Add sampled prompt capture next. 1% is a good starting point. Storage cost is small; investigation value is huge.

Cost attribution is third — once you have per-request metrics, attribution is just SQL.

Don't bother with real-time quality scoring at first. The cost is high; the offline eval is more reliable. Real-time quality is a nice-to-have once everything else is mature.

The temptation is to build everything at once. Each layer can take weeks; do them sequentially, getting each operational and integrated into team workflows before adding the next. Otherwise you ship five layers, none of them really wired into how the team actually responds to issues.

Deep Dive: Model Serving Observability Stack

Model Serving Observability Stack: Deep Dive

What's different about model serving #

Layer 1: Custom request-level metrics #

Layer 2: Sampled prompt and output capture #

Layer 3: Cost attribution #

Layer 4: Quality signals #

Layer 5: Synthetic eval, on a schedule #

Layer 6: Real-time anomaly detection #

What we don't have #

What we cut from earlier attempts #

How to read these dashboards during an incident #

What I'd tell a team building this #

Stay Updated

Practical Guide: Incident Response for Platform Teams

Deep Dive: Multi-Cluster Traffic Routing Strategies

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas