We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.
When we put our first LLM-backed service in front of customers, our existing observability stack — Datadog for metrics, Loki for logs, Grafana dashboards built around traditional service signals — turned out to be inadequate. We could see the service was up, we could see request latency, we could see errors. We could not answer questions like:
Building the observability stack to answer those questions took about two months. This post is what we ended up with.
Three properties make LLM serving different from regular API serving:
Our existing stack handled (1) only crudely (we knew total spend monthly), couldn't address (2) at all, and was blind to (3).
Every LLM call our service makes emits a structured metric event:
{
"timestamp": "2026-04-26T15:32:11Z",
"request_id": "req_abc123",
"service": "support-assistant",
"route": "classify_intent",
"user_id_hash": "h_deadbeef", # hashed for privacy
"model": "gpt-4o-mini-2024-07-18",
"input_tokens": 1240,
"output_tokens": 87,
"latency_total_ms": 1180,
"latency_to_first_token_ms": 320,
"cached": false,
"cache_similarity_score": null,
"downstream_status": "200",
"schema_validated": true,
"schema_validation_errors": null
}
These go to a high-cardinality metrics store — we use ClickHouse for this specifically. Datadog can technically handle some of this but the per-request granularity at our volume becomes prohibitively expensive there.
The dashboards built on this answer the cost-per-request and per-customer questions. ClickHouse queries take 1-2 seconds for "tokens per request, by route, last 24h" across our volume.
We log the actual prompt sent to the model and the actual response received, for ~1% of production traffic. The capture is tied to the request_id from Layer 1.
{
"request_id": "req_abc123",
"prompt": "...", # full prompt
"response": "...", # full response
"captured_at": "2026-04-26T15:32:11Z",
"captured_reason": "random_sample" # or "explicit_flag", "anomaly_detected"
}
This is sensitive — customer data. The capture goes to a separately-encrypted S3 bucket with strict IAM (no engineer has direct access; access requires a JIRA ticket with reason).
The 1% sample rate is enough to:
When something is genuinely going wrong (e.g., a downstream pipeline starts seeing malformed JSON), we can flip the sample rate to 100% temporarily for the affected route.
Cost-per-request × request volume = service cost. But "service cost" alone isn't actionable. We attribute cost three ways:
The attribution is computed nightly from the per-request metrics:
SELECT
route,
SUM(input_tokens * 0.00015 / 1000) as input_cost_usd,
SUM(output_tokens * 0.0006 / 1000) as output_cost_usd,
COUNT(*) as request_count
FROM llm_requests
WHERE date = current_date - 1
GROUP BY route
ORDER BY input_cost_usd + output_cost_usd DESC
The numbers feed into our cost dashboard. Anomalies trigger a Slack alert: "the classify_intent route's daily cost jumped 30%, normally $X, today $Y, possibly an upstream caller spike."
We've been refining these the longest. Three signals that have proven useful:
Schema validation rate. For routes that expect structured output, we record whether the response parsed correctly against the expected schema. A regression here is a strong signal.
Cache hit rate. For routes that use semantic caching, the hit rate should be stable. A sudden drop suggests the input distribution changed (legitimate traffic shift) or that the cache is misbehaving.
Length distribution. The token count distribution per route should be stable. A sudden shift toward longer responses (or shorter) suggests the model's behaviour changed — sometimes from a provider update, sometimes from a prompt change.
These aren't quality measures themselves — they're early warning signals that quality might have shifted. For actual quality measurement, we run our offline eval suite.
Our offline eval is the ground truth. It runs:
The cron's job is to catch quality regressions that don't come from our code. If the eval score drifts down without us shipping anything, the model's behaviour has shifted on the provider side. We've caught two such incidents — one a quiet model upgrade, one a degraded snapshot — that would otherwise have shown up only as customer complaints.
A small ML model (yes, we use a model to monitor a model) watches the per-request metrics for anomalies:
When it detects something significant, it posts a structured Slack alert with a link to a pre-built investigation dashboard. The alert is intentionally low-noise — we've tuned the model to fire ~once a week, with most firings being legitimately interesting.
The model itself is simple — a moving-window comparison with sigma-based thresholds. Nothing exotic. The value is having someone (or something) watching all the time.
In our first month, we tried logging every request to Datadog as a custom event. It worked, but the bill was eye-watering at our volume. Migrated the high-cardinality data to ClickHouse; kept aggregated rollups in Datadog for dashboarding.
We also tried capturing 100% of prompts/responses. Storage cost was high; access control was a constant security review. Going to 1% sampling addressed both — the loss of fidelity is rarely missed.
When something is going wrong, the order of inspection is roughly:
The investigation takes ~10 minutes if the dashboards are doing their job. Without the dashboards, the same investigation took us hours of grep against logs and manual aggregation.
Start with per-request metrics in a high-cardinality store. ClickHouse is great for this, but Snowflake / BigQuery / wherever your analytics team already runs is fine. The point is: every LLM call should be a row, and you should be able to slice it by route, customer, model, and any other dimension.
Add sampled prompt capture next. 1% is a good starting point. Storage cost is small; investigation value is huge.
Cost attribution is third — once you have per-request metrics, attribution is just SQL.
Don't bother with real-time quality scoring at first. The cost is high; the offline eval is more reliable. Real-time quality is a nice-to-have once everything else is mature.
The temptation is to build everything at once. Each layer can take weeks; do them sequentially, getting each operational and integrated into team workflows before adding the next. Otherwise you ship five layers, none of them really wired into how the team actually responds to issues.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.
We expanded from one Kubernetes cluster to four across two regions. The traffic-routing layer was the hardest piece. Here's what we tried, what worked, and what we'd do again.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.