Standard APM doesn't tell you when your LLM-powered features are silently degrading. The signals we track and the dashboards that catch the regressions standard tools miss.
Standard observability — request rate, error rate, latency — is necessary but not sufficient for LLM-powered features. The model might be returning 200 OK with a response, but the response could be wrong, hallucinated, or off-policy. None of those are visible in your APM. This is the observability stack we built for our LLM features over the past year.
Beyond standard metrics, four LLM-specific signals matter:
Each needs its own collection and analysis path.
The foundation is per-call tracing. Every LLM call gets a trace span with attributes:
with tracer.start_as_current_span("llm.call") as span:
span.set_attributes({
"llm.provider": "openai",
"llm.model": "gpt-4o-mini",
"llm.input_tokens": 1240,
"llm.output_tokens": 380,
"llm.cost_usd": 0.000312,
"llm.temperature": 0.0,
"llm.feature": "support_assistant",
"llm.user_id": user_id,
"llm.task_type": "answer_question",
})
response = await openai_call(...)
span.set_attributes({
"llm.finish_reason": response.choices[0].finish_reason,
"llm.response_length": len(response.choices[0].message.content),
})
This is the raw data. From it, we derive everything else.
We use OpenTelemetry. Spans go to Datadog APM. Custom metrics derived from spans go to Prometheus.
The 4 dashboards we check daily:
Cost dashboard: cost per task type per day, broken down by model. Spikes in cost without traffic spikes = something changed (prompt got longer, model upgraded silently, retries spiking).
Quality dashboard: per task type, % of responses that pass automated quality checks (more on this below). A drop here is the leading indicator of a regression.
Latency + tokens dashboard: p50/p95/p99 of LLM call latency and tokens consumed. Useful for catching prompt bloat (slow growth in input tokens means context is creeping up).
User signals dashboard: thumbs-up/down rates, abandonment rates, support-ticket-rate following AI-handled interactions. Lagging indicator but the most important.
The hardest part. How do you automatically tell if a response is good?
We use three layers of checks, in order of cost:
Layer 1: Programmatic checks (cheap, run on every response).
These are fast and catch ~30-40% of bad responses.
Layer 2: Embedding-based similarity (cheap, runs on sample).
We run these on a sample (~10% of traffic) because embedding adds latency. Useful for trend detection.
Layer 3: Judge-LLM evaluation (expensive, runs on smaller sample).
A larger model (gpt-4o) scores the response on dimensions we care about: helpfulness, accuracy, format adherence, tone. We run this on ~1% of traffic and on every regression test run.
The judge prompt:
Given the user's question and the assistant's response, score the response on:
- Accuracy (1-5): does it correctly answer the question?
- Grounding (1-5): are claims supported by the provided context?
- Format (1-5): does it follow the requested format?
- Helpfulness (1-5): is it useful to the user?
Respond with JSON: {"accuracy": N, "grounding": N, "format": N, "helpfulness": N, "issues": ["..."]}.
Judge LLM has its own biases but trends are reliable. Average accuracy score per task per day is one of our most important metrics.
When a regression hits, the signals show up in this order:
The Layer 1 → user signal lag is what makes proactive observability valuable. Catching the regression before users feel it is the goal.
Specific regressions we've caught:
A prompt change that started returning longer responses. Layer 1 caught it (response length above expected range). Investigated → reverted → back to normal.
A retrieval change that started missing relevant chunks. Embedding similarity to retrieved context dropped. Investigated → bug in chunking pipeline → fixed.
An OpenAI model snapshot change. When OpenAI silently rolled out a new version of gpt-4o, our judge scores dropped 8% overnight. We pinned to a dated snapshot the next day.
For tasks with measurable business outcomes (e.g., "successful customer support resolution"), we track cost per outcome:
cost_per_resolution = sum(llm_cost) / count(resolutions)
This is the most useful single metric for product/finance conversations. "Each resolved ticket costs $0.04 in LLM calls" is concrete.
Outliers in this metric show up as cost spikes per outcome — usually a bug (loop, expensive prompt, retry storm).
Three sources:
Explicit feedback: thumbs up/down on responses. Captures explicit signal but only ~5% of users react. Bias: people thumbs-down more than thumbs-up.
Implicit feedback: "did the user accept the response or escalate?" In the support assistant, did the conversation end (success) or get escalated to a human (signal of failure)?
Operational feedback: complaints, support tickets, social-media mentions. Lagging but highest signal.
We weight implicit feedback most heavily because it's the highest-volume and least biased.
Some customers' usage diverges from the average — they might use the feature in unusual ways, hit specific edge cases, or be the canary for issues.
We track quality and cost per customer for our top-50 customers. When one of them has worse-than-average quality, an account team is notified to check in.
This caught a case where one customer's data had specific characters that broke our chunking pipeline — quality was awful for that customer specifically, fine for everyone else. Without per-customer tracking, we wouldn't have found it.
When quality alerts fire, the runbook:
We've done this dance maybe a dozen times. About 70% of the time the cause is a recent change (prompt, retrieval, model version). The other 30% is environmental — input distribution shifted, dependent service changed, etc.
A few things we've tried and pruned:
Complex anomaly detection. Statistical anomaly detection on quality metrics had too many false positives. Simple thresholds work better for us.
Per-token cost analysis. Aggregating token costs at the per-token level didn't reveal anything that aggregate cost dashboards didn't.
LLM-generated explanations of failures. "Use the LLM to summarize what went wrong" sounds appealing; in practice, the LLM hallucinated explanations. We rely on humans reading examples.
Real-time judge LLM on every call. Cost prohibitive at our volume. Sampled is fine.
Real numbers for our setup:
Total: $400/month + 4 hours/week. For our LLM bill ($2,300/month) and the cost of bad responses (real, hard to quantify), this is a clear win.
Trace every LLM call from day one. The data is the foundation; collect it before you need it.
Layer 1 quality checks (programmatic) catch a surprising amount. Start there, add the more expensive layers later.
Dashboards by task type, not just aggregate. "Quality dropped 5%" is too coarse; "quality on the support assistant dropped 12%" is actionable.
Track cost per outcome, not just cost. Frames conversations productively.
Don't try to monitor "is this response good" in real time on every call. Sample for the expensive checks; programmatic checks for the rest.
Have a UI to pull recent samples by filter. The first time you have a quality issue, you'll be glad you can read 20 actual responses without writing a SQL query.
LLM observability is its own discipline, distinct from APM. The signals are different, the tools are different, the response patterns are different. The teams that get this right ship faster on AI features because they catch regressions early. The teams that don't end up with a slow erosion of quality that surfaces too late.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.