Standard APM doesn't tell you when your LLM-powered features are silently degrading. The signals we track and the dashboards that catch the regressions standard tools miss.

On this page

AI Observability: Monitoring LLM Performance in Production

Standard observability — request rate, error rate, latency — is necessary but not sufficient for LLM-powered features. The model might be returning 200 OK with a response, but the response could be wrong, hallucinated, or off-policy. None of those are visible in your APM. This is the observability stack we built for our LLM features over the past year.

The four signals that aren't in your APM #

Beyond standard metrics, four LLM-specific signals matter:

Response quality: is the answer good? Is it grounded in the provided context? Does it follow the format?
Drift: are quality / cost / latency trending differently from baseline?
Cost per output: tokens consumed vs business outcome.
User-perceived issues: which responses do users react badly to?

Each needs its own collection and analysis path.

Tracing every LLM call #

The foundation is per-call tracing. Every LLM call gets a trace span with attributes:

python.python

with tracer.start_as_current_span("llm.call") as span:
    span.set_attributes({
        "llm.provider": "openai",
        "llm.model": "gpt-4o-mini",
        "llm.input_tokens": 1240,
        "llm.output_tokens": 380,
        "llm.cost_usd": 0.000312,
        "llm.temperature": 0.0,
        "llm.feature": "support_assistant",
        "llm.user_id": user_id,
        "llm.task_type": "answer_question",
    })
    response = await openai_call(...)
    span.set_attributes({
        "llm.finish_reason": response.choices[0].finish_reason,
        "llm.response_length": len(response.choices[0].message.content),
    })

This is the raw data. From it, we derive everything else.

We use OpenTelemetry. Spans go to Datadog APM. Custom metrics derived from spans go to Prometheus.

The dashboards #

The 4 dashboards we check daily:

Cost dashboard: cost per task type per day, broken down by model. Spikes in cost without traffic spikes = something changed (prompt got longer, model upgraded silently, retries spiking).

Quality dashboard: per task type, % of responses that pass automated quality checks (more on this below). A drop here is the leading indicator of a regression.

Latency + tokens dashboard: p50/p95/p99 of LLM call latency and tokens consumed. Useful for catching prompt bloat (slow growth in input tokens means context is creeping up).

User signals dashboard: thumbs-up/down rates, abandonment rates, support-ticket-rate following AI-handled interactions. Lagging indicator but the most important.

Automated quality checks #

The hardest part. How do you automatically tell if a response is good?

We use three layers of checks, in order of cost:

Layer 1: Programmatic checks (cheap, run on every response).

Output is valid JSON if expected
Required fields present
Response length within expected range (catches truncations and bloat)
Categorical outputs are in the allowed set
No banned phrases (specific things the model shouldn't say)
Citation accuracy: claimed [N] citations exist in the provided context

These are fast and catch ~30-40% of bad responses.

Layer 2: Embedding-based similarity (cheap, runs on sample).

Embed the response. Embed the user's question. If similarity is too low, the response probably doesn't address the question.
Embed the response. Embed the retrieved context. If similarity is too low, the response is probably hallucinated.

We run these on a sample (~10% of traffic) because embedding adds latency. Useful for trend detection.

Layer 3: Judge-LLM evaluation (expensive, runs on smaller sample).

A larger model (gpt-4o) scores the response on dimensions we care about: helpfulness, accuracy, format adherence, tone. We run this on ~1% of traffic and on every regression test run.

The judge prompt:

code

Given the user's question and the assistant's response, score the response on:
- Accuracy (1-5): does it correctly answer the question?
- Grounding (1-5): are claims supported by the provided context?
- Format (1-5): does it follow the requested format?
- Helpfulness (1-5): is it useful to the user?

Respond with JSON: {"accuracy": N, "grounding": N, "format": N, "helpfulness": N, "issues": ["..."]}.

Judge LLM has its own biases but trends are reliable. Average accuracy score per task per day is one of our most important metrics.

Catching regressions #

When a regression hits, the signals show up in this order:

Quality dashboard shows a drop in pass rate (Layer 1 checks).
Hours later: judge LLM scores show a drop.
Hours-to-days later: user signals (thumbs-down rate) tick up.

The Layer 1 → user signal lag is what makes proactive observability valuable. Catching the regression before users feel it is the goal.

Specific regressions we've caught:

A prompt change that started returning longer responses. Layer 1 caught it (response length above expected range). Investigated → reverted → back to normal.

A retrieval change that started missing relevant chunks. Embedding similarity to retrieved context dropped. Investigated → bug in chunking pipeline → fixed.

An OpenAI model snapshot change. When OpenAI silently rolled out a new version of gpt-4o, our judge scores dropped 8% overnight. We pinned to a dated snapshot the next day.

Cost per output #

For tasks with measurable business outcomes (e.g., "successful customer support resolution"), we track cost per outcome:

code

cost_per_resolution = sum(llm_cost) / count(resolutions)

This is the most useful single metric for product/finance conversations. "Each resolved ticket costs $0.04 in LLM calls" is concrete.

Outliers in this metric show up as cost spikes per outcome — usually a bug (loop, expensive prompt, retry storm).

User feedback collection #

Three sources:

Explicit feedback: thumbs up/down on responses. Captures explicit signal but only ~5% of users react. Bias: people thumbs-down more than thumbs-up.

Implicit feedback: "did the user accept the response or escalate?" In the support assistant, did the conversation end (success) or get escalated to a human (signal of failure)?

Operational feedback: complaints, support tickets, social-media mentions. Lagging but highest signal.

We weight implicit feedback most heavily because it's the highest-volume and least biased.

Per-customer metrics #

Some customers' usage diverges from the average — they might use the feature in unusual ways, hit specific edge cases, or be the canary for issues.

We track quality and cost per customer for our top-50 customers. When one of them has worse-than-average quality, an account team is notified to check in.

This caught a case where one customer's data had specific characters that broke our chunking pipeline — quality was awful for that customer specifically, fine for everyone else. Without per-customer tracking, we wouldn't have found it.

Incident response: the LLM-specific runbook #

When quality alerts fire, the runbook:

Has a deploy happened recently? Check the GitOps history. Most quality issues correlate with recent prompt or pipeline changes.
Is the issue specific to a task type, customer, or input pattern? The dashboards have filters; we narrow the scope quickly.
What does the judge LLM say? Run judge on a recent sample. If it's reporting specific issues ("the response doesn't address the question"), that's the lead.
Look at actual examples. No amount of metrics replaces reading 20 actual responses. We have a UI that lets us pull recent samples by filter.
Hypothesize and revert. If a recent change caused it, revert. Don't try to "fix forward" a quality issue without a clear theory.

We've done this dance maybe a dozen times. About 70% of the time the cause is a recent change (prompt, retrieval, model version). The other 30% is environmental — input distribution shifted, dependent service changed, etc.

What we don't bother with #

A few things we've tried and pruned:

Complex anomaly detection. Statistical anomaly detection on quality metrics had too many false positives. Simple thresholds work better for us.

Per-token cost analysis. Aggregating token costs at the per-token level didn't reveal anything that aggregate cost dashboards didn't.

LLM-generated explanations of failures. "Use the LLM to summarize what went wrong" sounds appealing; in practice, the LLM hallucinated explanations. We rely on humans reading examples.

Real-time judge LLM on every call. Cost prohibitive at our volume. Sampled is fine.

Cost of the observability itself #

Real numbers for our setup:

OpenTelemetry tracing infra: ~$200/month (Datadog APM, partially shared with non-LLM services)
Embedding costs for similarity checks: ~$50/month
Judge LLM for sampled evaluation: ~$120/month
Engineer time to maintain dashboards and alerts: ~4 hours/week

Total: ~~$400/month + 4 hours/week. For our LLM bill (~~$2,300/month) and the cost of bad responses (real, hard to quantify), this is a clear win.

What I'd tell a team starting #

Trace every LLM call from day one. The data is the foundation; collect it before you need it.

Layer 1 quality checks (programmatic) catch a surprising amount. Start there, add the more expensive layers later.

Dashboards by task type, not just aggregate. "Quality dropped 5%" is too coarse; "quality on the support assistant dropped 12%" is actionable.

Track cost per outcome, not just cost. Frames conversations productively.

Don't try to monitor "is this response good" in real time on every call. Sample for the expensive checks; programmatic checks for the rest.

Have a UI to pull recent samples by filter. The first time you have a quality issue, you'll be glad you can read 20 actual responses without writing a SQL query.

LLM observability is its own discipline, distinct from APM. The signals are different, the tools are different, the response patterns are different. The teams that get this right ship faster on AI features because they catch regressions early. The teams that don't end up with a slow erosion of quality that surfaces too late.

AI Observability and Monitoring: Tracking Model Performance in Production

AI Observability: Monitoring LLM Performance in Production

The four signals that aren't in your APM #

Tracing every LLM call #

The dashboards #

Automated quality checks #

Catching regressions #

Cost per output #

User feedback collection #

Per-customer metrics #

Incident response: the LLM-specific runbook #

What we don't bother with #

Cost of the observability itself #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Four Signals That Matter: Choosing SLIs Users Actually Feel

Agent Memory: Short-Term, Long-Term, and When You Need Neither

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas