#Eval

9 articles tagged with Eval.

Production RAG Reliability — Making LLM Answers Trustworthy

A demo RAG app is easy; one users trust is not. This is the map for reliable retrieval-augmented generation: grounding, evaluation, retrieval quality, guardrails, and safe rollout.

Kiril Urbonas·2

Read article

••3 days ago

Shadow Testing and Canary Releases for LLM Changes

A prompt tweak or model bump can quietly wreck answers everywhere. Ship LLM changes the way you ship risky code: gate, shadow, canary, roll back.

Kiril Urbonas·1

Read article

••last week

RAG Evaluation Metrics — Faithfulness and Context Precision

A single answer-quality score hides where your RAG pipeline actually breaks. Split retrieval eval from generation eval and measure each one honestly.

Kiril Urbonas·1

Read article

••last week

Hallucination Detection — Grounding and Citations for RAG

A RAG system that invents facts erodes trust fast. Here is how we ground answers, force citations, and catch the fabrications before users do.

Kiril Urbonas·1

Read article

••last week

Guardrails for Production LLMs: Input and Output Filtering That Holds

A user got our support bot to recite its system prompt and then draft a refund it wasn't authorized to give. Two layers of guardrails, one on input, one on output, closed both holes.

Kiril Urbonas·1

Read article

••2 weeks ago

LLM Evals in CI: Catching Prompt Regressions Before They Ship

A prompt tweak that helped one case quietly broke twenty others. Here's the CI eval harness we built so that never ships silently again.

Kiril Urbonas·1

Read article

••3 weeks ago

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.

Kiril Urbonas·5

Read article

••3 weeks ago

LLM Output Validation — Schema-Constrained Generation in Production

Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.

Kiril Urbonas·3

Read article

••last month

LLM Evals That Actually Predict Production Quality

Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.

Kiril Urbonas·5

Read article