Blog

••last week

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.

Kiril Urbonas·5

••2 weeks ago

LLM Output Validation — Schema-Constrained Generation in Production

Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.

Kiril Urbonas·3

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.

Kiril Urbonas·6

LLM Output Validation: Schema-First Prompt Engineering Patterns

We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.

Kiril Urbonas·15

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.

Kiril Urbonas·10

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.

Kiril Urbonas·19

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.

Kiril Urbonas·9

Prompt Engineering Patterns That Actually Work in Production

Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.

Kiril Urbonas·9

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.

Kiril Urbonas·19

Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions

A real-world guide to prompt versioning and regression testing for production AI features, focused on preventing the subtle changes that hurt quality long before anyone notices.

Kiril Urbonas·11

RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production

A search-friendly guide to RAG retrieval quality evaluation, based on the moment one production assistant started citing stale documents and the team had to prove what 'good retrieval' meant.

Kiril Urbonas·7