Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

Token Budgeting for Long-Context Prompts: What to Cut First

A 180k-token context window is not a license to stuff everything in. Here's how we cut prompt size 60% without hurting answer quality, and what to trim first.

Kiril Urbonas

Read article

••2 days ago

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

When our single LLM provider had a 40-minute outage, every AI feature went dark. A gateway with routing and fallback fixed that, and cut spend 30% as a bonus.

Kiril Urbonas

Read article

••3 days ago

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Users hit stop, but our server kept paying for tokens for another 40 seconds. Here's how we wired real cancellation and backpressure into an SSE streaming endpoint.

Kiril Urbonas

Read article

••3 days ago

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

The top model on the MTEB leaderboard made our search worse and our bill bigger. Here's how we actually picked an embedding model for a real RAG system.

Kiril Urbonas

Read article

••4 days ago

Agent Memory: Short-Term, Long-Term, and When You Need Neither

Most agents that "need memory" actually need a smaller context window and a database. Here's how we cut a support agent's token bill by 60 percent by deleting memory.

Kiril Urbonas

Read article

••5 days ago

Guardrails for Production LLMs: Input and Output Filtering That Holds

A user got our support bot to recite its system prompt and then draft a refund it wasn't authorized to give. Two layers of guardrails, one on input, one on output, closed both holes.

Kiril Urbonas

Read article

••6 days ago

Reranking in RAG: When a Cross-Encoder Earns Its Latency

Our RAG answers kept citing the wrong paragraph even when the right one was retrieved. A cross-encoder reranker fixed relevance but added 180ms. Here's when that trade pays off.

Kiril Urbonas

Read article

••last week

LLM Evals in CI: Catching Prompt Regressions Before They Ship

A prompt tweak that helped one case quietly broke twenty others. Here's the CI eval harness we built so that never ships silently again.

Kiril Urbonas

Read article

••last week

Semantic Caching for LLM Apps: Cutting Cost on Repeated Queries

Users kept asking the same questions in slightly different words, and we paid full price every time. Semantic caching cut our LLM bill by a third.

Kiril Urbonas

Read article

••last week

Hybrid Search for RAG: Combining BM25 and Vectors the Right Way

Pure vector search kept missing exact matches like error codes and CLI flags. Adding BM25 back and fusing the two lifted our retrieval recall by 11 points.

Kiril Urbonas

Read article

••last week

RAG Chunking Strategies: Fixed, Semantic, and Recursive Compared

Our support bot kept citing half a sentence and missing the answer that sat two lines below. The culprit wasn't the model, it was how we split the docs.

Kiril Urbonas

Read article

••last week

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.

Kiril Urbonas·5

Read article

Page 1 of 10 · 111 posts