Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.
We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.
Learn how to fine-tune LLMs like Llama 2, Mistral, and GPT models for your specific use case. Includes LoRA, QLoRA, and full fine-tuning techniques.
We deploy LangChain apps in Docker on Kubernetes. The patterns that work, the LangChain-specific gotchas, and what we'd build differently next time.
I fine-tuned Llama 3 8B on a single 4090 over a weekend for a side project. Here's what worked, what cost more than expected, and what I'd do differently.