We deploy LangChain apps in Docker on Kubernetes. The patterns that work, the LangChain-specific gotchas, and what we'd build differently next time.
LangChain is the framework most teams reach for when building LLM-powered apps. We have several LangChain-based services in production, all containerized and running on Kubernetes. The framework helps with prototyping; production introduces specific challenges. This post is the practical version of getting from "it works on my machine" to "it serves real customers reliably."
Containers solve the standard set of problems:
For Python apps with heavy dependencies (LangChain itself plus the various integrations: openai, anthropic, pinecone, qdrant, etc.), the dependency tree is messy. Pinning everything in requirements.txt and baking it into a container is the only sane way to ship.
A working Dockerfile for a LangChain service:
# Stage 1: build with toolchain for compiling C extensions
FROM python:3.12-slim AS build
WORKDIR /src
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ libffi-dev && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: runtime
FROM python:3.12-slim AS runtime
WORKDIR /app
COPY --from=build /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
USER 1000:1000
CMD ["uvicorn", "myapp:app", "--host", "0.0.0.0", "--port", "8000"]
Multi-stage build keeps the runtime image small. With LangChain + OpenAI + a vector DB client + FastAPI, the image is ~450MB. Larger than I'd like but manageable.
PYTHONUNBUFFERED=1 ensures logs flush immediately. Without it, logs sometimes don't appear until the container exits — bad for debugging.
LangChain moves fast. Breaking changes happen between minor versions. Loose pins (langchain>=0.1.0) cause "it worked yesterday" surprises.
We pin everything:
langchain==0.3.7
langchain-openai==0.3.0
langchain-anthropic==0.3.1
openai==1.40.6
anthropic==0.32.0
fastapi==0.115.5
uvicorn==0.32.0
pydantic==2.10.0
qdrant-client==1.12.1
When we want to upgrade, it's an explicit PR with version bumps and any code changes needed. Not "oh we picked up a new version randomly."
We use pip-compile (from pip-tools) to generate locked files from a higher-level requirements.in. This handles transitive dependencies cleanly.
LangChain apps need lots of config: API keys, model names, endpoints, vector DB credentials, etc. All of these come from environment variables, not hardcoded:
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
api_key=os.environ["OPENAI_API_KEY"],
model=os.environ.get("LLM_MODEL", "gpt-4o-mini"),
temperature=float(os.environ.get("LLM_TEMPERATURE", "0")),
)
For development, a .env file (loaded via python-dotenv in dev only). For production, env vars come from Kubernetes Secrets (populated by ESO from AWS Secrets Manager).
The pattern: code reads env vars; deployment provides them. The same image runs in dev, staging, and prod with different env values.
LangChain apps need a health endpoint. Standard:
@app.get("/health")
async def health():
# Light check: process is up and able to respond
return {"status": "ok"}
@app.get("/ready")
async def ready():
# Deeper check: dependencies are reachable
try:
# cheap call to verify LLM provider is reachable
await llm.ainvoke("ok")
return {"status": "ready"}
except Exception as e:
return Response(status_code=503, content=str(e))
/health is fast and cheap; used by Kubernetes liveness probe.
/ready is more expensive; used by Kubernetes readiness probe (and called less often).
The split matters. Liveness probes that hit a slow /ready cause unnecessary pod restarts. We hit this once; a slow LLM call made the liveness probe time out, kubelet restarted the pod, repeat. Now liveness is cheap.
LangChain calls can take a long time. Without timeouts, requests can hang indefinitely. We set timeouts at multiple layers:
llm = ChatOpenAI(timeout=30) # 30s timeout per LLM call
# At the FastAPI route level
@app.post("/chat")
async def chat(req: ChatRequest):
async with asyncio.timeout(60): # 60s total request budget
return await chain.ainvoke(req.input)
Plus Kubernetes readiness probe timeouts, ingress timeouts, etc. The shortest timeout in the chain wins.
A specific gotcha: the OpenAI Python SDK's default timeout is high (10 minutes for some operations). Always set explicit timeouts.
For chat-style features, users want incremental responses (the LLM's output streams in). LangChain supports streaming; FastAPI does too.
@app.post("/chat")
async def chat(req: ChatRequest):
async def generate():
async for chunk in chain.astream(req.input):
yield f"data: {json.dumps({'text': chunk.content})}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Server-Sent Events (SSE) is the simplest streaming protocol. WebSockets work too but are heavier.
The connection lifecycle: keep streaming until the LLM is done; close cleanly. Cancel handling matters — if the client disconnects, we should stop the LLM call (saves cost).
Standard structured logging. Per-call:
logger.info("llm_call", extra={
"model": "gpt-4o-mini",
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"duration_ms": elapsed * 1000,
"request_id": request_id,
"user_id": user_id,
})
Plus OpenTelemetry tracing for distributed traces. We use the opentelemetry-instrumentation-langchain package which adds spans for every LangChain operation.
The tracing visibility is essential. When a user reports a slow response, we can see exactly which LLM call took how long, what was retrieved, where the latency went.
The Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: chatbot
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: our-registry/chatbot:v1.42.0
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8000
periodSeconds: 10
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-api-key
Standard pattern. Resources set conservatively at 250m / 512Mi — most LangChain apps don't need much CPU per replica because the work is I/O-bound (waiting for LLM responses).
Things that have bitten us:
Async vs sync mixing. LangChain has both sync (invoke) and async (ainvoke) variants. Mixing them in one chain causes thread-pool starvation. Pick async; stay async throughout.
Memory growth in long chains. Some LangChain memory implementations grow unbounded in long conversations. We cap conversation length explicitly and summarize older messages.
Default verbose logging. verbose=True on a chain logs the full prompt and response. Useful in development; disastrous in production (PII, secrets, tokens).
Output parsers that break on edge cases. LangChain's structured output parsers expect specific formats. When the LLM returns slightly off-format responses, parsers crash. We wrap parsers in try-except + fallback.
Tool/function calling cost. Each tool the model can call adds tokens to every prompt. With 20 tools defined, every call costs more than expected. We limit tools to ~5 per chain when possible.
LangChain has cache abstractions (InMemoryCache, RedisCache, SQLiteCache). For production, Redis is the right one:
from langchain.globals import set_llm_cache
from langchain_community.cache import RedisCache
set_llm_cache(RedisCache(redis_=redis_client))
The cache is keyed on (model, prompt). Identical (model, prompt) → cached response. Useful for:
Hit rate for our chatbot service: ~8%. Saves ~$80/month at our scale.
LangChain itself is moving fast. Versioning matters:
Looking back:
Used LangChain LCEL (Expression Language) from the start, not the older Chain APIs. LCEL is the future direction; older patterns are being deprecated. Migrating mid-project is annoying.
Built our own narrow abstractions earlier. LangChain provides building blocks; for our specific use cases, a thin wrapper over the OpenAI/Anthropic SDKs would have been less framework-coupling. We use LangChain for some things; pure SDK for others.
Logged the raw prompts from day one. Debugging quality issues without seeing the actual prompt sent to the LLM is hard. We log prompt now; should have from the start.
Set up evals before iterating. We added the regression test suite later. Should have been on day one — much faster iteration with eval safety net.
Pin every dependency. LangChain breaks between minor versions; upgrades should be explicit.
Health check splits matter. Cheap liveness, deeper readiness.
Timeouts everywhere. LLM calls can hang; without timeouts, your app does too.
Log the prompt + response. Not always, but for debugging investigations.
Async throughout. Don't mix sync and async in one chain.
Cache where it makes sense. Redis-backed cache for repeated queries.
Build evals before iterating. Each prompt change should run against a regression suite.
LangChain is one of those frameworks where the "getting started" experience is good and the "running it in production" experience is harder. The patterns above are the operational discipline that makes it work. With them, LangChain in production is mostly fine. Without them, you get the everyday LLM-app problems amplified by framework complexity.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
When everything seems "slow," a baseline gives you something to measure against. The capture-and-compare workflow we use on every Linux host.
Discover proven strategies to reduce AWS costs by up to 50%. Learn about Reserved Instances, Spot Instances, right-sizing, and automated cost management.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.