Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
After running LLM-powered features for 8 months in production, these are the patterns that survived contact with real users and messy data.
Asking an LLM to "return JSON" works 90% of the time. The other 10% crashes your parser at 2 AM.
What we do:
import json
from pydantic import BaseModel
class ExtractedEntity(BaseModel):
name: str
category: str
confidence: float
SYSTEM_PROMPT = """Extract entities from the text.
Return ONLY valid JSON matching this schema:
{"name": string, "category": string, "confidence": number 0-1}
Return an array. No explanation, no markdown fences."""
def extract_entities(text: str) -> list[ExtractedEntity]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
],
temperature=0.1,
)
raw = response.choices[0].message.content.strip()
# Strip markdown fences if the model adds them anyway
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
data = json.loads(raw)
return [ExtractedEntity(**item) for item in data]
Why it works: Low temperature, explicit schema in the prompt, and a defensive parser that handles the most common failure mode (markdown fences).
For classification tasks with nuance, asking the model to think step-by-step improved accuracy from 78% to 91%.
Classify this support ticket. Think step by step:
1. What product area does this relate to?
2. Is this a bug report, feature request, or question?
3. What is the urgency (low/medium/high)?
Then return your answer as JSON: {"area": ..., "type": ..., "urgency": ...}
Key insight: The reasoning steps aren't just for the model—they're also audit trails when a human reviews the classification.
LLM calls fail. Rate limits hit. Latency spikes. Your feature needs a fallback.
async def summarize_with_fallback(text: str) -> str:
try:
result = await call_llm(text, timeout=5.0)
return result
except (TimeoutError, RateLimitError):
# Fallback: first 200 chars + ellipsis
return text[:200].rsplit(" ", 1)[0] + "..."
except json.JSONDecodeError:
logger.warning("LLM returned unparseable response")
return "Summary unavailable"
Best practice: Every LLM call should have a timeout, a retry budget, and a non-LLM fallback.
Instead of a 500-word system prompt explaining the format, give 2-3 examples:
Convert the user message to a database query.
Example: "orders from last week" -> SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days'
Example: "top customers by revenue" -> SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10
User: {user_message}
This is more reliable than describing the syntax rules in prose.
The models are impressive, but production reliability comes from everything around the model call.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
Explore more articles in this category
A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.
You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.
Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.
Evergreen posts worth revisiting.