Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
After running LLM-powered features for 8 months in production, these are the patterns that survived contact with real users and messy data.
Asking an LLM to "return JSON" works 90% of the time. The other 10% crashes your parser at 2 AM.
What we do:
import json
from pydantic import BaseModel
class ExtractedEntity(BaseModel):
name: str
category: str
confidence: float
SYSTEM_PROMPT = """Extract entities from the text.
Return ONLY valid JSON matching this schema:
{"name": string, "category": string, "confidence": number 0-1}
Return an array. No explanation, no markdown fences."""
def extract_entities(text: str) -> list[ExtractedEntity]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
],
temperature=0.1,
)
raw = response.choices[0].message.content.strip()
# Strip markdown fences if the model adds them anyway
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
data = json.loads(raw)
return [ExtractedEntity(**item) for item in data]
Why it works: Low temperature, explicit schema in the prompt, and a defensive parser that handles the most common failure mode (markdown fences).
For classification tasks with nuance, asking the model to think step-by-step improved accuracy from 78% to 91%.
Classify this support ticket. Think step by step:
1. What product area does this relate to?
2. Is this a bug report, feature request, or question?
3. What is the urgency (low/medium/high)?
Then return your answer as JSON: {"area": ..., "type": ..., "urgency": ...}
Key insight: The reasoning steps aren't just for the model—they're also audit trails when a human reviews the classification.
LLM calls fail. Rate limits hit. Latency spikes. Your feature needs a fallback.
async def summarize_with_fallback(text: str) -> str:
try:
result = await call_llm(text, timeout=5.0)
return result
except (TimeoutError, RateLimitError):
# Fallback: first 200 chars + ellipsis
return text[:200].rsplit(" ", 1)[0] + "..."
except json.JSONDecodeError:
logger.warning("LLM returned unparseable response")
return "Summary unavailable"
Best practice: Every LLM call should have a timeout, a retry budget, and a non-LLM fallback.
Instead of a 500-word system prompt explaining the format, give 2-3 examples:
Convert the user message to a database query.
Example: "orders from last week" -> SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days'
Example: "top customers by revenue" -> SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10
User: {user_message}
This is more reliable than describing the syntax rules in prose.
The models are impressive, but production reliability comes from everything around the model call.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
Explore more articles in this category
A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.
A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.
A real-world guide to prompt versioning and regression testing for production AI features, focused on preventing the subtle changes that hurt quality long before anyone notices.