A hands-on intro to prompt engineering. Learn the four levers (role, format, examples, constraints) and watch a vague prompt turn into a reliable one.
By the end of this post you'll know the four levers that turn vague LLM prompts into reliable ones, and you'll have run a side-by-side comparison showing how each lever changes output quality. The whole thing takes about twenty minutes and any LLM API will do — examples here use OpenAI, but the patterns transfer.
Less mysterious than it sounds. A "prompt" is the text you send to an LLM. "Engineering" the prompt means structuring that text so the model produces the answer you want, consistently, even on inputs you haven't tested.
The bad version of prompt engineering is rearranging adjectives and adding superlatives ("respond like an EXPERT", "you MUST be detailed"). Modern models ignore most of that.
The good version uses four levers, in order of impact:
We'll walk through each by transforming a deliberately bad prompt into a good one.
pip install openai
export OPENAI_API_KEY="sk-..."
Save this as prompt_demo.py:
import openai
client = openai.OpenAI()
def ask(prompt: str, system: str | None = None) -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0,
)
return resp.choices[0].message.content
We use temperature=0 so output is deterministic — the same prompt gives the same answer. That makes side-by-side comparisons meaningful.
Our running example: classify customer support tickets into one of four categories. Bad version first:
ticket = "My credit card was charged twice yesterday for the same order. Please refund the duplicate charge."
print(ask(f"What category is this ticket? {ticket}"))
You'll get something like:
This ticket is related to billing or payment issues. Specifically, it concerns
a duplicate charge on a credit card and a request for a refund...
That's an essay, not a category. Useless if you're routing tickets to teams.
SYSTEM = "You categorize customer support tickets to route them to the correct team."
print(ask(ticket, system=SYSTEM))
Better — but still too verbose. The model now knows what it's doing but doesn't know what shape you want back.
SYSTEM = """You categorize customer support tickets to route them to the correct team.
Respond with JSON like: {"category": "BILLING"}
Categories: BILLING, TECHNICAL, ACCOUNT, OTHER"""
print(ask(ticket, system=SYSTEM))
You should get:
{"category": "BILLING"}
That's parseable. The literal example in the system prompt anchors the format better than describing it ("respond with JSON containing a category field").
For straightforward classification, the system prompt above is enough. For tasks with subtle judgment calls — multi-label classification, structured extraction, tone matching — examples matter more than instructions.
Try this richer task: extract the customer's intent and the urgency level.
SYSTEM = """You extract intent and urgency from customer support tickets.
Examples:
Input: "My credit card was charged twice. Please refund."
Output: {"intent": "refund_request", "urgency": "high"}
Input: "How do I change my profile picture?"
Output: {"intent": "how_to_question", "urgency": "low"}
Input: "Production is down. Customers can't log in. URGENT."
Output: {"intent": "outage_report", "urgency": "critical"}
Now respond with the same JSON format for the new ticket."""
print(ask("My subscription auto-renewed but I cancelled last week.", system=SYSTEM))
You should get something like:
{"intent": "billing_dispute", "urgency": "high"}
Three examples is the sweet spot for most tasks — one easy, one ambiguous, one edge case. More examples eat tokens (cost + latency); fewer leave the model guessing about format.
Constraints lock down behavior the examples don't cover. Two examples that earn their place in real prompts:
Refusal phrasing. Force the model to say "I don't know" in a specific way you can detect:
SYSTEM += '\n\nIf the ticket is unclear or out of scope, respond exactly: {"intent": "unknown", "urgency": "unknown"}'
Now you can branch on the output: if intent == "unknown": route_to_human().
Allowed value lists. For classification, list the only categories you accept. The model will pick from those instead of inventing new ones.
Length caps. "Respond in 50 words or fewer" or max_tokens=100. Caps cost and forces concision.
Stuffing the prompt with rules. "MUST NOT include greeting. MUST cite sources. MUST NOT use jargon. MUST..." — long rule lists conflict with each other and confuse the model. If you have more than ~5 rules, you probably need few-shot examples instead.
Capitalizing for emphasis. MUST is no more effective than must. Modern models don't weight capitalization. Save the keystrokes.
Putting the most important instruction in the middle. LLMs attend to the start and end of the prompt more than the middle. Put critical constraints (output format, refusal phrasing) at both ends.
Testing only on easy cases. A prompt that handles the obvious queries can fall apart on adversarial input or rare formats. Keep a small eval set of 20–50 tricky inputs and re-run it on every prompt change.
You now have the moves. The next levels:
Prompt engineering isn't magic. It's clear specification: tell the model what task it's doing, what shape you want back, and what to do when it isn't sure. The four levers above cover ~90% of real prompts. The rest is iteration.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A clear walkthrough of Linux file permissions. Read the funny rwx- letters, change them safely with chmod, fix "permission denied" errors with confidence.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.
Explore more articles in this category
Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.
A working retrieval-augmented generation app you can run today. Markdown ingestion, embeddings, semantic search, and an LLM answer — start to finish in one afternoon.
We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.
Evergreen posts worth revisiting.