When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
We've been shipping LLM agents for about a year — workers that can call tools to read and write real state. The interesting design decisions aren't about the LLM at all; they're about what tools to expose, how to gate writes, and where to require a human checkpoint. This post is what we've landed on after a few near-misses and one real incident.
An LLM agent gets a task ("update the customer's mailing address"), decides which tool to call, and the tool changes something. The model isn't deterministic; the task isn't always well-specified; the input might be adversarial. The question is: what guardrails make this safe enough to ship?
Two failure modes we've seen:
delete_customer instead of archive_customer.Both are real. Both happen in low percentage of cases but at scale "low percentage" is real customer impact. The design problem is making the cost of these failures low enough to live with.
The single most important design decision: separate read tools from write tools and gate them differently.
Read tools (get_customer_profile, list_recent_orders, search_knowledge_base) — we expose freely. The model can call them as often as it wants. Cost is read-only; worst case is some wasted tokens.
Write tools (update_address, cancel_subscription, issue_refund) — gated. Different model, different prompt, different invocation flow, different audit trail. Sometimes a different process entirely.
We had read+write tools in the same prompt early on. The model was happy to call writes for tasks where it should have just read. Splitting changed agent behavior dramatically — calling writes became deliberate, not incidental.
Three patterns that earn their place:
Two-step: propose then confirm. The agent runs the write tool in "propose" mode, which returns a description of what would change. A human (or a second agent with a separate prompt) confirms before the actual write executes. For most consequential writes we use this. Adds latency; adds safety.
Dry-run by default. Some tools default to dry-run mode. The agent has to explicitly pass confirm=true to actually execute. The default makes "I'm not sure" path safe; the explicit confirmation makes "I'm sure" deliberate.
Read-only impersonation in test mode. For some internal-only tools, we run the agent against a read-only mirror of production data, only swapping to live writes after we've validated the agent's behavior. The mirror is updated nightly; agents that read it can't affect production state.
The right pattern depends on the write's reversibility. Adding a tag to a record — fine, just do it. Issuing a $500 refund — propose, confirm, then execute.
Once you've decided writes need confirmation, the question is: who?
Two-step flow #1: agent → user → agent. The agent proposes a write to the user via the UI ("I'm about to cancel your subscription, confirm?"). User clicks confirm. Agent executes. Good for direct user-driven flows; not useful for automated backend agents.
Two-step flow #2: agent → second agent → execution. The second agent has a different prompt focused on safety review. Cheaper than human review at scale. Quality is OK for simple "does this look like a normal operation?" checks; not enough for high-stakes operations.
Two-step flow #3: agent → on-call human → execution. For genuinely high-stakes operations, a human in the loop. We use this for anything that touches money or destroys data.
We layer these. A typical agent run for a non-trivial task: agent proposes → safety-review agent checks against policy → if green, execute; if questionable, page on-call.
Vague tool descriptions give vague behavior. Specific schemas with examples in the description give predictable behavior.
Example of a bad schema:
{
"name": "update_user",
"description": "Updates the user record.",
"parameters": {"user_id": "string", "data": "object"}
}
Example of a good schema:
{
"name": "update_user_address",
"description": "Updates only the mailing address on a user record. Use this for moving-address requests. Do NOT use this to update email, phone, or name — those have separate tools.",
"parameters": {
"user_id": {"type": "string", "description": "The internal user ID (not their email)"},
"address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"postal_code": {"type": "string"},
"country": {"type": "string", "description": "ISO 3166-1 alpha-2 country code"}
},
"required": ["street", "city", "country"]
}
}
}
The narrower the tool's scope, the easier it is for the model to use correctly. We've moved from a few broad tools to many narrow tools. "Update the user" became "update address", "update email", "update phone", "update name" — four tools instead of one. Each individually is easier for the model to call right.
Tool implementations validate every argument server-side. Don't trust the schema to constrain the model's output. We've seen:
For each, the validation rejects the call with a structured error. The agent often handles the error gracefully — tries again, with a fix, or asks the user for clarification. But the safety boundary is the server-side validation, not the model's compliance.
Every tool invocation logs:
Why: when something goes wrong, the only way to debug is to see exactly what was called. We've reconstructed incidents weeks later from these logs. They live in S3 with 1-year retention; expensive but worth it.
We also log the LLM's reasoning trace (the <thinking> content or equivalent) for the most consequential operations. Reading "the model thought X about the user's intent" is sometimes the only way to understand why a wrong tool was called.
A real one, anonymized: the agent had a complete_task tool that marked a customer's task as done. The agent decided a task was complete (correctly), but called the tool with the wrong task ID — one that referred to a different customer's task. The wrong task got marked done.
What we changed:
complete_task now requires both task_id and a sanity-check field (the customer's email). Mismatch → rejected.None of these are clever. They're boring engineering. The boring engineering is what makes agents safe to ship.
A few patterns we've considered and skipped:
Letting agents create new tools. Some frameworks support agents writing their own code, then executing it. We don't. The blast radius is too large and we can't audit it.
Agents calling agents recursively. Same reason — unbounded recursion, unbounded cost, unbounded blast radius. We allow at most one level of agent-to-agent calls.
Tools that take freeform SQL or code as arguments. The model loves writing creative SQL. Some of it would be fine; some would not. We expose query builders with constrained parameters instead.
Skipping the confirmation step "because the user is impatient." Speed isn't worth the failure modes. If users find confirmations annoying, that's a UX problem to fix differently — not by removing the safety layer.
Read tools wide open; write tools heavily gated. The most important design decision.
Many narrow tools, not few broad ones. Easier for the model to pick correctly.
Server-side argument validation, always. The schema is documentation, not a security boundary.
Log every call. When something goes wrong (it will), the logs are how you debug.
Two-step for consequential writes. Propose, confirm, execute. The latency cost is small; the safety improvement is large.
Tabletop the failure modes. Walk through what happens if the agent calls each tool wrong. The exercise reveals which tools need more gating.
Agent tool design is mostly software engineering, not ML. Once you have those gates in place, the LLM at the core matters less than you'd think — the safety comes from the surrounding system. The teams that ship agents responsibly are the ones who treat tool-use as a security boundary, not a convenience.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.
Evergreen posts worth revisiting.