We run a fleet of LLM agents on Kubernetes. They're stateful, bursty, and expensive — none of which K8s defaults are good at. Here's what we changed.
We run a small fleet of LLM-based agents on Kubernetes — autonomous workers that take jobs from a queue, call out to model providers, write back results, and sometimes spawn child tasks. The workload is wildly different from the stateless web services Kubernetes is designed for. This post is about the changes we made to the cluster and the deployment patterns to make it actually work.
Three properties that break standard Kubernetes assumptions:
Long-lived sessions. A single agent task can run for 5-30 minutes. During that time the agent has in-memory state (conversation history, working files in /tmp, partial results) that's expensive to recreate. Killing the pod mid-task = restart from scratch and re-pay the LLM cost.
Burstiness. Demand is queue-driven. Most of the time we run 5-10 agents. When a batch job kicks off we might need 200 for an hour. Standard HPA on CPU doesn't see this coming until pods are already saturated.
Cost dominated by external calls, not local compute. Each agent uses ~200m CPU and 512MB RAM. The expensive part is the API tokens it sends to OpenAI/Anthropic. We've spent $4,200 on API in the last billing period and ~$180 on the K8s nodes running the agents.
These three properties shape almost every architectural choice below.
Early on we tried running multiple agents in one pod (separate processes inside a single container). Memory accounting got weird, log streaming was hard, OOMs from one agent killed all the others. We split to one-pod-per-agent and the operational overhead dropped dramatically.
Each pod runs:
The pod's resource requests:
resources:
requests:
cpu: 250m
memory: 768Mi
limits:
cpu: 1000m
memory: 1.5Gi
The CPU limit is generous because LLM streaming responses are bursty. Initial token generation is slow, then a flood, then idle while waiting for the next prompt. We set CPU limits well above the request to avoid throttling during the bursty phases.
Standard Kubernetes graceful shutdown sends SIGTERM, waits 30 seconds (default), then SIGKILL. That's fine for stateless web pods. For agents holding 5-minute tasks, it's wrong.
Our setup:
spec:
terminationGracePeriodSeconds: 1800 # 30 minutes
containers:
- name: agent
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/usr/local/bin/drain.sh"]
drain.sh does:
The 30-minute grace period is the upper bound of an agent task. We've never actually hit it; most tasks finish in under 10 minutes. But the safety margin matters.
The other piece: terminationGracePeriodSeconds is honored by kubectl delete pod, by deployments rolling, and by node drains. It is NOT honored by node failures or kubelet OOM kills. Those are still hard kills. So the queue-side retry logic has to assume any task can be lost.
CPU-based HPA doesn't work for our agents — by the time CPU goes up, we already have a queue backlog. We use KEDA with a queue-depth scaler:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-scaler
spec:
scaleTargetRef:
name: agent-deployment
minReplicaCount: 5
maxReplicaCount: 200
pollingInterval: 15
cooldownPeriod: 600 # 10 min — match agent task length
triggers:
- type: aws-sqs-queue
metadata:
queueURL: <sqs-url>
queueLength: "5" # 5 messages per pod target
Two settings to highlight:
cooldownPeriod: 600 — KEDA won't scale down a pod until 10 minutes after its scaler condition stops being met. This is what prevents an agent finishing one task and getting killed before picking up the next one.
queueLength: 5 — we want each pod to handle up to 5 queued tasks (one running, ~4 waiting). Lower = more aggressive scale-out, higher = more queue waiting before scaling.
Agents are interruptible (queue handles retries), so they run on spot nodes. The cost savings are large — about 60% off on-demand for the instance types we use.
We use Karpenter for node provisioning. The agent NodePool:
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
taints:
- key: workload
value: agent
effect: NoSchedule
The taint ensures only agent pods (with matching tolerations) land here. Web services land on a separate on-demand NodePool.
Without limits, a misbehaving agent (e.g., stuck in a loop) can spend $50 in tokens before anyone notices. We enforce two layers:
Per-task hard cap: each task is configured with max_tokens=4000 for the LLM call. Agents that need more context use retrieval or summarization, not bigger prompts.
Per-agent budget: each agent tracks cumulative tokens for the task. If it exceeds 50k tokens (across multiple LLM calls within one task), the agent self-terminates the task and reports failure.
We hit the per-agent limit twice in the first month — both were bugs (an infinite tool-use loop in one case, a recursive subagent spawn in another). Without the cap, both would have run for hours.
Each task gets a trace. We use OpenTelemetry; spans include:
agent.task)llm.call with model name, input tokens, output tokens, latency)agent.tool.call with tool name)agent.task.result)We also tag each span with the agent ID, task ID, and customer ID. The dashboard shows:
The "most expensive" view caught a regression: a prompt change that bloated context by 3x and tripled cost on one task type. We caught it within an hour.
Agents have working memory — files, intermediate results, in-progress reasoning. Where does it live?
We tried three approaches:
s3://agents/<task-id>/<file>, lifecycle policy deletes after 7 days.We went with #3. The agent sidecar mounts an S3 path via goofys (FUSE) and treats it like local disk. It's slow for many small files but fine for our access patterns. Pod death no longer means lost state.
OpenAI and Anthropic both have rate limits per organization. With 200 agents in flight, we exceeded those limits during the first big batch run.
Solution: a small Redis-backed token bucket in front of each provider. Agents acquire a token before making an LLM call:
async def with_rate_limit(provider: str):
while True:
if await redis.token_bucket.acquire(provider, tokens=1):
return
await asyncio.sleep(0.1)
Bucket sizes match the per-provider limits with ~10% headroom. Agents queue locally during bursts; we don't hit the provider's hard 429.
Debugging a stuck agent. When an agent is in a tight loop or waiting on a slow tool call, kubectl logs shows the LLM responses but not the agent's internal state. We added periodic state dumps (every 30s, the agent writes its current state to the trace as an event) which made debugging much easier.
Reproducing failures. Agents are nondeterministic. The same task with the same inputs gives different outputs. When one fails, "rerun with the same inputs" doesn't always reproduce. We log the full LLM request/response chain for failed tasks; that's how we debug.
Cost surprise. A new task type can be 10x more expensive than expected because of how the agent decomposes it. We have a "shadow mode" for new task types: the agent runs but doesn't actually execute the actions, just reports what it would have done and the cost. We graduate from shadow to production after seeing per-task costs in the expected range.
Provider outages. When OpenAI has a regional issue, all our agents stall. Multi-provider failover would help; we haven't built it yet because most of our prompts are tuned for one provider and switching mid-stream would degrade quality.
Don't treat agents like web services. The standard K8s deployment patterns (rolling updates, fast restarts, CPU-based scaling) are wrong for stateful, long-running, bursty workloads. Tune accordingly.
KEDA on queue depth, not HPA on CPU. Queue depth is the leading indicator. CPU is a lagging one.
Long termination grace periods. Match the upper bound of task length, plus margin.
Per-task cost guardrails before scaling up. A bug at 5 agents costs $20. The same bug at 200 agents costs $800. Have hard caps before you scale.
One agent per pod. Easier to reason about, easier to debug, easier to scale.
The combination of "long-running" and "bursty" and "expensive external calls" makes agent workloads weirder than most. K8s can run them fine — once you stop using its defaults.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We replaced three kernel-level monitoring tools with a small set of eBPF programs. What it bought us, what it cost, and where we still use the old stuff.
HPA, VPA, and Cluster Autoscaler / Karpenter solve overlapping problems badly when you don't understand which one owns what. The mental model that keeps them from fighting.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.