We run a fleet of LLM agents on Kubernetes. They're stateful, bursty, and expensive — none of which K8s defaults are good at. Here's what we changed.

On this page

Orchestrating AI Agents on Kubernetes

We run a small fleet of LLM-based agents on Kubernetes — autonomous workers that take jobs from a queue, call out to model providers, write back results, and sometimes spawn child tasks. The workload is wildly different from the stateless web services Kubernetes is designed for. This post is about the changes we made to the cluster and the deployment patterns to make it actually work.

What's different about agent workloads #

Three properties that break standard Kubernetes assumptions:

Long-lived sessions. A single agent task can run for 5-30 minutes. During that time the agent has in-memory state (conversation history, working files in /tmp, partial results) that's expensive to recreate. Killing the pod mid-task = restart from scratch and re-pay the LLM cost.

Burstiness. Demand is queue-driven. Most of the time we run 5-10 agents. When a batch job kicks off we might need 200 for an hour. Standard HPA on CPU doesn't see this coming until pods are already saturated.

Cost dominated by external calls, not local compute. Each agent uses ~200m CPU and 512MB RAM. The expensive part is the API tokens it sends to OpenAI/Anthropic. We've spent $4,200 on API in the last billing period and ~$180 on the K8s nodes running the agents.

These three properties shape almost every architectural choice below.

Pod design: one agent per pod #

Early on we tried running multiple agents in one pod (separate processes inside a single container). Memory accounting got weird, log streaming was hard, OOMs from one agent killed all the others. We split to one-pod-per-agent and the operational overhead dropped dramatically.

Each pod runs:

The agent binary as the main process
A sidecar that handles graceful shutdown (more on this below)
A logging sidecar that ships structured logs to our aggregator

The pod's resource requests:

yaml.yaml

resources:
  requests:
    cpu: 250m
    memory: 768Mi
  limits:
    cpu: 1000m
    memory: 1.5Gi

The CPU limit is generous because LLM streaming responses are bursty. Initial token generation is slow, then a flood, then idle while waiting for the next prompt. We set CPU limits well above the request to avoid throttling during the bursty phases.

Don't kill agents mid-task: pod lifecycle #

Standard Kubernetes graceful shutdown sends SIGTERM, waits 30 seconds (default), then SIGKILL. That's fine for stateless web pods. For agents holding 5-minute tasks, it's wrong.

Our setup:

yaml.yaml

spec:
  terminationGracePeriodSeconds: 1800  # 30 minutes
  containers:
    - name: agent
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "/usr/local/bin/drain.sh"]

drain.sh does:

Tells the agent's queue connection to stop accepting new tasks
Waits for the current task to finish (or times out)
Sends a "task interrupted, retry available" message to the queue if needed
Exits

The 30-minute grace period is the upper bound of an agent task. We've never actually hit it; most tasks finish in under 10 minutes. But the safety margin matters.

The other piece: terminationGracePeriodSeconds is honored by kubectl delete pod, by deployments rolling, and by node drains. It is NOT honored by node failures or kubelet OOM kills. Those are still hard kills. So the queue-side retry logic has to assume any task can be lost.

Scaling: KEDA against the queue depth #

CPU-based HPA doesn't work for our agents — by the time CPU goes up, we already have a queue backlog. We use KEDA with a queue-depth scaler:

yaml.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-scaler
spec:
  scaleTargetRef:
    name: agent-deployment
  minReplicaCount: 5
  maxReplicaCount: 200
  pollingInterval: 15
  cooldownPeriod: 600  # 10 min — match agent task length
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: <sqs-url>
        queueLength: "5"  # 5 messages per pod target

Two settings to highlight:

cooldownPeriod: 600 — KEDA won't scale down a pod until 10 minutes after its scaler condition stops being met. This is what prevents an agent finishing one task and getting killed before picking up the next one.

queueLength: 5 — we want each pod to handle up to 5 queued tasks (one running, ~4 waiting). Lower = more aggressive scale-out, higher = more queue waiting before scaling.

Node strategy: spot for agents, on-demand for control plane #

Agents are interruptible (queue handles retries), so they run on spot nodes. The cost savings are large — about 60% off on-demand for the instance types we use.

We use Karpenter for node provisioning. The agent NodePool:

yaml.yaml

spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      taints:
        - key: workload
          value: agent
          effect: NoSchedule

The taint ensures only agent pods (with matching tolerations) land here. Web services land on a separate on-demand NodePool.

Cost guardrails: per-agent token budgets #

Without limits, a misbehaving agent (e.g., stuck in a loop) can spend $50 in tokens before anyone notices. We enforce two layers:

Per-task hard cap: each task is configured with max_tokens=4000 for the LLM call. Agents that need more context use retrieval or summarization, not bigger prompts.
Per-agent budget: each agent tracks cumulative tokens for the task. If it exceeds 50k tokens (across multiple LLM calls within one task), the agent self-terminates the task and reports failure.

We hit the per-agent limit twice in the first month — both were bugs (an infinite tool-use loop in one case, a recursive subagent spawn in another). Without the cap, both would have run for hours.

Observability: tracing per agent task #

Each task gets a trace. We use OpenTelemetry; spans include:

The top-level task (agent.task)
Each LLM call (llm.call with model name, input tokens, output tokens, latency)
Each tool use (agent.tool.call with tool name)
The final outcome (agent.task.result)

We also tag each span with the agent ID, task ID, and customer ID. The dashboard shows:

p50/p95 task duration by task type
Cost per task (tokens × model price)
Success rate per task type
Most expensive tasks in the last hour

The "most expensive" view caught a regression: a prompt change that bloated context by 3x and tripled cost on one task type. We caught it within an hour.

State management: working memory #

Agents have working memory — files, intermediate results, in-progress reasoning. Where does it live?

We tried three approaches:

In-pod /tmp: simplest, but lost on pod death.
PersistentVolumeClaim per agent: too much overhead — PVCs take time to provision, and pod scheduling is constrained by which AZ the volume is in.
S3 with per-task prefix: state writes go to s3://agents/<task-id>/<file>, lifecycle policy deletes after 7 days.

We went with #3. The agent sidecar mounts an S3 path via goofys (FUSE) and treats it like local disk. It's slow for many small files but fine for our access patterns. Pod death no longer means lost state.

Concurrency control: rate limiting per provider #

OpenAI and Anthropic both have rate limits per organization. With 200 agents in flight, we exceeded those limits during the first big batch run.

Solution: a small Redis-backed token bucket in front of each provider. Agents acquire a token before making an LLM call:

python.python

async def with_rate_limit(provider: str):
    while True:
        if await redis.token_bucket.acquire(provider, tokens=1):
            return
        await asyncio.sleep(0.1)

Bucket sizes match the per-provider limits with ~10% headroom. Agents queue locally during bursts; we don't hit the provider's hard 429.

What's hard about agent ops #

Debugging a stuck agent. When an agent is in a tight loop or waiting on a slow tool call, kubectl logs shows the LLM responses but not the agent's internal state. We added periodic state dumps (every 30s, the agent writes its current state to the trace as an event) which made debugging much easier.

Reproducing failures. Agents are nondeterministic. The same task with the same inputs gives different outputs. When one fails, "rerun with the same inputs" doesn't always reproduce. We log the full LLM request/response chain for failed tasks; that's how we debug.

Cost surprise. A new task type can be 10x more expensive than expected because of how the agent decomposes it. We have a "shadow mode" for new task types: the agent runs but doesn't actually execute the actions, just reports what it would have done and the cost. We graduate from shadow to production after seeing per-task costs in the expected range.

Provider outages. When OpenAI has a regional issue, all our agents stall. Multi-provider failover would help; we haven't built it yet because most of our prompts are tuned for one provider and switching mid-stream would degrade quality.

What we'd tell a team starting #

Don't treat agents like web services. The standard K8s deployment patterns (rolling updates, fast restarts, CPU-based scaling) are wrong for stateful, long-running, bursty workloads. Tune accordingly.

KEDA on queue depth, not HPA on CPU. Queue depth is the leading indicator. CPU is a lagging one.

Long termination grace periods. Match the upper bound of task length, plus margin.

Per-task cost guardrails before scaling up. A bug at 5 agents costs $20. The same bug at 200 agents costs $800. Have hard caps before you scale.

One agent per pod. Easier to reason about, easier to debug, easier to scale.

The combination of "long-running" and "bursty" and "expensive external calls" makes agent workloads weirder than most. K8s can run them fine — once you stop using its defaults.

Orchestrating AI Agents on Kubernetes

Orchestrating AI Agents on Kubernetes

What's different about agent workloads #

Pod design: one agent per pod #

Don't kill agents mid-task: pod lifecycle #

Scaling: KEDA against the queue depth #

Node strategy: spot for agents, on-demand for control plane #

Cost guardrails: per-agent token budgets #

Observability: tracing per agent task #

State management: working memory #

Concurrency control: rate limiting per provider #

What's hard about agent ops #

What we'd tell a team starting #

Stay Updated

eBPF: The Future of Kernel Observability

Kubernetes Autoscaling: HPA vs VPA vs Cluster Autoscaler

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas