We started routing 90% of LLM traffic through a small internal gateway. The gateway wasn't planned — it emerged from solving the same problem in 5 places. Here's the shape it took.
When we started using LLMs in production, every service called OpenAI directly with its own client library. Three months later we had five services doing roughly the same thing — adding retries, capturing metrics, falling back on overload, swapping models per request. None of them did it well. We extracted the common logic into a small gateway service and routed all traffic through it.
The gateway wasn't a planned project. It was the result of repeatedly fixing the same problem in different places. The shape it landed on is below.
A thin HTTP service that sits between application code and LLM providers (OpenAI, Anthropic, our self-hosted Llama). Responsibilities:
The gateway speaks an OpenAI-compatible API at the boundary, so most existing client code worked with a base-URL change.
[ application service ] → [ gateway (us-east-1) ] → [ OpenAI ]
↑ → [ Anthropic ]
│ → [ Self-hosted Llama (gpu cluster) ]
↓
[ Redis (cache) ]
About 200 LoC of routing logic, plus standard middleware (auth, metrics, retries). Stateless — caching is in Redis, secrets are in Vault, everything else is config.
Latency overhead: ~3-5ms p95. Worth it.
Different requests want different things. The gateway accepts policy hints in the request:
{
"model": "auto",
"messages": [...],
"policy": {
"priority": "cost", // or "latency", "quality"
"max_cost_usd": 0.005,
"fallback_allowed": true
}
}
The gateway maps model: auto + priority: cost to the cheapest provider that meets the route's quality requirements (defined per route in config). For priority: latency, it picks the fastest provider available right now, accounting for current load.
This let us decouple "which model" from application code. Rolling out a new provider for a route is a config change, not a code change.
Every provider call wraps a retry policy:
attempt 1: provider X (primary)
on retryable error (5xx, rate limit):
attempt 2: provider X with 1s backoff
attempt 3: provider X with 4s backoff
attempt 4: provider Y (failover) with 1s backoff
give up
Retryable errors are clearly defined: 429 (rate limit), 500/502/503/504, network timeouts. Non-retryable: 400 (bad request), 401 (auth), context-length errors, content-policy violations. We don't retry on those — they're caller errors.
Failover to provider Y happens only after primary retries are exhausted. The hop adds latency; we'd rather wait the few extra hundred ms on retries to provider X than fail over too eagerly.
Failover stickiness: once we fail over to provider Y for a route, subsequent requests for that route also go to Y for the next 60 seconds. If primary recovers, traffic shifts back. Without stickiness, we'd flap — half-recovered providers would alternate.
The gateway has a built-in semantic cache:
# Pseudocode
def handle_request(req):
embed = await embedding(req.messages_concat())
similar = await redis.zsearch(embed, top_k=5, threshold=0.92)
if similar:
match = similar[0]
# Verify the match is "close enough" — same model, same temperature, etc.
if match.metadata == req.metadata:
return match.cached_response
response = await call_provider(req)
await redis.zwrite(embed, response, ttl=24*3600)
return response
Hit rate stabilises around 18% for our chat workload. Higher for some specific routes (FAQ-shaped content can be 35%+); near-zero for highly personalised requests.
Cache cost: roughly $0.0001 per query in embedding cost, vs avoided LLM cost of $0.001-0.005. Net positive even at low hit rates.
Every request emits a structured event:
{
"request_id": "...",
"route": "...",
"caller_service": "...",
"provider": "openai",
"model": "gpt-4o-mini",
"input_tokens": 1450,
"output_tokens": 230,
"cost_usd": 0.000356,
"latency_ms": 1180,
"cached": false,
"retried": false,
"failover_used": false
}
These go to ClickHouse for analytics. Dashboards built on top:
Each calling service has a quota:
rate_limits:
support_assistant: { rps: 10, daily_budget_usd: 50 }
data_enrichment: { rps: 50, daily_budget_usd: 200 }
internal_bot: { rps: 2, daily_budget_usd: 10 }
The gateway tracks both rate (RPS) and cumulative cost. When either limit is hit, calls return 429 with Retry-After. The calling service is expected to back off.
This was added after one incident: a deployment of an internal tool with a config bug looped infinitely calling the LLM. In two hours it generated $400 of OpenAI cost. The gateway's quota would have caught it within 30 minutes at the configured budget.
We tune budgets per service, with finance review. Budgets force people to think about cost upfront when proposing new LLM-using features.
The gateway requires a short-lived JWT issued by our internal auth service:
Authorization: Bearer eyJhbGciOiJSUzI1NiIs...
The JWT contains the caller service identity, the routes it's allowed to use, and a 1-hour expiration. Provider API keys (OpenAI, Anthropic) are never exposed to calling services — they live only on the gateway.
This means: a compromised application service can use the LLM up to its quota, but cannot exfiltrate the OpenAI API key to use it elsewhere. The blast radius of a compromise is bounded.
The first version of the gateway tried to be clever about provider routing — predicting which provider would be fastest based on recent latency. The prediction was right ~70% of the time and wrong dramatically the rest. We replaced it with a simpler "primary, fallback on error" model. The simpler version is what actually works in practice.
We also initially routed everything through a single gateway instance. Latency was fine but the gateway became a single point of failure. We now run 3 instances behind a load balancer; any one can fail without impacting traffic.
What we watch:
Three concrete things become easier with the gateway:
Provider experiments. Want to try Claude for one route? Edit the gateway config; no app deploy needed. Roll back if it doesn't work; same.
Cost optimization. Per-route, per-customer cost is visible. Optimization is targeted, not vibes-based.
Provider outages. When OpenAI had a multi-hour incident last quarter, our services kept running with degraded latency on Anthropic, no code changes required. Without the gateway, every service would have had to handle the outage independently or stay down.
Build the gateway when you have ≥3 services using LLMs and they're starting to duplicate logic. Earlier than that and the abstraction is premature. Later than that and the migration cost is significant.
Make it OpenAI-API-compatible at the boundary. Existing client libraries work. Migration becomes a base-URL change.
Don't try to ship every feature in v1. Start with: provider routing, retries, basic metrics. Add caching, rate limiting, advanced policy in subsequent iterations. Each addition is small; the gateway becomes more capable over time.
Keep it stateless. State (cache, secrets, quotas) goes in Redis or Vault. The gateway itself is a few stateless instances behind a load balancer. Easy to deploy, easy to scale, easy to debug.
The biggest win, six months in, isn't any specific feature — it's having a single chokepoint for the LLM-call pattern. When something needs to change (a new provider, a new policy, a new metric), there's one place to change it. That's worth a lot.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
We've shipped three end-to-end ML systems. The pieces that look obvious in slides and turn out to be the actual work.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.