We started routing 90% of LLM traffic through a small internal gateway. The gateway wasn't planned — it emerged from solving the same problem in 5 places. Here's the shape it took.

On this page

LLM Gateway Design for Multi-Provider Inference: Architecture Review

When we started using LLMs in production, every service called OpenAI directly with its own client library. Three months later we had five services doing roughly the same thing — adding retries, capturing metrics, falling back on overload, swapping models per request. None of them did it well. We extracted the common logic into a small gateway service and routed all traffic through it.

The gateway wasn't a planned project. It was the result of repeatedly fixing the same problem in different places. The shape it landed on is below.

What the gateway does #

A thin HTTP service that sits between application code and LLM providers (OpenAI, Anthropic, our self-hosted Llama). Responsibilities:

Provider routing: pick the right provider based on policy (cost, latency, model availability)
Retries with backoff: handle transient failures uniformly
Failover: when a provider degrades, shift traffic to alternatives
Cost accounting: emit per-request metrics that include cost
Caching: handle the semantic cache layer
Rate limiting per caller: prevent one runaway service from exhausting our API quota
Auth: short-lived tokens for internal callers; provider keys live only on the gateway

The gateway speaks an OpenAI-compatible API at the boundary, so most existing client code worked with a base-URL change.

The shape #

code

[ application service ] → [ gateway (us-east-1) ] → [ OpenAI ]
                                ↑                  → [ Anthropic ]
                                │                  → [ Self-hosted Llama (gpu cluster) ]
                                ↓
                         [ Redis (cache) ]

About 200 LoC of routing logic, plus standard middleware (auth, metrics, retries). Stateless — caching is in Redis, secrets are in Vault, everything else is config.

Latency overhead: ~3-5ms p95. Worth it.

Provider routing policies #

Different requests want different things. The gateway accepts policy hints in the request:

json.json

{
  "model": "auto",
  "messages": [...],
  "policy": {
    "priority": "cost",      // or "latency", "quality"
    "max_cost_usd": 0.005,
    "fallback_allowed": true
  }
}

The gateway maps model: auto + priority: cost to the cheapest provider that meets the route's quality requirements (defined per route in config). For priority: latency, it picks the fastest provider available right now, accounting for current load.

This let us decouple "which model" from application code. Rolling out a new provider for a route is a config change, not a code change.

Retries and failover #

Every provider call wraps a retry policy:

code

attempt 1: provider X (primary)
on retryable error (5xx, rate limit):
  attempt 2: provider X with 1s backoff
  attempt 3: provider X with 4s backoff
  attempt 4: provider Y (failover) with 1s backoff
  give up

Retryable errors are clearly defined: 429 (rate limit), 500/502/503/504, network timeouts. Non-retryable: 400 (bad request), 401 (auth), context-length errors, content-policy violations. We don't retry on those — they're caller errors.

Failover to provider Y happens only after primary retries are exhausted. The hop adds latency; we'd rather wait the few extra hundred ms on retries to provider X than fail over too eagerly.

Failover stickiness: once we fail over to provider Y for a route, subsequent requests for that route also go to Y for the next 60 seconds. If primary recovers, traffic shifts back. Without stickiness, we'd flap — half-recovered providers would alternate.

Caching #

The gateway has a built-in semantic cache:

python.python

# Pseudocode
def handle_request(req):
    embed = await embedding(req.messages_concat())
    similar = await redis.zsearch(embed, top_k=5, threshold=0.92)

    if similar:
        match = similar[0]
        # Verify the match is "close enough" — same model, same temperature, etc.
        if match.metadata == req.metadata:
            return match.cached_response

    response = await call_provider(req)
    await redis.zwrite(embed, response, ttl=24*3600)
    return response

Hit rate stabilises around 18% for our chat workload. Higher for some specific routes (FAQ-shaped content can be 35%+); near-zero for highly personalised requests.

Cache cost: roughly $0.0001 per query in embedding cost, vs avoided LLM cost of $0.001-0.005. Net positive even at low hit rates.

Cost accounting #

Every request emits a structured event:

code

{
  "request_id": "...",
  "route": "...",
  "caller_service": "...",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "input_tokens": 1450,
  "output_tokens": 230,
  "cost_usd": 0.000356,
  "latency_ms": 1180,
  "cached": false,
  "retried": false,
  "failover_used": false
}

These go to ClickHouse for analytics. Dashboards built on top:

Cost by route, daily / weekly / monthly
Cost by caller service (which app generates the most LLM cost)
Cache hit rate over time
Failover rate (high failover rate = primary is unhealthy; investigate)
Tokens per request (drift detection — if a route's avg tokens jumps, the prompt changed unintentionally)

Rate limiting per caller #

Each calling service has a quota:

yaml.yaml

rate_limits:
  support_assistant: { rps: 10, daily_budget_usd: 50 }
  data_enrichment:   { rps: 50, daily_budget_usd: 200 }
  internal_bot:      { rps: 2, daily_budget_usd: 10 }

The gateway tracks both rate (RPS) and cumulative cost. When either limit is hit, calls return 429 with Retry-After. The calling service is expected to back off.

This was added after one incident: a deployment of an internal tool with a config bug looped infinitely calling the LLM. In two hours it generated $400 of OpenAI cost. The gateway's quota would have caught it within 30 minutes at the configured budget.

We tune budgets per service, with finance review. Budgets force people to think about cost upfront when proposing new LLM-using features.

Auth #

The gateway requires a short-lived JWT issued by our internal auth service:

code

Authorization: Bearer eyJhbGciOiJSUzI1NiIs...

The JWT contains the caller service identity, the routes it's allowed to use, and a 1-hour expiration. Provider API keys (OpenAI, Anthropic) are never exposed to calling services — they live only on the gateway.

This means: a compromised application service can use the LLM up to its quota, but cannot exfiltrate the OpenAI API key to use it elsewhere. The blast radius of a compromise is bounded.

What the gateway doesn't do #

Prompt management. Apps still own their prompts. The gateway is dumb pipe + policy.
Result post-processing. Schema validation, content filtering, tone adjustment — all happen in calling services.
Multi-step orchestration (ReAct, agentic loops). Apps drive these. The gateway does single-call inference.
Embedding for vector search. We have a separate, simpler embedding gateway. They share infrastructure but the request shapes differ enough to keep separate.

What we got wrong initially #

The first version of the gateway tried to be clever about provider routing — predicting which provider would be fastest based on recent latency. The prediction was right ~70% of the time and wrong dramatically the rest. We replaced it with a simpler "primary, fallback on error" model. The simpler version is what actually works in practice.

We also initially routed everything through a single gateway instance. Latency was fine but the gateway became a single point of failure. We now run 3 instances behind a load balancer; any one can fail without impacting traffic.

Operational metrics #

What we watch:

Gateway p95 latency (the gateway's own overhead, excluding provider time): should be < 10ms
Provider error rate by provider: rising errors on one provider triggers failover review
Cache hit rate by route: drift indicates input distribution change
Per-service budget consumption: services approaching their daily budget get a warning
Failover frequency: if we're failing over more than ~1% of the time, primary is having issues we should address

What this enables #

Three concrete things become easier with the gateway:

Provider experiments. Want to try Claude for one route? Edit the gateway config; no app deploy needed. Roll back if it doesn't work; same.

Cost optimization. Per-route, per-customer cost is visible. Optimization is targeted, not vibes-based.

Provider outages. When OpenAI had a multi-hour incident last quarter, our services kept running with degraded latency on Anthropic, no code changes required. Without the gateway, every service would have had to handle the outage independently or stay down.

What I'd tell a team starting #

Build the gateway when you have ≥3 services using LLMs and they're starting to duplicate logic. Earlier than that and the abstraction is premature. Later than that and the migration cost is significant.

Make it OpenAI-API-compatible at the boundary. Existing client libraries work. Migration becomes a base-URL change.

Don't try to ship every feature in v1. Start with: provider routing, retries, basic metrics. Add caching, rate limiting, advanced policy in subsequent iterations. Each addition is small; the gateway becomes more capable over time.

Keep it stateless. State (cache, secrets, quotas) goes in Redis or Vault. The gateway itself is a few stateless instances behind a load balancer. Easy to deploy, easy to scale, easy to debug.

The biggest win, six months in, isn't any specific feature — it's having a single chokepoint for the LLM-call pattern. When something needs to change (a new provider, a new policy, a new metric), there's one place to change it. That's worth a lot.

Architecture Review: LLM Gateway Design for Multi-Provider Inference

LLM Gateway Design for Multi-Provider Inference: Architecture Review

What the gateway does #

The shape #

Provider routing policies #

Retries and failover #

Caching #

Cost accounting #

Rate limiting per caller #

Auth #

What the gateway doesn't do #

What we got wrong initially #

Operational metrics #

What this enables #

What I'd tell a team starting #

Stay Updated

A Pragmatic Multi-Region Strategy for Small Teams

Production AI Pipelines: Building End-to-End ML Systems

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Agent Memory: Short-Term, Long-Term, and When You Need Neither

Guardrails for Production LLMs: Input and Output Filtering That Holds

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas