Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
The first time OpenAI had a multi-hour outage, our customer-facing AI feature was down for the duration. That's when "multi-provider routing" stopped being a someday and became a project. Six months in, we route across three providers (OpenAI, Anthropic, plus a self-hosted Llama instance for some workloads). Failover handles outages. Per-task routing reduces our LLM bill by ~30%. This post is the patterns.
Three real reasons, in order of how often we feel them:
The patterns below address all three.
A small service (or library, depending on scale) that sits between the application and the LLM providers. Application calls our internal chat() function with a payload + task hint. The routing layer:
The interface to the application is provider-agnostic — chat({ messages, task }) regardless of who serves it.
Each provider has a rolling health score:
A provider whose score drops below a threshold is marked unhealthy. Routing skips it until the score recovers. Recovery happens via small fraction of requests routed there as probes; if they succeed, the score climbs back.
This is similar to a circuit breaker, but with a continuous score rather than open/closed states. The continuous version handles partial degradation — a provider that's slow but not failing gets fewer requests rather than zero.
We tag every LLM call with a task label: summarize, classify, extract-structured, chat-customer-facing, code-generation, etc. The routing layer maps task → preferred-provider-list:
classify → [openai/gpt-4o-mini, anthropic/claude-haiku, self-hosted/llama-3-8b]
extract-structured → [openai/gpt-4o-mini, anthropic/claude-haiku]
chat-customer-facing → [openai/gpt-4o, anthropic/claude-sonnet, openai/gpt-4o-mini]
code-generation → [anthropic/claude-sonnet, openai/gpt-4o]
summarize → [openai/gpt-4o-mini, anthropic/claude-haiku]
Each task has a primary and at least one fallback. The primary is picked for cost vs quality on that task; the fallbacks ensure we have somewhere to go when the primary is unhealthy.
The cost win comes from sending classification and extraction work to cheaper models. Roughly 70% of our LLM calls don't need GPT-4-class capability; routing them to mini/haiku models or self-hosted cuts the bill substantially.
Three states a call can fail in, each handled differently:
Rate limit (429). Retry on next provider in the list. Don't waste time retrying the same provider — the rate limit is real. Reduces overall error rate; no cost downside (we'd have failed anyway).
Transient error (5xx, timeout). Retry on same provider once. If still failing, fail over.
Permanent error (4xx other than 429, malformed request). Don't retry. Surface the error.
The retry policy matters: too aggressive and you DDoS yourself with extra cost when a provider is degraded; too conservative and you fail when you didn't need to.
Caps:
After exhausting these, return the last error to the caller. The caller decides whether to show a degraded experience, retry later, or surface the failure to the user.
Every call is logged with:
Daily aggregates per task surface where the spend is going. We use the dashboard to spot routing decisions that aren't paying off — sometimes a task we routed to a cheap model turned out to have quality regressions that pushed users to retry, costing more than just using the better model the first time.
The risk of multi-provider: the providers don't give identical output. A prompt that works perfectly on Claude can give weird results on GPT, and vice versa. Two patterns:
Eval harness covering all routed-to models. For every task, a small eval set. When we change which models are in the routing pool, we re-run the evals to confirm the new model meets the quality bar.
A/B production checks. For new model additions, we mirror a small percentage of traffic — call both the new model and the production one, score both, log the comparison. If the new one is at parity, promote it to a fallback (then possibly to primary). If it's not, leave it out.
We've had two cases where a provider that looked good on the eval set degraded on a specific real-world traffic pattern that wasn't in the eval. The A/B mirror caught it before we made it primary.
For the highest-volume cheap-task workloads (classification, simple extraction), we run a self-hosted Llama-3-8B-Instruct on an L4 GPU. Costs ~$200/month all-in. For tasks that fit its capability, this is dramatically cheaper than per-token API pricing.
The catch: self-hosted has different operational shape. The instance might be down for upgrades, capacity might be exceeded during burst. So self-hosted is always behind a fallback to an API provider — if the self-hosted instance returns an error or doesn't respond, the call falls over to OpenAI mini.
In aggregate: ~40% of our classification calls go to self-hosted; 60% to OpenAI mini (when self-hosted is at capacity or down). Average cost-per-call across both: ~30% of what pure OpenAI would have been.
Streaming responses across providers is the trickiest part. Each provider has its own stream format (SSE shape differs). Our routing layer normalizes them — every stream we expose internally has the same chunk shape. When a stream fails mid-response, we don't usually retry (the partial response has already been sent to the user); we close cleanly and let the caller decide whether to retry.
A few patterns we considered:
Provider price arbitrage on every request. Picking the cheapest provider per request based on live pricing. Adds complexity, marginal savings. We pick statically per task; revisit quarterly.
Cross-provider load balancing for cost smoothing. Trying to keep each provider at exactly N% of traffic to spread spend. Not worth the complexity vs simpler routing rules.
Letting the application pick the provider. Application code shouldn't know about providers; that's the routing layer's job. Otherwise migrations are spread across every caller.
Multi-provider routing is a few weeks of engineering for a feature that pays back the first time a provider has an outage. The cost-routing benefits compound over time. Once it's in place, the provider-failure scenario stops being a P0 and becomes "noted, our routing handled it."
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
EXPLAIN ANALYZE output is dense and intimidating. Once you can read it, most slow-query investigations finish in minutes. The patterns we keep seeing.
Edge compute is useless without an edge data layer. Three serverless databases that put data within ms of your edge functions, with the tradeoffs that aren't on the marketing pages.
Explore more articles in this category
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Pure vector search misses exact-keyword queries. Pure BM25 misses semantic ones. Combining them with reciprocal rank fusion is the simplest large win in RAG retrieval.
Evergreen posts worth revisiting.