Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.

On this page

Multi-Provider LLM Routing: Failover, Cost Routing, and Load Balancing

The first time OpenAI had a multi-hour outage, our customer-facing AI feature was down for the duration. That's when "multi-provider routing" stopped being a someday and became a project. Six months in, we route across three providers (OpenAI, Anthropic, plus a self-hosted Llama instance for some workloads). Failover handles outages. Per-task routing reduces our LLM bill by ~30%. This post is the patterns.

Why one provider isn't enough #

Three real reasons, in order of how often we feel them:

Provider outages. Every major LLM provider has had multi-hour incidents in the last 12 months. If your customer-facing feature depends on one, you're down when they're down.
Rate limits. OpenAI and Anthropic both have rate limits per organization. Hit them and requests get 429s until the window resets. Multi-provider lets you spread the load.
Cost optimization. Some tasks need GPT-4-class capability; many don't. Routing simpler tasks to cheaper models (or self-hosted) saves real money at scale.

The patterns below address all three.

The routing layer #

A small service (or library, depending on scale) that sits between the application and the LLM providers. Application calls our internal chat() function with a payload + task hint. The routing layer:

Picks a provider and model based on the task hint and current health.
Makes the call.
On error: classifies, decides whether to retry on the same provider or fail over.
Returns the response.

The interface to the application is provider-agnostic — chat({ messages, task }) regardless of who serves it.

Health tracking #

Each provider has a rolling health score:

Successful calls: increment score.
429 / 5xx / timeout: decrement score.
Latency above threshold: small decrement.

A provider whose score drops below a threshold is marked unhealthy. Routing skips it until the score recovers. Recovery happens via small fraction of requests routed there as probes; if they succeed, the score climbs back.

This is similar to a circuit breaker, but with a continuous score rather than open/closed states. The continuous version handles partial degradation — a provider that's slow but not failing gets fewer requests rather than zero.

Task-aware routing #

We tag every LLM call with a task label: summarize, classify, extract-structured, chat-customer-facing, code-generation, etc. The routing layer maps task → preferred-provider-list:

code

classify           → [openai/gpt-4o-mini, anthropic/claude-haiku, self-hosted/llama-3-8b]
extract-structured → [openai/gpt-4o-mini, anthropic/claude-haiku]
chat-customer-facing → [openai/gpt-4o, anthropic/claude-sonnet, openai/gpt-4o-mini]
code-generation    → [anthropic/claude-sonnet, openai/gpt-4o]
summarize          → [openai/gpt-4o-mini, anthropic/claude-haiku]

Each task has a primary and at least one fallback. The primary is picked for cost vs quality on that task; the fallbacks ensure we have somewhere to go when the primary is unhealthy.

The cost win comes from sending classification and extraction work to cheaper models. Roughly 70% of our LLM calls don't need GPT-4-class capability; routing them to mini/haiku models or self-hosted cuts the bill substantially.

Failover policy #

Three states a call can fail in, each handled differently:

Rate limit (429). Retry on next provider in the list. Don't waste time retrying the same provider — the rate limit is real. Reduces overall error rate; no cost downside (we'd have failed anyway).

Transient error (5xx, timeout). Retry on same provider once. If still failing, fail over.

Permanent error (4xx other than 429, malformed request). Don't retry. Surface the error.

The retry policy matters: too aggressive and you DDoS yourself with extra cost when a provider is degraded; too conservative and you fail when you didn't need to.

Caps:

Max 2 provider failovers per request.
Max 1 retry per provider.
Total time budget: 60 seconds (request-level timeout).

After exhausting these, return the last error to the caller. The caller decides whether to show a degraded experience, retry later, or surface the failure to the user.

Cost tracking #

Every call is logged with:

Task label
Provider + model
Input tokens, output tokens
Latency
Cost (computed from tokens × model price)

Daily aggregates per task surface where the spend is going. We use the dashboard to spot routing decisions that aren't paying off — sometimes a task we routed to a cheap model turned out to have quality regressions that pushed users to retry, costing more than just using the better model the first time.

Quality parity checks #

The risk of multi-provider: the providers don't give identical output. A prompt that works perfectly on Claude can give weird results on GPT, and vice versa. Two patterns:

Eval harness covering all routed-to models. For every task, a small eval set. When we change which models are in the routing pool, we re-run the evals to confirm the new model meets the quality bar.

A/B production checks. For new model additions, we mirror a small percentage of traffic — call both the new model and the production one, score both, log the comparison. If the new one is at parity, promote it to a fallback (then possibly to primary). If it's not, leave it out.

We've had two cases where a provider that looked good on the eval set degraded on a specific real-world traffic pattern that wasn't in the eval. The A/B mirror caught it before we made it primary.

Self-hosting for the cheap tail #

For the highest-volume cheap-task workloads (classification, simple extraction), we run a self-hosted Llama-3-8B-Instruct on an L4 GPU. Costs ~$200/month all-in. For tasks that fit its capability, this is dramatically cheaper than per-token API pricing.

The catch: self-hosted has different operational shape. The instance might be down for upgrades, capacity might be exceeded during burst. So self-hosted is always behind a fallback to an API provider — if the self-hosted instance returns an error or doesn't respond, the call falls over to OpenAI mini.

In aggregate: ~40% of our classification calls go to self-hosted; 60% to OpenAI mini (when self-hosted is at capacity or down). Average cost-per-call across both: ~30% of what pure OpenAI would have been.

Implementation note: streaming #

Streaming responses across providers is the trickiest part. Each provider has its own stream format (SSE shape differs). Our routing layer normalizes them — every stream we expose internally has the same chunk shape. When a stream fails mid-response, we don't usually retry (the partial response has already been sent to the user); we close cleanly and let the caller decide whether to retry.

What we monitor #

Per-provider success rate per task. Catches when one provider degrades on a specific task type.
Failover rate. What % of calls had to fail over? Spikes signal upstream issues.
Per-task cost trend. Catches when routing rules drift from what's optimal.
Per-task latency p95/p99. Quality of service.
Error budget burn rate per provider. Treat each provider like an SLO; alert if any are burning faster than sustainable.

What we don't bother with #

A few patterns we considered:

Provider price arbitrage on every request. Picking the cheapest provider per request based on live pricing. Adds complexity, marginal savings. We pick statically per task; revisit quarterly.

Cross-provider load balancing for cost smoothing. Trying to keep each provider at exactly N% of traffic to spread spend. Not worth the complexity vs simpler routing rules.

Letting the application pick the provider. Application code shouldn't know about providers; that's the routing layer's job. Otherwise migrations are spread across every caller.

What I'd tell a team starting #

Even one fallback provider is a huge resilience improvement over single-provider. Start there.
Task labels first, routing rules second. Without good task labels in the data, optimization is guessing.
A/B compare new providers before making them primary. Eval sets miss things real traffic surfaces.
Pay attention to the streaming shape if you do streaming. It's the part that requires the most provider-specific code.

What to read next #

Field notes: Model fallback policies for customer-facing AI — the prompt-side discipline that pairs with routing
AI cost optimization: reducing LLM inference costs — broader cost lever inventory
AI observability: monitoring LLM performance in production — what to track once you've got routing working
LLM streaming UX — backpressure, cancellation, partial results — the stream-handling side

Multi-provider routing is a few weeks of engineering for a feature that pays back the first time a provider has an outage. The cost-routing benefits compound over time. Once it's in place, the provider-failure scenario stops being a P0 and becomes "noted, our routing handled it."

Multi-Provider LLM Routing — Failover, Cost Routing, and Load Balancing

Multi-Provider LLM Routing: Failover, Cost Routing, and Load Balancing

Why one provider isn't enough #

The routing layer #

Health tracking #

Task-aware routing #

Failover policy #

Cost tracking #

Quality parity checks #

Self-hosting for the cheap tail #

Implementation note: streaming #

What we monitor #

What we don't bother with #

What I'd tell a team starting #

What to read next #

Stay Updated

Postgres Query Plans — Reading Them and the Indexes We Wish We'd Added Sooner

Edge Databases for Low-Latency Apps — D1, Turso, Neon Serverless

More from AI

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Long Context vs RAG — When to Use Which

Prompt Injection Defense for LLM Apps

RAG Evaluation Metrics — Faithfulness and Context Precision

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas