writing/blog/2026/06
BlogJun 7, 2026·6 min read

When Your AI Goes Down: Multi-Model Fallback Strategies for 2026

The June 2026 Claude outages proved AI is now infrastructure. Learn how to build resilient multi-model fallback systems using LiteLLM, OpenRouter, and circuit breaker patterns.

On June 2, 2026, Anthropic's Claude experienced a widespread global outage. Error rates spiked across Opus 4.6, the Claude API, and Claude Code CLI. Three days later, on June 5, another disruption hit — affecting claude.ai, the API, Claude Code, and Cowork. Notion's engineering team responded by immediately disabling all Anthropic models from their picker and rerouting every request to alternative providers. Users experienced a model switch. Not an outage.

That gap — between teams that scrambled and teams that rerouted seamlessly — comes down to one architectural decision made months earlier: whether or not to treat AI as infrastructure.

AI Is Infrastructure Now

Through 2024, an AI outage meant your chatbot was temporarily unavailable. In 2026, an AI outage means your development pipeline, your customer support triage, your document processing, and your onboarding flows all stop simultaneously.

Thoughtworks documented the cascading failures during the June incident: automated coding assistants went offline, semantic search degraded to keyword fallback, and LLM-powered data pipelines silently stopped processing. The teams hit hardest were those that had swapped human capability for AI capability rather than amplifying it.

The lesson is not "use AI less." The lesson is to build AI the same way you build any critical infrastructure: with redundancy, graceful degradation, and automatic failover.

The Single-Provider Trap

Most AI integrations start the same way: pick a provider, grab an API key, ship it. This works until it doesn't. The failure modes are predictable:

  • Provider outage — Claude, GPT-4o, and Gemini have all had significant downtime in 2026
  • Rate limiting — traffic spikes push you into 429 errors without warning
  • Model deprecation — providers retire models with 90-day notice windows
  • Regional failures — some outages are geographically scoped but still hit your users
  • Cost spikes — a provider changes pricing and your margins collapse overnight

Hardcoding a single API endpoint is a liability. The fix is routing your requests through a layer that treats model providers as interchangeable compute resources.

Building a Fallback Chain

A production-grade fallback chain has at least three levels:

  1. Primary model — your preferred provider for quality and cost (e.g., Claude Opus 4.7)
  2. Same-provider fallback — a cheaper or lighter model from the same vendor (e.g., Claude Haiku 4.5)
  3. Cross-provider fallback — a different vendor entirely (e.g., GPT-4o or Gemini 2.5 Flash)
  4. Last-resort option — a self-hosted or locally-run model with no external dependency

The chain activates on error codes 429, 500, 502, 503, and 529. Each level gets a configurable retry budget before escalating.

Implementing Fallbacks with LiteLLM

LiteLLM is the most widely adopted open-source gateway for multi-model routing. Here is a minimal Python example using its Router:

from litellm import Router
 
router = Router(
    model_list=[
        {
            "model_name": "claude-primary",
            "litellm_params": {
                "model": "anthropic/claude-opus-4-7",
                "api_key": "YOUR_ANTHROPIC_KEY",
            },
            "order": 1,
        },
        {
            "model_name": "claude-haiku-fallback",
            "litellm_params": {
                "model": "anthropic/claude-haiku-4-5",
                "api_key": "YOUR_ANTHROPIC_KEY",
            },
            "order": 2,
        },
        {
            "model_name": "openai-fallback",
            "litellm_params": {
                "model": "openai/gpt-4o",
                "api_key": "YOUR_OPENAI_KEY",
            },
            "order": 3,
        },
    ],
    fallbacks=[
        {"claude-primary": ["claude-haiku-fallback", "openai-fallback"]}
    ],
    num_retries=3,
    retry_after=60,
)
 
response = await router.acompletion(
    model="claude-primary",
    messages=[{"role": "user", "content": "Summarize this document."}],
)

The order field controls priority. When a deployment at order=1 fails, the router automatically tries order=2, then order=3. The fallbacks array gives you explicit cross-group escalation control.

OpenRouter as a Managed Alternative

If you prefer not to manage your own gateway infrastructure, OpenRouter provides managed fallback routing through a single API endpoint. You configure fallback models in your request headers:

const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.OPENROUTER_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "anthropic/claude-opus-4-7",
    models: [
      "anthropic/claude-opus-4-7",
      "anthropic/claude-haiku-4-5",
      "openai/gpt-4o",
    ],
    route: "fallback",
    messages: [{ role: "user", content: "Summarize this document." }],
  }),
});

OpenRouter handles the fallback logic server-side. If Claude is experiencing issues, the request automatically routes to the next model in your models array. For under 1,000 requests per second, this managed approach removes significant operational overhead.

Circuit Breakers for AI APIs

Retries alone are not enough. Without a circuit breaker, your application will keep hammering a degraded provider, making its recovery slower and your latency worse. The community consensus for circuit breaker configuration:

  • Failure threshold — 5 consecutive failures trips the circuit open
  • Cooldown period — 60 seconds before testing recovery with a single probe request
  • Half-open state — one successful probe closes the circuit; one failure keeps it open

LiteLLM's Router implements this automatically. If you are building a custom gateway, the pybreaker library for Python or cockatiel for TypeScript/Node.js are solid choices:

import { CircuitBreakerPolicy, ExponentialBackoff } from "cockatiel";
 
const claudeCircuit = CircuitBreakerPolicy.circuitBreaker({
  halfOpenAfter: 60_000,
  breaker: {
    threshold: 0.5,
    duration: 30_000,
    minimumRps: 5,
  },
});
 
const retryPolicy = Policy.handleWhenResult(() => false)
  .orWhenRejected()
  .retry()
  .exponential({ maxAttempts: 3, ...ExponentialBackoff });
 
const response = await Policy.wrap(claudeCircuit, retryPolicy).execute(() =>
  callClaude(prompt)
);

UX Considerations for Model Switching

Model switching is invisible to users only if you plan for it. When a fallback activates, two things need to be true:

Output consistency. Claude Opus returning a 2,000-word structured analysis followed by GPT-4o returning 400 words of free prose will confuse users. Normalize your prompts with explicit format instructions (word count, structure, JSON schema) that any capable model will follow.

Graceful degradation messaging. For features that are explicitly model-dependent — such as deep document analysis or code synthesis — surface a soft signal: "Advanced analysis is temporarily running on an alternative provider. Results may vary slightly." Users tolerate brief quality fluctuation far better than complete unavailability.

Observability: Knowing Before Your Users Do

The final piece is semantic monitoring — tracking not just whether the API responds, but whether responses meet your quality threshold. Three metrics to track per model call:

  • Token throughput — tokens per second, per provider
  • Error rate by code — separate 429s (rate limits) from 5xx (outages) from 529s (overloaded)
  • Latency P95 — not average, but 95th percentile to catch tail latency degradation

Tools like Langfuse, Helicone, and Portkey provide out-of-the-box dashboards for these. Setting alert thresholds at 15% error rate triggers automatic failover before users notice anything is wrong.

Key Takeaways for MENA Teams

For engineering teams in the MENA region, multi-model resilience has an additional dimension: regional latency. Several AI providers have uneven coverage across North Africa and the Gulf. Running benchmarks across providers — Claude, Gemini, GPT-4o, and Mistral — from your actual server region identifies your fastest primary and best fallback for local users, not just global averages.

The June 2026 outages were a stress test the industry collectively failed. The teams that passed built their AI integrations the same way they build their databases: with replication, automatic failover, and the assumption that any single node will eventually go down.

Your primary model is not your infrastructure. Your routing layer is.


Further reading: OpenRouter Fallback Routing Docs · LiteLLM Router Documentation · Thoughtworks Claude Outage Analysis