How to Choose a Fallback Model

This production-oriented guide explains fallback triggers, model selection criteria, routing policies, and failure patterns to avoid. It is written for teams operating real-time AI workloads with reliability and cost constraints.

When to Trigger Fallback

Timeouts: request exceeds user-SLO budget (for example, 2.5s for interactive chat).
5xx errors: provider-side errors exceed threshold for rolling window.
429/rate limit pressure: queue depth or throttle responses increase beyond retry budget.
Degraded quality signal: output validation fails deterministic checks for safety or structure.

Use debounced thresholds (e.g., 3 failures in 60s) to avoid thrashing and unnecessary provider switches.

Selection Criteria

Latency profile (p95 under your target region).
Cost per 1M input/output tokens for expected prompt sizes.
Output quality fit for your task category.
Context length and tool/function calling support.
Operational maturity: rate limits, support SLAs, incident cadence.

Fallback Strategies

1) Same Provider, Different Model

Use when provider control-plane is stable but one model tier is saturated.

2) Cross-Provider Fallback

Use when provider-wide incident is likely. Keep prompt adapters for schema/tool parity.

3) Tiered Routing (Fast -> Strong)

Default to a fast model for first attempt, escalate to a stronger model only when needed.

Example Routing Policy (Pseudocode)

function route(request):
  primary = model("fast-general")
  backupA = model("strong-general")
  backupB = providerB.model("balanced")

  result = call(primary, timeout=2500ms)

  if result.timeout or result.http in [500..599]:
      return call(backupA, timeout=3200ms)

  if result.http == 429:
      sleep(jitter(80..250ms))
      retry = call(primary, timeout=2500ms)
      if retry.http == 429:
          return call(backupB, timeout=3200ms)

  if !qualityChecksPass(result.output):
      return call(backupA, timeout=3200ms)

  return result

Common Pitfalls and How to Avoid Them

Retry storms: cap retries and use jittered backoff.
Silent quality regression: enforce output schema checks before accept.
No observability per route: track metrics by model/provider path.
Hard-coded provider assumptions: normalize request/response adapters.

Do not route all traffic at once during incident response. Use staged ramp-up (5% -> 25% -> 50% -> 100%).

In This Guide

When to trigger fallback Selection criteria Fallback strategies Routing pseudocode Common pitfalls

Production Readiness Checklist

Per-provider timeout and retry budgets configured.
Circuit breaker thresholds tested in staging.
Fallback output quality checks implemented.
Routing metrics dashboard and alerts active.
Incident runbook and owner on-call rotation assigned.

Monitoring and Alerts Checklist

Alert on p95 latency slope, not just absolute number.
Alert on 5xx rate and sustained 429 bursts.
Track fallback activation frequency by endpoint.
Track cost-per-successful-request after routing changes.