AI Checker HubAI Checker Hub

How to Choose a Fallback Model

This production-oriented guide explains fallback triggers, model selection criteria, routing policies, and failure patterns to avoid. It is written for teams operating real-time AI workloads with reliability and cost constraints.

When to Trigger Fallback

  • Timeouts: request exceeds user-SLO budget (for example, 2.5s for interactive chat).
  • 5xx errors: provider-side errors exceed threshold for rolling window.
  • 429/rate limit pressure: queue depth or throttle responses increase beyond retry budget.
  • Degraded quality signal: output validation fails deterministic checks for safety or structure.
Use debounced thresholds (e.g., 3 failures in 60s) to avoid thrashing and unnecessary provider switches.

Selection Criteria

  • Latency profile (p95 under your target region).
  • Cost per 1M input/output tokens for expected prompt sizes.
  • Output quality fit for your task category.
  • Context length and tool/function calling support.
  • Operational maturity: rate limits, support SLAs, incident cadence.

Fallback Strategies

1) Same Provider, Different Model

Use when provider control-plane is stable but one model tier is saturated.

2) Cross-Provider Fallback

Use when provider-wide incident is likely. Keep prompt adapters for schema/tool parity.

3) Tiered Routing (Fast -> Strong)

Default to a fast model for first attempt, escalate to a stronger model only when needed.

Example Routing Policy (Pseudocode)

function route(request):
  primary = model("fast-general")
  backupA = model("strong-general")
  backupB = providerB.model("balanced")

  result = call(primary, timeout=2500ms)

  if result.timeout or result.http in [500..599]:
      return call(backupA, timeout=3200ms)

  if result.http == 429:
      sleep(jitter(80..250ms))
      retry = call(primary, timeout=2500ms)
      if retry.http == 429:
          return call(backupB, timeout=3200ms)

  if !qualityChecksPass(result.output):
      return call(backupA, timeout=3200ms)

  return result

Common Pitfalls and How to Avoid Them

  • Retry storms: cap retries and use jittered backoff.
  • Silent quality regression: enforce output schema checks before accept.
  • No observability per route: track metrics by model/provider path.
  • Hard-coded provider assumptions: normalize request/response adapters.
Do not route all traffic at once during incident response. Use staged ramp-up (5% -> 25% -> 50% -> 100%).