What does HTTP 429 mean on AI APIs?

HTTP 429 means your client exceeded a rate or quota limit for a given time window, token budget, or concurrency policy.

Should I retry immediately after a 429?

No. Use exponential backoff with jitter and respect retry headers when available to avoid retry storms.

How do I distinguish local quota issues from provider incident?

Compare your account metrics, error mix, and independent status monitors. Broad 5xx/timeouts often indicate platform stress, while isolated 429 usually indicates quota or burst pressure.

AI API Error 429 Guide: Causes, Fixes, and Retry Strategy

HTTP 429 errors are one of the most common failure modes in production AI integrations. This guide explains what rate-limit responses mean, how to identify root cause quickly, and which retry patterns are safe for user-facing workloads.

What 429 Actually Means

A 429 response usually means your request stream exceeded one of several policy limits: requests per minute, tokens per minute, concurrent in-flight calls, or account-tier quotas. Many teams only monitor request count, but token burst and concurrency are often the hidden bottlenecks in LLM workloads.

The key operational takeaway is that 429 is not always an outage signal. It can be fully local to your account usage pattern, workload spikes, or request batching logic. Treat 429 as capacity pressure first, and only escalate to incident mode after checking broader error patterns and independent monitor data.

Fast Root-Cause Checklist

Confirm whether 429 is isolated to one endpoint or all critical routes.
Check per-minute request burst and token throughput for the last 15 minutes.
Inspect in-flight concurrency caps and queue depth during spikes.
Compare with independent status pages for simultaneous 5xx/timeouts.
Validate key, project, and billing quota state before changing routing policy.

If 429 rises without broad 5xx growth, prioritize client-side rate shaping before cross-provider failover.

Safe Retry and Backoff Policy

max_retries = 2
base_delay_ms = 120

for attempt in range(max_retries + 1):
  response = call_model(request)
  if response.ok:
    return response
  if response.status != 429:
    break
  sleep(exponential_backoff(base_delay_ms, attempt) + random_jitter(40, 220))

route_to_fallback(request)

Keep retries low and deterministic. Excess retries can amplify queue pressure and convert a local throttling event into a broader incident. In customer-facing systems, two retries with jitter is usually a better reliability/cost tradeoff than aggressive retry loops.

Rate-Limit Reduction Tactics

Tactic	When To Use	Expected Effect
Token budgeting	Prompt size spikes	Lowers tokens/minute pressure
Client queue + smoothing	Traffic burst windows	Reduces request burst
Adaptive concurrency cap	High timeout coupling	Prevents overload amplification
Tiered model routing	Primary model saturation	Spreads traffic across pools
Per-tenant quotas	Multi-tenant abuse risk	Protects core traffic fairness

Production Rollout Checklist

Define SLO-linked threshold for 429 escalation.
Store retry count and backoff duration in structured logs.
Add alert when 429 and latency rise together for more than 10 minutes.
Implement fallback route and test at low traffic before incident day.
Run weekly game-day simulation for throttling scenarios.