AI Checker HubAI Checker Hub

Error Code Glossary and Troubleshooting

Use this reference to diagnose AI API failures quickly. Each code includes likely root causes, customer impact profile, and practical first-response actions.

Error Code Reference

Code Title Type Retry Safe Severity Fix Steps
400Bad RequestClient inputNolowValidate JSON schema and required fields before send.
401UnauthorizedAuthNomedVerify API key scope and org/project mapping. See OpenAI status.
403ForbiddenAuthNomedCheck entitlement, policy constraints, and account permissions.
404Endpoint or Model Not FoundClient inputNolowConfirm endpoint path, API version, and model ID spelling.
408Request TimeoutNetworkConditionalmedIncrease timeout budget and review timeout guide.
409ConflictClient stateConditionallowUse idempotency keys and conflict-safe write logic.
410Gone (Deprecated Resource)Client inputNolowMove to supported endpoint/model and remove deprecated routes.
413Payload Too LargeClient inputNolowChunk large inputs and reduce context/token size.
415Unsupported Media TypeClient inputNolowUse accepted content type and serialization format.
422Unprocessable EntityClient inputNomedFix semantic validation errors and malformed parameters.
425Too EarlyNetwork/edgeConditionallowRetry with delay and ensure replay-safe requests.
429Too Many RequestsRate limitYes (bounded)highUse queue + jittered backoff. See 429 guide and fallback guide.
431Request Header Fields Too LargeClient inputNolowReduce oversized headers, cookies, and metadata fields.
499Client Closed RequestClient/networkConditionalmedCheck client timeout settings and cancellation behavior.
500Internal Server ErrorProviderYes (bounded)highApply capped retries and watch provider comparison.
501Not ImplementedProvider/APINomedUse supported feature set or alternate endpoint.
502Bad GatewayProvider/networkYes (bounded)highShort backoff retry and monitor regional status pages.
503Service UnavailableProvider capacityYes (bounded)highReduce non-critical traffic and prepare failover.
504Gateway TimeoutProvider/networkYes (bounded)highReview timeout handling and route critical paths to backup.
507Insufficient StorageProvider/storageNomedReduce payload size and retry after capacity recovery.
509Bandwidth Limit ExceededRate limitConditionalmedThrottle traffic and verify account/network quotas.
520Unknown Edge ErrorEdge/providerConditionalhighCheck provider region status and edge path anomalies.
522Connection Timed OutNetworkYes (bounded)highInspect upstream latency and network path health.
523Origin UnreachableNetwork/DNSConditionalhighValidate DNS and routing. Compare with Anthropic or Gemini status.
524A Timeout OccurredNetwork/providerYes (bounded)highTune timeout split and apply circuit breaker controls.
timeoutTransport TimeoutNetwork/providerYes (bounded)highUse timeout playbook and lower retry amplification.
connection resetConnection Reset by PeerNetworkConditionalmedRetry with jitter and inspect upstream connection stability.
DNSDNS Resolution FailureNetworkConditionalhighValidate resolver health, DNS TTL, and failover records.
TLS handshakeTLS Handshake FailureNetwork/securityConditionalmedVerify certificates, cipher support, and clock skew.
model overloadedModel Capacity OverloadedProvider capacityConditionalhighQueue low-priority traffic and shift critical load via fallback routing.

Mini-Guides for Fast Incident Response

How to handle 429 safely (retry + queue + smoothing)

Treat 429 as pressure, not immediate outage. Add queueing, apply jittered backoff, and smooth burst traffic before failover. Use the 429 guide for concrete policy templates.

503 vs 504 vs timeout: what is different?

503 usually reflects service saturation, 504 indicates gateway wait exceeded, and timeout often comes from transport/read budget limits. Compare all three before declaring provider-wide outage.

When auth errors mean "reachable"

401/403 often prove endpoint reachability while credentials or permissions are invalid. Validate key scope and project mapping before retrying.

Circuit breaker + retry budget template

Use a finite retry budget per request class, trip a circuit breaker on sustained 5xx/timeout breaches, and recover gradually after consecutive healthy windows.

When to fail over vs degrade gracefully

Fail over when broad, sustained errors breach SLO in multiple windows. Degrade gracefully for localized or short-lived latency pressure, then monitor recovery. Use OpenAI, Anthropic, and Gemini status pages together.

How To Diagnose AI API Errors Fast

Most production failures fall into four buckets. Classifying the bucket first avoids wasted incident time.

Triage Workflow (60 Seconds)

Most Common Errors and Mini-Guides

429 Too Many Requests

Usually rate-limit pressure. Smooth burst traffic, use jittered backoff, and enforce token budgeting.

503 / 504 Service Unavailable / Gateway Timeout

Often queue saturation or upstream instability. Reduce retries and evaluate controlled failover.

401 / 403 Authentication and Permission Failures

Reachability may be healthy while auth context is wrong. Verify key scope, project mapping, and policy constraints.

5xx Internal Errors

Enter incident mode when failures are sustained. Protect critical paths first and cap retry amplification.

Related guides: 429 Error Guide, Timeout Guide, Fallback Routing Guide.

FAQ

Should I retry 429 and 503 errors?

Yes, but with bounded retries and jittered backoff. Aggressive immediate retries often worsen throttling and queue pressure.

When should I switch to fallback provider?

Switch when 5xx/timeout rates breach your SLO threshold for consecutive windows, or when latency degrades persistently.

Can 401 errors still mean service is reachable?

Yes. Authentication errors usually indicate endpoint reachability but invalid/missing auth context for the specific request.

How do I separate outage from local configuration issues?

Compare multiple regions, check provider status pages, and verify whether errors are limited to one key, project, or environment.

Should retries be the same for all error classes?

No. Retry policies should be class-specific; aggressive retries on 429 or persistent 5xx often make incidents worse.

What should be in a production incident report for API errors?

Include UTC timeline, error-class split, affected regions/endpoints, mitigations, and post-incident threshold updates.

Production Troubleshooting Playbook

This glossary is most effective when paired with a consistent response policy. Teams that map each error class to a default action recover faster and avoid retry storms that amplify outages.

Default Actions by Error Class

Escalation Triggers

Once escalation triggers fire, switch from local debugging to incident mode and document actions in a shared timeline.

Avoid These Common Mistakes

Keep this page linked from runbooks and on-call dashboards so responders can classify issues quickly under pressure.

Runbook Snippets You Can Reuse

Retry and Backoff Guardrail

Use capped retries with exponential backoff and jitter. Set a hard retry budget per request path so one noisy upstream dependency cannot consume all available capacity.

Failover Decision Rule

Trigger fallback only after consecutive threshold breaches in latency or error rate. This avoids routing oscillation during short transient spikes and keeps traffic movement predictable.

Post-Incident Improvement Rule

After resolution, update one concrete control: threshold, timeout split, queue policy, or dashboard alert. Without this step, teams often repeat the same response pattern in the next incident.

Key Troubleshooting Links