Error Code Glossary and Troubleshooting

Use this reference to diagnose AI API failures quickly. Each code includes likely root causes, customer impact profile, and practical first-response actions.

Find an Error

Search by code, symptom, or fix keyword

Showing popular errors by default. Start typing to filter.

Fast Triage

Fast triage checklist

Check if errors are provider-wide or region-specific.
Separate client 4xx from provider 5xx failures.
Enable capped retries with jitter for transient failures.
Switch to fallback route if 5xx or timeout rates spike.
Track incident window with UTC timestamps for postmortems.

Retry rules

Retry 429/5xx/timeout with bounded attempts and jitter.
Do not retry malformed 4xx validation errors.
Use circuit breaker when consecutive windows breach SLO.
Restore traffic gradually after two healthy windows.

Guides

Error Code Reference

Code	Title	Type	Retry Safe	Severity	Fix Steps
400	Bad Request	Client input	No	low	Validate JSON schema and required fields before send.
401	Unauthorized	Auth	No	med	Verify API key scope and org/project mapping. See OpenAI status.
403	Forbidden	Auth	No	med	Check entitlement, policy constraints, and account permissions.
404	Endpoint or Model Not Found	Client input	No	low	Confirm endpoint path, API version, and model ID spelling.
408	Request Timeout	Network	Conditional	med	Increase timeout budget and review timeout guide.
409	Conflict	Client state	Conditional	low	Use idempotency keys and conflict-safe write logic.
410	Gone (Deprecated Resource)	Client input	No	low	Move to supported endpoint/model and remove deprecated routes.
413	Payload Too Large	Client input	No	low	Chunk large inputs and reduce context/token size.
415	Unsupported Media Type	Client input	No	low	Use accepted content type and serialization format.
422	Unprocessable Entity	Client input	No	med	Fix semantic validation errors and malformed parameters.
425	Too Early	Network/edge	Conditional	low	Retry with delay and ensure replay-safe requests.
429	Too Many Requests	Rate limit	Yes (bounded)	high	Use queue + jittered backoff. See 429 guide and fallback guide.
431	Request Header Fields Too Large	Client input	No	low	Reduce oversized headers, cookies, and metadata fields.
499	Client Closed Request	Client/network	Conditional	med	Check client timeout settings and cancellation behavior.
500	Internal Server Error	Provider	Yes (bounded)	high	Apply capped retries and watch provider comparison.
501	Not Implemented	Provider/API	No	med	Use supported feature set or alternate endpoint.
502	Bad Gateway	Provider/network	Yes (bounded)	high	Short backoff retry and monitor regional status pages.
503	Service Unavailable	Provider capacity	Yes (bounded)	high	Reduce non-critical traffic and prepare failover.
504	Gateway Timeout	Provider/network	Yes (bounded)	high	Review timeout handling and route critical paths to backup.
507	Insufficient Storage	Provider/storage	No	med	Reduce payload size and retry after capacity recovery.
509	Bandwidth Limit Exceeded	Rate limit	Conditional	med	Throttle traffic and verify account/network quotas.
520	Unknown Edge Error	Edge/provider	Conditional	high	Check provider region status and edge path anomalies.
522	Connection Timed Out	Network	Yes (bounded)	high	Inspect upstream latency and network path health.
523	Origin Unreachable	Network/DNS	Conditional	high	Validate DNS and routing. Compare with Anthropic or Gemini status.
524	A Timeout Occurred	Network/provider	Yes (bounded)	high	Tune timeout split and apply circuit breaker controls.
timeout	Transport Timeout	Network/provider	Yes (bounded)	high	Use timeout playbook and lower retry amplification.
connection reset	Connection Reset by Peer	Network	Conditional	med	Retry with jitter and inspect upstream connection stability.
DNS	DNS Resolution Failure	Network	Conditional	high	Validate resolver health, DNS TTL, and failover records.
TLS handshake	TLS Handshake Failure	Network/security	Conditional	med	Verify certificates, cipher support, and clock skew.
model overloaded	Model Capacity Overloaded	Provider capacity	Conditional	high	Queue low-priority traffic and shift critical load via fallback routing.

Mini-Guides for Fast Incident Response

How to handle 429 safely (retry + queue + smoothing)

Treat 429 as pressure, not immediate outage. Add queueing, apply jittered backoff, and smooth burst traffic before failover. Use the 429 guide for concrete policy templates.

503 vs 504 vs timeout: what is different?

503 usually reflects service saturation, 504 indicates gateway wait exceeded, and timeout often comes from transport/read budget limits. Compare all three before declaring provider-wide outage.

When auth errors mean "reachable"

401/403 often prove endpoint reachability while credentials or permissions are invalid. Validate key scope and project mapping before retrying.

Circuit breaker + retry budget template

Use a finite retry budget per request class, trip a circuit breaker on sustained 5xx/timeout breaches, and recover gradually after consecutive healthy windows.

When to fail over vs degrade gracefully

Fail over when broad, sustained errors breach SLO in multiple windows. Degrade gracefully for localized or short-lived latency pressure, then monitor recovery. Use OpenAI, Anthropic, and Gemini status pages together.

How To Diagnose AI API Errors Fast

Most production failures fall into four buckets. Classifying the bucket first avoids wasted incident time.

Auth/Permissions (401/403): key, org/project, or role mapping issues.
Rate limiting (429): request, token, or concurrency pressure.
Provider instability (5xx/timeouts): queue saturation or outage behavior.
Client/network: DNS, TLS, proxy, or routing path issues.

Triage Workflow (60 Seconds)

Check whether failures are global or region-specific.
Compare 5xx + timeouts versus 429 to separate outage from pressure.
Confirm billing/quota state and account limits.
Apply bounded retries with jitter (max 2 attempts).
Fail over only after consecutive SLO breach, not single spikes.

Most Common Errors and Mini-Guides

429 Too Many Requests

Usually rate-limit pressure. Smooth burst traffic, use jittered backoff, and enforce token budgeting.

503 / 504 Service Unavailable / Gateway Timeout

Often queue saturation or upstream instability. Reduce retries and evaluate controlled failover.

401 / 403 Authentication and Permission Failures

Reachability may be healthy while auth context is wrong. Verify key scope, project mapping, and policy constraints.

5xx Internal Errors

Enter incident mode when failures are sustained. Protect critical paths first and cap retry amplification.

FAQ

Should I retry 429 and 503 errors?

Yes, but with bounded retries and jittered backoff. Aggressive immediate retries often worsen throttling and queue pressure.

When should I switch to fallback provider?

Switch when 5xx/timeout rates breach your SLO threshold for consecutive windows, or when latency degrades persistently.

Can 401 errors still mean service is reachable?

Yes. Authentication errors usually indicate endpoint reachability but invalid/missing auth context for the specific request.

How do I separate outage from local configuration issues?

Compare multiple regions, check provider status pages, and verify whether errors are limited to one key, project, or environment.

Should retries be the same for all error classes?

No. Retry policies should be class-specific; aggressive retries on 429 or persistent 5xx often make incidents worse.

What should be in a production incident report for API errors?

Include UTC timeline, error-class split, affected regions/endpoints, mitigations, and post-incident threshold updates.

Production Troubleshooting Playbook

This glossary is most effective when paired with a consistent response policy. Teams that map each error class to a default action recover faster and avoid retry storms that amplify outages.

Default Actions by Error Class

401/403: stop retries, verify auth scope, rotate credentials only when needed.
429: reduce burst load, queue non-critical requests, add jittered backoff.
5xx/timeouts: cap retries, enable partial fallback, protect critical endpoints first.
Network failures: validate DNS/TLS/proxy path before blaming provider outage.

Escalation Triggers

More than 3 consecutive windows above your 5xx or timeout threshold.
Regional spread from one region to two or more regions.
p95 latency growth without recovery after bounded retries.

Once escalation triggers fire, switch from local debugging to incident mode and document actions in a shared timeline.

Avoid These Common Mistakes

Treating all failures the same: 429 and 503 require different control strategies.
Infinite retries: unbounded retries can increase queue pressure and error rates.
Ignoring region context: local routing issues can mimic full provider outages.
Late failover: waiting for complete outage often creates larger user impact.
No postmortem updates: if thresholds are never tuned, the same incident repeats.

Keep this page linked from runbooks and on-call dashboards so responders can classify issues quickly under pressure.

Runbook Snippets You Can Reuse

Retry and Backoff Guardrail

Use capped retries with exponential backoff and jitter. Set a hard retry budget per request path so one noisy upstream dependency cannot consume all available capacity.

Failover Decision Rule

Trigger fallback only after consecutive threshold breaches in latency or error rate. This avoids routing oscillation during short transient spikes and keeps traffic movement predictable.

Post-Incident Improvement Rule

After resolution, update one concrete control: threshold, timeout split, queue policy, or dashboard alert. Without this step, teams often repeat the same response pattern in the next incident.