AI Checker Hub
Error Code Glossary and Troubleshooting
Use this reference to diagnose AI API failures quickly. Each code includes likely root causes, customer impact profile, and practical first-response actions.
Error Code Reference
| Code | Title | Type | Retry Safe | Severity | Fix Steps |
|---|---|---|---|---|---|
| 400 | Bad Request | Client input | No | low | Validate JSON schema and required fields before send. |
| 401 | Unauthorized | Auth | No | med | Verify API key scope and org/project mapping. See OpenAI status. |
| 403 | Forbidden | Auth | No | med | Check entitlement, policy constraints, and account permissions. |
| 404 | Endpoint or Model Not Found | Client input | No | low | Confirm endpoint path, API version, and model ID spelling. |
| 408 | Request Timeout | Network | Conditional | med | Increase timeout budget and review timeout guide. |
| 409 | Conflict | Client state | Conditional | low | Use idempotency keys and conflict-safe write logic. |
| 410 | Gone (Deprecated Resource) | Client input | No | low | Move to supported endpoint/model and remove deprecated routes. |
| 413 | Payload Too Large | Client input | No | low | Chunk large inputs and reduce context/token size. |
| 415 | Unsupported Media Type | Client input | No | low | Use accepted content type and serialization format. |
| 422 | Unprocessable Entity | Client input | No | med | Fix semantic validation errors and malformed parameters. |
| 425 | Too Early | Network/edge | Conditional | low | Retry with delay and ensure replay-safe requests. |
| 429 | Too Many Requests | Rate limit | Yes (bounded) | high | Use queue + jittered backoff. See 429 guide and fallback guide. |
| 431 | Request Header Fields Too Large | Client input | No | low | Reduce oversized headers, cookies, and metadata fields. |
| 499 | Client Closed Request | Client/network | Conditional | med | Check client timeout settings and cancellation behavior. |
| 500 | Internal Server Error | Provider | Yes (bounded) | high | Apply capped retries and watch provider comparison. |
| 501 | Not Implemented | Provider/API | No | med | Use supported feature set or alternate endpoint. |
| 502 | Bad Gateway | Provider/network | Yes (bounded) | high | Short backoff retry and monitor regional status pages. |
| 503 | Service Unavailable | Provider capacity | Yes (bounded) | high | Reduce non-critical traffic and prepare failover. |
| 504 | Gateway Timeout | Provider/network | Yes (bounded) | high | Review timeout handling and route critical paths to backup. |
| 507 | Insufficient Storage | Provider/storage | No | med | Reduce payload size and retry after capacity recovery. |
| 509 | Bandwidth Limit Exceeded | Rate limit | Conditional | med | Throttle traffic and verify account/network quotas. |
| 520 | Unknown Edge Error | Edge/provider | Conditional | high | Check provider region status and edge path anomalies. |
| 522 | Connection Timed Out | Network | Yes (bounded) | high | Inspect upstream latency and network path health. |
| 523 | Origin Unreachable | Network/DNS | Conditional | high | Validate DNS and routing. Compare with Anthropic or Gemini status. |
| 524 | A Timeout Occurred | Network/provider | Yes (bounded) | high | Tune timeout split and apply circuit breaker controls. |
| timeout | Transport Timeout | Network/provider | Yes (bounded) | high | Use timeout playbook and lower retry amplification. |
| connection reset | Connection Reset by Peer | Network | Conditional | med | Retry with jitter and inspect upstream connection stability. |
| DNS | DNS Resolution Failure | Network | Conditional | high | Validate resolver health, DNS TTL, and failover records. |
| TLS handshake | TLS Handshake Failure | Network/security | Conditional | med | Verify certificates, cipher support, and clock skew. |
| model overloaded | Model Capacity Overloaded | Provider capacity | Conditional | high | Queue low-priority traffic and shift critical load via fallback routing. |
Mini-Guides for Fast Incident Response
How to handle 429 safely (retry + queue + smoothing)
Treat 429 as pressure, not immediate outage. Add queueing, apply jittered backoff, and smooth burst traffic before failover. Use the 429 guide for concrete policy templates.
503 vs 504 vs timeout: what is different?
503 usually reflects service saturation, 504 indicates gateway wait exceeded, and timeout often comes from transport/read budget limits. Compare all three before declaring provider-wide outage.
When auth errors mean "reachable"
401/403 often prove endpoint reachability while credentials or permissions are invalid. Validate key scope and project mapping before retrying.
Circuit breaker + retry budget template
Use a finite retry budget per request class, trip a circuit breaker on sustained 5xx/timeout breaches, and recover gradually after consecutive healthy windows.
When to fail over vs degrade gracefully
Fail over when broad, sustained errors breach SLO in multiple windows. Degrade gracefully for localized or short-lived latency pressure, then monitor recovery. Use OpenAI, Anthropic, and Gemini status pages together.
How To Diagnose AI API Errors Fast
Most production failures fall into four buckets. Classifying the bucket first avoids wasted incident time.
- Auth/Permissions (401/403): key, org/project, or role mapping issues.
- Rate limiting (429): request, token, or concurrency pressure.
- Provider instability (5xx/timeouts): queue saturation or outage behavior.
- Client/network: DNS, TLS, proxy, or routing path issues.
Triage Workflow (60 Seconds)
- Check whether failures are global or region-specific.
- Compare 5xx + timeouts versus 429 to separate outage from pressure.
- Confirm billing/quota state and account limits.
- Apply bounded retries with jitter (max 2 attempts).
- Fail over only after consecutive SLO breach, not single spikes.
Most Common Errors and Mini-Guides
429 Too Many Requests
Usually rate-limit pressure. Smooth burst traffic, use jittered backoff, and enforce token budgeting.
503 / 504 Service Unavailable / Gateway Timeout
Often queue saturation or upstream instability. Reduce retries and evaluate controlled failover.
401 / 403 Authentication and Permission Failures
Reachability may be healthy while auth context is wrong. Verify key scope, project mapping, and policy constraints.
5xx Internal Errors
Enter incident mode when failures are sustained. Protect critical paths first and cap retry amplification.
Related guides: 429 Error Guide, Timeout Guide, Fallback Routing Guide.
FAQ
Should I retry 429 and 503 errors?
Yes, but with bounded retries and jittered backoff. Aggressive immediate retries often worsen throttling and queue pressure.
When should I switch to fallback provider?
Switch when 5xx/timeout rates breach your SLO threshold for consecutive windows, or when latency degrades persistently.
Can 401 errors still mean service is reachable?
Yes. Authentication errors usually indicate endpoint reachability but invalid/missing auth context for the specific request.
How do I separate outage from local configuration issues?
Compare multiple regions, check provider status pages, and verify whether errors are limited to one key, project, or environment.
Should retries be the same for all error classes?
No. Retry policies should be class-specific; aggressive retries on 429 or persistent 5xx often make incidents worse.
What should be in a production incident report for API errors?
Include UTC timeline, error-class split, affected regions/endpoints, mitigations, and post-incident threshold updates.
Production Troubleshooting Playbook
This glossary is most effective when paired with a consistent response policy. Teams that map each error class to a default action recover faster and avoid retry storms that amplify outages.
Default Actions by Error Class
- 401/403: stop retries, verify auth scope, rotate credentials only when needed.
- 429: reduce burst load, queue non-critical requests, add jittered backoff.
- 5xx/timeouts: cap retries, enable partial fallback, protect critical endpoints first.
- Network failures: validate DNS/TLS/proxy path before blaming provider outage.
Escalation Triggers
- More than 3 consecutive windows above your 5xx or timeout threshold.
- Regional spread from one region to two or more regions.
- p95 latency growth without recovery after bounded retries.
Once escalation triggers fire, switch from local debugging to incident mode and document actions in a shared timeline.
Avoid These Common Mistakes
- Treating all failures the same: 429 and 503 require different control strategies.
- Infinite retries: unbounded retries can increase queue pressure and error rates.
- Ignoring region context: local routing issues can mimic full provider outages.
- Late failover: waiting for complete outage often creates larger user impact.
- No postmortem updates: if thresholds are never tuned, the same incident repeats.
Keep this page linked from runbooks and on-call dashboards so responders can classify issues quickly under pressure.
Runbook Snippets You Can Reuse
Retry and Backoff Guardrail
Use capped retries with exponential backoff and jitter. Set a hard retry budget per request path so one noisy upstream dependency cannot consume all available capacity.
Failover Decision Rule
Trigger fallback only after consecutive threshold breaches in latency or error rate. This avoids routing oscillation during short transient spikes and keeps traffic movement predictable.
Post-Incident Improvement Rule
After resolution, update one concrete control: threshold, timeout split, queue policy, or dashboard alert. Without this step, teams often repeat the same response pattern in the next incident.