Status History and Uptime Archive
Archive of significant reliability events and rolling monthly uptime snapshots. Use this page for post-incident review, provider trend comparisons, and planning fallback policy thresholds.
Notable Incident Archive
| Date | Provider Scope | Duration | Summary |
| 2026-02-18 | Multi-provider | 1h 55m | 5xx burst + timeout amplification during high-load period. |
| 2026-01-28 | 2 providers | 48m | Regional rate-limit spikes in EU inference endpoints. |
| 2025-12-11 | Single provider | 2h 12m | Inference queue saturation and elevated latency. |
| 2025-11-03 | Single provider | 37m | Authentication edge instability for API key validation. |
Monthly Uptime Snapshot
| Provider | Last 30d | Last 90d | Trend |
| OpenAI | 99.86% | 99.71% | Stable |
| Anthropic | 99.79% | 99.63% | Stable |
| Google AI | 99.41% | 99.18% | Improving |
| Mistral | 99.74% | 99.52% | Stable |
| Cohere | 99.57% | 99.34% | Stable |
| Perplexity | 98.92% | 98.47% | Volatile |
How To Use Incident History To Tune Alerts
Historical incidents are most useful when they change operational thresholds. If recurring windows show p95
growth before 5xx spikes, move alerting earlier to tail latency and timeout growth rather than waiting for
hard failures. This reduces user-visible impact and shortens mitigation time.
Archive data also helps calibrate escalation rules. Short, self-healing bursts should trigger local mitigation,
while multi-interval sustained errors should trigger incident mode and fallback routing with explicit traffic caps.
Common Incident Patterns
- Rate-limit pattern: 429 spikes with mostly healthy transport paths.
- Latency pattern: p95/p99 climb before failure rate visibly increases.
- Auth pattern: 401/403 rise while provider is otherwise reachable.
- Outage pattern: broad 5xx/timeouts with regional or global spread.
How to Convert Archive Data Into Better Alert Policies
Archive review should directly improve your monitoring configuration. If the same event type appears repeatedly,
your alert thresholds or escalation logic likely need adjustment. The goal is not to collect history, but to reduce
future incident impact and response time.
Monthly Policy Update Checklist
- Compare last 30 days with prior 90-day baseline for drift.
- Identify top recurring pattern (429, latency, auth, or outage).
- Tune one threshold and one mitigation runbook item each month.
- Document whether the change reduced false alerts and response time.
Pattern-to-Action Mapping
- Frequent 429 spikes: adjust traffic smoothing and queue strategy before adding more retries.
- Latency-first incidents: trigger warnings on p95 drift before hard failure rates rise.
- Auth instability: increase key-health checks and rollout validation safeguards.
- Multi-region outages: practice controlled failover with traffic caps and rollback criteria.
FAQ
How far back should I keep incident history?
At least 90 days for routing policy decisions and 12 months for annual reliability planning.
Should historical uptime replace live monitoring?
No. Archive trends guide policy; live telemetry drives immediate response.
How often should I review this archive?
Weekly for active operations teams and monthly for threshold and runbook updates.
Can one incident justify changing provider strategy?
Usually no. Look for repeated patterns and cross-region impact before making major strategy changes.
What if official provider status disagrees with this archive?
Use both signals, then prioritize your own production impact and user-facing telemetry.
Archive Review Framework for Quarterly Planning
Quarterly reviews are where archive data creates long-term value. Instead of reading incidents one by one, group them by impact type,
affected region, and recovery speed. This reveals whether your platform risk is shifting toward latency, authentication, or hard outages.
Quarterly Questions to Answer
- Which incident type produced the highest customer-visible impact?
- Which provider or region required the most manual intervention?
- Did fallback routing reduce impact or create additional complexity?
- Which alert triggered too late and should move to earlier indicators?
Converting answers into explicit roadmap items helps prevent repeated incident patterns and improves operational maturity over time.
How to Prioritize Reliability Investments
Historical status data can guide where engineering time produces the biggest reliability gains. Start by estimating which incident
pattern caused the highest user impact and operational cost. Then prioritize one investment at a time: better alerting, safer retries,
stronger fallback coverage, or improved auth controls.
- High latency incidents: invest in early p95 detection and timeout budget tuning.
- Recurring 429 pressure: invest in rate shaping, queueing, and workload prioritization.
- Auth-related interruptions: invest in key lifecycle automation and scope validation.
- Cross-region outages: invest in regional failover tests and rollback drills.
This method keeps roadmap choices evidence-based and reduces the chance of chasing low-impact reliability work.
Use the same scoring model each quarter so trend comparisons remain consistent and actionable.