Status History and Uptime Archive

Archive of significant reliability events and rolling monthly uptime snapshots. Use this page for post-incident review, provider trend comparisons, and planning fallback policy thresholds.

Notable Incident Archive

Date	Provider Scope	Duration	Summary
2026-02-18	Multi-provider	1h 55m	5xx burst + timeout amplification during high-load period.
2026-01-28	2 providers	48m	Regional rate-limit spikes in EU inference endpoints.
2025-12-11	Single provider	2h 12m	Inference queue saturation and elevated latency.
2025-11-03	Single provider	37m	Authentication edge instability for API key validation.

Monthly Uptime Snapshot

Provider	Last 30d	Last 90d	Trend
OpenAI	99.86%	99.71%	Stable
Anthropic	99.79%	99.63%	Stable
Google AI	99.41%	99.18%	Improving
Mistral	99.74%	99.52%	Stable
Cohere	99.57%	99.34%	Stable
Perplexity	98.92%	98.47%	Volatile

How To Use Incident History To Tune Alerts

Historical incidents are most useful when they change operational thresholds. If recurring windows show p95 growth before 5xx spikes, move alerting earlier to tail latency and timeout growth rather than waiting for hard failures. This reduces user-visible impact and shortens mitigation time.

Archive data also helps calibrate escalation rules. Short, self-healing bursts should trigger local mitigation, while multi-interval sustained errors should trigger incident mode and fallback routing with explicit traffic caps.

Common Incident Patterns

Rate-limit pattern: 429 spikes with mostly healthy transport paths.
Latency pattern: p95/p99 climb before failure rate visibly increases.
Auth pattern: 401/403 rise while provider is otherwise reachable.
Outage pattern: broad 5xx/timeouts with regional or global spread.

Detailed Incident Writeups

For deeper operational analysis, review the full incident pages: Incident Analysis, OpenAI Status, Anthropic Status, Gemini Status.

How to Convert Archive Data Into Better Alert Policies

Archive review should directly improve your monitoring configuration. If the same event type appears repeatedly, your alert thresholds or escalation logic likely need adjustment. The goal is not to collect history, but to reduce future incident impact and response time.

Monthly Policy Update Checklist

Compare last 30 days with prior 90-day baseline for drift.
Identify top recurring pattern (429, latency, auth, or outage).
Tune one threshold and one mitigation runbook item each month.
Document whether the change reduced false alerts and response time.

Pattern-to-Action Mapping

Frequent 429 spikes: adjust traffic smoothing and queue strategy before adding more retries.
Latency-first incidents: trigger warnings on p95 drift before hard failure rates rise.
Auth instability: increase key-health checks and rollout validation safeguards.
Multi-region outages: practice controlled failover with traffic caps and rollback criteria.

FAQ

How far back should I keep incident history?

At least 90 days for routing policy decisions and 12 months for annual reliability planning.

Should historical uptime replace live monitoring?

No. Archive trends guide policy; live telemetry drives immediate response.

How often should I review this archive?

Weekly for active operations teams and monthly for threshold and runbook updates.

Can one incident justify changing provider strategy?

Usually no. Look for repeated patterns and cross-region impact before making major strategy changes.

What if official provider status disagrees with this archive?

Use both signals, then prioritize your own production impact and user-facing telemetry.

Archive Review Framework for Quarterly Planning

Quarterly reviews are where archive data creates long-term value. Instead of reading incidents one by one, group them by impact type, affected region, and recovery speed. This reveals whether your platform risk is shifting toward latency, authentication, or hard outages.

Quarterly Questions to Answer

Which incident type produced the highest customer-visible impact?
Which provider or region required the most manual intervention?
Did fallback routing reduce impact or create additional complexity?
Which alert triggered too late and should move to earlier indicators?

Converting answers into explicit roadmap items helps prevent repeated incident patterns and improves operational maturity over time.

How to Prioritize Reliability Investments

Historical status data can guide where engineering time produces the biggest reliability gains. Start by estimating which incident pattern caused the highest user impact and operational cost. Then prioritize one investment at a time: better alerting, safer retries, stronger fallback coverage, or improved auth controls.

High latency incidents: invest in early p95 detection and timeout budget tuning.
Recurring 429 pressure: invest in rate shaping, queueing, and workload prioritization.
Auth-related interruptions: invest in key lifecycle automation and scope validation.
Cross-region outages: invest in regional failover tests and rollback drills.

This method keeps roadmap choices evidence-based and reduces the chance of chasing low-impact reliability work.

Use the same scoring model each quarter so trend comparisons remain consistent and actionable.