Monitoring Methodology

This page explains how status signals are produced so readers can judge fitness for their own workloads.

Check Cadence and Scope

Scheduled checks run every 5 minutes.
Each configured provider endpoint receives one probe per cycle.
Measured values are persisted in a D1 database table named health_checks.

Status Classification

Operational: Endpoint is reachable, including expected auth responses such as 401/403.
Degraded: Request throttling or similar friction indicators.
Down: Transport failure, timeout, or 5xx server-side failure patterns.
Maintenance: No recent check history available yet for that provider.

Some provider endpoints require authentication by design. For this public monitor, authentication responses are treated as a reachability signal rather than full workload health validation.

Uptime Calculation

24-hour uptime is computed from the ratio of operational samples to total samples over the last 24 hours for each provider.

This is a coarse public signal. It does not represent every region, model route, account tier, or quota state.

Known Limitations

Single synthetic check stream cannot model all enterprise traffic patterns.
Provider edge reachability is not equal to full model inference success.
Short-lived incidents may be missed between check intervals.

How We Build Composite Status

Public status is not taken from a single raw probe. We combine provider-level check outcomes, endpoint-level telemetry, and recent error patterns to produce a state that users can act on quickly. The intent is to reduce both false calm ("everything is fine") and false panic ("global outage") during noisy periods.

Operational: no broad evidence of elevated failures or sustained latency anomalies.
Degraded: requests generally succeed, but tail latency or throttling pressure is elevated.
Partial outage: one or more components/regions are unstable while others remain healthy.
Major outage: broad failures or transport-level disruption across core paths.

Latency Treatment and Why p95 Matters

We publish p50 and p95 to represent both typical and tail user experience. p50 can remain stable while p95 climbs sharply under congestion, which is often when users perceive an outage even if some requests still pass.

For incident interpretation, p95 movement is weighted more heavily than p50 drift. This mirrors production behavior in interactive applications where tail latency drives timeout rates, retry storms, and user drop-off.

Incident Lifecycle Rules

Incident entries are derived from transitions in check state over time. An incident starts when monitored status leaves operational mode and ends when stability is restored for consecutive windows. Active incidents are shown before resolved incidents to prioritize operational response.

Start condition: sustained non-operational transition in current check stream.
Resolution condition: return to stable operational windows.
Impact level: estimated from failure breadth and observed error severity.
Post-incident record: retained for trend and recurrence analysis.

Data Quality Safeguards

We use defensive checks to avoid showing broken or misleading output when live data is temporarily unavailable. Pages fall back to cached snapshots with clear labeling so users know they are not viewing the latest stream.

We also cap malformed or missing values, validate payload shapes at runtime, and keep client-side rendering lightweight to reduce visual instability and loading regressions.