The proof page
We publish
every number.
Most prediction APIs advertise an accuracy figure with nothing behind it. We do the opposite: here is exactly how reliable our probabilities are, tier by tier, across 67,667 out-of-sample backtested predictions — and the same numbers are served live from the API, so you can check every one yourself.
Out-of-sample backtest, simulated results — not live trading, and a measure of calibration, not profit.
predicted vs. observed, six confidence bands · every tier within about a point of its claim · out-of-sample backtest
A 60% pick should win about 60% of the time.
Calibration is the only property that makes a probability honest. It is not about being right more often — it is about a number meaning what it says. When we publish confidence: 0.60, that group of picks should land roughly six times in ten. Group every prediction by the probability we assigned, then check what actually happened. The two should line up — and they do.
Every sport, the misses included.
The calibration above is measured across every confidence level, not a flattering slice — the weak tiers sit in the same chart as the strong ones. Here is the out-of-sample hit rate per sport, each over thousands of graded games, so you can see exactly where the model is sharp and where it is closer to a coin-flip.
Out-of-sample walk-forward backtest — each game predicted using only data available before kick-off. Football is a three-way 1X2 market, so its hit-rate ceiling is structurally lower than a two-way sport (the draw is rarely the single most likely result); we judge it on calibration and Brier and show the always-home baseline for context. Simulated results, not live trading, and not a betting or profit claim.
# The reliability table on this page — served as a first-class endpoint
GET /v1/accuracy?sport=all -H "X-API-Key: mp_live_…"
{
"graded": 67667, // out-of-sample backtested predictions
"reliability": [
{ "bucket": "60-70%", "predicted": 0.643, "actual": 0.651, "n": 13061 },
{ "bucket": "70-80%", "predicted": 0.743, "actual": 0.753, "n": 5382 },
{ "bucket": "80-100%", "predicted": 0.850, "actual": 0.863, "n": 2212 }
],
"byLeague": { /* per-league graded + correct */ },
"recent": [ /* every graded result, hits and misses alike */ ]
}
The calls you'd act on are the ones that hold up.
Calibration is not just a flattering average — it holds at the top, where it matters most. When the football model is 70% confident or higher, it is wrong only 1.7% of the time in backtests; in baseball, 0.1%. The high-confidence calls are not where the model gets surprised — they are where it is most dependable.
Calibrated — the number means what it says
Across every tier, predicted probability and realised hit-rate line up to within about a point. A 75% from the API behaves like a 75%.
Sharp — confident when it should be
A model that always says 50% is calibrated and useless. Ours commits: it pushes into the 70s and 80s when the evidence is there, and those calls hold (the 80%+ tier realised 86.3%).
Auditable — you never take our word for it
Every figure here is recomputed from graded results and served at /v1/accuracy and /v1/backtest. Hold our outputs against real outcomes and check the calibration yourself.
We publish our numbers. The category publishes marketing.
The prediction space is full of listings claiming “85%+ accuracy” with nothing to back it. We took the opposite bet. Every number on this page is recomputed from graded results, lives at a public endpoint, and includes the tiers where the model is barely better than a coin-flip. Check our calibration against your own outcomes, and decide for yourself.
Calibrated probabilities, not predictions of profit. Past and backtested performance is simulated and not indicative of future results.