Methodology

How we measure
our own model.

Most prediction APIs ask you to trust an accuracy figure. We'd rather show our working. This is exactly how MatchPrior grades itself — the same way statisticians grade weather forecasts and election models — and why every number is checked out-of-sample, across 67,667 graded predictions.

See the live record Glossary of terms

Reliability diagramn = 67,667

predicted vs. observed, six confidence bands · out-of-sample backtest

Step 1 — the standard

Calibration: a number meaning what it says.

Calibration is the property that makes a probability honest. Take every prediction where we said roughly 70%, and check how often that outcome actually happened. If the model is calibrated, it happens about 70% of the time — not 80%, not 60%. It is the same standard used to grade a weather forecaster: of the days they call a 70% chance of rain, it should rain on roughly seven in ten.

Calibrated is not the same as accurate. A forecaster who always says 50% can be perfectly calibrated and tell you nothing. Calibration earns trust in the number; sharpness — being confident and still correct — is a separate axis. We report both.

Why it's the right test. A calibrated probability is one you can reason with: combine it, threshold it, feed it into your own model — and the maths stays honest.

Step 2 — the metrics

Brier score and calibration error.

Brier score is the mean squared distance between a probability and what happened (0 = perfect, 1 = worst). It is a proper scoring rule — it can't be gamed by hedging, because it is minimised only by reporting your true probability.

Calibration error is the average gap between predicted probability and observed frequency across confidence bands. On our backtest it is within about one percentage point per tier.

Hit rate is simply correct ÷ graded. We show it per sport, but we judge the model on Brier and calibration — hit rate alone rewards timidity.

Said 70–80%75.3%won (backtested)

Said 80%+86.3%won (backtested)

Tennis63.0%hit · n=20,756

Basketball66.6%hit · n=3,960

Step 3 — the validation

Out-of-sample, walk-forward — never marking our own homework.

Predict using only the past

For every historical match we generate the prediction using only data available before kick-off — ratings, form and calibration fit on earlier games. The model never sees the result it is about to be graded on.

Grade against what actually happened

We then compare each prediction to the real outcome and accumulate Brier, calibration error, hit rate and the reliability diagram across 67,667 graded calls spanning multiple sports and seasons.

Serve it from the API, unedited

The same figures are returned live by /v1/accuracy and /v1/backtest — including the tiers where the model is barely better than a coin-flip. Nothing is hand-picked.

Read this plainly. These figures are out-of-sample backtests — the model applied to past games it did not train on — and are simulated, not live trading. They describe calibration: how often an outcome happens when we assign it a probability. They are not a betting signal, an income, or a promise about future results. Past and simulated performance is not a reliable indicator of future results. For information and entertainment only — not betting, investment or financial advice.

Verify it yourself

Don't take our word for it.

Every number on this page is recomputed from graded results and served at a public endpoint. Pull our predictions, hold them against real outcomes, and check the calibration — or test your own forecasts with our free reliability tool.

See the live record Free reliability tool

How we measureour own model.

Calibration: a number meaning what it says.

Brier score and calibration error.

Out-of-sample, walk-forward — never marking our own homework.

Predict using only the past

Grade against what actually happened

Serve it from the API, unedited

Don't take our word for it.

How we measure
our own model.