Glossary

The calibration glossary.

Plain-English definitions of the forecasting terms behind a calibrated probability API. If you've ever wondered what a reliability diagram or a Brier score actually is, start here — then see them in action on the calibration page.

Calibration

A forecast is calibrated when events it predicts at probability p actually occur about p of the time over many predictions. In plain terms: when a calibrated source says 70%, it happens about 70% of the time — not 80%, not 60%. It is the property that makes a probability honest, and the standard used to grade weather forecasts and election models. See our methodology.

Reliability diagram

A plot of predicted probability on the x-axis against observed frequency on the y-axis. Group your forecasts into bands, and for each band plot what you said versus what happened. A perfectly calibrated model's points lie on the 45-degree diagonal. Try it with your own data in the free reliability tool.

Brier score

The mean squared distance between a probabilistic forecast and the 0/1 outcome. Lower is better: 0 is a perfect oracle, 1 is maximally wrong. Because it is a proper scoring rule, it can't be gamed by hedging — it is minimised only by reporting your true probability.

Log loss

Another proper scoring rule: the average negative log-likelihood of the outcomes under your forecast. It punishes confident wrong predictions far more harshly than Brier, which makes it a stringent test of whether your high-confidence calls are trustworthy.

Calibration error

The average gap between predicted probability and observed frequency across confidence bands — in other words, how far the points on a reliability diagram sit from the diagonal. Lower is better. A small calibration error means the numbers mean what they say.

Out-of-sample

Evaluating a model on data it was not fitted on. In-sample results flatter a model because it has effectively seen the answers; out-of-sample results reflect genuine forecasting skill. Every figure MatchPrior publishes is out-of-sample.

Walk-forward validation

A way to test forecasts honestly: train on the past, predict the next period, then roll forward and repeat. It mimics how a model is actually used in real time and avoids look-ahead leakage — using information that wouldn't have been available before the event.

Sharpness

How decisive a set of forecasts is. A model that always says 50% is perfectly calibrated but useless. The goal, in the forecasting literature, is to maximise sharpness subject to calibration: be confident and still correct.

Proper scoring rule

A scoring function that is minimised (or maximised) only when you report your true probabilities, so it cannot be gamed by shading or hedging your forecasts. Brier score and log loss are the two best-known examples.

Hit rate

The share of predictions whose single most likely outcome turned out correct (correct ÷ graded). It's intuitive, but it rewards timidity and ignores how confident you were — which is why we judge the model on calibration and Brier, and show hit rate only as context.

See the theory in practice. Read how we apply all of this in the methodology, or check the live record on the calibration page.