Evaluating Probabilistic Weather Forecasters

Statistics · Medium · Free problem

You have access to forecasts from several weather experts. Each day, each expert provides a probability $p \in [0, 1]$ that it will rain. At the end of the day, you observe whether it actually rained ($o_t = 1$) or not ($o_t = 0$).

After collecting data over $T$ days, how do you evaluate these forecasts to determine who is the best forecaster? Specifically:

  1. What scoring metric would you use, and why?
  2. How do you assess whether a forecaster is well-calibrated?
  3. What are the limitations of your chosen metric?

Hints

  1. Think about what makes a probability forecast "good" -- it is not just about being right, but about the stated probabilities matching reality.
  2. Look into proper scoring rules -- metrics that incentivize honest probability reporting. The Brier score and log loss are the two main candidates.
  3. Consider decomposing your chosen metric into calibration (reliability) and informativeness (resolution) components. Also think about how to visualize calibration.

Worked Solution

How to Think About It: This is a classic problem in forecast evaluation, and it comes up all the time in trading (think about evaluating signal quality or model predictions). The key tension is: a good forecaster is not one who is always right -- that is impossible with probabilistic events. A good forecaster is one whose stated probabilities match observed frequencies. If someone says "30% chance of rain" on 100 days, it should rain on roughly 30 of those days. This is calibration. But calibration alone is not enough -- a forecaster who always predicts the base rate is perfectly calibrated but useless. You also need resolution: the forecasts should vary and convey information.

Key Insight: You need a proper scoring rule -- a metric that is uniquely minimized when the forecaster reports their true belief. The Brier score and log loss are the two standard choices.

The Method:

1. Brier Score (primary metric):

$BS = \frac{1}{T} \sum_{t=1}^{T} (p_t - o_t)^2$

where $p_t$ is the forecast probability and $o_t \in \{0, 1\}$ is the outcome. Lower is better. A perfect forecaster scores $0$; a naive forecaster who always says $p = 0.5$ scores $0.25$; a forecaster who always predicts the climatological base rate $\bar{o}$ scores $\bar{o}(1 - \bar{o})$.

The Brier score decomposes into three components:

$BS = \underbrace{\text{Reliability}}_{\text{(calibration error)}} - \underbrace{\text{Resolution}}_{\text{(signal value)}} + \underbrace{\text{Uncertainty}}_{\text{(base rate entropy)}}$

  • Reliability measures how far the forecast probabilities are from observed frequencies in each bin. Lower is better (0 = perfectly calibrated).
  • Resolution measures how much forecasts deviate from the base rate. Higher is better (more informative).
  • Uncertainty is a property of the data, not the forecaster.

2. Calibration Assessment:

Bin the forecasts into groups (e.g., $[0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0]$). For each bin, compute the observed rain frequency. Plot observed frequency vs. predicted probability. A well-calibrated forecaster's points lie on the 45-degree diagonal. Systematic deviations reveal bias: points above the diagonal mean the forecaster is underconfident, below means overconfident.

3. Alternative metric -- Log Loss:

$LL = -\frac{1}{T} \sum_{t=1}^{T} \left[ o_t \log p_t + (1 - o_t) \log(1 - p_t) \right]$

Log loss penalizes confident wrong predictions much more severely than the Brier score. If a forecaster says $p = 0.99$ and it does not rain, the Brier penalty is $(0.99)^2 \approx 0.98$ but the log loss penalty is $-\log(0.01) \approx 4.6$. This makes log loss better at catching overconfident forecasters.

Practical Considerations:

  • Both Brier score and log loss are proper scoring rules, meaning a forecaster maximizes their expected score by reporting their true belief. This is critical -- improper metrics can incentivize dishonest reporting.
  • With small samples, calibration plots are noisy. You may need a Hosmer-Lemeshow test or similar to formally test calibration.
  • Neither metric accounts for the economic value of forecasts. A forecaster who is slightly worse on Brier score but excels at predicting rare events might be more valuable to a trader.
  • Skill scores (comparing to a reference forecaster like climatology) give context: $BSS = 1 - BS / BS_{\text{ref}}$.

Answer: Use the Brier score $\frac{1}{T}\sum(p_t - o_t)^2$ as the primary evaluation metric (lower is better). Supplement with calibration plots (observed frequency vs. predicted probability) and consider log loss for detecting overconfidence. The best forecaster minimizes the Brier score while maintaining good calibration across all probability bins.

Intuition

Evaluating probabilistic forecasts is fundamentally different from evaluating point predictions. You cannot just check "right or wrong" because the forecaster is giving you a distribution, not a single answer. The right framework is proper scoring rules -- metrics that reward honest probability assessments. The Brier score is the MSE of probability forecasts, and its decomposition into reliability, resolution, and uncertainty mirrors the bias-variance decomposition in regression.

In quant finance, this exact framework applies to evaluating trading signals, model probabilities (default models, fill rate models), and any system that outputs a probability. The subtle point people miss is that calibration and discrimination are separate qualities. A perfectly calibrated forecaster can still be useless (just predict the base rate every day), and a well-discriminating forecaster can be miscalibrated (consistently overconfident but still ranking days correctly). The best forecasters excel at both.

Open the full interactive solver →