Table 3:
Metrics for forecast performance.
| Terminology | Definition | Formula | |
|---|---|---|---|
| General definitions | Forecast horizon | The future period of time for which a forecast is generated. | |
| Uninformative forecasts | Forecasts that do not help decision-making. Trivial solutions, such as perpetually issuing 0% probability for rare events, have good performance but are uninformative (unskilled) and can be used as a reference. | ||
| Discrimination | Discrimination measures whether forecasts differ when their corresponding observations differ; for example, if forecasts for days that are wet indicate more rain than for days that are dry, the forecasts can discriminate wetter from drier days. | ||
| Deterministic metrics | Accuracy | Measure of discrimination or how well a forecast correctly identifies or excludes a certain outcome. | |
| Sensitivity (Se) | How often the forecast correctly identifies an event. | ||
| Specificity (Sp) | How often the forecast avoids misidentification. | ||
| Time in warning (Tiw) | Duration of time a forecast indicates an event is likely. | ||
| Area under the curve (AUC) | Typically assessed as the tradeoff between sensitivity and specificity (or time in warning) by systematically thresholding the algorithm output at all forecasted values. |
Se vs. 1-Sp
or Se vs. Tiw |
|
| Relative risk | The ratio between the probability of an event in a category or state and the probability of this event in another category. | ||
| Probabilistic metrics | Observed probability | Frequency of events per unit of time observed in the data, ie their empirical probability. | |
| Expected probability | Based on all previous observations, the frequency (probability) of events expected over long duration in the future. | ||
| Forecasted probability | Probability of event forecasted for one time interval in the future | ||
| Calibration (or reliability) | Agreement between forecasted probability and observed probability. Typically calculated by averaging n forecasts datapoints in m ranked bins (, e.g. average forecast between 0 and 10%) and calculating the corresponding observed event probability, . For a calibrated forecast, the binned forecasted probability and observed probability match and therefore align on a diagonal in a reliability diagram. Graphically, distance to the diagonal (Fig. S1). | ||
| Resolution | Ability of the forecast to separate observed probabilities from the average observed probability. Resolution is zero for a flat line intersecting the y-axis at the expected probability, this corresponds to alignment of the ROC curve with the diagonal. Graphically, separation of the reliability curve from the horizontal line of no resolution (Fig. S1). | ||
| Sharpness | Tendency to forecast probabilities, , near 0 or 1, as opposed to uniformly distributed forecasts. Sharpness is an attribute belonging only to the forecast and is not influenced by the observations. Graphically, variance of the distribution of the forecasts. | ||
| Uncertainty | Uncertainty only depends of the frequency of events and is not influenced by the forecast. Uncertainty tends to 0 with very rare (or frequent) observations (ie with increased imbalance) and is greatest (=0.25) when an event is observed 50% of the time, making forecasts more difficult. | ||
| Skill | Accuracy of a forecast relative to some reference forecast. The reference forecast is generally an unskilled forecast such as random chance, shuffled forecasts, or uninformative forecasts. A forecast may be better simply because it is easier to make, which is taken into account when calculating Skill. | ||
| Bias | Mismatch between the mean forecast value, and mean observed probability, . | ||
| Brier score (BS) | Mean squared distance between the forecasted value, and the observation, (set to 1 or 0), calculated at each ith timepoint for n forecasts. Better Brier scores are lower (ie tend to zero). | ||
| Brier skill score (BSS) | Improvement of Brier score over a reference forecast. Brier skill scores tend to 1 when better, 0 when no improvement over reference, and when worse than reference. | ||
TP: true positive, TN: true negative, FP: false positive, FN: false negative, All: TP + TN + FP + FN. m, number of bins in the reliability diagram; n, number of data points (observed or forecasted); fi, forecast probability for the ith forecast; oi the ith observed probability; and the average observed probability.