Recalibrating probabilistic forecasts of epidemics

Aaron Rumack; Ryan J Tibshirani; Roni Rosenfeld

doi:10.1371/journal.pcbi.1010771

. 2022 Dec 15;18(12):e1010771. doi: 10.1371/journal.pcbi.1010771

Recalibrating probabilistic forecasts of epidemics

Aaron Rumack ^1,^*, Ryan J Tibshirani ^1,², Roni Rosenfeld ¹

Editor: Cecile Viboud³

PMCID: PMC9799311 PMID: 36520949

Abstract

Distributional forecasts are important for a wide variety of applications, including forecasting epidemics. Often, forecasts are miscalibrated, or unreliable in assigning uncertainty to future events. We present a recalibration method that can be applied to a black-box forecaster given retrospective forecasts and observations, as well as an extension to make this method more effective in recalibrating epidemic forecasts. This method is guaranteed to improve calibration and log score performance when trained and measured in-sample. We also prove that the increase in expected log score of a recalibrated forecaster is equal to the entropy of the PIT distribution. We apply this recalibration method to the 27 influenza forecasters in the FluSight Network and show that recalibration reliably improves forecast accuracy and calibration. This method, available on Github, is effective, robust, and easy to use as a post-processing tool to improve epidemic forecasts.

Author summary

Epidemics of infectious disease cause millions of deaths worldwide each year, and reliable epidemic forecasts can allow public health officials to respond to mitigate the effects of epidemics. However, because epidemic forecasting is a difficult task, many epidemic forecasts are not calibrated. Calibration is a desired property of any forecast, and we provide a post-processing method that recalibrates forecasts. We demonstrate the effectiveness of this method in improving accuracy and calibration on a wide variety of influenza forecasters. We also show a quantitative relationship between calibration and a forecaster’s expected score. Our recalibration method is a tool that any forecaster can use, regardless of model choice, to improve forecast accuracy and reliability. This work provides a bridge between forecasting theory, which rarely deals with applications in domains that are new or have little data, and some recent applications of epidemic forecasting, where forecast calibration is rarely analyzed systematically.

1 Introduction

Epidemic forecasting is an important tool to inform the public health response to outbreaks of infectious diseases. Often, decision makers can take more effective action with an estimate of the uncertainty in a forecasted target. For this reason, distributional forecasts are more desirable than point forecasts. A distributional forecast is a probability distribution over the target variable and measures the uncertainty in the prediction, as opposed to a point forecast, which is just a scalar value for each target and has no measure of uncertainty. A desired property of distributional forecasts is calibration, or reliability between forecasts and the true distribution of the variable forecasted (a mathematical definition is given in Section 2). Along with uncertainty and resolution, calibration is one of three components of a forecaster’s accuracy as measured by any proper score [1], with better calibration resulting in a better score. It is therefore important for a forecaster to produce calibrated forecasts.

Previous work has described general forecasting theory and calibration and evaluated the calibration of certain forecasts [2–5]. Later work has gone beyond just describing calibration, presenting post-processing algorithms to recalibrate forecasts that were previously miscalibrated. Nonparametric techniques for recalibration of ensemble forecasts include rank histogram correction [6], Bayesian model averaging [7], linear pooling [8], and probability anomaly correction [9]. Brocklehurst et al. [10] provide a nonparametric approach using the empirical CDF, which can recalibrate any forecast of a scalar target. Parametric approaches include logistic regression [11], extended linear regression [12] and beta-transform linear pooling [8]. Wilks and Hamill [13] compare the performance of different recalibration techniques for different meteorological targets with different amounts of training data.

Much of the work in recalibration has been applied to weather forecasting, and thus many of the techniques are not applicable in other forecasting domains. The most popular weather forecasting models create a distribution from a series of point predictions, with each point being the result of a simulation under varying initial conditions. Many of the existing recalibration methods are defined only for this type of ensemble forecaster. For example, Bayesian model averaging assumes that an ensemble forecast is comprised of the same N forecasts in each observation. This method cannot be extended trivially to a domain where the forecaster itself outputs a distribution. Additionally, weather forecasts usually have a plethora of training data on which to train recalibration methods. For example, recalibration has been applied to a set of weather forecasts generated daily from 1979 to at least 2006, almost 10,000 days [14]. In settings like these, techniques need not be robust to small amounts of recalibration training data.

To be clear on nomenclature, throughout this paper, we use the term forecast to refer to the predicted probability distribution of a variable and the term forecaster to refer to an algorithm that produces a forecast for a variable given a context. Common examples of forecasters are an algorithm that forecasts the amount of precipitation two days in advance given current meteorological information, one that forecasts the price of a certain stock given the stock’s historical trend, or one that forecasts the statewide influenza incidence given historical incidence data. We also distinguish between calibration and recalibration; calibration refers to the property of a forecaster, and recalibration refers to a method whose goal is to make a forecaster more calibrated. Specifically, recalibration takes as input a set of a forecaster’s forecasts and corresponding observations (“training data”), and outputs a forecaster which should be more calibrated on a different set of forecasts and observations (“test data”).

In what follows, we present a generalized approach to forecast recalibration and show its performance when applied to forecasters in the FluSight Network. We demonstrate that across the diverse set of FluSight forecasters, recalibration consistently improves not just calibration but accuracy as well.

2 Methods

Consider the following setup. At each i = 1, 2, 3…, a forecaster M outputs a density forecast f_i given features x_i for a continuously distributed scalar random variable y_i whose true distribution is h_i. As a regularity condition, we assume that the corresponding cumulative distribution functions (CDFs) F_i and H_i are continuous and strictly increasing. The forecaster M is evaluated according to a proper scoring rule, such as the quadratic score [15] or the logarithmic score [16].

The goal of a forecaster is to produce ideal forecasts, i.e., to forecast f_i = h_i, the true distribution of y_i, for each i, though this is usually unattainable. We can inspect how close a forecaster is to being ideal with the distribution of the probability integral transform (PIT) values [17]. For each forecast f_i and observed value y_i, the PIT is defined as

PIT (f_{i}, y_{i}) = F_{i} (y_{i}),

where F_i is the CDF of f_i. A necessary (but not sufficient) condition for a forecaster to be ideal is probabilistic calibration [3]:

\frac{1}{N} \sum_{i = 1}^{N} H_{i} \circ F_{i}^{- 1} (p) \to p as N \to \infty, for all p \in (0, 1) .

(Here and throughout we interpret convergence in the almost sure sense.) An example of a probabilistically calibrated forecaster that is not ideal is the so-called climatological forecaster, which for each i outputs the marginal distribution of y_i over i = 1, 2, 3, …. To make this concrete, suppose each y_i is distributed as $N (μ_{i}, 1)$ , a normal distribution with mean μ_i and variance 1, and each μ_i itself follows $N (0, 1)$ , then the climatological forecaster simply outputs $N (0, 2)$ for each i.

Note that the PIT distribution of a probabilistically calibrated forecaster is close to uniform in large samples. The expected CDF of the PIT distribution is

G (p) = E [P [F_{i} (y_{i}) \leq p]] = E [P [y_{i} \leq F_{i}^{- 1} (p)]] = E [H_{i} \circ F_{i}^{- 1} (p)],

where here $E$ denotes the sample average operator over i = 1, …, N. This expression converges to p as N → ∞ when the forecaster is probabilistically calibrated. Thus an examination of the distribution of PIT values—looking for potential deviations from uniformity—serves as a good diagnostic tool to assess probabilistic calibration. Many use a PIT histogram to examine the PIT distribution because it is easy to read and understand [3]. For example, if the PIT distribution is bell-shaped, then the forecaster does not put enough weight in the middle of its distribution and is underconfident. In general, we can compare the PIT density to the horizontal line at 1, which corresponds to the uniform density. The greater the deviation from this line (which can be quantified via Kullback-Leibler divergence from the uniform distribution to the PIT distribution, or equivalently, negative entropy of the PIT distribution), the greater the miscalibration; see Fig 1 for examples.

Our recalibration method uses G as a CDF-CDF transform. The recalibrated forecaster, denoted M*, is defined by a recalibrated forecast CDF of $F_{i}^{*} (y) = G (F_{i} (y))$ , for each i. By the chain rule, the recalibrated forecast density is $f_{i}^{*} (y) = g (F_{i} (y)) \cdot f_{i} (y)$ , for each i. Thus the recalibrated forecast $f_{i}^{*}$ is the original forecast f_i weighted by the PIT density g. An illustration of this method is provided in Fig 2. In practice, of course, we do not have access to the true distributions H_i, so we need to estimate G from PIT values. A key assumption is that the PIT distribution of the training forecasts is the same as that of the test forecasts. Otherwise, applying G as a CDF-CDF transform will not produce probabilistically calibrated forecasts. The ultimate estimate of G that we propose in this paper will be an ensemble (weighted linear combination) of three estimates: a nonparametric method, a parametric method, and a null method. First, we will motivate calibration as a tool to increase forecast accuracy, and then, we explain the individual estimation methods.

Fig 2 — The original, underconfident forecast density is $f (y) = N (0, 2)$ while the true density is $h (y) = N (0, 1)$ . By calculating the PIT density g and producing a recalibrated forecast as the product g(F(y)) ⋅ f(y), we recover the true h(y).

2.1 Calibration and log score

In order to quantify how well a forecaster is calibrated, we calculate the entropy of the distribution of PIT values. As above, G is the CDF of the PIT distribution of M. The entropy of the PIT density g is defined as

H (g) = - \int_{p = 0}^{1} g (p) log g (p) d p .

If M is probabilistically calibrated, then (asymptotically, as N → ∞) the PIT values are uniform and the entropy is zero because g(p) is 1 everywhere. When the PIT values are not uniform, the entropy is negative.

Entropy is also useful because it provides an understanding of how miscalibration penalizes the expected log score, as shown below. First observe that

g (p) = \frac{d}{d p} G (p) = \frac{d}{d p} E [H_{i} \circ F_{i}^{- 1} (p)] = E [\frac{h_{i} (F_{i}^{- 1} (p))}{f_{i} (F_{i}^{- 1} (p))}],

where the last step assumes the smoothness and integrability conditions on h_i, f_i needed to exchange expectation and differentiation (the Leibniz rule). Next observe that

\begin{matrix} E [log f_{i}^{*} (y_{i})] - E [log f_{i} (y_{i})] & = E [log g (F_{i} (y_{i}))] \\ = E [\int_{- \infty}^{\infty} log g (F_{i} (y_{i})) h_{i} (y_{i}) d y_{i}] \\ = E [\int_{0}^{1} log g (F_{i} (F_{i}^{- 1} (p))) \frac{h_{i} (F_{i}^{- 1} (p))}{f_{i} (F_{i}^{- 1} (p))} d p] \\ = \int_{0}^{1} E [log g (p) \frac{h_{i} (F_{i}^{- 1} (p))}{f_{i} (F_{i}^{- 1} (p))}] d p \\ = \int_{p = 0}^{1} g (p) log g (p) = - H (g), \end{matrix}

(1)

where the third line is obtained by a variable substitution, and fourth by applying the Leibniz rule again assuming the needed regularity conditions.

For any forecaster, if the PIT distribution is the same for the training data and the test data, then the improvement of the recalibrated forecast’s log score can be estimated by estimating the negative entropy of g (note that the entropy of any distribution on [0, 1] is nonpositive). We can explain this intuitively as well: the more negative H(g) is, the more it indicates that there is information lying in the structure of g that can be extracted to improve forecasts.

2.2 Nonparametric correction

Given an observed training set of PIT values for a forecaster, F_i(y_i), i = 1, …, N, the empirical PIT CDF is

\hat{G} (x) = \frac{1}{N} \sum_{i = 1}^{N} I [F_{i} (y_{i}) \leq x] .

As $\hat{G}$ is discrete, it does not admit a well-defined density, and hence to use this for recalibration we can first smooth $\hat{G}$ using a monotone cubic spline interpolant, and then it will have a bonafide density $\hat{g}$ , which is itself smooth (twice continuously differentiable, to be precise). Using this for recalibration produces $f_{i}^{*} (y) = {\hat{g}}_{i} (F_{i} (y)) \cdot f_{i} (y)$ .

In practice, with a large amount of training data, recalibration using the empirical CDF as described above can be effective. However, with little training data, or a lot of diversity within the training data among the distributions of y_i, it can be ineffective for assuring calibration on the test set. This is in line with the practical difficulties of using nonparametric, distribution-free methods in general.

2.3 Parametric correction

Gneiting and Ranjan [8] present a recalibration method originally motivated by redistributing weights on the components of an ensemble forecast, but their method can applied generally to recalibrate any black box forecaster. Given an observed training set of PIT values, F_i(y_i), i = 1, …, N, we fit a beta density $\hat{g}$ via maximum likelihood estimation. This in fact corresponds to the beta transform that maximizes the log score of the recalibrated forecaster on the training data [8].

This parametric model is more resilient to minimal training data, and a beta distribution is usually an effective estimate of the PIT distribution: because a beta density can be either convex or concave, it is flexible enough to fit the PIT distribution of overconfident and underconfident forecasters; and because the mean can be in the interval (0, 1), it can fit biased forecasters as well. However, problematic behaviors arise at the tails. Except in exceptional cases (one or both of its two shape parameters is exactly 1), the beta density is 0 or ∞ at the endpoints of its support, which can cause problems for recalibration (there can be a big gap between the true PIT density and $\hat{g}$ in the tails).

2.4 Null correction

The final component of the recalibration ensemble is a null correction, in which there is no recalibration at all, i.e., we simply set $f_{i}^{*} (y) = f_{i} (y)$ . This prevents overfitting and decreases variance of the overall ensemble correction, to be described next.

2.5 Recalibration ensemble

The final recalibration system uses the three components described previously and weights them in an ensemble. The ensemble weights are calculated to maximize the overall log score. Letting $f_{i j}^{*}$ denote the forecast density for sample i and component j, the weights ensemble w are defined by solving the optimization problem:

\begin{matrix} \underset{w}{minmize} \frac{1}{N} \sum_{i = 1}^{N} log (\sum_{j = 1}^{p} w_{j} f_{i j}^{*} (y_{i})) subject to w \geq 0, \sum_{j = 1}^{p} w_{j} = 1, \end{matrix}

(2)

where p is the number of ensemble components (for us, p = 3) and the constraint w ≥ 0 is to be interpreted componentwise.

A component’s weight in the ensemble is not necessarily proportional to that component’s performance. For example, if the two best components are very similar to each other, one may have a very small weight because that component’s information is effectively represented by the other component.

2.6 Recalibration under seasonality

Epidemic forecasting presents a new challenge for recalibration. The methodology discussed above assumes that the previous behavior of a forecaster is indicative of future behavior, or more concretely, that the PIT distribution on the training set will be similar to that on the test set. However, this is not necessarily the case in epidemic forecasting, due to the fact that a forecaster’s behavior generally changes across the different phases of an epidemic. For example, some forecasters do not predict enough of a change in disease incidence from one week to the next. For such a forecaster, the PIT values are usually too high between a season’s onset and peak, because incidence increases more quickly than forecasted. Conversely, after the season peaks, the PIT values are too low, because incidence decreases more quickly than forecasted.

In order to account for such nonstationarity in the PIT distribution, we would like to form and use a special training set based on forecasts made at similar points in the epidemic curve in different seasons. This is not a straightforward task to do in real-time, since one cannot always be sure whether the peak has passed yet or not. However, for seasonal epidemics, we can take advantage of seasonality and build this training set based on the calendar weeks in which the forecasts were made. For example, a forecast made in week 6 can be recalibrated based on forecasts in other seasons made in weeks in between 3 and 9. This is what we do in our experiments in this paper, with more details given in the next section.

3 Results

We apply this ensemble recalibration method to data from influenza forecasting in the US. In an effort to better prepare for seasonal influenza, the US CDC has organized a seasonal influenza forecasting challenge every year since 2013, called the FluSight Challenge [18]. In 2017, a group of forecasters formed the FluSight Network [19] and began submitting an ensemble forecast of 27 component forecasters. As part of this collaboration, each of these forecasters produced and stored retrospective forecasts spanning 9 seasons, from 2010–11 to 2018–19. The retrospective forecasts were produced at the same time, with each forecaster using the same method for all seasons. Had the forecaster modified its algorithm from season to season, the previous forecast performance would not be predictive of future forecast performance, violating the assumptions behind this recalibration method. These forecasters include mechanistic and non-mechanistic forecasters, as well as baseline forecasters. They are diverse in behavior, accuracy, and calibration, and therefore provide an interesting challenge for our recalibration method, which treats the forecaster as a black box.

First, we summarize the retrospective forecasts in the FluSight data set. Each week, a forecast is produced for seven forecasting targets, all of which are based on weighted ILI (wILI), a population-weighted average of the percentage of outpatient visits with influenza-like illness derived from reports to the CDC from a network of healthcare providers called ILINet [20]. The forecasting targets are:

season onset (the first week where wILI is above a predefined baseline for three consecutive weeks);
season peak week (week of maximum wILI);
season peak percentage (maximum wILI value);
the wILI value at 1, 2, 3, and 4 weeks ahead of the current week.

The first three targets are referred to as seasonal targets and the last four targets are referred to as short-term targets. Each forecast is discretized over predetermined bins, forming a histogram distribution. For the season onset and season peak week targets, the width of each bin is one week, and for the other targets, the width of each bin is 0.1% wILI. Forecasts are produced for each of the 10 HHS Regions as well as the US as a whole, for a total of 9 seasons, from 2010–11 to 2018–19. Thus to be clear, the forecasts in this FluSight data set are indexed by forecaster, target, season, forecast week, and location.

Next, we describe the training setup we use for recalibrating the forecasts in this data set, which is a kind of nested leave-one-season-out cross-validation. This is laid out in the steps below for a given forecaster and forecasting target, and a particular season s.

Create recalibrated forecasts for all seasons r ≠ s, using each of the three methods: nonparametric, parametric, and null. For a forecast in season r at week i and at location ℓ, we build a training set using PIT values from all seasons other than r and s, all available forecast weeks in [i − 3, i + 3] (within three weeks of i), and all locations. These recalibrated forecasts are only used for training the ensemble weights in the following step.
Optimize the ensemble weights w by solving (2) using the recalibrated forecasts from Step 1.
Create recalibrated forecasts for season s, again using each of the three methods: nonparametric, parametric, and null. This is just as in Step 1, except we use one more season in the training set. Explicitly, for a forecast in season s at week i and at location ℓ, we build a training set using PIT values from all seasons other than s, all forecast weeks in [i − 3, i + 3] (within three weeks of i), and all locations.
Create ensemble recalibrated forecasts from season i, using the recalibration components from Step 3 and the weights from Step 2.

In what follows, we present and discuss the results. The code and data used to produce all of these results is publicly available online [21].

3.1 Effect of varying window size

The training procedure just presented assumes a window of k = 3 weeks on either side of a given week i in order to build the set of PIT values used for recalibration (using forecast data from other seasons). However, we could consider varying k, which would navigate something like a bias-variance tradeoff. We would expect the optimal window k to be larger for the nonparametric recalibration method versus the parametric one. It turns out that k = 3 is typically a reasonable choice for both, as displayed in Fig 3.

Fig 3 — A window size of k corresponds to training recalibration on forecasts within k weeks of the given forecast week, where available, inclusive. Log score is averaged over 9 seasons, 11 locations, and 29 weeks (higher log score is better). The largest window sizes slightly hurt the performance of the parametric model, and the smallest window sizes significantly hurt the nonparametric model. Averaged over all forecasters, the improvement in performance due to calibration is roughly equal to the improvement in performance by reducing the forecast horizon by a week.

3.2 Forecast accuracy and calibration

For the short-term targets, the ensemble recalibration method improves the mean log score for almost all forecasters. Both the nonparametric and parametric recalibration methods significantly improve the mean log score, and the ensemble improves it even further. For the seasonal targets, some component recalibration methods do not improve accuracy, although the ensemble method does improve accuracy, averaged over all forecasters. However, the ensemble improves accuracy for seasonal targets in only about three-quarters of forecasters. See Figs 4 and 5.

Fig 4 — Log score is averaged over all 27 forecasters in the FluSight, 9 seasons, 11 locations, and 29 weeks (higher log score is better). The ensemble recalibration method improves accuracy for every target.

Fig 5 — The ensemble method improves accuracy for the short-term targets for all forecasters, and most forecasters for the seasonal targets. It also improves calibration (as measured by entropy) for most forecasters and most targets. The ensemble method outperforms both the nonparametric and parametric methods.

Fig 6 gives a more direct comparison of improvements in accuracy versus calibration, i.e., in mean log score versus entropy, for the short-term forecasts. (Note that we estimate the entropy of the distribution of PIT values using a simple histogram estimator with 100 equal bins along the interval [0, 1].) We see a clear linear trend, with slope approximately 1, confirming our expectations from (1).

Finally, in Fig 7, we show that our ensemble recalibration method increases the entropy of the PIT distribution to nearly zero for nearly every forecaster. The two exceptions, the line segments towards the bottom of Fig 7, correspond to particularly poor forecasters (so poor that are outperformed by a baseline forecaster that outputs a uniform distribution).

3.3 Effect of number of training seasons

We chose to apply our recalibration to the FluSight Challenge because there are many forecasters available over many seasons for testing and training. When recalibrating forecasts of other epidemics, there may be significantly less training data available. Fortunately, these methods are robust to recalibrating FluSight Challenge forecastssituations with little training data. The parametric recalibration method improves the mean log score, averaged over all 27 forecasts, with just two training seasons, and the nonparametric recalibration improves average performance with four training seasons, as shown in Fig 8.

Fig 8 — We perform three runs for each of the nine available seasons and n ∈ {1, 2, 4, 8}, where a run consists of randomly sampling n other seasons to train recalibration for each of the 27 FluSight forecasters. Each point in the plot is averaged over 9 × 3 = 27 runs. As expected, the parametric method is more robust to limited training data than the nonparametric method.

Because we train selectively based on seasonality, as discussed in Section 2.6, each training season and location contributes only 7 PIT values to estimate G. We pool 11 locations together, so the parametric method can improve performance with roughly 150 PIT values, and the nonparametric method can improve performance with roughly 300 PIT values.

3.4 Recalibrating the FluSight ensemble

As we just saw, recalibration improves the performance of the individual forecasters in the FluSight Network. A natural follow up is therefore to investigate whether it can improve the performance of the FluSight ensemble, a forecaster that combines 27 component forecasters (the individual FluSight forecasters), whose construction is described in [19].

As both recalibration and ensembling are post-processing methods (i.e., that can be applied in post-processing of forecast data), we are left with two options to explore. We can recalibrate the component forecasters and then ensemble (C-E), or ensemble the components and then recalibrate (E-C). In the C-E model, we train ensemble weights in a leave-one-season-out format, on the recalibrated component forecasts. In the E-C model, we train ensemble weights in a leave-one-season-out format on the original component forecasts, and then recalibrate the ensemble forecasts.

Fig 9 reveals that E-C model performs better than the C-E model. This is in line with established forecasting theory, which states that linear ensembles (which take a linear combination of component forecasters, such as the FluSight ensemble approach) themselves are generally miscalibrated, even when the individual component forecasters are themselves calibrated [5, 8, 22].

4 Discussion

Even in a domain as complex as epidemic forecasting, relatively simple recalibration methods such as those described in this paper can significantly improve both calibration and accuracy. A forecaster’s performance for any proper score can be decomposed into three components: the inherent uncertainty of the target itself, the resolution of the forecaster (concentration of the forecasts), and the reliability of the forecaster to the target (calibration) [1]. In epidemic forecasting, without seasonality-aware recalibration training (such as that proposed and implemented in this paper), recalibration will not affect the resolution term, which is left to the individual forecasters, but it will improve the reliability term. However, using seasonality-aware recalibration, it can also improve the resolution term.

Over 9 seasons of forecast data from 27 forecasters in the FluSight Challenge, we found that recalibration was especially helpful for the short-term targets (1–4 week ahead forecasts). With the exception of two very similar forecasters that have poor performance, the ensemble recalibration method was able to reduce the entropy of the PIT distribution to nearly zero (not or barely statistically significantly different than a uniform distribution). The recalibrated forecasts are therefore more accurate and more reliable. This is true across a diverse set of forecasters, including mechanistic, statistical, baseline, and ensemble models; indeed, as our recalibration method treats the forecaster as a black box, it can be applied to any forecaster, given access to suitable training data (retrospective historical forecasts).

Recalibrating influenza forecasts avoids challenges present in other forecasting environments, such as nonseasonality, a lack of consistent forecasters spanning many seasons, and little training datalittle training data, nonseasonality, and consistent forecasting models spanning many seasons. Although this makes recalibrating influenza forecasts a relatively easier task, we believe that this recalibration method can be applied to forecasting other diseases as well. For example, dengue fever is a seasonal disease with training data since 2014 available for forecasting [23]. Aedes mosquito counts are another seasonal target of interest to the CDC, which has released several years of training data for some counties for the purpose of forecasting [24]. This recalibration method, with its seasonal component, can be applied to these forecasts.

In application to nonseasonal diseases, such as COVID-19 (currently), this method can easily be modified to use all available PIT values, as opposed to the selective training used for influenza forecasts. Alternatively, selective training could be done not by calendar week but by some other feature(s) that differentiates a forecaster’s behavior (e.g., whether cases are increasing or decreasing). While this allows for a flexible approach to recalibrate a variety of seasonal and nonseasonal diseases, this may be difficult to implement effectively in practice. In other cases where the PIT distribution changes slowly over time, training could be done only on the most recent forecasts to improve the estimate of $\hat{G}$ . This selective training approach has been successful in recalibrating COVID-19 forecasts [25]. The ensemble approach allows for the incorporation of multiple models trained on different historical forecasts, or even different recalibration methods altogether.

Regarding a lack of consistent forecasters, even if a forecaster has been modified continuously over many years and previous performance is not indicative of current performance, recalibration can be trained on retrospective forecasts produced by the current forecaster.

A lack of training data is a more difficult problem to solve. An obvious problem of limited training data is the variance in estimating $\hat{G}$ , but an additional challenge is that it is difficult to confirm our assumption that the PIT distribution is stationary over time. If we cannot detect that the PIT distribution changes over time, we will make inappropriate “corrections” to the forecasts that could harm calibration and accuracy. In practice, recalibration improved performance of the FluSight Challenge forecasts with relatively little training data, as shown in Fig 8. In less well-behaved applications, however, performance could decrease. We have made these recalibration methods available online so that a user can experiment with his or her own forecasts and determine whether or not recalibration improves performance [21].

The performance of recalibration with respect to the seasonal targets (onset, peak week, and peak percentage) was less conclusive than that of the short-term targets. Although the mean log score averaged over all of the forecasters was improved, recalibration only improved the performance of about three-quarters of the forecasters. Seasonal targets are inherently more difficult to recalibrate because at the end of the season, the true value has almost certainly been observed, and the forecasts are highly confident. For these forecasts, the correct bin has a mass of almost 1, and the observed PIT value then is approximately 0.5. At the end of the season, the PIT distribution is very concentrated at 0.5, which indicates underconfidence and poor calibration. If these PIT values of 0.5 are used to train forecasts for recalibration earlier in the season, before the target is observed, then recalibration incorrectly makes the forecast more confident. Because one is unsure whether the season peak has occurred or not for several weeks after the peak occurs, recalibration training is a nontrivial task. In general, more work is required to reliably improve accuracy and calibration for seasonal targets, which is a topic for future work.

Supporting information

S1 Appendix. The supporting information contains additional results.

(PDF)

Click here for additional data file.^{(2.6MB, pdf)}

Data Availability

The code and data are available at https://github.com/rumackaaron/recalibration.

Funding Statement

AR was supported by a fellowship from the Center for Machine Learning and Health at Carnegie Mellon University (https://www.cs.cmu.edu/cmlh-cfp). RR and AR were supported by McCune Foundation grant FP00004784 (https://www.mccune.org). RT and RR were supported by Centers for Disease Control and Prevention grant U01IP001121 (https://www.cdc.gov). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Bröcker J. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society. 2009;135(643):1512–1519. doi: 10.1002/qj.456 [DOI] [Google Scholar]
2. Dawid AP. Calibration-based empirical probability. Annals of Statistics. 1985;13(4):1251–1274. doi: 10.1214/aos/1176349736 [DOI] [Google Scholar]
3. Gneiting T, Balabdaoui F, Raftery AE. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society, Series B. 2007;69:243–268. doi: 10.1111/j.1467-9868.2007.00587.x [DOI] [Google Scholar]
4. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102(477):359–378. doi: 10.1198/016214506000001437 [DOI] [Google Scholar]
5. Hora SC. Probability judgements for continuous quantities. Management Science. 2004;50(5):597–604. doi: 10.1287/mnsc.1040.0205 [DOI] [Google Scholar]
6. Hamill TM, Colucci SJ. Verification of Eta-RSM short-range ensemble forecasts. Monthly Weather Review. 1997;125:1312–1327. doi: 10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2 [DOI] [Google Scholar]
7. Raftery AE, Gneiting T, Balabdaoui F, Polakowski M. Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review. 2005;133:1155–1174. doi: 10.1175/MWR2906.1 [DOI] [Google Scholar]
8. Gneiting T, Ranjan R. Combining predictive distributions. Electronic Journal of Statistics. 2013;7:1747–1782. doi: 10.1214/13-EJS823 [DOI] [Google Scholar]
9. van den Dool H, Becker E, Chen LC, Zhang Q. The probability anomaly correlation and calibration of probabilistic forecasts. Weather and Forecasting. 2017;32:199–206. doi: 10.1175/WAF-D-16-0115.1 [DOI] [Google Scholar]
10. Brocklehurst S, Chan PY, Littlewood B, Snell J. Recalibrating software reliability models. IEEE Transactions on Software Engineering. 1990;16(4):458–470. doi: 10.1109/32.54297 [DOI] [Google Scholar]
11. Hamill TM, Whitaker JS, Wei X. Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Monthly Weather Review. 2004;132:1434–1447. doi: 10.1175/1520-0493(2004)132<1434:ERIMFS>2.0.CO;2 [DOI] [Google Scholar]
12. Gneiting T, Raftery AE, Westveld AH III, Goldman T. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review. 2005;133:1098–1118. doi: 10.1175/MWR2904.1 [DOI] [Google Scholar]
13. Wilks DS, Hamill TM. Comparison of ensemble-MOS methods using GFS reforecasts. Monthly Weather Review. 2007;135:2379–2390. doi: 10.1175/MWR3402.1 [DOI] [Google Scholar]
14. Hamill TM, Whitaker JS, Mullen SL. Reforecasts: An important dataset for improving weather predictions. Bulletin of the American Meteorological Society. 2006;87:33–46. doi: 10.1175/BAMS-87-1-33 [DOI] [Google Scholar]
15. de Finetti B. Methods for Discriminating Levels of Partial Knowledge Concerning a Test Item. The British Journal of Mathematical and Statistical Psychology. 1965;18:87–123. doi: 10.1111/j.2044-8317.1965.tb00695.x [DOI] [PubMed] [Google Scholar]
16. Good IJ. Rational decisions. Journal of the Royal Statistical Society. 1952;14:107–114. [Google Scholar]
17. Dawid AP. Statistical theory: The prequential approach (with discussion). Journal of the Royal Statistical Society Series A. 1984;147:278–292. [Google Scholar]
18.Centers for Disease Control and Prevention. FluSight: Flu forecasting;. https://www.cdc.gov/flu/weekly/flusight/index.html.
19. Reich NG, McGowan CJ, Yamana TK, Tushar A, Ray EL, Osthus D, et al. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S. PLOS Computational Biology. 2019;15(11). doi: 10.1371/journal.pcbi.1007486 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Centers for Disease Control and Prevention. National, regional, and state level outpatient illness and viral surveillance;. https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html.
21.Rumack A, Brooks LC, Hyun S, Tibshirani RJ, Rosenfeld R. Recalibration repository;. https://github.com/rumackaaron/recalibration.
22. Ranjan R, Gneiting T. Combining probability forecasts. Journal of the Royal Statistical Society, Series B. 2010;72(1):71–91. doi: 10.1111/j.1467-9868.2009.00726.x [DOI] [Google Scholar]
23.Pan American Health Organization. PAHO/WHO Data—Dengue Cases;. https://www3.paho.org/data/index.php/en/mnu-topics/indicadores-dengue-en/dengue-nacional-en/252-dengue-pais-ano-en.html.
24.Centers for Disease Control and Prevention. Epidemic Prediction Initiative: Aedes Forecasting;. https://predict.cdc.gov/post/5e8e21ebcd1fbb050eacaa1e.
25. Picard R, Osthus D. Forecast Intervals for Infectious Disease Models. medRxiv. 2022;. [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010771.r001

Decision Letter 0

Cecile Viboud, Thomas Leitner

16 May 2022

Dear Mr. Rumack,

Thank you very much for submitting your manuscript "Recalibrating probabilistic forecasts of epidemics" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. In particular, both reviewers have commented on how influenza forecasts may be a special case of infectious disease forecasts because of marked seasonality, multiple years worth of data, and lack of changes in the structure of individual models (so that adjusting from past performances can be particularly useful). Accordingly, they have provided suggestions for sensitivity analyses. We would also want this limitation to be highlighted in the discussion.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Cecile Viboud

Associate Editor

PLOS Computational Biology

Thomas Leitner

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Summary

The authors present a novel technique for recalibating black-box forecasters that uses the distribution of PIT values from past forecasts to improve measures of calibration as well as log scores of future forecasts. They demonstrate the effectiveness of their approach by showing the improvement achieved for seasonal forecasts of influenza. Their proposed approach in a first step estimates PIT distributions (using both a parametric and a non-parametric method) and uses it to recalibrate forecasts. In a second step, they create a weighted ensemble of these two approaches and an unmodified forecast. To my knowledge this approach is novel and is one that I find genuinely exciting. I think the manuscript could be made even better by addressing a few points.

Major comments

First, the FluSight data set chosen to me feels like an idealised example, but this is not discussed in the paper. The authors use data that would not be available to researchers in many real-time settings, for example by using all other seasons, even those that lie in the future, to recalibrate forecasts for a given season. The data set is also very rich, including forecasts from the exact same models available across 9 seasons. In many settings, however, models evolve over time and it may not be possible to create retrospective forecasts. Diseases also may be non-seasonal or a disease may be seen for the first time in a given region (e.g. Ebola, Chikungunya in the Americas, COVID-19). I think it would be important to make these particular features of the example data clear and discuss in greater detail in which other settings the recalibration approach can be applied and where it may fail.

I suggest showing how well the recalibration method worked if only data from a single season were available. The data collated by the COVID-19 Forecast Hub could also make for an interesting example.

Secondly, I think Figure 3 and Figure should be made more consistent and captions should to be more detailed, as it is sometimes not clear what the figure shows. In particular:

- Figure 3: I'm not sure I exactly understand what was done. Why was the log score of the most accurate forecaster chosen, rather than say the average of all forecasters? What forecast horizon was used? Maybe it would be helpful to clearly separate k and 'window size' as at least to me they were confusing at first. Do I understand correctly, that k = 10 means 10 weeks prior + 10 weeks past the week, so the window is actually 20 weeks? when k = 10, how do you deal with the forecasts in the first 10 weeks (or the last ones). Is the window shortened for these, or are they omitted from scoring? In addition: maybe it would be interesting to extend the shown window size even further, until we actually see performance degrade.

- Figure 4: It is slightly confusing that the non-parametric method performs about equally well for 2-, 3-, and 4-week-ahead forecasts in this plot. Just looking at Figure 3, one would expect a negative effect for 4-week-ahead forecasts and a 3-week window size. I assume this is because Figure 4 uses all 27 forecasters and Figure 3 only the best one? To me this is confusing and it would be helpful if the Figures were more consistent.

Thirdly, the code, as of now, is not very accessible. It currently sits on a branch of a much greater repository and it is unclear a) whether this code is meant to stay there indefinitely, b) how the code ties in with the greater context of the repository (what is required for the recalibration and what is not) and c) how someone could use the code for their own work. I think it would greatly enhance accessibility of the code if the authors could a) move it to a separate repository (or maybe a subfolder of the main repository), rather than just a branch of an existing repository, b) document the code and its functions in greater detail and c) provide some more explanations on how to run the code.

Minor points

- Introduction: Maybe it would be a good idea to mention the forecasting paradigm to maximise sharpness subject to calibration (see Gneiting et al., Probabilistic forecasts, calibration and sharpness, 2007)?

- Line 10: If calibration is one aspect, which are the other two? This is explained in the discussion, but maybe could also be included here.

- Line 58: I asked myself what the the added complexity was if CDFs were not strictly increasing. Maybe either remove "for simplicity" or add a short explanation of why this is necessary?

- Line 59: I'm not sure why the Brier Score which is usually associated with binary forecasts is mentioned here. Maybe this could be made a little more clear (or removed).

- A few lines below line 60 (there are no line numbers here unfortunately): The observation was previously called $x_i$, (and $y_i$ was the random variable). It is maybe slightly confusing that the observation is now called $y_i$, where before the observation was $x_i$.

- Line 70: Maybe mention that a uniform PIT is a necessary, but not a sufficient condition for probabilistic calibration. Gneiting et al. (Probabilistic forecasts, calibration and sharpness, 2007) and Hamill (Interpretation of Rank Histograms for Verifying Ensemble Forecasts, 2001) mention examples of forecasters who are mis-calibrated, but have uniform PIT histograms.

- Equation 2: Again $x_i$ and $y_i$ are used differently than before

- Line 144 "too conservative" is slightly ambiguous. What it means is that forecasters don't adapt to trends quickly enough, but it could also easily be understood as "forecasts are to wide and uncertain"

- I unfortunately did not really understand the sentence in lines 178-180. Maybe the exact format could be worded more clearly?

- line 185: In applied setting, would't it make more sense to use leave-future-out CV? Currently this is using data that would not have been available to researchers making forecast at the time. This ties into the point made in the summary above: I would really like to know what performance is in less idealised circumstances.

- I'm not entirely sure I understood the full algorithm explained from line 187 on. What is confusing to me is that seasons are denoted with r and s, and weeks with $i$, but then later on (e.g. line 193), seasons are also denoted with $i$. I am also not sure I understood the difference between step 1 and step 3.

- Maybe it would be helpful to point out whether the weeks $i$ in the seasons calendar correspond to calendar weeks, or are they relative to something?

- 218 - 19: This sentence implies that the amount of improvement is close between the two, but rather I understand the proportion of improved models is close.

Reviewer #2: Major Issues

The paper gives a nice discussion (Sec 2) of the too-often-overlooked recalibration issue for influenza forecasting.

Relative to the 4-step recipe for the recalibration approach in Section 3, there appear to be implicit assumptions that should hold for the approach to be reasonable. For example, Step 1 defines a set of relevant (essentially exchangeable?) forecast errors for a "current" forecast, and it relies strongly on a seasonal structure. Despite certain claims (as in Sec. 2.6 and "the seasonal nature of epidemic forecasting"), not all highly communicable diseases are intrinsically seasonal, let alone do they have ample seasonal data available for recalibration. Examples: SARS, ebola, many animal diseases, and to this point in time, covid. The seasonal assumption clearly applies to influenza, but may not generalize easily to other forecasting environments.

Next, it appears implicit that a forecaster's prediction errors are fundamentally static. Stated less charitably relative to your FluSight example, were all forecasters so oblivious to their forecasting performance over the multi-year forecasting time frame that they never learned anything that helped them improve their modeling? Do the results of your analysis have no value to them beyond modifying prediction intervals after the fact? At a minimum, shouldn't this static assumption (i.e., that forecasters can never learn nor modify their behavior) be somehow confirmed before the recalibration approach is routinely applied? As in the previous paragraph, the assumption appears to give good results for influenza, but may(?) not generalize to other dynamic forecasting environments.

A suggestion is that underlying assumptions such as exchangeability and static forecasts be made explicit. Further, if the authors truly feel that their approach works as well for other infectious diseases as it does for influenza, some justification of this claim together with concrete examples would be a welcome addition to the paper. Otherwise, the emphasis of the paper (as well as its title) should be modified to reflect the influenza-centric nature of the approach.

Minor Issues

I was struck by the comment on p.1 in the summary that "epidemic forecasting is a relatively new field". Different points of view are not hard to find, e.g., "failure in epidemic forecasting is an old problem" (Ioannidis, Cripps, and Tanner; International Journal of Forecasting, 2021, Sec. 1). Indeed, the long/undistinguished history of epidemic forecasting is actually good motivation for your work. Or maybe there's some confusion over what your use of the word "new" means?

I was disappointed at the lack of an overall discussion of the optimized weight values { w_i } for each ILI forecaster/target. What do the optimized weights imply about the three components of the ensemble (Sec 2.5) and/or about the individual forecasters? Could some detail be added on this subject?

I liked the comment "recalibration only improved the performance of about three-quarters of the forecasters" on seasonal targets as it provided some insight to the analysis. Unfortunately, the discussion of this issue in the last paragraph of Sec. 4 did little to distinguish which types of forecasters gain by using the approach and which types don't. Was it that one-quarter of FluSight forecasters were well calibrated to begin with, so that no major improvement was possible? Did the aforementioned one-quarter of forecasters modify their predictive model over time, invalidating the implicit assumptions? Or ...? Providing more intuition as to when the approach works well (and when it doesn't) would be a nice addition to the paper.

Most influenza forecasters have a reputation of being overconfident, with U-shaped PIT curves. Does residual overconfidence still exist for your recalibrated forecasts, though to a much lesser degree? Is the approach mainly a bias correction activity?

Lastly, could the paper could note that the approach can be easily adapted to incorporate other recalibration methods beyond just those considered in the paper.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Nikos I. Bosse

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2022 Dec 15;18(12):e1010771. doi: 10.1371/journal.pcbi.1010771.r002

Author response to Decision Letter 0

27 Jun 2022

Attachment

Submitted filename: RumackResponse.pdf

Click here for additional data file.^{(156.3KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010771.r003

Decision Letter 1

Cecile Viboud, Thomas Leitner

12 Aug 2022

Dear Mr. Rumack,

Thank you very much for submitting your revised manuscript "Recalibrating probabilistic forecasts of epidemics" for consideration at PLOS Computational Biology. The revised manuscript went back to the initial reviewers. Both reviewers agree that the new version is substantially improved and clearer. Yet, the second reviewer has asked that you cast a more critical eye on the stationarity of PIT and be more specific about the type of non-seasonal diseases raised in the discussion. Based on the reviews, we will accept this manuscript for publication, but we would very much appreciate if you could address/caveat these points in the final version of the manuscript.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Cecile Viboud

Associate Editor

PLOS Computational Biology

Thomas Leitner

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thank you very much for addressing our comments. I would be very happy to see this work published.

Reviewer #2: In my previous review, I asked you to give specific, concrete examples of seasonal diseases (besides influenza) for which currently available public health data allow for the use of your methodology. I was disappointed that you couldn’t/wouldn’t provide a single such example. The paper would be much improved if its sweeping generalizations about applicability (such as lines 296-297) were accompanied by some detailed supporting evidence.

In any case, the approach has potential benefit for influenza forecasting. That some of these predictive models have been used for as long as they have and still aren’t well calibrated is somewhat embarrassing to the epi community -- especially in light of recent high-profile problems with poor/no uncertainty estimates for COVID-19 forecasts (see the reference cited in my previous review) and the history of other epidemiological probabilistic assessments, such as questionable confidence intervals for mechanistic model parameters (see, e.g., general work by Ioannidis, Tong, and others). Recalibration of probabilistic statements is topical and important.

Also on a positive note, the basis for the approach is more clearly presented in the revision. Adding discussion in Section 3.3 on varying amounts of training data is a nice touch, although the text is misleadingly one-sided. The good news, as is well summarized, is that there is only modest loss of precision with limited training data so long as the modeling assumptions are valid. The bad news, however, needs attention: having limited training data means having limited ability to meaningfully check the validity of those assumptions in the first place, which undermines “robustness” (line 247) and could easily lead to poor results from the approach.

Specifically, the stationarity assumption (lines 148-149) that the PIT distribution is the same for the training and test sets should not be taken lightly, especially for nonseasonal diseases. Consider this assumption relative to the most prominent nonseasonal disease at the moment, COVID-19, where the PIT distribution evolved over time for a variety of reasons. This type of nonstationarity can be circumvented by using only the most current data for recalibration, similar to the rationale in your lines 156-164 (aside: such an approach has recently been detailed -- Picard and Osthus, medRxiv 2022; even then, daily data might be required for true success in practice). Although the COVID-19 paradigm is nearly antithetic to the static-seasonal-disease, static-forecast-model paradigm you focus on, it emphasizes the importance of having adequate training data to check modeling assumptions.

The added paragraph (lines 296-311) on nonseasonal diseases is very weak. Readers will immediately wonder: exactly which nonseasonal diseases are you talking about? What disease-specific data sets exist for forecasting, now or in the foreseeable future? What’s so special about an “entire year” (line 303) for a disease that doesn’t care about the calendar? Because these questions aren’t addressed, readers will find this paragraph to be exasperatingly devoid of any substantive details. There’s a case to be made for your assertions, but you’re not making it. A related point: the paper’s approach is somewhat ahead of its time, awaiting a world where timely, standardized, multi-year, and easily accessible training data are available for all seasonal and nonseasonal diseases of interest. Admittedly, public health and biosurveillance data are moving (very slowly) in this direction, but they’re nowhere near that state now.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Nikos I. Bosse

Reviewer #2: No

Figure Files:

Data Requirements:

Reproducibility:

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2022 Dec 15;18(12):e1010771. doi: 10.1371/journal.pcbi.1010771.r004

Author response to Decision Letter 1

3 Oct 2022

Attachment

Submitted filename: Rumack_Response.pdf

Click here for additional data file.^{(100.5KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010771.r005

Decision Letter 2

Cecile Viboud, Thomas Leitner

28 Nov 2022

Dear Mr. Rumack,

We are pleased to inform you that your manuscript 'Recalibrating probabilistic forecasts of epidemics' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Cecile Viboud

Academic Editor

PLOS Computational Biology

Thomas Leitner

Section Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: none

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010771.r006

Acceptance letter

Cecile Viboud, Thomas Leitner

12 Dec 2022

PCOMPBIOL-D-21-02239R2

Recalibrating probabilistic forecasts of epidemics

Dear Dr Rumack,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. The supporting information contains additional results.

(PDF)

Click here for additional data file.^{(2.6MB, pdf)}

Attachment

Submitted filename: RumackResponse.pdf

Click here for additional data file.^{(156.3KB, pdf)}

Attachment

Submitted filename: Rumack_Response.pdf

Click here for additional data file.^{(100.5KB, pdf)}

Data Availability Statement

The code and data are available at https://github.com/rumackaaron/recalibration.

[pcbi.1010771.ref001] 1. Bröcker J. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society. 2009;135(643):1512–1519. doi: 10.1002/qj.456 [DOI] [Google Scholar]

[pcbi.1010771.ref002] 2. Dawid AP. Calibration-based empirical probability. Annals of Statistics. 1985;13(4):1251–1274. doi: 10.1214/aos/1176349736 [DOI] [Google Scholar]

[pcbi.1010771.ref003] 3. Gneiting T, Balabdaoui F, Raftery AE. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society, Series B. 2007;69:243–268. doi: 10.1111/j.1467-9868.2007.00587.x [DOI] [Google Scholar]

[pcbi.1010771.ref004] 4. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102(477):359–378. doi: 10.1198/016214506000001437 [DOI] [Google Scholar]

[pcbi.1010771.ref005] 5. Hora SC. Probability judgements for continuous quantities. Management Science. 2004;50(5):597–604. doi: 10.1287/mnsc.1040.0205 [DOI] [Google Scholar]

[pcbi.1010771.ref006] 6. Hamill TM, Colucci SJ. Verification of Eta-RSM short-range ensemble forecasts. Monthly Weather Review. 1997;125:1312–1327. doi: 10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2 [DOI] [Google Scholar]

[pcbi.1010771.ref007] 7. Raftery AE, Gneiting T, Balabdaoui F, Polakowski M. Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review. 2005;133:1155–1174. doi: 10.1175/MWR2906.1 [DOI] [Google Scholar]

[pcbi.1010771.ref008] 8. Gneiting T, Ranjan R. Combining predictive distributions. Electronic Journal of Statistics. 2013;7:1747–1782. doi: 10.1214/13-EJS823 [DOI] [Google Scholar]

[pcbi.1010771.ref009] 9. van den Dool H, Becker E, Chen LC, Zhang Q. The probability anomaly correlation and calibration of probabilistic forecasts. Weather and Forecasting. 2017;32:199–206. doi: 10.1175/WAF-D-16-0115.1 [DOI] [Google Scholar]

[pcbi.1010771.ref010] 10. Brocklehurst S, Chan PY, Littlewood B, Snell J. Recalibrating software reliability models. IEEE Transactions on Software Engineering. 1990;16(4):458–470. doi: 10.1109/32.54297 [DOI] [Google Scholar]

[pcbi.1010771.ref011] 11. Hamill TM, Whitaker JS, Wei X. Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Monthly Weather Review. 2004;132:1434–1447. doi: 10.1175/1520-0493(2004)132<1434:ERIMFS>2.0.CO;2 [DOI] [Google Scholar]

[pcbi.1010771.ref012] 12. Gneiting T, Raftery AE, Westveld AH III, Goldman T. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review. 2005;133:1098–1118. doi: 10.1175/MWR2904.1 [DOI] [Google Scholar]

[pcbi.1010771.ref013] 13. Wilks DS, Hamill TM. Comparison of ensemble-MOS methods using GFS reforecasts. Monthly Weather Review. 2007;135:2379–2390. doi: 10.1175/MWR3402.1 [DOI] [Google Scholar]

[pcbi.1010771.ref014] 14. Hamill TM, Whitaker JS, Mullen SL. Reforecasts: An important dataset for improving weather predictions. Bulletin of the American Meteorological Society. 2006;87:33–46. doi: 10.1175/BAMS-87-1-33 [DOI] [Google Scholar]

[pcbi.1010771.ref015] 15. de Finetti B. Methods for Discriminating Levels of Partial Knowledge Concerning a Test Item. The British Journal of Mathematical and Statistical Psychology. 1965;18:87–123. doi: 10.1111/j.2044-8317.1965.tb00695.x [DOI] [PubMed] [Google Scholar]

[pcbi.1010771.ref016] 16. Good IJ. Rational decisions. Journal of the Royal Statistical Society. 1952;14:107–114. [Google Scholar]

[pcbi.1010771.ref017] 17. Dawid AP. Statistical theory: The prequential approach (with discussion). Journal of the Royal Statistical Society Series A. 1984;147:278–292. [Google Scholar]

[pcbi.1010771.ref018] 18.Centers for Disease Control and Prevention. FluSight: Flu forecasting;. https://www.cdc.gov/flu/weekly/flusight/index.html.

[pcbi.1010771.ref019] 19. Reich NG, McGowan CJ, Yamana TK, Tushar A, Ray EL, Osthus D, et al. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S. PLOS Computational Biology. 2019;15(11). doi: 10.1371/journal.pcbi.1007486 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010771.ref020] 20.Centers for Disease Control and Prevention. National, regional, and state level outpatient illness and viral surveillance;. https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html.

[pcbi.1010771.ref021] 21.Rumack A, Brooks LC, Hyun S, Tibshirani RJ, Rosenfeld R. Recalibration repository;. https://github.com/rumackaaron/recalibration.

[pcbi.1010771.ref022] 22. Ranjan R, Gneiting T. Combining probability forecasts. Journal of the Royal Statistical Society, Series B. 2010;72(1):71–91. doi: 10.1111/j.1467-9868.2009.00726.x [DOI] [Google Scholar]

[pcbi.1010771.ref023] 23.Pan American Health Organization. PAHO/WHO Data—Dengue Cases;. https://www3.paho.org/data/index.php/en/mnu-topics/indicadores-dengue-en/dengue-nacional-en/252-dengue-pais-ano-en.html.

[pcbi.1010771.ref024] 24.Centers for Disease Control and Prevention. Epidemic Prediction Initiative: Aedes Forecasting;. https://predict.cdc.gov/post/5e8e21ebcd1fbb050eacaa1e.

[pcbi.1010771.ref025] 25. Picard R, Osthus D. Forecast Intervals for Infectious Disease Models. medRxiv. 2022;. [Google Scholar]

PERMALINK

Recalibrating probabilistic forecasts of epidemics

Aaron Rumack

Ryan J Tibshirani

Roni Rosenfeld

Roles

Abstract

Author summary

1 Introduction

2 Methods

Fig 1. Densities of PIT distributions for five sample forecasters, when the true distribution is a standard normal.

Fig 2. An illustration of recalibration.

2.1 Calibration and log score

2.2 Nonparametric correction

2.3 Parametric correction

2.4 Null correction

2.5 Recalibration ensemble

2.6 Recalibration under seasonality

3 Results

3.1 Effect of varying window size

Fig 3. Mean log score, averaged over all forecasters, for the different recalibration methods.

3.2 Forecast accuracy and calibration

Fig 4. Improvement in mean log score, for the different recalibration methods.

Fig 5. Proportion of forecasters for which recalibration improves mean log score (left) and entropy of the PIT values (right).

Fig 6. Improvement in mean log score versus improvement in entropy for each of the 27 FluSight forecasters and short-term targets.

Fig 7. Entropy and mean log score before and after recalibration, for each of the 27 FluSight forecasters and short-term targets.

3.3 Effect of number of training seasons

Fig 8. Improvement in mean log score after recalibration, averaged over all 27 FluSight forecasters, by number of training seasons.

3.4 Recalibrating the FluSight ensemble

Fig 9. Mean log score for the two different approaches to recalibrating the FluSight ensemble forecaster, with C-E and E-C reflecting the order of recalibration and ensembling.

4 Discussion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Cecile Viboud

Thomas Leitner

Roles

Author response to Decision Letter 0

Decision Letter 1

Cecile Viboud

Thomas Leitner

Roles

Author response to Decision Letter 1

Decision Letter 2

Cecile Viboud

Thomas Leitner

Roles

Acceptance letter

Cecile Viboud

Thomas Leitner

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases