Skip to main content

Some NLM-NCBI services and products are experiencing heavy traffic, which may affect performance and availability. We apologize for the inconvenience and appreciate your patience. For assistance, please contact our Help Desk at info@ncbi.nlm.nih.gov.

PLOS ONE logoLink to PLOS ONE
. 2022 Mar 29;17(3):e0266096. doi: 10.1371/journal.pone.0266096

Interval forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methods

Kathryn S Taylor 1,*, James W Taylor 2
Editor: Maurizio Naldi3
PMCID: PMC8963571  PMID: 35349605

Abstract

Background

A combined forecast from multiple models is typically more accurate than an individual forecast, but there are few examples of studies of combining in infectious disease forecasting. We investigated the accuracy of different ways of combining interval forecasts of weekly incident and cumulative coronavirus disease-2019 (COVID-19) mortality.

Methods

We considered weekly interval forecasts, for 1- to 4-week prediction horizons, with out-of-sample periods of approximately 18 months ending on 8 January 2022, for multiple locations in the United States, using data from the COVID-19 Forecast Hub. Our comparison involved simple and more complex combining methods, including methods that involve trimming outliers or performance-based weights. Prediction accuracy was evaluated using interval scores, weighted interval scores, skill scores, ranks, and reliability diagrams.

Results

The weighted inverse score and median combining methods performed best for forecasts of incident deaths. Overall, the leading inverse score method was 12% better than the mean benchmark method in forecasting the 95% interval and, considering all interval forecasts, the median was 7% better than the mean. Overall, the median was the most accurate method for forecasts of cumulative deaths. Compared to the mean, the median’s accuracy was 65% better in forecasting the 95% interval, and 43% better considering all interval forecasts. For all combining methods except the median, combining forecasts from only compartmental models produced better forecasts than combining forecasts from all models.

Conclusions

Combining forecasts can improve the contribution of probabilistic forecasting to health policy decision making during epidemics. The relative performance of combining methods depends on the extent of outliers and the type of models in the combination. The median combination has the advantage of being robust to outlying forecasts. Our results support the Hub’s use of the median and we recommend further investigation into the use of weighted methods.

Introduction

The coronavirus disease-2019 (COVID-19) pandemic has overwhelmed health services and caused excess death rates, prompting governments to impose extreme restrictions in attempts to control the spread of the virus [13]. These interventions have resulted in multiple economic, health and societal problems [4, 5]. This has generated intense debate among experts about the best way forward [6]. Governments and their advisors have relied upon forecasts from models of the numbers of COVID-19 cases, hospitalisations and deaths to help decide what actions to take [7]. Using models to lead health policy has been controversial, but it is recognised that modelling is potentially valuable when used appropriately [1, 810]. Numerous models have been developed to forecast different COVID-19 data, e.g. [1113].

Models should provide probabilistic forecasts, as point forecasts are inherently uncertain [9, 14]. A 95% interval forecast is a common and useful form of probabilistic forecast [15, 16]. Models may be constructed for prediction or scenario analysis. Prediction models forecast the most likely outcome in the current circumstances. Multiple models may reflect different approaches to answering the same question [11], and conflicting forecasts may arise. Rather than asking which is the best model [17], a forecast combination can be used, such as the mean, which is often used and hard to beat [18, 19]. Forecast combining harnesses the ‘wisdom of the crowd’ [20] by producing a collective forecast from multiple models that is typically more accurate than forecasts from individual models. Combining pragmatically synthesises information underlying different prediction methods, diversifying the risk inherent in relying on an individual model, and it can offset statistical bias, potentially cancelling out overestimation and underestimation [21]. These advantages are well-established in many applications outside health care [2225]. This has encouraged the more recent applications of combining in infectious disease prediction [14, 2629], including online platforms that present visualisations of combined probabilistic forecasts of COVID-19 data from the U.S, reported by the Centers for Disease Control and Prevention (CDC), and from Europe, reported by the European Centre for Disease and Control (EDCD). Other examples or combined probabilistic forecasts are in vaccine trial planning [30] and diagnosing disease [31]. These examples have mainly focused on simple mean and median ‘ensembles’ and, in the case of prediction of COVID-19 data, published studies have primarily involved short periods of data, which rules out the consideration of more sophisticated methods, such as those weighted by historical accuracy.

By comparing the accuracy of different combining methods over longer forecast evaluation periods compared to other studies, our broad aims were to: (a) investigate whether combining methods, involving weights determined by prior forecast accuracy or different ways of excluding outliers, are more accurate than simple methods of combining, and (b) establish the relative accuracy of the mean and median combining methods. Previously, we reported several new weighted methods, in a comparison of combining methods applied to probabilistic predictions of weekly cumulative COVID-19 mortality in U.S. locations over the 40-week period up to 23 January 2021 [32]. We found that weighted methods were the most accurate overall and the mean generally outperformed the median except in the first ten weeks. In this paper, we test further by comparing the combining methods on a dataset of interval forecasts of cumulative mortality and a dataset of incident mortality, both over a period of more than 80 weeks. We also include individual models in the comparison, and explore the impacts of reporting patterns of death counts and outlying forecasts on forecast accuracy.

Materials and methods

Data sources

Forecasts of weekly incident and cumulative COVID-19 mortalities were downloaded from the COVID-19 Forecast Hub (https://covid19forecasthub.org/), which is an ongoing collaboration with the U.S. CDC, and involves forecasts submitted by teams from academia, industry and government-affiliated groups [26]. Teams are invited to submit forecasts for 1- to 4-week horizons, in the form of a point forecast and estimates of quantiles corresponding to the following 23 probability points along the probability distribution: 1%, 2.5%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97.5% and 99%. From these, we produced interval forecasts, including the 95% interval, which is bounded by the 2.5% and 97.5% quantiles. The numbers of actual cumulative COVID-19 deaths each week were also provided by the Hub. Their reference data source is the Center for Systems Science and Engineering (CSSE) at John Hopkins University.

Dataset

The Hub carries out screening tests for inclusion in their ‘ensemble’ forecast combinations. Screening excludes forecasts with an incomplete set of quantiles or prediction horizons, and improbable forecasts. The definition of improbable forecasts relate to cumulative deaths, and currently includes decreasing quantiles over the forecast horizons, decreasing cumulative deaths over time (except including an adjustment due to reporting revisions, which is permitted up to a maximum of 10% in the 1-week ahead forecasts), and forecasts of cumulative deaths for a particular location exceeding the size of its population. Before the week of 28 July 2020, the Hub also excluded outlying forecasts, which were identified by a visual check against the actual number of deaths. We only included forecasts that passed the Hub’s screening tests.

Our dataset included forecasts projected from forecast origins at midnight on Saturdays between 9 May 2020 to 8 January 2022 for forecasts of cumulative COVID-19 deaths (88 weeks of data), and between 6 June 2020 and 8 January 2022 for forecasts of incident deaths (84 weeks of data). Forecasts of incident deaths were not screened by the Hub in the weeks ending 9 May 2020 to 30 May 2020. We included forecasts of cumulative deaths in this period as we wished to use all the available data, and also given the fact that the set of incident and cumulative forecasts were different in terms of the included models and locations. In terms of the actual weekly COVID-19 mortality, for each location and week, we used the values made available on 15 January 2022. We studied forecasts of COVID-19 mortality for the U.S. as a whole and 51 U.S. jurisdictions, including the 50 states and the District of Columbia. For simplicity, we refer to these as 51 states.

Our analysis included forecasts from 60 forecasting models and the Hub’s ensemble model. In the early weeks of our dataset, the majority were susceptible-exposed-infected-removed (SEIR) compartmental models, but as the weeks passed, other model types became more common (Fig 1). These involved methods such as neural networks, agent-based modelling, time series analysis, and the use of curve fitting techniques. S1 Table provides a list of all the models.

Fig 1. Number and types of models at each forecast origin for each mortality dataset.

Fig 1

Fig 2 shows the extent of missing data for forecasts of incident deaths. The timeline of forecasts from each model (represented by a row) illustrates the extent of missing data across the 52 locations, including the frequent ‘entry and exit’ of forecasting teams. The corresponding figure for forecasts of cumulative deaths is given in S1 Fig. Higher levels of forecasts were excluded for cumulative deaths than for incident deaths, and this was mainly attributed to the screening tests, as opposed to exclusion due to not being assessed by the Hub. The extent of missing data was such that imputation was impractical.

Fig 2. Data availability for forecasts of incident COVID-19 deaths.

Fig 2

Several combining methods required parameter estimation, which we performed for each location and forecast origin. We defined the in-sample estimation period as being initially the first 10 weeks, and then expanding week by week. This resulted in out-of-sample forecasts produced from 78 weekly forecast origins for the cumulative deaths series, and 74 weekly forecast origins for incident deaths.

Evaluating the interval forecasts

We evaluated out-of-sample prediction accuracy and calibration, with reference to the reported death counts on 15 January 2022, thus producing a retrospective evaluation. Calibration was assessed by the percentage of actual deaths that fell below each bound of the interval forecasts. As each bound is a quantile, this amounted to assessing the calibration of the 23 quantiles for which the teams submitted forecasts. We present this using reliability diagrams. To evaluate prediction accuracy of an interval forecast, we used the interval score (IS) given by the following expression [33, 34]:

ISα=(utlt)+2αI{ytlt}(ltyt)+2αI{ytut}(ytut)

where lt is the interval’s lower bound, ut is its upper bound, yt is the observation in period t, I is the indicator function (1 if the condition is true and 0 otherwise), and α is the ideal probability of the observation falling outside the interval. We report the IS for the 95% interval forecasts (for which α = 5%). Lower values of the IS reflect greater interval forecast accuracy. The unit of the IS is deaths. As each forecasting team provides forecasts for 23 different quantiles, the following K = 11 symmetric interval forecasts can be considered: 98%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20% and 10%. To summarise prediction accuracy for all these intervals, we used the weighted IS (WIS) [16]:

WIS=1K+1/2×(w0×2×|ytm|+k=1K(wk×ISαk))

where w0=12,wi=αi2 and m is the forecast of the median. The IS and the WIS are useful for comparing methods, and although their units are deaths, these scores are not interpretable. Averaging each of these two scores across weeks provided the mean IS (MIS) and the mean WIS (MWIS).

We also averaged the scores across forecast horizons. We did this for conciseness, and because we had a relatively short analysis period, which is a particular problem when evaluating forecasts of extreme quantiles. To show the consistency across horizons, we present a set of results by horizon for interval forecasts for both the incident and cumulative deaths data. For this analysis, because we were looking at individual horizons, we were able to use the Diebold-Mariano statistical test [35], adapted to test across multiple series. This test statistic was originally designed to apply to the difference between the mean of an accuracy measure for two methods for a single time series. To compare the difference averaged across multiple time series, we calculated the variance of the sampling distribution by first summing each variance of the sampling distribution from the Diebold-Mariano test applied to each series, and then dividing by the square of the number of series. To summarise results averaged across the four horizons, we were unable to use the adapted Diebold-Mariano test, so we applied the statistical test proposed by Koning et al. [36]. This test compares the rank of each method, averaged across multiple series, with the corresponding average rank of the most accurate method. Statistical testing was based on a 5% significance level.

We present results of the forecast accuracy evaluation in terms of the 95% interval MIS, MWIS, ranks and skill scores, which are calculated as the percentage by which a given method is superior to the mean combination. The mean is a common choice of benchmark in combining studies. We report results for the series of total U.S. deaths, as well as results averaged across all 52 locations. In addition, to avoid scores for some locations dominating, we also present results averaged for three categories, each including 17 states: high, medium and low mortality states. This categorisation was based on the number of cumulative COVID-19 deaths on 15 January 2022. All results are for the out-of-sample period, and to provide some insight into the potential change in ranking of methods over time, we present MWIS results separately for the first and second halves of the out-of-sample period.

We evaluated the effects of changes in reporting patterns on forecast accuracy. Changes in reporting patterns may involve reporting delays of death counts and changes in the definitions of COVID-19 deaths, both of which may lead to backdating of death counts and steep increases or decreases. Backdating of death counts would produce a problematic assessment in our retrospective evaluation of forecast accuracy, and sudden changes in death counts might cause some forecasting models to misestimate, particularly time series models. To obtain some insight, we compared reports of cumulative death counts for each location in files that were downloaded at multiple time points between 20 June 2020 and 15 January 2022. Locations for which there were notable effects of reporting patterns were excluded in sensitivity analysis. We also examined the effect of outlying forecasts on forecast accuracy by comparing the performance of the mean and median, and visually comparing plots of the MWIS of the mean and median forecasts by location.

Data preparation and descriptive analysis was carried out using Stata version 16 and the forecasts were combined using version 19 of the GAUSS programming language.

Forecast combining methods

All the interval combining methods are applied to each interval bound separately, and for each mortality series, forecast origin and prediction horizon. The comparison included several interval combining methods that do not rely on the availability of records of past accuracy for individual models. These methods include the well-established mean and median combinations [3739], and the more novel symmetric, exterior, interior and envelope trimming methods, which exclude a particular percentage of forecasts, and then average the remaining forecasts of each bound [40]. Fig 3 provides a visual representation of these methods, which we describe in more detail below.

Fig 3. Illustration of interval forecast combining methods that do not rely on past historical accuracy.

Fig 3

Each pair of shapes represents an interval forecast produced by an individual model.

We also implemented two inverse score methods that do rely on the availability of a record of historical accuracy for each individual model. For any combining method that involved a parameter, such as a trimming parameter, we optimised its value for each location by minimising the MIS calculated over the in-sample period. The following is a list of the combining methods that we included in our study:

  1. Mean combination. We calculated the average of the forecasts of each bound. This combining method is also known as the simple average.

  2. Median combination. We calculated the median of the forecasts of each bound. This method is robust to outliers.

  3. Ensemble. This is the COVID-19 Hub ensemble forecast, which was originally the mean combination of the eligible forecasts until the week commencing 28 July 2020, when the ensemble forecast became the median combination and then, in the week commencing 27 September 2021, the Hub switched to using a weighted ensemble method. The use of eligibility screening implies that the ensemble is constructed with the benefit of a degree of trimming which initially involved some subjectivity and was then formalised more objectively. The results for the median and the Hub ensemble will be similar as the latter method was the median combination for around 90% of our out-of-sample period.

  4. Symmetric trimming. This method deals with outliers. For each bound, it involves trimming the N lowest-valued and N highest-valued forecasts, where N is the largest integer less than or equal to the product of β/2 and the total number of forecasts, where β is a trimming parameter. The median combination is an extreme form of symmetric trimming.

  5. Exterior trimming. This method targets overly wide intervals. It involves removing the N lowest-valued lower bound forecasts and the N highest-valued upper bound forecasts, where N is the largest integer less than or equal to the product of the trimming parameter β and the number of forecasts. When this resulted in a lower bound being above the upper bound, we replaced the two bounds by their average.

  6. Interior trimming. This method targets overly narrow intervals. It involves removing the N highest-valued lower bound forecasts and the N lowest-valued upper bound forecasts, where N is defined as for exterior trimming.

  7. Envelope method. The interval is constructed using the lowest-valued lower bound forecast and highest-valued upper bound forecast. This method is an extreme form of interior trimming.

  8. Inverse interval score method. This is a method that has the model forecasts weighted by historical accuracy, with the weight for each model inversely proportional to the historical MIS for that team [32], which is calculated in the in-sample period. With the shortest in-sample period being 10 weeks, we considered only forecasting teams for which we had forecasts for at least five past forecast origins. Larger numbers led to the elimination of many forecasters for the early weeks in our out-of-sample period. The following expression gives the weight on forecasting model i at forecast origin t:
    wit=1/MISi,tj=1J1/MISj,t
    where MISi,t is the historical MIS computed at forecast origin t from model i, and J is the number of forecasting models included in the combination.
  9. Inverse interval score with tuning. This method has weights inversely proportional to the MIS and a tuning parameter, λ > 0, to control the influence of the score on the combining weights [32]. The following expression gives the weight on forecasting model i at forecast origin t:
    wit=(1/MISi,t)λj=1J(1/MISj,t)λ

    If λ is close to zero, the combination reduces to the mean combination, whereas a large value for λ leads to the selection of the model with best historical accuracy. The parameter λ was optimised using the same expanding in-sample periods, as for the trimming combining methods. Due to the extent of missing forecasts, we pragmatically computed MISi,t using all available past forecasts, rather than limit the computation to periods for which forecasts from all models were available. For the models for which forecasts were not available for at least 5 past periods, we set MISi,t to be equal to the mean of MISi,t for all other models. An alternative approach, which we employed in our earlier study [32], is to omit from the combination any model for which there is only a very short or non-existent history of accuracy available. The disadvantage of this is that it omits potentially useful forecast information, and this was shown by empirical results.

Comparison with individual models

A comparison of the results of the combining methods with those of individual models is complicated by none of the individual models providing forecasts that passed the Hub’s screening for all past periods and locations. We addressed this in two ways. Firstly, we included a previous best method, which at each forecast origin and location, selected the interval forecast of the individual model with lowest in-sample MIS. The aim of this is, essentially, to obtain the interval forecasts of the best of the individual models. Secondly, in our results, we also summarise a comparison of the mean and median combinations with individual models for which we had forecasts for at least half the locations and at least half the out-of-sample period. Our inclusion criteria here is rather arbitrary, but the resulting analysis does help us compare the combining methods with the models of the more active individual teams. In this comparison, we excluded the COVID Hub baseline model, as it is only designed to be a comparator point for the models submitted to the Hub and not a true forecast.

Results

Forecasting incident deaths

Main results for incident deaths

Table 1 presents the MIS for 95% interval forecasts and MWIS for the 74-week out-of-sample period for incident mortality. Table 2 presents the corresponding mean skill scores, and Table 3 provides the mean ranks and results of the statistical test proposed by Koning et al. [36]. The weighted inverse scores, the ensemble and median combination were the best performing methods. Overall, (for all 52 locations), Table 2 shows that the performance of the inverse score method was almost 12% better than the mean in forecasting the 95% interval and, considering all interval forecasts, the ensemble and median were around 7% better than the mean. Of the trimming methods, symmetric trimming performed best overall, and was quite competitive compared to the leading methods. The ‘previous best’ was not competitive against most of the combining methods. The worst results were produced by the envelope method. Tables 1 to 3 report results averaged across the four forecast horizons (1 to 4 weeks ahead). We found similar relative performances of the methods when looking at each forecast horizon separately (S2 Table).

Table 1. For incident mortality, 95% interval MIS and MWIS.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean 779.6 9250.7 1249.1 472.4 119.0 55.5 897.3 80.1 28.5 8.5
Median 723.0 9623.7 1050.3 488.4 106.7 51.6 914.4 68.1 27.6 a 8.1 a
Ensemble 727.2 10303.7 1031.5 481.3 105.3 51.5 924.8 67.5 a 27.6 a 8.1 a
Sym trim 764.0 9464.8 1187.3 481.9 111.0 55.1 912.7 78.8 27.9 8.2
Exterior trim 824.1 10435.8 1301.6 490.2 115.0 67.2 924.7 114.0 28.8 8.4
Interior trim 767.0 9292.5 1228.6 456.3 114.7 57.6 907.9 86.2 28.2 8.5
Envelope 3838.1 55853.1 6752.0 1046.7 655.8 234.0 3289.7 408.8 84.8 28.7
Inv score 690.0 8964.2 1030.0 451.8 a 101.4 a 53.3 843.4 77.1 28.0 8.2
Inv score tuning 656.7 a 8631.1 a 923.7a 470.2 107.1 50.0 a 833.2 a 66.8 28.9 8.4
Previous best 872.9 11428.8 1231.0 621.8 145.0 62.0 1061.2 81.2 35.6 10.5

Lower values are better.

a best method in each column.

Table 2. For incident mortality, skill scores for 95% interval MIS and MWIS.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Median 8.3 -4.0 14.5 -2.1 12.2 6.6 -1.9 10.5 3.2 6.4
Ensemble 9.6 -11.4 16.4 -0.5 13.1 7.0 a -3.1 11.4 a 3.5 a 6.5 a
Sym trim 7.0 -2.3 12.1 0.7 8.4 3.8 -1.7 3.9 2.4 5.6
Exterior trim -3.2 -12.8 -4.1 -4.8 -0.2 -1.9 -3.1 -6.5 -1.3 1.9
Interior trim 4.7 -0.5 4.5 4.6 5.2 0.0 -1.2 -1.4 1.2 0.2
Envelope -222.6 -503.8 -239.4 -161.2 -265.0 -252.4 -266.6 -302.2 -212.3 -247.7
Inv score 11.7 a 3.1 16.8 6.0 a 12.5 a 3.8 6.0 5.6 2.6 3.2
Inv score tuning 8.9 6.7 a 19.2 a 0.0 6.5 3.0 7.1 a 9.7 -1.4 0.0
Previous best -20.1 -23.5 -8.4 -34.3 -18.9 -19.7 -18.3 -8.4 -24.5 -27.1

Shows percentage change from the mean. Higher values are better.

a best method in each column.

Table 3. For incident mortality, average ranks of the 95% interval MIS and MWIS.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean 4.7 b 3.0 5.0 4.4 4.7 5.1 b 3.0 5.4 4.8 5.2
Median 5.1 b 6.0 5.1 5.7 4.5 3.7 6.0 4.4 3.4 3.2
Ensemble 4.5 b 7.0 4.4 5.0 3.9 3.2 8.0 2.9 3.4 3.1 a
Sym trim 4.7 b 5.0 4.8 4.6 4.8 4.4 5.0 4.7 4.4 4.1
Exterior trim 6.8 b 8.0 7.2 b 6.5 b 6.7 b 6.5 b 7.0 7.1 b 6.6 b 5.8
Interior trim 4.1 4.0 4.9 3.4 3.9 5.4 b 4.0 6.1 b 4.5 5.7
Envelope 10.0 b 10.0 10.0 b 9.9 b 10.0 b 10.0 b 10.0 10.0 b 10.0 10.0
Inv score 2.6 a 2.0 2.5 a 2.5 a 2.9 a 2.9 a 2.0 2.5 a 3.1 a 3.3
Inv score tuning 4.5 b 1.0 a 3.7 4.7 5.4 5.2 1.0 a 4.2 6.1 5.6
Previous best 8.0 b 9.0 7.5 b 8.3 b 8.1 b 8.5 b 9.0 7.9 b 8.7 b 8.9 b

Lower values are better.

a best method in each column

b significantly worse than the best method, at the 5% significance level.

Changes over time in performance for incident deaths

In Table 4, the MWIS skill scores are shown separately for the first and second halves of the 74-week out-of-sample period. Recalling that the skill scores assess performance relative to the mean combining method, the table shows that this combining method was notably more competitive for the second half of the out-of-sample period than for the first half. Comparing the other methods, we see that the same methods that performed particularly well for the first half of the data also were the best methods for the second half. An exception that was the inverse score tuning method that performed worse for the second half, which is perhaps surprising, as one might expect the tuning parameter to be better estimated for the second half, as more data was available for estimation. Inverse score without tuning would appear to be a more robust method for this dataset. The consistently good performance of the median emphasises the importance of robustness.

Table 4. For incident mortality, skill scores for MWIS calculated separately for the first and second halves of the 74-week out-of-sample period.
1st Half 2nd Half
Method All U.S. High Med Low All U.S. High Med Low
Mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Median 8.3 2.6 14.9 5.6 a 4.5 2.2 -9.0 -1.3 1.0 7.4 a
Ensemble 8.6 a 3.3 15.0 5.6 a 5.1 a 2.9 a -13.0 1.3 1.2 6.8
Sym trim 4.1 1.3 5.4 4.1 3.1 2.8 -6.6 0.3 1.1 7.3
Exterior trim -4.2 -4.2 -8.4 -2.7 -1.8 2.4 -1.3 1.2 0.9 5.2
Interior trim 0.9 -0.5 -1.2 2.5 1.4 -1.1 -2.2 -0.7 -0.9 -1.7
Envelope -245.3 -238.4 -338.1 -212.5 -201.0 -233.1 -311.9 -200.2 -215.2 -285.9
Inv score 4.1 7.8 6.7 3.7 1.7 2.7 3.1 a 2.4 a 1.5 a 4.1
Inv score tuning 5.7 20.2 a 15.5 a 0.2 -0.5 -2.9 -13.2 -5.3 -2.9 -0.1
Previous best -13.4 -3.1 1.9 -21.5 -23.2 -30.1 -41.8 -36.0 -28.5 -25.4

Shows percentage change from the mean. Higher values are better.

a best method in each column.

Performance by model type for incident deaths

To evaluate performance by model type, for each category of mortality series (all, U.S, high, medium and low mortality), in Table 5, we tabulate MWIS skills scores for the combining methods applied separately to each of the following three sets of individual models: all models, compartmental models only, and non-compartmental models only. For each category of mortality series, to enable a comparison of the combining methods applied to the different sets of individual models, we computed the skill scores using the same benchmark, which we set as the mean combination of all models. Note that we have omitted the ensemble from Table 5 because the forecasts from this method were determined by the Hub, and so we were not in control of which individual methods that method combines. The first point to note from Table 5 is that combining only non-compartmental models led to poorer results for almost all combining methods and categories of mortality series. A second point to note is that, for the all, high, medium and low categories of series, combining only compartmental models was preferable to combining all models, unless the combining method was the median. For the median, combining all available models was preferable. It is interesting to note that the two inverse score methods, when applied only to the compartmental models, become competitive with the median.

Table 5. For incident mortality, skill scores for MWIS for combining methods applied to forecasts of all models, compartmental models only, and non-compartmental models only.
All U.S. High Med Low
Method All Comp Non-comp All Comp Non-comp All Comp Non-comp All Comp Non-comp All Comp Non-comp
Mean 0.0 5.1 -8.0 0.0 -15.7 -3.2 0.0 10.4 -12.2 0.0 1.6 -6.3 0.0 4.1 -6.1
Median 6.6 4.2 3.7 -1.9 -13.7 -2.3 10.5 7.7 7.0 3.2 1.6 0.2 6.4 4.1 4.1
Sym trim 3.8 4.7 1.6 -1.7 -16.0 -2.2 3.9 8.9 1.9 2.4 2.2 -0.8 5.6 3.9 3.7
Exterior trim -1.9 3.5 -6.4 -3.1 -17.7 -3.4 -6.5 8.7 -11.0 -1.3 0.3 -6.6 1.9 2.4 -1.8
Interior trim 0.0 5.6 -6.8 -1.2 -15.2 -4.1 -1.4 10.4 -11.0 1.2 2.6 -4.3 0.2 4.7 -5.5
Envelope -252.4 -89.6 -240.0 -266.6 -144.8 -234.6 -302.2 -83.9 -288.4 -212.3 -101.3 -193.8 -247.7 -81.5 -244.7
Inv score 3.8 7.2 a -2.8 6.0 -3.7 2.8 5.6 12.6 a -4.1 2.6 3.9 a -3.3 3.2 5.4 -1.2
Inv score tun 3.0 5.1 -2.7 7.1 a 7.1 a 2.8 9.7 11.5 0.5 -1.4 1.1 -4.9 0.0 2.3 -4.3
Previous best -19.7 -13.0 -21.4 -18.3 -9.5 -7.1 -8.4 -9.1 -7.8 -24.5 -20.8 -29.1 -27.1 -9.7 -29.6

Shows percentage change from the mean. Higher values are better.

a best method in each of the five mortality categories.

Performance of individual models for incident deaths

Table 6 reports the performance of the 27 individual models for which we had forecasts of incident deaths for at least half the out-of-sample period and at least half of the 52 locations. The table summarises skill scores based on scores calculated for the individual model and the benchmark method using only those weeks for which forecasts were available for the individual model. Table 6 reports results for the skill score calculated using mean combining as the benchmark, as in our previous tables, but also the results for skill score calculate using median combining as the benchmark method. The skill scores of these individual models were highly variable, and generally negative, implying that they were not competitive against the mean or median in any category. The only notable exception was the performance of an individual model that was almost 17% better than the mean for the 95% interval forecasts in the high mortality locations.

Table 6. For incident mortality, summary statistics of skill scores for individual models.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean combining as skill score benchmark
    Count 27 26 27 27 27 27 26 27 27 27
    Mean -117.5 -251.5 -107.7 -121.2 -116.2 -96.1 -160.8 -88.3 -97.8 -105.1
    Median -78.1 -176.9 -56.3 -77.2 -84.1 -89.5 -122.7 -76.3 -90.0 -76.9
    Minimum -486.6 -937.6 -439.3 -426.8 -542.1 -263.4 -651.2 -353.5 -253.1 -506.2
    Maximum -0.8 -2.2 16.8 -12.3 0.6 -38.9 -13.6 -31.9 -43.7 -36.3
Number > 0 0 0 2 0 1 0 0 0 0 0
Median combining as skill score benchmark
    Count 27 26 27 27 27 27 26 27 27 27
    Mean -140.6 -240.6 -149.2 -120.6 -146.0 -112.1 -157.9 -114.3 -106.5 -119.3
    Median -94.2 -162.3 -102.0 -75.4 -104.0 -104.2 -123.2 -99.5 -100.9 -89.0
    Minimum -618.2 -944.4 -597.2 -452.7 -700.5 -291.5 -651.5 -420.9 -272.5 -538.5
    Maximum -11.6 3.0 0.3 -12.2 -13.5 -49.0 -11.8 -50.4 -48.5 -39.7
    Number > 0 0 1 1 0 0 0 0 0 0 0

Higher values of the skill score are better.

Calibration results for incident deaths

As we stated earlier, with each bound of the interval forecasts being a quantile, we assess calibration for each of the 23 quantiles for which the teams submitted forecasts. We do this in Fig 4, which presents reliability diagrams for each category of mortality series for the mean, median and inverse score with tuning combining methods. Reasonable calibration can be seen in the plot relating to all 52 locations, and there is good calibration at the extreme quantiles in each plot, except the one for low mortality locations. Most methods had calibration that was too low for the U.S and high mortality locations, and most methods displayed calibration that was too high for the low mortality locations, particularly for the lower quantiles. For the medium mortality locations, the mean and inverse score with tuning performed better than the median, for which calibration was slightly too low. For all methods, S3S7 Tables provide the calibration for the five categories of the locations: all 52 locations, U.S, high mortality, medium mortality and low mortality, respectively.

Fig 4. For incident mortality, reliability diagrams showing calibration of the 23 quantiles for the mean, median and inverse score with tuning methods.

Fig 4

The 23 quantiles include all bounds on the interval forecasts and the median.

Forecasting cumulative deaths

In this section, for the cumulative deaths data, we report analogous results tables and figure to those that we have presented for the incident deaths data.

Main results for cumulative deaths

Table 7 presents the MIS for 95% interval forecasts and MWIS for the 78-week out-of-sample period for cumulative mortality. The corresponding skill scores are in Table 8. The median and ensemble approaches were the best performing methods in terms of both metrics. Overall, their performance for the 95% interval was about 65% better than the mean, and considering all interval forecasts, the ensemble and median were 43% better than the mean. The very poor performance of the mean suggests the presence of outlying forecasts. These would also undermine the weighted inverse score methods, as they involve weighted averages. The inverse score with tuning method was the best method for the U.S. series. Interior trimming performed better than the inverse scoring methods for the 95% interval, which suggests that there were large numbers of 95% intervals that were too narrow. Both metrics showed symmetric trimming performing almost as well as the median. Table 9 reports mean ranks. Using the statistical test proposed by Koning et al. [36], we identified that, in terms of the mean rank, most methods were statistically significantly worse than the median-based approaches. We found similar relative performances of the methods when looking at each forecast horizon separately (S8 Table).

Table 7. For cumulative mortality, 95% interval MIS and MWIS for the 78-week out-of-sample period.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean 5540 87322 8822 2173 814 234 4156 346 92 33
Median 2784 a 30188 5514 921 a 306 a 143 a 2332 228 54 a 18 a
Ensemble 2821 33147 5445 a 930 306 a 143 a 2355 226 a 54 a 18 a
Sym trim 3044 34038 5882 1066 360 151 2415 242 57 19
Exterior trim 5388 83865 8632 2127 788 207 3442 319 82 29
Interior trim 3203 30333 6293 1221 501 226 2933 419 73 27
Envelope 7915 154350 10814 2931 1385 1027 21143 1373 402 122
Inv score 3429 29134 6560 1598 617 170 2496 277 70 27
Inv score tuning 3626 24624 a 6818 2237 589 161 2270 a 259 73 25
Previous best 4786 33090 8593 3632 467 180 2580 281 94 24

Lower values are better.

a best method in each column.

Table 8. For cumulative mortality, skill scores for 95% interval MIS and MWIS for the 78-week out-of-sample period.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Median 65.2a 65.4 60.7 a 63.6 a 70.5 a 43.2 a 43.9 39.7 41.7 a 47.9 a
Ensemble 64.9 62.0 60.5 63.4 70.2 43.1 43.3 40.0 a 41.6 47.4
Sym trim 60.4 61.0 55.1 59.0 66.1 39.9 41.9 36.6 38.0 44.6
Exterior trim 4.7 4.0 4.3 2.8 7.0 11.2 17.2 10.6 10.1 12.5
Interior trim 46.3 65.3 45.6 47.0 45.0 17.2 29.4 15.6 19.6 15.4
Envelope -46.2 -76.8 -36.5 -47.7 -53.2 -318.1 -408.8 -314.9 -349.5 -287.3
Inv score 32.4 66.6 37.0 28.3 28.7 23.3 39.9 26.3 22.9 19.5
Inv score tuning 22.5 71.8 a 27.3 2.0 30.6 23.5 45.4 a 26.8 19.5 22.5
Previous best 25.2 62.1 18.4 -18.1 54.7 17.5 37.9 16.1 4.5 28.8

Shows percentage change from the mean. Higher values are better.

a best method in each column.

Table 9. For cumulative mortality, average ranks of the 95% interval MIS and MWIS.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean 7.8 b 9.0 8.1 b 7.3 b 8.1 b 8.4 b 9.0 8.4 b 8.5 b 8.4 b
Median 2.4 a 3.0 2.8 a 2.0 a 2.5 a 1.9 2.0 2.1 1.6 1.9 a
Ensemble 2.6 6.0 2.8 b 2.1 2.7 1.8 a 3.0 1.8 a 1.4 a 2.1
Sym trim 3.8 7.0 3.7 3.8 3.8 3.2 b 4.0 3.1 3.1 3.2
Exterior trim 7.7 b 8.0 7.5 b 7.5 b 8.0 b 7.1 b 8.0 7.4 b 6.8 b 6.9 b
Interior trim 4.0 4.0 4.5 3.2 4.4 6.0 b 7.0 6.3 b 5.6 b 6.2 b
Envelope 8.9 b 10.0 9.1 b 9.1 b 8.5 b 10.0 b 10.0 10.0 b 10.0 b 10.0 b
Inv score 5.4 b 2.0 5.2 5.2 6.1 b 5.1 b 5.0 5.0 b 4.6 b 5.8 b
Inv score tuning 6.2 b 1.0 a 5.5 7.2 b 6.1 b 5.1 b 1.0 a 4.5 b 5.7 b 5.4 b
Previous best 6.0 b 5.0 5.7 7.6 b 4.8 6.4 b 6.0 6.5 b 7.6 b 5.2 b

Lower values are better.

a indicates best method in each column

b significantly worse than the best method, at the 5% significance level.

Changes over time in performance for cumulative deaths

The skill scores for the MWIS for the first and second halves of the 74-week out-of-sample period are shown in Table 10. For the first half of the out-of-sample period, the improvements over the mean were considerably smaller than for the second half. The sizeable skill scores for the second half for the ensemble, median and symmetric trimming strongly suggests the presence of outliers. We consider this issue further in a later section where we investigate the impact of reporting patterns and outliers on forecast accuracy. We also note in Table 10 that the inverse score methods were more competitive against the ensemble and median in the first half of the out-of-sample period.

Table 10. For cumulative mortality, skill scores for MWIS calculated separately for the first and second halves of the 78-week out-of-sample period.
1st Half 2nd Half
Method All U.S. High Med Low All U.S. High Med Low
Mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Median 8.9 6.9 8.5 8.2 10.2 65.7 a 68.3 a 62.7 64.8 a 69.2 a
Ensemble 9.1 a 7.0 8.7 8.3 a 10.4 a 65.6 67.4 63.5 a 64.6 68.2
Sym trim 8.4 4.3 8.1 6.9 10.5 60.4 66.9 58.1 60.1 62.6
Exterior trim -0.1 -0.4 0.1 -1.2 0.8 18.4 29.1 18.4 18.1 18.2
Interior trim 0.7 5.3 -4.4 3.0 3.0 28.8 45.5 31.9 30.8 22.1
Envelope -240.0 -195.0 -246.1 -246.9 -230.2 -367.6 -555.0 -357.8 -419.8 -321.2
Inv score 7.6 10.8 9.3 5.9 7.2 33.9 59.2 37.5 35.6 26.2
Inv score tuning 8.4 17.0 a 11.2 a 5.1 8.4 35.0 64.0 38.5 32.4 31.4
Previous best -9.5 1.0 -6.5 -17.2 -5.8 38.8 62.2 34.2 28.2 50.1

Shows percentage change from the mean. Higher values are better.

a best method in each column.

Performance by model type for cumulative deaths

The MWIS results of the comparison by model type for cumulative forecasts are reported in Table 11. For all combining methods except the median, combining only compartmental models performed better than combining all models for all categories of the mortality series, except the category that is just the total U.S. deaths. As with forecasts of incident deaths, the inverse score methods were competitive with the median when combining forecasts only from compartmental models.

Table 11. For cumulative mortality, skill scores for MWIS for combining methods applied to forecasts of all models, compartmental models only, and non-compartmental models only.
All U.S. High Med Low
Method All Comp Non-comp All Comp Non-comp All Comp Non-comp All Comp Non-comp All Comp Non-comp
Mean 0.0 39.7 -27.3 0.0 33.1 -27.1 0.0 36.8 -29.0 0.0 37.3 -27.8 0.0 45.1 -25.2
Median 43.2 40.7 41.1 43.9 33.2 45.0 39.7 37.3 36.7 41.7 39.1 39.3 47.9 45.7 46.5
Sym trim 39.9 40.8 38.9 41.9 32.6 43.0 36.6 37.4 33.9 38.0 38.9 37.3 44.6 46.0 44.8
Exterior trim 11.2 38.7 -4.8 17.2 31.3 -0.2 10.6 35.9 -7.7 10.1 36.0 -5.6 12.5 44.4 -1.5
Interior trim 17.2 40.1 9.7 29.4 35.1 16.5 15.6 36.5 4.8 19.6 38.0 12.2 15.4 45.6 11.5
Envelope -318.1 -27.3 -337.3 -408.8 -14.5 -423.1 -314.9 -29.0 -324.0 -349.5 -43.8 -354.5 -287.3 -12.0 -329.5
Inv score 23.3 40.9 14.0 39.9 38.1 35.9 26.3 37.4 16.0 22.9 39.1 12.1 19.5 46.1 12.2
Inv score tun 23.5 40.1 15.7 45.4 44.2 39.2 26.8 37.1 22.0 19.5 37.9 8.2 22.5 44.6 14.8
Previous best 17.5 29.0 7.8 37.9 37.1 40.2 16.1 24.4 16.1 4.5 25.1 -1.1 28.8 36.5 5.3

Shows percentage change from the mean. Higher values are better.

Performance of individual models for cumulative deaths

For forecasts of cumulative mortality, 25 models provided forecasts for at least half the locations for at least half the weeks in the out-of-sample period. As can be seen in Table 12, the performance of these individual models was highly variable. The upper half of the table shows that, particularly for the 95% interval, a good number of the individual models were able to outperform the mean. However, the lower half of the table shows that the individual methods were not competitive with the median, except for the case of the 95% interval for the total U.S. mortality series.

Table 12. For cumulative mortality, summary statistics of skill scores for individual models, using mean and median combining as benchmark.
95% interval MIS MWIS
Method All U.S. High Med Low All U.S. High Med Low
Mean combining as skill score benchmark
    Count 25 24 25 25 25 25 24 25 25 25
    Mean -17.9 -9.4 -14.7 -19.2 -24.2 -43.0 -21.1 -34.6 -46.2 -56.4
    Median 2.7 19.6 15.4 8.0 -14.2 -36.3 -4.7 -23.1 -39.3 -37.0
    Minimum -255.2 -307.7 -224.5 -261.1 -245.6 -131.2 -166.8 -113.2 -129.0 -390.8
    Maximum 61.2 74.6 57.7 55.4 69.5 18.6 41.7 16.8 14.7 31.0
Number > 0 13 17 15 14 10 4 11 3 4 6
Median combining as skill score benchmark
    Count 25 24 25 25 25 25 24 25 25 25
    Mean -165.1 -136.4 -155.4 -168.0 -181.7 -126.8 -108.8 -112.3 -132.7 -145.3
    Median -142.6 -130.4 -108.6 -137.9 -148.1 -107.4 -87.3 -93.9 -115.6 -102.2
    Minimum -573.6 -480.9 -481.6 -511.7 -991.6 -310.6 -371.8 -228.2 -287.2 -616.1
    Maximum -14.6 20.3 -6.0 -26.2 -6.3 -43.0 -7.4 -41.7 -45.8 -41.6
    Number > 0 0 2 0 0 0 0 0 0 0 0

Higher values of the skill score are better.

Calibration results for cumulative deaths

Fig 5 presents reliability diagrams for each category of the mortality series to summarise the calibration for each of the 23 quantiles for forecasts of the mean, median and inverse score with tuning combining methods. The figure shows that the mean produced quantile forecasts that tended to be too low for the U.S. series, and too high for the other four categories. The inverse score with tuning method was very well calibrated, except for the U.S. series, and the median method also performed reasonably well, although it tended to produce quantile forecasts that were generally a little low. For all methods, S9S13 Tables show the calibration for all methods for the five categories of the locations: all 52 locations, U.S, high mortality, medium mortality and low mortality, respectively.

Fig 5. For cumulative mortality, reliability diagrams showing calibration of the 23 quantiles for the mean, median and inverse score with tuning methods.

Fig 5

The 23 quantiles include all bounds on the interval forecasts and the median.

Impact of reporting patterns and outliers on forecast accuracy

We observed changes in reporting patterns of historical death counts in 15 locations. Fig 6 shows examples of six locations where updates to death counts were particularly notable. We found evidence of backdating in Delaware, Ohio, Rhode Island and Indiana. Backdating of historical death counts is shown as dashed lines. We noted a sharp drop in death counts in West Virginia in May 2021, suggesting a redefinition of COVID-19 deaths. There were sharp increases in death counts in Oklahoma in early April 2021 and in Delaware in late July 2021. We also observed sharp increases in death counts of two other locations (Missouri and Nebraska).

Fig 6. Numbers of reported cumulative deaths in six states where there were noticeable changes in reporting patterns.

Fig 6

Based on reported death counts at multiple data points between 20 June 2020 and 15 January 2022.

For each of the 51 states, Figs 7 and 8 present the MWIS for the mean, median and inverse score with tuning method for incident and cumulative mortalities, respectively. The locations are ordered by the cumulative number of deaths on 15 January 2022. In both figures, all three methods performed noticeably poorly for Ohio, Oklahoma, Nebraska and West Virginia, for which we found notable changes in reporting patterns, as well as in Virginia and Oregon, where we did not observe such changes. We cannot rule out there having been changes in reporting patterns for these and other locations, as we did not have a complete set of files of reported death counts for each week.

Fig 7. For incident mortality, MWIS for high, medium and low mortality states for three selected combining methods.

Fig 7

Fig 8. For cumulative mortality, MWIS for high, medium and low mortality states for three selected combining methods.

Fig 8

As a sensitivity analysis, we excluded the eight named locations for which there were noticeable changes in reporting patterns. The resulting MWIS skill scores are given in S14 Table and S15 Table for incident and cumulative deaths respectively. Compared to the MWIS skill scores presented in Tables 2 and 8 (where no locations were excluded), there were improvements for all methods, slight changes in rankings, but no changes in the overall conclusions.

The differences between the performance of the mean and median forecasts described in previous sections and highlighted Fig 8 suggested a problem with outliers, particularly for cumulative deaths. Fig 9 provides some insight into an outlying set of 23 quantile forecasts. Each line shows the probability distributions function mapped out by the 23 quantile forecasts of an individual model. For each of the two locations, the presence of an outlying set of quantile forecasts is evident by there being a line that differs notably from the other lines.

Fig 9. Two examples of an outlying set of quantile forecasts of cumulative deaths for one week-ahead from forecast origin for the week ending on 18 July 2020.

Fig 9

Discussion

The weighted inverse scores, ensemble and median performed best for forecasts of incident deaths. They produced moderate improvements in performance over the common benchmark mean combination. With the forecasts of cumulative deaths, improvements over the mean were much higher, and for the median and ensemble, they were substantial. For all combining methods except the median, combining forecasts from only compartmental models produced better forecasts than forecasts from combinations of all models. Furthermore, considering combinations of compartmental models only, inverse score combining was more competitive against the median for both mortalities. We found that the individual models were not competitive with the better combining methods. The presence of outlying forecasts had an adverse impact on the performance of the mean and the inverse score methods, which involved weighted averaging. The adverse effects of reporting patterns on performance were minor.

We presented the inverse score methods in an earlier study of forecasts of cumulative COVID-19 deaths [32]. The current paper considers both incident and cumulative forecasts, using a far longer period of data than the earlier study, and involves a different set of forecasting models, as we now only include forecasts that passed the screening tests of the COVID-19 Hub. In our earlier study, the inverse score methods were the most accurate overall and the mean generally outperformed the median. The mean was also competitive against the inverse score method for many locations. The results of the current study for forecasts of cumulative deaths were not consistent with those of the earlier study, although much better results were achieved for the mean by combining only compartmental models, and when combining forecasts of incident deaths from all models, the relative performance of the inverse mean methods was considerably better. In the current study of cumulative deaths, the leading methods were generally the ensemble, the median and symmetric mean (for which the median is an extreme case). These methods are robust to outliers. The results of our two studies illustrate that, particularly for forecasts of cumulative deaths, the relative performance of combining methods depends on the extent of outlying forecasts, and that outlying forecasts were clearly more prevalent in the dataset for the current study.

Another relevant previous study is that by Bracher et al. [29], who compared forecasts produced by the mean combination, median combination and a weighted combination for COVID-19 deaths in Germany and Poland. They found that combined methods did not perform better than individual models. However, this study was limited by an evaluation period of only ten weeks. It is also worth noting that the study used just thirteen individual models in the combinations. In our previous work [32], we found accuracy improved notably as the number of individual models rose, plateauing at around twenty models.

Previous studies have found that data driven models can perform better than compartmental models in forecasting COVID-19 data [9, 41, 42]. For forecasts of both mortalities, in many cases we found that there was no benefit in including non-compartmental models in a combination with compartmental models for a number of combining methods, including the mean combination. Non-compartmental models include simple time series models, which would be particularly prone to underperform when there are sudden steep increases in cumulative death counts, and so the steep increases that we highlighted might partially explain our results. Furthermore, these cited studies were carried out during the early weeks of our study, and we would expect the compartmental models to have increased in sophistication over time and the model parameters and assumptions to have improved.

A major strength of our study is our source of data, which presented an opportunity to study the ‘wisdom of the crowd’, and provided the necessary conditions for the crowd being ‘wise’ [20] and without distortion, such as by social pressure [43] or restrictions against forecasting teams applying their own judgement [26]. These conditions include independent contributors, diversity of opinions, and a trustworthy central convener to collate the information provided [20]. Further strengths relating to the reliability of our findings arise from the high number of individual models. Our reported findings are limited to U.S. data and a particular set of models, and it is possible that different results may arise from other models, or for data from other locations, or other types of data, such as number of people infected. These are potential avenues for future research. Our ability to detect statistical differences was limited by the small sample sizes, with only 17 locations in each category, missing data, and a relatively short out-of-sample period.

It is suggested that relying on modelling alone leads to “missteps and blind spots”, and that the best approach to support public policy decision making would involve a triangulation of insights from modelling with other information, such as analyses of previous outbreaks and discussions with frontline staff [44]. It is essential that modelling offers the most accurate forecasts. Probabilistic forecasts reflect the inherent uncertainties in prediction. Although individual models can sometimes be more accurate than combined methods, relying on forecasts from combined methods provides a more risk-averse strategy, as the best individual model will not be clear until records of historical accuracy are available, and the best performing model will typically change over time. At the start of an epidemic, when it is not clear which model has the best performance, the statistical expectation is that the average method will score far better than a model chosen randomly, or chosen on the basis of no prior history. This was the case at the start of the COVID-19 pandemic.

The existence of outlying forecasts presents challenges to forecast combining. These can arise due to model-based factors or factors involving the actual number of deaths. The former include computational model errors, which can happen occasionally, and model assumptions being incorrect, which will typically apply in the early stages of a pandemic. The latter factors include data updates and changes in definitions. Some models can be adapted to allow for data anomalies. The removal of outlying forecasts may be added to the pre-combining screening process, but screening criteria for outliers may be arbitrary and it will be subjective. A more objective way to tackle outlying forecasts is to use the median combination, and that was the approach taken by the COVID-19 Hub in July 2021, having previously relied on a mean ensemble. Our earlier study suggested that factoring historical accuracy into forecast combinations may achieve greater accuracy than the median combination [32]. Both our studies have involved the use of performance-weighted mean methods, and our current study has shown that they are not sufficiently robust to outliers. We recommend further research into weighted methods and the effect of model type on the relative performance of combined methods.

Supporting information

S1 Fig. Data availability for forecasts of cumulative COVID-19 deaths.

* Based on information recorded on the COVID19 Hub with citations as recorded on 25/2/22; a Only provided forecasts of numbers of cumulative COVID-19 deaths; b Only provided forecasts of numbers of incident COVID-19 deaths.

(TIF)

S1 Table. Individual forecasting models.

(PDF)

S2 Table. For incident mortality, 95% interval MIS and MWIS for each prediction horizon.

Lower values are better. a best method for each horizon in each column; b score is significantly lower than the mean combination; c score is significantly lower than the median combination.

(PDF)

S3 Table. For incident mortality, calibration for all locations.

(PDF)

S4 Table. For incident mortality, calibration for U.S.

(PDF)

S5 Table. For incident mortality, calibration for high mortality locations.

(PDF)

S6 Table. For incident mortality, calibration for medium mortality locations.

(PDF)

S7 Table. For incident mortality, calibration for low mortality locations.

(PDF)

S8 Table. For cumulative mortality, 95% interval MIS and MWIS for each prediction horizon.

Lower values are better. a best method for each horizon in each column; b score is significantly lower than the mean combination; c score is significantly lower than the median combination.

(PDF)

S9 Table. For cumulative mortality, calibration for all locations.

(PDF)

S10 Table. For cumulative mortality, calibration for U.S.

(PDF)

S11 Table. For cumulative mortality, calibration for high mortality locations.

(PDF)

S12 Table. For cumulative mortality, calibration for medium mortality locations.

(PDF)

S13 Table. For cumulative mortality, calibration for low mortality locations.

(PDF)

S14 Table. Sensitivity analysis for incident mortality, skill scores of the 95% interval MIS and MWIS after excluding locations for which there were noticeable changes in reporting patterns.

Shows percentages. Higher values are better. a best method in each column.

(PDF)

S15 Table. Sensitivity analysis for cumulative mortality, skill scores of the 95% interval MIS and MWIS after excluding locations for which there were noticeable changes in reporting patterns.

Shows percentages. Higher values are better. a best method in each column.

(PDF)

Acknowledgments

We thank Nia Roberts for helping us understand the license terms for the forecast data from the COVID-19 Forecast Hub. We also thank the anonymous reviewers for their comments.

Data Availability

Data were downloaded from the public GitHub data repository of the COVID-19 Hub at https://github.com/reichlab/covid19-forecast-hub. The code used to generate the results is publically available on Zenodo at https://doi.org/10.5281/zenodo.6300524.

Funding Statement

This research was partly supported by the National Institute for Health Research Applied Research Collaboration Oxford and Thames Valley at Oxford Health NHS Foundation Trust. The views expressed in this publication are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Jewell NP, Lewnard JA, Jewell BL. Predictive mathematical models of the COVID-19 pandemic: underlying principles and value of projections. JAMA. 2020;323(19):1893–4. doi: 10.1001/jama.2020.6585 [DOI] [PubMed] [Google Scholar]
  • 2.Phelan AL, Katz R, Gostin LO. The novel coronavirus originating in Wuhan, China: Challenges for global health governance. JAMA. 2020. doi: 10.1001/jama.2020.1097 [DOI] [PubMed] [Google Scholar]
  • 3.Looi M-K. Covid-19: Is a second wave hitting Europe? BMJ. 2020;371:m4113. doi: 10.1136/bmj.m4113 [DOI] [PubMed] [Google Scholar]
  • 4.Melnick ER, Ioannidis JPA. Should governments continue lockdown to slow the spread of covid-19? BMJ. 2020;369:m1924. doi: 10.1136/bmj.m1924 [DOI] [PubMed] [Google Scholar]
  • 5.Policy brief: Education during COVID-19 and beyond [press release]. 2020.
  • 6.Wise J. Covid-19: Experts divide into two camps of action—shielding versus blanket policies. BMJ. 2020;370:m3702. doi: 10.1136/bmj.m3702 [DOI] [PubMed] [Google Scholar]
  • 7.Adam D. Special report: The simulations driving the world’s response to COVID-19. Nature. 2020;580(7803):316–8. doi: 10.1038/d41586-020-01003-6 [DOI] [PubMed] [Google Scholar]
  • 8.Holmdahl I, Buckee C. Wrong but Useful—What Covid-19 Epidemiologic Models Can and Cannot Tell Us. N Engl J Med. 2020;383(4):303–5. doi: 10.1056/NEJMp2016822 [DOI] [PubMed] [Google Scholar]
  • 9.Ioannidis JPA, Cripps S, Tanner MA. Forecasting for COVID-19 has failed. Int J Forecast. 2020. doi: 10.1016/j.ijforecast.2020.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Buckee CO, Johansson MA. Individual model forecasts can be misleading, but together they are useful. Eur J Epidemiol. 2020;35(8):731–2. doi: 10.1007/s10654-020-00667-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chen LP, Zhang Q, Yi GY, He W. Model-based forecasting for Canadian COVID-19 data. PLoS One. 2021;16(1):e0244536. doi: 10.1371/journal.pone.0244536 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Paiva HM, Afonso RJM, de Oliveira IL, Garcia GF. A data-driven model to describe and forecast the dynamics of COVID-19 transmission. PLoS One. 2020;15(7):e0236386. doi: 10.1371/journal.pone.0236386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Petropoulos F, Makridakis S. Forecasting the novel coronavirus COVID-19. PLoS One. 2020;15(3):e0231236. doi: 10.1371/journal.pone.0231236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Johansson MA, Apfeldorf KM, Dobson S, Devita J, Buczak AL, Baugher B, et al. An open challenge to advance probabilistic forecasting for dengue epidemics. PNAS. 2019;116(48):24268–74. doi: 10.1073/pnas.1909865116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lipsitch M, Santillana M. Enhancing Situational Awareness to Prevent Infectious Disease Outbreaks from Becoming Catastrophic. Curr Top Microbiol Immunol. 2019;424:59–74. doi: 10.1007/82_2019_172 [DOI] [PubMed] [Google Scholar]
  • 16.Bracher J, Ray EL, Gneiting T, Reich NG. Evaluating epidemic forecasts in an interval format. PLoS Comput Biol. 2021;17(2):e1008618–e. doi: 10.1371/journal.pcbi.1008618 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Panovska-Griffiths J. Coronavirus: we’ve had ‘Imperial’, ‘Oxford’ and many more models–but none can have all the answers.2020. Accessed 28 Dec 2020. Available from: https://theconversation.com/coronavirus-weve-had-imperial-oxford-and-many-more-models-but-none-can-have-all-the-answers-135137. [Google Scholar]
  • 18.Claeskens G, Magnus JR, Vasnev AL, Wang W. The forecast combination puzzle: A simple theoretical explanation. Int J Forecast. 2016;32(3):754–62. [Google Scholar]
  • 19.Smith J, Wallis KF. A Simple Explanation of the Forecast Combination Puzzle. Oxf Bull Econ Stat. 2009;71(3):331–55. [Google Scholar]
  • 20.Surowiecki J. The Wisdom of Crowds: Why the Many are Smarter Than the Few and how Collective Wisdom Shapes Politics,Business, Economies, Societies, and Nations: Doubleday & Co; 2004. [Google Scholar]
  • 21.Bates JM, Granger CWJ. The Combination of Forecasts. OR. 1969;20(4):451–68. [Google Scholar]
  • 22.Busetti F. Quantile Aggregation of Density Forecasts. Oxf Bull Econ Stat. 2017;79(4):495–512. [Google Scholar]
  • 23.Timmermann A. Chapter 4 Forecast Combinations. In: Elliott G, Granger CWJ, Timmermann A, editors. Handb Econ Forecast. 1: Elsevier; 2006. p. 135–96. [Google Scholar]
  • 24.Krishnamurti TN, Kishtawal CM, LaRow TE, Bachiochi DR, Zhang Z, Williford CE, et al. Improved Weather and Seasonal Climate Forecasts from Multimodel Superensemble. Science. 1999;285(5433):1548–50. doi: 10.1126/science.285.5433.1548 [DOI] [PubMed] [Google Scholar]
  • 25.Moran KR, Fairchild G, Generous N, Hickmann K, Osthus D, Priedhorsky R, et al. Epidemic Forecasting is Messier Than Weather Forecasting: The Role of Human Behavior and Internet Data Streams in Epidemic Forecast. J Infect Dis. 2016;214(suppl_4):S404–S8. doi: 10.1093/infdis/jiw375 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ray EL, Wattanachit N, Niemi J, Kanji AH, House K, Cramer EY, et al. Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. medRxiv. 2020:2020.08.19.20177493. [Google Scholar]
  • 27.Reich NG, McGowan CJ, Yamana TK, Tushar A, Ray EL, Osthus D, et al. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S. PLoS Comput Biol. 2019;15(11):e1007486–e. doi: 10.1371/journal.pcbi.1007486 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yamana TK, Kandula S, Shaman J. Individual versus superensemble forecasts of seasonal influenza outbreaks in the United States. PLoS Comput Biol. 2017;13(11):e1005801. doi: 10.1371/journal.pcbi.1005801 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bracher J, Wolffram D, Deuschel J, Görgen K, Ketterer JL, Ullrich A, et al. A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave. Nat Commun. 2021;12(1):5173. doi: 10.1038/s41467-021-25207-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Dean NE, Pastore YPA, Madewell ZJ, Cummings DAT, Hitchings MDT, Joshi K, et al. Ensemble forecast modeling for the design of COVID-19 vaccine efficacy trials. Vaccine. 2020;38(46):7213–6. doi: 10.1016/j.vaccine.2020.09.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ehwerhemuepha L, Danioko S, Verma S, Marano R, Feaster W, Taraman S, et al. A super learner ensemble of 14 statistical learning models for predicting COVID-19 severity among patients with cardiovascular conditions. Intell Based Med. 2021;5:100030. doi: 10.1016/j.ibmed.2021.100030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Taylor JW, Taylor KS. Combining probabilistic forecasts of COVID-19 mortality in the United States. Eur J Oper Res. 2021. doi: 10.1016/j.ejor.2021.06.044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Gneiting T, Raftery AE. Strictly Proper Scoring Rules, Prediction, and Estimation. J Am Stat Assoc 2007;102(477):359–78. [Google Scholar]
  • 34.Winkler RL, Grushka-Cockayne Y Jr. KCL, Jose VRR. Probability Forecasts and Their Combination: A Research Perspective. Decis Anal. 2019;16(4):239–60. [Google Scholar]
  • 35.Diebold FX, Mariano RS. Comparing Predictive Accuracy. J Bus Econ Stat 2002;20(1):134–44. [Google Scholar]
  • 36.Koning AJ, Franses PH, Hibon M, Stekler HO. The M3 competition: Statistical tests of the results. Int J Forecast. 2005;21(3):397–409. [Google Scholar]
  • 37.Gaba A, Tsetlin I, Winkler RL. Combining Interval Forecasts. Decis Anal. 2017;14(1):1–20. [Google Scholar]
  • 38.Hora SC, Fransen BR, Hawkins N, Susel I. Median Aggregation of Distribution Functions. Decis Anal. 2013;10(4):279–91. [Google Scholar]
  • 39.Jose VRR, Grushka-Cockayne Y, Lichtendahl KC. Trimmed Opinion Pools and the Crowd’s Calibration Problem. Manage Sci. 2014;60(2):463–75. [Google Scholar]
  • 40.Park S, Budescu D. Aggregating multiple probability intervals to improve calibration. Judgm Decis. 2015;10(2):130–43. [Google Scholar]
  • 41.Girardi P, Greco L, Ventura L. Misspecified modeling of subsequent waves during COVID-19 outbreak: A change-point growth model. Biom J. 2021. doi: 10.1002/bimj.202100129 [DOI] [PubMed] [Google Scholar]
  • 42.Mingione M, Alaimo Di Loro P, Farcomeni A, Divino F, Lovison G, Maruotti A, et al. Spatio-temporal modelling of COVID-19 incident cases using Richards’ curve: An application to the Italian regions. Spatial Statistics. 2021:100544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Yeh RW. Academic Cardiology and Social Media: Navigating the Wisdom and Madness of the Crowd. Circ Cardiovasc Qual Outcomes. 2018;11(4):e004736. doi: 10.1161/CIRCOUTCOMES.118.004736 [DOI] [PubMed] [Google Scholar]
  • 44.Sridhar D, Majumder MS. Modelling the pandemic. BMJ. 2020;369:m1567. doi: 10.1136/bmj.m1567 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Maurizio Naldi

17 Jan 2022

PONE-D-21-30265Forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methodsPLOS ONE

Dear Dr. Taylor,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================All three reviewers have recognized the importance and timeliness of the topic. However, they have also highlighted several criticalities. Please refer to their detailed reviews for indications. Please be sure to answer all their comments in your revision.==============================

Please submit your revised manuscript by Mar 03 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Maurizio Naldi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript addresses an interesting and timely topic. The use of ensamble approaches is sound and comparisons across methods is very useful in practice. Some comments follow.

1. I would emphasize the importance of quantifying the uncertainty associated with forecasts. Therefore I kindly ask to report more predictive quantiles (1%, 2.5%, 5%, 10%, . . . , 90%, 95%, 97.5%, 99%) in addition to their point forecasts. This motivates considering both forecasts of cumulative and incident quantities, as predictive quantiles for these generally cannot be translated from one scale to the other.

2. The use of the interval score (Gneting and Raftery, 2007) is sound. The three summands can be interpreted as a measure of sharpness and penalties for under- and overprediction, respectively. I strongly believe that the weighted interval score (Bracher, J., Ray, E. L., Gneiting, T., and Reich, N. G. (2020a). Evaluating epidemic forecasts in an interval

format. PLOS Computational Biology) shoul be considered. It combines the absolute error of the

predictive median and the interval scores achieved for the nominal levels. It is a well-known

quantile-based approximation of the continuous ranked probability score.

3. Maybe I miss something, or there is something swept under the carpet. In any ensamble-based approach, the choice of the weights is crucial. Please, provide more details on this point.

4. It would also be interesting to see a discussion of the performance of the models belonging to different families. Ioannidis et al (Ioannidis JPA, Cripps S, Tanner MA. Forecasting for COVID-19 has failed. Int J Forecast. 2020. http://www.sciencedirect.com/science/article/pii/S016920%7020301199) discussed that some approaches, mainly those referring to the SIR family, have failed in providing reasoable forecasts. On the other hand, data-driven approaches show much better performances in forecasting the evolution of the epidemic (see e.g. Mingione, M., Di Loro, P. A., Farcomeni, A., Divino, F., Lovison, G., Maruotti, A., & Lasinio, G. J. (2021). Spatio-temporal modelling of COVID-19 incident cases using Richards’ curve: An application to the Italian regions. Spatial Statistics, 100544; Girardi, P., Greco, L., & Ventura, L. (2021). Misspecified modeling of subsequent waves during COVID‐19 outbreak: a change‐point growth model. Biometrical Journal).

Reviewer #2: Review of “Forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methods”

Thank you for the opportunity to review this interesting paper. The topic is important. This analysis builds on a previously published study of combining methods for COVID-19 forecasting models in the United States and extends it using more data. The methods are interesting and seem mostly technically sound to me, although I think a comparison with percent errors is essential. I have a number of clarifying questions about the text. My main comment is about data and code availability. Pointing to the input data sources is not sufficient, in my opinion, as the authors must transform and manipulate the data to conduct the analysis. The devil is in the details with forecasting and predictive validity. The authors should publish their code and data in a public repository to facilitate a complete review of the research. That is the standard in the field for this kind of work (all other papers I’ve read on the topic have done so) and also required by the journal for publication: https://journals.plos.org/plosone/s/materials-software-and-code-sharing.

Comments:

Abstract

1. Methods section is vague. How specifically was the analysis conducted? “comparing accuracy” is not specific. Which metrics were used? How were they calculated? How did you design your holdouts? The basic details should be clear in the abstract.

2. The results section is also vague. Specific values should be cited. By what margin much did the best performing models overtake the others?

3. Conclusions: please indicate how this study did or did not concur with the previous study. What more can we learn from adding additional data?

Methods

4. Why was only 1 year of data used when there are nearly 2 years of data available now since the start of the pandemic?

5. I am not sure that evaluating absolute error makes the most sense, as it biases the results towards higher mortality moments and locations. I would like to see a comparison to relative error metrics, such as the median absolute percent error. This has also been done in prior analyses so would be standard.

6. I would recommend placing the trimming and other details of how forecasts are combined in the main text, given journal length limits, and how central those details are to the work at hand.

Results

7. How did the methods you describe here perform in comparison to the forecasts hub’s own internal ensemble model? Please highlight this important comparison.

Reviewer #3: Thank you for the opportunity to review Taylor and Taylor’s manuscript that compares different methods of combining COVID-19 mortality forecasts. Their research is an important contribution to the forecasting field. Analysis of ensembling approaches is sorely needed; and it is also timely as the COVID-19 epidemic continues to change and forecasts continue to be used for operational decision making. Moreover, but as the field continues to move towards open and collaborative science, their work has direct application to future forecasting. Their clearly written manuscript has many strengths – it compares simple and more complex methods; it harnesses a year’s worth of data for evaluation; and focuses analyses on probabilistic distributions. In addition, there are also opportunities to improve to the text and analysis (see comments below) .

MAJOR COMMENTS

Abstract

Line 30: I suggest cutting the text about “extended and new datasets”. It’s not apparently clear to readers what the authors are referring to if they haven’t read their previous manuscript.

Line 35: The COVID-19 Forecast Hub collects probability distributions as well as point forecasts. The authors are referring here to the 50% quantile of probability distributions as ‘point forecasts’. Please refer to them throughout the manuscript as the 50% quantile, rather than a point forecast. [Also, see comments under Material and Methods, as I recommend excluding all 50% quantile analyses. These forecasts are 1) not useful for outbreak response, and 2) misleading in terms of communicating uncertainty.]

Lines 35-37: Please list the evaluation metrics (mean interval score) and the combination methods here.

Line 38 and 39: The first sentence of the Results is very confusing. Does the ‘average performance of these models’ refer to the Mean method, the Ensemble model, or something else?

Line 46: How do you define “sufficient”? Length of historical data was beyond the scope of the analysis, but the text implies this was considered within the analysis.

Introduction

Line 89: Recommend not referring to reported data as ‘delayed’. ‘Reporting patterns’ is more accurate to describe the nature of the changes to the CSSE datasets and more accurately describes the descriptive nature of this part of the analysis.

Materials and Methods

Lines 109-110: The number of the week’s is very confusing here. What is the first week of the epidemic, and are you referring to the first week in the US? Because forecasts were not collected at ‘week 1’ of outbreak, I suggest starting the start date and the end date, with the start date referred to as ‘week 1 of the analysis’.

Lines 113-116: I strongly recommend changing the inclusion criteria for this analysis. Models that were not included in the COVID-19 Forecast Hub should also be excluded here. This will help with comparability between the ensemble approaches. The COVID-19 Forecast Hub excludes forecasts that are improbable, such as if the number of cases or deaths exceeds the population size, not based on their predictions being ‘too large’. Details are provided here: https://covid19forecasthub.org/doc/ensemble/

Line 122: In the evaluation, did you use the CSSE reported counts at the end of the analysis period? OR was each date evaluated against the data available at the time. Please clarify in the methods.

Line 122: What is the rationale for focusing only on the 95% PIs? Information can be gained be examining all intervals (7 available) and weighting them (a method applied by Cramer et al, 2021: https://www.medrxiv.org/content/10.1101/2021.02.03.21250974v3)

Line 122: Because the interval score is a combination of calibration and sharpness, I wonder if the authors considered presenting all three metrics – calibration, sharpness, and IS? This might provide additional insights and if the authors are only presenting metrics at alpha of 0.05, they have the space to do so.

Line 130: The field is moving away from communication of single numbers and towards ranges. Point predications are rarely used to communicate forecasts in the COVID-19 pandemic, and generally discouraged. Thus, I don’t find the point forecast analysis to be useful and suggest removing it.

Line 142-146: Changes in the reported death counts were not always due to reporting delays. For example, the large spike observed in the winter 2020/2021 in Ohio reflects a change in how the state defined a death. Even the everchanging landscape of the pandemic, I’d recommend referring to these anomalies as “reporting patterns”, and perhaps defining examples of backlogged deaths, reporting dumps, or changes in definitions.

Lines 157-159: Please define ‘overconfidence’ and ‘underconfidence’ and describe how they relate to the various trimming methods.

Line 161: Can you say more about the ‘previous best’? It’s not clear to me why you added this model or if reference 42 is the correct reference for is. What does this add to the analysis?

Line 171-172: Inclusion of individual models aids in the overall point that combining is better, however, the inclusion criteria is pretty strict here. Several models were consistent submitters to the COVID-19 Forecast Hub but missed a week here or there. Consider broadening this inclusion criteria.

Line 171-172: I’d like to confirm that COVIDhub Baseline model was not included in this set? It’s not designed to be a true forecast but is rather a comparator point for the submitted models.

Results

Line 190: What were the thresholds used for the categories? How many states were in each category? Please also add to the discussion the limitation of not including a time component to the analysis, as the US experienced spatial heterogeneity in the outbreak and even lower incidence states had peaks, when model performance was subpar.

Line 230: Please note which statistical test you are referring to here.

Line 248: I think OK and WV are missing the dashed lines? If not, then there are no differences in reporting patterns overtime

Line 255: Please present these data in either points or bard. Lines implies that the data are longitudinal.

Line 266: Caution with describing Ohio and the individual models here. Because I don’t know which team model 33 is, I can’t speak to the accuracy of the text. It should be noted that several teams noted the spike in reported deaths in Ohio and assumed it to be an error, while other teams assumed it to be truth. Because this nuance is not available here, I recommend deleting mention of the individual models from the text.

Lines 278-281: Can you share that sensitivity analysis as a supplement?

Discussion

Line 298: The main limitation of the analysis is the lack of temporal analysis. The epidemic varied over time and space in the US, and consequently, so did the forecast performance. While I do not think that the authors need to include temporal analysis, they should include this as a limitation in the Discussion.

Line 302: As written, this implies that forecast type and timing was assessed in the analysis. Please provide a citation since this was beyond the scope of the manuscript.

Line 337: Same comments as line 302. Please reference.

MINOR COMMENTS

Line 61: Source 15 has been published; please update the reference.

Line 73: Center, in CDC, is spelled incorrectly.

Line 93 and 98: Reference 33 is incorrect. It should be reference 26 here, as reference 33 refers to the reported data, not the forecast data or the collaborations surrounding the forecast data.

Line 130 and 131: The sentence about point forecasts should be a new paragraph

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Mar 29;17(3):e0266096. doi: 10.1371/journal.pone.0266096.r002

Author response to Decision Letter 0


27 Feb 2022

Reply to comments of the Editor

The Editor writes: “Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

All three reviewers have recognized the importance and timeliness of the topic. However, they have also highlighted several criticalities. Please refer to their detailed reviews for indications. Please be sure to answer all their comments in your revision.”

Our response:

We are very grateful to the Editor for giving us the opportunity to revise our manuscript. We are also very appreciative of the reviewers for devoting time to read and comment on our manuscript. Extensive changes have been made to address the issues raised by the reviewers, involving the data, study design and reporting of results. We feel that the review process has very much improved our manuscript.

We would like to point out to the reviewers that we now report the results for incident deaths and cumulative deaths separately. Please also note that we have changed the title so that we now refer to “interval forecasts” instead of simply “forecasts”, as we feel this more clearly describes the focus of our revised manuscript.

Reply to Reviewer #1

The reviewer writes: “The manuscript addresses an interesting and timely topic. The use of ensamble approaches is sound and comparisons across methods is very useful in practice. Some comments follow.”

Our response: We thank the reviewer for these positive comments.

The reviewer writes: “I would emphasize the importance of quantifying the uncertainty associated with forecasts. Therefore I kindly ask to report more predictive quantiles (1%, 2.5%, 5%, 10%, . . . , 90%, 95%, 97.5%, 99%) in addition to their point forecasts. This motivates considering both forecasts of cumulative and incident quantities, as predictive quantiles for these generally cannot be translated from one scale to the other.”

Our response: We now report more predictive quantiles by considering all the (symmetric) interval forecasts that can be constructed from the 23 quantiles. In response to other reviewers’ comments, we have removed the analysis of point (50% quantile) forecasts. Given our focus now on interval forecasts, we have added “Interval” to the title, so it is it is now

“Interval forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methods”

The reviewer writes: “The use of the interval score (Gneting and Raftery, 2007) is sound. The three summands can be interpreted as a measure of sharpness and penalties for under- and overprediction, respectively. I strongly believe that the weighted interval score (Bracher, J., Ray, E. L., Gneiting, T., and Reich, N. G. (2020a). Evaluating epidemic forecasts in an interval format. PLOS Computational Biology) shoul be considered. It combines the absolute error of the predictive median and the interval scores achieved for the nominal levels. It is a well-known quantile-based approximation of the continuous ranked probability score.”

Our response: In addition to the 95% interval score, we now also report the weighted interval score, as we are now evaluating all (symmetric) intervals constructed from the 23 quantiles. As we take the average of the weighted interval score across the out-of-sample period, and across forecast horizons, in the paper, we refer to the “mean weighted interval score (MWIS)”.

The reviewer writes: “Maybe I miss something, or there is something swept under the carpet. In any ensamble-based approach, the choice of the weights is crucial. Please, provide more details on this point.”

Our response: The descriptions of the two weighted methods and all other combining methods were in a supplementary file in our previous submission. We appreciate that this important information should have been included in the main text, and so we have now done this. The combining methods are described in the section entitled “Forecast combining methods”.

The reviewer writes: “It would also be interesting to see a discussion of the performance of the models belonging to different families. Ioannidis et al (Ioannidis JPA, Cripps S, Tanner MA. Forecasting for COVID-19 has failed. Int J Forecast. 2020.

http://www.sciencedirect.com/science/article/pii/S016920%7020301199) discussed that some approaches, mainly those referring to the SIR family, have failed in providing reasoable forecasts.

On the other hand, data-driven approaches show much better performances in forecasting the evolution of the epidemic (see e.g. Mingione, M., Di Loro, P. A., Farcomeni, A., Divino, F., Lovison, G., Maruotti, A., & Lasinio, G. J. (2021). Spatio-temporal modelling of COVID-19 incident cases using Richards’ curve: An application to the Italian regions. Spatial Statistics, 100544; Girardi, P., Greco, L., & Ventura, L. (2021). Misspecified modeling of subsequent waves during COVID‐19 outbreak: a change‐point growth model. Biometrical Journal).”

Our response:

We now also consider the forecasting accuracy for compartmental/SEIR models versus non-compartmental models. We report the results in the sections entitled “Performance by model type for incident deaths” and “Performance by model type for cumulative deaths”.

We thank the reviewer for highlighting these references, which we now cite in the paper. In particular, the text in the discussion section now reads (4th para):

“Previous studies have found that data driven models can perform better than compartmental models in forecasting COVID-19 data [9, 41, 42]. For forecasts of both mortalities, in many cases we found that there was no benefit in including non-compartmental models in a combination with compartmental models for a number of combining methods, including the mean combination. Non-compartmental models include simple time series models, which would be particularly prone to underperform when there are sudden steep increases in cumulative death counts, and so the steep increases that we highlighted might partially explain our results. Furthermore, these cited studies were carried out during the early weeks of our study, and we would expect the compartmental models to have increased in sophistication over time, and the model parameters and assumptions to have improved.”

Reply to Reviewer #2

The reviewer writes: “Thank you for the opportunity to review this interesting paper. The topic is important. This analysis builds on a previously published study of combining methods for COVID-19 forecasting models in the United States and extends it using more data. The methods are interesting and seem mostly technically sound to me, although I think a comparison with percent errors is essential.”

Our response:

We thank the reviewer for their positive comments.

We now report percentage errors in the form of skill scores. In the section entitled “Evaluating the interval forecasts” (1st para), we define the mean interval score (MIS) and mean weighted interval score (MWIS), and then later in this section (3rd para), we define skill scores. We write:

“We present results of the forecast accuracy evaluation in terms of the 95% interval MIS, MWIS, ranks and skill scores, which are calculated as the percentage by which a given method is superior to the mean combination, which is a common choice of benchmark in combining studies.”

The reviewer writes: “I have a number of clarifying questions about the text. My main comment is about data and code availability. Pointing to the input data sources is not sufficient, in my opinion, as the authors must transform and manipulate the data to conduct the analysis. The devil is in the details with forecasting and predictive validity. The authors should publish their code and data in a public repository to facilitate a complete review of the research. That is the standard in the field for this kind of work (all other papers I’ve read on the topic have done so) and also required by the journal for publication: https://journals.plos.org/plosone/s/materials-software-and-code-sharing.”

Our response: We now provide access to all our code files at https://doi.org/10.5281/zenodo.6300524. The README document explains the process of importing the publically available data from the COVID-19 Forecast Hub and the data preparation processes, which were carried out in Stata, before importing the data into GAUSS all generate the results of combining. These code files can be read using a text editor such as Notepad. The data is not ours to give. Our data availability statement now reads:

“Data were downloaded from the public GitHub data repository of the COVID-19 Hub at

https://github.com/reichlab/covid19-forecast-hub. The code used to generate the results is publically available on Zenodo at https://doi.org/10.5281/zenodo.6300524.”

Abstract

The reviewer writes: “Methods section is vague. How specifically was the analysis conducted? “comparing accuracy” is not specific. Which metrics were used? How were they calculated? How did you design your holdouts? The basic details should be clear in the abstract.”

Our response: The methods section of our abstract has been extended. It now provides more detail about the forecasting methods and their evaluation. We also now state the length of the out-of-sample periods. The text now reads:

“We considered weekly interval forecasts, for 1- to 4-week prediction horizons, with out-of-sample periods of approximately 18 months ending on 8 January 2022, for multiple locations in the United States, using data from the COVID-19 Forecast Hub. Our comparison involved simple and more complex combining methods, including methods that involve trimming outliers or performance-based weights. Prediction accuracy was evaluated using interval scores, weighted interval scores, skill scores, ranks, and reliability diagrams.”

The reviewer writes: “The results section is also vague. Specific values should be cited. By what margin much did the best performing models overtake the others?”

Our response: We thank the reviewer for highlighting this issue. The abstract now reads:

“The weighted inverse score and median combining methods performed best for forecasts of incident deaths. Overall, the leading inverse score method was 12% better than the mean benchmark method in forecasting the 95% interval and, considering all interval forecasts, the median was 7% better than the mean. Overall, the median was the most accurate method for forecasts of cumulative deaths. Compared to the mean, the median’s accuracy was 65% better in forecasting the 95% interval, and 43% better considering all interval forecasts.”

The reviewer writes: “Conclusions: please indicate how this study did or did not concur with the previous study. What more can we learn from adding additional data?”

Our response: We have had to be concise in our conclusions, given the word limit for the abstract. We now state:

“The relative performance of combining methods depends on the extent of outliers”.

This is explained more fully in the discussion section of the main text where the text reads (2nd para):

“In our earlier study, the inverse score methods were the most accurate overall and the mean generally outperformed the median. The mean was also competitive against the inverse score method for many locations. The results of the current study for forecasts of cumulative deaths were not consistent with those of the earlier study, although much better results were achieved for the mean by combining only compartmental models, and when combining forecasts of incident deaths from all models, the relative performance of the inverse mean methods was considerably better. In the current study of cumulative deaths, the leading methods were generally the median methods and symmetric mean (for which the median is an extreme case). These methods are robust to outliers. The results of our two studies illustrate that, particularly for forecasts of cumulative deaths, the relative performance of combining methods depends on the extent of outlying forecasts, and that outlying forecasts were clearly more prevalent in the dataset for the current study”

Methods

The reviewer writes: “Why was only 1 year of data used when there are nearly 2 years of data available now since the start of the pandemic?”

Our response: We have updated the two datasets. In response to another reviewer we are now only using forecasts that passed the Hub’s screening tests and we used all available forecasts projected from forecast origins up to 8 January 2022.

The reviewer writes: “I am not sure that evaluating absolute error makes the most sense, as it biases the results towards higher mortality moments and locations. I would like to see a comparison to relative error metrics, such as the median absolute percent error. This has also been done in prior analyses so would be standard.”

Our response: In response to comments by the other reviewers, we have replaced the analysis of point (50% quantile) forecasts with the analysis of all the prediction intervals, evaluated using the weighted interval score. We now report percentage errors in the form of skill scores. In the section entitled “Evaluating the interval forecasts” (1st para), we define the mean interval score (MIS) and mean weighted interval score (MWIS), and then later in this section (3rd para), we define skill scores. We write:

“We present results of the forecast accuracy evaluation in terms of the 95% interval MIS, MWIS, ranks and skill scores, which are calculated as the percentage by which a given method is superior to the mean combination. The mean is a common choice of benchmark in combining studies.”

The reviewer writes: “I would recommend placing the trimming and other details of how forecasts are combined in the main text, given journal length limits, and how central those details are to the work at hand.”

Our response: We have followed the reviewer’s recommendation. The supplementary file on forecasting methods has been removed, and we describe each combining method in the section entitled “Forecast combining methods”.

Results

The reviewer writes: “How did the methods you describe here perform in comparison to the forecasts hub’s own internal ensemble model? Please highlight this important comparison.”

Our response: The analysis is now more focused on the Hub’s ensemble as, following the advice from another reviewer, our study now only includes forecasts that are included in the Hub’s ensemble. We were unable to reproduce perfectly the performance of the Hub’s ensemble (its performance should have been identical to that of the median in the post-sample period), which suggests that some forecasts were removed since they were submitted to and assessed by the Hub. We have reported the results based on the data that was available and we have referred to the Hub ensemble and the median collectively as the “median methods”.

Reply to Reviewer #3

The reviewer writes: “Thank you for the opportunity to review Taylor and Taylor’s manuscript that compares different methods of combining COVID-19 mortality forecasts. Their research is an important contribution to the forecasting field. Analysis of ensembling approaches is sorely needed; and it is also timely as the COVID-19 epidemic continues to change and forecasts continue to be used for operational decision making. Moreover, but as the field continues to move towards open and collaborative science, their work has direct application to future forecasting. Their clearly written manuscript has many strengths – it compares simple and more complex methods; it harnesses a year’s worth of data for evaluation; and focuses analyses on probabilistic distributions. In addition, there are also opportunities to improve to the text and analysis (see comments below).”

Our response: We thank the reviewer for these positive comments.

MAJOR COMMENTS

Abstract

The reviewer writes: “Line 30: I suggest cutting the text about “extended and new datasets”. It’s not apparently clear to readers what the authors are referring to if they haven’t read their previous manuscript”

Our response: We have removed this sentence from the abstract, as recommended.

The reviewer writes: “Line 35: The COVID-19 Forecast Hub collects probability distributions as well as point forecasts. The authors are referring here to the 50% quantile of probability distributions as ‘point forecasts’. Please refer to them throughout the manuscript as the 50% quantile, rather than a point forecast. [Also, see comments under Material and Methods, as I recommend excluding all 50% quantile analyses. These forecasts are 1) not useful for outbreak response, and 2) misleading in terms of communicating uncertainty.]”

Our response: Following the reviewer’s advice, we have now removed from the analysis the evaluation of 50% quantile forecasts.

The reviewer writes: “Lines 35-37: Please list the evaluation metrics (mean interval score) and the combination methods here.”

Our response: The methods section of our abstract now provides more detail about the forecasting methods and their evaluation. The text now reads:

“Our comparison involved simple and more complex combining methods, including methods that involve trimming outliers or performance-based weights. Prediction accuracy was evaluated using interval scores, weighted interval scores, skill scores, ranks, and reliability diagrams.”

The reviewer writes: “Line 38 and 39: The first sentence of the Results is very confusing. Does the ‘average performance of these models’ refer to the Mean method, the Ensemble model, or something else?”

Our response: In our previous submission, this sentence referred to the average performance of all the models, which was calculated by considering the performance of each model individually and then averaging. We showed that this average performance did not exceed the performance of the mean combination (the ‘performance of the average was better than the average performance’). The dataset provided the opportunity to illustrate this standard result, which provides a motivation for combining. We have removed this analysis, and instead, summarise the results of the individual models included in our comparison in different ways in Tables 6 and 12. We hope that this is a more informative approach to illustrate that combining is better.

The reviewer writes: “Line 46: How do you define “sufficient”? Length of historical data was beyond the scope of the analysis, but the text implies this was considered within the analysis.”

Our response: We appreciate the point made by the reviewer, and in revising the text we have removed that particular sentence.

Introduction

The reviewer writes: “Line 89: Recommend not referring to reported data as ‘delayed’. ‘Reporting patterns’ is more accurate to describe the nature of the changes to the CSSE datasets and more accurately describes the descriptive nature of this part of the analysis.”

Our response: We have followed the reviewer’s recommendation by now referring to reporting “patterns” instead of “delays”.

Materials and Methods

The reviewer writes: “Lines 109-110: The number of the week’s is very confusing here. What is the first week of the epidemic, and are you referring to the first week in the US? Because forecasts were not collected at ‘week 1’ of outbreak, I suggest starting the start date and the end date, with the start date referred to as ‘week 1 of the analysis’.”

Our response: We accept the reviewer’s point that referring to epidemic week was confusing. In our revised paper, we now avoid week numbering and simply refer to the actual dates of forecast origins. We do this in the text and in the labelling of figures. In the methods subsection “Dataset (2nd para), we state the start and end dates of our data:

“Our dataset included forecasts projected from forecast origins at midnight on Saturdays between 9 May 2020 to 8 January 2022 for forecasts of cumulative COVID-19 deaths (88 weeks of data), and between 6 June 2020 and 8 January 2022 for forecasts of incident deaths (84 weeks of data).”

The reviewer writes: “Lines 113-116: I strongly recommend changing the inclusion criteria for this analysis. Models that were not included in the COVID-19 Forecast Hub should also be excluded here. This will help with comparability between the ensemble approaches. The COVID-19 Forecast Hub excludes forecasts that are improbable, such as if the number of cases or deaths exceeds the population size, not based on their predictions being ‘too large’. Details are provided here: https://covid19forecasthub.org/doc/ensemble/”

Our response: We thank the reviewer for sending the link to the Hub ensemble inclusion criteria. We accept that including only forecasts that were included by the Hub would provide a better comparison and greater consistency in our comparison. We have followed the reviewer’s recommendation. We were unable to reproduce perfectly the performance of the Hub ensemble (its performance should have been identical to that of the median in the post-sample period) which suggests that some forecasts were removed since they were submitted and assessed by the Hub. We have reported the results based on the data that was available and we have referred to the Hub ensemble and the median collectively as the ‘median methods’. The text in the methods subsection “Dataset” now reads (end of 1st para):

“We only included forecasts that passed the Hub’s screening tests”.

The reviewer writes: “Line 122: In the evaluation, did you use the CSSE reported counts at the end of the analysis period? OR was each date evaluated against the data available at the time. Please clarify in the methods.”

Our response: We have clarified this point in the methods. The first sentence in the subsection of methods entitled “Evaluating the interval forecasts” the text now reads:

“We evaluated out-of-sample prediction accuracy and calibration, with reference to the reported death counts on 15 January 2022, thus producing a retrospective evaluation.”

The reviewer writes: “Line 122: What is the rationale for focusing only on the 95% PIs? Information can be gained be examining all intervals (7 available) and weighting them (a method applied by Cramer et al, 2021: https://www.medrxiv.org/content/10.1101/2021.02.03.21250974v3)”

Our response: We now include in our analysis all (symmetric) intervals constructed from the 23 quantiles.

The reviewer writes: “Line 122: Because the interval score is a combination of calibration and sharpness, I wonder if the authors considered presenting all three metrics – calibration, sharpness, and IS? This might provide additional insights and if the authors are only presenting metrics at alpha of 0.05, they have the space to do so.”

Our response: For the 95% interval forecasts, and for all intervals considered together, we are now presenting results for mean interval score (MIS), mean weighted interval score (MWIS), mean ranks (as the statistical tests are based on these), and skills scores (in response to comments by another reviewer). We chose to summarise the performance over time, performance by type of model and sensitivity analysis by using skill scores, as we believe they provide a useful way of measuring how much the performance of each method differs from the established benchmark mean method (which is often hard to beat), and how the methods compare with the leading methods.

The reviewer writes: “Line 130: The field is moving away from communication of single numbers and towards ranges. Point predications are rarely used to communicate forecasts in the COVID-19 pandemic, and generally discouraged. Thus, I don’t find the point forecast analysis to be useful and suggest removing it.”

Our response: Following the reviewer’s recommendation, we have removed 50% quantile (point) forecasts from the paper.

The reviewer writes: “Line 142-146: Changes in the reported death counts were not always due to reporting delays. For example, the large spike observed in the winter 2020/2021 in Ohio reflects a change in how the state defined a death. Even the everchanging landscape of the pandemic, I’d recommend referring to these anomalies as “reporting patterns”, and perhaps defining examples of backlogged deaths, reporting dumps, or changes in definitions.”

Our response: We acknowledge that changes in reporting patterns were not solely attributed to reporting delays and we have followed the reviewer’s recommendation by referring to “reporting patterns”. Please also see our response to this reviewer’s comments regarding line 248 and the states OK and WV.

The reviewer writes: “Lines 157-159: Please define ‘overconfidence’ and ‘underconfidence’ and describe how they relate to the various trimming methods.”

Our response: We decided to remove these terms, as we can see they are likely to cause confusion, and we thank the reviewer for highlighting this issue. Instead, we now refer to intervals that are overly narrow or overly wide. We do this when describing the exterior and interior trimming methods in the section entitled “Forecast combining methods”

The reviewer writes: “Line 161: Can you say more about the ‘previous best’? It’s not clear to me why you added this model or if reference 42 is the correct reference for is. What does this add to the analysis?”

Our response: We include the ‘previous best’ forecast, as we feel it is natural to consider model selection as an alternative to model combination. The ‘previous best’ method simply selects the forecasts of the model that has the best historical accuracy (judged in terms of the interval score). The reference that we provide is the correct reference for this method. Capistrán and Timmermann refer to it on page 430 of their article Forecast combination with entry and exit. J Bus Stats 27(4): 428-440. We appreciate that the previous best is not a combining method, and so we have moved its introduction to the section entitled “Comparison with individual models”. To be consistent with this, we now position this method in the bottom row of the results tables, below the results of the combining methods.

The reviewer writes: “Line 171-172: Inclusion of individual models aids in the overall point that combining is better, however, the inclusion criteria is pretty strict here. Several models were consistent submitters to the COVID-19 Forecast Hub but missed a week here or there. Consider broadening this inclusion criteria.”

Our response: We appreciate that our inclusion criteria may seem strict, but it would not be appropriate to include an individual model in our main results tables, along with the combining methods, unless forecasts were available from the individual model for all weeks in the out-of-sample period. The evaluation measures would not be comparable. However, we accept that it is interesting to provide, somehow, a comparison of accuracy of the individual models with the combining methods. To enable this, we now present summaries of the results of the individual models for which we had forecasts for at least half the out-of-sample period and at least half of the 52 locations. Tables 6 and 12 summarise skill scores based on scores calculated for the individual model and the benchmark method using only those weeks for which forecasts were available for the individual model. These tables and discussions of them are provided in the sections entitled “Performance of individual models for incident deaths” and “Performance of individual models for cumulative deaths”.

The reviewer writes: “Line 171-172: I’d like to confirm that COVIDhub Baseline model was not included in this set? It’s not designed to be a true forecast but is rather a comparator point for the submitted models.”

Our response: We concede that our reference to the inclusion of individual models in the comparison, in our previous submission, was confusing. In the analysis described in our revised manuscript we included forecasts that passed the Hub’s screening tests and that included forecasts from the COVID Hub baseline model. We believe that the reviewer was referring to the individual models that were included in our comparison. We can confirm that although the COVID Hub baseline model provided forecasts for more than half the locations and for more than half the out-of-sample period (the revised inclusion criteria for this analysis), we can confirm that we have not included this model in our comparison. In the section entitled “Comparison with individual models” the text reads:

“….we also summarise a comparison of the mean and median combinations with individual models for which we had forecasts for at least half the locations and at least half the out-of-sample period. Our inclusion criteria here is rather arbitrary, but the resulting analysis does help us compare the combining methods with the models of the more active individual teams. In this comparison, we excluded the COVID Hub baseline model, as it is only designed to be a comparator point for the models submitted to the Hub and not a true forecast”.

Results

The reviewer writes: “Line 190: What were the thresholds used for the categories? How many states were in each category?

Our response: We divided the 51 states into three equal groups of 17 states. We felt that setting thresholds would be arbitrary and splitting equally would be a pragmatic attempt to deal with the problem of scores for some locations dominating. We now report the number of states in each category. The text of the methods subsection “Evaluating the interval forecasts” now reads (3rd para):

“We report results for the series of total U.S. deaths, as well as results averaged across all 52 locations. In addition, to avoid scores for some locations dominating, we also present results averaged for three categories, each including 17 states: high, medium and low mortality states. This categorisation was based on the number of cumulative COVID-19 deaths on 15 January 2022.”.

The reviewer writes: “Please also add to the discussion the limitation of not including a time component to the analysis, as the US experienced spatial heterogeneity in the outbreak and even lower incidence states had peaks, when model performance was subpar.”

Our response: To provide some insight into the potential change in ranking of methods over time, we now report the MWIS results separately for the first and second halves of the out-of-sample period. The text in “Evaluating the interval forecasts” (3rd para) reads:

“All results are for the out-of-sample period, and to provide some insight into the potential change in ranking of methods over time, we present MWIS results separately for the first and second halves of the out-of-sample period.”

These results are given in sections entitled “Changes over time in performance for incident deaths” and “Changes over time in performance for cumulative deaths”.

The reviewer writes: “Line 230: Please note which statistical test you are referring to here.”

Our response: We now refer to the statistical test when presenting results in the sections entitled “Main results for incident deaths” (1st para) and “Main results for cumulative deaths (1st para)”.

The reviewer writes: “Line 248: I think OK and WV are missing the dashed lines? If not, then there are no differences in reporting patterns overtime”

Our response:

We can now see that our explanation was not clear. In the methods subsection “Evaluating the interval forecasts” (4th para) the text now reads:

“We evaluated the effects of changes in reporting patterns on forecast accuracy. Changes in reporting patterns may involve reporting delays of death counts and changes in the definitions of COVID-19 deaths, both of which may lead to backdating of death counts and steep increases or decreases. Backdating of death counts would produce a problematic assessment in our retrospective evaluation of forecast accuracy, and sudden changes in death counts might cause some forecasting models to misestimate, particularly time series models. To obtain some insight, we compared reports of cumulative death counts for each location in files that were downloaded at multiple time points between 20 June 2020 and 15 January 2022. Locations for which there were notable effects of reporting patterns were excluded in sensitivity analysis.”

In the results section entitled “Impact of changes in reporting patterns and outliers on forecast accuracy”(1st para) the text reads:

“We observed changes in reporting patterns of death counts at 15 locations. Fig 6 shows examples of six locations where the changes were particularly noticeable. We found evidence of backdating in Delaware, Ohio, Rhode Island and Indiana. Backdating of historical death counts is shown as dashed lines. We noted a sharp drop in death counts in West Virginia in May 2021, suggesting a redefinition of COVID-19 deaths. There were sharp increases in death counts in Oklahoma in early April 2021 and in Delaware in late July 2021. We also observed sharp increases in death counts of two other locations, Missouri and Nebraska.”

The reviewer writes: “Line 255: Please present these data in either points or bard. Lines implies that the data are longitudinal.”

Our response: Our reason for joining the lines was to aid visibility, but we accept that this was misleading. We have removed the lines and instead present just points in our revised figures (Figs 7 and 8).

The reviewer writes: “Line 266: Caution with describing Ohio and the individual models here. Because I don’t know which team model 33 is, I can’t speak to the accuracy of the text. It should be noted that several teams noted the spike in reported deaths in Ohio and assumed it to be an error, while other teams assumed it to be truth. Because this nuance is not available here, I recommend deleting mention of the individual models from the text.”

Our response: We have redrafted this section and we no longer refer to individual models.

The reviewer writes: “Lines 278-281: Can you share that sensitivity analysis as a supplement?”

Our response: The results of the sensitivity analyses are now presented as supplementary information (S14 Table and S15 Table).

Discussion

The reviewer writes: “Line 298: The main limitation of the analysis is the lack of temporal analysis. The epidemic varied over time and space in the US, and consequently, so did the forecast performance. While I do not think that the authors need to include temporal analysis, they should include this as a limitation in the Discussion.”

Our response: To provide some insight into the potential change in ranking of methods over time, we now report the MWIS results separately for the first and second halves of the out-of-sample period. The text in “Evaluating the interval forecasts” (3rd para) reads:

“All results are for the out-of-sample period, and to provide some insight into the potential change in ranking of methods over time, we present MWIS results separately for the first and second halves of the out-of-sample period.”

These results are given in sections entitled “Changes over time in performance for incident deaths” and “Changes over time in performance for cumulative deaths”. ”

The reviewer writes: “Line 302: As written, this implies that forecast type and timing was assessed in the analysis. Please provide a citation since this was beyond the scope of the manuscript.” and also “Line 337: Same comments as line 302. Please reference.”

Our response: We have removed the text as we can see it was unclear.

MINOR COMMENTS

The reviewer writes: “Line 61: Source 15 has been published; please update the reference.”

Our response: We thank the reviewer for highlighting this point. The reference has been updated.

The reviewer writes: “Line 73: Center, in CDC, is spelled incorrectly.”

Our response: We thank the reviewer for highlighting this error. We have corrected the spelling.

The reviewer writes: “Line 93 and 98: Reference 33 is incorrect. It should be reference 26 here, as reference 33 refers to the reported data, not the forecast data or the collaborations surrounding the forecast data.”

Our response: We thank the reviewer for highlighting this error. We have corrected the citations.

The reviewer writes: “Line 130 and 131: The sentence about point forecasts should be a new paragraph”

Our response: We now no longer discuss point forecasting.

Attachment

Submitted filename: Response to reviewers.docx

Decision Letter 1

Maurizio Naldi

15 Mar 2022

Interval forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methods

PONE-D-21-30265R1

Dear Dr. Taylor,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Maurizio Naldi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: (No Response)

Reviewer #3: Thank you to the authors for addressing all comments. In particular, the adjustments to include all quantiles, and thus better address uncertainty, have improved the manuscript. I have one very minor comment: The authors note that the COVID-19 Hub Ensemble method was switched to a median in July 2020. While this is correct, there was one additional change to the methods in November 20201. As of November 2021, the ensemble used a weighted approach based on WIS in 12 prior weeks (https://covid19forecasthub.org/doc/ensemble/). This is worth noting in the text.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Acceptance letter

Maurizio Naldi

21 Mar 2022

PONE-D-21-30265R1

Interval forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methods

Dear Dr. Taylor:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Maurizio Naldi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Data availability for forecasts of cumulative COVID-19 deaths.

    * Based on information recorded on the COVID19 Hub with citations as recorded on 25/2/22; a Only provided forecasts of numbers of cumulative COVID-19 deaths; b Only provided forecasts of numbers of incident COVID-19 deaths.

    (TIF)

    S1 Table. Individual forecasting models.

    (PDF)

    S2 Table. For incident mortality, 95% interval MIS and MWIS for each prediction horizon.

    Lower values are better. a best method for each horizon in each column; b score is significantly lower than the mean combination; c score is significantly lower than the median combination.

    (PDF)

    S3 Table. For incident mortality, calibration for all locations.

    (PDF)

    S4 Table. For incident mortality, calibration for U.S.

    (PDF)

    S5 Table. For incident mortality, calibration for high mortality locations.

    (PDF)

    S6 Table. For incident mortality, calibration for medium mortality locations.

    (PDF)

    S7 Table. For incident mortality, calibration for low mortality locations.

    (PDF)

    S8 Table. For cumulative mortality, 95% interval MIS and MWIS for each prediction horizon.

    Lower values are better. a best method for each horizon in each column; b score is significantly lower than the mean combination; c score is significantly lower than the median combination.

    (PDF)

    S9 Table. For cumulative mortality, calibration for all locations.

    (PDF)

    S10 Table. For cumulative mortality, calibration for U.S.

    (PDF)

    S11 Table. For cumulative mortality, calibration for high mortality locations.

    (PDF)

    S12 Table. For cumulative mortality, calibration for medium mortality locations.

    (PDF)

    S13 Table. For cumulative mortality, calibration for low mortality locations.

    (PDF)

    S14 Table. Sensitivity analysis for incident mortality, skill scores of the 95% interval MIS and MWIS after excluding locations for which there were noticeable changes in reporting patterns.

    Shows percentages. Higher values are better. a best method in each column.

    (PDF)

    S15 Table. Sensitivity analysis for cumulative mortality, skill scores of the 95% interval MIS and MWIS after excluding locations for which there were noticeable changes in reporting patterns.

    Shows percentages. Higher values are better. a best method in each column.

    (PDF)

    Attachment

    Submitted filename: Response to reviewers.docx

    Data Availability Statement

    Data were downloaded from the public GitHub data repository of the COVID-19 Hub at https://github.com/reichlab/covid19-forecast-hub. The code used to generate the results is publically available on Zenodo at https://doi.org/10.5281/zenodo.6300524.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES