Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Dec 18;188(4):1299–1325. doi: 10.1093/jrsssa/qnae136

Evaluating infectious disease forecasts with allocation scoring rules

Aaron Gerding 1,✉,2, Nicholas G Reich 2, Benjamin Rogers 3, Evan L Ray 4
PMCID: PMC12371526  NIHMSID: NIHMS2099256  PMID: 40881183

Abstract

Recent years have seen increasing efforts to forecast infectious disease burdens, with a primary goal being to help public health workers make informed policy decisions. However, there has been only limited discussion of how predominant forecast evaluation metrics might indicate the success of policies based in part on those forecasts. We explore one possible tether between forecasts and policy: the allocation of limited medical resources so as to minimize unmet need. We use probabilistic forecasts of disease burden in each of several regions to determine optimal resource allocations, and then we score forecasts according to how much unmet need their associated allocations would have allowed. We illustrate with forecasts of COVID-19 hospitalizations in the U.S., and we find that the forecast skill ranking given by this allocation scoring rule can vary substantially from the ranking given by the weighted interval score. We see this as evidence that the allocation scoring rule detects forecast value that is missed by traditional accuracy measures and that the general strategy of designing scoring rules that are directly linked to policy performance is a promising direction for epidemic forecast evaluation.

Keywords: epidemiology, forecast evaluation, proper scoring rules, public health, resource allocation

1. Introduction

Infectious disease forecasting models have emerged as important tools in public health outbreak response. The predictions they provide increasingly inform decisions regarding a wide variety of countermeasures intended to reduce transmission and mitigate the severity of disease outcomes. For example, estimates of the onset time of the flu season have been used in developing national vaccination strategies (Igboh et al., 2023), and forecasts of Ebola and diptheria dynamics have been made with the clearly stated goal of helping local public health workers choose the timing and location of interventions in settings where resources are severely constrained (Camacho et al., 2015; Finger et al., 2019; Meltzer et al., 2014; Rainisch et al., 2015). More recently, in the context of the outbreak of COVID-19 across the U.S., Bertsimas et al. (2021) used forecasts as inputs to decision tools for the interstate reallocation of ventilators and ICU capacity, and to recommend vaccine trial sites to a major trial sponsor. Fox et al. (2022) similarly used predictive models to inform intrastate resource and care site planning, as well as local community guidelines for masking, travelling, dining, and shopping (University of Texas at Austin, 2022).

In the wake of the COVID-19 pandemic, this trend has been followed by calls for infectious disease forecasts to be not only designed, but also evaluated in ways that align specifically with how forecasts can be used to inform such outbreak control decisions (Bilinski et al., 2023; Marshall et al., 2024). This contrasts, however, with the historically standard practice of measuring the quality of disease forecasts using general purpose accuracy and skill scores, especially those that have implementations available in existing software when the relevant outbreak occurs. For point forecasts, the root mean square error (RMSE) (e.g. Papastefanopoulos et al., 2020) and the mean absolute error (MAE) (e.g. Johansson et al., 2016) are common choices. For probabilistic forecasts, which are the focus of this paper, researchers have often relied on the logarithmic score (LS). For example, the LS has been used to evaluate the skill of US seasonal influenza forecasts (McGowan et al., 2019; Reich et al., 2019) as well as forecasts targeting surveillance measures of dengue incidence in Peru and Puerto Rico (Johansson et al., 2019). More recently, the continuous ranked probability score (CRPS) and a discretized version adapted to multiquantile forecasts, the weighted interval score (WIS) (Bracher et al., 2021), have gained prominence. For example, the CRPS was used to assess probabilistic forecasts (based on random effect models) of dengue incidence at the district level in Vietnam (Colón-González et al., 2021). Likewise, the WIS has been used during the COVID-19 pandemic to evaluate forecasts of observed cases, hospitalizations and deaths in the U.S. and Europe, as reported by municipal, state, and federal surveillance systems (Cramer, Ray, et al., 2022; Fox et al., 2022; Sherratt et al., 2023).

While any of these scores can be interpreted through the lens of decision theory, they lack explicit connections between how a forecast is evaluated and how that forecast was used in practice. A key impetus for the present work was to develop a score that is interpretable to decision-makers by measuring the extent to which forecasts lead to improved decisions. In addition, the allocation score we propose is expressed in units that are intelligible to decision-makers.

A general phenomenon at play here—one that has been observed repeatedly over the past few decades in other fields, such as finance, supply chain management, and meteorology—is that while scores, such as RMSE, MAE, LS, CRPS, and WIS can describe the quality of a forecast in terms of how well it corresponds to the observed disease outcome, they will often fail to register the value of a forecast in the context of a specific decision. We refer the reader to Yardley and Petropoulos (2021) and the references collected therein (especially the foundational Murphy, 1993) for a general overview touching on a wide range of forecasting contexts and also to Pesaran and Skouras (2002) for a clear discussion from an econometric perspective.

Despite this now well-developed discussion of the quality-value distinction in the larger forecasting community, we are aware of only a limited literature attempting to connect the value of infectious disease forecasts to their impact on and through policy. Even within this body of work, we have found the discussion of such a connection to be at a formally and quantitatively imprecise stage. In Ioannidis et al. (2022), the possible negative consequences of inaccurate forecasts of infectious disease are discussed, but there is no attempt to quantify the utility or loss incurred as a result of those forecasts. Bilinski et al. (2023) explore ways in which predictive classifiers of local COVID-19 risk levels in the U.S. could be tuned to policymaker preferences for different costs associated with over and underreaction to disease dynamics, but they do not clearly identify the source of these costs or how they depend on quantifiable policy choices. A similar discussion related to dengue countermeasures in Vietnam appears in Colón-González et al. (2021). A novel version of the WIS informally motivated by utility considerations is developed by Marshall et al. (2024), but the score is not derived in a decision-theoretic manner. There is also a thread of literature that frames infectious disease modelling as a component of a larger system for understanding how policy goals, means, and choices interact and constrain one another. As an example, Probert et al. (2016) explore how policy recommendations ought to flow from a possibly incongruous set of simulation-based projection models of a hypothetical foot-and-mouth disease outbreak when there are ranges of plausible responses and stakeholder interests. Decision theory plays a prominent role here, but not explicitly as a way to evaluate the choices made in developing the models.

In this work, we begin to fill this gap between the ways that infectious disease forecasts have traditionally been evaluated, and the ways that they have been used to support public health policy. To do so, we consider a setting in which forecasts are used to help determine the allocation of a limited quantity of medical supplies across multiple regions. In Section 2, we define a new forecast scoring rule—the allocation score—that evaluates forecasts based on how beneficial resource allocations derived from them would turn out to be. In Section 3, we present an illustrative analysis using the allocation score (AS) to evaluate forecasts of hospital admissions in the U.S. that were made leading up to and during the Omicron wave that peaked in January of 2022. This analysis is ‘synthetic’ insofar as it is not intended to correspond to any specific historical record of allocation decisions that could have been supported by hospitalization forecasts during this period. However, we view the general allocation problem on which our framework is based as a versatile template for formalizing real-world decisions that must constantly be made in real-time by public health administrators around the globe, especially those related to hospital capacity, ventilator usage, doses needed for specific medications and other situations where an outbreak creates sudden and highly variable demand for potentially scarce resources.

2. The allocation score

We begin with an informal description of the AS and some examples illustrating its key characteristics in Section 2.1. In Section 2.2, we develop the AS more carefully, using a decision theoretic procedure for deriving proper scoring rules. (See Appendix A for a definition of a proper scoring rule and a more technical discussion of the procedure.) In Section 2.3, we note that another group of common scores including the quantile score, WIS, and CRPS, can also be derived from decision theoretic foundations—starting from a different decision-making context.

2.1. Overview of allocation scoring

Suppose that a decision-maker is tasked with determining how to allocate K available units of a resource across N locations. If the decision-maker is provided with a multivariate forecast F where each marginal forecast distribution Fi predicts resource need in a particular location, one option is to choose the resource allocation that minimizes the expected total unmet need according to the forecast. We will give a more precise mathematical statement in Section 2.2, but informally, the total expected unmet need according to the forecast is

i=1NEFi[unmet need in locationi], (1)

where the unmet need in a particular location is the difference between resource need in that location and the number of resource units that were allocated there. This allocation problem has an intuitively appealing solution: allocate so that the probabilities of need exceeding allocation in various locations are as close to each other as possible. This will lead to the allocations provided by F being quantiles of the marginal distributions Fi for some single probability level τ that is shared in common for all locations.

After time passes and the actual level of resource need has been observed, the value of a selected allocation can be measured by comparing the actual need in each location to the amount of resources that were sent there. Specifically, we compute the total unmet need that resulted from the selected allocation:

i=1Nunmet need in locationi. (2)

We regard one allocation as better than another (with respect to the realized need) if it results in lower total unmet need, and accordingly regard one forecast as better than another (again with respect to the realized need) if the allocation derived from the former results in lower total unmet need than does the allocation derived from the latter.

The AS of the forecast F is the avoidable unmet need that results from using the allocation that minimizes the expected unmet need according to that forecast. By ‘avoidable unmet need’, we mean that the AS does not include the amount of unmet need that was inevitable simply because the amount of available resources K was less than the need for resources. Rather, the AS measures the unmet need that could have been avoided by an oracle that knows exactly how much need will occur in each location and divides the amount K so that nothing is wasted in one location while it could be put to use in another. The best and lowest possible AS is 0. A positive AS indicates that an oracle with perfect foreknowledge could have selected an allocation that met more need than the allocation suggested by F.

Example 1

Suppose we have a forecast F for need in two locations such that F1 and F2 have exponential marginal distribution with scale parameters σ1=1 and σ2=4. The means of these distributions are given by the scale parameters σi. When the marginal forecasts are exponential distributions, it can be shown that the optimal allocation divides the available resources among the locations proportionally to the scale parameters σi (see Appendix B). If K=5 units of the resource are available, the optimal allocation according to F would be 1 and 4 resource units in locations 1 and 2, respectively. Increasing K to 10 changes this optimal allocation to 2 and 8. Figure 1 illustrates the situation.

Figure 1.

Figure 1.

An illustration of the resource allocation problem in Example 1. There are N=2 locations, with predictive distributions F1=Exp(1) and F2=Exp(1/4). The cumulative distribution functions of these distributions are illustrated in the panels at bottom and right. In the center panel, the background shading corresponds to the expected loss according to these forecasts. Diagonal lines in the center panel indicate resource constraints at K=5 and K=10 units; any point along those lines corresponds to an allocation that meets the resource constraint. For these forecasts, the optimal allocations are (1,4) for K=5 and (2,8) for K=10. These allocations are at the point on the constraint line where the expected loss is smallest, which also corresponds to the point where a level set of the expected loss surface is tangent to the constraint.

Next suppose that we observe resource needs of 1 and 10 in locations 1 and 2, respectively. Based on these observed needs, we evaluate the allocation suggested by the forecast by calculating the amount of unmet need that resulted from that allocation over and above what was unavoidable given the resource constraint. With K=5 available resource units, the allocation based on the forecast exactly meets the observed need in location 1, but it leaves 6 units of need unmet in location 2. But these 6 units of unmet need are unavoidable under the constraint K=5: allocating 0 units of resources to location 1 and 5 to location 2, for example, still results in a total unmet need of 6 across both locations. This gives an AS of 0 for the forecast when K=5. On the other hand, when K=10, the forecast’s allocation results in 108=2 units of unmet need in location 2 despite leaving no need unmet in location 1. In this case, the oracle would be able to prevent all but 1 of the total 11 units of need from going unmet, for example by allocating 1 unit of resources to location 1 and the remaining 9 units of resources to location 2. So for K=10, the forecast gets an allocaton score of 1 (= 2 realized 1 unavoidable) in units of avoidable unmet need.

These scores illustrate a general result: ASs for a forecast will tend to be larger when the resource constraint is close to the observed need, because this is when it matters most which locations are allocated more or less resources. If the amount of available resources is small relative to the observed need, any allocation of those limited resources will result in a large amount of unmet need. If the amount of available resources is comparatively large, it becomes less important which locations receive relatively more or fewer resources because all locations are likely to receive enough resources to meet their need. In either of these extremes of resource availability, the avoidable unmet need that arises from the allocation suggested by a forecast (i.e. the forecast’s AS) will tend to be small.

Example 2

Now consider a different forecast that also has exponential distributions for resource need in each location, but that has the scale parameters σ1=2 and σ2=8, twice as large as the scale parameters of the forecast in Example 1. Because the optimal allocation is proportional to the scale parameters, this forecast would lead to the same allocations as the forecast in Example 1, and would therefore be assigned the same AS.

Examination of these results leads to two observations. First, the reason that these forecasts had a positive (i.e. nonoptimal) AS at K=10 is that they did not get the relative magnitude of resource need across the two locations right: the realized need was 10 times as large in location 2 as in location 1, but the forecasts indicated that the resource allocation for location 2 should be only four times the allocation for location 1. At its core, the AS measures whether the forecast accurately captures the relative magnitudes of resource need across different locations, which is precisely the information that is needed to allocate resources to those locations subject to a fixed resource constraint.

A second observation is that the forecasts in Examples 1 and 2 predicted different mean levels of resource needs, but had the same AS. The AS does not directly measure whether the forecasts correctly capture the absolute magnitude of resource need in each individual location. This stands in contrast to other common scoring methods that aggregate scores such as log score, CRPS, or WIS for each location, where a poor forecast for one location is penalized regardless of alignments in other units.

2.2. A decision theoretic development of the AS

We give a high-level review of a general procedure for developing proper scoring rules that are tailored to specific decision-making tasks in Section 2.2.1, and then in Section 2.2.2, we apply that procedure to develop the AS based on the task of deciding on how to allocate a fixed supply of resources across multiple locations. In 2.2.3, we consider a small extension where the resource constraint is not known, or it is desired to consider the value of forecasts across a range of decision making scenarios. This gives rise to the integrated allocation score (IAS).

2.2.1. The decision theoretic set-up for forecast evaluation

Decision theory investigates how decisions are made by formalizing a decision as the process of selecting an action x from a specified set X of potential actions while taking into consideration the possible consequences of x under future states of the world. Let Y be a quantifiable aspect of the future which will help to determine the consequences of x. Y is a random variable inasmuch as it is indeterminate when the decision is made. We refer to a realization y of Y as an outcome. A fundamental strategy in decision theory is to assume that the consequences of an action x under outcome y can be assigned a numeric value, or loss, measuring the (lack of) success of x.

Some examples of actions are levels of investment in measures designed to mitigate severe disease outcomes such as hospital beds, ventilators, medication, or medical staff, with X being the set of all feasible levels of investment. An outcome y determining the consequences of such an action might be the number of individuals who eventually become sick and would benefit from the mitigation measure. In our example, x will succeed to the extent that it meets the realized need y, so that greater unmet need will incur greater loss.

When a forecast F of Y is available, a decision-maker might choose that action x which yields the lowest expected loss according to F. The loss of the action x chosen according to F can then, in turn, be interpreted as the value of F as an input to the decision-making process under the realized outcome y of Y.

The strategy of minimizing the expected loss according to F might be used if the decision-maker trusted that F was accurate and did not have any additional information that was not captured in F. However, here we use it only as a device for evaluating forecasts by examining the consequences of hypothetical decisions that would be made when using only F as an input. We return to this point in the discussion.

We arrive at a three-step procedure for developing scoring rules for probabilistic forecasts:

  1. Specify a loss function  s(x,y) that measures the loss associated with taking action x when outcome y eventually occurs.

  2. Given a probabilistic forecast F, determine the Bayes act  xF that minimizes the expected loss under the distribution F.

  3. The scoring rule for F calculates the score as the loss incurred when the Bayes act was used: S(F,y)=s(xF,y).

This is a general procedure that may be applied in settings where it is possible to specify a quantitative loss function. We call a scoring rule obtained from this procedure a Bayes scoring rule (noting that to our knowledge, this terminology is not standard). In Appendix A, we review the result from the scoring rule literature (e.g. Dawid, 2007; Gneiting & Raftery, 2007) that Bayes scoring rules are proper by construction.

2.2.2. The AS for a fixed resource constraint

In our setting, an action is a vector x=(x1,,xN) specifying the amount of resources allocated to each of N locations. We require that each xi is nonnegative, and that the total allocation across all locations equals the amount of available resources, K: i=1Nxi=K. The set X consists of all possible allocations that satisfy these constraints. The eventually realized resource need in each location is denoted by y=(y1,,yN); this is a realization (unknown at the time of decision-making) of the random vector Y=(Y1,,YN). A forecast of the random need Y for each location is written as F=(F1,,FN). We assume need is nonnegative and that the forecasts reflect that, i.e. the support of each Fi is a subset of R+. Finally, we assume that each unmet unit of need in any location incurs a fixed loss of L>0, so that if the selected resource level xi in location i is less than the realized need yi, a loss of L(yixi) results. A variety of extensions to this set-up are possible; for example, we might account for storage costs for resources that go unused, allow for a different loss per unit of unmet need in each location, or account for resource transportation costs. In this work, we choose to keep the loss function relatively simple to focus on the core ideas.

It is helpful to clearly distinguish between the time td when a decision is made about a public health resource allocation and the time tr when resource needs that might be addressed by that allocation occur. Our set-up assumes that td<tr and that the resource in question can only meet the need realized at tr, not prevent it. That is, we stipulate that the choice of allocation x has no effect on the distribution of the random vector Y. In the context of infectious disease, this means that we do not consider resources, such as vaccines, that are intended to reduce the number of people who will become sick at some point in the future. Instead, our set-up addresses resources like hospital beds, oxygen supply, or ventilators which are intended to meet the medical needs of patients who are already sick. Additionally, our set-up addresses decision-making that is related to resource needs only at the time tr; we do not consider sequences of multiple decisions that are made over time or account for the impact of decisions on resource needs at any time other than tr. We outline some opportunities to extend our work to more complex decision-making settings in the discussion.

With this problem formulation in place, we can develop a proper scoring rule following the procedure in Section 2.2.1.

Step 1: Specify a loss function

The loss incurred by an allocation x under an outcome y is the sum of contributions from unmet need in each location:

sA(x,y)=i=1NLmax(0,yixi). (3)

Here, max(0,yixi) is the unmet need in location i, which is given by yixi if the realized need yi in location i is greater than the amount xi allocated to that location, or 0 if the amount xi allocated to unit i is greater than or equal to the realized need. Also, L is a constant scalar value, the same across all locations, specifying the ‘cost’ of one unit of unmet need.

Step 2: Given a probabilistic forecast F, identify the Bayes act

The Bayes act associated with the forecast, xF,K, is the allocation that minimizes the expected loss, that is, the solution of the allocation problem associated with K:

minimize0xEF[sA(x,Y)]subject toi=1Nxi=K, (4)

where EF[sA(x,Y)]=i=1NLEFi[max(0,Yixi)] sums the expected loss due to unmet need across all locations.

In Appendix B, we show that the components of the Bayes act are quantiles xiF,K=Fi1(τF,K) at a probability level τF,K that depends on the forecast F and the resource constraint K, but is shared across all locations. This probability level is the level at which the resource constraint is satisfied: i=1NFi1(τF,K)=K. This tells us that to allocate optimally (according to F), we must divide resources among the locations so that there is an equal forecasted probability in every location that the allocation is sufficient to meet resource need. This solution to the allocation problem is well known in inventory management and is often attributed to Hadley and Whitin (1963).

Step 3: Define the scoring rule

We can now use the Bayes act to define a proper scoring rule for the probabilistic forecast F. Consider first the raw score defined as

SAraw(F,y;K)=sA(xF,K,y)=i=1NLmax(0,yixiF,K). (5)

This measures the total unmet need across all locations that results from using the Bayes allocation associated with the forecast F when the actual level of need in each location is observed to be yi.

To make this a more easily interpreted measure of forecast performance, we will adjust the raw score by subtracting the minimum loss achievable by an oracle allocator which has precise foreknowledge of the outcomes yi. When the oracle has sufficient resources to meet the total need, i.e. when i=1NyiK, the oracle’s loss is zero and allocation score coincides with the raw score. On the other hand, when the oracle cannot cover all need and incurs a loss of L(i=1NyiK)>0, we adjust the raw score by this loss. The oracle-adjusted score can therefore be written as

SA(F,y;K)=SAraw(F,y;K)Lmax(0,i=1NyiK) (6)
=L{i=1Nmax(0,yixiF,K)max(0,i=1NyiK)}. (7)

The oracle adjustment aligns with a common theme in economic decision theory that opportunity loss (often known as regret or (negative) relative utility) is often a more important quantity than absolute loss (see e.g. Diecidue & Somasundaram, 2017).

2.2.3. Integrating the AS across resource constraint levels

The AS SA developed in the previous section evaluates the forecast distributions F based on a single probability level τF,K. This is appropriate if the resource constraint K is a known constant. However, if K is not precisely known at the time of decision-making or there is interest in measuring the value of forecasts across a range of decision-making scenarios with different resource constraints, we can use an IAS that integrates the AS across values of K, weighting by a distribution p:

SIAS(F,y)=SA(F,y;K)p(K)dK (8)

We note that the device of considering a range of hypothetical decision-makers or decision-making problems with different problem parameters has been employed in the past (e.g. Murphy, 1993).

In practice, it may be challenging to compute the integral in (8). In the applied investigations below, we resort to a discretization, evaluating at a grid of values of the resource constraint K. A discrete treatment along these lines will often be natural, in settings where the resource in question is nondivisible (e.g. ventilators, masks).

2.3. Comparisons with other scores

The WIS was proposed in 2020 as a way to score forecasts that were being made in the early stages of the COVID-19 pandemic (Bracher et al., 2021); equivalent scores had also been used in previous forecast evaluation efforts (e.g. Hong et al., 2016). The WIS is a proper scoring rule for forecasts that use a set of quantiles to represent a probabilistic forecast distribution. Scores are calculated as a weighted sum of interval scores at different probability levels (e.g. 50% prediction intervals, 80% PIs, 95% PIs, etc.). Larger interval scores indicate less skilful forecasts. An interval score consists of (a) the width of the interval, with larger intervals receiving higher scores (higher scores indicate less accuracy), and (b) a penalty if the interval does not cover the eventual observation, which increases the further away the interval is from the observed value. Equivalently, the WIS can also be characterized as a weighted sum of quantile scores for each individual predictive quantile. The quantile score for a particular quantile level assigns an asymmetric penalty to predictions that are too high or too low, with the relative sizes of the penalties set so that in expectation the score is minimized by the given quantile of the distribution. The most commonly used version of WIS is one that uses an equal weighting of all quantile levels, in which case WIS approximates the CRPS, a commonly used score for probabilistic forecasts. It is important to note that this weighting was proposed because the resulting score approximates the CRPS, and not because it aligned with any particular public health decision-making rationale. For more mathematical detail on the WIS, we point readers to Bracher et al. (2021).

Nonetheless, the quantile score and WIS can be derived using the same decision theoretic procedure that we outlined in Section 2.2. In fields such as meteorology and supply chain management, a great deal of attention has been given to the problem where a decision must be made about the quantity of a resource to purchase for a single location in the face of a fixed cost C for each unit of the resource and a loss L that will be incurred for each unit of unmet need. This leads to the quantile score for the probability level τ=1C/L. From this point, the WIS or CRPS can be obtained by averaging across a range of decision-making settings with different cost and loss parameters, using a similar motivation that we used to obtain the IAS from the AS in Section 2.2.3 (Gneiting & Ranjan, 2011).

We also note that the AS, just as the (aggregate or mean) quantile score and WIS, depends only on the marginal distributions for each location of a multilocation forecast (see Equations (5) and (18) in Appendix B). It is, however, the case that when a forecaster modifies a single marginal forecast (holding those for other locations fixed), the aggregate WIS of their forecast will change only according to the WIS of that marginal forecast, whereas every term in the sum defining the AS (in (5)) will change as the Bayes probability level defined by (18) changes. In this sense, the AS depends ‘jointly’ on the marginal forecasts while the WIS does not.

3. Evaluating forecasts of COVID hospitalizations using the AS

We illustrate our new forecast evaluation framework with an application to hospital admissions in the U.S., considering a hypothetical problem of allocating a limited supply of medical resources to states. The results of this section can be reproduced using the resources linked to below under Data Availability.

3.1. Data

The US COVID-19 Forecast Hub collected short-term forecasts of daily new hospital admissions for individuals with COVID-19 starting in December 2020 (Cramer, Huang, et al., 2022). The target data for these forecasts were hospital admissions as reported by the US Department of Health and Human Services through the HealthData.gov website. Forecasts were probabilistic predictions of the number of new hospital admissions on a particular day in the future, in a specific jurisdiction of the U.S. (national level, state, or territory). Probability distributions were represented using a set of 23 quantiles for each individual prediction, including a median and the lower and upper limits of 11 central prediction intervals, from a 99% to a 10% prediction interval.

The analysis in this work focuses on forecasts made before and during the first wave of the Omicron SARS-CoV-2 variant in the U.S. As such, we analysed forecasts for the 15 weeks starting with Monday 22 November 2021 through Monday 28 February 2022.

Submission to the Forecast Hub followed a weekly cycle, and each Monday the Hub collected the most recent forecasts submitted by all teams that met certain inclusion criteria and created ensemble forecasts using quantile averaging (Ray et al., 2023). Our analysis includes these ensembles (COVIDhub-ensemble and COVIDhub-trained_ensemble) as well as one other ensemble of hub models created by another team (JHUAPL-SLPHospEns) and several other individual models. Models were eligible to be included in the analysis if they were designated as a ‘primary’ model from a team. For a model to have a complete, eligible submission in a given week, it had to have a 14 day-ahead forecast for all 50 states plus Washington DC. Models had to have a complete forecast for at least 4 of the 15 weeks in the analysis to be eligible for inclusion.

The hospitalization data used for scoring forecasts were downloaded on 26 July 2024.

3.2. Evaluation metrics

We evaluated forecasts using the AS and the WIS, computed at a horizon of 14 days. We chose to focus much of our analysis on the AS computed for a resource constraint of K=15,000. This level provides an anchor for interesting comparisons between phases of the Omicron wave by representing a national shortage at the peak and a national excess before and after the wave, assuming that the resource need corresponds directly to new hospital admissions. We additionally computed the mean WIS (MWIS) across all of the forecasted state-level locations.

For both scores, we also computed standardized ranks among all models that submitted forecasts each week. The standardized rank lies in the interval [0,1], where 0 corresponds to the worst rank and 1 to the best. In the case of a tie between one or more models, all models received the better rank.

As described above, predictions were submitted to the Forecast Hub in the form of a set of 23 quantiles of the predictive distribution. The WIS can be directly calculated from these quantiles. However, our numerical method for calculating the AS (outlined in Appendix C) requires a full cumulative distribution function (CDF) for the forecast in each location. For the purpose of this analysis, we imputed CDFs based on the provided quantiles following a procedure detailed in Appendix D. Briefly, this involves interpolating the provided quantiles between the lowest (0.01) and highest (0.99) probability levels using monotonic cubic splines, and then extending outside these levels with normal distribution tails parametrized to match the two lowest and two highest quantiles. We show in Appendix E that if this evaluation procedure had been specified prospectively, the resulting score would be proper, but that a post hoc application of this procedure is improper. We use the procedure here to illustrate the properties of the AS, and note that a collaborative forecasting hub interested in using the AS for evaluation could circumvent propriety issues by communicating the definition of the AS to forecasters and then collecting allocations at specified resource levels as part of forecast submissions.

3.3. Allocation benchmarks

To explore how model-derived allocations might be compare to ‘standard operating procedures’ for allocation used by public health officials, we scored two simple benchmark methods. First, we evaluated forecasts from the COVIDhub-baseline model, which predicts a flat line from the most recent observation with uncertainty bounds based on a random walk (Cramer, Ray, et al., 2022). Second, we generated a proposed set of allocations where the quantity allocated to each state was proportional to that state’s population (using US Census data of vintage 2022 (United States Census Bureau, 2022)), referred to below as per-capita. These two approaches generally reflect common choices for ‘best practices’ of allocation: either allocating resources based on the most recent observed data, or in proportion to the population of each location.

3.4. Application results

3.4.1. Anatomy of forecast scores for one week

To illustrate the mechanics of allocation scoring, we start by focusing on how forecasts generated on or before 20 December 2021, with predictions for 3 January 2022, were scored by different metrics. This week was around the peak of the Omicron wave nationally, with individual states typically observing a peak at or after 3 January 2022.

Of the 10 models evaluated for this one week, the CU-select model had the most accurate forecasts according to the AS while the USC-SI_kJalpha model had the most accurate forecasts based on MWIS (Table 1). The JHUAPL-Bucky model had the second best MWIS but the third worst AS.

Table 1.

For one illustrative week, a comparison of allocation scores (AS), mean weighted interval scores (MWIS), and two varieties of Integrated Allocation Scores (IAS), (see Section 3.4.5)

Model AS MWIS IAS centred at 15 k IAS uniform
CU-select 669 133 774 326
per-capita 865 1,029 366
COVIDhub-ensemble 873 159 1,067 438
USC-SI_kJalpha 995 91 1,216 1,097
JHUAPL-Gecko 1,034 164 1,141 418
MUNI-ARIMA 1,084 169 1,248 440
COVIDhub-trained_ensemble 1,089 169 1,271 823
COVIDhub-baseline 1,175 170 1,317 535
JHUAPL-Bucky 1,358 102 1,566 1,214
JHUAPL-SLPHospEns 1,540 129 1,604 1,102
UVA-Ensemble 2,469 213 2,635 2,494

Note. All metrics are shown for 10 models that made forecasts of hospital admissions for 3 January 2022. Results are sorted by AS. For all metrics, lower scores indicate better accuracy.

A state-wise comparison of the JHUAPL-Bucky and CU-select models shows that while the JHUAPL-Bucky forecast distributions were more centred on the eventual observations in many states, the suggested allocations of the JHUAPL-Bucky model were less efficient than those of the CU-select model (Figures 2 and 3a). Thus, JHUAPL-Bucky achieved a worse AS than CU-select (on this particular week) by allocating excess resources to several states, such as Pennsylvania and Michigan, rather than to states such as Florida and California where these resources would have reduced unmet need. These allocation errors resulted from forecasts failing to consistently capture the relative resource needs across different states. The CU-select model made some similar errors—most prominently, overallocating resources to Ohio—but overall, it more successfully forecasted the resource demands across different locations in relative terms.

Figure 2.

Figure 2.

Probabilistic forecasts for new COVID-19 hospital admissions on 3 January 2022 and their suggested resource allocations for the states with the 10 highest (eventually observed) hospitalization counts. For each state, the dark line shows the data observed when the forecast was made, the lighter line shows counts observed after the forecast was made, and the thin horizontal segment shows the count observed on the target date. The side-by-side shaded regions show the median (thicker horizontal segments) and 50%, 80%, and 95% prediction intervals for the two selected models. The forecasts were made for new hospitalizations on 3 January 2022 (vertical dashed line, with number of hospitalizations indicated by thin horizontal line segment at the intersection of the vertical dashed line and the line of data). In the online version, vertical bars with purple, blue, and red shading show the allocations. The purple bar goes from zero to the amount of need that was met by the allocation from that model to a specific location. A red bar indicates need that exceeded the resource allocation for a location, and a light blue bar shows the amount by which the resource allocation suggested by that model exceeded the need for that location.

Figure 3.

Figure 3.

Component-wise breakdowns of the allocation score (a) and weighted interval score (WIS) (b), by location for forecasts of hospitalization admissions on 3 January 2022, for two selected models (JHUAPL-Bucky and CU-select). a) The observed resource need, in this case the observed number of hospitalizations, for each state, along with the hypothetical number of resources allocated to the given location based on the forecasts from each model. The number of available resources was fixed at 15,000 and forecasts from each model were used to determine an optimal allocation strategy before the resource need was known. For most locations, the resource need exceeded the resources allocated, indicated by some amount of ‘unmet resource need’ above the ‘need met by allocated resources’. b) The breakdown of the WIS into components of underprediction, overprediction and dispersion. Larger values of WIS indicate more error, and the full WIS score for each location can be decomposed into the three components shown here.

On the other hand, CU-select had worse performance as measured by MWIS. Its forecasts were biased downwards, and it consistently incurred a large penalty for underprediction (Figure 3b). The JHUAPL-Bucky model had wider predictive intervals which more often included the observed level of daily hospital admissions. Its MWIS was therefore less severely penalized by under—or overpredictions of the actual hospitalization counts.

3.4.2. Forecast scores showed differences in aggregate and over time

Allocation scores varied substantially by date and by model (Figure 4). For predictions made for the first three Mondays in December 2021 and the last three Mondays in February 2022, all models had ASs under 500 (and the mean across all models was less than 100), indicating that unnecessary unmet need was fairly low on those days. The allocation scores were on the whole highest when the observed number of new hospital admissions was closest to the resource threshold of 15,000. It was on these occasions that models were most likely to waste resources in one location that the oracle would have put to use in another.

Figure 4.

Figure 4.

Hospital admissions and evaluation metrics over time. a) The number of hospital admissions in the U.S. as a whole due to COVID-19 on a sequence of 15 Mondays from December 2021 through March 2022. These are the dates for which forecasts were made and evaluated. A horizontal dashed line at 15,000 shows the hypothetical resource constraint K. b) Allocation scores (AS) for each model’s 14 day-ahead forecast, across all US states. The x-axis corresponds to the date of the observation that a model’s prediction was targeting (i.e. the date the forecast was made plus the forecast horizon). AS typically are high when the observed value is near to the constraint, which occurs during the last Monday in December (on the way up) and the last Monday in January (on the way down). In (b), the per-capita allocation is drawn in a heavier solid line. c) The mean weighted interval score (MWIS) across weeks. MWIS tends to scale with the observed and predicted values, peaking just after the peak of the Omicron wave.

Averaging across all weeks, the ensemble forecast achieved a better mean allocation score (MAS) than two benchmark methods. The COVIDhub-ensemble had the best MAS across all weeks, the per-capita allocation had the second best MAS, and the COVIDhub-baseline model had the sixth best allocation (Table 2). No individual model had a better MAS than the per-capita allocation.

Table 2.

Time averaged mean weighted interval scores (taMWIS) and mean allocation scores (MAS) by model across 13 weeks

model MAS taMWIS taMWIS rank
COVIDhub-ensemble 389 70 2
per-capita 464
COVIDhub-trained_ensemble 483 87 4
CU-select 502 81 3
JHUAPL-SLPHospEns 526 67 1
COVIDhub-baseline 594 93 5
JHUAPL-Bucky 643 112 7
MUNI-ARIMA 707 116 8
JHUAPL-Gecko 929 110 6
USC-SI_kJalpha 1,473 155 9

Note. These results show the average performance across time for the nine models that submitted forecasts for every week from 29 November 2021 through 21 February 2022, as well as for a per-capita allocation. MWIS is not defined for the per-capita allocation which does not use forecasts.

Model rankings on MAS and time-averaged mean weighted interval score (taMWIS) were broadly similar, with some places of disagreement. The four most accurate and the five least accurate models were the same according to both metrics, although not in exactly the same order (Table 2). (This comparison excludes the per-capita allocation which does not use forecasts and therefore cannot be assigned an MWIS.) Notably, the JHUAPL-SLPHospEns model ranked best on taMWIS but had only the fourth best MAS.

3.4.3. Metrics were not consistently correlated over time

We can gain some insight into the discordance between MAS and taMWIS by examining the correlation between time-specific AS and MWIS values of the various models in Figure 5. We find, for example, no clear association between the consistent and high MWIS rank of the JHUAPL-SLPHospEns model and its highly variable AS rank. This contrasts with a clearly positive correlation between MWIS and AS ranks in other models such as Karlen-pypm and USC-SI_kJalpha. We also can see in more detail the consistently high performance of the COVIDhub-ensemble on both scores. And while it did not submit forecasts for enough weeks to be included in Table 2, CMU-TimeSeries is an interesting example of a model that performed consistently well on AS but had only middling MWIS ranks.

Figure 5.

Figure 5.

Association of standardized ranks for mean weighted interval score (MWIS) and AS by model and week. Each facet of the plot corresponds to one model. Within each facet, each point corresponds to a week. The x- and y-values correspond to the MWIS standardized rank and the AS standardized rank for that week. Points corresponding to earlier dates have darker shading. The size of the point corresponds to the observed value on the date for which the prediction was made. Models show different degrees of association between the two metrics.

3.4.4. Sensitivity of AS to values of K

AS was computed for a range of K from 200 to 60,000 at increments of 200 for forecasts made on 20 December 2020 predicting levels of hospitalizations on 3 January 2021 as well as for the per-capita allocation (Figure 6a). These calculations highlight that AS was highest at values of K near the observed nationwide total hospital admissions of 19,581 that day. Model ranks by AS were fairly stable across all K, and there appears to be a substantial interval around the observed value over which ranks are constant.

Figure 6.

Figure 6.

Allocation scores (AS) across different resource constraints (K) for 10 models that made forecasts on 20 December 2021. a) For each model, the AS for values of K between 200 and 60,000 at increments of 200. The AS show a sharp peak just under 20,000, near the eventually observed number of hospitalizations. Two possible weighting functions for the IAS are shown. The first (peaked shaded area) computes weights proportional to a normal distribution centred at 15,000 with a standard deviation of 3,000, and truncated to be between 5,000 and 25,000. The second (flat shaded area) uses a uniform weight for all possible values of K. Note that the AS used in earlier sections of the application uses the single fixed value of K=15,000. b) How the two versions of the IAS (y-axis) correlate with the AS at K=15,000 (x-axis). Every point represents the AS at 15,000 and a version of the IAS for one model. As we would expect, the IAS centred at 15,000 (circles) is more closely correlated with the AS at 15,000 than is the uniform IAS (triangles).

3.4.5. IAS across values of K

The IAS summarizes ASs across a range of possible values of the constraint (K), allowing one to assign higher weights to more likely values of K (Section 2.2.3). IAS was computed for two weight distributions on the same grid of values of K used in the sensitivity analysis above, one uniform across the entire range and the other with weights proportional to a truncated normal distribution centred at 15,000 (the orange and blue shaded areas in Figure 6a, respectively). With these weight distributions, the IAS defined in Equation (8) can be computed as the sum

SIAS(F,y)=KSA(F,y;K)p(K).

Both versions of the IAS were correlated with the original AS for K=15,000, with the higher correlation coming from the weighting centred at K=15,000 (Figure 6b). Model rankings based on the AS and the centred IAS were roughly similar, with the top and bottom three approaches being the same for both scores (Table 1).

4. Discussion

Probabilistic forecasts of infectious disease outbreaks have typically been evaluated using well-known proper scoring rules such as the LS, CRPS, and WIS. Often models are ranked according to such a score, but without reference to any underlying decision problem from which the score might be derivable as a Bayes scoring rule or be otherwise motivated. We argue that the value of collaborative forecasting efforts, such as the COVID-19 Forecasting Hub, could be enhanced by including evaluations of forecasts with scoring rules that directly consider the success of forecasts in supporting specific decisions. The AS was developed for this purpose. The AS is expressed in units that are directly relevant to decisions, which helps not only with ranking forecasts in the context of a specific action, but also allows decision-makers to contextualize the magnitude of differences between forecast scores. For example, it may be the case that differences in AS are small relative to the amount of available resources or the accuracy with which planned resource allocations can be executed. The benefits of the approach to forecast evaluation discussed here generalize beyond public health, to any application of probability forecasts for decision-making, such as economics, where the monetary consequences of using a forecast can be clearly juxtaposed with standard accuracy measures of the forecast (Bannigidadmath & Narayan, 2016; Zhang et al., 2019), resource allocation in urban planning (Burnett, 2014; Liang & Wey, 2013), or sustainability initiatives such as water and energy allocation (Colett et al., 2016; Gebre et al., 2021; Syme et al., 1999).

We have demonstrated how tying forecast evaluation to a specific decision problem (e.g. via the AS) can yield model rankings that differ substantively from those based a current standard measure of forecast accuracy, the WIS, which does not refer to any decision problem that has been clearly connected to outbreak response. This result aligns with findings in other fields (Cenesizoglu & Timmermann, 2012; Leitch & Tanner, 1991; Murphy, 1993). Additionally, we show that an existing ensemble forecast approach was the only method to outperform (in terms of this new evaluation method) two benchmark allocation approaches in a hypothetical application, suggesting that there is room for innovation of current epidemiological modelling techniques that might be spurred on by a reorientation of the field toward practical decision problems. For example, to surpass benchmark ASs, forecasters need to devote effort to capturing the relative magnitude of future resource need across different locations. Success at this task is not, however, directly rewarded by the WIS (see the discussion at the end of Section 2.3).

Our example application of the AS refers to a hypothetical and unspecified resource need that is assumed to pair with our forecasted outcome (new hospital admissions) in a manner that might appear overly simplistic and artificially direct. We are optimistic, however, that by augmenting this scheme with a probabilistic measurement model for resource need and more detailed domain knowledge, it could be meaningfully aligned with a more concrete decision problem such as the allocation of ventilators during a respiratory viral pandemic. This problem was explored from a stochastic optimization perspective (not from a forecast evaluation perspective) in Huang et al. (2017) and provided an initial motivation for this project. We see this as a promising focus for further development of the AS framework since it (a) has been identified as a setting in which allocation decisions are made during respiratory disease outbreaks, (b) involves logistical considerations depending on geographical and population scales, and (c) would likely imply a central role for hospitalization data. But again, additional modelling and data synthesis would be required since not everyone who is hospitalized needs a ventilator. Other possible real-world examples include the allocation of a limited stockpile of vaccinations (Araz et al., 2012; Persad et al., 2023) or diagnostic tests (Du et al., 2022; Pasco et al., 2023).

In practice, epidemiological forecasts are often directed at a diverse set of end-users. It may be easy for some of these users to summarize the consequences of how they use a forecast with a numerical loss function, but for others, such as officials deciding how best to update public ‘situational awareness’ of an outbreak, this may be essentially impossible. Even those forecast uses that are easily framed as expected loss minimization may differ enough that no single score would be appropriate for all users. Ideally, targeted forecasting tools could be developed through close collaboration between modellers and public health officials. However, this may only be possible in settings with sufficient staffing on both an analytics and a public health team. Increasingly, collaborative modelling hubs are being used to generate ‘one-size-fits-all’ forecasts for many locations at once. But it could still be beneficial in these settings to expand the set of scoring methods used by the hub to include scores, such as the AS, which are based on specific public health decision problems. This could help end users better understand the value of available forecasts as inputs to their particular decision making contexts. (We note that issues of propriety are raised when forecasts are evaluated with scoring rules for which they were not elicited or via parameters imputed by a hub after submission, both of which are done in our application example. We consider these issues briefly in Appendices D and E.)

There are three important limitations to the current work which we noted when introducing the AS in Section 2.2.2. We view these issues as promising avenues for future investigation.

First, the terms in our loss function (3) do not depend on geographic or demographic covariates. Policy makers may, however, wish to incorporate differences in the impact of unmet need across populations into a forecast evaluation methodology. For example, in the state-level forecasting context of Section 3, it might be appropriate to adjust the loss so that the costs of unmet need are larger in locations with greater population vulnerability. Along these lines, it should also be mentioned that the AS does not directly take into consideration the fairness or equity of allocations, or more broadly, individual and group preference relations that are difficult or even impossible to encode into a loss or utility function. Ambiguity aversion (in the sense of Ellsberg, 1961), for example, seems especially relevant to the use of forecasts in outbreak response, where there can be immense social and political pressure on public health officials to maintain a transparent base of evidence for their policy choices and recommendations.

Secondly, our formulation of the AS does not take into consideration the temporal inter-dependencies between sequential decisions that public health decision-makers are likely to face as an epidemic unfolds. Such dependencies could arise from time-varying inputs such as resource availability or the current effective allocations of resources. The sequential decision problem could also be combined with others allowing constraints for different resources to be modified by time-varying funding shifts between various disease mitigation measures. Incorporating these possibilities into an AS framework would likely require technical developments involving either dynamic programming (see e.g. Bertsekas, 2012) or a scenario-based approach (see e.g. Pflug & Pichler, 2014).

Thirdly, we emphasize that we opted at this stage of development to explicitly rule out situations where a successful epidemiological forecast could lead to policy decisions that change the distribution of the predicted outcome Y. For example, this excludes the important problem of allocation of doses of a preventative vaccine. We have not developed a formal approach to forecast evaluation using scores given by (7) when both disease burden and unmet need are considered as endogenous variables. Such an approach would require some form of accounting for the causal effect of the allocation decision on observed measures of disease burden, e.g. using the available observed data to estimate the (counterfactual) number of cases or deaths averted from one or another allocation strategy.

Another opportunity for further investigation is to more carefully evaluate whether forecasts add value to standard procedures already in place for making public health decisions. In the context of allocations, such a procedure might extrapolate need from current observed need and population levels (similarly to the two benchmark approaches presented above), with adjustments based on other political or real-world considerations. For example, in many settings public health stakeholders will make decisions after synthesizing information from a variety of quantitative and qualitative sources coupled with expert judgment. The AS presented in this work does not directly measure whether a given forecast adds useful information to such an existing decision-making process. While the scoring procedures as presented do not directly address this question, they could be modified (say, by comparison to a baseline model or expert-elicited allocations in the absence of forecast data) to quantify the benefit of using a forecast to inform a specific decision.

In conclusion, we argue that when possible, the way modellers and policymakers view and evaluate forecasts should more explicitly depend on the specific decision-making context. Defaulting to standard forecast evaluation metrics can mask the utility (or disutility) of certain forecasts, or lead to forecasts being used in decision-making contexts very different from those for which they are able to offer useful guidance. New collaborative work between decision-makers and modelling teams is needed to assess the value and relevance of the initial findings presented here, including real-time pilot studies or simulation exercises that could be used to inform further development of new or alternative scoring metrics. We see this work as an initial overture for what we hope will grow to be a large, collaborative body of work more closely coupling applied epidemiological forecasting with public health decision-making.

Acknowledgments

We wish to thank the following individuals who contributed valuable comments and feedback on early versions of this work: Matthew Biggerstaff, Rebecca Borchering, Sebastian Funk, Melissa Kerr, and Jeffrey Shaman.

Appendices

We address some technical and methodological points from the main text. We begin in Appendix A by defining proper scoring rules and showing that the AS is proper. Appendix B gives a justification for the result that the Bayes act for the allocation problem is given by a vector of quantiles of the forecast distributions in each location at a shared probability level, which was stated in Section 2.2.2. We describe the algorithm that we use to compute allocations given a forecast distribution in each location in Appendix C. In Appendix D, we describe the methods for approximating a full distribution from a set of quantiles, implemented in the R package distfromq, that we used to support computation of ASs from quantile forecasts that were submitted to the US COVID-19 Forecast Hub for the application in Section 3. Finally, in Appendix E, we examine implications for propriety of an analysis that uses summaries of forecasts (such as predictive quantiles) rather than the full forecast distributions for computation of ASs.

Appendix A: proper scoring rules

In decision theory, a loss function l is used to formalize a decision problem by assigning numerical value l(x,y) to the result of taking an action x in preparation for an outcome y. A scoring rule S is a loss function for a decision problem where the action is a probabilistic forecast F of the outcome y (or the statement of F by a forecaster). We refer to the realized loss S(F,y) as the score of F at y.

Probabilistic forecasts can be seen as a unique kind of action in that they can be used to generate their own (simulated) outcome data, against which they can be scored using S. A probabilistic forecast F is thus committed to the ‘self-assessment’ EF[S(F,Y)]:=E[S(F,YF)], where YFF is the random variable defined by sampling from F, as well to an assessment EF[S(G,Y)] of any alternative forecast G.

A natural consistency criterion for S is that, for observations assumed to be drawn from F, it will not assess any other forecast G as being better than F itself, that is, that

EF[S(F,Y)]EF[S(G,Y)] (9)

for any F,G. A scoring rule meeting this criterion is called proper. If S were improper, then from the perspective of a forecaster focussed (solely) on expected loss minimization, the decision to state a forecast G other than the forecast F which they believe describes Y could be superior to the decision to state F. S is strictly proper when (9) is a strict inequality, in which case the only optimal decision for a forecaster seeking to minimize their expected loss is to state the forecast they believe to be true.

A.1. The allocation score is proper

Our primary decision theoretical procedure, outlined in Section 2.2.1, uses a decision problem with loss function s(x,y) to define a scoring rule

S(F,y):=s(xF,y) (10)

where xF:=argminxEF[s(x,Y)] is the Bayes act for F with respect to s. Such scoring rules, which we call Bayes scoring rules, are proper by construction since

EF[S(F,Y)]=EF[s(xF,Y)]=minxEF[s(x,Y)](bydefinitionofxF) (11)
EF[s(xG,Y)] (12)
=EF[S(G,Y)].

The allocation scoring rule is Bayes and therefore proper.

We note that in the probabilistic forecasting literature (see e.g. Gneiting, 2011a, Theorem 3) what we have termed Bayes scoring rules typically appear via (10) where xF is some given functional of F which can be shown to be elicitable, that is, to be the Bayes act for some loss function s. Such a loss function is said to be a consistent loss (or scoring) function for the functional FxF, and many important recent results in the literature (e.g. Fissler & Ziegel, 2016) address whether there exists any loss function that is consistent for xF. Our orientation is different from this insofar as we begin by specifying a decision problem and a loss function of subject matter relevance and use the Bayes act only as a bridge to a proper scoring rule. Consistency is never in doubt.

Appendix B: allocation bayes acts as vectors of marginal quantiles

Here we study the form of the Bayes act for the allocation problem (AP) (Equation (4) in Section 2.2.2) of the text:

minimize0xEF[sA(x,Y)]=i=1NLEFi[max(0,Yixi)]subject toi=1Nxi=K, (13)

where the marginal forecasts Fi for i=1,,N represent forecasts for N distinct locations. We show that the Bayes act xF,K=(x1F,K,,xNF,K) for a forecast F and resource constraint level K is a vector of quantiles of the marginal forecast distributions Fi at a single probability level τF,K, that is, xiF,K=qFi,τF,K. An immediate consequence used in the examples in Section 2.1 is that if Fi=Exp(1/σi) for all i, then the Bayes act is proportional to (σ1,,σN), since

qExp(1/σ),τ=σlog(1τ). (14)

We begin by noting that a key feature of each term of the loss function sA(x,Y) defining the AP is the presence of a shortage: an amount max{0,yx} by which a resource demand y exceeds a supply decision variable x, which, for convenience, we write as (yx)+. This is a feature shared with decision problems used to define quantiles and related scoring rules such as the CRPS and the WIS (see e.g. Gneiting, 2011b; Jose & Winkler, 2009; Royset & Wets, 2022, sections 1.C and 3.C). In particular, a quantile at probability level α of a distribution F on R (which we assume to have a well-defined density f(x)) is a Bayes act for the loss function

l(x,y)=Cx+L(yx)+

where α=1C/L and C and L can be interpreted as the cost per unit of a resource (such as medicine) and the loss incurred when a unit of demand (such as illness) cannot be met due to the shortage (yx)+. This follows because a Bayes act, as a minimizer of EF[l(x,Y)], must also be a vanishing point of the derivative

ddxEF[l(x,Y)]=EF[ddxl(x,Y)]=C+LEF[ddx(Yx)+]=CLEF[1{Y>x}]=C+L(F(x)1), (15)

so that 1C/L=F(x). The formula ddxEF[(Yx)+]=F(x)1 for the derivative of the shortage, used above in (15), can be obtained from an application of the ‘Leibniz Rule’:

ddxEF[(Yx)+]=ddxx(yx)fY(y)dy=xddx(yx)fY(y)dy(xx)fY(x)=xfY(y)dy=F(x)1. (16)

Note that more care is required when F does not have a density.

Returning to the AP (13), notice that for xR+N to be a Bayes act it must be true that reallocating δ>0 units of the resource from location i to location j will lead to a net increase in expected shortage—in other words, the reallocation increases the expected shortage in location i at least as much as it decreases the expected shortage in location j:

EFi[(Yi(xiδ))+]EFi[(Yixi)+](detrimentini)EFj[(Yjxj)+]EFj[(Yj(xj+δ))+](benefitinj).

Dividing by δ and letting δ0, this implies from (16) that

1Fi(xi)=ddxiEFi[(Yixi)+](rateofdetrimentini)=limδ01δ{EFi[(Yi(xiδ))+]EFi[(Yixi)+]}limδ01δ{EFj[(Yjxj)+]EFj[(Yj(xj+δ))+]}=ddxjEFj[(Yjxj)+]=1Fj(xj)(rateofbenefitinj). (17)

(Negative derivatives appear here because our optimality condition addresses how a decrease in resources will increase the expected shortage in i and vice versa in j.)

Since (17) also holds with i and j reversed, a number λ (a Lagrange multiplier) exists such that L(1Fk(xk))=λ for all k1,,N. (We scale by L to facilitate possible future interpretations of λ in terms of the partial derivatives of EF[sA(x,Y)].) That is, xk is a quantile qτ,Fk for τ=1λ/L. The value of τ is then determined by the constraint equation

i=1Nqτ,Fi=K. (18)

It is important to note that τ depends on F and K and is not a fixed parameter of the allocation scoring rule.

Appendix C: numerical computation of allocation bayes acts

To compute an allocation score (AS) SA(F,y;K):=sA(xF,K,y), we require numerical values for a Bayes act solving the AP (13)—that is, we must find the specific resource allocations for each location that are determined by the forecast F under the resource constraint K. Assuming we have reliable means of calculating quantiles qα,Fi of the marginal forecasts Fi, these allocations are given by qτ,Fi where τ solves Equation (18). This equation is not analytically tractable in general, but because the left-hand side of (18) is nondecreasing in τ, it is straightforward to find arbitrarily accurate approximations to a solution using an iterative bisection method.

We have implemented such a method along with the resulting score computations in the R package alloscore (Gerding & Ray, 2023) which provided all AS values used in the analysis of Section 3. The procedure begins with an initial search interval [τL,1,τU,1] (such as [0,maxiFi(K)]) that clearly contains the solution τ. At each step j of the algorithm, we evaluate the total allocation i=1NqτM,j,Fi at the midpoint of the search interval, τM,j=12(τL,j+τU,j) and continue the search on the narrowed sub-interval

[τL,j+1,τU,j+1]={[τL,j,τM,j]ifi=1NqτM,j,FiK[τM,j,τU,j]ifi=1NqτM,j,Fi<K.

This search continues until τU,j+1<(1+ε)τL,j+1 for a suitably small ε>0.

Figure C1 illustrates this bisection method in the context of the examples of Section 2.1, where it quickly finds close approximations to the analytic solutions τ=1e1,1e2 (for K=5,10 respectively) which, in this case, are available (cf. (14)).

Figure C1.

Figure C1.

Illustrations of bisection method for the examples of Section 2.1. In each plot, the thinner black curves are the quantile functions of the exponential forecasts for Location 1, with σ1=1 (lower), and Location 2, with σ2=4 (upper), while the thicker black curve gives the sum of these quantiles, which must meet the resource constraints of K=5 on the left and K=10 on the right. The red segments show the iterates τM,j and their associated total resource demands as given by the function allocate from the package alloscore applied to these examples and using an initial search interval [0,1]. (The τM,j are in this case just sums of terms ±2j.) The horizontal dashed lines show the resulting allocations (cf. Figure 1).

Subtleties can arise when the forecast densities fi vanish or are very small, in which case quantiles are nonunique or highly variable near a probability level, leading to ambiguity or numerical instabilities in the evaluation of i=1Nqτ,Fi. Additionally, if point masses are present in any of the Fi, (18) will not have a unique solution for some discrete set of constraint levels K. We have adopted conventions for detecting such levels and enforcing consistency in score calculations near them. Through extensive experimentation, we have determined that these conditions seem to address these challenges with the forecasts we are working with, but we leave a more rigorous approximation error analysis for later work.

Appendix D: computing allocations from finite quantile forecast representations

In Section 3, we used the AS to evaluate forecasts of COVID-19 hospitalizations that have been submitted to the US COVID-19 Forecast Hub. These forecasts are submitted to the Hub using a set of 23 quantiles of the forecast distribution at the 23 probability levels in the set T={0.01,0.025,0.05,0.1,0.15,,0.9,0.95,0.975,0.99}, which specify a predictive median and the endpoints of central (1α)×100% prediction intervals at levels α=0.02,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9. For a given week and target date, we use qi,k to denote the submitted quantiles for location i and probability level τkT, k=1,,23.

In the event that there is some k{1,,23} for which iqi,k=K, i.e. the provided predictive quantiles at level τk sum across locations to the resource constraint K, the solution to the allocation problem is given by those quantiles. However, generally this will not be the case; the optimal allocation will typically be at some probability level τT.

To address this situation and support the numerical allocation algorithm outlined in Section C, we need a mechanism to approximate the full CDFs Fi, i=1,,N based on the provided quantiles. We have developed functionality for this purpose in the distfromq package for R (Ray & Gerding, 2023). This functionality represents a distribution as a mixture of discrete and continuous parts, and it works in two steps:

  1. Identify a discrete component of the distribution consisting of zero or more point masses, and create an adjusted set of predictive quantiles for the continuous part of the distribution by subtracting the point mass probabilities and rescaling.

  2. For the continuous part of the distribution, different approaches are used on the interior and exterior of the provided quantiles:

    1. On the interior, a monotonic cubic spline interpolates the adjusted quantiles representing the continuous part of the distribution.

    2. A location-scale parametric family is used to extrapolate beyond the provided quantiles. The location and scale parameters are estimated separately for the lower and upper tails so as to obtain a tail distribution that matches the two most extreme quantiles in each tail. In this work, we use normal distributions for the tails.

The resulting distributional estimate exactly matches all of the predictive quantiles provided by the forecaster. We use the cumulative distribution function resulting from this procedure as an input to the AS algorithm.

We refer the reader to the distfromq documentation for further detail (Ray & Gerding, 2023).

Appendix E: propriety of parametric approximation

In practice, open forecasting exercises are generally not able to collect a perfect description of the forecast distribution F other than in simple settings such as for a categorical variable with a relatively small number of categories. In settings where the outcome being forecasted is a continuous quantity (such as the proportion of outpatient doctor visits where the patient has influenza-like illness) or a count (such as influenza hospitalizations), forecasting exercises have therefore resorted to collecting summaries of a forecast distribution such as bin probabilities or predictive quantiles. In this section, we address two practical concerns raised by this. First, we discuss conditions under which it is possible to calculate the AS when only summaries of a forecast distribution are recorded in a submission to a forecast hub. Second, we show that a post hoc attempt to compute the AS based on submitted predictive quantiles may in fact compute an alternative score that is not proper.

E.1. Propriety when scoring methods are announced prospectively

We consider a setting where a forecasting exercise (such as a forecast hub) prespecifies that forecasts will be represented using a parametric family of forecast distributions Gθ(y), and the task of the forecaster is to select a particular parameter value θ. We use P to denote the collection of all distributions Gθ in the given parametric family. Additionally, we note that the functionality in distfromq can be viewed as specifying a parametric family Pdfq where the parameters θ of Gθ are its quantiles at prespecified probability levels, and where the shape of any GθPdfq over the full range of its support is entirely controlled by these quantiles.

We find it helpful now to formally distinguish between two decision making problems. The first is the public health decision-maker’s allocation problem where the task is to select an allocation x, with the allocation loss sA(x,y)=i=1NLmax(0,yixi) as described in Section 2.2.2. The second is the forecaster’s reporting problem where the task is to select parameter values θ to report. The forecaster’s loss is given by

sR(θ,y)=sA(xGθ,y), (19)

where xGθ is the Bayes act for the allocation problem under the distribution Gθ. In words, the loss associated with reporting θ is equal to the loss associated with taking the Bayes allocation corresponding to the distribution Gθ.

Following our usual construction, the Bayes act for the forecast reporting problem is the parameter set that minimizes the forecaster’s expected loss. Breaking with our earlier notation for improved legibility, we use θ(F) to denote this Bayes act:

θ(F)=argminθEF[sR(θ,Y)]=argminθEF[sA(xGθ,Y)].

We then arrive at the scoring rule

SR(F,y)=sR(θ(F),y)=sA(xGθ(F),y).

It follows from the discussion in Section A.1 that this is a proper scoring rule for F. Although the full forecast distribution F is not available in the forecast submission, the score SR(F,y) can be calculated from the reported parameter values as long as the forecaster submits the optimal parameters θ(F).

We emphasize that the forecaster’s true predictive distribution F does not need to be a member of the specified parametric family P for this construction to yield a proper score. It is, however, necessary to specify the parametric family to use and the foundational scoring rule sA (including any relevant problem parameters such as the resource constraint K) in advance, so that forecasters can identify the Bayes act parameter set θ(F) to report.

If the parametric family used to represent forecast distributions is flexible enough, the reporting scoring rule SR and the allocation score will coincide. Suppose that for a given resource constraint K, for any forecast distribution F it is possible to find a member Gθ of the specified parametric family P with the same allocation as F (i.e. xF=xGθ). Then θ is a Bayes act for the reporting problem since for any other parameter value θ,

EF[sR(θ,Y)]=EF[sA(xGθ,Y)]=EF[sA(xF,Y)](sincexF=xGθ)EF[sA(xGθ,Y)](bydefinitionofxF)=EF[sR(θ,Y)].

Thus SR(F,y)=sR(θ,y)=sA(xGθ,y)=sA(xF,y)=SA(F,y).

For the particular choice of the parametric family Pdfq (i.e. using the distfromq package), this flexibility requirement is satisfied. For instance, the forecaster could pick one required quantile level (such as 0.5, for which the corresponding predictions are predictive medians), and set the submitted quantiles of their forecast distribution in each location at that level to be the desired allocations, which sum to K across all locations. However, this representation of the forecast may be quite different from the actual forecast distribution F. For example, for the actual forecast distribution F the allocations may occur at some quantile level other than 0.5.

As another alternative for practical forecasting exercises, a forecast hub could ask forecasters to directly provide the Bayes allocations associated with their forecasts for one or more specified resource constraints K. At the cost of increasing the number of quantities solicited by the forecast hub, this would have several advantages: it would prevent any artificial distortion of the forecast distributions, allow for direct calculation of scores, and narrow the gap between model outputs and public health end users. For this to be feasible, implementations of the allocation algorithm would have to be provided to participating forecasters in the computational languages being used for modelling.

E.2. Impropriety of post hoc allocation scoring with quantile forecasts

A post hoc evaluation of quantile forecasts that combines the parametric family specified by distfromq with the allocation score does not yield the AS of the forecast distribution F. Instead, it computes an alternative score that is improper. This is because the forecast distribution F and the distribution GqPdfq with the same quantiles as F may determine different resource allocations. In our investigations, these discrepancies appear to be relatively minor on the interior of the provided quantiles, but could be severe if the tail extrapolations performed by distfromq do not match the tail behaviour of F and the allocations are in the tails of the predictive distribution.

We define

G(F):=argminGPdfqEF[SA(G,Y)].

Since SR is defined as the Bayes scoring rule for the forecaster’s loss (19), G(F) coincides with Gθ(F), the distribution in Pdfq given by the optimal submission parameters θ(F) for the forecaster with predictive distribution F. In general, Gq(F) and G(F) will be different distributions: matching F at specific quantiles does not require Gq(F) to match F at the quantiles for τF,K (cf. (18)), which would be necessary for it to share xF as an optimal allocation.

When an analyst attempts a post hoc computation of the AS using Gq(F) (implicitly assuming that Gq(F)=G(F)), they in fact compute the alternative score

S~(F,y)=SA(Gq(F),y)=sA(xGq(F),y).

This score is improper because EF[SA(G,Y)] is minimized by G(F), not Gq(F). In general, we have

EF[S~(G(F),Y)]EF[SA(Gq(F),Y)]=EF[S~(F,Y)]. (20)

However, the inequality in (20) will typically be strict. For example, if F has heavy upper tails (such as for a lognormal distribution), but normal distributions are used for tail extrapolations in distfromq, then the resource allocations based on the distribution Gq(F) may be quite different from the optimal allocations under the distribution F, leading to a strict inequality. This demonstrates that S~ is improper.

Contributor Information

Aaron Gerding, Department of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts at Amherst, Amherst, Massachusetts, USA.

Nicholas G Reich, Department of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts at Amherst, Amherst, Massachusetts, USA.

Benjamin Rogers, Department of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts at Amherst, Amherst, Massachusetts, USA.

Evan L Ray, Department of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts at Amherst, Amherst, Massachusetts, USA.

Funding

This work has been supported by the National Institute of General Medical Sciences (NIGMS) (R35GM119582) and the Centers for Disease Control and Prevention (CDC) (1U01IP001122). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIGMS, or the CDC.

Data availability

All forecast data used in this evaluation are available through the COVID-19 Forecast Hub (Cramer, Huang, et al., 2022). An R package implementing the AS is available at https://github.com/aaronger/alloscore. All code and data for the analyses presented in this manuscript are available at https://github.com/aaronger/utility-eval-papers. The analyses were generated using a reproducible workflow using R version 4.4.1 (2024-06-14) and the targets package (Landau, 2021; R Core Team, 2023).

References

  1. Araz  O. M., Galvani  A., & Meyers  L. A. (2012). Geographic prioritization of distributing pandemic influenza vaccines. Health Care Management Science, 15(3), 175–187. 10.1007/s10729-012-9199-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bannigidadmath  D., & Narayan  P. K. (2016). Stock return predictability and determinants of predictability and profits. Emerging Markets Review, 26, 153–173. 10.1016/j.ememar.2015.12.003 [DOI] [Google Scholar]
  3. Bertsekas  D. (2012). Dynamic programming and optimal control: Volume I (Vol. 4). Athena Scientific. [Google Scholar]
  4. Bertsimas  D., Boussioux  L., Cory-Wright  R., Delarue  A., Digalakis  V., Jacquillat  A., Kitane  D. L., Lukin  G., Li  M., Mingardi  L., Nohadani  O., Orfanoudaki  A., Papalexopoulos  T., Paskov  I., Pauphilet  J., Lami  O. S., Stellato  B., Bouardi  H. T., Carballo  K. V., …Zeng  C. (2021). From predictions to prescriptions: A data-driven response to COVID-19. Health Care Management Science, 24(2), 253–272. 10.1007/s10729-020-09542-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bilinski  A. M., Salomon  J. A., & Hatfield  L. A. (2023). Adaptive metrics for an evolving pandemic: A dynamic approach to area-level COVID-19 risk designations. Proceedings of the National Academy of Sciences, 120(32), e2302528120. 10.1073/pnas.2302528120 [DOI] [Google Scholar]
  6. Bracher  J., Ray  E. L., Gneiting  T., & Reich  N. G. (2021). Evaluating epidemic forecasts in an interval format. PLoS Computational Biology, 17(2), e1008618. 10.1371/journal.pcbi.1008618 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Burnett  A. (2014). Urban political processes and resource allocation. In Progress in political geography (Routledge revivals) (pp. 177–215). Routledge. [Google Scholar]
  8. Camacho  A., Kucharski  A., Aki-Sawyerr  Y., White  M. A., Flasche  S., Baguelin  M., Pollington  T., Carney  J. R., Glover  R., Smout  E., & Tiffany  A. (2015). Temporal changes in Ebola transmission in Sierra Leone and implications for control requirements: A real-time modelling study. PLoS Currents, 7. 10.1371/currents.outbreaks.406ae55e83ec0b5193e30856b9235ed2 [DOI] [Google Scholar]
  9. Cenesizoglu  T., & Timmermann  A. (2012). Do return prediction models add economic value?  Journal of Banking & Finance, 36(11), 2974–2987. 10.1016/j.jbankfin.2012.06.008 [DOI] [Google Scholar]
  10. Colett  J. S., Kelly  J. C., & Keoleian  G. A. (2016). Using nested average electricity allocation protocols to characterize electrical grids in life cycle assessment. Journal of Industrial Ecology, 20(1), 29–41. 10.1111/jiec.2016.20.issue-1 [DOI] [Google Scholar]
  11. Colón-González  F. J., Bastos  L. S., Hofmann  B., Hopkin  A., Harpham  Q., Crocker  T., Amato  R., Ferrario  I., Moschini  F., James  S., Malde  S., Ainscoe  E., Nam  V. S., Tan  D. Q., Khoa  N. D., Harrison  M., Tsarouchi  G., Lumbroso  D., Brady  O. J., …Cook  A. R. (2021). Probabilistic seasonal dengue forecasting in Vietnam: A modelling study using superensembles. PLoS Medicine, 18(3), e1003542. 10.1371/journal.pmed.1003542 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Cramer  E. Y., Huang  Y., Wang  Y., Ray  E. L., Cornell  M., Bracher  J., Brennen  A., Rivadeneira  A. J. C., Gerding  A., House  K., Jayawardena  D., Kanji  A. H., Khandelwal  A., Le  K., Mody  V., Mody  V., Niemi  J., Stark  A., Shah  A., … Reich  N. G. (2022). The United States COVID-19 forecast hub dataset. Scientific Data, 9(1), 462. 10.1038/s41597-022-01517-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cramer  E. Y., Ray  E. L., Lopez  V. K., Bracher  J., Brennen  A., Rivadeneira  A. J. C., Gerding  A., Gneiting  T., House  K. H., Huang  Y., & Jayawardena  D. (2022). Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proceedings of the National Academy of Sciences, 119, e2113561119. 10.1073/pnas.2113561119 [DOI] [Google Scholar]
  14. Dawid  A. P. (2007). The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59(1), 77–93. 10.1007/s10463-006-0099-8 [DOI] [Google Scholar]
  15. Diecidue  E., & Somasundaram  J. (2017). Regret theory: A new foundation. Journal of Economic Theory, 172, 88–119. 10.1016/j.jet.2017.08.006 [DOI] [Google Scholar]
  16. Du  J., Beesley  L. J., Lee  S., Zhou  X., Dempsey  W., & Mukherjee  B. (2022). Optimal diagnostic test allocation strategy during the COVID-19 pandemic and beyond. Statistics in Medicine, 41(2), 310–327. 10.1002/sim.v41.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Ellsberg  D. (1961). Risk, ambiguity, and the savage axioms. The Quarterly Journal of Economics, 75(4), 643–669. 10.2307/1884324 [DOI] [Google Scholar]
  18. Finger  F., Funk  S., White  K., Siddiqui  M. R., Edmunds  W. J., & Kucharski  A. J. (2019). Real-time analysis of the diphtheria outbreak in forcibly displaced Myanmar nationals in Bangladesh. BMC Medicine, 17(1), 58. 10.1186/s12916-019-1288-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Fissler  T., & Ziegel  J. F. (2016). Higher order elicitability and Osband’s principle. Annals of Statistics, 44(4), 1680–1707. 10.1214/16-AOS1439 [DOI] [Google Scholar]
  20. Fox  S. J., Lachmann  M., Tec  M., Pasco  R., Woody  S., Du  Z., Wang  X., Ingle  T. A., Javan  E., Dahan  M., Gaither  K., Escott  M. E., Adler  S. I., Johnston  S. C., Scott  J. G., & Meyers  L. A. (2022). Real-time pandemic surveillance using hospital admissions and mobility data. Proceedings of the National Academy of Sciences, 119(7), e2111870119. 10.1073/pnas.2111870119 [DOI] [Google Scholar]
  21. Gebre  S. L., Cattrysse  D., & Van Orshoven  J. (2021). Multi-criteria decision-making methods to address water allocation problems: A systematic review. Water, 13(2), 125. 10.3390/w13020125 [DOI] [Google Scholar]
  22. Gerding  A., & Ray  E. (2023). alloscore: Tools for implementing allocation scoring rules. R package version 0.0.9001. https://github.com/aaronger/alloscore.
  23. Gneiting  T. (2011a). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494), 746–762. 10.1198/jasa.2011.r10138 [DOI] [Google Scholar]
  24. Gneiting  T. (2011b). Quantiles as optimal point forecasts. International Journal of Forecasting, 27(2), 197–207. 10.1016/j.ijforecast.2009.12.015 [DOI] [Google Scholar]
  25. Gneiting  T., & Raftery  A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. 10.1198/016214506000001437 [DOI] [Google Scholar]
  26. Gneiting  T., & Ranjan  R. (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29(3), 411–422. 10.1198/jbes.2010.08110 [DOI] [Google Scholar]
  27. Hadley  G., & Whitin  T. M. (1963). Analysis of inventory systems. Prentice-Hall international series in management. Prentice-Hall. [Google Scholar]
  28. Hong  T., Pinson  P., Fan  S., Zareipour  H., Troccoli  A., & Hyndman  R. J. (2016). Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. International Journal of Forecasting, 32(3), 896–913. 10.1016/j.ijforecast.2016.02.001 [DOI] [Google Scholar]
  29. Huang  H.-C., Araz  O. M., Morton  D. P., Johnson  G. P., Damien  P., Clements  B., & Meyers  L. A. (2017). Stockpiling ventilators for influenza pandemics. Emerging Infectious Diseases, 23(6), 914–921. 10.3201/eid2306.161417 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Igboh  L. S., Roguski  K., Marcenac  P., Emukule  G. O., Charles  M. D., Tempia  S., Herring  B., Vandemaele  K., Moen  A., Olsen  S. J., Wentworth  D. E., Kondor  R., Mott  J. A., Hirve  S., Bresee  J. S., Mangtani  P., Nguipdop-Djomo  P., & Azziz-Baumgartner  E. (2023). Timing of seasonal influenza epidemics for 25 countries in Africa during 2010–19: A retrospective analysis. The Lancet: Global Health, 11(5), e729–e739. 10.1016/S2214-109X(23)00109-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Ioannidis  J. P., Cripps  S., & Tanner  M. A. (2022). Forecasting for COVID-19 has failed. International Journal of Forecasting, 38(2), 423–438. 10.1016/j.ijforecast.2020.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Johansson  M. A., Apfeldorf  K. M., Dobson  S., Devita  J., Buczak  A. L., Baugher  B., Moniz  L. J., Bagley  T., Babin  S. M., Guven  E., & Yamana  T. K. (2019). An open challenge to advance probabilistic forecasting for dengue epidemics. Proceedings of the National Academy of Sciences, 116, 24268–24274. 10.1073/pnas.1909865116 [DOI] [Google Scholar]
  33. Johansson  M. A., Reich  N. G., Hota  A., Brownstein  J. S., & Santillana  M. (2016). Evaluating the performance of infectious disease forecasts: A comparison of climate-driven and seasonal dengue forecasts for Mexico. Scientific Reports, 6(1), 33707. 10.1038/srep33707 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Jose  V. R. R., & Winkler  R. L. (2009). Evaluating quantile assessments. Operations Research, 57(5), 1287–1297. 10.1287/opre.1080.0665 [DOI] [Google Scholar]
  35. Landau  W. M. (2021). The targets R package: A dynamic make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 6(57), 2959. 10.21105/joss [DOI] [Google Scholar]
  36. Leitch  G., & Tanner  J. E. (1991). Economic forecast evaluation: Profits versus the conventional error measures. The American Economic Review, 81, 580–590. https://www.jstor.org/stable/2006520 [Google Scholar]
  37. Liang  S., & Wey  W.-M. (2013). Resource allocation and uncertainty in transportation infrastructure planning: A study of highway improvement program in Taiwan. Habitat International, 39, 128–136. 10.1016/j.habitatint.2012.11.004 [DOI] [Google Scholar]
  38. Marshall  M., Parker  F., & Gardner  L. M. (2024). When are predictions useful? A new method for evaluating epidemic forecasts.  BMC Global Public Health, 2(67). 10.1186/s44263-024-00098-7 [DOI] [Google Scholar]
  39. McGowan  C. J., Biggerstaff  M., Johansson  M., Apfeldorf  K. M., Ben-Nun  M., Brooks  L., Convertino  M., Erraguntla  M., Farrow  D. C., Freeze  J., Ghosh  S., Hyun  S., Kandula  S., Lega  J., Liu  Y., Michaud  N., Morita  H., Niemi  J., Ramakrishnan  N., … Yang  W. (2019). Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific Reports, 9(1), 683. 10.1038/s41598-018-36361-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Meltzer  M. I., Atkins  C. Y., Santibanez  S., Knust  B., Petersen  B. W., Ervin  E. D., Nichol  S. T., Damon  I. K., & Washington  M. L. (2014). Estimating the future number of cases in the Ebola epidemic–Liberia and Sierra Leone, 2014–2015. MMWR, 63, 1–14. https://www.cdc.gov/mmwr/preview/mmwrhtml/su63e0923a1.htm [Google Scholar]
  41. Murphy  A. H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8, 281–293. 10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2 [DOI] [Google Scholar]
  42. Papastefanopoulos  V., Linardatos  P., & Kotsiantis  S. (2020). COVID-19: A comparison of time series methods to forecast percentage of active cases per population. Applied Sciences, 10(11), 3880. 10.3390/app10113880 [DOI] [Google Scholar]
  43. Pasco  R., Johnson  K., Fox  S. J., Pierce  K. A., Johnson-León  M., Lachmann  M., Morton  D. P., & Meyers  L. A. (2023). COVID-19 test allocation strategy to mitigate SARS-CoV-2 infections across school districts. Emerging Infectious Diseases, 29(3), 501–510. 10.3201/eid2903.220761 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Persad  G., Leland  R J., Ottersen  T., Richardson  H. S., Saenz  C., Schaefer  G O., & Emanuel  E. J. (2023). Fair domestic allocation of monkeypox virus countermeasures. The Lancet: Public Health, 8(5), e378–e382. 10.1016/S2468-2667(23)00061-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Pesaran  M. H., & Skouras  S. (2002). Decision-based methods for forecast evaluation. In M. P.  Clements & D. F.  Hendry (Eds.), A companion to economic forecasting (chap. 11, pp. 241–267). Blackwell. [Google Scholar]
  46. Pflug  G. C., & Pichler  A. (2014). Multistage stochastic optimization (Vol. 1104). Springer. [Google Scholar]
  47. Probert  W. J. M., Shea  K., Fonnesbeck  C. J., Runge  M. C., Carpenter  T. E., Dürr  S., Garner  M. G., Harvey  N., Stevenson  M. A., Webb  C. T., Werkman  M., Tildesley  M. J., & Ferrari  M. J. (2016). Decision-making for foot-and-mouth disease control: Objectives matter. Epidemics, 15, 10–19. 10.1016/j.epidem.2015.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Rainisch  G., Shankar  M., Wellman  M., Merlin  T., & Meltzer  M. I. (2015). Regional spread of Ebola virus, West Africa, 2014. Emerging Infectious Diseases, 21(3), 444–447. 10.3201/eid2103.141845 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Ray  E. L., Brooks  L. C., Bien  J., Biggerstaff  M., Bosse  N. I., Bracher  J., Cramer  E. Y., Funk  S., Gerding  A., Johansson  M. A., Rumack  A., Wang  Y., Zorn  M., Tibshirani  R. J., & Reich  N. G. (2023). Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States. International Journal of Forecasting, 39(3), 1366–1383. 10.1016/j.ijforecast.2022.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Ray  E. L., & Gerding  A. (2023). distfromq: Reconstruct a distribution from a collection of quantiles. R package version 1.0.2. https://github.com/reichlab/distfromq.
  51. R Core Team (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  52. Reich  N. G., Brooks  L. C., Fox  S. J., Kandula  S., McGowan  C. J., Moore  E., Osthus  D., Ray  E. L., Tushar  A., Yamana  T. K., Biggerstaff  M., Johansson  M. A., Rosenfeld  R., & Shaman  J. (2019). A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences, 116(8), 3146–3154. 10.1073/pnas.1812594116 [DOI] [Google Scholar]
  53. Royset  J. O., & Wets  R. J. -B. (2022). An optimization primer. Springer. [Google Scholar]
  54. Sherratt  K., Gruson  H., Grah  R., Johnson  H., Niehus  R., Prasse  B., Sandmann  F., Deuschel  J., Wolffram  D., Abbott  S., Ullrich  A., Gibson  G., Ray  E. L., Reich  N. G., Sheldon  D., Wang  Y., Wattanachit  N., Wang  L., Trnka  J., … Meinke  J. H. (2023). Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations. eLife, 12, e81916. 10.7554/eLife.81916 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Syme  G. J., Nancarrow  B. E., & McCreddin  J. A. (1999). Defining the components of fairness in the allocation of water to environmental and human uses. Journal of Environmental Management, 57(1), 51–70. 10.1006/jema.1999.0282 [DOI] [Google Scholar]
  56. United States Census Bureau (2022). Annual population estimates, estimated components of resident population change, and rates of the components of resident population change for the United States, States, District of Columbia, and Puerto Rico: April 1, 2020 to July 1, 2022. US Census Bureau. https://www2.census.gov/programs-surveys/popest/datasets/2020-2022/state/totals/NST-EST2022-ALLDATA.csv. Accessed February 1, 2024.
  57. University of Texas at Austin (2022). COVID forecasting method using hospital and cellphone data proves it can reliably guide us cities through pandemic threats. Available at https://news.utexas.edu/2022/02/02/covid-forecasting-method-using-hospital-and-cellphone-data-proves-it-can-reliably-guide-us-cities-through-pandemic-threats/ May 26, 2023.
  58. Yardley  E., & Petropoulos  F. (2021). Beyond error measures to the utility and cost of the forecasts. Foresight: The International Journal of Applied Forecasting, 63, 36–45. https://forecasters.org/wp-content/uploads/Foresight63_Beyond-Error-Measures-to-the-Utility-and-Cost-of-the-Forecasts.pdf [Google Scholar]
  59. Zhang  Y., Zeng  Q., Ma  F., & Shi  B. (2019). Forecasting stock returns: Do less powerful predictors help?  Economic Modelling, 78, 32–39. 10.1016/j.econmod.2018.09.014 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All forecast data used in this evaluation are available through the COVID-19 Forecast Hub (Cramer, Huang, et al., 2022). An R package implementing the AS is available at https://github.com/aaronger/alloscore. All code and data for the analyses presented in this manuscript are available at https://github.com/aaronger/utility-eval-papers. The analyses were generated using a reproducible workflow using R version 4.4.1 (2024-06-14) and the targets package (Landau, 2021; R Core Team, 2023).


Articles from Journal of the Royal Statistical Society. Series A, (Statistics in Society) are provided here courtesy of Oxford University Press

RESOURCES