An expert judgment model to predict early stages of the COVID-19 pandemic in the United States

Thomas McAndrew; Nicholas G Reich

doi:10.1371/journal.pcbi.1010485

. 2022 Sep 23;18(9):e1010485. doi: 10.1371/journal.pcbi.1010485

An expert judgment model to predict early stages of the COVID-19 pandemic in the United States

Thomas McAndrew ^1,^*, Nicholas G Reich ²

Editor: Tom Britton³

PMCID: PMC9534428 PMID: 36149916

Abstract

From February to May 2020, experts in the modeling of infectious disease provided quantitative predictions and estimates of trends in the emerging COVID-19 pandemic in a series of 13 surveys. Data on existing transmission patterns were sparse when the pandemic began, but experts synthesized information available to them to provide quantitative, judgment-based assessments of the current and future state of the pandemic. We aggregated expert predictions into a single “linear pool” by taking an equally weighted average of their probabilistic statements. At a time when few computational models made public estimates or predictions about the pandemic, expert judgment provided (a) falsifiable predictions of short- and long-term pandemic outcomes related to reported COVID-19 cases, hospitalizations, and deaths, (b) estimates of latent viral transmission, and (c) counterfactual assessments of pandemic trajectories under different scenarios. The linear pool approach of aggregating expert predictions provided more consistently accurate predictions than any individual expert, although the predictive accuracy of a linear pool rarely provided the most accurate prediction. This work highlights the importance that an expert linear pool could play in flexibly assessing a wide array of risks early in future emerging outbreaks, especially in settings where available data cannot yet support data-driven computational modeling.

Author summary

We asked experts in the modeling of infectious disease to submit probabilistic predictions of the spread and burden of SARS-CoV-2/COVID-19 from February to May, 2020 in an effort to support public health decision making. Expert predictions were aggregated into a linear pool. We found experts could produce short and long term predictions related to the pandemic that could be compared to ground truth such as the number of cases occurring by the end of the week and predictions of unmeasurable outcomes such as latent viral transmission. Experts were also able to make counter factual predictions—predictions of an outcome assuming an action will continue or not continue. In addition, predictions built by aggregating individual expert predictions were less variable when compared to predictions made by individuals. Our work highlights that an expert linear pool is a fast, flexible tool that can support situational awareness for public health officials during an emerging outbreak.

Introduction

The first COVID-19 cases globally were reported in December of 2019 [1]. The World Health Organization (WHO) declared the outbreak a Public Health Emergency of International Concern on January 30, 2020, and on March 11, 2020, after the virus began spreading to other continents [2, 3], the WHO designated the outbreak a pandemic [4]. The first COVID-19 case in the United States without known origin occurred in Washington state in late January [1].

As with previous outbreaks of other diseases [5–7], forecasts from computational models [8–11] assisted in planning and outbreak response near the beginning of the pandemic. However, given the initial limitations in testing capacity for SARS-CoV-2, these models were confronted with imperfect data with which to explain and predict viral transmission dynamics. Because of these challenges in the early phase of the pandemic, some models faced criticism for a lack of accuracy [12].

Starting in mid-February 2020, shortly after the first US case of COVID-19 was identified and before any large-scale computational modeling efforts were public in the US, we collected and aggregated probabilistic predictions and estimates—quantitative statements provided as a probability distribution over potential future values—in weekly surveys of experts in the modeling of infectious disease [13].

Human predictions have been able to accurately predict phenomena or support the development of a predictive model in many domains from ecology to economics [14–21]. Expert and non-expert crowds have made accurate predictions of sleep disturbances [16], geopolitical events [15, 17], and meteorologic events [20, 21], and experts have successfully chosen variables that improved the prediction of clinical phenomena [18, 22]. In the context of infectious disease, human judgment (from a knowledgeable but not exclusively “expert” panel) produced accurate forecasts of seasonal influenza outbreaks in recent seasons [5].

However, there is as much past work highlighting experts’ predictive skill as there is highlighting their difficulty in accurately assessing uncertainty [23–25]. Differences in predictive performance are thought to be a result of individual experts’ abilities and how they interact with their environment–the probabilistic relationships between cues (data and signals available to them to help form an informed judgment) and the target of interest [26–33]. Past work has shown expert performance decreases when tasks are complex, ambiguous, and when experts receive imprecise feedback [34–37]. Expert performance also decreases when there is an overabundance of information in the environment potentially related to the target of interest [36]. Experts are also not exempt from cognitive biases that impact human judgment [23, 38–41]. Gains in predictive accuracy have been found when individual human predictions are combined into a linear pool [42–44].

Recent work in human judgement and crowdsourcing applied techniques from expert elicitation and decision theory to computational models of COVID-19 [45], crowdsourcing antiviral drugs to inhibit SARS-CoV-2 [46], and have used social media to collect patient-level data on COVID-19 to track transmission [47]. Predictions of COVID-19 from experts and lay persons have been compared, finding experts can make more accurate, though often overconfident, predictions [48]. Particular to COVID-19, human judgment and crowdsourcing are methods for rapidly collecting and organizing data, and exploring many potential interventions in parallel.

In the early months of the pandemic there was a tremendous amount of information being generated. Between pre-print scientific research, media attention and amplification of both accurate and inaccurate findings, sparse past examples of an outbreak of this magnitude, and fast-changing government responses, assessing the current state and predicting the future trajectory of the pandemic were very difficult tasks in early 2020. This work assesses the performance of a panel of experts in providing quantitative estimates of a wide range of at-the-time unknown quantities relating to the scale and pace of the emerging COVID-19 pandemic in the US.

Materials and methods

Ethics statement

The proposed research was deemed not human subjects research by the University of Massachusetts-Amherst Institutional Review Board (IRB Determination Number 20–54).

Recruitment of experts

We defined an expert as a researcher who has spent a substantial amount of time in their professional career designing, building, and/or interpreting models to explain and understand infectious disease dynamics and/or the associated policy implications in human populations.

Experts were recruited by sending an email asking for their participation, and by soliciting participation through online forums for infectious disease modelers.

Experts could participate in our surveys after reading and agreeing to a consent document (S1 Fig). The consent states that after an expert completed two surveys their name and affiliation would be included in public-facing summaries. The consent form also said that public releases of the data would ensure individual expert responses would remain unidentifiable. A list of experts who participated in two or more surveys can be found in the data repository [13] and in S3 Table in the supplement.

Survey data were collected using the web-based Qualtrics platform (Qualtrics, Seattle, WA) through a link sent via email. A link to the survey was also placed on an online forum of modelers focused on COVID-19, asking a participant to self-identify as an expert and fill out the survey. If the participant was vetted to be an expert by the research team, according to the expert definition above, they were added to the list of those that receive weekly emails. Predictions were collected from experts starting on the Monday of each week and closing on Tuesday of that same week.

Survey methods

We asked experts questions that required them to submit predictions in one of four different formats: (i) categorical questions asked experts to pick one out of two (binary) or many (categorical) options, (ii) probabilistic questions asked experts to assign a probability to two (binary probabilistic) or many (categorical probabilistic) options, (iii) percentile questions asked experts to provide a lower (5th or 10th) percentile, median percentile, and upper (90th or 95th) percentile, (iv) triplet questions asked experts to report a smallest, most likely, and largest possible value for a forecasting target of interest. The format for how we asked experts to submit answers changed from smallest, most likely, and largest to a percentile format to more clearly communicate the predictions we meant to elicit from experts. A list of all questions and the type of answer required by survey can be found in the data repository and in the supplement in S6 Table.

Data on true outcomes predicted by experts were collected from several sources (S1 Table). For questions that have known, measurable answers, we created a database with the observed answers and the resolution criteria—the method used to define the true answer—that can be found in the data repository.

Survey of cases

For surveys administered from February 17 to April 6 we asked participants to provide the smallest, most likely, and largest possible number of cases that would occur by the end of the week, and for surveys administered from April 13 to May 11 participants were asked to assign probabilities to intervals where the number of cases could occur.

Survey of deaths

For surveys administered on March 16, 23, 30, we asked participants to provide the smallest, most likely, and largest possible number of deaths that would occur by the end of 2020, on April 20th we asked participants to provide a 5th, 50th, and 95th percentile, and on May 4 we asked participants to provide a 10th, 50th, and 90th percentile.

Surveys of latent viral transmission

For surveys administered from March 2nd to April 6 we asked experts to provide a smallest, most likely, and largest estimate. The last survey, on April 27, asked experts to provide a 10th, 50th, and 90th percentile. Experts were asked from March 2 to March 16 to predict the percent of confirmed cases, and from March 23 to April 27 experts were asked to predict the total number of infections.

Surveys of counterfactual predictions

For a survey administered on April 27 we asked experts to provide a smallest, most likely, and largest estimate of the 7-day moving average of reported COVID-19 cases in Georgia (i) if the state continued to loosen restrictions and (ii) if the state did not loosen restrictions. A survey administered on May 4 asked experts to provide a smallest, most likely, and largest estimate of the 7-day moving average of reported COVID-19 cases in Texas if the state loosed or did not loosen restrictions. For a survey on May 11 we asked experts to provide a 10th, 50th and 90th percentile estimate of the 7-day moving average of reported COVID-19 cases in Washington state is the state began, or did not begin, an accelerated restart.

Data repository

A publicly available repository with details about questions asked and data on all responses is available under a MIT license at https://github.com/tomcm39/COVID19_expert_survey.

Statistical Methods

Rounding

When reported in the text, expert predictions in the manuscript were taken to have two significant digits. For example, the number 12 was rounded to 12, 123 was rounded to 120, 1,234 was rounded to 1,200, 12,345 was rounded to 12,000, etc. We felt that rounding like this maintained the relevant precision in estimates for public health practice. Rounding was not performed when calculating scores.

Linear pooling

A linear pool (f) is a function that takes as input E expert predictions formatted as probability distributions f₁, f₂, ⋯, f_E over a single target value t [49]. We assume the true value t was generated from a random variable T whose distribution we do not know. The linear pool then outputs a probability distribution over t,

\begin{matrix} f (x) = \sum_{e = 1}^{E} π_{e} f_{e} (x) such that \sum_{e = 1}^{E} π_{e} = 1 \end{matrix}

where π_e is a weight between zero and one assigned to expert e. Because each f_e is a probability distribution the linear pool f is a probability distribution. For this work we chose to weight all experts equally, assigning a weight of π_e = 1/E to each expert (see S1 Appendix for additional weighting schemes and results).

Scoring predictions

Expert predictions were scored against true outcomes using the log score [50, 51]. The log score is a proper scoring rule [50–52] that assigns the log of the probability a forecast placed on the true outcome

LS (x) = ln [p (x)]

where p(x) is the forecasted probability assigned to the true value x. A log score of 0 indicates the forecast placed a probability of 1 on the truth and is the best score. A log score of negative infinity indicates the forecast placed a probability of 0 on the truth and is the worst score. Experts’ scores for measurable questions are stored in the data repository [13].

A common transformation of the log score is forecast skill [53], defined as the exponentiated log score.

Forecast skill (x) = exp {LS (x)} = p (x)

Linear pool performance was reported using the forecast skill of the linear pool divided by the forecast skill of an unskilled forecaster minus one, or the relative difference in forecast skill compared to an unskilled forecaster.

Relative skill = \frac{Forecast skill of model}{Forecast skill of unskilled forecaster} - 1

A positive difference indicates the linear pool or individual expert is better informed than an unskilled forecaster.

Given N predictions, forecast skill percentile is computed for each prediction by ranking all N predictions by their forecast skill, assigning a value of one to the smallest forecast skill and a value of N to the largest forecast skill, dividing by the number of predictions (N), and multiplying by 100. A forecast skill percentile of 0 is the least accurate prediction and a forecast skill percentile of 100 is the most accurate prediction.

We used the concept of an “unskilled forecaster” as a benchmark against which to measure predictive performance. For triplet questions, an unskilled forecaster is one who assigns a uniform probability mass to all values between the lowest and the highest predictions any expert proposed for a particular question. An unskilled percentile question assigned to its lower percentile the minimum of all lower percentiles, and to its upper percentile the maximum of all largest percentiles. The unskilled median was the median of all median percentiles. For probabilistic forecasts, an unskilled forecaster assigns 1/C, where C is the number of categories, to each category available as an answer.

From expert predictions to probabilistic forecasts

Experts made four types of probabilistic predictions: binary, categorical, percentile, and triplet. For binary probabilistic questions, we can define the expert’s prediction of the presence of an event as p. Then an expert’s predictive distribution is Bernoulli(p). A similar approach can be taken for categorical probabilistic questions, and we define an expert’s predictive distribution over C different choices as a Multinomial(N=1, p₁, p₂,⋯, p_C) distribution. Percentile questions ask experts to provide three percentiles, a low (5th or 10th) percentile, 50th percentile (median), and high (90th or 95th) percentile. The lowest (5th or 10th), middle (50th) and highest (90th or 95th) percentiles are called p_low, p_middle, p_high, and the corresponding percentile values are called q_low, q_middle, q_high. Specifying three percentiles creates four intervals with probabilities corresponding to the percentiles, for example asking for a 5th, 50th, and 95th percentile creates four intervals with the following probabilities: 0.05,0.45,0.45, and 0.05 probability, and the probability prescribed to an integer value x was

\begin{matrix} PCT (x | p_{low}, p_{middle}, p_{high}, q_{low}, q_{middle}, q_{high}) = {\begin{matrix} \frac{p_{low}}{q_{low}} & 0 \leq x < q_{low} \\ \frac{p_{middle}}{q_{middle} - q_{low}} & q_{low} \leq x < q_{middle} \\ \frac{p_{high}}{q_{high} - q_{middle}} & q_{middle} \leq x < q_{high} \\ \frac{p_{high}}{q_{high}} & q_{high} \leq x < 2 q_{high} \\ 0 & otherwise \end{matrix} \end{matrix}

(1)

Triangular probability densities [54] (See S2 Fig for an example) were generated from the smallest (s), most likely (m), and largest (l) answer provided by experts as follows

\begin{matrix} TPD (x | s, m, l) = {\begin{matrix} 0 & x < s \\ \frac{2 (x - s)}{(l - s) (m - s)} & s \leq x < m \\ \frac{2}{l - s} & x = m \\ \frac{2 (l - x)}{(l - s) (l - m)} & m \leq x < l \\ 0 & x > l \end{matrix} \end{matrix}

(2)

The above TPD specifies an expert-specific probability distribution over a continuous target. To specify a distribution over integers values (x₁, x₂, ⋯, x_V), we assigned to the value x_i the integral of the continuous TPD from x_i up to x_{i + 1}. Define the CDF of a TPD distribution as CDF_TPD(x). Then the discretized TPD was defined as

p (x_{i}) = {CDF}_{TPD} (x_{i + 1}) - {CDF}_{TPD} (x)

for i < V on the values (x₁, x₂, ⋯, x_V).

Imputing quantiles from a binned distribution

Given a set of N bins [x₀, x₁], [x₀, x₁], ⋯, [x_N−1, x_N] with corresponding probabilities p₁, p₂, ⋯, p_N over a target that takes nonnegative values, quantiles can be imputed by first generating a function f that is a linear interpolant of (x,y) pairs (x₀, 0), (x₁, P₁), (x₂, P₂), (x₃, P₃), ⋯, (x_N, 1), where i ranges from 0 to N and P_i is the cumulative sum of all probabilities with index i or smaller P_i = p₁ + p₂ + ⋯ + p_i. To compute a quantile q we can find the root of the function F(x) = f(x)−q using the Newton-Raphson algorithm (or any standard root finding procedure).

Relative absolute error

The relative absolute error is a function that takes as input a point prediction (p) and ground truth value (t) and outputs the absolute difference between the point prediction and ground truth value divided by the ground truth value

Rel.Abs.Err (p, t) = | \frac{p}{t} - 1 |

where the vertical bars indicate the absolute value. The domain of this function is defined only for positive values of t (i.e. t > 0).

Results

Overview

Thirteen surveys of experts in the modeling of infectious disease were conducted between February 18, 2020 and May 11, 2020 (see S1 Table for schedule, S3 Table for a list of experts who participated, and S6 Table for a list of all questions) [13]. We solicited the participation of 72 experts (see Methods for definition) via email and a Slack channel dedicated to communicating about COVID-19 data and models. A total of 41 experts contributed predictions, with an average of 18.6 expert participants each week (range: 15–22).

Across all surveys, we asked 73 questions (48 with measurable outcomes) with a median of 6 questions per survey (range: 4–7) focused on the outbreak in the United States. Experts responded to questions on a variety of topics including short- and long-term predictions of COVID-19 cases, hospitalizations, and deaths, and we combined predictions into linear pools. The survey results were released publicly every week and delivered directly to decision makers at state and federal health agencies (see [13] for copies of each summary report).

Here we present results from selected questions. We chose to report on questions that other computational models have tried to predict: the number of deaths due to COVID-19 in the US by the end of 2020, the total number of SARS-COV-2 infections in the US, and the number of confirmed cases one week ahead. Additionally, we include some results from counterfactual questions that asked experts to predict the 7-day rolling average number of cases for a state under the current policies in place and if the state increased restrictions on social contact. Expert linear pool estimates and predictions for all questions are stored in a public GitHub repository [13].

Predictions of US COVID-19 deaths reported in 2020

Across five surveys administered between March 16 and May 4, we asked experts to predict the number of reported COVID-19 deaths in the US by the end of 2020. The linear pool prediction ranged from 150,000 to more than 250,000 (Fig 1A), corresponding to between 4 and 7 times the average number of annual deaths in the US due to seasonal influenza [55]. There was considerable uncertainty around these predictions: the lower bounds of the five prediction intervals ranged from 6,000 (on March 16) to 118,000 (on May 4), and the upper bounds ranged from 517,000 (on April 20) to 1,700,000 (on March 30). The COVID Tracking Project reported that there were 336,802 cumulative deaths due to COVID-19 in the US at the end of Dec. 31 2020. This eventually observed value was included in the 90% prediction interval for each of the five surveys. While experts underestimated the actual number of observed deaths by a substantial margin, they also consistently saw this eventual number of deaths as a not unlikely outcome, assigning a probability of 0.45 on March 16 and 0.46 on March 30 to more than 250K deaths.

Fig 1 — (A.) Expert linear pool predictions of the total number of deaths by the end of 2020 from five surveys asked between March 16 and May 4, 2020. Points show the median estimate. Bars show 90% prediction intervals for the first four surveys and an 80% prediction interval for the fifth survey. The dotted line is the reported total number of deaths by The COVID Tracking Project as of December 31, 2020. (B.) Expert linear pool forecasts, made on Monday and Tuesdays, of the number of cases to be reported by the end of the week (Sunday, date shown on x-axis) from thirteen surveys administered between February 23 and May 17, 2020. The first eight surveys asked experts to provide smallest, most likely, and largest possible values for the number of confirmed cases, and the last five asked experts to assign probabilities to ranges of values for confirmed cases. Light blue points represent the median of the expert linear pool distribution. Dark points represent the eventually observed value reported by The COVID Tracking Project. Prediction intervals at the 90% level are shown in shaded blue bars. The 90% prediction intervals included the true number of cases in all thirteen forecasts.

Predictions of weekly COVID-19 reported cases

At the beginning of thirteen consecutive weeks from February 17 to May 11, experts predicted the number of confirmed cases at the end of the week (Fig 1B). In early surveys, experts tended to underestimate the number of reported cases in the following week. The relative difference between the median linear pool prediction of week-ahead cases and the reported cases was on average -51% for the first four surveys. In later surveys, the accuracy improved and for surveys 5 through 13, the relative difference was on average 2.5%. However, as with the forecasts of total deaths, the expert linear pool provided wide uncertainty: all thirteen expert linear pool 90% confidence intervals covered the reported number of confirmed cases.

Estimates of the fraction of infections reported as cases

Over the course of seven surveys from March 2 to April 27, 2020, experts were asked to estimate the fraction of all infections with SARS-CoV-2 (the virus that causes the COVID-19 illness) in the U.S. that had been confirmed and reported as a case. Because this outcome can never be fully observed, these questions posed a different type of estimation task for the experts.

In these surveys, the median of expert linear pool predictions for the percentage of infections detected was between 6% and 16% (Fig 2B). The median responses were consistent with estimates from contemporaneous computational models, perhaps reflecting the extent to which experts relied on these early model estimates [9–11, 56] (Fig 2A).

However, these estimates from computational models and expert judgment made in early 2020 were substantially smaller than 3 out of 4 retrospective estimates of the total number of SARS-CoV-2 infections that were made in early 2021 [57–60]. Some of these model-based estimates suggested that the detected fraction of cases was possibly as low as 0.1%, although substantial uncertainty was present even in retrospect [57–60].

Counterfactual predictions

Experts were asked to make counterfactual predictions of the 7-day moving average of confirmed cases for three states that had begun to relax social distancing restrictions (“re-open”). The questions asked experts to predict how many confirmed cases each state would see between 3 and 5 weeks’ time for each of two separate scenarios: (i) if the state continued its current phase of reopening and (ii) if the state did not begin to reopen (Fig 3).

Expert predictions showed a clear expectation that state-level policies restricting non-essential travel and business would result in lower COVID-19 transmission in coming weeks. The expert linear pool predictions for the scenarios that were more clearly aligned with the eventual reopening policies were more accurate than for the alternative scenarios.

For the state of Georgia (Fig 3A) the expert linear pool prediction on April 27, 2020 of the 7-day moving average of reported COVID-19 cases on May 16, 2020 was 1,044 (80% PI = [580, 2, 292]) assuming restrictions were loosened and 487 (80% PI = [273, 1, 156]) assuming restrictions were not loosened. On Apr 30, 2020 the state decided to extend shelter in place orders for at risk populations. The relative absolute error of the linear pool prediction was 18% for that scenario versus 75% for the scenario assuming restrictions were loosened. For the state of Texas (Fig 3B), the expert linear pool prediction on May 4, 2020 of the 7-day moving average of reported COVID-19 cases on June 13, 2020 was 825 (80% PI = [231, 1, 608]) assuming restrictions were not loosened and 1,358 (80% PI = [734, 2, 732]) assuming restrictions were loosened. On May 18, 2020 the state decided to reopen retail businesses. The relative absolute error of the linear pool prediction was 23% for that scenario versus 54% for the scenario assuming restrictions were not loosened. For the state of Washington (Fig 3C) the expert linear pool median prediction on May 11, 2020 of the 7-day moving average of reported COVID-19 cases was 332 (80% PI = [158, 644]) if WA did not begin an “accelerated restart” by relaxing restrictions in all counties and 554 (80% PI = [263, 1, 053]) if it did. On May 22, 2020 the state decided to reopen 25/39 counties. The relative absolute error of the linear pool prediction was 5% for a partial reopening versus 76% for the scenario assuming an accelerated restart in all counties.

Individual experts’ predictive performance varied substantially

The median forecast skill of individual expert predictions of the number of deaths by Dec 31, 2020 was above that of an unskilled forecaster for all five surveys where this question was asked (Fig 4A). The median forecast skill of individual expert predictions of new cases in the coming week was lower than an unskilled forecaster on the first five surveys (Fig 4B). Forecasts of cases from individual experts were more accurate in later surveys, and the median accuracy of individual expert predictions was higher than the accuracy of an unskilled forecaster in 7 of the last 8 surveys.

Fig 4 — Evaluation of forecast accuracy for forecasts of cumulative COVID-19 deaths (A) and cases (B). For both types of questions, the methods used to elicit probabilistic forecasts changed and this point is indicated by a vertical dashed line. Predictions are shown from each expert (light dots), the median expert (dark diamond), and the linear pool (dark square) compared to an “unskilled” forecaster (see Methods). Higher relative forecast skill indicates better performance than an unskilled forecaster and a zero relative forecast skill represents identical performance with an unskilled forecaster. (A). Relative forecast skill of the cumulative number of COVID-19 deaths by December 31, 2020 (see Fig 1A). Over 50% of experts made better predictions of year-end COVID-19 deaths than an unskilled forecaster on each of the five occasions this question was asked. Experts’ median relative forecast skill was higher than the linear pool forecast skill for the latest prediction of year-end deaths when asked to provide percentiles compared to the smallest, most likely, and largest number of deaths. (B.) Relative forecast skill of the number of cases to be reported by the end of the week from thirteen surveys administered between February 23 and May 17, 2020. Individual experts’ accuracy was mixed with some experts performing better than an unskilled forecaster and others scoring worse. In the first five surveys, the median expert made less skilled forecasts than the unskilled forecaster. Experts’ median relative forecast skill was smaller than the linear pool forecast skill when asked to provide a smallest, most likely, and largest number of cases and similar to a linear pool when asked to assign probabilities to a set of intervals where the true number of cases could fall.

Because expert performance was not consistent across surveys, performance-based weighting to build a linear pool did not significantly improve forecast accuracy compared to equal weighting (see supplemental section on aggregation, S4 Fig and S5 Fig, and S4 Table and S5 Table). The degree to which experts relied on available data, model outputs, and intuition varied by expert. The median proportion of each prediction that relied on analytic models versus intuition which was self-reported by experts was 75% and responses ranged from 20% to 100% (Fig 5).

Fig 5 — Experts’ self-assessed percentage of analytic vs. intuitive thinking when making predictions, reporting 0 when an expert uses only their intuition and 100 when they relied solely on models and experience. To make predictions over a wide variety of targets, experts reported a mixture of using models/experience and intuition, with the median expert claiming to rely 75% on experience.

Accuracy of the expert linear pool

Looking across all questions with measurable probabilistic outcomes from February 17 to May 11, the linear pool prediction was the most consistently accurate forecast. When ranked alongside all individual expert predictions, the linear pool was among the top 50% most accurate forecasts 36/44 (82%) times (Fig 6A). The linear pool mean forecast skill percentile of 73 was the highest when compared to the mean forecast skill percentile of experts who completed ten or more surveys (Fig 6B). Over all 13 surveys issued, the linear pool model ranked in the top half of all experts and was the most accurate for five surveys issued from May 9 to April 6 (Fig 6C). The linear pool performance depended on the structure of the questions, which changed over the course of the surveys (e.g., see Fig 4, Methods, and a list of all survey questions in the supplement).

Fig 6 — The forecast skill percentile of linear pool predictions compared to individual expert predictions across all 13 surveys (vertical axis) differentiated by the type of target. A linear pool of expert judgment often scores in the top 50th percentile independent of the type of question (B.) The mean, median, 25th, and 75th percentile for forecast skill percentile for all individual experts who completed 10 or more surveys (blue) and for the linear pool (red). Compared to individual experts, a linear pool has the highest mean and highest 25th percentile forecast skill percentile. (C.) The median forecast skill percentile across surveys for experts (blue) and the linear pool (red). Over time the linear pool median forecast skill percentile is above 0.50 for all but one survey and for five surveys the linear pool generated the most accurate predictions.

Discussion

Linear pool aggregations of expert judgment during early months of the COVID-19 pandemic provided important and early insights about the trajectory of the emerging pandemic. In mid-March, when there were less than 100 COVID-19 deaths in the US, the expert linear pool assigned a probability of 66% to over 100,000 deaths and a probability of 46% to over 250,000 deaths by the end of 2020. In contrast, early forecasts from a computational model used by the federal government in late March predicted 81,000 deaths and an outbreak that would end by early August, 2020 [61].

An expert linear pool is a nimble and flexible model that can answer questions about the public health impact of outbreaks before computational models have enough data and validation to be reliable. In particular, an expert linear pool model has two key advantages over computational models. First, an expert model has relatively low overhead to develop and can be deployed at the onset of an outbreak. The first expert predictions from these surveys were available starting in mid-February, 2020 before any computational models were publicly available. Second, a survey framework allowed expert predictions to be tailored on-the-fly to maximize value for public health decision makers. This is in contrast to computational models which require extensive development to answer a specific set of questions.

However, an expert linear pool model suffers from issues of scalability. Every individual forecast elicited from an expert requires minutes of human time. Because of experts’ limited time, surveys must focus on a short list of impactful questions. Another limitation of an expert model is the bias introduced by human judgment. Though the assumptions built into computational models are explicitly specified, experts’ predictive processes are more opaque. A structured forecasting platform, that allows experts to communicate with one another about the reasoning behind their forecasts and facilitates interactions between subject matter experts and trained forecasters, has been shown to lead to more accurate predictions [17]. Cultivating and training a pool of expert forecasters should be included as part of larger investments in preparing modeling infrastructure for future outbreaks and pandemics.

Experts reported using a combination of analytic thinking and intuition to form their responses when making predictions about targets related to the pandemic. The mix of analytic thinking and intuition could be because of the lack of structured data early during the pandemic or could be because different questions induced different modes of cognition [62–64]. Insight into the type of thinking experts prioritize when making a prediction and how these modes of cognition lead to more or less accurate predictions may allow us to develop elicitation protocols to either emphasize, or discourage specific approaches to forming a prediction. Expert predictions suggest they were able to synthesize diverse and disparate sources of information to make quantitative predictions and estimates about different facets of the pandemic, including short- and long-term predictions of observable data, short-term projections of counterfactual scenarios, and estimates of quantities that will never be fully observed.

Expert predictions of the US COVID-19 outbreak often outperformed an unskilled forecaster but in absolute terms were typically biased towards optimistic responses. Expert predictions of confirmed cases included the true number of cases in their 90% prediction interval for all thirteen predictions, a sign that the uncertainty reflected in the linear pool was perhaps too broad. Experts’ predictions in early surveys were smaller than the true number of confirmed cases, but after receiving weekly feedback on their previous predictions (starting on March 9), accuracy improved.

However, the extent to which strong conclusions about expert prediction accuracy can be drawn from these data is constrained due to several key limitations of the present study. First, the sample size of the study is small, with only 73 questions asked and 48 questions with ground truth available. To fully assess probabilistic accuracy of a system making repeated predictions, larger numbers of questions are needed. Second, the type of answer and the type of target experts were asked to provide varied across the surveys. This was the result of the study organizers adapting to new information and phases of the early outbreak, and trying out different strategies for answering questions. Results appear to suggest (see, e.g., Fig 4) that assessments of individual expert accuracy and the variability in individual scores may depend on the way in which probability distributions were elicited. There are several elicitation protocols that could have been used to extract less biased and more informative predictions from experts [65, 66]. Third, due to the operational challenges that accompanied standing up this project in real-time during the early months of 2020, the pool of experts may not represent a full spectrum of expert opinion from the modeling community.

Despite the limitations outlined above, the present study shows the potential for a more structured and larger-scale effort to use expert judgment to supplement output from computational models. An expert judgment model can act as an important component of rapid response and as a first-step forecast for global catastrophes like an outbreak, especially while domain-specific computational models are still being trained on sparse early data. Experts’ ability to synthesize diverse sources of information gives them a unique, complementary perspective to model-driven forecasts that are not able to assimilate information or data outside of the domain of a specific, prescribed computational framework.

During the evolving global catastrophe of the COVID-19 pandemic, an expert judgment model provided rapid and calibrated forecasts that were responsive to changing public health needs. If and when modeling needs are assessed for future outbreaks, the successes and limitations of this project could be used to design future expert judgment panels.

Supporting information

S1 Fig. Consent form.

The consent form each expert was presented with and had to agree to before taking part in the survey. This document was shown for every survey.

(PDF)

Click here for additional data file.^{(141KB, pdf)}

S2 Fig. Triangular probability distribution.

An example of transforming an expert’s triplet answer (smallest: 10, most likely:30, largest: 50) to a probabilistic distribution.

(PDF)

Click here for additional data file.^{(96.3KB, pdf)}

S3 Fig. Triangular probability distribution ensemble.

An example of 10 expert answers to a triplet question, their corresponding triangular probability distributions (TPDs), and an equally-weighted linear pool distribution (black) built from those TPDs.

(PDF)

Click here for additional data file.^{(69.8KB, pdf)}

S4 Fig. An analysis of individual expert forecast skill.

(PDF)

Click here for additional data file.^{(94.4KB, pdf)}

S5 Fig. Ensemble accuracy and ensemble weights assigned to experts.

(PDF)

Click here for additional data file.^{(74.1KB, pdf)}

S1 Table. Survey meta information.

A listing of survey numbers, the date they were issued, information on expert participation, and the database(s) used to collect ground truth.

(PDF)

Click here for additional data file.^{(54.8KB, pdf)}

S2 Table. Ensemble covariates.

Model and the covariates used to define the design matrix X to weight experts.

(PDF)

Click here for additional data file.^{(51.6KB, pdf)}

S3 Table. List of experts who participated in at least two surveys.

(PDF)

Click here for additional data file.^{(54KB, pdf)}

S4 Table. Linear regression that compares the weights assigned to each expert using the expert-specific performance weighting and assigning experts equal weights.

(PDF)

Click here for additional data file.^{(76.6KB, pdf)}

S5 Table. Linear regression that compares the weights assigned to each expert using the expert-specific plus relative entropy performance weighting and assigning experts equal weights.

(PDF)

Click here for additional data file.^{(76.8KB, pdf)}

S6 Table. Date each survey was conducted, the questions asked, and the format experts answered.

(PDF)

Click here for additional data file.^{(1.1MB, pdf)}

S1 Appendix. Methodology to aggregate expert probabilistic predictions.

(PDF)

Click here for additional data file.^{(112.5KB, pdf)}

Acknowledgments

We wish to thank all the experts who have participated, for offering their time and expertise to help us better understand the COVID-19 outbreak. We also thank Evan L. Ray for comments that improved this work.

Data Availability

A publicly available repository with details about questions asked and data on all responses is available under a MIT license at https://github.com/tomcm39/COVID19_expert_survey.

Funding Statement

This work has been supported by the National Institutes of General Medical Sciences (NIGMS, grant number R35GM119582) [NGR] and the Centers for Disease Control and Prevention (CDC, grant number 1U01IP001122) [NGR]. The content is solely the responsibility of the authors and does not necessarily represent the official views of CDC, NIGMS, or the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Coronavirus Disease 2019 (COVID-19);. https://www.cdc.gov/coronavirus/2019-ncov/index.html.
2. Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. Jama. 2020;323(13):1239–1242. doi: 10.1001/jama.2020.2648 [DOI] [PubMed] [Google Scholar]
3. Grasselli G, Pesenti A, Cecconi M. Critical care utilization for the COVID-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response. Jama. 2020;323(16):1545–1546. doi: 10.1001/jama.2020.4031 [DOI] [PubMed] [Google Scholar]
4.Novel Coronavirus (2019-nCoV) situation reports;. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports.
5. Farrow DC, Brooks LC, Hyun S, Tibshirani RJ, Burke DS, Rosenfeld R. A human judgment approach to epidemiological forecasting. PLoS computational biology. 2017;13(3):e1005248. doi: 10.1371/journal.pcbi.1005248 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. McGowan CJ, Biggerstaff M, Johansson M, Apfeldorf KM, Ben-Nun M, Brooks L, et al. Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific reports. 2019;9(1):1–13. doi: 10.1038/s41598-018-36361-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Del Valle SY, McMahon BH, Asher J, Hatchett R, Lega JC, Brown HE, et al. Summary results of the 2014-2015 DARPA Chikungunya challenge. BMC infectious diseases. 2018;18(1):1–14. doi: 10.1186/s12879-018-3124-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Petropoulos F, Makridakis S. Forecasting the novel coronavirus COVID-19. PloS one. 2020;15(3):e0231236. doi: 10.1371/journal.pone.0231236 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Lu FS, Nguyen AT, Link NB, Lipsitch M, Santillana M. Estimating the early outbreak cumulative incidence of COVID-19 in the United States: three complementary approaches. medRxiv. 2020;. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Lover AA, McAndrew T. Sentinel event surveillance to estimate total SARS-CoV-2 infections, United States. medRxiv. 2020;. [Google Scholar]
11. Perkins TA, Cavany SM, Moore SM, Oidtman RJ, Lerch A, Poterek M. Estimating unobserved SARS-CoV-2 infections in the United States. Proceedings of the National Academy of Sciences. 2020;117(36):22597–22602. doi: 10.1073/pnas.2005476117 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Jewell NP, Lewnard JA, Jewell BL. Caution warranted: using the institute for health metrics and evaluation model for predicting the course of the COVID-19 pandemic; 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.A repository of the data and code used for aggregating expert predictions of COVID19;. https://github.com/tomcm39/COVID19_expert_survey.
14. McAndrew T, Wattanachit N, Gibson GC, Reich NG. Aggregating predictions from experts: A review of statistical methods, experiments, and applications. Wiley Interdisciplinary Reviews: Computational Statistics. 2021;13(2):e1514. doi: 10.1002/wics.1514 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Mellers B, Ungar L, Baron J, Ramos J, Gurcay B, Fincher K, et al. Psychological strategies for winning a geopolitical forecasting tournament. Psychological science. 2014;25(5):1106–1115. doi: 10.1177/0956797614524255 [DOI] [PubMed] [Google Scholar]
16. Warby SC, Wendt SL, Welinder P, Munk EG, Carrillo O, Sorensen HB, et al. Sleep-spindle detection: crowdsourcing and evaluating performance of experts, non-experts and automated methods. Nature methods. 2014;11(4):385–392. doi: 10.1038/nmeth.2855 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Ungar L, Mellers B, Satopää V, Tetlock P, Baron J. The good judgment project: A large scale test of different methods of combining expert predictions. In: 2012 AAAI Fall Symposium Series; 2012. [Google Scholar]
18. Cheng TH, Wei CP, Tseng VS. Feature selection for medical data mining: Comparisons of expert judgment and automatic approaches. In: 19th IEEE symposium on computer-based medical systems (CBMS’06). IEEE; 2006. p. 165–170. [Google Scholar]
19. Bouwman MJ. Expert vs novice decision making in accounting: A summary. Accounting, Organizations and Society. 1984;9(3-4):325–327. doi: 10.1016/0361-3682(84)90016-3 [DOI] [Google Scholar]
20. Murphy AH, Daan H. Impacts of feedback and experience on the quality of subjective probability forecasts. comparison of results from the first and second years of the zierikzee experiment. Monthly Weather Review. 1984;112(3):413–423. doi: [DOI] [Google Scholar]
21. Murphy AH, Winkler RL. Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1977;26(1):41–47. [Google Scholar]
22. Nahar J, Imam T, Tickle KS, Chen YPP. Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications. 2013;40(1):96–104. doi: 10.1016/j.eswa.2012.07.032 [DOI] [Google Scholar]
23. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. science. 1974;185(4157):1124–1131. doi: 10.1126/science.185.4157.1124 [DOI] [PubMed] [Google Scholar]
24. Meehl PE. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. University of Minnesota Press; 1954. [Google Scholar]
25. Slovic P, Fischhoff B, Lichtenstein S. Regulation of risk: a psychological perspective, tn Regulatory Poliql and the Social Sciences, Nol] R.(ed.), pp. 24I–27B, Berkley; 1985. [Google Scholar]
26. Brunswik E. The conceptual framework of psychology.(Int. Encycl. unified Sci., v. 1, no. 10.). Univ. Chicago Press; 1952. [Google Scholar]
27. Brunswik E. Perception and the representative design of psychological experiments. Univ of California Press; 1956. [Google Scholar]
28. Hammond KR, Stewart TR, Brehmer B, Steinmann DO. Social judgment theory. Cambridge University Press; 1986. [Google Scholar]
29. Doherty ME, Kurz EM. Social judgement theory. Thinking & Reasoning. 1996;2(2-3):109–140. doi: 10.1080/135467896394474 [DOI] [Google Scholar]
30. Rogers SD, Kadar EE, Costall A. Gaze patterns in the visual control of straight-road driving and braking as a function of speed and expertise. Ecological Psychology. 2005;17(1):19–38. doi: 10.1207/s15326969eco1701_2 [DOI] [Google Scholar]
31. Araujo D, Davids K, Passos P. Ecological validity, representative design, and correspondence between experimental task constraints and behavioral setting: Comment on. Ecological Psychology. 2007;19(1):69–78. doi: 10.1080/10407410709336951 [DOI] [Google Scholar]
32. Plessner H, Schweizer G, Brand R, O’Hare D. A multiple-cue learning approach as the basis for understanding and improving soccer referees’ decision making. Progress in brain research. 2009;174:151–158. doi: 10.1016/S0079-6123(09)01313-2 [DOI] [PubMed] [Google Scholar]
33. Wiggins MW. Cue Utilization as an Objective Metric in Naturalistic Decision-Making. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting. vol. 64. SAGE Publications Sage CA: Los Angeles, CA; 2020. p. 209–213. [Google Scholar]
34. Brehmer B. Inference behavior in a situation where the cues are not reliably perceived. Organizational Behavior and Human Performance. 1970;5(4):330–347. doi: 10.1016/0030-5073(70)90024-3 [DOI] [Google Scholar]
35. Brehmer B. Effects of task predictability and cue validity on interpersonal learning of inference tasks involving both linear and nonlinear relations. Organizational behavior and human performance. 1973;10(1):24–46. doi: 10.1016/0030-5073(73)90003-2 [DOI] [Google Scholar]
36. Brehmer B. Social judgment theory and the analysis of interpersonal conflict. Psychological bulletin. 1976;83(6):985. doi: 10.1037/0033-2909.83.6.985 [DOI] [Google Scholar]
37. Karelaia N, Hogarth RM. Determinants of linear judgment: A meta-analysis of lens model studies. Psychological bulletin. 2008;134(3):404. doi: 10.1037/0033-2909.134.3.404 [DOI] [PubMed] [Google Scholar]
38. Tversky A, Kahneman D. The framing of decisions and the psychology of choice. In: Behavioral decision making. Springer; 1985. p. 25–41. [Google Scholar]
39. Meehl PE, Rosen A. Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological bulletin. 1955;52(3):194. doi: 10.1037/h0048070 [DOI] [PubMed] [Google Scholar]
40. Harvey N. Confidence in judgment. Trends in cognitive sciences. 1997;1(2):78–82. doi: 10.1016/S1364-6613(97)01014-0 [DOI] [PubMed] [Google Scholar]
41. Dror IE, Kukucka J, Kassin SM, Zapf PA. When expert decision making goes wrong: Consensus, bias, the role of experts, and accuracy. 2018;. [Google Scholar]
42. Clemen RT, Winkler RL. Combining probability distributions from experts in risk analysis. Risk analysis. 1999;19(2):187–203. [DOI] [PubMed] [Google Scholar]
43. Budescu DV, Rantilla AK. Confidence in aggregation of expert opinions. Acta psychologica. 2000;104(3):371–398. doi: 10.1016/S0001-6918(00)00037-8 [DOI] [PubMed] [Google Scholar]
44. Cooke RM, Goossens LL. TU Delft expert judgment data base. Reliability Engineering & System Safety. 2008;93(5):657–674. doi: 10.1016/j.ress.2007.03.005 [DOI] [Google Scholar]
45. Shea K, Runge MC, Pannell D, Probert WJ, Li SL, Tildesley M, et al. Harnessing multiple models for outbreak management. Science. 2020;368(6491):577–579. doi: 10.1126/science.abb9934 [DOI] [PubMed] [Google Scholar]
46. Chodera J, Lee AA, London N, von Delft F. Crowdsourcing drug discovery for pandemics. Nature Chemistry. 2020;12(7):581–581. doi: 10.1038/s41557-020-0496-2 [DOI] [PubMed] [Google Scholar]
47. Sun K, Chen J, Viboud C. Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet Digital Health. 2020;2(4):e201–e208. doi: 10.1016/S2589-7500(20)30026-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Recchia G, Freeman AL, Spiegelhalter D. How well did experts and laypeople forecast the size of the COVID-19 pandemic? PloS one. 2021;16(5):e0250935. doi: 10.1371/journal.pone.0250935 [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Genest C, McConway KJ. Allocating the weights in the linear opinion pool. Journal of Forecasting. 1990;9(1):53–73. doi: 10.1002/for.3980090106 [DOI] [Google Scholar]
50. Shuford EH, Albert A, Massengill HE. Admissible probability measurement procedures. Psychometrika. 1966;31(2):125–145. doi: 10.1007/BF02289503 [DOI] [PubMed] [Google Scholar]
51. McCarthy J. Measures of the value of information. Proceedings of the National Academy of Sciences of the United States of America. 1956;42(9):654. doi: 10.1073/pnas.42.9.654 [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association. 2007;102(477):359–378. doi: 10.1198/016214506000001437 [DOI] [Google Scholar]
53. Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ, Moore E, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences. 2019;116(8):3146–3154. doi: 10.1073/pnas.1812594116 [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Ayyangar A. The triangular distribution. Mathematics Student. 1941;9:85–87. [Google Scholar]
55.Novel Coronavirus (2019-nCoV) situation reports;. https://www.cdc.gov/flu/about/burden/index.html.
56.I could easily be off 2-fold in either direction, but my best guess is that we’re currently in the 10,000 to 40,000 range nationally. 11/13;. Twitter.
57.COVID-19 projections using machine learning;. https://covid19-projections.com.
58. Chitwood MH, Russi M, Gunasekera K, Havumaki J, Pitzer VE, Warren JL, et al. Bayesian nowcasting with adjustment for delayed and incomplete reporting to estimate COVID-19 infections in the United States. medRxiv. 2020;. [Google Scholar]
59. Lemaitre JC, Grantz KH, Kaminsky J, Meredith HR, Truelove SA, Lauer SA, et al. A scenario modeling pipeline for COVID-19 emergency planning. Scientific reports. 2021;11(1):1–13. doi: 10.1038/s41598-021-86811-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
60.CEID Covid-19 tracker;. https://www.covid19.uga.edu/.
61. COVID I, Murray CJ, et al. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months. MedRxiv. 2020;. [Google Scholar]
62. Hammond KR. Human judgment and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. Oxford University Press on Demand; 1996. [Google Scholar]
63. Hammond KR, et al. Judgments under stress. Oxford University Press on Demand; 2000. [Google Scholar]
64. Hammond KR. Coherence and correspondence theories in judgment and decision making. Cambridge University Press; 2000. [Google Scholar]
65. Mazzuco S, Keilman N. Developments in demographic forecasting. Springer Nature; 2020. [Google Scholar]
66. Dion P, Galbraith N, Sirag E. Using expert elicitation to build long-term projection assumptions. In: Developments in Demographic Forecasting. Springer, Cham; 2020. p. 43–62. doi: 10.1007/978-3-030-42472-5_3 [DOI] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010485.r001

Decision Letter 0

Alison L Hill, Tom Britton

10 Jan 2022

Dear Dr. McAndrew,

Thank you very much for submitting your manuscript "An expert judgment model to predict early stages of the COVID-19 pandemic in the United States" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

In particular, we hope that the authors can consider Reviewer 1's questions about how the experts interpreted the questions and the implications this has for the results, as well as Reviewer 2's concerns about the quality of the literature review, the description of the methods in the supplement, and other requests for minor clarifications of figures and text.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Alison L. Hill

Associate Editor

PLOS Computational Biology

Tom Britton

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is brilliant work and was a real pleasure to read. Thorough efforts evaluating how aggregated expert forecasts perform in real time, and how they can be aggregated in ways that facilitates more accurate predictions, is crucially important work, and no one has done it for COVID-19 more thoroughly than the authors. The extensive longitudinal nature of this work, and the authors' willingness to update their protocol as they continued when weaknesses were identified, makes it especially valuable. Excellent points made in discussion as well.

A few comments:

1) How experts interpreted the phrase "smallest, most likely, and largest possible number of cases" is hard to know: did they literally interpret that as the smallest/largest possible number of cases, such that they would be infinitely flabbergasted if the true number were below/above these extrema? or (probably more likely) did they actually tend to respond with the smallest they thought plausible, or even the smallest they thought likely? In the way that the triangular probability densities are constructed, it is assumed that the expert meant that there is literally 0 probability of the true value falling below their "smallest" or above their "largest" estimate. This is fine, but of course if the experts weren't being quite so literal in their interpretation of "smallest/largest possible", it will result in inadequate allocation of probability mass to outcomes beyond these bounds. It seems possible that the early underestimates might not have seemed quite as extreme if an elicitation protocol were used that took this into account. This should perhaps be briefly discussed. (But I don't want to let the experts off the hook too much for those early predictions, as it's clear that they were massive underestimates!)

2) Was experts’ self-assessed percentage of analytic vs. intuitive thinking predictive of performance? If so, mentioning this fact might help inform guidelines for what kinds of thinking experts might consider prioritizing when asked to make probabilistic estimates.

3) It is mentioned that experts spent their careers either working with models of infectious disease dynamics (I'll call these people 'modelers') "and/or the associated policy implications" ('policy wonks'). Was status as modelers vs. policy wonks predictive of performance, or is there too little data to say?

4) Figure 6 would benefit from an increased resolution if possible.

5) Despite my critique of asking about the "smallest, most likely, and largest possible number of cases" above, asking participants to consider the extremes of the distribution could slot into an elicitation protocol in other ways. Given that you found that experts gave too-narrow confidence intervals on their early esimates, in future studies you might consider a method similar to that described in Ch. 3 of "Developments in Demographic Forecasting" (eds. Mazzuco & Keilman), "Using Expert Elicitation to Build Long-Term Projection Assumptions" (Dion, Galbraith, and Sirag):

"(a) Experts are first asked to provide the lower and higher bounds of a range covering nearly all plausible values... Beginning with the contemplation of the extremes of the distribution is an intentional practice used to minimize potential overconfidence (Speirs-Bridge et al. 2010; Sperber et al. 2013; Oakley and O’Hagan 2014; Grigore et al. 2017; Hanea et al. 2018). Indeed, asking experts to first provide a single central estimate such as a mean or a median tends to trigger anchoring to that value in subsequent responses.

(b) Experts are asked to report how confident they are that the true value will fall

within the range they just specified in step 2(a). Allowing experts to determine

their own level of confidence has been found to reduce overconfidence in

comparison with asking them to identify the low and high bounds of an interval

to some predetermined confidence level (Speirs-Bridge et al. 2010).

provided in step 2(a), so that they expect an equal (50-50) chance that the true

value lies above or below the median.

(d) The range of values between the lower bound and the median is split in two

segments of equal length and the same is done for values between the median

and the upper bound. The respondent is then asked to assign to each segment

the probability that the true value falls within each of these segments. Note that

each half below and above the median has by definition 50% probability of

occurrence, so it is a matter of redistributing that 50% to each segment.

Throughout... several “checks”, in the form of pop-up warning signs, were

built in... in order to prevent illogical inputs in various forms."

This is not something that needs to be changed, just a thought to consider for future research.

Congratulations to the authors once again on an excellent piece of research.

Reviewer #2: review is uploaded as attachment

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No: missing regression data

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Gabriel Recchia

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Attachment

Submitted filename: Rerview PCOMPBIOL.pdf

Click here for additional data file.^{(366.5KB, pdf)}

PLoS Comput Biol. 2022 Sep 23;18(9):e1010485. doi: 10.1371/journal.pcbi.1010485.r002

Author response to Decision Letter 0

5 Feb 2022

Attachment

Submitted filename: Cover_and_reviewer_responses__PCOMPBIOL-D-21-01928.pdf

Click here for additional data file.^{(249.3KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010485.r003

Decision Letter 1

Alison L Hill, Tom Britton

29 Mar 2022

Dear Dr. McAndrew,

While Reviewer 1's comments were adequately addressed, Reviewer 2 has brought up numerous oversights and inconsistencies in the methodological descriptions in the SI that were not adequately addressed by the previous revisions or were newly introduced. While Reviewer 2 felt these concerns were serious enough to recommend rejection of the paper, the Editors believe the paper will be publishable after the authors address these remaining issues.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Sincerely,

Alison L. Hill

Associate Editor

PLOS Computational Biology

Tom Britton

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thank you! I'm satisfied with the way all of my comments were addressed. Congrats again on a nice piece of work.

Reviewer #2: Most of the cosmetic issues in the main text have been satisfactorily addressed. The SI is still unacceptable. The more I try to understand, the more confusing it becomes. “Consensus distribution” has been replaced by “linear pool” in the main text but not in the SI. There are 40 experts in tables S4 and S5, there are 41 experts on SI p. 7 and in the caption to Fig 5, there are 36 on github. Fig 5 caption says there are 40 “measurable questions “ in the right graph, where I count 44.

p.7 SI says “Experts who answered a survey for the first time or had no training data were assigned values equal to an unskilled forecaster , for example a relative entropy of one (i.e. the score of an unskilled forecaster). In later surveys, experts were not required to answer all questions, and in these cases, we assigned them the same value as an unskilled forecaster. These assignments enabled all responses to have observed data with which to calculate weights in a given week.” So some of the experts’ assessments are just invented? How many?”

SI tables 4 and 5 give 1207 as “Nr Observations”; presumably that means 1207 assessments of some question by some expert. Table 1 lists the number of experts answering each question. Summing the products (#experts * #questions) gives 1370.

The tables S4 and S5 are hard to interpret without some explanation of what the intercept and the beta’s represent. I don’t understand why only 4 of the expert betas are positive in each table. Half of the CI’s in S5 include zero (in S4, 13), meaning that the betas for “expert specific minus equal weights” are not significantly different from zero? Interpretation?

SI p.7: “Regression approaches in the past have had success making in sample and out of sample predictions on a diverse set of datasets [3, 4]” Neither of these references have anything to do with regression. Perhaps the authors confuse the linear pool with linear regression? In any case the differences here between performance weights and equal weights are extremely small. Since this is strongly at variance with a wealth of literature (google “performance weights expert judgment”) including recent pubs in PNAS, PLOS ONE (WHO) and Emerging Infection Diseases (CDC), the reader deserves some explanation. The factors mentioned on p 7 haven’t plagued other approaches. Perhaps the problem lies with the scoring variable. Indeed, rewarding honesty is not the same as rewarding goodness, as my little counter example pointed out. You don’t address counter examples by citing other people making the same mistakes.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: No: the data they provide is inconsistent

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Gabriel Recchia

Reviewer #2: No

Figure Files:

Data Requirements:

Reproducibility:

References:

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2022 Sep 23;18(9):e1010485. doi: 10.1371/journal.pcbi.1010485.r004

Author response to Decision Letter 1

5 Apr 2022

Attachment

Submitted filename: Response2Review_expertpaper_round2__PCOMPBIOL-D-21-01928.pdf

Click here for additional data file.^{(144.6KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010485.r005

Decision Letter 2

Tom Britton

28 Jul 2022

Dear Dr. McAndrew,

Thank you very much for submitting your manuscript "An expert judgment model to predict early stages of the COVID-19 pandemic in the United States" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account these comments.

We noticed that your Table S4 values have been changed in the most recent version of the manuscript. In your response in your revised paper please explain the reason for this. Also, the reviewer commented on some confusion with table and figure numbers; please check that these are correct.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Tom Britton

Deputy Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I was asked to comment on the fact that "the authors have made unexplained changes to their tables which are pertinent to the critiques raised by the reviewer in the previous round ((in particular “SI_Redline.pdf”, Table 4, Table 5)". My comment is that the changes made seem to have been in response to the reviewer's comment that "The tables S4 and S5 are hard to interpret without some explanation of what the intercept and the beta’s represent", to which the authors responded that they had removed the intercept from tables S4 and S5. However on closer inspection it looks like the values on the initial lines (which had previously been described as "Intercept" but are now described as "Expert 0") are unchanged, while all other values have changed. I am guessing that the authors' code had previously erroneously used Expert 0 as the reference level, that they realized the error and updated the numbers with the correct analysis, but I might be wrong about this. I am happy to trust the authors' most recent analysis, although I would urge them to double-check that their latest numbers are indeed correct.

Something that may have added to the confusion is that there seem now to be two different Table S4s: one in Redline_Supporting_info.pdf ("Linear regression that compares the weights assigned to each expert using the expert-specific performance weighting and assigning experts equal weights"), and the "Table S4.pdf" uploaded separately - the latter seems to be the one referred to in the manuscript on p. 3 and p. 7. So I would also urge the authors to check over all their table, figure, and supplementary table/figure numbers. I don't think another round of review is required if changes of this sort are needed as this is the sort of thing that the authors can fix in proofing.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Figure Files:

Data Requirements:

Reproducibility:

PLoS Comput Biol. 2022 Sep 23;18(9):e1010485. doi: 10.1371/journal.pcbi.1010485.r006

Author response to Decision Letter 2

28 Jul 2022

Attachment

Submitted filename: ReplyToReview_expertpaper_round3__PCOMPBIOL-D-21-01928.pdf

Click here for additional data file.^{(91.1KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010485.r007

Decision Letter 3

Tom Britton

11 Aug 2022

Dear Dr. McAndrew,

We are pleased to inform you that your manuscript 'An expert judgment model to predict early stages of the COVID-19 pandemic in the United States' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Tom Britton

Deputy Editor

PLOS Computational Biology

Tom Britton

Deputy Editor

PLOS Computational Biology

***********************************************************

I am satisfied with the small modifications of the diagrams and support accepting the paper.

Kind regards, Tom Britton

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010485.r008

Acceptance letter

Tom Britton

15 Sep 2022

PCOMPBIOL-D-21-01928R3

An expert judgment model to predict early stages of the COVID-19 pandemic in the United States

Dear Dr McAndrew,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Consent form.

The consent form each expert was presented with and had to agree to before taking part in the survey. This document was shown for every survey.

(PDF)

Click here for additional data file.^{(141KB, pdf)}

S2 Fig. Triangular probability distribution.

An example of transforming an expert’s triplet answer (smallest: 10, most likely:30, largest: 50) to a probabilistic distribution.

(PDF)

Click here for additional data file.^{(96.3KB, pdf)}

S3 Fig. Triangular probability distribution ensemble.

An example of 10 expert answers to a triplet question, their corresponding triangular probability distributions (TPDs), and an equally-weighted linear pool distribution (black) built from those TPDs.

(PDF)

Click here for additional data file.^{(69.8KB, pdf)}

S4 Fig. An analysis of individual expert forecast skill.

(PDF)

Click here for additional data file.^{(94.4KB, pdf)}

S5 Fig. Ensemble accuracy and ensemble weights assigned to experts.

(PDF)

Click here for additional data file.^{(74.1KB, pdf)}

S1 Table. Survey meta information.

A listing of survey numbers, the date they were issued, information on expert participation, and the database(s) used to collect ground truth.

(PDF)

Click here for additional data file.^{(54.8KB, pdf)}

S2 Table. Ensemble covariates.

Model and the covariates used to define the design matrix X to weight experts.

(PDF)

Click here for additional data file.^{(51.6KB, pdf)}

S3 Table. List of experts who participated in at least two surveys.

(PDF)

Click here for additional data file.^{(54KB, pdf)}

S4 Table. Linear regression that compares the weights assigned to each expert using the expert-specific performance weighting and assigning experts equal weights.

(PDF)

Click here for additional data file.^{(76.6KB, pdf)}

S5 Table. Linear regression that compares the weights assigned to each expert using the expert-specific plus relative entropy performance weighting and assigning experts equal weights.

(PDF)

Click here for additional data file.^{(76.8KB, pdf)}

S6 Table. Date each survey was conducted, the questions asked, and the format experts answered.

(PDF)

Click here for additional data file.^{(1.1MB, pdf)}

S1 Appendix. Methodology to aggregate expert probabilistic predictions.

(PDF)

Click here for additional data file.^{(112.5KB, pdf)}

Attachment

Submitted filename: Rerview PCOMPBIOL.pdf

Click here for additional data file.^{(366.5KB, pdf)}

Attachment

Submitted filename: Cover_and_reviewer_responses__PCOMPBIOL-D-21-01928.pdf

Click here for additional data file.^{(249.3KB, pdf)}

Attachment

Submitted filename: Response2Review_expertpaper_round2__PCOMPBIOL-D-21-01928.pdf

Click here for additional data file.^{(144.6KB, pdf)}

Attachment

Submitted filename: ReplyToReview_expertpaper_round3__PCOMPBIOL-D-21-01928.pdf

Click here for additional data file.^{(91.1KB, pdf)}

Data Availability Statement

A publicly available repository with details about questions asked and data on all responses is available under a MIT license at https://github.com/tomcm39/COVID19_expert_survey.

[pcbi.1010485.ref001] 1.Coronavirus Disease 2019 (COVID-19);. https://www.cdc.gov/coronavirus/2019-ncov/index.html.

[pcbi.1010485.ref002] 2. Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. Jama. 2020;323(13):1239–1242. doi: 10.1001/jama.2020.2648 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref003] 3. Grasselli G, Pesenti A, Cecconi M. Critical care utilization for the COVID-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response. Jama. 2020;323(16):1545–1546. doi: 10.1001/jama.2020.4031 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref004] 4.Novel Coronavirus (2019-nCoV) situation reports;. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports.

[pcbi.1010485.ref005] 5. Farrow DC, Brooks LC, Hyun S, Tibshirani RJ, Burke DS, Rosenfeld R. A human judgment approach to epidemiological forecasting. PLoS computational biology. 2017;13(3):e1005248. doi: 10.1371/journal.pcbi.1005248 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref006] 6. McGowan CJ, Biggerstaff M, Johansson M, Apfeldorf KM, Ben-Nun M, Brooks L, et al. Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific reports. 2019;9(1):1–13. doi: 10.1038/s41598-018-36361-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref007] 7. Del Valle SY, McMahon BH, Asher J, Hatchett R, Lega JC, Brown HE, et al. Summary results of the 2014-2015 DARPA Chikungunya challenge. BMC infectious diseases. 2018;18(1):1–14. doi: 10.1186/s12879-018-3124-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref008] 8. Petropoulos F, Makridakis S. Forecasting the novel coronavirus COVID-19. PloS one. 2020;15(3):e0231236. doi: 10.1371/journal.pone.0231236 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref009] 9. Lu FS, Nguyen AT, Link NB, Lipsitch M, Santillana M. Estimating the early outbreak cumulative incidence of COVID-19 in the United States: three complementary approaches. medRxiv. 2020;. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref010] 10. Lover AA, McAndrew T. Sentinel event surveillance to estimate total SARS-CoV-2 infections, United States. medRxiv. 2020;. [Google Scholar]

[pcbi.1010485.ref011] 11. Perkins TA, Cavany SM, Moore SM, Oidtman RJ, Lerch A, Poterek M. Estimating unobserved SARS-CoV-2 infections in the United States. Proceedings of the National Academy of Sciences. 2020;117(36):22597–22602. doi: 10.1073/pnas.2005476117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref012] 12. Jewell NP, Lewnard JA, Jewell BL. Caution warranted: using the institute for health metrics and evaluation model for predicting the course of the COVID-19 pandemic; 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref013] 13.A repository of the data and code used for aggregating expert predictions of COVID19;. https://github.com/tomcm39/COVID19_expert_survey.

[pcbi.1010485.ref014] 14. McAndrew T, Wattanachit N, Gibson GC, Reich NG. Aggregating predictions from experts: A review of statistical methods, experiments, and applications. Wiley Interdisciplinary Reviews: Computational Statistics. 2021;13(2):e1514. doi: 10.1002/wics.1514 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref015] 15. Mellers B, Ungar L, Baron J, Ramos J, Gurcay B, Fincher K, et al. Psychological strategies for winning a geopolitical forecasting tournament. Psychological science. 2014;25(5):1106–1115. doi: 10.1177/0956797614524255 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref016] 16. Warby SC, Wendt SL, Welinder P, Munk EG, Carrillo O, Sorensen HB, et al. Sleep-spindle detection: crowdsourcing and evaluating performance of experts, non-experts and automated methods. Nature methods. 2014;11(4):385–392. doi: 10.1038/nmeth.2855 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref017] 17. Ungar L, Mellers B, Satopää V, Tetlock P, Baron J. The good judgment project: A large scale test of different methods of combining expert predictions. In: 2012 AAAI Fall Symposium Series; 2012. [Google Scholar]

[pcbi.1010485.ref018] 18. Cheng TH, Wei CP, Tseng VS. Feature selection for medical data mining: Comparisons of expert judgment and automatic approaches. In: 19th IEEE symposium on computer-based medical systems (CBMS’06). IEEE; 2006. p. 165–170. [Google Scholar]

[pcbi.1010485.ref019] 19. Bouwman MJ. Expert vs novice decision making in accounting: A summary. Accounting, Organizations and Society. 1984;9(3-4):325–327. doi: 10.1016/0361-3682(84)90016-3 [DOI] [Google Scholar]

[pcbi.1010485.ref020] 20. Murphy AH, Daan H. Impacts of feedback and experience on the quality of subjective probability forecasts. comparison of results from the first and second years of the zierikzee experiment. Monthly Weather Review. 1984;112(3):413–423. doi: [DOI] [Google Scholar]

[pcbi.1010485.ref021] 21. Murphy AH, Winkler RL. Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1977;26(1):41–47. [Google Scholar]

[pcbi.1010485.ref022] 22. Nahar J, Imam T, Tickle KS, Chen YPP. Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications. 2013;40(1):96–104. doi: 10.1016/j.eswa.2012.07.032 [DOI] [Google Scholar]

[pcbi.1010485.ref023] 23. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. science. 1974;185(4157):1124–1131. doi: 10.1126/science.185.4157.1124 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref024] 24. Meehl PE. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. University of Minnesota Press; 1954. [Google Scholar]

[pcbi.1010485.ref025] 25. Slovic P, Fischhoff B, Lichtenstein S. Regulation of risk: a psychological perspective, tn Regulatory Poliql and the Social Sciences, Nol] R.(ed.), pp. 24I–27B, Berkley; 1985. [Google Scholar]

[pcbi.1010485.ref026] 26. Brunswik E. The conceptual framework of psychology.(Int. Encycl. unified Sci., v. 1, no. 10.). Univ. Chicago Press; 1952. [Google Scholar]

[pcbi.1010485.ref027] 27. Brunswik E. Perception and the representative design of psychological experiments. Univ of California Press; 1956. [Google Scholar]

[pcbi.1010485.ref028] 28. Hammond KR, Stewart TR, Brehmer B, Steinmann DO. Social judgment theory. Cambridge University Press; 1986. [Google Scholar]

[pcbi.1010485.ref029] 29. Doherty ME, Kurz EM. Social judgement theory. Thinking & Reasoning. 1996;2(2-3):109–140. doi: 10.1080/135467896394474 [DOI] [Google Scholar]

[pcbi.1010485.ref030] 30. Rogers SD, Kadar EE, Costall A. Gaze patterns in the visual control of straight-road driving and braking as a function of speed and expertise. Ecological Psychology. 2005;17(1):19–38. doi: 10.1207/s15326969eco1701_2 [DOI] [Google Scholar]

[pcbi.1010485.ref031] 31. Araujo D, Davids K, Passos P. Ecological validity, representative design, and correspondence between experimental task constraints and behavioral setting: Comment on. Ecological Psychology. 2007;19(1):69–78. doi: 10.1080/10407410709336951 [DOI] [Google Scholar]

[pcbi.1010485.ref032] 32. Plessner H, Schweizer G, Brand R, O’Hare D. A multiple-cue learning approach as the basis for understanding and improving soccer referees’ decision making. Progress in brain research. 2009;174:151–158. doi: 10.1016/S0079-6123(09)01313-2 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref033] 33. Wiggins MW. Cue Utilization as an Objective Metric in Naturalistic Decision-Making. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting. vol. 64. SAGE Publications Sage CA: Los Angeles, CA; 2020. p. 209–213. [Google Scholar]

[pcbi.1010485.ref034] 34. Brehmer B. Inference behavior in a situation where the cues are not reliably perceived. Organizational Behavior and Human Performance. 1970;5(4):330–347. doi: 10.1016/0030-5073(70)90024-3 [DOI] [Google Scholar]

[pcbi.1010485.ref035] 35. Brehmer B. Effects of task predictability and cue validity on interpersonal learning of inference tasks involving both linear and nonlinear relations. Organizational behavior and human performance. 1973;10(1):24–46. doi: 10.1016/0030-5073(73)90003-2 [DOI] [Google Scholar]

[pcbi.1010485.ref036] 36. Brehmer B. Social judgment theory and the analysis of interpersonal conflict. Psychological bulletin. 1976;83(6):985. doi: 10.1037/0033-2909.83.6.985 [DOI] [Google Scholar]

[pcbi.1010485.ref037] 37. Karelaia N, Hogarth RM. Determinants of linear judgment: A meta-analysis of lens model studies. Psychological bulletin. 2008;134(3):404. doi: 10.1037/0033-2909.134.3.404 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref038] 38. Tversky A, Kahneman D. The framing of decisions and the psychology of choice. In: Behavioral decision making. Springer; 1985. p. 25–41. [Google Scholar]

[pcbi.1010485.ref039] 39. Meehl PE, Rosen A. Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological bulletin. 1955;52(3):194. doi: 10.1037/h0048070 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref040] 40. Harvey N. Confidence in judgment. Trends in cognitive sciences. 1997;1(2):78–82. doi: 10.1016/S1364-6613(97)01014-0 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref041] 41. Dror IE, Kukucka J, Kassin SM, Zapf PA. When expert decision making goes wrong: Consensus, bias, the role of experts, and accuracy. 2018;. [Google Scholar]

[pcbi.1010485.ref042] 42. Clemen RT, Winkler RL. Combining probability distributions from experts in risk analysis. Risk analysis. 1999;19(2):187–203. [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref043] 43. Budescu DV, Rantilla AK. Confidence in aggregation of expert opinions. Acta psychologica. 2000;104(3):371–398. doi: 10.1016/S0001-6918(00)00037-8 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref044] 44. Cooke RM, Goossens LL. TU Delft expert judgment data base. Reliability Engineering & System Safety. 2008;93(5):657–674. doi: 10.1016/j.ress.2007.03.005 [DOI] [Google Scholar]

[pcbi.1010485.ref045] 45. Shea K, Runge MC, Pannell D, Probert WJ, Li SL, Tildesley M, et al. Harnessing multiple models for outbreak management. Science. 2020;368(6491):577–579. doi: 10.1126/science.abb9934 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref046] 46. Chodera J, Lee AA, London N, von Delft F. Crowdsourcing drug discovery for pandemics. Nature Chemistry. 2020;12(7):581–581. doi: 10.1038/s41557-020-0496-2 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref047] 47. Sun K, Chen J, Viboud C. Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet Digital Health. 2020;2(4):e201–e208. doi: 10.1016/S2589-7500(20)30026-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref048] 48. Recchia G, Freeman AL, Spiegelhalter D. How well did experts and laypeople forecast the size of the COVID-19 pandemic? PloS one. 2021;16(5):e0250935. doi: 10.1371/journal.pone.0250935 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref049] 49. Genest C, McConway KJ. Allocating the weights in the linear opinion pool. Journal of Forecasting. 1990;9(1):53–73. doi: 10.1002/for.3980090106 [DOI] [Google Scholar]

[pcbi.1010485.ref050] 50. Shuford EH, Albert A, Massengill HE. Admissible probability measurement procedures. Psychometrika. 1966;31(2):125–145. doi: 10.1007/BF02289503 [DOI] [PubMed] [Google Scholar]

[pcbi.1010485.ref051] 51. McCarthy J. Measures of the value of information. Proceedings of the National Academy of Sciences of the United States of America. 1956;42(9):654. doi: 10.1073/pnas.42.9.654 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref052] 52. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association. 2007;102(477):359–378. doi: 10.1198/016214506000001437 [DOI] [Google Scholar]

[pcbi.1010485.ref053] 53. Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ, Moore E, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences. 2019;116(8):3146–3154. doi: 10.1073/pnas.1812594116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref054] 54. Ayyangar A. The triangular distribution. Mathematics Student. 1941;9:85–87. [Google Scholar]

[pcbi.1010485.ref055] 55.Novel Coronavirus (2019-nCoV) situation reports;. https://www.cdc.gov/flu/about/burden/index.html.

[pcbi.1010485.ref056] 56.I could easily be off 2-fold in either direction, but my best guess is that we’re currently in the 10,000 to 40,000 range nationally. 11/13;. Twitter.

[pcbi.1010485.ref057] 57.COVID-19 projections using machine learning;. https://covid19-projections.com.

[pcbi.1010485.ref058] 58. Chitwood MH, Russi M, Gunasekera K, Havumaki J, Pitzer VE, Warren JL, et al. Bayesian nowcasting with adjustment for delayed and incomplete reporting to estimate COVID-19 infections in the United States. medRxiv. 2020;. [Google Scholar]

[pcbi.1010485.ref059] 59. Lemaitre JC, Grantz KH, Kaminsky J, Meredith HR, Truelove SA, Lauer SA, et al. A scenario modeling pipeline for COVID-19 emergency planning. Scientific reports. 2021;11(1):1–13. doi: 10.1038/s41598-021-86811-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010485.ref060] 60.CEID Covid-19 tracker;. https://www.covid19.uga.edu/.

[pcbi.1010485.ref061] 61. COVID I, Murray CJ, et al. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months. MedRxiv. 2020;. [Google Scholar]

[pcbi.1010485.ref062] 62. Hammond KR. Human judgment and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. Oxford University Press on Demand; 1996. [Google Scholar]

[pcbi.1010485.ref063] 63. Hammond KR, et al. Judgments under stress. Oxford University Press on Demand; 2000. [Google Scholar]

[pcbi.1010485.ref064] 64. Hammond KR. Coherence and correspondence theories in judgment and decision making. Cambridge University Press; 2000. [Google Scholar]

[pcbi.1010485.ref065] 65. Mazzuco S, Keilman N. Developments in demographic forecasting. Springer Nature; 2020. [Google Scholar]

[pcbi.1010485.ref066] 66. Dion P, Galbraith N, Sirag E. Using expert elicitation to build long-term projection assumptions. In: Developments in Demographic Forecasting. Springer, Cham; 2020. p. 43–62. doi: 10.1007/978-3-030-42472-5_3 [DOI] [Google Scholar]

PERMALINK

An expert judgment model to predict early stages of the COVID-19 pandemic in the United States

Thomas McAndrew

Nicholas G Reich

Roles

Abstract

Author summary

Introduction

Materials and methods

Ethics statement

Recruitment of experts

Survey methods

Survey of cases

Survey of deaths

Surveys of latent viral transmission

Surveys of counterfactual predictions

Data repository

Statistical Methods

Rounding

Linear pooling

Scoring predictions

From expert predictions to probabilistic forecasts

Imputing quantiles from a binned distribution

Relative absolute error

Results

Overview

Predictions of US COVID-19 deaths reported in 2020

Fig 1. Expert predictions of confirmed COVID-19 cases and deaths.

Predictions of weekly COVID-19 reported cases

Estimates of the fraction of infections reported as cases

Fig 2. Expert predictions of total number of SARS-CoV-2 infections.

Counterfactual predictions

Fig 3. Counterfactual predictions of reported COVID-19 cases.

Individual experts’ predictive performance varied substantially

Fig 4. Forecast accuracy for expert predictions of cases and deaths.

Fig 5. Expert analytic versus intuitive thinking.

Accuracy of the expert linear pool

Fig 6. Expert linear pool forecast skill.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Alison L Hill

Tom Britton

Roles

Author response to Decision Letter 0

Decision Letter 1

Alison L Hill

Tom Britton

Roles

Author response to Decision Letter 1

Decision Letter 2

Tom Britton

Roles

Author response to Decision Letter 2

Decision Letter 3

Tom Britton

Roles

Acceptance letter

Tom Britton

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases