. 2020 Aug 25;38(2):423–438. doi: 10.1016/j.ijforecast.2020.08.004

Table 3.

Potential reasons for the failure of COVID-19 forecasting along with examples and extent of potential amendments.

Reasons	Examples	How to fix: extent of potential amendments
Poor data input on key features of the pandemic that go into theory-based forecasting (e.g. SIR models)	Early data providing estimates for case fatality rate, infection fatality rate, basic reproductive number, and other key numbers that are essential in modeling were inflated.	May be unavoidable early in the course of the pandemic when limited data are available; should be possible to correct when additional evidence accrues about true spread of the infection, proportion of asymptomatic and non-detected cases, and risk-stratification. Investment should be made in the collection, cleaning, and curation of data.

Poor data input for data-based forecasting (e.g. time series)	Lack of consensus as to what is the ‘ground truth” even for seemingly hard-core data such as the daily the number of deaths. They may vary because of reporting delays, changing definitions, data errors, etc. Different models were trained on different and possibly highly inconsistent versions of the data.	As above: investment should be made in the collection, cleaning, and curation of data.

Wrong assumptions in the modeling	Many models assume homogeneity, i.e. all people having equal chances of mixing with each other and infecting each other. This is an untenable assumption and, in reality, tremendous heterogeneity of exposures and mixing is likely to be the norm. Unless this heterogeneity is recognized, estimates of the proportion of people eventually infected before reaching herd immunity can be markedly inflated	Need to build probabilistic models that allow for more realistic assumptions; quantify uncertainty and continuously re-adjust models based on accruing evidence

High sensitivity of estimates	For models that use exponentiated variables, small errors may result in major deviations from reality	Inherently impossible to fix; can only acknowledge that uncertainty in calculations may be much larger than it seems

Lack of incorporation of epidemiological features	Almost all COVID-19 mortality models focused on number of deaths, without considering age structure and comorbidities. This can give very misleading inferences about the burden of disease in terms of quality-adjusted life-years lost, which is far more important than simple death count. For example, the Spanish flu killed young people with average age of 28 and its burden in terms of number of quality-adjusted person-years lost was about 1000-fold higher than the COVID-19 (at least as of June 8, 2020).	Incorporate best epidemiological estimates of age structure and comorbidities in the modeling; focus on quality-adjusted life-years rather than deaths

Poor past evidence on effects of available interventions	The core evidence to support “flatten-the-curve” efforts was based on observational data from the 1918 Spanish flu pandemic on 43 US cites. These data are >100-years old, of questionable quality, unadjusted for confounders, based on ecological reasoning, and pertaining to an entirely different (influenza) pathogen that had $\sim$ 100-fold higher infection fatality rate than SARS-CoV-2. Even thus, the impact on reduction of total deaths was of borderline significance and very small (10%–20% relative risk reduction); conversely, many models have assumed a 25-fold reduction in deaths (e.g. from 510,000 deaths to 20,000 deaths in the Imperial College model) with adopted measures	While some interventions in the broader package of lockdown measures are likely to have beneficial effects, assuming huge benefits is incongruent with past (weak) evidence and should be avoided. Large benefits may be feasible from precise, focused measures (e.g. early, intensive testing with thorough contact tracing for the early detected cases, so as not to allow the epidemic wave to escalate [e.g. Taiwan or Singapore]; or draconian hygiene measures and thorough testing in nursing homes) rather than from blind lockdown of whole populations.

Lack of transparency	The methods of many models used by policy makers were not disclosed; most models were never formally peer-reviewed, and the vast majority have not appeared in the peer-reviewed literature even many months after they shaped major policy actions	While formal peer-review and publication may unavoidably take more time, full transparency about the methods and sharing of the code and data that inform these models is indispensable. Even with peer-review, many papers may still be glaringly wrong, even in the best journals.

Errors	Complex code can be error-prone, and errors can happen even by experienced modelers; using old-fashioned software or languages can make things worse; lack of sharing code and data (or sharing them late) does not allow detecting and correcting errors	Promote data and code sharing; use up-to-date and well-vetted tools and processes that minimize the potential for error through auditing loops in the software and code

Lack of determinacy	Many models are stochastic and need to have a large number of iterations run, perhaps also with appropriate burn-in periods; superficial use may lead to different estimates	Promote data and code sharing to allow checking the use of stochastic processes and their stability

Looking at only one or a few dimensions of the problem at hand	Almost all models that had a prominent role in decision-making focused on COVID-19 outcomes, often just a single outcome or a few outcomes (e.g. deaths or hospital needs). Models prime for decision-making need to take into account the impact on multiple fronts (e.g. other aspects of health care, other diseases, dimensions of the economy, etc.)	Interdisciplinarity is desperately needed; as it is unlikely that single scientists or even teams can cover all this space, it is important for modelers from diverse ways of life to sit at the same table. Major pandemics happen rarely, and what is needed are models which combine information from a variety of sources. Information from data, from experts in the field, and from past pandemics, need to combined in a logically consistent fashion if we wish to get any sensible predictions.

Lack of expertise in crucial disciplines	The credentials of modelers are sometimes undisclosed; when they have been disclosed, these teams are led by scientists who may have strengths in some quantitative fields, but these fields may be remote from infectious diseases and clinical epidemiology; modelers may operate in subject matter vacuum	Make sure that the modelers’ team is diversified and solidly grounded in terms of subject matter expertise

Groupthink and bandwagon effects	Models can be tuned to get desirable results and predictions; e.g. by changing the input of what are deemed to be plausible values for key variables. This is especially true for models that depend on theory and speculation, but even data-driven forecasting can do the same, depending on how the modeling is performed. In the presence of strong groupthink and bandwagon effects, modelers may consciously fit their predictions to what is the dominant thinking and expectations – or they may be forced to do so.	Maintain an open-minded approach; unfortunately, models are very difficult, if not impossible, to pre-register, so subjectivity is largely unavoidable and should be taken into account in deciding how much forecasting predictions can be trusted

Selective reporting	Forecasts may be more likely to be published or disseminated if they are more extreme	Very difficult to diminish, especially in charged environments; needs to be taken into account in appraising the credibility of extreme forecasts