Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
letter
. 2019 Sep 26;116(42):20809–20810. doi: 10.1073/pnas.1912147116

On the multibin logarithmic score used in the FluSight competitions

Johannes Bracher a,1
PMCID: PMC6800377  PMID: 31558612

The FluSight challenges (1) represent an outstanding collaborative effort and have “pioneered infectious disease forecasting in a formal way” (ref. 2, p. 2803). However, I wish to initiate a discussion about the employed evaluation measure.

The competitions feature discrete or discretized targets related to the US influenza season. E.g., for the peak timing Y, a forecast distribution F consists of probabilities p1,,pT for the T=33 wk of the season. Such forecasts can be evaluated using the log score (3, 4)

logS(F,yobs)=log(pyobs),

where yobs is the observed value. This score is strictly proper; i.e., its expectation is uniquely maximized by the true distribution of Y. In the FluSight competitions the logS is applied in a multibin version,

MBlogS(F,yobs)=logi=ddpyobs+i,

to measure “accuracy of practical significance” (ref. 1, p. 3153). Depending on the target, d is either 1 or 5. Following the competitions, this score has become widely used (510), even though as also mentioned in ref. 1, it is improper. This may be problematic as improper scores incentivize dishonest forecasts. Assume T>2d and

p1=···=pd=pTd+1=···=pT=0, [1]

i.e., probability 0 for the 2d extreme categories. Now define a “blurred” distribution F~ with

p~t=i=ddpt+i2d+1,t=1,,T, [2]

where pt=0 for t<1 and t>T and Eq. 1 ensures t=1Tp~t=1. This implies

MBlogS(F,yobs)=logS(F~,yobs)+log(2d+1);

i.e., the MBlogS is essentially the logS applied to a blurred version of F. To optimize the expected MBlogS under their true belief F, forecasters should therefore not report F, but a sharper forecast G so that the blurred version G~ (with p~G,1,,p~G,T derived from pG,1,,pG,T as in Eq. 2) is close or equal to F. This follows from the propriety of the logS. An optimal G is found by maximizing t=1Tptlog(p~G,t) with respect to pG,1,,pG,T.

This optimal G can differ considerably from the original F, as Fig. 1 shows for forecasts of the 2016 to 2017 national-level peak timing by the Los Alamos National Laboratory (LANL) team (9) (downloaded from https://github.com/FluSightNetwork/cdc-flusight-ensemble/). The optimized Gs (with d=1) often have their mode shifted by 1 wk and tend to be multimodal, even for unimodal F. Averaged over the 2016 to 2017 season they yield improved MBlogS for the peak timing (−0.434 vs. −0.484). This illustrates that the MBlogS may be gamed, even though we strongly doubt participants have tried to. The logS, like any other proper score, could avoid such pitfalls.

Fig. 1.

Fig. 1.

Forecasts F for the peak week, submitted by the LANL team in weeks 6 to 7, 2017, and optimized versions G. Diamonds mark the observed peak week. Expected scores are computed under F.

Acknowledgments

I thank T. Gneiting for helpful discussions and the FluSight Collaboration for making its forecasts publicly available.

Footnotes

The author declares no conflict of interest.

References

  • 1.Reich N. G., et al. , A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proc. Natl. Acad. Sci. U.S.A. 116, 3146–3154 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Viboud C., Vespignani A., The future of influenza forecasts. Proc. Natl. Acad. Sci. U.S.A. 116, 2802–2804 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gneiting T., Raftery A. E., Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102, 359–378 (2007). [Google Scholar]
  • 4.Held L., Meyer S., Bracher J., Probabilistic forecasting in infectious disease epidemiology: The 13th Armitage lecture. Stat. Med. 36, 3443–3460 (2017). [DOI] [PubMed] [Google Scholar]
  • 5.Brooks L. C., Farrow D. C., Hyun S., Tibshirani R. J., Rosenfeld R., Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions. PLoS Comput. Biol. 14, 1–29 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kandula S., et al. , Evaluation of mechanistic and statistical methods in forecasting influenza-like illness. J. R. Soc. Interface 15, 20180174 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kandula S., Shaman J., Near-term forecasts of influenza-like illness: An evaluation of autoregressive time series approaches. Epidemics 27, 41–51 (2019). [DOI] [PubMed] [Google Scholar]
  • 8.McGowan C. J., et al. , Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Sci. Rep. 9, 683 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Osthus D., Gattiker J., Priedhorsky R., Del Valle S. Y., Dynamic Bayesian influenza forecasting in the United States with hierarchical discrepancy (with discussion). Bayesian Anal. 14, 261–312 (2019). [Google Scholar]
  • 10.Osthus D., Daughton A. R., Priedhorsky R., Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited. PLoS Comput. Biol. 15, 1–19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES