Skip to main content
PLOS Neglected Tropical Diseases logoLink to PLOS Neglected Tropical Diseases
. 2020 Sep 24;14(9):e0008056. doi: 10.1371/journal.pntd.0008056

Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burden at national and sub-national scales in Colombia

Naizhuo Zhao 1,2, Katia Charland 3, Mabel Carabali 4, Elaine O Nsoesie 5, Mathieu Maheu-Giroux 4,6, Erin Rees 7, Mengru Yuan 4, Cesar Garcia Balaguera 8, Gloria Jaramillo Ramirez 8, Kate Zinszer 3,6,9,*
Editor: Marc Choisy10
PMCID: PMC7537891  PMID: 32970674

Abstract

The robust estimate and forecast capability of random forests (RF) has been widely recognized, however this ensemble machine learning method has not been widely used in mosquito-borne disease forecasting. In this study, two sets of RF models were developed at the national (pooled department-level data) and department level in Colombia to predict weekly dengue cases for 12-weeks ahead. A pooled national model based on artificial neural networks (ANN) was also developed and used as a comparator to the RF models. The various predictors included historic dengue cases, satellite-derived estimates for vegetation, precipitation, and air temperature, as well as population counts, income inequality, and education. Our RF model trained on the pooled national data was more accurate for department-specific weekly dengue cases estimation compared to a local model trained only on the department’s data. Additionally, the forecast errors of the national RF model were smaller to those of the national pooled ANN model and were increased with the forecast horizon increasing from one-week-ahead (mean absolute error, MAE: 9.32) to 12-weeks ahead (MAE: 24.56). There was considerable variation in the relative importance of predictors dependent on forecast horizon. The environmental and meteorological predictors were relatively important for short-term dengue forecast horizons while socio-demographic predictors were relevant for longer-term forecast horizons. This study demonstrates the potential of RF in dengue forecasting with a feasible approach of using a national pooled model to forecast at finer spatial scales. Furthermore, including sociodemographic predictors is likely to be helpful in capturing longer-term dengue trends.

Author summary

Dengue virus has the highest disease burden of all mosquito-borne viral diseases, infecting 390 million people annually in 128 countries. Forecasting is an important warning mechanism that can help with proactive planning and response for clinical and public health services. In this study, we compare two different machine learning approaches to dengue forecasting: random forest (RF) and artificial neural networks (ANN). National (pooling across all departments) and local (department-specific) models were compared and used to predict future dengue cases in Colombia. In Colombia, the departments are administrative divisions formed by a grouping of municipalities. The results demonstrated that the counts of future dengue cases were more accurately estimated by RF than by ANN. It was also shown that environmental and meteorological predictors were more important for forecast accuracy for shorter-term forecasts while socio-demographic predictors were more important for longer-term forecasts. Finally, the national pooled model applied to local data was more accurate in dengue forecasting compared to the department-specific model. This research contributes to the field of disease forecasting and highlights different considerations for future forecasting studies.

Introduction

Dengue virus is most prevalent of the mosquito-borne viral diseases, infecting 390 million people annually in 128 countries with four different virus serotypes [1]. Rising incidence and large-scale outbreaks are largely due to inadequate living conditions, naïve populations, global trade and population mobility, climate change, and the adaptive nature of the principal mosquito vectors Aedes aegypti and Aedes albopictus [2, 3]. The direct and indirect costs of dengue are substantial and impose enormous burdens on low- and middle-income tropical countries, with a global estimate of US$8.9 billion in costs per year [4].

Human and financial costs of dengue can be alleviated when response systems, such as intervention strategies, health care services, and supply chain management, receive timely warnings of future cases through forecasting models [5]. A number of dengue forecasting models have been developed and these models can be generally classified into two methodological categories: time series and machine learning [6, 7]. The majority of existing dengue forecasting models used time series methods and typically Autoregressive Integrated Moving Average (ARIMA), in which lagged meteorological factors (e.g. temperature and precipitation) act as covariates in conjunction with historical dengue data for one- to 12-week-ahead forecasting [813]. Many studies reported that conventional time series models such as ARIMA are insufficient to meet complex forecasting requirements [1416], as multiple trends and outliers present in the time series reduce the forecasting accuracy [17].

In the last two decades, machine learning (ML) methods have been used in many disciplines, such as geography, environment, and epidemiology, to yield meaningful findings from highly heterogeneous data. Differing from statistical modeling that forms relationships between variables based on many assumptions (e.g. independence of predictor variables, homoscedasticity, and normal distributions of errors), machine learning facilitates the inclusion of a large number of correlated variables, enable the modeling of complex interactions between variables, and can fit complex models without presupposing forms (e.g. linear, exponential, and logistic) of functions, providing a more flexible approach for disease forecasting [18, 19]. Decision trees, support vector machine, shallow neural network, K-nearest neighbor, gradient boosting, and naive Bayes are frequently used ML approaches in dengue-forecasting studies [7, 2023]. Compared to the above ML methods, random forests (RF), another common ML algorithm, have shown to be more accurate in forecasting given its ability to overcome the common problem of over-fitting through the use of bootstrap aggregation [2428].

Random forests have been used to forecast dengue risk in several countries including Costa Rica [29], Philippines [30, 31], Pakistan [32], Peru and Puerto Rico [33]. However, time or seasonal variables were not always included in the models nor were sociodemographic predictors, which have been found to improve forecast accuracy in HIV [34] and Ebola [35] epidemic models. Furthermore, dengue models, regardless of the use of the time series or ML approaches, have been developed for predicting dengue cases in individual administrative areas such in a city or a province [912, 2023]. Universal dengue prediction models that are effective across different administrative regions remain scarce.

Historically, Colombia is one of the countries most affected by dengue, with the Aedes mosquitoes being widely distributed throughout all departments at elevations below 2,000 meters [36, 37]. The objective of this study was to evaluate the potential of RF forecasting models at the department and national level in Colombia. We compared the accuracy of department-specific RF models to a nationally-pooled RF model to understand the feasibility of using a pooled model to predict dengue cases for individual departments. Using ARIMA as baseline, we also compared errors of the nationally pooled RF model with those of Artificial Neural Network (ANN), another classic and widely used ML approach. Finally, we estimated the change in importance of different predictors according to forecast horizon.

Methods

Ethics statement

Ethical approval was obtained from the Health Research Ethics Board from the University of Montreal (18-073-CERES-D).

Data

Various data were used to develop the forecasting models, which included: dengue cases from surveillance data, environmental indicators from remote sensing data, and sociodemographic indicators such as population, income inequity, and education coverage (Table 1). The dengue case surveillance data were extracted from an electronic platform, SIVIGILA, created by the Colombia national surveillance program and was available at the department level. The national surveillance program receives weekly reports from all public health facilities that provide services to cases of dengue. the dengue cases reported by SIVIGILA were a mixture of probable and laboratory confirmed cases without distinguishing between the two different case definitions. Laboratory confirmation for dengue is based on a positive result from antigen, antibody, or virus detection and/or isolation [38]. Probable cases are based on clinical diagnosis plus at least one serological positive immunoglobulin M test or an epidemiological link to a confirmed case within 14 days prior to symptom onset. Cases are typically reported within a week with severe cases usually being reported immediately.

Table 1. Summary of indicators and data sources.
Indicator Source Temporal granularity Format
Dengue cases SIVIGILA (national surveillance program of Colombia) Weekly Tabular
Rainfall CMORPH precipitation data from NOAA’s CPC Daily Gridded
EVI MOD13C1 from NASA’s LP DAAC 16-day Gridded
Temperature MOD11C2 from NASA’s LP DAAC 8-day Gridded
Population Colombian National Administrative Department of Statistics Yearly Tabular
Gini Index Colombian National Administrative Department of Statistics Yearly Tabular
Education coverage Colombian National Administrative Department of Statistics Yearly Tabular

CPC: Climate Prediction Center; LP DAAC: Land Processes Distributed Active Archive Center; NOAA: National Oceanic and Atmospheric Administration; EVI: enhanced vegetation index; CMORPH: Climate Prediction Center morphing method; NASA: National Aeronautics and Space Administration.

Precipitation, air temperature, and land cover type have been shown to be three important determinants of Aedes mosquito abundance and are often used as predictors in dengue forecasting [9, 11, 21, 39]. In this study, precipitation data was obtained from the CMORPH (Climate Prediction Center morphing method) daily estimated precipitation dataset [40]. The land surface temperatures were extracted from the MODIS Terra Land Surface Temperature 8-day image products (MOD11C2.006). Enhanced vegetation index (EVI) estimates were obtained from the MODIS Terra Vegetation Indices 16-Day image products (MOD13C1.006). Several studies have shown that socio-demographic factors may influence dengue transmission and incidence as significantly as environmental factors [4143]. Education influences people’s knowledge and behaviours towards infectious diseases, as people with higher education more likely to adopt behaviours to reduce risks of infection compared to individuals with lower education [44]. Income also affects risk of infectious diseases, with those from higher income brackets often being less exposed and consequently, less at-risk of infection compared to individuals with lower income [45]. Given this, we included population, education coverage, and the Gini Index (a measure of income inequity) as potential predictors, which were retrieved from the Colombian National Administrative Department of Statistics.

Random forests

Random forests (RF) is an ensemble decision tree approach [46]. A decision tree is a simple representation for classification in which each internal node corresponds to a test on an attribute, each branch represents an outcome of a test, and each leaf (i.e. terminal node) holds a class label. Decision trees can also be used for regression when the target or outcome variable is continuous. Bootstrap aggregation, commonly known as bagging, is the most distinctive technique used in RF and bagging requires training each decision tree with a randomly selected subsample of the entire training datasets.

Data preprocessing

To ensure a consistent temporal granularity with the outcome variable, the daily precipitation data were aggregated to a weekly frequency. The 8-day land surface temperature and the 16-day EVI data were resampled to a weekly frequency using a spline interpolation [47]. We assigned a given department the same population, Gini Index, and education coverage values for all weeks within the same calendar year.

Colombia has 32 departments and the archipelago of San Andrés, Providencia, and Santa Catalina (commonly known as San Andrés y Providencia) is a department consisting of two island groups and 775 km away from mainland Colombia. Due to the frequent cloud contamination over the small island areas, it was not possible to have high-quality MODIS images products for weekly temperature or EVI value estimation. Vaupés department had only 30 confirmed dengue cases during 2014 to 2018. Therefore, the departments of San Andrés y Providencia and Vaupés were excluded from this study and data from the other 30 departments were used to train our models.

Weekly dengue data from 2014–2017 was used to train the RF models and the data from 2018 was used to evaluate the models. To simulate ‘real life’ forecasting, we did not include the 2018 data for the socio-demographic variables given that they are only produced annually whereas the remote sensing data are more readily available. Based on historical (2010–2017) time series data, double exponential smoothing with an additive trend was used to estimate the values for 2018. The specific exponential smoothing functions were determined by the optimal decay option in the “forecast” package for R software through minimizing the squared prediction errors.

Development of RF, ANN, and ARIMA models

We first developed RF models for each department (hereafter referred to as the local level). Let the “current” week be k and the number of confirmed dengue cases be y. Referring to the RF streamflow forecasting model developed by Papacharalampous and Tyralis [48], we used the numbers of current and previous 11 weeks dengue cases (i.e. yk, yk-1,…, yk-10, yk-11) of a department to predict one-week-ahead dengue cases (i.e. yk+1) for each department. The current and previous 11 weeks of rainfall, land surface temperature, EVI, population, Gini Index, and education coverage were also included as predictors. These values were selected as previous studies demonstrated that the optimal lags of meteorological variables used for dengue forecasting are usually not larger than 12 weeks [4954]. In addition, the ordinal number of the forecast week (1–52 for the year of 2015, 2016, 2017, and 2018 and 1–53 for 2014) as well as year (2014–2018) were treated as two predictor variables to account for seasonality and long-term changing trend of dengue occurrence [55,56].

We then developed a RF model at the national scale, which consisted of pooled the data across each department. To train a national-scale RF model for forecasting n-week-ahead dengue cases (where n ≤12), we used the same predictor and target variables as those used in the local n-week-ahead forecasting models. The difference between the local and the national pooled models was that the local n-week-ahead models were trained using 209-n (209 = 53+52+52+52) samples while the national model was trained using 6270-30n [i.e. (209-n) ×30] samples. Through 10-fold cross-validations, we found that the common settings for the number of variables randomly sampled as candidates at each split (i.e. the number of features divided by three) and the minimum size of terminal nodes (i.e. five) were also optimal to avoid over-fitting in our RF models [57]. The specific RF models were fitted by “randomForest” in the R statistical computing environment and set 1000 trees for an ensemble of trees (forest) [58]. We found that further increasing the number of trees did not markedly decrease out-of-bag mean square errors of the RF models but only increased computation time.

Artificial Neural Network (ANN) is considered a classic ML approach and to highlight the advantage of prediction accuracy of the RF models, we developed an ANN model at the national scale. The ANN was composed of one input layer, three hidden layers, and one output layer. The ANN model used ReLU as an activation function to solve the problem of a vanishing gradient and avoids over-fitting through setting “dropouts”. Jointly considering prediction accuracy and computation time, we set “epoch” and “batch size” of the ANN models as 100 and 32 respectively. The ANN models had the same 53 predictor variables as the RF models, resulting in 53 neurons in the input layer and one neuron in the output layer. The number of neurons in the hidden layer was decreased layer by layer as the shape of an inverted pyramid. The specific number of neurons and value of dropout of a hidden layer were determined by iterative attempts until the mean absolute error (MAE) of the prediction could not be further reduced [59] (see Table 2).

Table 2. The numbers of neurons and values of dropouts in the hidden layers of the ANN models.
Hidden layer Number of neurons Dropout
First 48 0.3
Second 32 0.2
Third 19 0.1

Standard univariate ARIMA developed at the local scale was used as the baseline to compare with the RF and ANN models. The Hyndman-Khandakar algorithm was used for automatic ARIMA modeling [60]. This algorithm first determines the number of non-seasonal differences needed for stationarity (i.e. d in ARIMA) using repeated Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests. Then, the number of autoregressive terms and the number of lagged forecast errors (i.e. p and q in ARIMA respectively) by minimizing Akaike’s Information Criterion (AIC).

Model evaluation

The MAEs of the ARIMA, RF, and ANN models were calculated for the 52 weeks in 2018 by the actual and the predicted numbers of dengue cases. The accuracy comparison was performed for the local (department) and national (pooled) scales. When the comparison for an n-week-ahead prediction was conducted at the national scale, the predicted numbers of dengue cases by the 30 local RF models were additively combined and compared with the actual national values to calculate one MAE. When the comparison was implemented at the local scale, the national RF model was applied to each one of the 30 departments and then the predicted values were compared with the actual numbers of dengue cases to compute 30 individual MAEs. To improve intuitive interpretation and facilitate comparisons of one model’s predictive performance across different departments and forecasting horizons, we used the relative MAE (RMAE) to evaluate model accuracy [61]. We defined a RMAE between a ML (i.e. RF or ANN) and the baseline models at a horizon h as:

RMAEA,B,h=MAEA,hMAEB,h (1)

where A represented a ML model and B denoted the baseline ARIMA model.

Given that dengue burden varies across different years, we conducted leave-one-season-out cross-validations to improve the robustness of our evaluation. The accuracy between the national (pooled) and local RF models as well as the national ANN model were compared using RMAE. In the validations, four years of data were used to train the models and the remaining one year was used to validate the models. This procedure was iterated five times to ensure each year data were selected once for validation. An ARIMA model requires continuous time series and therefore, was not suitable for conducting the leave-one-season-out cross-validations. The specific ANN and ARIMA fitting processes were completed using the “keras” and “forecast” packages respectively in the R statistical computing environment.

Percentage of increased mean squared error (%IncMSE) is a robust and informative indicator to quantitatively evaluate the importance of predictor variables in a random forests model [62]. Percentage of increased mean squared error indicates the increase in the mean squared error (MSE) of prediction as a result of an independent variable being randomly shuffled while maintaining the other independent variables as unchanged [46]. A larger %IncMSE of a predictor variable suggests greater importance of the variable on the model’s overall forecast accuracy and the %IncMSE was calculated for each predictor in each RF model.

Results

An exceptionally large dengue outbreak occurred in Colombia during the study period. The counts of confirmed dengue cases reached more than 2,500 per week by the end of 2015 and the outbreak ended mid-year in 2016. Following this outbreak, the yearly dengue case peaks were drastically reduced in 2016 and 2017 but began increasing again in 2018 (Fig 1).

Fig 1.

Fig 1

Weekly total counts of confirmed dengue cases over Colombia for 2014–2018 (A) and the predicted counts of dengue cases by the national one-, two-, four-, eight-, and twelve-week-ahead models for 2018 (B). See S1 Fig for the predicted counts of dengue cases by the remaining seven models.

For any of the n-week-ahead (n≤12) forecasts, the national RF model more accurately predicted the counts of dengue cases than the ARIMA models, demonstrated by the smaller-than-one RMAE (Table 3). The performance of the national model was better than that of the local model, demonstrated by the smaller overall RMAE and MAE (Tables 3 and 4). Moreover, in most cases, a department’s dengue cases were more accurately predicted by the national model than the local model (Fig 2). The errors of the national RF model were mainly derived from under-estimation of cases which coincided with dramatic increases in cases towards the end of 2018. As expected, the under-estimation was more pronounced when predictions were made over a longer time period.

Table 3. Accuracy comparison among ARIMA, RF, and ANN model for prediction of 2018.

n-week ahead MAE RMAE
ARIMA Local RF National RF National ANN
1 6.24 1.28 0.93 0.98
2 7.15 1.27 0.95 1.03
3 8.12 1.25 0.94 1.04
4 8.95 1.23 0.95 0.99
5 9.76 1.24 0.95 0.98
6 10.69 1.20 0.94 0.96
7 11.61 1.16 0.93 0.98
8 12.50 1.12 0.92 0.98
9 13.31 1.08 0.90 1.00
10 14.05 1.04 0.89 0.99
11 14.84 1.00 0.87 0.95
12 15.56 0.97 0.86 0.95

MAE: mean absolute error; RMAE: relative mean absolute error; ARIMA: Autoregressive Integrated Moving Average; RF: random forests; ANN: artificial neural network.

Table 4. Average MAEs of the leave-one-season-out cross-validations.

n-week ahead Local RF National RF National ANN
1 13.86 9.32 10.20
2 15.90 11.05 12.40
3 17.70 12.50 13.89
4 19.45 14.19 16.04
5 20.88 15.81 16.61
6 22.00 17.36 18.55
7 23.14 18.88 20.46
8 24.10 20.29 22.14
9 25.08 21.55 22.57
10 25.69 22.63 23.86
11 26.16 23.82 24.28
12 26.76 24.56 25.25

MAE: mean absolute error; RF: random forests; ANN: artificial neural network.

Fig 2. Accuracy comparison between the local and the national random forests models at the department scale for the one-week ahead, four-week ahead, eight-week ahead, and twelve-week ahead predictions.

Fig 2

See S2 Fig for the comparison between the two types of models for all week ahead predictions.

The overall MAE of the ANN model developed at the national scale and obtained from the leave-one-season-out cross-validation was smaller than that of the local RF model at any forecasting horizon (Table 4). The MAE grew for the ANN model with longer forecasting horizons compared to the local RF model. The RMAE of the ANN model obtained from the validation for 2018 was consistently smaller than that of the local RF model for each forecasting horizon. The MAE and RMAE of the national RF model were always smaller than those of the national ANN model at any forecasting horizon.

The relative importance of different predictor variables in the national RF model was varied (Table 5). Firstly, “current” and “near current” past dengue data were extremely important in predicting occurrence of dengue in the near future (e.g. one- to three-weeks ahead). However, with the predicted week increasingly further away from the “current” week, the importance of historical dengue data decreased while the “current” week of dengue cases remained one of the top three most important predictors in predicting the future dengue cases. Secondly, the environmental (EVI) and the meteorological predictors (rainfall and temperature) were more important than the socio-demographic predictors when dengue cases were predicted in the near future (one- to three-weeks ahead). Yet, with the predicted week increasingly far away from the “current” week, importance of the three socio-demographic covariates (education, population, and Gini Index) became increasingly notable. Finally, the week predictor, which accounted for the seasonal pattern of dengue, was important across all forecasting horizons but relatively smaller in importance with smaller forecasting horizons (i.e. n ≤4).

Table 5. The top ten most important predictor variables for predicting dengue cases in the national models, ordered from the largest to the smallest %IncMSEs.

Rank 1 2 3 4 5 6 7 8 9 10
1-week-ahead Denguek (26.35%) Denguek-1 (17.97%) Denguek-2 (12.61%) Denguek-3 (10.36%) Week (8.78%) Denguek-4 (7.83%) EVIk-11 (6.43%) Temperaturek-11 (6.39%) EVIk-10 (6.07%) EVIk-8 (6.05%)
2-week-ahead Denguek (25.72%) Denguek-1 (17.13%) Week (12.33%) Denguek-2 (12.30%) Denguek-3 (9.73%) Temperaturek-11 (8.87%) Denguek-4 (8.82%) EVIk-7 (8.42%) EVIk-5 (8.06%) EVIk-8 (7.41%)
3-week-ahead Denguek (27.16%) Denguek-1 (17.54%) Week (14.57%) Denguek-2 (12.91%) EVIk-8 (9.67%) EVIk-10 (8.52%) Temperaturek-10 (8.49%) Education (8.40%) Denguek-3 (7.48%) Denguek-4 (7.40%)
4-week-ahead Denguek (27.24%) Week (17.94%) Denguek-1 (15.10%) Education(12.97%) Denguek-2 (11.28%) Temperaturek-9 (10.03%) EVIk-8 (9.68%) Temperaturek-11 (8.67%) EVIk-7 (8.37%) Denguek-3 (7.86%)
5-week-ahead Denguek (25.39%) Week (18.86%) Denguek-1 (18.73%) Education(12.99%) Denguek-2 (12.39%) EVIk-10 (11.42%) Temperaturek-8 (11.15%) Temperaturek (11.31%) Gini (10.33%) EVIk-9 (9.82%)
6-week-ahead Denguek (24.88%) Week (20.14%) Denguek-1 (17.68%) Education(17.13%) Population(12.38%) Year (11.83%) Denguek-2 (11.54%) EVIk-8 (11.52%) EVIk-9 (11.24%) EVIk-1 (11.15%)
7-week-ahead Denguek (25.61%) Week (19.71%) Education(17.66%) Denguek-1 (17.49%) Year (15.64%) Denguek-2 (14.45%) Population (12.49%) Gini (11.69%) EVIk-10 (11.55%) EVIk-9 (11.06%)
8-week-ahead Denguek (25.68%) Week (21.49%) Population(20.67%) Education(19.16%) Denguek-1 (16.84%) Year (16.06%) Temperaturek-11 (12.99%) Temperaturek-5 (12.11%) Denguek-2 (11.66%) Gini (11.63%)
9-week-ahead Denguek (24.11%) Week (22.15%) Population(21.56%) Education(20.47%) Year (17.70%) Denguek-1 (17.44%) Temperaturek-11 (12.94%) Denguek-11 (12.05%) Gini (11.89%) Temperaturek-3 (11.15%)
10-week-ahead Denguek (23.42%) Week (23.03%) Year (21.45%) Education(20.38) Population(19.80%) Denguek-1 (17.22%) Gini (14.88%) Denguek-11 (13.02%) Temperaturek-4 (12.95%) Denguek-2 (10.60%)
11-week-ahead Year (22.94%) Week (21.73%) Denguek (21.37%) Population(18.61%) Education (17.20%) Gini (16.98%) Denguek-1 (16.56%) Temperaturek-11 (15.48%) Denguek-10 (13.47%) Temperaturek-4 (11.80%)
12-week-ahead Population (26.76%) Year (24.86%) Denguek (22.50%) Week (22.45%) Education(17.12%) Gini (17.72%) Denguek-11 (16.71%) Denguek-1 (16.67%) Denguek-10 (14.06%) Temperaturek-10 (13.07%)

Dengue indicates historical dengue cases and EVI denotes enhanced vegetation index. %IncMSE: percentage of increased mean squared error.

Discussion

In the current study, we developed a national pooled model to predict counts of dengue cases across different departments of Colombia and found that for the majority of departments, the national model more accurately forecasted future dengue cases at the department level compared to the local model. This result indicates the similarity in importance of dengue drivers across different administrative regions of Colombia. Random forests is an unsupervised tree-based regression approach requiring a relatively large training sample for the repeated splitting of the dataset into separate branches. A RF regression model cannot yield predictions for data points beyond the scope of the training data range. Pooling data from individual departments creates a training dataset with larger ranges of variables, increasing the extrapolating capacity of the RF model. Therefore, the national pooled model trained by a larger dataset had higher prediction accuracy compared to the local models. The national and the local models performed poorly in departments of Guainía and Vichada. The small population and consequently the low counts of dengue cases resulted in the relatively large errors in the two departments.

We also found that the meteorological and environmental variables were more important for prediction accuracy at smaller forecasting horizons compared to the socio-demographic variables, with socio-demographics being more important at larger forecasting horizons. This is likely due to the influence of meteorological and environmental conditions on Aedes mosquitoes and the lag effects are usually between 1 to 4 weeks for temperature and precipitation [6365]. Poor quality housing and sanitation management with high population density are key risk factors for dengue transmission [66, 67], and are closely related to education and poverty [68, 69]. These results demonstrate the complementary nature of these different groups of predictor variables and the importance of their inclusion in dengue forecasting models.

We compared our RF pooled national models to pooled national ANN models using the same predictor variables. Theoretically, with ANN, more complex correlations between predictor and target variables can be discerned by deeper (i.e. more hidden layers) networks [70]. However, traditional ANNs cannot handle the problem of vanishing gradient which results in the failure of improving accuracy of ANN models by adding more hidden layers. In the current study, we used the activation function of ReLU to overcome the issue of vanishing gradient, mitigated over-fitting by adding dropouts for each hidden layer, and predicted dengue cases with a three-hidden neural network. Compared with the ARIMA and local RF models, the ANN model developed by the national pooled data showed a stronger capability on forecasting dengue cases in Colombia across different forecasting horizons but performed slightly worse than the national RF model in this forecasting case study. It usually requires several iterative attempts to determine an optimal structure of an ANN model. By contrast, RF has conventional settings for tuning the hyperparameters (e.g. using the number of features divided by three for the number of variables at each split and five for the minimum size of terminal nodes) with the default hyperparameters having been found to be optimal in different studies [57].

Despite the strengths of our study, our RF approach is likely to generate time lags in forecasting rapid changes in dengue, which is also a common occurrence with other forecasting approaches. Including a predictor of mosquito abundance from an entomological surveillance program may reduce such time lag errors [71]. However, this type of data was not available at the national level given insufficient temporal and spatial granularity. Additionally, RF, as a non-parametric black-box approach, cannot use specific equations to quantify the relationships between the count of dengue cases and the heterogeneous predictor variables, although it is able to more flexibly and accurately capture the possibly complex non-linear and non-additive relationships among the variables. A more severe limitation of the RF model is the fact that RF cannot obtain values beyond the range of the variable in the training dataset. If an unprecedented dengue outbreak occurred in future, under-estimations will occur inevitably using the RF approach. Modeling changes in the count of dengue cases rather than the count may reduce such under-estimation errors.

Forecasting is an important warning mechanism that can help with proactive planning and response for clinical and public health services. This study highlights the potential of RF for dengue forecasting and also demonstrates the benefits of including socio-demographic predictors. Our findings also found that a national pooled model, on average, performed better compared to the local models. These findings have important implications for dengue forecasting models in public health in terms of time savings, such as pooled data versus locally-specific models, and predictors and approaches that could help improve forecast accuracy. Future studies should consider the inclusion of other arboviruses as predictors, such as chikungunya and Zika as well as examine the importance of other socio-economic factors. In addition, other promising ML methods should be tested including recurrent neural networks, which inherently account for time, and are able to capture complicated non-linear and non-additive relationships between predictor and target variables [72].

Supporting information

S1 Fig. Weekly total counts of confirmed dengue cases over Colombia for 2014–2018 and the predicted counts of dengue cases by the national three-, five-, six-, seven-, nine-, and eleven-week-ahead models for 2018.

(TIFF)

S2 Fig. Accuracy comparison between the local and the national random forests models at the department scale for each week ahead predictions using the relative mean absolute error (RMAE).

(PDF)

Data Availability

The epidemiological data are freely available through www.ins.gov.co, the sociodemographic data are freely available through www.dane.gov.co, and the environmental data are freely available through lpdaac.usgs.gov (MODIS products) and www.cpc.ncep.noaa.gov (CMORPH product).

Funding Statement

This work was supported by seed grant funding provided by the Quebec Population Health Research Network to KZ and MMG, and by a grant from the Canadian Institutes of Health Research (428107) to KZ. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Lambrechts L, Scott TW, Gubler DJ. Consequences of the expanding global distribution of Aedes albopictus for dengue virus transmission. PLoS Neglected Tropical Diseases 2010; 4(5): e646 10.1371/journal.pntd.0000646 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL et al. The global distribution and burden of dengue. Nature 2013; 496:504–507. 10.1038/nature12060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Morin CW, Comrie AC, Ernst K. Climate and dengue transmission: evidence and implications. Environmental Health Perspectives 2013; 121(11–12): 1264 10.1289/ehp.1306556 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Shepard DS, Undurraga EA, Hallasa YA, Stanaway JD. The global economic burden of dengue: a systematic analysis. Lancet Infectious Diseases 2016; 16:935–941. 10.1016/S1473-3099(16)00146-8 [DOI] [PubMed] [Google Scholar]
  • 5.Soyiri IN, Reidpath DD. An overview of health forecasting. Environmental Health and Preventive Medicine 2013; 18(1):1–9. 10.1007/s12199-012-0294-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Racloz V, Ramsey R, Tong S, Hu W. Surveillance of dengue fever virus: A review of epidemiological models and early warning systems. PLoS Neglected Tropical Diseases 2012; 6(5):e1648 10.1371/journal.pntd.0001648 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gambhir S, Malik SK, Kumar Y, The diagnosis of dengue disease: An evaluation of three machine learning approaches. International Journal of Healthcare Information Systems and Informatics 2018; 13:1–19. 10.4018/ijhisi.2018040101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Naish S, Dale P, Mackenzie JS, McBride J, Mengersen K, Tong S, Climate change and dengue: a critical and systematic review of quantitative modelling approaches. BMC Infectious Diseases 2014; 14:167 10.1186/1471-2334-14-167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gharbi M, Quenel P, Gustave J, Cassadou S, Ruche GL, Girdary L, et al. Time series analysis of dengue incidence in Guadeloupe, French West Indies: Forecasting models using climate variables as predictors. BMC Infectious Diseases 2011; 11:166 10.1186/1471-2334-11-166 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hu W, Clements A, Williams G, Tong S, Dengue fever and El Niño/Southern Oscillation in Queensland, Australia: a time series predictive model. Occupational & Environmental Medicine 2010; 67:307–311. [DOI] [PubMed] [Google Scholar]
  • 11.Dom NC, Hassan AA, Latif ZA, Ismail R, Generating temporal model using climate variables for the prediction of dengue cases in Subang Jaya, Malasia. Asian Pacific Journal of Tropical Disease 2013; 3:352–361. [Google Scholar]
  • 12.Cortes F, Turchi Martelli CM, Arraes de Alencar Ximenes R, Montarroyos UR, Siqueira Junior JB, Gonçalves Cruz O, et al. Time series analysis of dengue surveillance data in two Brazilian cities. Acta Tropica. 2018; 182:190–7. 10.1016/j.actatropica.2018.03.006 [DOI] [PubMed] [Google Scholar]
  • 13.Johansson MA, Reich NG, Hota A, Brownstein JS, Santillana M, Evaluating the performance of infectious disease forecasts: A comparison of climate-driven and seasonal dengue forecasts for Mexico. Scientific Reports 2016; 6:33707 10.1038/srep33707 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Niu M, Wang Y, Sun S, Li Y, A novel hybrid decomposition-and-ensemble model based on CEEMD and GWO for short-term PM2.5 concentration forecasting. Atmospheric Environment 2016; 134:168–180. [Google Scholar]
  • 15.Chen M-Y, Chen B-T, A hybrid fuzzy time series model based on granular computing for stock price forecasting. Information Sciences 2015; 294:227–241. [Google Scholar]
  • 16.Wang P, Zhang H, Qin Z, Zhang G, A novel hybrid-Garch model based on ARIMA and SVM for PM2.5 concentrations forecasting. Atmospheric Pollution Research 2017; 8: 850–860. [Google Scholar]
  • 17.Zhao N, Liu Y, Vanos JK, Cao G, Day-of-week and seasonal patterns of PM2.5 concentrations over the United States: Time-series analyses using the Prophet procedure. Atmospheric Environment 2018; 192:116–127. [Google Scholar]
  • 18.Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical Science 2001; 16(3): 199–231. [Google Scholar]
  • 19.Murphy KP. Machine Learning: a probabilistic perspective. MIT Press, 2012. [Google Scholar]
  • 20.Guo P, Liu T, Zhang Q, Wang L, Xiao J, Zhang Q, et al. Developing a dengue forecast model using machine learning: A case study in China. PLoS Neglected Tropical Diseases 2017; 11:e0005973 10.1371/journal.pntd.0005973 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Scavuzzo JM, Trucco F, Espinosa M, Tauro CB, Abril M, Scavuzzo CM, et al. Modeling dengue vector population using remotely sensed data and machine learning. Acta Tropica 2018; 185:167–175. 10.1016/j.actatropica.2018.05.003 [DOI] [PubMed] [Google Scholar]
  • 22.Althouse BM, Ng YY, Cummings DAT, Prediction of dengue incidence using serach query surveillance. PLoS Neglected Tropical Diseases 2011; 5:e1258 10.1371/journal.pntd.0001258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Laureano-Rosario AE, Duncvan AP, Mendez-Lazaro PA, Garcia-Rejon JE, Gomez-Carro S, Farfan-Ale J, et al. Application of artificial neural networks for dengue fever outbreak predictions in the northwest coast of Yucatan, Mexico and San Juan, Puerto Rico. Tropical Medicine and Infectious Disease 2018; 3:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Raczko E, Zagajewski B, Comparison of support vector machine, random forest and neural network classifiers for tree species classification on airborne hyperspectral APEX images. European Journal of Remote Sensing 2017; 50:144–154. [Google Scholar]
  • 25.Meyer H, Kulhnlein M, Appelhans T, Nauss T, Comparison of four machine learning algorithms for their applicability in satellite-based optical rainfall retrievals. Atmospheric Research 2016; 169:424–433. [Google Scholar]
  • 26.Rodriguez-Galiano V, Sanchez-Castillo M, Chica-Olmo M, Chica-Rivas M, Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews 2015; 71:804–818. [Google Scholar]
  • 27.Statnikov A, Wang L, Aliferis CF, A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008; 9:319 10.1186/1471-2105-9-319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Nsoesie EO, Beckman R, Marathe M, Lewis B, Prediction of an epidemic curve: A supervised classification approach. Statistical communications in infectious diseases. 2011; 3(1):5 10.2202/1948-4690.1038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Vasquez P, Loria A, Sanchez F, Barboza LA, Climate-driven statistical models as effective predictors of local dengue incidence in Costa Rica: A generalized additive model and random forest approach. arXiv 2019; 1907.13095. [Google Scholar]
  • 30.Olmoguez ILG, Catindig MAC, Amongos MFL, Lazan AF, Developing a dengue forecasting model: A case study in Iligan city. International Journal of Advanced Computer Science and Applications 2019; 10(9):281–286. [Google Scholar]
  • 31.Carvajal TM, Viacrusis KM, Hernandez LFT, Ho HT, Amalin DM, Watanabe K, Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan Manila, Philippines. BMC Infectious Diseases 2018; 18:183 10.1186/s12879-018-3066-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Rehman NA, Kalyanaraman S, Ahmad T, Pervaiz F, Saif U, Subramanian L, Fine-grained dengue forecasting using telephone triage services. Science Advances 2016; 2(7): e1501215 10.1126/sciadv.1501215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Freeze J, Erraguntla M, Verma A, Data integration and predictive analysis system for disease prophylaxis: Incorporating dengue fever forecasts. Proceedings of the 51st Hawaii International Conference on System Science 2018; 913–922. [Google Scholar]
  • 34.Dinh L, Chowell G, Rothenberg R, Growth scaling for the early dynamics of HIV/AIDS epidemics in Brazil and the influence of socio-demographic factors. Journal of Theoretical Biology 2018; 442:79–86. 10.1016/j.jtbi.2017.12.030 [DOI] [PubMed] [Google Scholar]
  • 35.Chretien J-P, Riley S, George DB, Mathematical modeling of the West Aftica Ebola epidemic. eLIFE 2015; 4:e09186 10.7554/eLife.09186 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cardona-Ospina JA, Villamil-Gómez WE, Jimenez-Canizales CE, Castañeda-Hernández DM, Rodríguez-Morales AJ. Estimating the burden of disease and the economic cost attributable to chikungunya, Colombia, 2014. Transactions of the Royal Society of Tropical Medicine and Hygiene 2015; 109(12):793–802. 10.1093/trstmh/trv094 [DOI] [PubMed] [Google Scholar]
  • 37.Villar LA, Rojas DP, Besada-Lombana S, Sarti E. Epidemiological trends of dengue disease in Colombia (2000–2011): a systematic review. PLoS Neglected Tropical Diseases 2015; 9(3): e0003499 10.1371/journal.pntd.0003499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ospina Martinez ML, Martinez Duran ME, Pacheco García OE, Bonilla HQ, Pérez NT., Protocolo de vigilancia en salud pública enfermedad por virus Zika. PRO-R02.056. Bogota (Colombia): Instituto Nacional de Salud, 2017. Available from: http://bvs.minsa.gob.pe/local/MINSA/3449.pdf (last accessed December 16, 2019). [Google Scholar]
  • 39.Beketov MA, Yurchenko YA, Belevich OE, Liess M, What environmental factors are important determinants of structure, species richness, and abundance of mosquito assemblages? Journal of Medical Entomology 2010; 47:129–139. 10.1603/me09150 [DOI] [PubMed] [Google Scholar]
  • 40.Joyce RJ CMORPH: A method that produces global precipitation estimates from passive microwave and infrared data at high spatial and temporal resolution. Journal of Hydrometeorology 2004; 5:487–503. [Google Scholar]
  • 41.Koyadun S, Butraporn P, Kittayapong P , Ecologic and sociodemographic risk determinants for dengue transmission in urban areas in Thailand. Interdisciplinary Perspectives on Infectious Diseases 2012; 2012:907494 10.1155/2012/907494 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Reiter P, Climate change and mosquito-borne disease. Environmental Health Perspectives 2001; 109(supplement 1):141–161. 10.1289/ehp.01109s1141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Soghaier MA, Himatt S, Osman KE, Okoued SI, Seidahmed OE, Beatty ME, et al. , Cross-sectional community-based study of the socio-demographic factors associated with the prevalence of dengue in the eastern part of Sudan in 2011. BMC Public Health 2015; 15:558 10.1186/s12889-015-1913-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Kannan Maharajan M, Rajiah K, Singco Belotindos JA, Bases MS. Social determinants predicting the knowledge, attitudes, and practices of women toward zika virus infection Frontiers in Public Health 2020; 8:170 10.3389/fpubh.2020.00170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Couse Quinn S, Kumar S. Health inequalities and infectious disease epidemics: A challenge for global health security. Biosecurity and Bioterrorism: Biodefense Srategy, Practice, and Science 2014; 12(5):263–273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Breiman L, Random forests. Machine learning 2001; 45(1):5–32. [Google Scholar]
  • 47.Hulme M, New M. Dependence of large-scale precipitation climatologies on temporal and spatial sampling. Journal of Climate, 1997; 10:1099–1113, [Google Scholar]
  • 48.Papacharalampous GA, Tyralis H, Evaluation of random forests and prophet for daily streamflow forecasting. Advances in Geosciences 2018; 45:201–208. [Google Scholar]
  • 49.Lu L, Lin H, Tian L, Yang W, Sun J, Liu Q, Time series analysis of dengue fever and weather in Guangzhou, China, BMC Public Health 2009; 9:395 10.1186/1471-2458-9-395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Chen S-C. Liao C-M, Chio C-P, Chou H-H, You S-H, Cheng Y-H, lagged temperature effect with mosquito transmission potential explains dengue variability in southern Taiwan: Insights from a statistical analysis. Science of The Total Environment 2010; 408(19):469–4075. [DOI] [PubMed] [Google Scholar]
  • 51.Cheong YL, Burkart K, Leitao PJ, Lakes T, Assessing weather effects on dengue disease in Malaysia, International Journal of Environmental Research and Public Health 2013; 10(12):6319–6334. 10.3390/ijerph10126319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Chang K, Chen C-D, Shih C-M, Lee T-C, Wu M-T, Wu D-C, et al. , Time-lagging interplay effect and excess risk of meteorological/mosquito parameters and petrochemical gas explosion on dengue incidence. Scientific reports 2016; 6:35028 10.1038/srep35028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Chen Y, Ong JHY, Rajarethinam J, Yap G, Ng LC, Cook AR. Neighbourhood level real-time forecasting of dengue cases in tropical urban Singapore. BMC Medicine 2018;16(1):129 10.1186/s12916-018-1108-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Eastin MD, Delmelle E, Casas I, Wexler J, Self C, Intra-and interseasonal autoregressive prediction of dengue outbreaks using local weather and regional climate for a tropical environment in Colombia. The American Journal of Tropical Medicine and Hygiene 2014; 91(3):598–610. 10.4269/ajtmh.13-0303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Bostan N, Javed S, Amen N, Eqani SAMAS, Tahir F, Bokhari H, Dengue fever virus in Pakistan: effects of seasonal pattern and temperature change on distribution of vector and virus. Reviews in Medical Virology 2017; 27(1):e1899. [DOI] [PubMed] [Google Scholar]
  • 56.Oidtman RJ, Lai S, Huang Z, Yang J, Siraj AS, Reiner RC, et al. , Inter-annual variation in seasonal dengue epidemics driven by multiple interacting factors in Guangzhou, China, Nature Communications 2019; 10:1148 10.1038/s41467-019-09035-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Spring, Berlin, 2008. [Google Scholar]
  • 58.Liaw A, Wiener M. Breiman and Culter’s random forests for classification and regression. 2018. Available from: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf (last accessed May 7, 2020).
  • 59.Peng Z, Letu H, Wang T, Shi C, Zhao C, Tana G, Zhao N, Dai T, Tang R, Shang H, Shi J, Chen L. Estimation of shortwave solar radiation using the artificial neural network from Himawari-8 satellite imagery over China. Journal of Quantitative Spectroscopy and Radiative Transfer 2020; 240: 106672. [Google Scholar]
  • 60.Hyndman RJ, Khandakar Y. Automatic time series forecasting: The forecast package for R. Journal of Statistical Software 2008; 27: 1–22. [Google Scholar]
  • 61.Reich NG, Lessler J, Sakrejda K, Lauer SA, Iamsirithaworn S, Cummings DAT. Case study in evaluating time series prediction models using the relative mean absolute error. The American Statistician 2016; 70: 285–292. 10.1080/00031305.2016.1148631 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Liu Y, Cao G, Zhao N, Mulligan K, Ye X. Improve ground-level PM2.5 concentration mapping using a random forests-based geostatistical approach. Environmental Pollution 2018; 235: 272–282. 10.1016/j.envpol.2017.12.070 [DOI] [PubMed] [Google Scholar]
  • 63.Grziwotz F, Strauß JF, Hsieh C-h, Telschow A. Empirical dynamic modelling identifies different responses of Aedes Polynesiensis subpopulations to natural environmental variables. Scientific Reports 2018; 8: 16768 10.1038/s41598-018-34972-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.da Cruz Ferreira DA, Degener CM, de Almeida Marques-Toledo C, Bendati MM, Fetzer LO, Teixeira CP, Eiras AE. Meteorological variables and mosquito monitoring are good predictors for infestation trends of Aedes aegypti, the vector of dengue, chikungunya and Zika. Parasites Vectors 2017; 10: 78 10.1186/s13071-017-2025-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Manica M, Filipponi F, D’Alessandro A, Screti A, Neteler M, Rosà R, et al. Spatial and Temporal Hot Spots of Aedes albopictus Abundance inside and outside a South European Metropolitan Area. PLoS Neglected Tropical Diseases 2016; 10(6): e0004758 10.1371/journal.pntd.0004758 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Mulligan K, Dixon J, Sinn C-L J, Elliott SJ. Is dengue a disease of poverty? A systematic review. Pathogens and Global Health 2015; 109(1): 10–18. 10.1179/2047773214Y.0000000168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Tapia-Conyer R, Méndez-Galván JF, Gallardo-Rincón H. The growing burden of dengue in Latin America. Journal of Clinical Virology 2009; 46: S3–S6. 10.1016/S1386-6532(09)70286-0 [DOI] [PubMed] [Google Scholar]
  • 68.Adams EA, Boateng GO, Amoyaw JA. Socioeconomic and demographic predictors of potable water and sanitation access in Ghana. Social Indicators Research 2016; 126(2): 673–687. [Google Scholar]
  • 69.de Janvry A, Sadoulet E. Growth, poverty, and inequality in Latin America: A causal analysis, 1970–94. The review of Income and Wealth 2000; 46(3): 267–287. [Google Scholar]
  • 70.Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. Journal of Big Data 2015; 2:1. [Google Scholar]
  • 71.Ong J, Liu X, Rajarethinam J, Kok SY, Liang S, Tang CS, et al. , Mapping dengue risk in Singapore using random forest. PLoS Neglected Tropical Diseases 2018; 12(6):e0006587 10.1371/journal.pntd.0006587 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Williams RJ, Zipser D, A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1989; 1(2):270–280. [Google Scholar]
PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008056.r001

Decision Letter 0

Marc Choisy, Robert C Reiner

3 Apr 2020

Dear Dr Zinszer,

Thank you very much for submitting your manuscript "Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burdens at the national sub-national scale in Colombia" for consideration at PLOS Neglected Tropical Diseases. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Marc Choisy

Guest Editor

PLOS Neglected Tropical Diseases

Robert Reiner

Deputy Editor

PLOS Neglected Tropical Diseases

***********************

Reviewer's Responses to Questions

Key Review Criteria Required for Acceptance?

As you describe the new analyses required for acceptance, please consider the following:

Methods

-Are the objectives of the study clearly articulated with a clear testable hypothesis stated?

-Is the study design appropriate to address the stated objectives?

-Is the population clearly described and appropriate for the hypothesis being tested?

-Is the sample size sufficient to ensure adequate power to address the hypothesis being tested?

-Were correct statistical analysis used to support conclusions?

-Are there concerns about ethical or regulatory requirements being met?

Reviewer #1: The clearly articulated objective of the study is "to evaluate the potential of using random forest forecasting models at the department and national levels in Columbia."

The study design needs an appropriate baseline model to truly address the stated objective. In the introduction, the authors state that "conventional time-series models such as ARIMA are insufficient to meet complex forecasting requirements", however that is only known because they are compared to baseline models that are commonly used by public officials. In a way, the authors use the Local Model as the baseline, but we have no reason to trust this as a baseline model because we would like to evaluate this model as well. For predictions of Dengue_{k+1}, a good naive baseline model would be Dengue_k because public health officials often react to the most recent data point. For longer term predictions, it would be good to show how the RFs perform relative to ARIMA models. A paper that describes selecting a baseline model for dengue prediction in detail is Reich et al., 2015, Case Study in Evaluating Time Series Prediction Models Using the Relative Mean Absolute Error.

An appropriate baseline model would allow the authors to calculate the relative mean absolute error which is a better measure of model performance than absolute measures such as MAE and RMSE.

Are there time lags in the dengue case data? For instance, dengue cases in Thailand had a substantial reporting lag (Reich et al., 2016. Challenges in real-time prediction of infectious disease: A case study of dengue in Thailand) and influenza cases in the US have about a 4-week reporting lag. If so, how does that affect the potential of using the RF forecasting model in real time?

I assume that the dengue cases are the sum of probable and confirmed dengue cases from SIVIGILA. Do you know how many cases were probable and confirmed for each week? I'd like to see a supplemental figure of both plus dengue hemorrhagic fever (or severe dengue), if possible.

Reviewer #2: The methods descriptions are sound and generally clear, but RF and ANN have plenty of details, and it would be nice to have more comprehensive algorithm descriptions. Particularly important are the issues of (a) how hyperparameter tuning was performed, and (b) whether regularization was used.

Reviewer #3: The present study combines climate and socio-economic data as well as past Dengue case count data to predict future Dengue cases in Colombia . The objective of using either subnational data separately or in combination for training for prediction is clearly stated. The choice of random forest algorithm is well motivated to avoid overfitting.

--------------------

Results

-Does the analysis presented match the analysis plan?

-Are the results clearly and completely presented?

-Are the figures (Tables, Images) of sufficient quality for clarity?

Reviewer #1: The analysis presented matches the analysis plan and results are presented clearly.

Figure 1 is okay, but if a baseline model is chosen then I would prefer to see a relative MAE figure in place of figure 2. RMSE or MAE are absolute measures that might not be comparable across locations. If one location is more variable than another, that figure may merely be displaying the difference in variability. A figure of relative MAE would show the forecasting skill by location. The sentence in the Results describing Figure 2 (on lines 245-246) is accurate, however the assertion in the discussion (lines 309-310) may change if these locations are inherently more difficult to predict for a baseline model as well.

Is there a way to get the %IncMSEs into Table 3? The rank order is nice, but it doesn't tell us how much stronger the first variable is than the second, third, or tenth. If the lower impact variables have very low values, they could be left off in order to free up space for some numbers.

Reviewer #2: The results match the descriptions, and are clearly presented. However, there needs to be evaluation on more data, e.g., by using leave-one-season-out cross-validation. Some additional points of investigation noted in the comments might expand the contribution of this work and reinforce the points it makes.

Reviewer #3: The results are clearly presented in emphasizing that training on subnational data combined is superior to training separately to achieve high accuracy. The way how predictive power of different features depending on the time horizon chosen is interesting, e.g. socio-economic factors seem to be more important for long-range predictions.

Figure 1 is hardly legible, please increase the resolution.

--------------------

Conclusions

-Are the conclusions supported by the data presented?

-Are the limitations of analysis clearly described?

-Do the authors discuss how these data can be helpful to advance our understanding of the topic under study?

-Is public health relevance addressed?

Reviewer #1: The conclusions made by the authors are correct in stating that the national model performed better than the local model. However, this analysis doesn't tell us whether RFs are better than ARIMA or a naive baseline.

The authors discuss several legitimate limitations of the study. Another limitation is that they only forecasted 2018 and that there is considerable variation between years 2014-2016 each appear to have had more incidence while 2017 had less. Whether the results are unique to 2018 or generalizable to other years remains to be seen. Also, the model doesn't account for the changes in population susceptibility due to the complex immunological dynamics of dengue (long-term immunity to infecting serotypes and short-term immunity to non-infecting serotypes).

The public health relevance is not addressed.

Reviewer #2: Some statements seem to suggest a causal interpretation that are not warranted. Again, as noted above, there probably needs to be evaluation on more data; I expect the current metrics on the current data to be fairly noisy.

Reviewer #3: The limitations such as the lack of entomological data are addressed. It would be helpful to argue why existing Aedes data was not used, was the required spatial resolution not available?

The public health implications are not sufficiently discussed.

--------------------

Editorial and Data Presentation Modifications?

Use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity. If the only modifications needed are minor and/or editorial, you may wish to recommend “Minor Revision” or “Accept”.

Reviewer #1: (No Response)

Reviewer #2: (No Response)

Reviewer #3: Here some methodological suggestions for minor revision:

Bagging for decision tree learning hinges on the assumption of independence between observations. The dependence structures between regions for particular feature, but also between different features could be addressed by performing proper outer cross-validation for the random forest.

It might be helpful to compare random forest algorithms with more similar methods such as gradient descent boosting (e.g. xgboost) where carefully tuned models can potentially yield predictors with lower variance. I am not sure whether neural networks is the best suited for methodological comparison, given the limited number of features, the inherent overfitting and lack of accuracy is almost expected.

The authors show validation on the most recent part of the Dengue case count data. To confirm accuracy and robustness, it could be helpful to validate the algorithm also on other parts of the time series (e.g. years 2015-2016).

--------------------

Summary and General Comments

Use this section to provide overall comments, discuss strengths/weaknesses of the study, novelty, significance, general execution and scholarship. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. If requesting major revision, please articulate the new experiments that are needed.

Reviewer #1: This is a quality paper that needs to make one simple but substantial change. The authors have collected a large amount of data across several sources and fit random forest and artificial neural networks to predict dengue incidence in Colombia. Showing that random forests make accurate forecasts of dengue incidence would be a significant addition to the growing literature of using machine learning techniques for disease prediction.

However, the authors only compare these methods to each other and not to a simple baseline method, such as an AR-1 or an ARIMA model, which are likely to be the existing methods of choice by local public health officials. Showing how much machine learning techniques improve over the traditional standards would make for a more compelling argument.

Making the code and data (which are freely available) easily accessible (e.g. in a GitHub repo) would be appreciated.

Reviewer #2: This work compares the predictive performance of models predicting dengue incidence in a given department using (a) an stratified RF model using training instances from the same department only, (b) an RF model using training instances from all departments, and (c) an one-hidden-layer ANN approach, using several types of covariates; it also describes methods for preparing national-level predictions. It demonstrates that, in certain settings and for certain models, prediction accuracy can be improved by the training instance pooling in (b), using the RF-based models presented rather than the ANN approach presented, and that incorporating socioeconomic factors can improve predictive performance.

Major issues:

- Evaluation takes place on a single season. Even with 30 departments, this is probably not reliable. The authors describe the large differences in the different seasons from each other; it is quite reasonable to expect that relative performance of the methods could differ as well. Suggest leave-one-season-out cross-validation or a similar technique.

- The performance of the methods should be compared to existing ARIMA approaches or very simple baselines, such as last-observation-carried-forward.

Comments (mixing technical suggestions that may not be necessary but would

probably strengthen the work's contribution, alongside minor typesetting notes

and more important clarity issues):

- Why is "ANN" a "comparator"? Is it viewed as being worse a priori in this context? Is there literature to back this up? If so, does the architecture of the ANN and the amount of data in this context resemble those based on this literature? Is this applying some pre-existing approach, or not?

- In abstract, several times throughout paper, referring to a "national" model seems a bit confusing or not very descriptive. Suggest finding an alternative description.

- "Furthermore, sociodemographic predictors are important to include to capture longer-term trends in dengue." --- This wording suggests almost a causal interpretation, but the performance analysis is not done using causal inference techniques; it would help to clarify here and any other places that suggest a causal conclusion.

- Line 91: missing "and"

- Line 95: suggests comparison to ARIMA

- Line 106: parametric assumptions may be difficult to test, but I would expect that the untestable assumptions are shared by the nonparametric methods

- Line 107--108: match singular/plural

- Line 109: wording may make it sound like RF is not an ML method

- Line 113: missing comma

- Line 113--115: suggests analyzing importance of all types of variables --- performance impact of covariates / types of covariates in addition to importance rank of individual covariates

- Line 119: Lowe et al. spatiotemporal dengue models in Brazil may qualify, but use a quite different approach

- Line 136: What is the data reporting lag for probable cases? For confirmed cases? Are data revised over time? (At the end of the time series plot, is the drop down from incomplete counts to be revised?) These are important questions and the answers will almost surely indicate that performance metrics for given forecast horizons will need to be approached with some care.

- Line 140: -> "within 14 days"?

- Line 190: Nonparametric models that are not properly regularized will overfit; probably even the very simplest parametric models as well (for the stratified RF model) given the number of covariates.

- Line 213: is there a way for RF to quantify uncertainty, and to evaluate these uncertainty estimates?

- Line 222: What about other hyperparameters, in particular: regularization parameter or number of iterations & learning rate, weight initialization, etc.? in [55], several aspects of the model are adjusted; does this work mirror that one, tune only the # of neurons, or something else? What data is used to evaluate different configurations during the tuning process?

- Line 231: is this performed using an ablation approach or a Shapley value approach? I would expect Shapley value approaches to give a better idea of contributions when certain covariates are redundant or nearly redundant with some other(s), but taken together with these redundant variables are large contributors.

- Line 236: emphasizes why a leave-one-season-out or similar type of evaluation is important.

- Line 253: is there an explanation for the national/pooled RF model's better performance? Is this truly "successful transfer" or is it preventing overfitting?

- Line 254: can MAE and RMSE be put on a more interpretable scale? (Like the relative scale used for feature importance. But some similar options should be avoided if the scaling would directly emphasize instances where one particular method did well or poorly, e.g., by scaling individual errors by the error of one particular system.)

- Line 329--330: RF can also overfit, though.

- Table 3: suggests quantifying the contribution of the AR, Year&Week, weather&vegetation, and sociodemographic categories in terms of error. (But again, care must be taken not to hint without discussion any causal conclusions; top contributors may be confounded by any number of unobserved/unincorporated quantities.)

- Line 322: -> "humans"

- Line 324: shallow ANN's don't have this issue, and there are strategies to avoid these issues for deeper networks

- Line 331: how are the hyperparameters set for RF? If they are tuned based on performance metrics, what training and test data is used to provide these evaluations during the tuning process? What is the exact algorithm for the RF models? Is it really Breiman's original RF or some variant?

- Line 333: this doesn't actually seem like a limitation; probably any model is going to end up with a similar phenomenon. This does heavily suggest comparisons against last-observation-carried-forward, 1-lag or limited-lag linear autoregression, and linear autoregression augmented with week-of-season and year indicator covariates

- Line 339: this seems to be more a problem with the dimensionality than RF; we can plot the point predictions against one or two variables at a time, but issue is that it is a slice/aggregation over values for the other dimensions

- Line 344: this suggest modeling changes in counts or relative changes in counts rather than the counts themselves as a target, perhaps using lagged versions as covariates.

Reviewer #3: The present study suggests the use of a broad range of environmental, climate and socio-economic data on a subnational level to accurately predict Dengue cases in Colombia. While it uses standard methods (random forests) , it would be helpful to address dependence structure in the data more carefully.

From a practical point of view, the absence of mosquito prevalence data for such predictive tasks and the lack of public health relevance should be addressed more explicitly.

--------------------

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Stephen A Lauer

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see https://journals.plos.org/plosntds/s/submission-guidelines#loc-methods

PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008056.r003

Decision Letter 1

Marc Choisy, Robert C Reiner

10 Jul 2020

Dear Prof Zinszer,

Thank you very much for submitting your manuscript "Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burdens at the national sub-national scale in Colombia" for consideration at PLOS Neglected Tropical Diseases. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

The two reviewers and I found the revised version greatly improved, with all major comments from reviewers 1 and 2 satisfactorily addressed. Reviewer 4 is a new reviewer and is asking for more discussion of the pooled vs individual forecast, which I agree with. Please also address all the minor comments made by both reviewers.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.  

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. 

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Marc Choisy

Guest Editor

PLOS Neglected Tropical Diseases

Robert Reiner

Deputy Editor

PLOS Neglected Tropical Diseases

***********************

The authors made a great job in addressing the reviewers' comments. Reviewer 4 is a new reviewer and is asking for more discussion of the pooled vs individual forecast, which I agree with. Please also address all the minor comments made by both reviewers.

Reviewer's Responses to Questions

Key Review Criteria Required for Acceptance?

As you describe the new analyses required for acceptance, please consider the following:

Methods

-Are the objectives of the study clearly articulated with a clear testable hypothesis stated?

-Is the study design appropriate to address the stated objectives?

-Is the population clearly described and appropriate for the hypothesis being tested?

-Is the sample size sufficient to ensure adequate power to address the hypothesis being tested?

-Were correct statistical analysis used to support conclusions?

-Are there concerns about ethical or regulatory requirements being met?

Reviewer #2: Methods are now more precisely specified.

Reviewer #4: Please see review.

--------------------

Results

-Does the analysis presented match the analysis plan?

-Are the results clearly and completely presented?

-Are the figures (Tables, Images) of sufficient quality for clarity?

Reviewer #2: Yes, and results are now more complete.

Reviewer #4: Please see review.

--------------------

Conclusions

-Are the conclusions supported by the data presented?

-Are the limitations of analysis clearly described?

-Do the authors discuss how these data can be helpful to advance our understanding of the topic under study?

-Is public health relevance addressed?

Reviewer #2: Yes.

Reviewer #4: Please see review.

--------------------

Editorial and Data Presentation Modifications?

Use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity. If the only modifications needed are minor and/or editorial, you may wish to recommend “Minor Revision” or “Accept”.

Reviewer #2: (No Response)

Reviewer #4: (No Response)

--------------------

Summary and General Comments

Use this section to provide overall comments, discuss strengths/weaknesses of the study, novelty, significance, general execution and scholarship. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. If requesting major revision, please articulate the new experiments that are needed.

Reviewer #2: The authors have made a thorough revision, addressing my primary concern regarding reliability of the evaluation metrics by adding a cross validation analysis. They have also better contextualized the results by providing an ARIMA baseline for comparison, describing the methods in more detail, and using a more featureful ANN model. My only remaining concern is that some of the text in the response regarding potential revisions to data should be included in the text.

Line 127 -- Line 133: What type of cases are being predicted? Suspected, confirmed, or the sum the two?

Line 128: "confirmation" -> "confirmed cases".

Line 132: Some readers may appreciate the additional information included in the authors' responses regarding the timeliness and nature of revisions, and the reasoning for why it may not impact predictive performance that much. It seems more common that count surveillance systems will undergo large, biased revisions, potentially over longer time periods.

Line 171: it may help to reference the discussion on line 132 here as well, or to rearrange content so that these sections are even closer to each other.

Line 173: "Exponential smoothing approach..." is a sentence fragment. Please specify the decay factor.

Line 222 and elsewhere: "ARMIA" -> "ARIMA"

Line 338: (Again, I wouldn't see this as a limitation per se; it is likely a feature of many of the best existing predictors for various epidemiological forecasting tasks.)

Line 173: Please specify the decay factor for the exponential smoother.

Line 173: "Exponential [...] data to estimate" -> "An exponential [...] data was used to estimate"

Line 176: "local level" -> "the local level"

Line 198: "forest that is" -> "forecast, which is"

Line 204: "hidden layer" -> "hidden layers"

Line 244: "therefore not suitable to conduct" -> "therefore was not suitable for conducting"

Line 338: (Again, I don't see this as a limitation; it is likely a feature of many of the best predictors for various epidemiological forecasting tasks.)

Reviewer #4: The manuscript was well written and clear for the most part. The only pieces missing are clarification on a few points made.

--------------------

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/plosntds/s/submission-guidelines#loc-materials-and-methods

PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008056.r005

Decision Letter 2

Marc Choisy, Robert C Reiner

12 Aug 2020

Dear Dr Zinszer,

We are pleased to inform you that your manuscript 'Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burdens at the national sub-national scale in Colombia' has been provisionally accepted for publication in PLOS Neglected Tropical Diseases.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Neglected Tropical Diseases.

Best regards,

Marc Choisy

Guest Editor

PLOS Neglected Tropical Diseases

Robert Reiner

Deputy Editor

PLOS Neglected Tropical Diseases

***********************************************************

PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008056.r006

Acceptance letter

Marc Choisy, Robert C Reiner

17 Sep 2020

Dear Dr Zinszer,

We are delighted to inform you that your manuscript, "Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burdens at the national sub-national scale in Colombia," has been formally accepted for publication in PLOS Neglected Tropical Diseases.

We have now passed your article onto the PLOS Production Department who will complete the rest of the publication process. All authors will receive a confirmation email upon publication.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any scientific or type-setting errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Note: Proofs for Front Matter articles (Editorial, Viewpoint, Symposium, Review, etc...) are generated on a different schedule and may not be made available as quickly.

Soon after your final files are uploaded, the early version of your manuscript will be published online unless you opted out of this process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Neglected Tropical Diseases.

Best regards,

Shaden Kamhawi

co-Editor-in-Chief

PLOS Neglected Tropical Diseases

Paul Brindley

co-Editor-in-Chief

PLOS Neglected Tropical Diseases

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Weekly total counts of confirmed dengue cases over Colombia for 2014–2018 and the predicted counts of dengue cases by the national three-, five-, six-, seven-, nine-, and eleven-week-ahead models for 2018.

    (TIFF)

    S2 Fig. Accuracy comparison between the local and the national random forests models at the department scale for each week ahead predictions using the relative mean absolute error (RMAE).

    (PDF)

    Attachment

    Submitted filename: response_letter_final.docx

    Attachment

    Submitted filename: response_letter_Aug8.docx

    Data Availability Statement

    The epidemiological data are freely available through www.ins.gov.co, the sociodemographic data are freely available through www.dane.gov.co, and the environmental data are freely available through lpdaac.usgs.gov (MODIS products) and www.cpc.ncep.noaa.gov (CMORPH product).


    Articles from PLoS Neglected Tropical Diseases are provided here courtesy of PLOS

    RESOURCES