Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2019 May 6;2019:680–685.

Integrating Multiple Data Sources and Learning Models to Predict Infectious Diseases in China

Wenxiao Jia 1, Yi Wan 1, Yanpu Li 1, Kewei Tan 1, Wenqing Lei 1, Yiying Hu 1, Zhao Ma 1, Xiang Li 1, Guotong Xie 1
PMCID: PMC6568090  PMID: 31259024

Abstract

The outbreaks of infectious diseases do not only endanger people’s lives and property, but can also result in negative social impact and economic loss. Therefore, establishing early warning technologies for infectious diseases is of great value. This paper was built on the historical morbidity and mortality incidence data of infectious diseases, including typhoid fever, Hemorrhagic Fever with Renal Syndrome (HFRS), mumps, scarlatina, malaria, dysentery, pertussis, conjunctivitis, pulmonary tuberculosis, diarrhea from 2012 to 2016 in China. We also integrated search engine query data and seasonal information into the prediction models. Multiple models for prediction, including linear model, time series analysis model, boosting tree model and deep learning model (recurrent neural network, RNN) were constructed in order to predict the morbidity incidence of 10 infectious diseases. The RNN model has better predictive capability for these diseases. The improvement of techniques for infectious disease prediction can facilitate constructive and positive change towards disease prevention.

Introduction

The large prevalence and rapid variation of infectious diseases such as tuberculosis, diarrhea and malaria have massive impacts on public health globally. Among communicable diseases, lower respiratory infections caused 3.0 million deaths worldwide in 2016. The death rate from diarrhoeal diseases decreased by almost 1 million between 2000 and 2016, but still caused 1.4 million deaths in 2016. In the same year, there were an estimated 216 million cases of malaria in 91 countries, an increase of 5 million cases over 2015 [1]. As China plays an important role in the international public health community and infectious diseases being one of the leading causes of illness in China, the establishment and development of early warning techniques for infectious diseases is of enormous significance. To establish these techniques, understanding how infectious diseases progress is crucial to guiding government decisions and strategies [2]. It is a first step towards implementing effective interventions to control infectious diseases and reduce the resulting mortality and morbidity in human populations.

Traditionally, outbreak analyses and predictive models were used as the public health responses to infectious diseases. These analyses have previously utilized linear models and time series analysis which have been widely accepted. Tang et al. proposed new Bayesian hierarchical generalized linear models (GLMs), called group spike-and- slab lasso GLMs, for predicting disease outcomes and detecting associated genes by incorporating large-scale molecular data and group structures [3]. Soebiyanto et al. utilized the autoregressive integrated moving average (ARIMA) model to analyze the role of climatic factors on the epidemiology of influenza transmission in two regions characterized by warm climate. They demonstrated that including the climatic variables as input series result in models with better performance than the univariate model [4]. While these models are valuable, their predictive ability is limited by their reliance on prior knowledge of parameters or inherent time-lag. Furthermore, these models do not account for additional factors which influence the occurrence and development of infectious diseases.

In an effort to leverage additional data sources, Ginsberg et al. presented a method of analyzing large numbers of Google search queries to track influenza-like illness in a population. They can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day [5]. Michael et al. monitored influenza awareness through twitter [6]. However, these methods only employed the linear model which is too simplistic to account for diseases influenced by data outside of search engine queries.

Machine learning method enables us to build more precise predictive models which can comprehensively reflect the internal laws of the broader systems within which these diseases exist. Kane et al. applied random forest time series models to incidence data of outbreaks of highly pathogenic avian influenza (H5N1) in Egypt and found that the random forest model outperformed the ARIMA model in predictive ability [6]. Even though the prediction was made based on time series information, there was no association in time as a factor.

In order to surpass the limitations of previous disease prediction models, we integrated additional data sources and the utilization of deep learning method. Our models included search engine query data, monthly morbidity incidence, monthly mortality incidence and seasonal information. We applied RNN method to make the model automatically capture the time series information and it increased the predictive power.

Method

Figure 1 shows our pipeline of building infectious disease prediction models for China from 2012 to 2016. We first collected the features of interest as the model inputs. Then, we trained prediction models using time series and non-time series algorithms using different data source as the model input through many experiments. For the Non-Time Series Model (linear model and boosting tree model), we selected the mortality incidence, search engine query, morbidity incidence and seasonal information as model input. For the ARMIA model, we only used historical morbidity incidence as model input. For the RNN model, the morbidity incidence and seasonal information were selected as the model input. Finally, we evaluated the prediction performance of the different models.

Figure 1.

Figure 1

Pipeline of prediction models for infectious disease in China

Dataset and Feature Construction

This section presents the data preparation and feature construction schema that we followed.

The China Public Health Science Data Center(CPHSDC) is a national-level, open source data sharing platform managed and maintained by Chinese Center for Disease Control (CDC). The CDC provides a database of infectious diseases. From this database, we collected the morbidity and the mortality rate of ten infectious diseases: typhoid fever, hemorrhagic fever, mumps, scarlatina, malaria, dysentery, pertussis, conjunctivitis, pulmonary tuberculosis, and diarrhea [7]. We initially considered 31 diseases in 31 provinces in China from 2012 to 2016. By considering the characteristics of the incubation period, as well as the route and mode of transmission, 10 infectious diseases with obvious predictive significance and value were selected for modeling. We also include the Baidu indices of different diseases as features in our prediction model. Baidu index, which was developed by Baidu, represents search volume and trends of keywords of Baidu search engine.

In addition to existing features, several more features were created to present the underlying data. For instance, considering the distributed lag effects of morbidity incidence, new variables such as maximum, minimum, and average morbidity incidence over the past 3 months were included in our models. In this study, the data from 2012 to 2015 was used as training set. The data from 2016 was used as testing set.

Predictive Model

In this study, we applied several different predictive models; including linear model, boosting tree model, time series analysis model and deep learning model.

Linear Model

Linear regression model [8] is widely used to model a linear relationship, but it has low predictive accuracy when predictors are highly correlated. Similar to ordinary least squares (OLS) estimations in the linear model, penalized regression (e.g. lasso, ridge) minimizes the residual sum of squares (RSS). However, it imposes a penalty on the magnitude of coefficients. Lasso shrinks the coefficients towards zero and thus can be used to perform variable selection. While ridge regression does not zero out coefficients, the coefficients are close to zero based on the tuning parameter. Therefore, when multicollinearity and high variability of estimates of coefficient are a major issue, lasso and ridge regression are preferred over linear regression model. We tested both of these models and ridge regression produced better results.

Time Series Model

Time series models play an important role in identifying hidden trend such as autocorrelation, seasonal variation in a dataset. For disease exhibiting cyclic or repeating patterns, time series model such as autoregressive integrated moving average (ARIMA) could be used to describe the linear relationship between disease incidence and predictors [9]. ARIMA model can capture behaviors of both stationary and non-stationary series, giving great capability of describing series with various patterns. For non-stationary time series data, an initial first-order difference can be applied one or more times to eliminate the non-stationarity and achieve high prediction accuracy based on transformed series.

Boosting Tree Model

Boosting trees are easy to interpret and have been widely applied in healthcare industry. The algorithm learns by fitting the residual of the preceding tree and tends to improve the prediction accuracy. With the embedded feature selection, boosting tree model can dynamically select the optimal features, during the learning process at each split in each tree. Among boosting trees, XGBoost model is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable [10]. As a parallel tree boosting, XGBoost provides an accurate and efficient implementation of gradient boosting machine learning algorithm.

Recurrent neural network model

Recurrent neural network (RNN) is the state of the art algorithm for sequential data. This is because it can remember its input, due to an internal memory, which makes it perfectly suited for Machine Learning problems that involve sequential data. Long Short Term Memory networks (LSTM) are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work [11]. LSTM has reputation for good performance in learning long dependencies. Based on such characteristic, we designed a two-layer LSTM model with random walk strategy to keep our model simple, and achieved great performance in predicting all 10 infectious diseases. Besides, we also take the following two factors into consideration when designing our model structure. First, to help LSTM better capture existing dependency in our data, time window is introduced. Generally speaking, LSTM can fit well into learning from arbitrary size time steps. Due to the volume of our data, as well as each record is represented in monthly format, we set the window size to 3. Meanwhile, we limited the number of neuron and batch size to make our model suitable for small dataset. Second, it is very obvious our data have trending, seasonality and burst changes in direction problem, which means it is non-stationary. To solve this problem, we use first-order difference. This is because such method can maintain stability in the mean of a time series through detrending and reducing the changes.

Result

We evaluated the performance of our approaches for predicting the morbidity incidence of infectious disease in 2016. The mean absolute percentage error (MAPE), root mean square error (RMSE) and R squared (R2) were used to evaluate the prediction performance of models. We performed feature engineering and model learning on the training set (2012-2015) for the traditional machine learning methods (including ridge regression and XGBoost) and applied the learned models on the testing set (2016). Finally, we evaluated different model performance on the testing set for ten infectious diseases. For the comparison, we used the difference in the morbidity incidence of the first two months as the boosting tree model feature input or not. In addition, for the RNN model, we also take the first three month difference as the model input or not.

The results are shown in Table 1- Table 3, where the models with best performance on the testing set are highlighted. Compared to other machine learning methods, LSTM almost achieved the best performance in all the infectious disease. Without the need for manual selection, we only input the historical data into the deep learning model and it can automatically learn the features and predict the future morbidity incidence.

Table 1.

Mean Absolute Percentage Error (MAPE) for different predictive model

Ridge Regression ARIMA XGBoost XGBoost(diff) LSTM LSTM(diff)
Typhoid Fever 12.04% 12.05% 14.39% 14.25% 9.91% 11.40%
Hemorrhagic Fever 37.45% 29.55% 69.02% 53.41% 12.20% 15.44%
Mumps 21.42% 23.04% 14.50% 13.50% 8.05% 20.84%
Scarlatina 23.79% 45.31% 14.74% 17.95% 8.98% 8.63%
Malaria 16.67% 18.48% 11.10% 12.79% 19.68% 10.42%
Dysentery 11.17% 19.66% 21.41% 18.11% 17.55% 7.59%
pertussis 14.21% 15.17% 12.17% 16.73% 16.41% 11.50%
conjunctivitis 24.05% 14.87% 22.45% 17.97% 9.21% 10.04%
Pulmonary tuberculosis 11.50% 6.61% 7.49% 8.06% 9.08% 4.32%
diarrhea 20.61% 18.33% 11.96% 8.69% 19.49% 18.76%

Table 3.

R squared (R2) for different predictive model

Ridge Regression ARIMA XGBoost XGBoost(diff) LSTM LSTM(diff)
Typhoid Fever 0.6501 0.5293 0.417 0.4089 0.717 0.66
Hemorrhagic Fever -0.0756 0.0024 -7.3302 -5.9487 0.573 0.641
Mumps -0.219 -0.0903 0.3206 0.5393 0.813 0.211
Scarlatina 0.7256 -0.0079 0.8805 0.6229 0.938 0.949
Malaria 0.1987 -0.1066 0.4213 0.2782 -0.575 0.463
Dysentery 0.9262 0.7811 0.7658 0.8304 0.825 0.938
pertussis 0.1899 -0.0314 0.4259 0.1123 -0.072 0.556
conjunctivitis -0.4545 0.3571 -6.5479 -4.7623 0.14 0.052
Pulmonary tuberculosis -0.5089 -0.4965 0.3589 0.2147 -0.037 0.688
diarrhea -0.0526 -0.1747 0.447 0.6207 -0.414 -0.254

Note: Diff means that we added the first-order difference variables of historical morbidity incidence as the model input.

As an example, we only show the the predictive curve of LSTM model in 5 infectious disease. For comparision, we draw the predicted curve about the LSTM model with first-difference estimator and without first-difference estimator in 5 infectious diseases. As we can see, the LSTM model with first-order difference information had a good prediction with upward or downward trend. However, the LSTM model without difference information have a good prediction in turning point.

Discussion

In this study, we compared the performance of prediction models for the morbidity incidence of 10 different infectious diseases. These performances yielded three key findings. Our most important finding was the superiority of the deep learning model, LSTM. For most diseases, LSTM demonstrated a lower error rate. For example, pulmonary tuberculosis had an impressively low error rate of 4.32 percent. This level of accuracy demonstrates that this prediction model has immediate practical application. By comparing LSTM’s performance to another time series model, ARIMA, we gained some insight into reasons for LSTM’s better predictive ability. LSTM and ARIMA are similar in that they both can capture seasonal patterns and trends. The high error rates resulting from ARIMA’s time-lag problem indicates that the lack of this time-lag problem may be a factor in LSTM’s accuracy.

We also found that if the morbidity incidence graph of the disease had a general trend line within a five year range, such as typhoid fever, hemorrhagic fever, mumps and conjunctivitis, the LSTM with first-difference estimator model was more predictive. Conversely, the diseases without a clear trend line, such as scariatina, dysentery, pertussis and pulmonary tuberculosis, were more accurately predicted with LSTM model.

One of the limitations of this study was that it is national-level morbidity and mortality data, which did not account for regional differences. Therefore, we did not include the weather data or other region-level data. Furthermore, if we had weekly data of the mortality and morbidity incidence, we would be able to make more precise prediction. Advanced prediction models such as ours can facilitate efforts toward disease prevention if we can cooperate with CDC-China and build an early warning system.

Conclusion

For traditional machine learning models, it is at great cost of spending a large amount of time in doing feature engineering process, and the more appropriate hand-crafted features involved, the better performance would be. On the contrary, deep learning models have the natural feature to automatically learn from raw data without too much feature selection. It has been proven in our experiment that LSTM with three features outperformed ridge regression, ARIMA, XGBoost with around 11 input features. Based on the comparison between LSTM with first-order difference and without difference, this strategy proves to be powerful in handling trending in time series analysis.

Figure 2.

Figure 2

The original and predictive values for the morbidity incidence of 10 infectious diseases using LSTM in 2012-2016. The left panel is the result of LSTM model with first-difference estimator and the right panel is the result of LSTM without with first-difference estimator.

Table 2.

Root Mean Squared Error (RMSE) for different predicted model

Ridge Regression ARIMA XGBoost XGBoost(diff) LSTM LSTM(diff)
Typhoid Fever 0.0091 0.0105 0.0117 0.0118 0.008 0.009
Hemorrhagic Fever 0.0198 0.0191 0.0552 0.0504 0.012 0.011
Mumps 0.2797 0.2645 0.2088 0.1719 0.11 0.225
Scarlatina 0.083 0.159 0.0548 0.0973 0.039 0.036
Malaria 0.0039 0.0046 0.0033 0.0037 0.006 0.003
Dysentery 0.0903 0.1555 0.1609 0.1369 0.139 0.083
pertussis 0.0063 0.0072 0.0053 0.0066 0.007 0.005
conjunctivitis 0.0505 0.0336 0.115 0.1005 0.077 0.041
Pulmonary tuberculosis 0.6406 0.638 0.4176 0.4622 0.531 0.291
diarrhea 1.31 1.3839 0.9495 0.7863 0.155 1.43

References

  • 1.World Health Organization. The top 10 causes of death fact sheet. < http://www.who.int/mediacentre/factsheets/fs310/en/index1.html>.
  • 2.Van Kerkhove MD, Ferguson NM. Epidemic and intervention modelling--a scientific rationale for policy decisions? Lessons from the 2009 influenza pandemic. Bulletin of the World Health Organization. 2012;90(4):306–10. doi: 10.2471/BLT.11.097949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tang Z, Shen Y, Li Y, et al. Group Spike-and-Slab Lasso Generalized Linear Models for Disease Prediction and Associated Genes Detection by Incorporating Pathway Information. Bioinformatics. 2018;34(6):901. doi: 10.1093/bioinformatics/btx684. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Soebiyanto RP, Adimi F, Kiang RK. Modeling and predicting seasonal influenza transmission in warm regions using climatological parameters. Plos One. 2010;5(3):e9450. doi: 10.1371/journal.pone.0009450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ginsberg J, Mohebbi MH, Patel RS, et al. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012. doi: 10.1038/nature07634. [DOI] [PubMed] [Google Scholar]
  • 6.Smith MC, Broniatowski DA, Paul MJ, Dredze M. Towards Real-Time Measurement of Public Epidemic Awareness: Monitoring Influenza Awareness through Twitter. [Google Scholar]
  • 7.Kane M J, Price N, Scotch M, et al. Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks. [J] Bmc Bioinformatics. 2014;15(1):1–9. doi: 10.1186/1471-2105-15-276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Logan C. Brooks, David C. Farrow, Sangwon Hyun, et al. Flexible Modeling of Epidemics with an Empirical Bayes Framework. PLOS Computational Biology. doi: 10.1371/journal.pcbi.1004382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hoerl A E, Kannard R W, Baldwin K F. Ridge regression:some simulations. Communications in Statistics. 1975;4(2):105–123. [Google Scholar]
  • 10.Hillmer SC, Tiao GC. An ARIMA-Model-Based Approach to Seasonal Adjustment. Publications of the American Statistical Association. 1982;77(377):63–70. [Google Scholar]
  • 11.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. 2016:785–794. [Google Scholar]
  • 12.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES