Prediction of an Epidemic Curve: A Supervised Classification Approach

Elaine O Nsoesie; Richard Beckman; Madhav Marathe; Bryan Lewis

doi:10.2202/1948-4690.1038

. Author manuscript; available in PMC: 2012 Sep 18.

Published in final edited form as: Stat Commun Infect Dis. 2011 Oct 4;3(1):5. doi: 10.2202/1948-4690.1038

Prediction of an Epidemic Curve: A Supervised Classification Approach

Elaine O Nsoesie ¹, Richard Beckman ², Madhav Marathe ³, Bryan Lewis ⁴

PMCID: PMC3445421 NIHMSID: NIHMS397431 PMID: 22997545

Abstract

Classification methods are widely used for identifying underlying groupings within datasets and predicting the class for new data objects given a trained classifier. This study introduces a project aimed at using a combination of simulations and classification techniques to predict epidemic curves and infer underlying disease parameters for an ongoing outbreak.

Six supervised classification methods (random forest, support vector machines, nearest neighbor with three decision rules, linear and flexible discriminant analysis) were used in identifying partial epidemic curves from six agent-based stochastic simulations of influenza epidemics. The accuracy of the methods was compared using a performance metric based on the McNemar test.

The findings showed that: (1) assumptions made by the methods regarding the structure of an epidemic curve influences their performance i.e. methods with fewer assumptions perform best, (2) the performance of most methods is consistent across different individual-based networks for Seattle, Los Angeles and New York and (3) combining classifiers using a weighting approach does not guarantee better prediction.

Keywords: epidemic curves, supervised learning, agent-based epidemic models, classification, random forest

1 Introduction

Epidemic curves are graphical representations of the incidence of a disease plotted over time and are useful for inferring the magnitude, incubation duration and other attributes of an outbreak. In this study, we seek to predict the epidemic curve for an ongoing outbreak using a combination of simulations and classification methods. Predicting the epidemic curve implies that given data up to day j, we seek to predict the number of daily infections in the future. Real-time prediction of an epidemic curve during a (global) disease outbreak could be invaluable to public health officials since it could aid in the postulation of disease transmission parameters for studying the dynamics of the outbreak and influence selected measures for containment (McKinley et al. 2009, Nishiura 2011, Jiang et al. 2009, Ohkusa et al. 2011).

Infectious disease pandemics in recent years have increased interest in real-time forecasting and long-term prediction of epidemic curves and disease transmission parameters (see Jiang et al. 2009, Nishiura 2011, McKinley et al. 2009, Ohkusa et al. 2011 and Hall et al. 2007 for a few examples). McKinley et al. (2009) introduced a simulation-based approached for the estimation of disease transmission parameters. The similarities between their method and ours are: (i) both approaches use simulations in the prediction of epidemic curves which allows for the estimation of missing data, (ii) given observed surveillance data for an ongoing epidemic, the surveillance data are matched to simulated samples using a metric and (iii) the analysis from both methods indicate that the choice of the metric affects the conclusions.

However, differences exist in the approach, the focus, and assumptions made in both studies. Some of the differences are: (i) the temporal SEIR model used by McKinley et al. (2009) assumes homogeneous mixing while the agent-based network approach captures the heterogeneity present in the spread of an infectious disease. (ii) Although not exploited in this study, the agent-based model has a representation of every pair of individuals connected in the network which enables analysis at the individual level and the prediction of the epidemic curve based on changes in individual behavior. (iii) Unlike McKinley et al. (2009), one of the focuses of this paper is to explore different classification methods with the purpose of finding the best ones for predicting outbreaks with different parameters and simulated over different networks. (iv) Also, contrary to McKinley et al. (2009), we do not use a Bayesian approach in this study.

The relevance of a simulation-based approach without likelihoods lies in the challenges faced by likelihood-based methods and alternative Bayesian approaches (McKinley et al. 2009). Most of the challenges are related to the availability of epidemic data and the tractability of the likelihood for large populations (Deardon et al. 2010). In addition, there are other potential advantages of using a simulation- and classification-based approach for predicting epidemic curves. In the event of a previously unobserved pandemic such as the recent 2009 H1N1(A), by comparing the data for the daily infected or influenza-like illness (ILI) cases to existing simulations of previous outbreaks, an initial model can be proposed for studying the spread of the disease. In addition, the agent-based epidemic modeling approach allows for easy introduction of behavioral changes that occur at the individual level which can affect the spread of a disease (Epstein 2009). Although compartmental models could be used in simulating similar shaped epidemic curves as those used in this study, agent-based models (ABM) are used instead, because the social networks focus on particular regions which imply that results observed for one region are not necessarily applicable to other regions due to spatial and demographic differences. More details on the ABM are presented in the Appendix.

In the long run the overall goal of this project is to develop an extensive digital library of tens of thousands of simulated outbreaks and during an outbreak of an infectious disease quickly match real world surveillance data to one or more (or possibly none) of these simulations. By matching the surveillance data to simulated cases, both the epidemic curve and underlying model parameters can be estimated. This paper represents the first step in this process. Here we compare eight supervised classification methods to determine those with a high accuracy rate in identifying the underlying disease parameter for an influenza epidemic. The supervised classification schemes are used in sequentially classifying twelve hundred partial epidemic curves (half used as a classification training set while the other half represent surveillance samples) from six agent-based stochastic simulations of influenza epidemics. The underlying disease transmission model is an SEIR model with three parameters: incubation period distribution, infectious period distribution and transmissibility.

Sequential classification implies that for each day j of an outbreak, a set of transmission parameters are proposed for describing the outbreak by assigning the epidemic curve to one of the epidemic clusters in our library. An epidemic cluster is represented by stochastic simulations from the same SEIR disease model parameterization. Supervised classification methods are used in this study since each classifier can be trained using data already available in the library. Ideally, the choice of the classification method can make a substantial difference in the sensitivity, specificity and accuracy of classification.

In a typical supervised classification scheme, given learning samples with known data classes, the goal is to build a classifier that can correctly predict the classes of new data objects. In this study, the new data objects (partial epidemic curves) do not have the complete information available in the learning samples. The sparse nature of the partial curves is likely to affect the performance of the classification techniques. The complexity of this study therefore lies in the incompleteness of information in the new data objects.

The two main aims of this study are to perform a systematic comparison of eight classification methods (three nearest neighbor methods, support vector machines (SVM), linear discriminant analysis (LDA), flexible discriminant analysis (FDA), random forests (RF) and a combined classifier) to find which methods perform best in correctly identifying partial epidemic curves for six epidemics and evaluate whether the performance of these methods differs by social networks. There are two main assumptions in this study. (i) Epidemic curves represent the counts of daily-infected cases. (ii) Epidemic curves in the test set are assumed to be described by one of the sets of disease transmission parameters in the data library.

However, the latter would not always be the case if surveillance data from a real outbreak is used. In general, the epidemic curve for an ongoing outbreak can be assigned to one, several or none of the clusters in the library. If the epidemic curve is assigned to one of the clusters, then the underlying model for that epidemic cluster can be used in modeling the outbreak. The aim is not to find a specific model for the epidemic but rather a set of possible models since in most scenarios uncertainty in prediction increases with sparsity of the data (McKinley et al. 2009). In addition, the set of possible parameters can be extremely large, which is one of the reasons for presenting this special case. To make this method applicable to the case where an epidemic curve cannot be assigned to any of the clusters in the library, an iterative scheme can be used whereby a combination of expert opinion and search methods are applied over multiple iterations to propose a set of parameters possibly describing the outbreak. New clusters of epidemics can be created based on the newly proposed parameters and the epidemic curve can be reclassified until a good fit is found.

The epidemics are simulated under the assumption of a novel virus with little or no prior immunity. Simulations could be conducted that further change the shape of the epidemic curves by including more details about individual behavior and immunity that improve the realism. To test out the classification scheme these curves are sufficient. One could argue that by maintaining fewer parameters, this further “differentiates” the curves thereby providing a strenuous exploration of the classification scheme’s performance since in essence it has fewer parameters to perform the classification on. In addition, the use of simulated data enables a thorough evaluation of the performance of the classification methods under a controlled setting. The outbreaks are simulated across social networks for metropolitan regions surrounding New York, Los Angeles and Seattle with population sizes of approximately 20 million, 16 million and 3.2 million respectively. For the purpose of this paper, these metropolitan regions are referred to as Seattle, Los Angeles and New York.

The rest of this paper is organized as follows: the proposed approach is presented in the next section, the methods are discussed in section 3, the results are presented in section 4, the conclusions follow in section 5 and the Appendix contains a description of the supervised learning methods, the ABM, and a compartmental model. Initial results for this study were published in the 2010 Joint Statistical Meeting Proceedings (Nsoesie et al. 2010).

2 Approach

Let the vector X =< x₁, x₂,…, x_t > represent an epidemic curve where t is the number of observed infections over time measured in days and x_ts are the daily-counts of infected. Suppose in the early stages of an outbreak, a partial curve Y =< y₁, y₂,… y_d > of duration d is observed. Given that the model parameters underlying the new outbreak are unknown, the prediction of future daily infected counts would be difficult.

A possible solution would involve using a classification approach. Based on the shape of the partial epidemic curve, several possible disease transmission parameters (e.g. incubation period, infectious period, serial interval etc.) can be hypothesized. Thousands of epidemic curves can be simulated based on the hypothesized parameters and the simulated epidemics can be organized in a library with clusters representing epidemics with the same parameters. Using a supervised classification method, the complete epidemic curve can be estimated by assigning the partial epidemic curve into the cluster with the most similar full curves.

In general, a partial epidemic curve observed during an outbreak can be assigned to one, several or none of the clusters in the library. If the partial curve is assigned to one (or several) of the clusters, then this can be used as a starting point for modeling the outbreak. If the partial epidemic curve is not assigned to any of the clusters in the library, a new set of disease models can be hypothesized based on a search for possible model parameters that could be used to describe the outbreak.

In order to find the best supervised classification method that consistently outperforms other methods in predicting the correct epidemic cluster early on in an outbreak, we compare the accuracy of seven supervised classification methods in addition to a combined classifier. The epidemic curves used in the classification are simulated using six SEIR model parameterizations. The parameters are shown in Table 1. For the purposes of this paper, these outbreaks are called catastrophic, mildly catastrophic, strong, mildly strong, moderate and mild flu epidemics.

Table 1.

Parameters used in simulating the epidemics in this study. Catastrophic flu infects about 50% of the population, while strong, mildly strong, and mild flu infect approximately 30%, 20% and 10% of the population respectively. Each infected individual has a randomly assigned probability of having a specific incubation or infectious duration. For example, for catastrophic flu, each infected individual have a probability of 0.3, 0.5 or 0.2 of having an incubation period duration of 1, 2, or 3 days respectively.

Name	Transmissibility	Incubation Period (day probability)		Infectious Period (day probability)
Catastrophic *	0.00006	1	0.3	3	0.3
		2	0.5	4	0.4
		3	0.2	5	0.2
				6	0.1

Mildly Catastrophic **	0.000083	0	0.20	2	0.66
		1	0.45	3	0.33
		2	0.35	4	0.01

Strong	0.000042	same as *		same as *

Mildly Strong	0.0000365	same as *		same as *

Moderate	0.0000581	same as **		same as **

Mild	0.0000333	same as *		same as *

Open in a new tab

The incubation and infectious period distributions used in the catastrophic model (Table 1) are based on a consensus by three research groups and was initially used in a study by Halloran et al. (2008). The same parameters have been used in several other studies (e.g. Eubank et al. 2010 and Goldstein et al. 2010). The parameters used in modeling the mildly catastrophic outbreaks have not been published but are based on the serial interval proposed by Cauchemez et al. (2009). The distributions of the incubation and infectious durations are obtained by adjusting the previously described joint distributions of the incubation and infectious period distributions in Halloran et al. (2008) to match the serial interval in Cauchemez et al. (2009). The transmissibility parameters are selected so as to explore outbreaks ranging from what is observed during normal influenza seasons to more extreme outbreaks in order to provide a thorough examination of the sensitivity of the proposed method. Samples of epidemic curves simulated for Seattle are given in Figure 1. There are 200 replicates for each simulated epidemic; half are used as a training set and the other half are used as surveillance samples for the test set. Although all the epidemics are simulated for a duration of 365 days most have a total duration of less than 210 days.

Sample epidemic curves simulated using the disease models in Table 1 for Seattle. Each group of epidemic curves represents 200 stochastic simulations. Although the curves appear smooth, a closer look would reveal they are not.

In the classification scheme, the data in the training set are used to “learn” the methods in order to obtain the optimal classifier on each day j = 1,…,d. Each partial epidemic curve in the test set is then used as an input sequence < x₁, x₂,…, x_d > into each of the trained classifiers and a single epidemic cluster label is returned as output. The accuracy of each classifier is estimated by epidemic (catastrophic, mildly catastrophic, strong, mildly strong, moderate and mild flu epidemics) based on the number of correct classifications on each day. This is a combination of a time series prediction and a sequence classification problem since each epidemic curve can be viewed as a time series. Dietterich (2002) discusses these classification problems in detail.

3 Methods

An agent-based modeling approach is used to simulate the outbreaks investigated in this study. This ABM has previously been used to study the transmission dynamics of an infectious agent through individual populations and to evaluate the effectiveness of control strategies over specific populations (Bisset et al. 2009, Goldstein et al. 2010). For discussions and studies using similar models see Barrett et al. (2011) and Eubank et al. (2004). The creation of this ABM involves two major steps: the creation of a social contact network from a state-of-the-art behavioral model and a computational model for disease transmission. To enable the readability of this paper, these methods are discussed in the Appendix.

3.1 Classification Methods

The supervised classification methods used in this study are support vector machines (SVM), random forest (RF), nearest neighbor methods: nearest mean (Mean), nearest median (Median), minimum distance (Minimum), linear discriminant analysis (LDA) and flexible discriminant analysis (FDA). There are several professed advantages to using each of the supervised learning techniques. The advantages of random forest include efficiency on large databases, high accuracy and estimation of importance variables (Hastie et al. 2009). Likewise, support vector machines have been shown to achieve a high accuracy rate across various data types and also tend to perform well on high dimensional data (Hastie et al. 2009). Discriminant analysis methods perform well due to the simplicity of the methodology and can provide low-dimensional views of high dimensional data (Hastie et al. 2009). Nearest neighbor methods are relatively easy to implement and are highly adaptive (Holmes and Adams 2002). A brief and detailed discussion of these methods can be found in the Appendix and in Hastie et al. (2009) respectively.

The supervised classification methods are selected in a manner that allows exploration of different types of classification methods from machine learning and statistics: tree based methods (random forest), distance based methods (nearest neighbor), probabilistic classification methods (LDA) and maximum margin classifiers (support vector machine). These methods have been shown to perform differently based on the performance criteria and the data structure (Caruana and Niculescu-Mizil 2006).

In some cases, the selection of a single method from all supervised learning methods investigated might not prove to be the ideal choice since “potentially valuable information may be wasted by discarding the results of less-successful classifiers” (Tumer and Ghosh 1999). A pooled classifier might be useful in situations where no classification method is likely to consistently outperform the others or the surveillance data contains a large amount of noise and are high dimensional (Tumer and Ghosh 1999).

Several methods have been proposed for combining classifiers. The most popular of these is simple averaging of the output from each of the classification methods (Tumer and Ghosh 1999). Weighted averaging with different definitions of how to calculate the weights is an extension of this method (Tumer and Ghosh 1999). Rank-based combiners, voting schemes, order statistics combiners, and belief functions are other methods for pooling classifiers (Tumer and Ghosh 1999, Chan and Stolfo 1995).

Both simple and weighted voting classification schemes are used in this study. In a simple voting scheme, the prediction from each of the classification methods receives a single vote and the final classification is based on a majority of votes. The simple voting scheme is simple and easy to implement. Weighted voting is an alteration to simple voting, which involves associating each classifier’s vote with a weight. The weights increase the influence of better classifiers on predictions and are calculated based on the performance of each classification method on a validation set (Chan and Stolfo 1995). The weights in this study are defined as linear, polynomial and exponential and are given by 2ε_i_,_j⁻¹, 2ε_i,_j⁻³, and exp(2ε_i_,_j⁻¹) respectively. The weights are calculated on each day of classification and there are six misclassification errors on each day since there are six disease clusters. ε_i_,_j is the mean of the six misclassification errors on day j for method i where i=1,…,7 and j=5,…,t where t is the duration of the epidemic.

An additional set of 600 simulated epidemics from all parameter sets is used as a validation set in estimating the weights for each method on each day. The classification procedure using the combined weighted voting scheme is described below. The first half of the algorithm excluding the calculation of weights is the basic approach for classification based on the individual methods.

3.2 Performance Accuracy Metric

The McNemar test (Everitt 1992) is used in a pair-wise comparison of the methods. The McNemar test is selected because it deals with the lack of independence between data samples and has been shown to be less prone to Type I error compared to other statistical tests for comparing classification methods (Dietterich 1998). The McNemar test evaluates the null hypothesis of equality between p_b and p_c, where p_b is the number of test samples misclassified by classifier A but not by B and p_c is number of test samples misclassified by classifier B but not by A. The McNemar test is based on the hypotheses:

H_{0_a}: The error rate of method A is greater than or equal to the error rate of method B on day j.
H_{0_b}: The error rate of method B is greater than or equal to the error rate of method A on day j.

The McNemar test compares accuracy of the methods at a single time point. However, since the epidemic curves are classified over several time points, an accuracy metric which evaluates the method performance over time is needed. For the purposes of this paper, performance accuracy is defined under two categories: “better” and “consistent” in order to find which methods perform best in identifying epidemic curves from each parameter set and which method consistently outperform other methods across all epidemics. The performance accuracy metric is defined as follows:

Method A is significantly better than method B if there is statistically significant evidence for rejecting the null hypothesis on more than 50% of the days. Method A is consistent if it performs better than most methods in the identification of epidemic curves from all clusters. This implies that there can be more than one consistent method.

The significance level for the test is set at α = 0.05. The analysis is focused on the first few days of the outbreaks since the aim of the study is to find a method with a high accuracy rate at the early stages of an outbreak.

3.3 Chi-Square Tests

The Chi-square test is used in evaluating the null hypothesis of no association between the performance of the methods and social networks for which the outbreaks are simulated across. Both the Pearson Chi-square and Likelihood Chi-square tests are considered. These tests are selected because they can be used for both nominal and ordinal data (Agresti 2002).

A summary of the experimental design for this study is given in Table 2.

Table 2.

Summary of the components of the experimental design

Components	Number	Names
Social Networks	Three	Seattle, Los Angeles and New York
Classification Methods	Seven	Linear and Flexible Discriminant Analysis, Nearest Mean Method, Nearest Median Method Minimum Distance Method, Random Forest, Support Vector Machines
Combined Classifiers	Four	Simple voting classifier, linear, exponential, and polynomial weighted voting classifiers
Influenza epidemics	Six	Catastrophic, Mildly Catastrophic, Strong, Mildly Strong, Moderate and Mild
Statistical Methods	Two	McNemar and Chi-square tests

Open in a new tab

4 Results

The results are presented under four subsections. The first two sections, (4.1) and (4.2), answer the question of which methods perform best in identifying epidemic curves from each parameter set and which methods are consistent across all epidemics. Section (4.3) discusses the performance of the combined classifiers, while section (4.4) examines the performance of the methods across different social networks.

4.1 Daily Accuracy of the Classification Methods

As shown in Figure 1, 200 epidemic curves are simulated for Seattle using each set of parameters in Table 1. Epidemics based on the same assumptions are also simulated for New York and Los Angeles. Day 1 of the epidemics represents the day on which the first infected case is observed. Six hundred epidemics are randomly assigned to the training set and six hundred are assigned to the test set for each of the study regions.

The comparative performance of each of the supervised classification methods in correctly identifying the epidemic cluster for all six hundred epidemic curves in the test set are given in Figure 2 for Seattle. The results are presented for days five to eighty-one since the accuracy of most of the methods remain stable across all epidemics after day eighty-one. Each of the subfigures in Figure 2 represents the accuracy in the identification of epidemics from a single SEIR parameterization. The results are presented by day of peak, i.e. the outbreak with the earliest peak (mildly catastrophic) based on Figure 1 is presented first.

The daily accuracy of eight classification methods. Results are presented for Seattle.

The eight classification methods compared in Figure 2 are: support vector machines (SVM), random forest (RF), nearest neighbor methods (nearest mean (Mean), nearest median (Median) and minimum distance (Minimum)), linear discriminant analysis (LDA), flexible discriminant analysis (FDA) and the combined simple voting classifier (Combined). Only the combined simple voting classifier is shown here since there is not much difference between “simple voting” and “weighted voting” as discussed later in section (4.3). In addition, the results are presented only for Seattle since similar results are observed for New York and Los Angeles as discussed later in section (4.4).

For the two catastrophic flu epidemics (Figure 2), it is easy to make an early identification irrespective of the classification method. However, as the epidemic curves “get closer”, the complexity of correct identification increases. In such cases methods such as LDA should not be used. The prediction accuracy rate is unstable for LDA and FDA in the classification of mild, mildly strong and strong epidemics. The performance of the nearest mean and median methods are also unstable in the identification of strong and moderate epidemics. In contrast, random forest, the simple voting classifier and the minimum distance methods appear to perform well across all epidemics.

The results shown in Figure 2 indicate that all methods have over a 50% accuracy rate in identifying mildly catastrophic on the first day of classification. Figure 2 also shows that the minimum distance method is the only method which achieves an accuracy above 50% in the identification of catastrophic epidemics on day 5. The high accuracy of the minimum distance method can be explained by the low variability between clusters at the start of the epidemics. With the minimum distance method, an observed outbreak can be matched to more than one of the clusters.

Most methods exhibit instability in the identification of both strong and moderate epidemics. However, the accuracy rate is much higher on day eighty-one across all methods in the identification of strong epidemics compared to moderate epidemics. The instability in the classification can be explained by the overlap between epidemic curves for mildly strong, strong and those of mild and moderate epidemics.

The accuracy of the methods in the identification of mild and mildly strong epidemics are important because they are the last to peak which increases the complexity of identification. Except for the two discriminant analysis methods, all other methods perform relatively well in the identification of both mild and mildly strong epidemics. However, the methods achieved a higher accuracy by day eighty-one in identifying mild epidemics relative to mildly strong epidemics although the mild epidemics peak after the mildly strong.

The results observed in Figure 2 indicate that “time to peak” affects the accuracy of random forest, support vector machines and nearest neighbor methods since curves which peak early are most easily identifiable. Figure 2 also suggests that in choosing methods which are likely to perform well across all epidemics, linear and flexible discriminant analysis methods should be avoided.

4.2 Consistency of Classification Methods

The observations in Figure 2 are further investigated by performing a pair-wise comparison of the methods using the McNemar test and the performance accuracy metric. The methods are ranked from one to three for each epidemic, where three is the most preferable method and one is the least preferable method. The rankings are displayed on the heatmaps in Figure 3. The rankings 1, 2, and 3 are represented with the colors royal blue, dark green, and orange respectively.

Ranking of methods by epidemic and by region based on results of the McNemar test and the performance accuracy metric. 1 represents the least preferable and 3 represents the most preferable method for each epidemic. The color code is 1: royal blue, 2: dark green, and 3: orange.

Based on Figure 3, RF is ranked as the most preferable method for five out of the six epidemics for Seattle. Minimum is ranked as the most preferable method for three out of the six epidemics. Each of the other methods except LDA is ranked twice as the most preferable method. Although, SVM has fewer best rankings than Minimum, it is not ranked as least preferable for any of the epidemics. This could suggest that SVM is a more consistent method than Minimum. Similar results are observed for Los Angeles. However, for New York RF, SVM, FDA and LDA are ranked as least preferable in the classification of mild flu. Nevertheless, RF and SVM perform significantly better than LDA and FDA. SVM and RF are included in this group because the nearest neighbor methods perform better.

The nearest neighbor methods perform significantly better than all other methods in the identification of mild epidemics across all social networks. All the methods perform well in the classification of catastrophic epidemics, while RF is best for the classification of mildly strong, strong, mildly catastrophic and moderate epidemics. Comparisons of rankings across social networks are further discussed in section (4.4).

Based on Figures 2 and 3, RF appears to be the most consistent method. SVM is also consistent, ranking as the least preferable method only once. FDA and LDA are ranked as the least preferable methods more often than others which suggests they should be avoided.

4.3 Combined Classification Weighting Schemes

The simple voting combined classifier does not appear to perform better than the individual classification methods in the daily identification of epidemics curves for the six epidemics (Figure 2). In most cases, the classifier seems to capture the mean accuracy rate of all seven classification methods since it is influenced by both successful and less-successful classifiers.

In addition to the simple voting combined classifier, weighted voting classifiers were also proposed as discussed in section 3. Figure 4 shows the results of the comparison of the simple voting classifier to the weighted voting classifiers for all six epidemics. The combined classifiers are represented as follows: simple voting classifier by “simple voting”, weighted linear voting by “weighted-linear”, weighted polynomial voting by “weighted-poly” and weighted exponential voting by “weighted-expo”. In addition, the combined classifiers are also compared to the consistent methods (RF, SVM) from section (4.2).

The performance of the combined classification schemes. Results are presented for Seattle.

There appear to be few differences in the classification of the catastrophic, mildly catastrophic and strong epidemics (Figure 4). Differences between the methods are more apparent for the mildly strong, moderate and mild epidemics. These differences are further investigated using the McNemar test and the performance accuracy metric.

The results from the McNemar tests indicate that none of the methods perform better in the classification of catastrophic and mildly catastrophic, which is reinforced by Figure 4. In addition, RF and SVM perform significantly better than all combined classifiers in the classification of moderate epidemics. In the classification of moderate and mild epidemics, none of the methods perform significantly better than others. Similar results are observed for New York and Los Angeles. Based on these results, we can conclude that in most cases, the combined classifiers perform as well as random forest and support vector machines.

4.4 Different Social Networks

The Chi-square test for independence is used in testing whether the performance of each of the classification methods is independent of the social networks over which the epidemics are simulated. Both the Pearson and Likelihood Chi-square tests indicate that there is no statistically significant evidence to reject the hypothesis of independence with p-values in the range [0.38, 1.00]. These results suggest that the accuracy of the eight classification methods in identifying epidemic curves from each of the six stochastic simulated epidemics is independent of the social networks over which the epidemics are simulated.

In addition, the best methods for identifying all six epidemics are also similar and the most consistent method across all epidemics is RF for all social networks (see Figure 3). However, there are a few differences in the rankings of the methods by epidemic across the social networks. For Seattle, RF and all nearest neighbor methods are ranked best in the classification of mildly strong epidemics. For New York, only RF is ranked best, while for Los Angeles, RF, Median and Mean are ranked best in the classification of mildly strong epidemics. Another difference can be observed in the classification of mildly catastrophic outbreaks. Only RF is ranked best for Seattle and Los Angeles, while for New York, RF, SVM and FDA are ranked best. These differences in rankings are however minute and are also subject to our definition of the accuracy metric.

There are several possible reasons why the results observed for the three regions are similar. Similarities in the area under the epidemic curve (total attack rate), the time to peak and peak infection rate could imply that the shape and form of the simulated epidemic curves are alike across regions. An analysis of variance on these three measures gave the following results: the attack rates are statistically significantly different (P < 0.00001) across all regions, the peak infection rates are statistically significantly different (P < 0.00001) across all regions and except for mildly catastrophic epidemics, the times to peak are also statistically significantly different across all regions. A comparison of the times to peak of mildly catastrophic epidemics for Los Angeles and Seattle suggested that there is no statistical evidence to reject the null hypothesis that the times to peak are the same (P = 0.21).

However, although the epidemic curves from the same parameter sets appear to be different across the different regions, the epidemic curves for each region peak in the same sequence (mildly catastrophic, catastrophic, strong, moderate, mildly strong and mild epidemics). This therefore suggests that the similarity in performance might be due to the sequence in which the epidemics reach their peaks across the three social networks.

The performance of the methods are also tested on a basic compartmental SEIR model representing a “null” network. Four parameterizations are used to simulate epidemics and the daily infected are used in constructing the epidemic curves. The SEIR model and results are presented in the Appendix. Similar conclusions are drawn from the “null” network as from the three previously discussed social networks; random forest and SVM are the most preferable methods.

5 Discussion

This paper serves as the first step in a project where the overall goal is to create a digital library of simulated outbreaks to be used to classify real world surveillance data to assist in estimating infectious disease characteristics. The main aims of this paper were to perform a systematic comparison of seven supervised classification methods in addition to a combined classifier, in order to find which methods perform best in identifying outbreaks from six parameterizations of influenza epidemics and whether the performance of these methods were affected by the social networks across which the outbreaks were simulated.

The correct identification of the underlying transmission parameters for an ongoing outbreak would provide an estimate of the epidemic curve under the assumption that no intervention methods have been applied to control the spread of the disease. Different epidemic curves with different transmission parameters are expected to have different shapes. However, this is not always apparent at the early stages of the outbreaks, which can make the differentiation of outbreaks with different transmission parameters difficult. For instance, in our study, mild outbreaks were misclassified as mildly strong and strong outbreaks at the start of the outbreaks.

The results show that multiple classification techniques can be used for predicting epidemic curves and inferring transmission parameters. However, random forest, which is easy to implement and has few parameters, would be the most preferable method. The consistency and high accuracy achieved by random forest in this problem could be due to its lack of assumptions regarding the structure of the data and the methodology, which involves classification based on the majority vote. In contrast, linear discriminant analysis had the poorest performance out of all classification techniques examined. LDA assumes that all outbreak clusters have a common covariance matrix, which is not consistent with the expectation that input variables belong to different classes. In addition, LDA uses a single class centroid per class, which can be inadequate in some cases (Hastie et al. 2009).

The results in this analysis demonstrate the complexities associated with dealing with sparse data in classification. In addition, combining classification methods using a voting scheme doesn’t always perform better than the individual classification methods as is usually assumed. The error introduced by less accurate methods heavily affects the performance of the combined classifier whether weights are placed on the more accurate methods or not. Therefore, instead of combining several methods with different assumptions about the data, simpler and more efficient methods such as random forest can be used in the prediction of an epidemic curve.

Some of the limitations of our method deal with the assumption that the observed outbreak can be assigned to at least one of the clusters in our library. A method that systematically identifies an outbreak if it doesn’t belong to any of the clusters and then proposes possible parameters for modeling the outbreak would be more beneficial. Also, based on the healthcare infrastructure of a particular region or city, differences exist in methods used for disease surveillance and selection of interventions. Therefore, the proposed method would need to be adapted to the specific scenario under study by adjusting the details of the ABM.

In the next step, the methods used in this study can be extended to include a probabilistic framework to measure uncertainty in our classification and to extend the analysis to different intervention scenarios and subpopulations such as age groups. We think the results in this study are promising and reinforce the idea that a combination of simulations and classification methods can be used in the prediction and estimation of infectious disease transmission parameters.

Algorithm 1.

Classification

Inputs: epidemic curves on day j

Outputs: predicted cluster on day j

Read in all epidemic curves in training set

Loop over days

for (each day j) do

Train each supervised learning method using all curves in the training set for (each epidemic in the test set) do

Predict the epidemic cluster using each method

end for

Estimate error ε_i_,_j: mean misclassification error on day j for method i

end for

For weighted classifiers

Using validation data set estimate weights: 2ε_i,j⁻¹, 2ε_i,j⁻³, and exp (2ε_i,j⁻¹)

for (each day j) do

for (each epidemic in the test set) do

Predict the epidemic cluster using each method and consider each prediction as a single vote

Assign appropriate weights to each method

Sum the weighted votes assigned to each cluster

Predict the epidemic cluster based on the majority weighted vote

end for

Open in a new tab

Acknowledgments

We thank Dr. Scotland Leman and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments. This work has been partially supported by NSF Netse Grant CNS-1011769, DTRA R\&D Grant HDTRA1-0901-0017, DTRA CNIMS Grant HDTRA1-07-C-0113, NIH MIDAS project 2U01GM070694-7 and DTRA Rigorous Approaches for Validation and Verification of Networked Systems Grant.

6 Appendix

6.1 Computational Epidemiology Model

A detailed description of the agent-based model (ABM) used in simulating the outbreaks used in this study is discussed in Bisset et al. (2009). The ABM belongs to a class of models called network based epidemiology models that uses a representation of the population that includes each individual and their minute-by-minute movements. Their interactions with other agents are used to generate a dynamic social network. These networks are then in turn used to simulate epidemics and study the effects of changes in individual behavior and public policy on the propagation of an outbreak (Barrett et al. 2011). Although neither changes in individual behavior nor public policy are directly explored in this study, it is extremely easy in these models to change individual behaviors, like keeping children home from school or in general limiting the number of non-essential activities of specific members of the population. However the purpose of this study is to assess the performance of classification techniques given partial epidemic curves. When the full procedure is implemented in the future, simulations will include epidemic curves that have these characteristics.

The creation of the agent-based epidemic model used in this study entails two major steps, the first of which consists of the creation of a social contact network from a state-of-the-art behavioral model. This involves creating synthetic populations and time varying social networks. Synthetic individuals and households, located in specified geographical regions (such as Los Angeles), each with a set of demographic variables are created using an iterative proportional fit to joint demographic distributions from the 2000 US census data provided in SF3 and PUMA (Public Use Microdata Area) files (Barrett et al. 2001). For example, a list of demographic information such as household income, family size, age, education etc. are available for each of the approximately 16 million individuals in the Los Angeles region.

The synthetic populations are created to produce realistic features and demographics while preserving the confidentiality of the original data sets. An edge represents an individual in the synthetic population. Each node is placed in a household with other synthetic individuals and each household is geographically located such that a census aggregated to the block level of the synthetic population would be statistically identical to the real census data (Beckman et al. 1996). Additional information can be found in Beckman et al. (1996), Speckman et al. (1997a) and Speckman et al. (1997b).

Next, each individual in the synthetic household is allotted activities by time of day based on several thousand responses to an activity or time-use survey for a specific region. The National Household Transportation Survey was used in creating the activity templates assigned to each household. The time-use or activity survey is expected to vary by region given factors such as the geographical location and age composition of the population. Presently, this modeling approach is considered the de facto standard in transportation science and is called activity based travel demand models (Barrett et al. 2011). See Bowman (2001) and Bowman et al. (1998) for additional information.

Using a decision tree based on demographic information (such as number of people in a household, number of children etc.), each household in the synthetic population is matched to a survey household. Each activity for each synthetic person is then assigned an appropriate real location based on a gravity model and land-use data (Beckman et al. 1996). The addresses of locations are obtained from Dun and Bradstreet’s Strategic Database Marketing Records. The activities for each household are assigned to actual locations based on “the distance from the previous activity and its attractiveness a measure of how likely that the activity happens there - number of employees, school enrollment, square feet of retail shopping etc.” (Eubank et al. 2010).

In addition to specific assignment of activities, the time at which each activity starts and ends is also included. This leads to each individual in each household having a minute-by-minute schedule for each day. Synthetic individuals in the population interact with each other based on their minute-by-minute schedule to produce realistic contact graphs where vertices represent individuals and edges represent contacts between individuals (Barrett et al. 2011). Individuals mimic the behaviors of real people by participating in everyday activities such as eating, socializing, shopping etc. and multiple edges can be used between each person and the locations representing their frequency of visits. The modeling approaches used in the ABM as presented in Barrett et al. (2011) are given in Table 3.

Table 3.

Models and Modeling Approaches used in ABM

Models	References
Urban Population Mobility Models	Barrett et al. (2009), Bowman et al. (1998), TRBC (1995–2003), and TRB (1998–2006)
Natural Disease History	Bailey (1975), Elveback et al. (1976), Halloran et al. (2008), Hethcote (2000), and Longini et al. (2005)
Transmission Models	Halloran et al. (2008), Hethcote (2000), and Longini et al. (2005)
Social Network Models	Eubank et al. (2004), Halloran et al. (2008), Newman (2003)
Types of Interventions	Ferguson et al. (2005), Ferguson et al. (2006), Halloran et al. (2002), and Halloran et al. (2008)

Open in a new tab

Next, a computational model is developed to represent disease within individuals and the transmission between individuals in the synthetic population. The transition from one disease state (susceptible, exposed, infectious and removed) to another is probabilistic and timed (e.g. it may be represented by the distribution of the infectious period). The transition between states can also depend on the attributes of the people (e.g. age, health status etc.) and the type of contact (e.g casual, intimate etc.). For the disease model used in this study, the probability that an infectious person i infects a susceptible person j is given by:

P r (person i infects person j) = 1 - \exp (- f (S, T_{i, j}))

(1)

where f(S, T_i_, _j) is a nonnegative monotone increasing function of S, the severity of the disease (called transmissibility) and T_i_, _j, the contact time between persons i and j. The disease model is combined with the information in the network to study an infectious disease. At each time step of a simulation, each of the nodes is either: susceptible, exposed, infectious, and removed. Time is divided into units based on days and the state of each individual is noted at the start of each day.

To run a simulation experiment, a population (contact network), characteristics of a disease and initial conditions (such as duration) are specified. In this study, the social networks studied were based on Seattle, Los Angeles, New York and surrounding metropolitan regions. The disease characteristics were based on influenza epidemics. For each simulated outbreak, several realizations of the stochastic process of disease propagation are computed. Intervention options such as vaccination, antiviral and social distancing can be applied during the outbreak to control its propagation. Each simulation is seeded with a randomly selected set of initially infected individuals. An epidemic curve, the vulnerability of different individuals in the network, epidemic size, number of new exposures on each day etc. can be explored at the end of each simulation of a disease outbreak.

Several studies have been implemented to validate specific components of the model and the general approach. See Beckman et al. (1996), Eubank et al. (2004), and Halloran et al. (2008) for structural validity of these models.

6.2 Analysis on a “Null” Network

The epidemics for the “null” network are simulated using a stochastic compartmental SEIR model for an epidemic. The SEIR model is based on the discretized stochastic SEIR model given by Lekone and Finkenstädt (2006). S(t), E(t), I(t) and R(t) represent the number of susceptible, exposed, infectious and removed individuals respectively at time t. Given the initial number of individuals in each compartment, the model can be specified as follows:

\begin{array}{l} S (t + h) = S (t) - S_{E} (t) \\ E (t + h) = E (t) + S_{E} (t) - E_{I} (t) \\ I (t + h) = I (t) + E_{I} (t) - I_{R} (t) \\ S (t) + E (t) + I (t) + R (t) = N \end{array}

(2)

where S_E (t) ~ Bin(S(t),P(t)) is the number of susceptible persons who become exposed and $P (t) = 1 - \exp [- \frac{β (t)}{N} h I (t)]$ . E_I(t) represents the number of exposed individuals who become infected and E_I(t) ~ Bin(E(t), pc) where pc= 1− exp(−θ h). I_R(t) is the number of cases removed from infectious compartment at time t and I_R(t) ~ Bin(I(t), p_R) where p_R = 1 − exp (−γ h). The time-dependent transmission rate, mean incubation and mean infectious periods are given by β(t), 1/θ and 1/γ respectively.

The counts of daily infected are used in constructing the epidemic curves. Epidemics are simulated from four parameterizations of the SEIR model with two sets of mean incubation and infectious periods, and two transmission rates. The mean incubation and infectious periods are based on the distributions in Table 1.

The methods are tested on all four sets of epidemic curves and ranked based on the performance metric as shown in Figure 5. Random forest and SVM are ranked best, while the nearest neighbors are ranked “acceptable” two out of four times. In agreement with the results in (4.4), this also suggests that the discriminant analysis methods should be avoided. As previously mentioned, the ABM is used instead of the simple compartmental SEIR model due to the overall goal of the project. In addition, the ABM results can be specific to a particular region while the “null” network results are generalized.

Ranking of methods by epidemic based on results of the McNemar test and the performance accuracy metric. 1 represents the least preferable and 3 represents the most preferable method for each epidemic. The color code is 1: royal blue, 2: dark green, and 3: orange.

6.3 Classification Techniques

In this section, we present a brief introduction into the classification methods used in our analysis.

6.3.1 Random Forest

Bagging which stands for bootstrap aggregating is a method for decreasing the variance of an estimated prediction function by combining several predictors (Breiman 1996, Hastie et al. 2009). Random Forest is an extension of bagging. It involves growing several de-correlated trees, which are then averaged (Hastie et al. 2009). Trees in a random forest are identically distributed and have the same expectation.

The random forest algorithm for classification as stated in Hastie et al. (2009) follows:

For b = 1 to B:
1. Draw a bootstrap sample Z* of size N from the training data
2. Grow a random-forest tree T_b to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size n_min is reached
  1. Select m variables at random from the p variables
  2. Pick the best variable/split-point among the m
  3. Split the node into two daughter nodes
Output the ensemble of trees (T_b)₁^B

To make a prediction at a new point x: Let C_b(x) be the class prediction of the b_th random-forest tree. Then C^B_{r f} (x) = majority vote (C_b(x))^B₁

There are several professed advantages to random forests. Random forest runs efficiently on large databases in a short amount of time, provides estimates of important variables in classification and can be used in unsupervised classification (Breiman 2001). Random forest also calculates an out-of-bag (oob) error rate based on its out-of-bag samples feature which involves constructing random forest predictor for an observation “by averaging only those trees corresponding to bootstrap samples in which the observation did not appear” (Hastie et al. 2009).

The random forest technique was modeled using the random Forest package in R (Ihaka and Gentleman 1996). In order to improve the classification rate, for each model fitted to the training data, the number of variables randomly sampled from candidates at each split was set to a value which produced the minimum error rate with 500 trees grown.

6.3.2 Linear Discriminant Analysis

In linear discriminant analysis (LDA), the data are projected onto a low dimensional vector space to maximize the ratio of between-class variance to within-class variance, thereby, obtaining maximal discrimination between classes. The density for each class is modeled as a multivariate Gaussian and the classes are assumed to have a common covariance matrix (Hastie et al. 2009).

Consider the problem of classifying an object x_i = (x_i₁,…, x_id)^T based on d dimensions to one of k formerly defined classes. The linear discriminant function (classification score) is given by:

ϕ_{k} (x) = x^{T} \sum^{- 1} μ_{k} - \frac{1}{2} μ_{k} {{}^{T}\sum}^{- 1} μ_{k} + ln (π_{k})

(3)

where Σ is the common covariance, π_k is the prior probability and μ_k is the mean vector for class k. The parameters of the Gaussian distributions are estimated using the training data:

\begin{array}{l} \sum^{^} = \frac{\sum_{k = 1}^{K} \sum_{g_{i} = k} (x_{i} - μ_{k}) {(x_{i} - μ_{k})}^{T}}{(N - K)} \\ \hat{π_{k}} = \frac{N_{k}}{N} \\ \hat{μ_{k}} = \frac{1}{N_{k}} \sum_{g_{i} = k} x_{i} \end{array}

(4)

where N is the number of objects in the training data and N_k is the number of objects in class k. Object x_i is classified into the class with the smallest ϕ_k(x) (Hastie et al. 2009, Wu et al. 1996).

The linear discriminant analysis was modeled using the lda function in the MASS package in R (Ihaka and Gentleman 1996). Equal prior probabilities were set for the six data class memberships.

6.3.3 Flexible Discriminant Analysis

“LDA can be performed by a sequence of linear regressions, followed by classification to the closest class centroid in the space of fits”(Hastie et al. 2009). This leads to a generalization of LDA where the linear regression fits are replaced by more flexible nonparametric fits such as multivariate adaptive regression

splines (MARS) introduced by Friedman (1991). See Friedman (1991) for more information on MARS. The nonparametric fits results in a more flexible classifier which is expected to result in better classification than LDA since the underlying relationship between variables are not assumed but derived from basis functions.

6.3.4 Support Vector Machines

Classification using support vector machines involves two steps: mapping the data into a predetermined high-dimensional space via a kernel function (linear, Gaussian, radial and polynomial) and finding the hyperplane that maximizes the margin between the data classes. The overall aim of an SVM is to find the hyperplane that maximizes separation and minimizes misclassifications.

Consider the training data with two predictor variables and two separable classes. Say there are n objects in each sample given by $(\vec{x_{1}}, y_{1}), \dots, (\vec{x_{n}}, y_{n})$ . Finding an optimal hyperplane for the data, involves finding the weight w⃗ and bias (threshold) b to construct two margins, which provide maximum separation between the two classes.

\begin{array}{l} \vec{w^{T}} \vec{x_{i}} + b \geq 1, y_{i} = + 1 \\ \vec{w^{T}} \vec{x_{i}} + b \leq 1, y_{i} = - 1 \end{array}

(5)

where w⃗ = (w₁,…, w_n)^T is a vector of size n.

The optimal hyperplane would maximize the distance between the closest points between classes while separating the two classes (Vapnik 1995). The vectors (or points) near the hyperplane are called support vectors. Any given hyperplane can be expressed as: w⃗x→ + b = 0 and width of the margin is: $γ = \frac{2}{| | w | |}$ . So the optimal separating hyperplane can be found by maximizing γ such that (5) holds true. The case discussed above is the ideal case. In cases of non-linearly separable problems, the optimal hyperplane is extended to include a penalty term for misclassifications (Hastie et al. 2009).

The support vector machine was modeled using the e1071 package in R (Ihaka and Gentleman 1996). The classification was implemented using a linear kernel and the cost of misclassification was set to the value resulting in the minimum error rate.

6.3.5 Nearest Neighbor

The nearest neighbor procedure is defined as follows: for each new object, find the k most similar objects in the training set based on a distance metric. The cluster to which the new instance is assigned is the most frequent out of all k nearest objects.

Three nearest neighbor approaches were defined for the purposes of this paper. The nearest neighbor classifiers were: nearest median, nearest mean and minimum distance approaches. Given a partial epidemic curve, the squared Euclidean distance was calculated between the partial epidemic curve and the median (mean) curve of each of the epidemic clusters. The partial curves was assigned to the epidemic cluster with the nearest median (mean) curve. The minimum distance approach is based on the typical nearest neighbor approach; a new epidemic curve is compared to all curves in the library and then assigned to the cluster with the closest curve.

The nearest median and mean approaches are less computationally intensive than the usual nearest neighbor method since the new object is not compared to every object in the training set.

Contributor Information

Elaine O. Nsoesie, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute at Virginia Tech

Richard Beckman, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute at Virginia Tech.

Madhav Marathe, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute at Virginia Tech; Computer Science Department at Virginia Tech.

Bryan Lewis, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute at Virginia Tech.

References

Agresti A. Categorical Data Analysis. 2 New York, NY: John Wiley & Sons; 2002. [Google Scholar]
Bailey N. The Mathematical Theory of Infectious Diseases and its Applications. London: Griffin; 1975. [Google Scholar]
Barrett C, Beckman R, Khan M, Kumar VSA, Marathe M, Stretz P, Dutta T, Lewis B. Generation and analysis of large synthetic social contact networks. Winter Simulation Conference, WSC ’09; 2009. pp. 1003–1014. [Google Scholar]
Barrett C, Bisset K, Leidig J, Marathe A, Marathe M. Economic and social impact of influenza mitigation strategies by demographic class. Epidemics. 2011;3:19–31. doi: 10.1016/j.epidem.2010.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barrett CL, Beckman R, Berkbigler K, Bisset K, Bush K, Campbell K, Eubank S, Henson K, Hurford J, Kubicek D, Marathe M, Romero P, Smith J, Smith L, Speckman P, Stretz P, Thayer G, Van Eeckhout E, Williams M. TRANSIMS: Transportation analysis simulation system. Technical Report, LA-UR-00-1725, Los Alamos National Laboratory Unclassified Report. 2001;3 URL http://ndssl.vbi.vt.edu/transims.php. [Google Scholar]
Beckman R, Baggerly K, Mckay M. Creating synthetic baseline populations. Transportation Research Part A: Policy and Practice. 1996;30:415–429. [Google Scholar]
Bisset K, Chen J, Feng X, Kumar VSA, Marathe M. Epifast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems. Proceedings of the 23rd international conference on Supercomputing, ICS’. 2009;09:430–439. [Google Scholar]
Bowman J. Activity-based disaggregate travel demand model system with activity schedules. Transportation Research Part A: Policy and Practice. 2001;35:1–28. [Google Scholar]
Bowman J, Bradley M, Shiftan Y, Lawton TK, Ben-Akiva M. Demonstration of an activity based model system for Portland. Proceedings of the 8th World Conference on Transport Research.1998. [Google Scholar]
Breiman L. Bagging predictors. Machine Learning. 1996;24:123–140. [Google Scholar]
Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]
Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd International Conference on Machine Learning, ICML ’06; 2006. pp. 161–168. [Google Scholar]
Cauchemez S, Donnelly CA, Reed C, Ghani AC, Fraser C, Kent CK, Finelli L, Ferguson NM. Household transmission of 2009 pandemic influenza A (H1N1) virus in the United States. New England Journal of Medicine. 2009;361:2619–2627. doi: 10.1056/NEJMoa0905498. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chan P, Stolfo S. A comparative evaluation of voting and meta-learning on partitioned data. Proceedings of the Twelfth International Conference on Machine Learning; Morgan Kaufmann. 1995. pp. 90–98. [Google Scholar]
Deardon R, Brooks SP, Grenfell BT, Keeling MJ, Tildesley MJ, Savill N. Inference for individual level models of infectious diseases in large populations. Statistica Sinica. 2010;20:239–261. [PMC free article] [PubMed] [Google Scholar]
Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. 1998;10:1895–1923. doi: 10.1162/089976698300017197. [DOI] [PubMed] [Google Scholar]
Dietterich T. Machine learning for sequential data: A review. Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition; 2002. pp. 15–30. [Google Scholar]
Elveback L, Fox J, Ackerman E, Langworthy A, Boyd M, Gatewood L. American Journal of Epidemiology. 1976;103:152–165. doi: 10.1093/oxfordjournals.aje.a112213. [DOI] [PubMed] [Google Scholar]
Epstein J. Modelling to contain pandemics. Nature. 2009:687. doi: 10.1038/460687a. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eubank S, Barrett C, Beckman R, Bisset K, Durbeck L, Kuhlman C, Lewis B, Marathe A, Marathe M, Stretz P. Detail in network models of epidemiology: are we there yet? Journal of Biological Dynamics. 2010;4:446–455. doi: 10.1080/17513751003778687. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eubank S, Guclu H, Kumar VSA, Marathe M, Srinivasan A, Toroczkai Z, Wang N. Modelling disease outbreaks in realistic urban social networks. Nature. 2004 doi: 10.1038/nature02541. [DOI] [PubMed] [Google Scholar]
Everitt B. The Analysis of Contingency Tables. 2 London: Chapman & Hall; 1992. [Google Scholar]
Ferguson N, Cummings D, Cauchemez S, Fraser C, Riley S, Meeyai A, Iamsirithaworn S, Burke D. Strategies for containing an emerging influenza pandemic in Southeast Asia. Nature. 2005;43:209–214. doi: 10.1038/nature04017. [DOI] [PubMed] [Google Scholar]
Ferguson N, Cummings D, Fraser C, Cajka J, Cooley P, Burke D. Strategies for mitigating an influenza pandemic. Nature. 2006;442:448–452. doi: 10.1038/nature04795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J. Multivariate adaptive regression splines. Annals of Statistics. 1991;19:1–67. [Google Scholar]
Goldstein E, Apolloni A, Lewis B, Miller J, Macauley M, Eubank S, Lipsitch M, Wallinga J. Distribution of vaccine/antivirals and the “least spread line” in a stratified population. Journal of the Royal Society Interface. 2010;7:755–764. doi: 10.1098/rsif.2009.0393. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall IM, Gani R, Hughes HE, Leach S. Real-time epidemic forecasting for pandemic influenza. Epidemiology and Infection. 2007;135:372–385. doi: 10.1017/S0950268806007084. [DOI] [PMC free article] [PubMed] [Google Scholar]
Halloran E, Longini I, Cowart M, Nizam A. Community interventions and the epidemic prevention potential. Vaccine. 2002;20:3254–3262. doi: 10.1016/s0264-410x(02)00316-x. [DOI] [PubMed] [Google Scholar]
Halloran ME, Ferguson N, Eubank S, Longini I, Cummings D, Lewis B, Xu S, Fraser C, Vullikanti A, Germann T, Wagener D, Beckman R, Kadau K, Barrett C, Macken C, Burke D, Cooley P. Modeling targeted layered containment of an influenza pandemic in the United States. Proceedings of the National Academy of Sciences. 2008 doi: 10.1073/pnas.0706849105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning 2009 [Google Scholar]
Hethcote HW. The mathematics of infectious diseases. SIAM Review. 2000;42:599–653. [Google Scholar]
Holmes C, Adams N. A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64 [Google Scholar]
Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996:299–314. [Google Scholar]
Jiang X, Wallstrom G, Cooper G, Wagner M. Bayesian prediction of an epidemic curve. Journal of Biomedical Informatics. 2009;42:90–99. doi: 10.1016/j.jbi.2008.05.013. [DOI] [PubMed] [Google Scholar]
Lekone P, Finkenstädt B. Statistical inference in a stochastic epidemic SEIR model with control intervention: Ebola as a case study. Biometrics. 2006;62:1170–1177. doi: 10.1111/j.1541-0420.2006.00609.x. [DOI] [PubMed] [Google Scholar]
Longini I, Nizam A, Xu S, Ungchusak K, Hanshaworakul W, Cummings D, Halloran E. Containing pandemic influenza at the source. Science. 2005;309:1083–1087. doi: 10.1126/science.1115717. [DOI] [PubMed] [Google Scholar]
McKinley T, Cook A, Deardon R. Inference in epidemic models without likelihoods. The International Journal of Biostatistics. 2009;5 [Google Scholar]
Newman M. The structure and function of complex networks. SIAM Review. 2003;45:167–256. [Google Scholar]
Nishiura H. Real-time forecasting of an epidemic using a discrete time stochastic model: a case study of pandemic influenza (H1N1-2009) BioMedical Engineering Online. 2011;10 doi: 10.1186/1475-925X-10-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nsoesie E, Beckman R, Marathe M. Estimation of an epidemic curve during an outbreak: A classification approach. Proceedings of the Joint Statistical Meetings, Section on Statistics and Epidemiology. 2010:5177–5191. [Google Scholar]
Ohkusa Y, Sugawara T, Taniguchi K, Okabe N. Real-time estimation and prediction for pandemic A/H1N1(2009) in Japan. Journal of infection and chemotherapy. 2011 doi: 10.1007/s10156-010-0200-3. [DOI] [PubMed] [Google Scholar]
Speckman P, Vaughn K, Pas E. Generating household activity-travel patterns (HATPs) for synthetic populations. Transportation Research Board 1997 Annual Meeting.1997a. [Google Scholar]
Speckman P, Vaughn K, Pas E. A continuous spatial interaction model: Application to home-work travel in Portland, Oregon. Transportation Research Board 1997 Annual Meeting.1997b. [Google Scholar]
TRB. Transportation Research Board annual meetings.1998–2006. [Google Scholar]
TRBC. 5th–9th Biennial National Academies Transportation Research Board Conferences on Application Of Transportation Planning Methods.1995–2003. [Google Scholar]
Tumer K, Ghosh J. Combining Artificial Neural Nets. Springer-Verlag; 1999. Linear and order statistics combiners for pattern classification; pp. 127–162. [Google Scholar]
Vapnik V. The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc; 1995. [Google Scholar]
Wu W, Mallet Y, Walczak B, Penninckx W, Massart D, Heuerding S, Erni F. Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data. Analytica Chimica Acta. 1996;329:257–265. [Google Scholar]

[R1] Agresti A. Categorical Data Analysis. 2 New York, NY: John Wiley & Sons; 2002. [Google Scholar]

[R2] Bailey N. The Mathematical Theory of Infectious Diseases and its Applications. London: Griffin; 1975. [Google Scholar]

[R3] Barrett C, Beckman R, Khan M, Kumar VSA, Marathe M, Stretz P, Dutta T, Lewis B. Generation and analysis of large synthetic social contact networks. Winter Simulation Conference, WSC ’09; 2009. pp. 1003–1014. [Google Scholar]

[R4] Barrett C, Bisset K, Leidig J, Marathe A, Marathe M. Economic and social impact of influenza mitigation strategies by demographic class. Epidemics. 2011;3:19–31. doi: 10.1016/j.epidem.2010.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Barrett CL, Beckman R, Berkbigler K, Bisset K, Bush K, Campbell K, Eubank S, Henson K, Hurford J, Kubicek D, Marathe M, Romero P, Smith J, Smith L, Speckman P, Stretz P, Thayer G, Van Eeckhout E, Williams M. TRANSIMS: Transportation analysis simulation system. Technical Report, LA-UR-00-1725, Los Alamos National Laboratory Unclassified Report. 2001;3 URL http://ndssl.vbi.vt.edu/transims.php. [Google Scholar]

[R6] Beckman R, Baggerly K, Mckay M. Creating synthetic baseline populations. Transportation Research Part A: Policy and Practice. 1996;30:415–429. [Google Scholar]

[R7] Bisset K, Chen J, Feng X, Kumar VSA, Marathe M. Epifast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems. Proceedings of the 23rd international conference on Supercomputing, ICS’. 2009;09:430–439. [Google Scholar]

[R8] Bowman J. Activity-based disaggregate travel demand model system with activity schedules. Transportation Research Part A: Policy and Practice. 2001;35:1–28. [Google Scholar]

[R9] Bowman J, Bradley M, Shiftan Y, Lawton TK, Ben-Akiva M. Demonstration of an activity based model system for Portland. Proceedings of the 8th World Conference on Transport Research.1998. [Google Scholar]

[R10] Breiman L. Bagging predictors. Machine Learning. 1996;24:123–140. [Google Scholar]

[R11] Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]

[R12] Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd International Conference on Machine Learning, ICML ’06; 2006. pp. 161–168. [Google Scholar]

[R13] Cauchemez S, Donnelly CA, Reed C, Ghani AC, Fraser C, Kent CK, Finelli L, Ferguson NM. Household transmission of 2009 pandemic influenza A (H1N1) virus in the United States. New England Journal of Medicine. 2009;361:2619–2627. doi: 10.1056/NEJMoa0905498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Chan P, Stolfo S. A comparative evaluation of voting and meta-learning on partitioned data. Proceedings of the Twelfth International Conference on Machine Learning; Morgan Kaufmann. 1995. pp. 90–98. [Google Scholar]

[R15] Deardon R, Brooks SP, Grenfell BT, Keeling MJ, Tildesley MJ, Savill N. Inference for individual level models of infectious diseases in large populations. Statistica Sinica. 2010;20:239–261. [PMC free article] [PubMed] [Google Scholar]

[R16] Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. 1998;10:1895–1923. doi: 10.1162/089976698300017197. [DOI] [PubMed] [Google Scholar]

[R17] Dietterich T. Machine learning for sequential data: A review. Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition; 2002. pp. 15–30. [Google Scholar]

[R18] Elveback L, Fox J, Ackerman E, Langworthy A, Boyd M, Gatewood L. American Journal of Epidemiology. 1976;103:152–165. doi: 10.1093/oxfordjournals.aje.a112213. [DOI] [PubMed] [Google Scholar]

[R19] Epstein J. Modelling to contain pandemics. Nature. 2009:687. doi: 10.1038/460687a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Eubank S, Barrett C, Beckman R, Bisset K, Durbeck L, Kuhlman C, Lewis B, Marathe A, Marathe M, Stretz P. Detail in network models of epidemiology: are we there yet? Journal of Biological Dynamics. 2010;4:446–455. doi: 10.1080/17513751003778687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Eubank S, Guclu H, Kumar VSA, Marathe M, Srinivasan A, Toroczkai Z, Wang N. Modelling disease outbreaks in realistic urban social networks. Nature. 2004 doi: 10.1038/nature02541. [DOI] [PubMed] [Google Scholar]

[R22] Everitt B. The Analysis of Contingency Tables. 2 London: Chapman & Hall; 1992. [Google Scholar]

[R23] Ferguson N, Cummings D, Cauchemez S, Fraser C, Riley S, Meeyai A, Iamsirithaworn S, Burke D. Strategies for containing an emerging influenza pandemic in Southeast Asia. Nature. 2005;43:209–214. doi: 10.1038/nature04017. [DOI] [PubMed] [Google Scholar]

[R24] Ferguson N, Cummings D, Fraser C, Cajka J, Cooley P, Burke D. Strategies for mitigating an influenza pandemic. Nature. 2006;442:448–452. doi: 10.1038/nature04795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Friedman J. Multivariate adaptive regression splines. Annals of Statistics. 1991;19:1–67. [Google Scholar]

[R26] Goldstein E, Apolloni A, Lewis B, Miller J, Macauley M, Eubank S, Lipsitch M, Wallinga J. Distribution of vaccine/antivirals and the “least spread line” in a stratified population. Journal of the Royal Society Interface. 2010;7:755–764. doi: 10.1098/rsif.2009.0393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Hall IM, Gani R, Hughes HE, Leach S. Real-time epidemic forecasting for pandemic influenza. Epidemiology and Infection. 2007;135:372–385. doi: 10.1017/S0950268806007084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Halloran E, Longini I, Cowart M, Nizam A. Community interventions and the epidemic prevention potential. Vaccine. 2002;20:3254–3262. doi: 10.1016/s0264-410x(02)00316-x. [DOI] [PubMed] [Google Scholar]

[R29] Halloran ME, Ferguson N, Eubank S, Longini I, Cummings D, Lewis B, Xu S, Fraser C, Vullikanti A, Germann T, Wagener D, Beckman R, Kadau K, Barrett C, Macken C, Burke D, Cooley P. Modeling targeted layered containment of an influenza pandemic in the United States. Proceedings of the National Academy of Sciences. 2008 doi: 10.1073/pnas.0706849105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning 2009 [Google Scholar]

[R31] Hethcote HW. The mathematics of infectious diseases. SIAM Review. 2000;42:599–653. [Google Scholar]

[R32] Holmes C, Adams N. A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64 [Google Scholar]

[R33] Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996:299–314. [Google Scholar]

[R34] Jiang X, Wallstrom G, Cooper G, Wagner M. Bayesian prediction of an epidemic curve. Journal of Biomedical Informatics. 2009;42:90–99. doi: 10.1016/j.jbi.2008.05.013. [DOI] [PubMed] [Google Scholar]

[R35] Lekone P, Finkenstädt B. Statistical inference in a stochastic epidemic SEIR model with control intervention: Ebola as a case study. Biometrics. 2006;62:1170–1177. doi: 10.1111/j.1541-0420.2006.00609.x. [DOI] [PubMed] [Google Scholar]

[R36] Longini I, Nizam A, Xu S, Ungchusak K, Hanshaworakul W, Cummings D, Halloran E. Containing pandemic influenza at the source. Science. 2005;309:1083–1087. doi: 10.1126/science.1115717. [DOI] [PubMed] [Google Scholar]

[R37] McKinley T, Cook A, Deardon R. Inference in epidemic models without likelihoods. The International Journal of Biostatistics. 2009;5 [Google Scholar]

[R38] Newman M. The structure and function of complex networks. SIAM Review. 2003;45:167–256. [Google Scholar]

[R39] Nishiura H. Real-time forecasting of an epidemic using a discrete time stochastic model: a case study of pandemic influenza (H1N1-2009) BioMedical Engineering Online. 2011;10 doi: 10.1186/1475-925X-10-15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Nsoesie E, Beckman R, Marathe M. Estimation of an epidemic curve during an outbreak: A classification approach. Proceedings of the Joint Statistical Meetings, Section on Statistics and Epidemiology. 2010:5177–5191. [Google Scholar]

[R41] Ohkusa Y, Sugawara T, Taniguchi K, Okabe N. Real-time estimation and prediction for pandemic A/H1N1(2009) in Japan. Journal of infection and chemotherapy. 2011 doi: 10.1007/s10156-010-0200-3. [DOI] [PubMed] [Google Scholar]

[R42] Speckman P, Vaughn K, Pas E. Generating household activity-travel patterns (HATPs) for synthetic populations. Transportation Research Board 1997 Annual Meeting.1997a. [Google Scholar]

[R43] Speckman P, Vaughn K, Pas E. A continuous spatial interaction model: Application to home-work travel in Portland, Oregon. Transportation Research Board 1997 Annual Meeting.1997b. [Google Scholar]

[R44] TRB. Transportation Research Board annual meetings.1998–2006. [Google Scholar]

[R45] TRBC. 5th–9th Biennial National Academies Transportation Research Board Conferences on Application Of Transportation Planning Methods.1995–2003. [Google Scholar]

[R46] Tumer K, Ghosh J. Combining Artificial Neural Nets. Springer-Verlag; 1999. Linear and order statistics combiners for pattern classification; pp. 127–162. [Google Scholar]

[R47] Vapnik V. The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc; 1995. [Google Scholar]

[R48] Wu W, Mallet Y, Walczak B, Penninckx W, Massart D, Heuerding S, Erni F. Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data. Analytica Chimica Acta. 1996;329:257–265. [Google Scholar]

PERMALINK

Prediction of an Epidemic Curve: A Supervised Classification Approach

Elaine O Nsoesie

Richard Beckman

Madhav Marathe

Bryan Lewis

Abstract

1 Introduction

2 Approach

Table 1.

Figure 1.

3 Methods

3.1 Classification Methods

3.2 Performance Accuracy Metric

3.3 Chi-Square Tests

Table 2.

4 Results

4.1 Daily Accuracy of the Classification Methods

Figure 2.

4.2 Consistency of Classification Methods

Figure 3.

4.3 Combined Classification Weighting Schemes

Figure 4.

4.4 Different Social Networks

5 Discussion

Algorithm 1.

Acknowledgments

6 Appendix

6.1 Computational Epidemiology Model

Table 3.

6.2 Analysis on a “Null” Network

Figure 5.

6.3 Classification Techniques

6.3.1 Random Forest

6.3.2 Linear Discriminant Analysis

6.3.3 Flexible Discriminant Analysis

6.3.4 Support Vector Machines

6.3.5 Nearest Neighbor

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases