Abstract
Communicable diseases ‘flow’ between locations. These flows dictate where and when certain communities will be affected. While the prediction of disease flows is essential for the timely intervention of epidemics, few studies have addressed this critical issue. This study predicts disease flows during an epidemic by considering the epidemiological, network, and temporal contextual factors using a deep learning approach. A series of scenario analyses helps identify the effects of these contextual factors on disease flows.
Results show that the extended spatial-temporal effect of the epidemiological factors stimulates disease flows. The compound effects of the network factors enhance the transmission efficiency of these flows. Lastly, the temporal effect accelerates the combined effects of epidemiological and network factors on the flows. Findings of this study reveal the intricate nature of disease flows and lay a solid foundation for real-time surveillance of epidemics and pandemics to inform timely interventions for a broad range of communicable diseases.
Keywords: disease flows, influenza, location network
1. Introduction
The dispersion of communicable diseases, such as influenza, is a dynamic process. A disease ‘flows’ from a source location to a target location (Cooley, Ganapathi, Ghneim, Holmberg, & Wheaton, 2008; Zhong & Bian, 2016). An infection case initially occurs at a source location and re-occurs at a target location as the individual who carries the infection travels. At the target location, this case may cause new cases if within seven days, the typical infectious period of influenza transmission (Control & Prevention, 2011; Fiore et al., 2010; Heyman, 2004). As a chain effect, the target location of the initial case is deemed the source location of the new cases, and these new cases may have their target locations. As flows that originated at source locations actively generate new cases at target locations, these chained spatial flows of disease dictate where and when certain communities and individuals will be affected. While the timely prediction of disease flows is essential for the effective intervention of epidemics and pandemics, few studies have paid attention to the dynamics of flows and what drives the flows (Li et al., 2019; Shi & Kwan, 2015; Shi & Wang, 2015; Zhu, Huang, Shi, Wu, & Liu, 2018)
Presently, prevailing epidemic research has mostly focused on a synoptic view of health outcomes, for example, the total population affected or the duration of an epidemic (Bian et al., 2012; Nsoesie, Brownstein, Ramakrishnan, & Marathe, 2014; Tizzoni et al., 2012). Less attention has been paid to localized and dynamic context. Some recent studies have begun to explore this context, for instance, the variation of disease cases at specific locations, while few have looked into how diseases flow between locations (Pastor-Satorras, Castellano, Van Mieghem, & Vespignani, 2015; Shaman & Karspeck, 2012). Specifically, given the footprint of past flows and the fronts of current flows, where is the disease potentially flowing to? and which individuals and communities are on the pathways of the flow?
The heterogeneous conditions at various locations give rise to a large number of localized and dynamic contextual factors that drive the flow (Ewing, Lee, Viboud, & Bansal, 2017). A thoughtful examination of these factors is essential for the timely deployment of disease intervention strategies. Among the many factors to consider, epidemiological factors, network factors, and temporal factors are the most relevant (Cauchemez et al., 2011; Charaudeau, Pakdaman, & Boelle, 2014; Salathe et al., 2010).
Epidemiological factors
The number of disease cases at a location is one of the most common epidemiological factors, as it indicates the potential of the disease flows. However, this set of factors has often been considered in an aggregated manner, e.g., the total number of disease cases over an aggregated area throughout the entire time period of an epidemic (Chowell & Rothenberg, 2018; Eggo, Cauchemez, & Ferguson, 2011; Gog et al., 2014; Riley, Eames, Isham, Mollison, & Trapman, 2015). Few studies have been concerned with their more resolved effects on disease flows.
Some of the recent studies are indeed more resolved in space and time, but limited to the specific date or location. The spatially-temporally extended effects of epidemiological factors have not been accounted for, e.g., whether disease flows that occur a few days earlier can affect the flows today, and whether the cases at neighboring locations sustain existing disease flows at the current location. These extended epidemiological contexts could have profound effects on disease flows but are not yet known to us.
Network factors
The disease flow between locations can be perceived as a location-centric network in which locations are nodes and the disease flows between them are edges (Zhong & Bian, 2016). The position of a location node in the network often determines how the disease flows. Centrally positioned nodes might be more prone to generating (or receiving) disease flows than peripheral nodes (Danon et al., 2011; Pellis et al., 2015).
Network factors are used to measure the positions of the nodes in a network. A node is considered centrally positioned if it reaches all the other nodes easily or if it serves as a popular node (Freeman, 2004; Newman, 2018). Current studies tend to consider the effects of these network measures individually but rarely their compound effects. In a complex network, a single measure might not be sufficient to understand the effect of location nodes on disease flows (Martincic-Ipsic, Mocibob, & Perc, 2017; Patel et al., 2015).
Temporal factor
In an epidemic, a large number of flows occurs during the peak time; the opposite occurs during troughs that represent a latent period for subsequent and often severe peaks. Existing studies are primarily concerned with peaks but rarely focus on the lapse between peaks and troughs (Conlan & Grenfell, 2007; Dushoff, Plotkin, Levin, & Earn, 2004; Wagner et al., 2001; Wilson & Brownstein, 2009). The lapse often indicates the lagged effect of the contextual factors on disease flows. For instance, how many days had passed before the epidemiological and network factors had started to affect the disease flows today? Although this temporal factor is implicitly embedded in the epidemiological and network factors, it should be considered to be as important as the other two factors to better understand the dynamic nature of disease flows.
An added complexity is the various combinations of three sets of factors as well as the ways in which they change over time. This presents additional challenges to understanding disease flows. An approach is needed that can address a wide range of factors and their associated complexities in predicting dynamic flows.
A deep learning approach, such as the Convolutional Neural Network (CNN), is part of a broader family of machine learning methods used for prediction and classification (Bengio, 2009; Goodfellow, Bengio, & Courville, 2016; Hinton, Osindero, & Teh, 2006; LeCun, Bengio, & Hinton, 2015). Although originally designed for image recognition, CNN has been extended to a variety of other applications, such as predicting temporal dynamics (Henaff, 2015; Silver et al., 2016).
By design, CNN uses a set of convolutional kernels to extract candidate features from the input factors, where the candidates should represent the essential characteristics of the input. A pooling process further filters improved features from the candidates; then, the kernels are updated through a series of iterations until the performance converges, where the predicted outcomes (presence of disease flows) well approximate the observed outcome (Krizhevsky, Sutskever, & Hinton, 2012; LeCun et al., 2015).
Leveraging these design principles, CNN is adopted in this study to capture the association between the known flows and the contextual factors observed on earlier days, then the model uses the established association to predict potential flows in subsequent days. The convolutional process of CNN can fuse a massive amount of contextual factors and help identify their compound effects (Shin et al., 2016). CNN is also known to amplify the effective factors (or their compound effects) through the convolutional process (Park, Han, Berg, & Berg, 2016).
In particular, CNN has overcome the limitations of conventional machine learning approaches, such as artificial neural networks (ANN), which often require manual selection of features from input data to complete the training process (Larsson, Maire, & Shakhnarovich, 2016; LeCun et al., 2015; Szegedy, Toshev, & Erhan, 2013). In contrast, the ability of CNN to automatically extract representative features renders the approach remarkably advantageous to achieve the above strengths.
As encountered by various prediction approaches, the dilemma between ‘goodness of fit’ and ‘interpretability of the model’ is a challenge; this is also true for deep learning methods. Intuitive understanding of the mechanism, especially how the factors contribute to the final results, remains challenging. Overcoming this may require complementary methods, such as the scenario analysis, to decipher the semantics of how the aforementioned factors, individually or in combination, contribute to the disease flows.
This study intends to predict disease flows in an urban area during a seasonal influenza epidemic while leveraging the capabilities of CNN. Specifically, the objective of this study is twofold. Firstly, we predict the potential disease flows based on three sets of localized and dynamic contextual factors using CNN. The performance of CNN is compared with that of an analysis using ANN. Second, we evaluate the effects of the factors, individually and combined, on the disease flow through a series of scenario analyses to help interpret the prediction results.
Results of this study may shed light on the effect of localized dynamic factors in driving disease flows. Disease dynamics have rarely been investigated at both high spatial resolution (to specific locations) and high temporal resolution (on a daily basis). For a public health crisis such as a pandemic, findings of this study help contain the localized source of infection, identify the dynamic transmission pathways, and inform timely and spatially sensitive disease intervention strategies. Methodologically, the deep learning approach, complemented by scenario analysis, reveals the mechanism of how these localized dynamic factors affect disease flows. The methodological design of this study is also applicable to studies of a broad range of communicable diseases.
The remainder of this paper is organized as follows. Section 2 provides a description of the data and study area. Section 3 details the methodology design of the CNN prediction model and the scenario analysis. Section 4 presents the prediction results and estimates how the contextual factors affect the disease flows. Section 5 presents the conclusions.
2. Data and Study Area
The disease flow prediction is conducted in the metropolitan area of Lanzhou in Midwest China, a typical mid-sized city of the country. The metropolitan area has been swept by a seasonal influenza epidemic in a recent year. The epidemic lasted for 72 days from September to mid-November (Chinese National Influenza Center, 2019; WHO FluNet, 2019). The dataset consists of the symptom onset date of the influenza cases and the residence and workplace (including schools and universities) addresses associated with each case. The information is obtained from the China Information System for Diseases Control and Prevention (Chinese National Influenza Center, 2019). The use of these data was approved by the (Internal Review Board) IRB at the authors’ institute. A total of 1,026 locations were obtained; each residential address identifies a named residential community and each workplace address identifies an area equivalent to a U.S. census block group.
To prepare the network factors (details in Section 3.1), disease flow networks are constructed using the data discussed above. A disease flow is present between a residence-workplace pair if an influenza case occurred at a source location (e.g., residence), re-occurred at a target location (e.g., workplace), and followed by new cases at the target location within the typical seven-day infectious period (Fiore et al., 2010; Heymann, 2008). A total of 72 disease flow networks are built, one for each day of the epidemic. Each of the networks consists of all 1,026 location nodes, while the number of disease flows between locations varies on a daily basis.
The daily network captures the dynamics of disease flows and reserves the temporal continuity of flows across days, as the target locations in the previous day serve as the source locations of the next day. Accordingly, the number of cases at the target locations of the previous day is the number of cases at the source locations of the current day. These source and target locations and the time-stamped disease flows form a dynamic disease flow network over the epidemic (Figure 1). The number of cases at each location, the flow between the residence-workplace pair, and the daily disease network are used to support the derivation of the aforementioned epidemiological factors, network factors, and temporal factors.
Figure 1.
The disease flow networks throughout the entire epidemic. Dots denote the 1,026 locations and the color code of each link denotes the day each disease flow occurs.
3. Methodology
The three sets of contextual factors derived from the data are inputted into the CNN prediction model and the ANN model. The association between these factors and the observed flows on earlier dates is established in the CNN and ANN models, and then used to predict the potential flows on later dates. A series of scenario analyses are performed to evaluate the effects of the contextual factors on disease flows by experimenting with their various combinations.
3.1. Contextual factors
Each of the three sets of contextual factors (epidemiological, network, and temporal) contain multiple factors. The first set contains three epidemiological factors. The first factor is the number of cases at both the source and target locations a certain number of days prior to the current day. The second factor is the number of cases at the neighboring locations of the source and target locations in the prior days, respectively. The third factor is the presence of disease flows between the given location pair in the prior days. Both the prior days and the neighboring locations represent the extended spatial-temporal effects of these epidemiological factors.
This set of network factors consists of ten factors. They are essential network metrics and used here to characterize how the position of the location nodes facilitates the disease flows. The first factor represents the reachability: the shortest path length. It calculates the network distance between two location nodes. The remaining nine factors are structure-based factors: closeness, eccentricity, radiality, degree, clustering coefficient, topological coefficient, betweenness, bridging, and eigen-centrality. These factors measure how central the nodes are in disease transmission. More details can be found in Appendix A and in Freeman (2004); Newman (2018); Wasserman and Faust (1994).
The third set of contextual factors, the temporal factors, although inherently integrated into the epidemiological and network factors, is represented as the number of days prior to the current day when the epidemiological and network factors are used to predict flows on subsequent days.
These three sets of contextual factors are utilized for two different purposes. First, they are inputted into the CNN model and the ANN model as input contextual factors to predict disease flows. Second, they are used in the scenario analysis (detailed below) in various combinations to evaluate their effects on the presence of disease flows.
3.2. Prediction models
The CNN model predicts the presence of disease flow as a binary value, via a training and a testing process. The training process captures the association between the contextual factors and flows; the testing process predicts the potential flows. The training process is formalized on a rolling basis, where flows on earlier days are utilized to predict those of later days, so as not to break the temporal continuity of disease flows:
| (Eq. 1) |
where (either 1 or 0) is an indicator of flow presence between Source Locations i and Target Location j on Day t; the input set (,) represents the contextual factors of Locations i and j observed on one day, two days, up to n days prior to Day t, respectively. The n days are initially set as seven days, according to the typical infectious period of influenza transmission, to predict the flows in subsequent days (Carrat et al., 2008; Fiore et al., 2010; Heymann, 2008). The last term W, is a set of weights, to be solved in the CNN model. In the subsequent testing process, the solved W is utilized to predict flows in the t+n days. The testing process is formalized accordingly as:
| (Eq. 2) |
The 72 daily networks are divided into the training set and the testing set using the common 75%-versus-25% rule (Weiss & Provost, 2003). The first 54 days are allocated to the training set; the remaining 18 days are assigned to the testing set. For the 54 days, data in each seven-consecutive-day set are used to predict the flows of the eighth day on a continuous, one-day increment, rolling basis. Furthermore, to explore whether the training/testing division has an impact on the prediction, the 75%-versus-25% rule is extended to both the short- and long-end, ranging from 70%-versus-30% (50 days for training and 22 days for testing) to 80%-versus-20% (58 days for training and 14 days for testing).
Moreover, the initial temporal lag of seven days is expanded to a range of three to ten days, with a one-day increment. The three-day lag corresponds to the typical latent period of influenza transmission (Control & Prevention, 2011); the ten-day lag features the average interval between peaks and troughs during the epidemic.
The size of the training set is 129,944*(54-7) = 6,107,368 entries, where 129,944 is the number of residence-workplace location pairs in each daily network; this includes both the presence and absence of flows. The term, 54-7=47, refers to the number of daily networks in the training set. Given the initial seven-day temporal lag, the first seven daily networks are only used as input in the training process, while the eighth to 54th daily networks (i.e., the 54-7=47 networks) are used for both input and output (Eq. 1). Accordingly, the size of the testing set is 129,944* (72-54) = 2,338,992 entries, where (72-54) =18 refers to the number of daily networks in the testing set, given 72 daily networks are constructed in total (see Section 3.1) where 54 days have been allocated to the training set. Furthermore, when the temporal lags are expanded from three to ten days, the corresponding sizes of the training set range from 6,627,144 to 5,717,536. Accordingly, the sizes of the testing set range from 2,858,768 to 1,949,160.
At the beginning of the training process, the input factors are organized in a matrix format and fed into the CNN model (Figure 2). The initial setting of the kernels in the CNN model follows random Gaussian distributions. The values of the hyper-parameters, including weight decay and learning rate, are experimented and selected following the comprehensive guidelines presented in Shin et al. (2016) to facilitate training convergence and to prevent under-fitting and over-fitting. Training convergence is achieved within five epochs.
Figure 2.
A schematic illustration of the Convolutional Neural Networks architecture.
In the testing process, the prediction accuracy is evaluated by comparing the predicted flow with the actual flow . The F1 score is used for the evaluation, which integrates two accuracy measures: precision and recall. The precision focuses on the predicted flows to enumerate how many are actually correct. The recall focuses on the actual disease flows to enumerate how many of them are predicted.
Precision and recall are two separate perspectives of an accuracy measurement. Over- or under-prediction of flows are inevitably reflected in either the precision or recall (Ali, Shamsuddin, & Ralescu, 2015; Japkowicz & Stephen, 2002; Ma, Zhong, Gao, & Bian, 2019). The F1 score is adopted to balance the two measurements in order to comprehensively reflect the prediction accuracy (Powers, 2011; Sasaki, 2007). As shown below, the F1 measure is defined as the harmonic mean of precision and recall. A high F1 score indicates a balanced performance on the prediction of flows. The maximum F1 value is 1, which means perfect precision and recall; the minimum value is 0.
| (3) |
To further evaluate the prediction model, the performance of CNN is compared with that of ANN. The ANN model is implemented using a typical three-layer architecture following the same design in Eq.1 (Jain, Mao, & Mohiuddin, 1996). The weight term W is obtained by tuning the ANN model with the commonly used back-propagation algorithm (LeCun et al., 1990). The contextual factors, the size and division of the training and testing data, and the temporal lags remain the same as in the CNN model.
3.3. Scenario analyses
To investigate the effects of the epidemiological and network factors on disease flows, the three epidemiological factors and the ten network factors (totaling 13) are removed from the training set both individually and in different combinations. The changes in the corresponding prediction accuracy are expected to reveal how the removed factor(s) contributed to predicting the influenza transmission flows. Two removal strategies are used. The first strategy uses a one-at-a-time removal process that excludes one factor each time while leaving the remaining 12 factors intact. This strategy yields 13 scenarios. The second strategy uses an exhaustive-combination removal process that excludes all possible combinations of the 13 factors (i.e., all combinations of one factor, two factors, …, and 13 factors), for one combination each time. This strategy yields 213=8,192 scenarios. The epidemic, network, and temporal effects on the disease flows from individual factors and their combinations are analyzed.
4. Results and Discussion
The results of disease flow prediction are first reported, including the performance of the CNN model in comparison to that of the ANN model using the three sets of factors. The effects of the factors on disease flows are subsequently analyzed in the results of scenario analyses.
4.1. Prediction results
The CNN prediction model achieves an accuracy of 78.1% for disease flows between all of the 2,338,992 residence-workplace location pairs for the testing period. The approach performs consistently across different training vs. testing divisions, i.e., 70%-versus-30% division (50 days for training and 22 days for testing) to 80%-versus-20% division (58 days for training and 14 days for testing (Figure 3, left). This result demonstrates the robustness of the CNN model as the training/testing division has little impact on the prediction, successfully avoiding both the under-fitting and over-fitting problems.
Figure 3.
Prediction accuracy of disease flows by CNN (left) and ANN (right), with respect to temporal lag (horizontal axis) and the training vs. testing divisions (vertical axis). The green cells represent high prediction accuracies, while the gray ones are low accuracies.
In the dimension of temporal lag (horizontal), the prediction of the CNN model performs consistently well (over 80%) when the temporal lag is five days or shorter. Throughout the epidemic, there are two peaks and a major trough (Figure 4) each lasting approximately ten days. The five-day temporal lag corresponds to the rising and declining slope of the peak and trough, suggesting that the contextual factors within the past five days are effective in predicting the disease flows. The robust performance of the CNN model is also observed in the prediction accuracy regardless of the fluctuation between peaks and troughs of disease flows (Figures 4). The F1 score is as high as 86% near the highest peak day (Day 64); it is still more than 70% near the lowest trough day (Day 59).
Figure 4.
Prediction accuracy of disease flows across the testing period. The green curve corresponds to the F1 score, with reference to the vertical axis on the left. The blue curve shows the daily number of cases during the testing period, with reference to the vertical axis on the right. The peaks and trough are identified by red circles.
In practice, the five-day temporal lag can be perceived as the critical response time, and it calls for timely intervention strategies regardless of peak or trough time, as long as the early infections can be identified. Potential disease flows between locations could be effectively prevented to reduce further dispersion, using strategies such as quarantine or travel restrictions. In this study, the effect of temporal lag also refers to the temporal factor (see Section 1), which is further discussed in greater detail in the later part of this section.
In comparison, the ANN model (Figure 3, right) achieves a considerably lower overall accuracy than that of the CNN model. The performance of the ANN model shows little stability with respect to different training vs. testing divisions or different temporal lags. Given that the CNN model considerably outperforms the ANN model, the remaining discussion will focus on the results of CNN.
4.2. Scenario analyses
The removal of the contextual factors, individual or combined, reveals how these factors contribute to the prediction of disease flows. For the one-at-a-time removal strategy, the prediction accuracies of the 13 scenarios decrease to varying degrees (Figure 5). The removal of the shortest path length leads to a decrease of 14.98%; this implies that this factor is important in disease flow prediction. The accuracies of the other 12 scenarios only decrease minimally; this suggests that each of the 12 factors alone only has a minor effect on the flow prediction.
Figure 5.

Decrease in prediction accuracy using the one-at-a-time removal strategy.
For the exhaustive-combination removal strategy, three distinct groups stand out (Figure 6). The scenarios in Group 3 are found to have the greatest decrease in prediction accuracy, falling by 41.23%. This decrease is caused by the removal of two epidemiological factors at the same time: the number of cases at neighboring locations and the presence of the disease flow on earlier days. These two factors represent the extended spatial-temporal effect of the epidemiological factors. Furthermore, their compound effect is considerably greater than the simple addition of their individual effects (presence of disease flow: 0.91%, cases at neighboring locations: 0.83%, as shown in Figure 6).
Figure 6.
Three groups of scenarios using the exhaustive-combination removal strategies.
For the scenarios in Group 2, the prediction accuracies are found to decrease by 39.8% when two network factors are removed at the same time, degree and closeness. These two factors in combination represent the reachability between locations in the flow networks. Their compound effect on the disease flows is much greater than the simple addition of their individual effects (degree: 1.22%, closeness: 0.5%, as shown in Figure 6). For the Group 1 scenarios, the prediction accuracies decrease by 17.9% when two different network factors are removed at the same time, the shortest path length and eccentricity, both representing central positions in flow networks. Once again, the compound effect of these two factors is larger than their simple additive effects (the shortest path length: 14.98%, eccentricity: 0.2%, as shown in Figure 6). These three groups of factors collectively have been widely used in many studies; their compound effects have rarely been revealed.
In addition to the compound effect of factors within each group, intertwined effects among these groups of factors are also revealed. Specifically, the effect of the epidemiological factors is mostly reflected in the universities and schools that endured a major influenza outbreak. The infections in the past few days and at neighboring locations (e.g., near-campus dorms and near-school residential communities) stimulate the potentials of new flows. Further, the effect of network factors actualizes the flows from the dorms and residents to universities and schools, respectively, due to their high reachability (high degree and closeness) and efficiency (shortest path and high eccentricity). These dorms and residents in turn leave themselves in an exacerbated vulnerability of receiving disease flows. Divide-and-conquer intervention strategies between these locations might be helpful to prevent the intertwined effects and dissect transmissions across different communities.
Figure 7 reveals the effect of the temporal factor on the disease flows by showing how the three groups of factors vary one to five days prior to the occurrence of flows. The one to five-day temporal lag is chosen as it is found to be effective to predict flows (see Section 4.1).
Figure 7.

Temporal variation of three groups of compound factors, one to five days prior to the occurrence of flows. (A) presence of disease flow on earlier days and number of cases at neighboring locations, (B) degree and closeness, and (C) path length and eccentricity. Left and right vertical axes correspond to two respective factors.
For the two epidemiological factors in Group 3, the presence of flows on earlier days at the current location increases with time during the five effective days, increasing the potential for the current day flow. The other factor, the number of cases at neighboring locations, peaks a few days before the current day to allow the arrival of the disease flow at the current location. When combined, the accumulated cases and flows in the prior days and at neighboring locations (e.g., dorms and residents) boost the flows at the current time and location, typically the universities and schools. For the two network factors in Group 2, both the degree and closeness increase in the five days, implying active and accelerated new flows of the current day (Figure 7B). Lastly, for the network factors in Group 1, the decreased path length combined with increased eccentricity shortens the network distance of the locations (e.g., dorms and residents) to their neighboring locations (e.g., universities and schools) and all other locations, driving the flows on the current day. Collectively, the temporal progress of the three groups of factors during the five days advances the disease flows.
These results reveal the intertwined contribution of epidemiological, network, and temporal factors to the disease flows, informing the intricate nature of those flows. Understanding of these effects lay a solid foundation for location-centric disease surveillance where contextual factors in different perspectives should be simultaneously monitored and treated as signals for potential disease flows between locations. These findings could facilitate early warning of epidemic outbreaks so that effective time-sensitive intervention strategies can be deployed.
5. Conclusions
This study explored the potential disease flows between locations in an urban area using three sets of contextual factors, epidemiological, network, and temporal, based on a deep learning approach and a series of scenario analyses.
The extended spatial-temporal effect of the epidemiological factors is found to play a critical role in stimulating disease flows. The compound effects of the network factors, combined with the epidemiological factors, provide insights into how interactions between contextual factors affect disease flows. Within the five effective days, the extended and compound effects are accelerated to advance the disease flows throughout the epidemic.
The findings reveal the complicated nature of disease flows that demand a comprehensive consideration of the contextual factors that have been long ignored in prevailing health studies. The effects of contextual factors identified in this study could benefit new studies and strategies, such as real-time surveillance of disease outbreaks, in order to deploy timely intervention strategies for a broad range of communicable diseases.
This study reveals distinct advantages of perceiving disease flows as a location-centric network phenomenon. It provides an alternative perspective to effectively capture and represent the dynamics of flows in a relative space, which goes beyond the Euclidean space (Shaw & Sui, 2020; Xing, Sieber, & Roche, 2020; Zhong & Bian, 2016). This perspective emphasizes the locations and flows that are semantically critical to understand health threats. Such a perception endows the incorporation of the epidemiological factors as well as the inherent network factors, both rendered with the temporal factor. These advantages of the location-centric perception could inspire new designs for epidemiological modeling.
Acknowledgments
Research reported in this publication was supported in part by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM108731. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The use of the case data has been approved by the Institutional Review Board at the authors’ institution.
Appendix: Network metrics
This appendix is intended to provide a detailed mathematical explanation of the ten essential network metrics that are adopted as network factors in this study and presented in Section 3.1. These essential network metrics characterize the role and importance of the location nodes in the disease flow network from different perspectives (Freeman, 2004; Newman, 2018; Wasserman & Faust, 1994). For a given network G = (N, M) with ∣N∣ nodes and ∣M∣ links and a given location pair of Node s and t:
1. Shortest path length
The path length is the number of consecutive links between a given pair of nodes in a network. Among all the possible paths between the node pair, the shortest path is the one with the least number of links. In disease flow networks, it measures the number of consecutive flows between the given pair of locations.
2. Closeness
The closeness of a node is the reciprocal of the summed length of the shortest paths between the node and all other nodes in a network. In disease flow networks, a high closeness value indicates that the location node has short transmission pathways to all other nodes.
| (Eq. A1) |
where d(s,t) is the shortest path length between Nodes s and t.
3. Eccentricity
The eccentricity of a node in a network is defined as the reciprocal of the longest path among all the shortest paths between the node and all other nodes. High eccentricity values imply fast transmissions.
| (Eq. A2) |
4. Radiality
The radiality of a node is defined as the network diameter subtracting the shortest path between the node and all other nodes, where the network diameter is the shortest distance between the two most distant nodes in a network. In disease flow networks, a high radiality value indicates high reachability of the flow to all other location nodes.
| (Eq. A3) |
where D is the network diameter.
5. Degree
The degree of a node measures the number of links to its direct neighbors. It implies the local transmission magnitude of a location node in the disease flow networks.
6. Clustering coefficient
The clustering coefficient measures the degree to which a node’s neighboring nodes also connect. In disease flow networks, a high clustering coefficient implies the disease flows are locally clustered.
| (Eq. A4) |
where ks is the number of neighboring nodes of Node s and es is the number of links between the neighboring nodes of Node s.
7. Topological coefficient
The topological coefficient of a node measures the extent to which the node shares neighbors with other nodes. It indicates the reachability between the location and its neighbors via their mutual neighbors.
| (Eq. A5) |
where R(s, t) is the number of neighbors shared between Nodes s and t.
8. Betweenness
The betweenness of a Node x is defined as the number of the shortest paths (for all possible node pairs) that pass through the node. A high betweenness value implies that a location serves as a hub on the shortest transmission pathways between all other locations.
| (Eq. A6) |
where Pst is the total number of shortest paths between Nodes s and t, while Pst(x) is the number of those paths that pass through Node x.
9. Bridging
A bridging node connects the network components in a network, whereby within a component all the nodes are connected directly or indirectly. Intuitively, the bridging measures how well a location connects the transmission hubs.
| (Eq. A7) |
where BC(x) is the bridging coefficient of the node, and it measures the extent to which a node is located between high degree nodes.
10. Eigen-centrality
The eigen-centrality is the influence of a node based on the influence of its neighbors. In disease flow networks, a high eigen-centrality implies that a location has a high transmission magnitude based on the magnitude of its neighbors. Given a 1-by-∣N∣ vector x, the eigen-centrality of all nodes in the network is obtained by solving the vector x in
| (Eq. A8) |
where A is the adjacency matrix of the network G with eigenvalue(s) λ.
Footnotes
Disclosure statement
No potential conflict of interest was reported by the authors.
References
- Ali A, Shamsuddin SM, & Ralescu AL (2015). Classification with class imbalance problem: a review. Int. J. Advance Soft Compu. Appl, 7(3), 176–204. [Google Scholar]
- Bengio Y (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127. doi: 10.1561/2200000006 [DOI] [Google Scholar]
- Bian L, Huang Y, Mao L, Lim E, Lee G, Yang Y, … Wilson D (2012). Modeling individual vulnerability to communicable diseases: A framework and design. Annals of the Association of American Geographers, 102(5), 1016–1025. [Google Scholar]
- Carrat F, Vergu E, Ferguson NM, Lemaitre M, Cauchemez S, Leach S, & Valleron A-J (2008). Time lines of infection and disease in human influenza: a review of volunteer challenge studies. American journal of epidemiology, 167(7), 775–785. [DOI] [PubMed] [Google Scholar]
- Cauchemez S, Bhattarai A, Marchbanks TL, Fagan RP, Ostroff S, Ferguson NM, … Pennsylvania H. N. w. g. (2011). Role of social networks in shaping disease transmission during a community outbreak of 2009 H1N1 pandemic influenza. Proc Natl Acad Sci U S A, 108(7), 2825–2830. doi: 10.1073/pnas.1008895108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charaudeau S, Pakdaman K, & Boelle P-Y (2014). Commuter mobility and the spread of infectious diseases: application to influenza in France. PloS one, 9(1), e83002 Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886984/pdf/pone.0083002.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chinese National Influenza Center. (2019). Retrieved from http://www.chinaivdc.cn/cnic/en/
- Chowell G, & Rothenberg R (2018). Spatial infectious disease epidemiology: on the cusp. BMC Med, 16(1), 192. doi: 10.1186/s12916-018-1184-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Conlan AJ, & Grenfell BT (2007). Seasonality and the persistence and invasion of measles. Proceedings of the Royal Society B: Biological Sciences, 274(1614), 1133–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Control, C. f. D., & Prevention. (2011). The 2009 H1N1 pandemic: summary highlights, April 2009-April 2010. Website: http://www.cdc.gov/h1n1flu/cdcresponse.htm, Accessed on August, 2.
- Cooley P, Ganapathi L, Ghneim G, Holmberg S, & Wheaton W (2008). Using Influenza-Like Illness Data to Reconstruct an Influenza Outbreak. Math Comput Model, 48(5-6), 929–939. doi: 10.1016/j.mcm.2007.11.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danon L, Ford AP, House T, Jewell CP, Keeling MJ, Roberts GO, … Vernon MC (2011). Networks and the epidemiology of infectious disease. Interdisciplinary perspectives on infectious diseases, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dushoff J, Plotkin JB, Levin SA, & Earn DJ (2004). Dynamical resonance can account for seasonality of influenza epidemics. Proceedings of the National Academy of Sciences of the United States of America, 101(48), 16915–16916. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC534740/pdf/pnas-0407293101.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eggo RM, Cauchemez S, & Ferguson NM (2011). Spatial dynamics of the 1918 influenza pandemic in England, Wales and the United States. J R Soc Interface, 8(55), 233–243. doi: 10.1098/rsif.2010.0216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewing A, Lee EC, Viboud C, & Bansal S (2017). Contact, Travel, and Transmission: The Impact of Winter Holidays on Influenza Dynamics in the United States. J Infect Dis, 215(5), 732–739. doi: 10.1093/infdis/jiw642 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fiore AE, Uyeki TM, Broder K, Finelli L, Euler GL, Singleton JA, … Bresee JS (2010). Prevention and control of influenza with vaccines: recommendations of the Advisory Committee on Immunization Practices (ACIP), 2010. [PubMed] [Google Scholar]
- Freeman L (2004). The development of social network analysis. A Study in the Sociology of Science, 1. [Google Scholar]
- Gog JR, Ballesteros S, Viboud C, Simonsen L, Bjornstad ON, Shaman J, … Grenfell BT (2014). Spatial Transmission of 2009 Pandemic Influenza in the US. PLoS Comput Biol, 10(6), e1003635. doi: 10.1371/journal.pcbi.1003635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodfellow I, Bengio Y, & Courville A (2016). Deep learning: MIT press. [Google Scholar]
- Henaff M, Bruna J, & LeCun Y (2015). Deep Convolutional Networks on Graph-Structured Data. arXiv preprint arXiv:1506.05163. [Google Scholar]
- Heyman D (2004). Control of Communicable Diseases Manual. American Public Health Association; Washington, DC. [Google Scholar]
- Heymann DL (2008). Control of communicable diseases manual: American Public Health Association. [Google Scholar]
- Hinton GE, Osindero S, & Teh Y-W (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527–1554. [DOI] [PubMed] [Google Scholar]
- Jain AK, Mao J, & Mohiuddin K (1996). Artificial neural networks: A tutorial. Computer(3), 31–44. [Google Scholar]
- Japkowicz N, & Stephen S (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429–449. [Google Scholar]
- Krizhevsky A, Sutskever I, & Hinton GE (2012). Imagenet classification with deep convolutional neural networks. Paper presented at the Advances in neural information processing systems. [Google Scholar]
- Larsson G, Maire M, & Shakhnarovich G (2016). Learning representations for automatic colorization. Paper presented at the European Conference on Computer Vision. [Google Scholar]
- LeCun Y, Bengio Y, & Hinton G (2015). Deep learning. Nature, 521(7553), 436. [DOI] [PubMed] [Google Scholar]
- LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, & Jackel LD (1990). Handwritten digit recognition with a back-propagation network. Paper presented at the Advances in neural information processing systems. [Google Scholar]
- Li M, Shi X, Li X, Ma W, He J, & Liu T (2019). Epidemic Forest: A Spatiotemporal Model for Communicable Diseases. Annals of the American Association of Geographers, 1–25. [Google Scholar]
- Ma F, Zhong S, Gao J, & Bian L (2019). Influenza-Like Symptom Prediction by Analyzing Self-Reported Health Status and Human Mobility Behaviors. Paper presented at the Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. [Google Scholar]
- Martincic-Ipsic S, Mocibob E, & Perc M (2017). Link prediction on Twitter. PloS one, 12(7), e0181079. doi: 10.1371/journal.pone.0181079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman M (2018). Networks: Oxford university press. [Google Scholar]
- Nsoesie EO, Brownstein JS, Ramakrishnan N, & Marathe MV (2014). A systematic review of studies on forecasting the dynamics of influenza outbreaks. Influenza and other respiratory viruses, 8(3), 309–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park E, Han X, Berg TL, & Berg AC (2016). Combining multiple sources of knowledge in deep cnns for action recognition. Paper presented at the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). [Google Scholar]
- Pastor-Satorras R, Castellano C, Van Mieghem P, & Vespignani A (2015). Epidemic processes in complex networks. Reviews of modern physics, 87(3), 925. [Google Scholar]
- Patel NG, Rorres C, Joly DO, Brownstein JS, Boston R, Levy MZ, & Smith G (2015). Quantitative methods of identifying the key nodes in the illegal wildlife trade network. Proc Natl Acad Sci U S A, 112(26), 7948–7953. doi: 10.1073/pnas.1500862112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pellis L, Ball F, Bansal S, Eames K, House T, Isham V, & Trapman P (2015). Eight challenges for network epidemic models. Epidemics, 10, 58–62. doi: 10.1016/j.epidem.2014.07.003 [DOI] [PubMed] [Google Scholar]
- Powers DM (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. [Google Scholar]
- Riley S, Eames K, Isham V, Mollison D, & Trapman P (2015). Five challenges for spatial epidemic models. Epidemics, 10, 68–71. doi: 10.1016/j.epidem.2014.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salathe M, Kazandjieva M, Lee JW, Levis P, Feldman MW, & Jones JH (2010). A high-resolution human contact network for infectious disease transmission. Proc Natl Acad Sci U S A, 107(51), 22020–22025. doi: 10.1073/pnas.1009094108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sasaki Y (2007). The truth of the F-measure. Teach Tutor mater, 1(5), 1–5. [Google Scholar]
- Shaman J, & Karspeck A (2012). Forecasting seasonal outbreaks of influenza. Proceedings of the National Academy of Sciences, 109(50), 20425–20430. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3528592/pdf/pnas.201208772.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaw S-L, & Sui D (2020). Understanding the new human dynamics in smart spaces and places: Toward a Splatial framework. Annals of the American Association of Geographers, 110(2), 339–348. [Google Scholar]
- Shi X, & Kwan M-P (2015). Introduction: geospatial health research and GIS. Annals of GIS, 21(2), 93–95. doi: 10.1080/19475683.2015.1031204 [DOI] [Google Scholar]
- Shi X, & Wang S (2015). Computational and data sciences for health-GIS. Annals of GIS, 21(2), 111–118. doi: 10.1080/19475683.2015.1027735 [DOI] [Google Scholar]
- Shin H-C, Roth HR, Gao M, Lu L, Xu Z, Nogues I, … Summers RM (2016). Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging, 35(5), 1285–1298. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4890616/pdf/nihms785980.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, … Lanctot M (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484. [DOI] [PubMed] [Google Scholar]
- Szegedy C, Toshev A, & Erhan D (2013). Deep neural networks for object detection. Paper presented at the Advances in neural information processing systems. [Google Scholar]
- Tizzoni M, Bajardi P, Poletto C, Ramasco JJ, Balcan D, Gonçalves B, … Vespignani A (2012). Real-time numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm. BMC medicine, 10(1), 165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wagner MM, Tsui F-C, Espino JU, Dato VM, Sittig DF, Caruana RA, … Fridsma DB (2001). The emerging science of very early detection of disease outbreaks. Journal of public health management and practice, 7(6), 51–59. [DOI] [PubMed] [Google Scholar]
- Wasserman S, & Faust K (1994). Social network analysis: Methods and applications (Vol. 8): Cambridge university press. [Google Scholar]
- Weiss GM, & Provost F (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of artificial intelligence research, 19, 315–354. [Google Scholar]
- WHO FluNet. (2019). Retrieved from http://www.who.int/influenza/gisrs_laboratory/flunet/en/
- Wilson K, & Brownstein JS (2009). Early detection of disease outbreaks using the Internet. Cmaj, 180(8), 829–831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xing J, Sieber R, & Roche S (2020). Rethinking Spatial Tessellation in an Era of the Smart City. Annals of the American Association of Geographers, 110(2), 399–407. [Google Scholar]
- Zhong S, & Bian L (2016). A location-centric network approach to analyzing epidemic dynamics. Annals of the American Association of Geographers, 106(2), 480–488. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4968948/pdf/nihms805004.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu D, Huang Z, Shi L, Wu L, & Liu Y (2018). Inferring spatial interaction patterns from sequential snapshots of spatial distributions. International Journal of Geographical Information Science, 32(4), 783–805. [Google Scholar]





