Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Aug 1.
Published in final edited form as: Spat Spatiotemporal Epidemiol. 2021 Jun 30;38:100437. doi: 10.1016/j.sste.2021.100437

Decoding Influenza Outbreaks in a Rural Region of the USA with Archetypal Analysis

Elham Bayat Mokhtari 1, Erin L Landguth 2, Stacey Anderson 3, Emily Stone 4
PMCID: PMC8356570  NIHMSID: NIHMS1724721  PMID: 34353529

Abstract

We present the first application of archetypal analysis for influenza data from 2010–2018 in Montana, USA. Using archetypes, we decompose the data into spatial and temporal components, allowing for a more informed analysis of spatial-temporal dynamic trends during an influenza season. Initially, we reduce the dimension of the set of counties by using a mutual information measure on the influenza time series to create a smaller, maximal mutual information network. Archetypal analysis then describes the relationship between influenza cases across counties and regions in Montana. Finally, we discuss the potential implications this analysis can have for infectious disease modeling, particularly where data is sparse and limited.

1. Introduction

Mathematical models are often used to predict how infectious diseases, such as influenza, will spread, showing the likely outcome of an epidemic and thereby informing public health interventions. Incorporating data into these models presents many challenges. For influenza in particular, one such study is found in Shaman and Karspeck (2012), which uses a data assimilation technique commonly applied in numerical weather prediction on outbreaks in New York City. In this paper we document the use of archetypal analysis to study seasonal spatio-temporal influenza outbreaks in Montana.

Archetypal Analysis (AA) is a tool for the study of large data sets, used in both data compression, and the analysis of spatio-temporal patterns in data (Cutler and Stone, 1997; Stone and Cutler, 1996; Bauckhage, 2014; Vinue et al., 2015). Recently, with the advent of greater computational power, it has been used in a variety of applications in the analysis of “Big Data”. For instance, it has been used to analyze weather, climate and precipitation patterns (Hannachi and Trendafilov, 2017; Steinschneider and Lall, 2015; Su et al., 2017). A probalistic framework for archetypes is developed in (Seth and Eugster, 2016). Its application to machine learning appears in Mørup and Hansen (2012). It is also used in biomedical and industrial engineering (Epifanio et al., 2013; Thøgersen et al., 2013), and in the analysis of terrorist events (Lundberg, 2019). To date, however, archetypal analysis has not been applied to epidemiological data.

Archetypal Analysis was introduced by Cutler and Breiman in 1994 as variant of principal component analysis (PCA) that could capture ‘archetypal patterns’ in the data (Cutler and Breiman, 1994). Through AA, each time-based observation can be represented as a convex combination of a limited number of points, called ‘archetypes’ or ‘pure type’, which may or may not be observed. These influential data points best describe the exterior surface of the original data set, and as convex combinations of the data points themselves, resemble the observations.

AA provides a number of advantages over commonly used techniques for data compression, such as PCA (Pearson, 1901; Abdi and Williams, 2010) or k-mean clustering (MacQueen, 1967; Lloyd, 1982). PCA can lead to a complex representation of the data and it is restricted by orthogonality, so meaningful features may not be discovered. Clustering approaches provide easy interpretation, but tend to lack modeling flexibility, with each observation grouped in only one cluster such that no between intermediate groupings are allowed. In contrast, with AA, each observed data point either is classified to its closest archetype (single), or it is associated with two or more archetypes (mixed). Therefore, AA combines the virtue of both methods; it is easy to interpret and by allowing intermediates, provides more flexibility than clustering.

Influenza count data offers a unique opportunity to test the application of archetypal analysis to epidemiological data. While the burden of influenza can vary from season to season, it is estimated that between 9 and 45 million cases of influenza occur each year in the United States. Of these, an estimated 140,000–810,000 hospitalizations and up to 61,000 deaths due to influenza occur each year (www.cdc.gov/flu/about/burden/index.html). In the western US state of Montana, approximately 6,000 cases of influenza are reported each season (approximately October to May; https://www.cdc.gov/flu/about/season/flu-season.htm), but the actual number is likely higher, as not all individuals who are infected will seek medical care (MT DPHHS data). In Montana, influenza is associated with approximately 900 hospitalizations and 60 deaths each year.

The objective of this study is to evaluate the use of archetype analysis on Montana county level influenza data from 2010 to 2018. Specifically, we first reduce the dimension of the 51 county data set by pruning an unweighted network of the counties created with a mutual information measure of the influenza time series. We then apply AA to further decompose the resulting seasonal influenza count data set into a limited number of spatial patterns of these counties over the state of Montana, and a data reconstruction time series in terms of this lower dimensional set. The reconstruction time series allows us to examine the spread of flu from one region to another in a single season, and to compare the outbreaks in different flu seasons. We believe this to be a very useful recapitulation of the data, organizing it into information that can be more easily studied and acted on by public health experts.

2. Methods

2.1. Influenza Data

The data are weekly county-level case counts of positive diagnoses of influenza from all reporting sources, including laboratory confirmations, hospitalizations, and clinical diagnoses across Montana from 2010 to 2018. Influenza cases in those of all ages are reported. The Centers for Disease Control do not state the estimated under-reporting of influenza, but do acknowledge that it is largely under-reported (www.cdc.gov/flu/about/burden/how-cdc-estimates.htm). Six small population counties (Musselshell, Petroleum, Judith Basin, Wheatland, Golden Valley, and Fergus) were grouped into what is known as the Central Montana Health District (CMHD). The CMHD and all 50 other Montana counties were included in this study for a total of 51 regions which will be referred to as counties for simplicity. In total, the influenza data for Montana produced 51 counties over 8 years of weekly recorded time series. Case counts per 1000 for each week for all counties are shown in Figure 1a. Figure 1b depicts total flu incidence per 1,000 in each county for a sample flu season period (October 1, 2015 to April 30, 2016) for each county in Montana.

Figure 1:

Figure 1:

(a) Total weekly influenza cases plotted for all Montana counties, 2010–2018. (b) Total case incidence (per 1,000) for each county in Montana for the 2015–2016 flu season.

Influenza counts are at their minimum in temperate climates and during the northern latitude summer months (May to August), whereas winter months make up the predominant season for infection due to cold temperatures, low humidity, and increased indoor crowding (Finkelman et al., 2007; Cauchemez et al., 2011; Tamerius et al., 2013). We thus excluded flu counts from May 1 to August 30, (which were largely 0.0 or NA) from the data set, and found an average flu season length of 39 weeks.

Where counts showed missing entries (−9999 or NA) during the influenza season period, the missing values were interpolated using the last observation carried forward method, a common statistical approach to the analysis of longitudinal repeated measure data, where some follow-up observations may be missing. The combination of the observed and imputed data is then analyzed as a complete data set.

To consider differences in population size in Montana counties, we weighed weekly influenza cases for each county by average population size of that county from 2010 to 2018 ×1000. The flu season data set has n = 335 observations (weekly data from 2010 − 2018) and m = 51 attributes, from January 3, 2010 to June 3, 2018 with 8 entire flu seasons represented and one half season.

2.2. Archetypal Analysis: Mathematical Formulation

Consider an m × n matrix X, where n is the number of observations of flu cases across m Montana counties. AA decomposes the spatio-temporal variability of X in a similar way to PCA but with the following underlying constraints. Given a specified value for k, AA aims to identify m-dimensional vectors z1, ⋯ , zk that best describe k characteristic patterns, or archetypes, in the original data set, such that data can be represented as convex combinations (i.e., linear combinations with non-negative coefficients that sums to unity) of these archetypal patterns:

zj=i=1nβijxi,βij>0&i=1nβij=1. (1)

The n-dimensional vector βj contains the convex weights for the jth archetype across all observations. The n × k matrix of all such weights is given by B = {β1, ⋯ , βk}. Each archetype is either a convex combinations of the original observations or an actual observation (Cutler and Breiman, 1994), so they are more readily interpreted compared to PCA eigenvectors. All observations can then be approximated by a convex combination of the archetypes:

x^i=j=1kαjizj,αji>0&j=1kαji=1. (2)

Here, the convex weights, sometimes referred to as mixture coefficients, αji with j = 1, ⋯ , k range from 0 to 1, are used to reconstruct the ith observation across the k archetypes. The k × n matrix of all such weights is given by A = {α1, ⋯ , αn}. The αj are like the (nonlinear) projection of the original data X onto the jth archetype zj, similar PC scores in PCA. Thus the αjs are time series that determine how much of each archetype is used in reconstructing each data point.

The m × k matrix Z of k archetypes is defined by the matrix factorization problem:

minA,BXXBA, (3)

where Z = XB. RSS = ||XXBA|| is the residual sum of square errors, where ||.|| is the spectral norm. AA seeks to find k m-dimensional archetypes such that the RSS is minimized. This approach is described in detail in (Cutler and Breiman, 1994), but can be summarized as follows: AA uses a convex least-squares method (CLSM) to estimate the coefficient αji, subject to the constraints for given some initial values of βij. It then finds the best βij using CLSM, using the new αji. This process repeats until the RSS fails to improve, or potentially until the maximum number of iterations is reached. AA will find local minimums, not necessarily the global minimum of RSS, hence using several starting βij values to insure a global solution is recommended. Furthermore, there is no universal method for determining the optimal value of k. One commonly used approach is the “elbow” criteria, where a good value of k is selected by when the RSS fails to improve, which can be determined by finding an elbow in the relationship between RSS and k in a screeplot. Since its introduction, other algorithms have been developed to find an archetypal decomposition of data. In this study, we use the a Principal Convex Hull Analysis (PCHA) package created by Morten Mørup, in Matlab.

It is noted in (Cutler and Breiman, 1994) that the vectors z are not orthogonal and have no natural nesting structure, i.e., as more archetypes are found, the archetypes in the smaller set can change. This is in contrast to PCA, where the set of the leading N principal components are a subset of the set of the leading M principal components for M > N. This characteristic makes PCA more flexible, however this flexibility comes at the cost of interpretation.

2.3. Mutual Information

To initially reduce the dimension of the data set, we applied information-theoretic measures introduced by Shannon (Shannon and Weaver, 1998) to quantify the dependence of the flu count time series from different counties upon each other. For instance, if a county has a flu count time series that runs more or less independently of the other counties (as was typical with very small population counties), we could chose to remove them from the data set, thereby reducing the dimension. The exact meaning of “more or less independently” is explored below.

Mutual information measures the expected reduction in uncertainty about x that results from learning y, or vice versa, where x and y are samples of the random variables X and Y. This quantity can be formulated

I(X;Y)=H(X)+H(Y)H(X,Y),

where entropy is defined

H(X)=xXp(X=x)log2p(X=x) (4)

and the joint entropy of two random variables X and Y quantifies the uncertainty of their joint distribution.

H(X,Y)=yYxXp(X=x,Y=y)log2p(X=x,Y=y) (5)

Using Eqs. (4) and (5), the mutual information can be rewritten

I(X;Y)=xXyYp(x,y)log2p(xy)p(x). (6)

The mutual information is symmetric in the variables X and Y, I(X; Y ) = I(Y; X), and is zero if the random variables are independent or if the relation between them is deterministic (nothing to be learned in either case). Note also that if X is statistically correlated to Y, H(X|Y ) will be less than H(X), and I will be greater than 0. If X is independent of Y, H(X|Y ) = H(X) and I = 0. If X is uniquely determined by Y, H(X|Y ) = 0 and I(X; Y ) = H(X).

In general, association measures like correlation coefficient or mutual information are used to estimate the relationships between two random variables. Correlation coefficient measures such as Pearson or Spearman entail the assumption of linear dependence. Therefore, if two random variables are associated by a nonlinear relationship these methods fail to detect this link, or its strength will be wrongly estimated. Mutual information, however, is able to detect both linear and nonlinear dependencies, and it measures the amount of information connecting two random variables, in this case between influenza cases in two Montana counties; in other words, it estimates the reduction in uncertainty about the influenza activity of one county when the activity of another county is known.

2.4. Network Graphs

We combined graph theory and mutual information measures to describe the relationship between influenza cases across the counties in Montana. With 51 counties as nodes, and the mutual information measure giving the strength of the edges, we can construct undirected network graphs, as follows.

An estimated mutual information matrix I, can be cast into the form of an undirected network adjacency matrix A satisfying following conditions:

  • 0 ≤ Aij ≤ 1,

  • Aij = Aji,

  • Aii = 1.

Note that mutual information is a symmetric square matrix, and its entries are bounded below by 0. However, they are not bounded above by 1 and the diagonal elements are not equal to 1, but equal to the entropy of the variable. In order to satisfy the above conditions, we divide each entry of mutual information matrix I by the joint entropy between each pair of counties’ influenza cases, resulting in adjusted mutual information matrix Iadj. The transformed matrix satisfies the network adjacency conditions.

We then reduce the size of the network by pruning nodes that have low connectivity with the other nodes in the graph, leaving us with a smaller, unweighted, network of nodes. To this end we define an unweighted network adjacency matrix between counties i and j by hard thresholding Iadj as follows:

Iunw.adj(ij)={1,ifIadj(ij)τ,0,otherwise (7)

Here τ is the hard threshold parameter. While this leads to an intuitive concept of network connectivity, the challenging aspect is choosing the threshold. In many network studies, hard thresholding is based on the scale-free criteria of a graph. It is assumed that the probability that a node is connected with k other nodes follows a power-law distribution

P(p)~pγ, (8)

where p is the node degree, and γ is an exponent. We consider the adjusted mutual information matrix Iadj and define the binarized matrix Tτ(Iadj) where the ij entry is 1 if Iadj,ij>τ and 0 otherwise. We choose the threshold τ by fitting a linear function f(p)=γ^p+b^ to the empirical degree distribution in log space and estimate the coefficient of variation, R2, of the fit. We then choose the threshold that results in the highest R2 value.

A quantitative measure of the importance of a node in a graph is the connectivity degree of a node ν, defined as the number of adjacent nodes, e.g. dj, for the j-th node with j = 1, ⋯ , p and is given by:

dj=j=1pIunw.adj(jj), (9)

where Iunw.adj(jj) is defined at Eq.7.

3. Results

3.1. Archetypal Analysis of 51 County Data Set

We first applied archetypal analysis to decompose the seasonal flu counts in all counties into a limited number of spatial patterns over the state of Montana, and the data reconstruction time series in terms of this set of patterns. As described above, each time-based observation can be represented by a convex combination of the archetypes, and each archetype itself is a convex combination of the original observations. The archetypes can be interpreted as being a “pure type”, or as extremal spatial patterns (i.e., the data points on the boundary of convex hull of the data set).

While the archetypes and alphas do provide a spatio-temporal split of the data, and some insight is gained, the data set itself is too stochastic and too high dimensional for the analysis to give satisfactory results. The bulk of the activity is summarized in two archetypes, the zero archetype, or “no outbreak” archetype and one which clustered the contiguous counties of Lincoln and Sanders, Jefferson and Gallatin, and Missoula and Ravalli. The rest of the archetypes in any level truncation are mainly representations of isolated outbreaks, because outliers in the data have a strong effect on their composition. There are formulations of archetypal analysis that address this issue, see for instance Moliner and Epifanio (2019). For more useful results, we decided to reduce the dimension of the data set following statistical and regional/population center information, as described in the next section.

3.2. Mutual Information-Based Influenza Network Analysis

We used the measure of mutual information between each pair of counties to arrive at a reduced dimension set. Following the formulas in section 2, we first created histograms of the time series data for single counties, and joint distributions for each county with all the others. We note that the choice of bin size in these histograms will change the value of the entropy, but by choosing a fixed bin size for all the distributions, it is possible to compare the measures relative to each other. Accordingly, we calculated the entropy with a uniform bin size and 30 partitions, for each county with respect to all others, creating a 51 by 51 matrix. To determine which counties have the highest MI in total, the MI row/column for each county is summed and ordered, to create the graph shown in Figure 2. Note that the lowest MI occurs for 4 very small population counties (Carter, Garfield, Prairie and Treasure counties) with correspondingly more stochastic flu counts. The top 11 counties are: Ravalli (17.92), Missoula (16.96), Lake (16.57), Flathead (16.54), Gallatin (16.28), Silverbow (16.22), Lincoln (16.02), Big Horn (15.84), Cascade (15.81), Glacier (15.32), Yellowstone (15.12), Beaverhead (14.69) and Lewis and Clark (14.66). We can reduce this set further by only considering those counties with large population centers, removing Beaverhead, Glacier, and Big Horn. Beaverhead is contiguous and connected by major highways to Silverbow, and it is seen in the data that its outbreaks follow the outbreaks of Silverbow, and similarly Glacier is contiguous with Flathead and its flu outbreaks match Flathead in time, and Big Horn is contiguous with Yellowstone and follows its outbreaks. These counties are roughly “slaved” to the larger counties and can be ignored without loss of significant information. We refer to these 10 as the maximal MI set of counties in what follows.

Figure 2:

Figure 2:

Total mutual information across counties for each county in an increasing order.

Next, we compare this set to a network created by using the MI matrix. With the 51 counties as nodes, we constructed undirected network graphs by using mutual information as the weight of the edges of the network.

To create the unweighted network, we need a cut-off threshold to set the entries to 0 or 1, as mentioned in the Methods section. For our adjacency matrix, the threshold τ = 0.46 provides us with the highest R2 0.97. We build our network based on the 0 and 1 adjacency matrix and filter isolated nodes (counties), i.e., nodes that have connectivity degree of 0. Since isolated nodes (counties) do not share information with other counties, we can remove them without any (large) loss of information. Those that create non-trivial subgraphs (subgraphs with more than one node) are assumed to be important. Figure 4 shows the Montana Counties Influenza Network, created using the unweighted, adjusted MI matrix with a threshold of 0.46, and including nodes with connectivity degree larger than 10. The node size represents the degree of the node, while the weight of the edge represents the MI between the two counties. Figure 3 shows that Ravalli, Missoula, Lake, Gallatin, Big Horn, Lincoln, Flathead, Silverbow, and Cascade have large connectivity degree, from 35 down to 28. There is drop after that to Phillips, Yellowstone, Glacier and Lewis and Clark (at 24 and 23). This list includes all counties with large population centers plus three others, Phillips, Glacier and Big Horn counties. The network created with these counties as nodes is shown in Figure 5. If we prune it further by removing the counties without a large population center, Big Horn (contiguous to Yellowstone), Glacier (contiguous to Flathead) and Phillips, we arrive at the maximal MI group.

Figure 4:

Figure 4:

Montana counties influenza cases network based on adjusted unweighted MI matrix. Only nodes with degree at least 10 are shown (29 in total). The darker and wider edges correspond to greater mutual information. The node and font size correspond to connectivity degree.

Figure 3:

Figure 3:

Degree of connectivity across counties for each county in an increasing order

Figure 5:

Figure 5:

Montana counties influenza cases network based on adjusted unweighted MI matrix. Only the top 13 counties ranked by degree are shown. The darker and wider edges correspond to greater mutual information. The node and font size correspond to connectivity degree.

The 10 counties in the Maximal MI subset include all these important nodes, so we now will refer to it as the Maximal MI network of counties. These counties are also the 10 largest population counties in the state. Their populations, in alphabetical order, are: Cascade (81366), Flathead (103806), Gallatin (114434), Lake (30458), Lewis & Clark (69432), Lincoln (19980), Missoula (119600), Ravalli (43806), Silverbow (34915) and Yellowstone (161300). We note that the counties can be grouped into rough geographic subregions, where they are connected by state and/or Interstate highways, each with one major city or town. These are noted in Table 1.

Table 1:

Counties Grouped into Geographical Regions

Region Counties in Region
Northwest Lincoln (Libby), Flathead (Kalispell), Lake (Polson)
North Central Lewis & Clark (Helena, state capital), Cascade (Great Falls)
Southwest Missoula (Missoula), Ravalli (Hamilton)
South Silverbow (Butte), Gallatin (Bozeman), Yellowstone (Billings)

To illustrate this, in figure 6 we show the network placed in rough geographic location (b), and a map of Montana showing county names and the location of the county seat (a).

Figure 6:

Figure 6:

(a) Map of Montana showing counties and county seats. (b) Top 10 MI counties with nodes marked by county seat names, presented in their geographic location. The line weight on the edges is scaled by the MI between the 2 counties.

Montana is anchored to Canada by state highway 93 which runs north-south from Canada through Flathead, Lake, Missoula and Ravalli counties to Idaho, and Interstate 90 connects Missoula, Silverbow, Gallatin and Yellowstone, running west to east from Idaho to North Dakota. Interstate 15 also runs from Canada in the north to Idaho in the south, connecting Cascade, Lewis & Clark and Silverbow counties. These are the major transportation corridors through the state. This information will help guide our analysis of the archetypal decomposition.

3.3. Archetypal Analysis of Maximal MI network

We next present the Archetypal analysis of this 10 D data set. Figure 7 shows the residual sum of squares (RSS) for each truncation from 1 archetype to 20, to illustrate the drop-off in the error as the number of archetypes is increased. Note that the largest drop in RSS occurs after the first archetype, which, we shall see below, captures the “no flu” state. Beyond that, the RSS declines more slowly to near zero as the number grows larger than about 10, as expected. For this analysis we choose a truncation to 6 archetypes, which captures roughly 85% of the variance. Thus we have reduced the dimension of the data set from 10 to 6, or really 5, because the alpha time series must sum to one at each time point, giving a relationship between one of the alphas and all the others.

Figure 7:

Figure 7:

Scree plot of the residual sum of squares against the number of archetypes for the Maximal MI County Network

In Figure 8 we show bar graph representations of each archetype. The archetypes are ordered according to the overall size of their alpha contribution. The first archetype is the necessary ‘zero’ archetype, which captures quiescent periods in the flu season and thus has a very small amplitude in each county. It serves as an on/off switch for an outbreak. We note that the duration of a flu outbreak could be measured from the time series for the zero archetype alone. The other archetypes Table 2: Archetype Composition-Maximal MI County Network separate out clusters of counties which have simultaneous outbreaks (spatial), and the alpha’s give the sequence in which the outbreaks occur (temporal). The alpha time series (Figure 9) show when each archetype is “active” in the time series. The reconstruction of the data can be consulted to determine the accuracy of this approximation for any county and any outbreak, see Figure 10. In a majority of the outbreaks the error in the reconstruction is at an acceptable level. Missoula county counts, however, are consistently underestimated by the reconstruction. This is because the size of the outbreaks for Missoula county per capita are significantly smaller than the other counties, so the error in not capturing Missoula counts is smaller, and hence the algorithm does not favor points in that direction when minimizing the error. This could be corrected for by weighting, if need be.

Figure 8:

Figure 8:

Six Archetype set for the Maximal MI County Network presented as bar graphs. Note the magnitude of the first archetype. It captures the “no flu” state, and is first in the list because it captures the largest variance in the data set.

Table 2:

Archetype Composition-Maximal MI County Network

Archetype Counties in Archetype
1 The zero or “no flu” archetype.
2 Outbreak in a northwest diagonal swath of contiguous counties: Lincoln, Flathead, Lewis & Clark.
3 Statewide outbreak, with a large component in Lewis & Clark county.
4 Broad, low outbreak, with a concentration in the central counties of Lake, Lewis and Clark and Cascade.
5 Broad, low outbreak with a large Silverbow contribution, representing isolated outbreaks in Silverbow county.
6 Broad, low outbreak statewide, with larger counts in the southern tier counties: Silverbow, Gallatin, and Yellowstone.

Figure 9:

Figure 9:

6 Archetype alpha time series for the Maximal MI County Network

Figure 10:

Figure 10:

Reconstruction of the data with 6 Archetypes for the Maximal MI County Network. Y-axis represent the weekly influenza cases. (Red: data, Turquoise: reconstruction)

As the sum of the alpha time series is equal to 1 for each data point, it is fair to ignore one archetype; its alpha time series can be calculated from the remaining. We choose to ignore the zero archetype when analyzing in detail the anatomy of an outbreak, as it serves to turn the epidemic “off”. We further classify archetypes 1–6 in Table 2.

To better understand this classification, refer to the map representation of each archetype in Figures 1116. Archetype 3 represents a broad outbreak in all counties, while the other 4 can be understood more regionally: 2) Northwest diagonal swath of counties (Flathead Lincoln Lewis & Clark counties), 4) Central counties (Lake, Lewis & Clark, Cascade), 5) Northwest diagonal swath stretching into Silverbow county, and 6) largest outbreak in the southern tier of counties (Silverbow, Gallatin, Yellowstone).

Figure 11:

Figure 11:

Archetype 1 heat map

Figure 16:

Figure 16:

Archetype 6 heat map

Next, we examine the spatio-temporal dynamics of each flu season by analyzing the simplified dynamics of the alphas. Specifically, we consider 8 full seasons of outbreaks. Archetypal analysis is used to parse out spatial regions and temporal spread of the epidemics. To begin, consider the first three outbreaks shown in Figure 17. The outbreak at week 0 is partial, beginning in January 2010, and will not be studied. In the next outbreak, beginning in Oct. 2010, we see large peaks in Lincoln, Lewis & Clark and Flathead counties, which are captured by the large peak in alpha 2. The initial rise of Lewis & Clark, and Flathead counties, are captured by the small peak in alpha 3. The outbreak in Dec. 2011 is largest in Flathead county, with contributions from Lincoln and Lewis & Clark counties, hence it is mainly captured by archetype 2. From this we can conclude that the second and third are largely outbreaks in the northwest diagonal swath of counties.

Figure 17:

Figure 17:

Expanded window on the data and alpha time series from January 2010 to the end of the flu season starting in Dec. 2011, showing the first 2 full flu outbreaks. Top plot is the data, bottom is the reconstruction time series.

The outbreak in Dec. 2012 is widespread in all counties, with alpha 5 and 6 starting the spread with a large component in Yellowstone and Silverbow counties. Alpha 3 then picks up the broad large outbreak in all other counties, which indicates a spread north-westward from the southern counties. Alpha 4, with its large component in Lake, Lewis and Clark and Cascade counties follows, capturing the extended outbreak in these counties seen in the data. Sequentially, the outbreak, starting in the southeast, moves west and circles north, then curves back east and south into the central counties, like a spiral. The small outbreak starting in Dec. 2013 is captured first by alpha 3 and 6 together, then picked up by alpha 5. The data show the broad initial outbreak, with larger peaks in Lewis and Clark and Silverbow counties, consistent with the make-up of the archetypes themselves. The outbreak initiated in Nov. 2014 is mainly represented by archetype 2, which is the larger outbreak seen in Lincoln and Lewis and Clark counties in the northwest.

The small outbreak initiated in Jan. 2016 is first largest in Flathead, with a later peak in Lake county. The first alpha to peak is alpha 3, with its large components in western counties, followed by a spike in alpha 4, capturing the large component in Lake county. It is a more or less simultaneous outbreak in all counties. The outbreak in Nov. 2016 is largest in alpha 3, which again captures the larger outbreaks in Flathead and Lewis & Clark counties seen in the data. Alpha 2 and 6 then emerge to extend the outbreak in the northwest corner and southern tier of the state. The season beginning Dec. 2017 has a large outbreak in Silverbow county, accompanied by widespread simultaneous outbreak in the rest of the state (alpha 3 and alpha 5). Later in the season, an increase in cases in the northwest region is represented by peak in A2, followed by a broader peak in A4, picking up the spread south to Lake county and eastward into Cascade county.

4. Discussion

We have shown how archetypal analysis parses the data into spatial regions and time series representing outbreaks in these regions. Specifically, the alpha time series show which archetypes are involved in each outbreak, and in some cases show the spread of the flu from region to region within an outbreak. Archetypes and the time series together allow for a more ready characterization of each flu season than the time series of the flu counts in the 10-D space. For instance, in the 2010–2011 and 2011–2012 seasons, we see a large archetype 2 outbreak and a small archetype 2 outbreak, respectively. This archetype has higher flu counts mainly in the swath of counties in a diagonal line from the northwest to the central part of the state: Lincoln to Flathead to Lewis and Clark counties. Lincoln and Flathead are connected by SR 2, and the quickest route from Kalispell, in Flathead county, to Helena, in Lewis and Clark county, runs down the Seeley-Swan valley and picks up a small State Road (SR) until it hits SR 287 west of Helena. This route skirts all other population centers, hence the 3 counties can be viewed as connected directly along this route. These counties must experience simultaneous outbreaks often enough in the data that the archetype captures the second largest fraction of the variance, after the zero archetype. It thus represents a large amount of the flu data in the entire data set, and hence a coherent outbreak in this northwest region is commonly seen.

In the 2012–2013 flu season, archetype 5 and archetype 6 combine to show an initial outbreak in the counties in the south: Silverbow, Gallatin and Yellowstone (the southern tier). The largest cities in these three counties, Butte, Bozeman and Billings, are linked by the main east-west transportation corridor in the state, I-90. The time series then show a spread to the central and northwest counties (archetype 3 and archetype 2). The outbreak persists in Cascade and Lake counties, represented by archetype 4. In 2013–2014 the outbreak is smaller, with a contribution from archetype 3 and archetype 6 followed by archetype 5. In terms of geographic regions, this means the outbreak is widespread, with a higher concentration in a complex of contiguous counties in the center of the state (Lewis and Clark, Flathead and Lake) and Yellowstone county. Archetype 5 picks up the later large outbreak in Silverbow county. The outbreak in 2014–2015 is most prominent in Lincoln and Lewis and Clark counties, part of the northwest region, and thus, captured primarily by archetype 2.

In 2015–2016, first archetype 3 peaks, followed by a peak in archetype 4, representing the large outbreak in Lewis and Clark county, followed by the outbreak in Lake, Flathead and Cascade counties. This season is focused in the central region of the state, spreading outward from Helena. Lake and Flathead are contiguous counties, as are Lewis and Clark and Cascade. We will refer to the former as the Lake region, and the latter as the Central region. Spread from Lewis and Clark to the west could be accomplished via several routes that connect Kalispell and Great Falls or Polson to Helena. We note that Lewis and Clark county is the home of the state capital, Helena.

The flu season in 2016–2017 is initiated in both archetype 3 and archetype 4, which combined give significant outbreaks in the Central and Lake regions. The tail of the season is in Lincoln, Gallatin and Yellowstone counties, represented by archetype 2 and archetype 6. This season is then characterized by widespread outbreak in the Central and Lake regions, spreading westward to Lincoln and southeast to the Southern Tier. The outbreak in 2017–2018 is at first broad with a large peak in Silverbow county, then spreads to the northwestern and central parts of the state.

To summarize, flu seasons in 2010, 2011 and 2014 were similar, being largely in the Northwest-Central region. The 2015 and 2016 flu seasons were focused centrally, and spread outward from there. The season in 2012 was unique in that the predominant region of the outbreak was in the southern tier counties. In 2017 there was an outlier season, with a very large outbreak in Silverbow county, compared to other counties. Only one season, in 2013, was characterized by a low, widespread outbreak across all regions.

The identification of geographic regions that share flu outbreaks is an important result of this analysis. A major transportation corridor links the counties in the southern tier, while the northwest-central group of counties in archetype 2 has contiguous borders and transportation routes connecting the major cities that do not pass through other counties. Smaller coherent regions are formed from pairs of contiguous counties, like Lake and Flathead, Cascade and Lewis and Clark. The analysis also picks up outliers, like the occurrence of isolated large isolated outbreaks in Silverbow county. That particular season could be analyzed in more detail using the time series in the southern tier directly.

In conclusion, AA applied to epidemiological data has great potential to illuminate shared patterns in temporal and spatial dimensions. From here, AA results could be used to develop or strengthen already existing network SIR mathematical models by using the findings to calibrate models or as inputs for the models themselves. These analyses could be applied at larger scales as well, but from a local and state public health perspective, they can help find common patterns and connections of disease spread through data that may otherwise be sparse and limited.

Figure 12:

Figure 12:

Archetype 2 heat map

Figure 13:

Figure 13:

Archetype 3 heat map

Figure 14:

Figure 14:

Archetype 4 heat map

Figure 15:

Figure 15:

Archetype 5 heat map

Figure 18:

Figure 18:

Expanded window on the data and alpha time series for flu seasons starting in Fall 2012 through end of the season starting in Fall 2014, showing the next 3 outbreaks starting in Dec. 2012, Dec. 2013, and Nov. 2014. Top plot is the data, bottom is the reconstruction time series.

Figure 19:

Figure 19:

Expanded window on the data and alpha time series from flu season starting 2015 through to the end of the flu season starting in 2017, showing the remaining 3 outbreaks in Jan. 2016, Nov. 2016, and Dec. 2017. Top plot is the data, bottom is the reconstruction time series.

Highlights.

  • We have developed a procedure for exploring epidemiological data using archetypal analysis.

  • Our methodology decomposes epidemiological data into spatial patterns and time series representing their contribution to the flu count signal.

  • Using Montana seasonal influenza data we discover regions of the state and aggregations of counties that experience simultaneous outbreaks, over multiple seasons.

  • The spread of influenza across the state can be typified by a transfer across these spatial patterns, which could be used in prediction in future seasons.

5. Acknowledgements

This research was supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH), United States [Award Number P20GM130418]. We thank the anonymous reviewers for offering feedback on manuscript. We also thank the Montana Department of Public Health and Human Services, Communicable Disease Epidemiology Section, for allowing us access to the state’s influenza data.”

Footnotes

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Abdi H and Williams LJ, 2010: Principal component analysis. WIREs Computational Statistics, 2(4), 433–459. [Google Scholar]
  2. Bauckhage C, 2014: A note on archetypal analysis and the approximation of convex hulls, arxiv:1410.0642. [Google Scholar]
  3. Cauchemez S, Bhattarai A, Marchbanks TL, Fagan RP, Ostroff S, Ferguson NM, and Swerdlow D, 2011: Role of social networks in shaping disease transmission during a community outbreak of 2009 h1n1 pandemic influenza. Proceedings of the National Academy of Sciences, 108(7), 2825–2830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cutler A and Breiman L, 1994: Archetypal analysis. Technometrics, 36(4), 338–347. [Google Scholar]
  5. Cutler A and Stone E, 1997: Moving archetypes. Physica D: Nonlinear Phenomena, 107(1), 1–16. [Google Scholar]
  6. Epifanio I, Vinue G, and Alemany S, 2013: Archetypal analysis: Contributions for estimating boundary cases in multivariate accommodation problem. Computers & Industrial Engineering, 64(3), 757–765. [Google Scholar]
  7. Finkelman BS, Viboud C. c., Koelle K, Ferrari MJ, Bharti N, and Grenfell BT, 2007: Global patterns in seasonal activity of influenza a/h3n2, a/h1n1, and b from 1997 to 2005: Viral coexistence and latitudinal gradients. PLOS ONE, 2(12), 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hannachi A and Trendafilov N, 2017: Archetypal Analysis: Mining Weather and Climate Extremes. Journal of Climate, 30(17), 6927–6944. [Google Scholar]
  9. Lloyd SP, 1982: Least squares quantization in pcm. IEEE Trans. Inf. Theory, 28, 129–136. [Google Scholar]
  10. Lundberg R, 2019: Archetypal terrorist events in the united states. Studies in Conflict & Terrorism, 42(9), 819–835. [Google Scholar]
  11. MacQueen J, 1967: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, University of California Press, Berkeley, Calif., pp. 281–297. [Google Scholar]
  12. Moliner J and Epifanio I, 2019: Robust multivariate and functional archetypal analysis with application to financial time series analysis. Physica A: Statistical Mechanics and its Applications, 519, 195–208. [Google Scholar]
  13. Mørup M and Hansen LK, 2012: Archetypal analysis for machine learning and data mining. Neurocomputing, 80, 54–63, special Issue on Machine Learning for Signal Processing 2010. [Google Scholar]
  14. Pearson K, 1901: On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572. [Google Scholar]
  15. Seth S and Eugster MJA, 2016: Probabilistic archetypal analysis. Machine Learning, 102(1), 85–113. [Google Scholar]
  16. Shaman J and Karspeck A, 2012: Forecasting seasonal outbreaks of influenza. PNAS, 109 (50), 20425–20430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Shannon C and Weaver W, 1998: The Mathematical Theory of Communication. University of Illinois Press, ISBN; 9780252098031. [Google Scholar]
  18. Steinschneider S and Lall U, 2015: Daily Precipitation and Tropical Moisture Exports across the Eastern United States: An Application of Archetypal Analysis to Identify Spatiotemporal Structure. Journal of Climate, 28(21), 8585–8602. [Google Scholar]
  19. Stone E and Cutler A, 1996: Archetypal analysis of spatio-temporal dynamics. Physica D: Nonlinear Phenomena, 90(3), 209–224. [Google Scholar]
  20. Su Z, Hao Z, Yuan F, Chen X, and Cao Q, 2017: Spatiotemporal variability of extreme summer precipitation over the yangtze river basin and the associations with climate patterns. Water, 9, 873. [Google Scholar]
  21. Tamerius JD, Shaman J, Alonso WJ, Bloom-Feshbach K, Uejio CK, Comrie A, and Viboud C, 2013: Environmental predictors of seasonal influenza epidemics across temperate and tropical climates. PLOS Pathogens, 9(3), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Thøgersen JC, Mørup M, Damkiær S, Molin S, and Jelsbak L, 2013: Archetypal analysis of diverse pseudomonas aeruginosatranscriptomes reveals adaptation in cystic fibrosis airways. BMC Bioinformatics, 14(1), 279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Vinue G, Epifanio I, and Alemany S, 2015: Archetypoids: A new approach to define representative archetypal data. Computational Statistics & Data Analysis, 87, 102–115. [Google Scholar]

RESOURCES