Skip to main content
PLOS One logoLink to PLOS One
. 2024 Jan 3;19(1):e0283265. doi: 10.1371/journal.pone.0283265

Archetypal analysis of COVID-19 in Montana, USA, March 13, 2020 to April 26, 2022

Emily Stone 1,*, Sebastian Coombs 1, Erin Landguth 2
Editor: Maurizio Fiaschetti3
PMCID: PMC10763954  PMID: 38170725

Abstract

Infectious disease data can often involve complex spatial patterns intermixed with temporal trends. Archetypal Analysis is a method to mine complex spatio-temporal data, and can be used to discover the dynamics of spatial patterns. The application of Archetypal Analysis to epidemiological data is relatively new, and here we present one of the first applications on COVID-19 data from March 13, 2020 to April 26, 2022, for the counties of Montana, USA. We present three views of the data set decomposed with Archetypal Analysis. First, we evaluate the entire 56 county data set. Second, we use a mutual information calculation to remove counties whose dynamics are mainly independent from the other counties, reducing the set to 17 counties. Finally, we analyze the top ten counties in terms of population size to focus on the dynamics in the large cities in the state. For each data set, we analyze four significant disease outbreaks across Montana. Archetypal Analysis uncovers distinct spatial patterns for each outbreak and demonstrates that each has a unique trajectory across the state.

Introduction

The COVID-19 pandemic launched an intensive effort to understand the processes and drivers of infectious disease outbreaks, with an emphasis on improving predictions and providing information to mitigate public health threats [1]. The unprecedented collection of spatially and temporally dense disease data from across the world, along with increased computing power has made many theoretical spatio-temporal approaches now computationally tractable. Combining spatial with temporal methods allows for the investigation of both the persistence and time evolution of patterns.

Archetypal Analysis (AA) is a promising machine learning tool for epidemiological data and the analyses of the spatio-temporal spread of disease. AA decomposes the data into spatial and temporal components; the archetypes being the spatial patterns and the time dependence captured by the reconstruction coefficient time series. Cutler and Breiman introduced AA as variant of Principal Component Analysis (PCA) [2] that could capture ‘archetypal patterns’ in the data [3]. The archetypal patterns are convex combinations of the data points, and as such resemble the data, making interpretation much more transparent than linear decompositions such as PCA. In turn, each data point can be constructed from a convex combination of the archetypes.

PCA provides the most efficient way to compress data, but the eigenvectors produced from PCA should not be interpreted as actual data points, but as directions in the high dimensional space being decomposed. In highly clustered data, AA will find the centroids of the clusters, representing data as convex combinations of the centroids, which means it can indicate if a data point lies between several clusters. Therefore, AA combines the strengths of other commonly used techniques for data decomposition; providing the interpretability that PCA lacks, and more flexibility than many clustering algorithms.

AA applications appear in many fields, the first spatio-temporal application being the study of cellular flames [4], and now include the analysis of weather and climate patterns [57], machine learning [8], market analysis [9], and biomedical and industrial engineering [10, 11]. For example, recently AA was applied to large scale patterns of seas-surface temperature to represent extreme climate variability associated with marine heatwaves [12]. Mokhtari et al. [13] used AA in one of the first applications to epidemiological data, reconstructing the spatio-temporal patterns in an influenza time series, and showed how prominent outbreaks developed in time and across space for each influenza season in the years 2010–2019, in the state of Montana.

Here, we follow the approach from Mokhtari et al. [13] and apply AA to COVID-19 county-level data in Montana, USA, from March 2020 to April 2022. We use AA to find different disease outbreaks across Montana from distinct sets of archetypes and follow these outbreaks in time with the reconstruction coefficient time series. Our goal is to further evaluate the use of AA to construct and interpret the spatial patterns for the pandemic. We apply AA to decompose the full data set (56 counties), and two reduced dimension data sets.

The sections of the paper are as follows:

Methods: the data, the AA method, mutual information explained. Results: AA analysis of the full 56 dimensional data set, AA on a set reduced using mutual information (17 dimensional), AA analysis of the set restricted to counties with large population centers (10 dimensional), to isolate the study to the spread of COVID between them. Discussion: an overview of the conclusions arrived at in the Results section.

Methods

COVID-19 data for Montana, USA

COVID-19 data from counties (m = 56; i.e., spatial attributes) in Montana, USA, was acquired courtesy of the Montana Department of Health and Human Services, from March 13, 2020—April 26, 2022. These data cover 110 full weeks or n = 776 daily case counts. The first case count was recorded on March 13, 2020, however, Montana went into lockdown and had no significant number of cases until June 2020. Therefore, we analyzed the period from June 20, 2020—April 26, 2022, resulting in n = 676 daily data points over 56 counties. The running weekly average of COVID-19 cases per day was smoothed using the “smoothdata” function within MATLAB (default setting of mean = 3) to reduce noise in the time series (Fig 1). Cases are per 1000 people in each county, dividing by the population size of the county in 2020 (American Community survey data from the Census Bureau). In what follows we, partition the COVID-19 pandemic for Montana empirically into four outbreaks (rise in cases followed by a decline): Initial (Jun. 2020—Mar. 2021), Early Delta (July—Aug. 2021), Delta (Sept.- Dec. 2021), and Omicron (Jan. 2021—Apr. 2022).

Fig 1. COVID-19 cases in Montana, USA.

Fig 1

Running weekly average of COVID-19 cases plotted for all 56 Montana counties. The four outbreaks are seen here: Initial (Sept. 2020-March 2021), Early Delta (April-Aug. 2021), Delta (July—Sept. 2021), and Omicron (Jan. 2021-April 2022).

Archetypal analysis: Mathematical formulation

Consider an m × n matrix X, where n is the number of days (676) across m Montana counties (56). AA decomposes the spatio-temporal variability of X in a similar way to PCA but with the following underlying constraints. Given a specified value for k, AA identifies m-dimensional vectors z1, …, zk that best describe k characteristic patterns, or archetypes, in the original data set, such that data can be represented as convex combinations (i.e., linear combinations with non-negative coefficients that sum to unity) of these archetypal patterns. The archetypes themselves are convex combination of the data points, xi, i = 1, …, n:

zj=i=1nβijxi,βij>0,i,j&i=1nβij=1,j. (1)

where the n-dimensional vector βj contains the convex weights for the jth archetype across all data points. The n × k matrix of all such weights is given by B = β1, …, βk. All data points can then be approximated by a convex combination of the archetypes:

x^i=j=1kαjizj,αji>0i,j&j=1kαji=1i. (2)

The convex weights, αji with j = 1, …, k, sometimes referred to as mixture coefficients, range from 0 to 1, are used to reconstruct the ith observation across the k archetypes. The k × n matrix of all such weights is given by A = {α1, …, αn}. The k dimensional vectors, αj’s, are like a (nonlinear) projection of the original data X onto zj, similar to PC scores in PCA. Thus the αj’s are time series that determine how much of each archetype is used in reconstructing each data point.

The m × k matrix Z of k archetypes is found by solving the optimization problem:

argminA,BX-ATBTX, (3)

where Z = XTB. RSS = ‖XXBA‖ is the residual sum of square errors, where ‖.‖ is the spectral norm. AA seeks to find k m-dimensional archetypes such that the RSS is minimized. This approach is described in detail in [3], but can be summarized as follows: AA uses a convex least-squares method (CLSM) to estimate the coefficient αji, subject to the constraints for given some initial values of βij. It then finds the best βij using CLSM, using the new αji. This process repeats until the RSS fails to improve, or until a maximum number of iterations is reached. AA will find local minima, not necessarily the global minimum of RSS, hence using several starting βij values to ensure a global solution is recommended. Furthermore, there is no universal method for determining the optimal value of k. One commonly used approach is the “elbow” criteria, where a good value of k is selected by when the RSS fails to decrease with increasing number of archetypes. Since its introduction, other algorithms have been developed to find an archetypal decomposition of data. To compute the archetypes we used Matlab packages by Morten Mørup and Lars Kai Hansen [8] for computing the Principal Convex Hull [14, 15]. The AA algorithm has also been implemented in R, see [16]. See the paper [12] for a good explanation of the geometrical interpretation of the archetypes as points on the convex hull of the data, and a new algorithm for computing archetypes based on differential geometry (the manifold-based algorithm). Archetypes are presented here as color mapped counties within a state map, and were created with GeoPandas (GeoPandas.org).

It is noted in [3] that the archetypal points z, viewed as vectors, are not orthogonal and have no natural nesting structure, i.e., as more archetypes are found, the archetypes in the smaller set can change. This is in contrast to PCA, where the set of the leading N principal components are a subset of the set of the leading M principal components for M > N. In PCA all the eigenvectors are found in a single decomposition, and the computation is fast and efficient. This is the result of the linearity of PCA, and it comes at the cost of interpretation.

We also note that, depending on how they are computed, the approximate archetypes may or may not represent extremes of the data, that is on the envelope of the convex hull. The convex hull of a data set is the smallest convex set that contains the data. The manifold-based algorithm will naturally find points on the envelope, but the alternating optimization algorithm, which minimizes the error in a convex representation of the data, may first find points representative of heavily sampled values. By the same token, the decomposition is sensitive to outliers, so extremes may be found first. This depends strongly on how the data are distributed, especially in high dimensional spaces. The error is the residual sum of squares, so if a point occurs often in a data set, the error in reproducing it will be multiplied by the number of occurrences and the algorithm will use it as an archetype, necessarily. It may or may not be an extreme in the data set. Mørup et al.’s algorithm does optimization based on projected gradient, to efficiently minimize the RSS.

Mutual information

To reduce the dimension of the data set, we applied information-theoretic measures introduced by Shannon [17] to quantify the dependence of the count time series from different counties upon each other. For instance, if a county has a COVID-19 count time series that runs essentially independently of the other counties (as is typical of very small population counties), we could choose to remove it from the data set, thereby reducing the dimension. We determined this by calculating the mutual information between counties.

Mutual information measures the expected reduction in uncertainty about x that results from learning y, or vice versa, where x and y are samples of the random variables X and Y. This quantity can be formulated

I(X;Y)=H(X)+H(Y)-H(X,Y),

where entropy is defined

H(X)=-xXp(X=x)log2p(X=x) (4)

and the joint entropy of two random variables X and Y quantifies the uncertainty of their joint distribution.

H(X,Y)=-yYxXp(X=x,Y=y)log2p(X=x,Y=y) (5)

Using Eqs (4) and (5), the mutual information can be rewritten

I(X;Y)=xXyYp(x,y)log2p(x|y)p(x). (6)

I is symmetric in the variables X and Y, i.e. I(X; Y) = I(Y; X), and is zero if the random variables are independent or if the relationship between them is deterministic (nothing to be learned in either case). Note also that if X is statistically correlated to Y, H(X|Y) will be less than H(X), and I will be greater than 0. If X is independent of Y, H(X|Y) = H(X) and I = 0. If X is uniquely determined by Y, H(X|Y) = 0 and I(X; Y) = H(X).

In general, association measures like correlation coefficient or mutual information are used to estimate the relationships between two random variables. Correlation coefficient measures such as Pearson or Spearman entail the assumption of linear dependence. Therefore, if two random variables are associated by a nonlinear relationship these methods may fail to detect this link, or the strength will be wrongly estimated. Mutual information, however, is able to detect both linear and nonlinear dependencies, and it measures the amount of information connecting two random variables, in this case between disease cases in two Montana counties. In other words, it estimates the reduction in uncertainty about the COVID-19 activity of one county when the activity of another county is known.

Summary of data sets and analyses

We conducted three different Archetypal Analyses. First, we used AA on the entire data set (m = 56 counties). Second, we introduced MI as a systematic approach to reduce dimensionality and calculated AA on the highest mutual information data set (m = 17). Third, to focus on the dynamics within and between the significant population centers, we computed AA on the counties with large populations (m = 10). For each data set we constructed scree plots with the residual sum of squares error in a reconstruction (RSS) vs the number of archetypes used. These were used to choose the number of archetypes in the decomposition. Archetype sets are plotted as color-coded heat maps of counties in the state of Montana. In the reduced dimension sets, we plot the reconstruction of the data with the archetypes to illustrate the validity of the approach. We study the composition of the archetypes, and compare this with the data. The contribution of each archetype to an outbreak (Initial, Early Delta, Delta and Omicron) was then parsed out with the α coefficients, which allowed us to identify each with certain archetypes.

Results

Archetypal analysis of full data set (56 counties)

The application of AA to the data from all 56 counties of Montana is a natural place to begin our study and we include this analysis to illustrate potential problems with applying AA, and to show the complete picture of the COVID-19 epidemic in Montana. As mentioned in an earlier section, we normalized the time series by dividing by the population of each county in 2020 and multiplying it by 1000. This seems like a reasonable step, and indeed it is customary practice, but in a sparsely populated state like Montana (14 of 56 counties have population less than 5000, and 4 have populations less than 1000) it can lead to issues with the decomposition, as we shall demonstrate now.

We first apply archetypes to the entire truncated in time data set, which is 676 (m) data points in 56 (n) dimensions. The scree plot Fig 2) illustrates the drop-off in error as the number of archetypes is increased. Using an elbow criterion would indicate truncating to one archetype, but this is the “no-disease” or zero archetype and thus serves to turn the outbreak off and on, and would not yield any information about spatial structure. Instead, we choose k = 10 archetypes, which captures approximately 86% of the total variance of the data set. We are left with a 56 by 10 (n × k) set of β values, and a 10 by 676 (k × m) set of α values.

Fig 2. Scree plot.

Fig 2

Residual sum of squares vs. number of archetypes in the 56 county data set. Used to select the number of archetypes in the decomposition.

These 10 archetypes are shown in Fig 3 (and summarized in Table 1) as color-mapped counties in Montana. Each archetypal map is a representation of the spatial features of the pandemic in a given period of time. The color is determined by the β value for that county. We define high, medium and low counts in counties as those with β above 1.5, between 0.75–1.5, and below 0.75, respectively.

Fig 3. Ten archetype set for all 56 counties.

Fig 3

Presented as β color-coded counties in the map of Montana. The first archetype [a] is nearly zero, as it captures the “no disease” state, and acts to “turn-off” the infection/spread in each county. The rest are in order: [b] archetype 2, [c] archetype 3, [d] archetype 4, [e] archetype 5, [f] archetype 6, [g] archetype 7, [h] archetype 8, [i] archetype 9, [j] archetype 10. High β values are considered those greater than 1.5, mid-size between 0.75 and 1.5, and below 0.75 as low. Figure created with ArcGiS: Esri, HERE, Garmin, FAO, NOAA, USGS, EPA.

Table 1. Archetype composition-all county set.

Archetype Counties in Archetype
1 The zero or “no disease” archetype.
2 Widespread mid-size outbreak across state, largest in small population counties.
3 Isolated larger outbreak in Big Horn county, small in Phillips, very low in the rest.
4 Widespread low to mid-size outbreak with a larger outbreaks in Ferris, Valley, and Musselshell counties, and slightly larger outbreaks in Park, Dawson, and Phillips counties. Note that these are all very low population counties.
5 Very low level counts in all counties except for ten small population counties scattered all over the state, which have mid-size outbreaks.
6 Widespread mid-size outbreaks, with larger outbreaks in small population counties, with the exception of Silverbow.
7 Low to mid-size outbreaks in all counties except Big Horn, Cascade, Glacier, Blaine, Roosevelt, and Rosebud, which have large outbreaks. The last three are very small population counties.
8 Large outbreaks in the high population counties of Yellowstone, Missoula and Gallatin, as well as the small population counties of Deer Lodge, Garfield, Liberty, Mineral, Prairie, and Teton.
9 Low to mid-size outbreaks in all counties except for nine very small counties which have very large outbreaks (Daniels, Dawson, Fallon, Powell, Sheridan, Silverbow, Sweet Grass, Toole, Wibaux).
10 Widespread large outbreaks throughout state, both in large and small population counties.

The archetypes in this set are dominated by large counts in small population counties. When normalized by county population these have counts much higher than the largest population counties, and thus, are outliers. They would contribute a large amount to the RSS if not included in the set. In contrast, their time series are the most stochastic, and less likely to have any real predictive relationship with the other counties. Only two (numbers 8 and 10, Fig 3h and 3j) represent large counts in the largest population counties (Yellowstone, Missoula, Gallatin, Flathead, Cascade, Lewis & Clark, Ravalli, Silver Bow, Lake and Lincoln). To remove these in a systematic way, we use mutual information.

AA on the data set with spatial dimension reduced using information theory

In sparsely populated states with a large number of counties, small population counties are numerous; 36 out of 56 in Montana have population below 10000. In decreasing order from 9391 to 434, these are: Beaverhead (at 9391 in 2020), Deer Lodge, Dawson, Stillwater, Madison, Rosebud, Valley, Blaine, Powell, Broadwater, Teton, Pondera, Chouteau, Tolle, Musselshell, Minreal, Phillips, Sweet Grass, Sheridan, Granite, Fallon, Wheatland, Liberty, Judith Basin, Meagher, McCone, Powder River, Daniels, Carter, Prairie, Wibaux, Garfield, Golden Valley, Treasure, Petroleum (at 434 in 2020). These distort the AA by introducing stochastic outliers in the per county population data set. To determine which counties should be included in the Archetypal analysis, we computed the mutual information between all counties, and ranked counties according to their total mutual information (see also [13]).

Following the formulas in the Methods Section, we created histograms of the time series data for single counties, and joint histograms for each county with all the others. We note that the choice of bin size in these histograms will change the value of the entropy, but by choosing a fixed bin size for all the histograms, it is possible to compare the measure relative to others. Accordingly, we calculated the entropy with a uniform bin size and 30 partitions, for each county with respect to all others, creating a 56 by 56 matrix. To determine which counties have the highest MI in total, the MI row(column) for each county is summed and ordered, to create the graph shown in Fig 4.

Fig 4. Total mutual information for all 56 counties in Montana, USA.

Fig 4

Total mutual information (y-axis) across all counties in increasing order.

We chose the 17 top MI counties (see Figs 5 and 6) to analyze. Taking 17 includes all the large population counties, and the smaller population counties that have the largest MI. The large population counties, with population in 2023, in alphabetical order, are: Cascade (84864), Flathead (111814), Gallatin (124857), Lake (32853), Lewis & Clark (73832), Lincoln (21525), Missoula (121041), Ravalli (47298), Silverbow (36068) and Yellowstone (169852). The smaller population counties in alphabetical order are: Beaverhead (9719), Custer (12032), Dawson (8830), Fergus (11663), Hill (16068), Jefferson (12826), and Powell (7051). These counties can be grouped into rough geographic sub-regions. See Table 2.

Fig 5. Total mutual information for the largest MI counties.

Fig 5

Total mutual information in the top 17 total MI counties in increasing order.

Fig 6. Time series of COVID-19 case counts.

Fig 6

COVID-19 cases for the 17 largest total MI counties in Montana, USA.

Table 2. Top 17 MI counties grouped into geographical regions.

Region Counties in Region (Large Market Town, if one exists)
Northwest Flathead (Kalispell), Lake (Polson), Lincoln (Libby)
North Central Cascade (Great Falls), Fergus, Hill, Lewis & Clark (Helena, state capital)
Southwest Beaverhead, Missoula (Missoula), Powell, Ravalli (Hamilton)
South Gallatin (Bozeman), Jefferson, Silverbow (Butte), Yellowstone (Billings)
East Custer, Dawson

Next, AA was used to decompose the COVID-19 counts into a limited number of spatial patterns, and the α time series. Computing archetypal sets with increasing cardinality gives the scree plot in Fig 7. Note that the largest drop in RSS occurs after the first archetype, which, as mentioned earlier, captures the “no disease” state. Beyond that, the RSS declines more slowly to near zero for numbers larger than about 15, because the set is 17 dimensional.

Fig 7. Scree plot.

Fig 7

Residual sum of squares vs. number of archetypes being computed for the High MI County Set. Used to select the number of archetypes in the decomposition. Note: the RSS for 1 archetype is 3178, not seen with this y-range, which chosen to show the decay at larger numbers.

Using an elbow criterion on the scree plot to determine a threshold, we chose 9 archetypes for the decomposition. See Fig 8 for color heat maps of the archetypes, which are summarized in Table 3. We can check the validity of the 9 archetype decomposition of the data by comparing the time series for each county, and its reconstruction with archetypes. See Fig 9. The visible error is generally negligible, because with 9 archetypes, 99% of the variance of the data set is captured.

Fig 8. Nine archetype set for the high MI county set.

Fig 8

Presented as color-coded counties in map of Montana. The first archetype is not included, as it captures the “no disease” state, and acts to “turn-off” the disease in each county. The rest are presented in order: [a] archetype 2, [b] archetype 3, [c] archetype 4, [d] archetype 5, [e] archetype 6, [f] archetype 7, [g] archetype 8, [h] archetype 9. Note that high β values are considered those greater than 1.5, mid-size are between 0.75 and 1.5, low are below 0.75. Figure created with ArcGiS: Esri, HERE, Garmin, FAO, NOAA, USGS, EPA.

Table 3. Archetype composition- largest MI county data set.

Archetype Counties in Archetype
1 The zero or “no disease” archetype.
2 Widespread low outbreak across state.
3 Widespread low outbreak with large outbreaks in Cascade and Flathead counties.
4 Mid-size over all, larger in Lincoln, in the northwest, Hill in the north, and Custer in the east.
5 Mid-size outbreaks overall, with larger outbreaks in Powell and Dawson.
6 Low level outbreaks in all counties except in Gallatin, which is high.
7 Widespread outbreak, with lowest levels in Lincoln, Lake, Missoula, Ravalli, Lewis & Clark, Jefferson and Silverbow counties. Mid-size outbreaks in Flathead, Powell, Beaverhead, Cascade, Gallatin, Fergus, Yellowstone, Custer and Dawson. Larger outbreak in Hill.
8 Mid-size to very high level outbreaks over all. Largest in Gallatin, and large in Missoula, Lewis & Clark and Cascade counties. Powell and Silverbow have the next largest number of cases, then Yellowstone and Hill. The rest have more moderate size outbreaks.
9 Mid-size outbreaks in the contiguous region made up of Flathead, Lake, Missoula, and Lewis & Clark counties, with a large outbreak in adjacent Cascade county, and a mid-size outbreak in Hill county. The remaining counties have low to mid-size outbreaks.

Fig 9. Reconstruction of high MI county data set with 9 archetypes.

Fig 9

Original data are plotted in blue, and the reconstruction of the data in red.

The sum of the α time series is equal to 1 for each data point, so one archetype can be ignored without loss of generality; its α time series can be calculated from the remaining. We chose to ignore the zero archetype when analyzing the outbreaks, as it can be inferred from the other α’s.

The archetypes separate out clusters of counties which have simultaneous outbreaks (spatial), and the α’s give the sequence in which the outbreaks occur (temporal). The α time series indicate when each archetype is active.

Fig 10 illustrates how the each of the 9 archetypes dominate uniquely during the different outbreaks, and is summarized in Table 4. The Initial outbreak is captured by archetypes 7, 5 and 2, in that temporal order. The Early Delta low-level outbreak is captured by archetypes 6 and 3, the Delta outbreak by 4 and 2, and finally the Omicron outbreak is dominated by the sequence 6, 8, 9 and 3. There are small contributions from other archetypes as well, e.g. archetype 2 is a low widespread outbreak, and is present at the end of the first outbreak and the Delta outbreak. Archetype 6 and 3 are present to a lesser degree in the Omicron phase. These results suggest a different spatio-temporal spread during each phase, recapitulated in Table 4.

Fig 10. Reconstruction coefficient (α) time series for the 9 archetype decomposition of the 17 county data set.

Fig 10

Bars across top are color-coded to show the dominant archetype during that time period.

Table 4. Dominant archetypes for 17 high MI counties in each phase of the pandemic in montana, USA.

Phase/outbreak 2 3 4 5 6 7 8 9
Initial
Early Delta
Delta
Omicron

The analysis can be confirmed by comparing it to the time series of the major counties in each archetype during each phase. Fig 11 shows the COVID-19 time series in the counties with significant contributions to different archetypes, for comparison with the archetypal description above.

Fig 11. Time series of major county components in the archetypes.

Fig 11

As labeled: Initial, Early Delta, Delta and Omicron outbreaks, 17 county data set.

We see that COVID-19 initially appears in Hill county followed by Powell (represented by archetype 7), after which it spreads to a larger outbreak overall, with largest numbers in Powell, Silverbow and Dawson counties, represented by archetype 5. The last stage is characterized by large counts in Gallatin county, mid-size in Silverbow, Yellowstone, Lewis & Clark and Missoula counties, and smaller in the rest (seen in archetype 2).

The smaller Early Delta outbreak is captured by archetypes 2 6 and 3, with largest counts in Gallatin county followed by large counts in Cascade and Flathead counties.

In September the Delta outbreak continues, with archetype 4, switching to archetype 2. Archetype 4 has large β values in Lincoln, Custer, Hill and Beaverhead counties, and archetype 2 is a widespread outbreak which is largest in Dawson county. We see in the time series that Lincoln and Custer do dominate initially, with Hill and Beaverhead also larger. As Lincoln and Custer counts decline, the Dawson count grows.

The Omicron outbreak was the most complex, as the counts grew and declined several times in different parts of the state, most likely reflecting the reduction in mitigation strategies combined with social gatherings, such as year-end holiday events, and the reconvening of schools in January. It outbreak began in December with archetype 6, switching to 8, then 9, then 3. This reflects the initial large case numbers in Gallatin county (archetype 6), followed by widespread outbreak with large counts also in Missoula, Cascade, and Lewis & Clark counties (archetype 8), then to large counts in Cascade (archetype 9) and finally decaying in all counties except Cascade and Flathead (archetype 3).

AA for the largest population counties

We next consider the counties with large population size to examine the spatio-temporal dynamics isolated to the cities and surrounding areas. There are 10 relatively large population cities in Montana, and we chose the counties that contained these cities. The time series for each is plotted in Fig 12.

Fig 12. Data time series: COVID case numbers for the large population county data set.

Fig 12

For this analysis we choose a truncation to 6 archetypes, which captures roughly 98% of the variance, see Figs 13 and 14 shows color maps of each archetype, also described in Table 5.

Fig 13. Scree plot.

Fig 13

Residual sum of squares vs. number of archetypes for the Large Population County Set. Used to select the number of archetypes in the decomposition.

Fig 14. The six archetypes for the large population county data set.

Fig 14

Note that the first archetype is not included, as it captures the “no disease” state, and acts to “turn-off” the outbreak in each county. The rest are in order: [a] archetype 2, [b] archetype 3, [c] archetype 4, [d] archetype 5, [e] archetype 6. Figure created with ArcGiS: Esri, HERE, Garmin, FAO, NOAA, USGS, EPA.

Table 5. Archetype composition-large population county data set.

Archetype Counties represented in Archetype
1 The zero or “no disease” archetype.
2 Mid-size overall, with larger counts in Lincoln.
3 Low level outbreak overall, with mid-size outbreaks in Cascade and Flathead.
4 Mid-size outbreaks overall. Flathead, Lewis & Clark, Cascade and Gallatin form a set of counties with larger counts on a major transportation corridor. The largest counts are seen in Silver Bow county
5 Mid-size overall, with largest outbreak in Gallatin, followed by Missoula, Silver Bow and Yellowstone counties. Lewis & Clark and Cascade have the next highest level of outbreak, and the have low to mid-level sized outbreaks.
6 Large outbreaks in the west central counties of Cascade, Lewis & Clark and Missoula, with high mid-size outbreaks in the southern counties of Silver Bow, Gallatin and Yellowstone. The northwest corner counties of Lincoln, Flathead and Lake have mid-size outbreaks, with Ravalli county in the southwest corner with a low outbreak.

We can check the validity of the 6 archetype decomposition of the data by comparing the time series of the data for each county, and its reconstruction with archetypes. See Fig 15. The visible error is generally negligible, as expected because 98% of the variance of the data is captured with 6 archetypes.

Fig 15. Reconstruction of large population county data set with 6 archetypes.

Fig 15

Data are plotted in blue, reconstruction in red.

As in the largest total MI data set, each archetype is almost uniquely identified with one of the major outbreaks. See Fig 16 in which the α time series is plotted along with colored bars that indicate the dominant archetype. Archetype 4 (Cascade, Flathead, Gallatin, Lewis & Clark, Silverbow) peaks during the Initial outbreak, archetype 2 (Cascade, Lake, Lewis & Clark, and Yellowstone) during the Delta outbreak. Archetypes 5 (Gallatin, Missoula), 6 (Cascade, Missoula, Lewis & Clark) and 3 (Cascade and Flathead) illustrate 3 different epochs in the Omicron phase. These archetypes also make contributions to the shoulders of the major outbreaks, and 3 and 5, at low levels, are used to represent the Early Delta outbreak. See Table 6.

Fig 16. Reconstruction coefficient (α) time series.

Fig 16

The 6 archetype decomposition of the 10 county set. Bars across top are color-coded to show the dominant archetype during that time period.

Table 6. Dominant archetypes for 10 large population counties in each outbreak of the pandemic in montana, USA.

Phase/outbreak 2 3 4 5 6
Initial
Early Delta
Delta
Omicron

We next compare the archetype time series with the data time series in Fig 17. In the initial outbreak, archetype 3 is first seen, because Gallatin, Cascade and Flathead counts rise first. In November 2020 archetype 4 becomes dominant, representing widespread contagion, with a large component in Silverbow county. As the outbreak declines overall in January, α2 and α5 rise, archetypes 2 and 5 giving a widespread lower level outbreak, with a larger component in Gallatin county.

Fig 17. Time series of major county components in the archetypes.

Fig 17

As labeled: Initial, early Delta, Delta and Omicron Phases, 10 county data set.

The early Delta outbreak is represented by a low level of archetype 5 switching to a low level of archetype 3, giving a rise in Cascade county followed by a rise in Gallatin county, with slow spread into the other counties.

The Delta outbreak continues with low α3 transitioning to large α2, as it begins in Cascade and Flathead counties, then spreads to all counties with a larger component in Lincoln. Archetype 2 is thus the dominant archetype for the Delta outbreak. It ends as α2 declines and α6 rises, reflecting the decline in Lincoln counts and the growing counts in all other counties, especially the counties with major cities, Missoula, Gallatin, Yellowstone, Cascade and Lewis & Clark.

The Omicron phase in late December 2021, begins with large counts in Gallatin (archetype 5) and switches to large counts in all other counties (archetype 6). It ends with significant counts in Cascade and Flathead, but low elsewhere, hence the rise of archetype 3.

Discussion

We have shown how Archetypal Analysis can be used to good effect in studying a spatio-temporal data set of COVID-19 case counts in the counties of Montana, USA. Decomposing the entire 56 dimensional data set was problematic, however, because the large size of the per-population counts in small population counties form stochastic outliers in the data set and dominate the decomposition. To mitigate this, a straight-forward truncation to only large population counties with significant city centers formed one reduced set, but to include the small counties whose disease dynamics had significant interaction with the others, we used a mutual information (MI) measure between case count time series of the different counties, and chose those small counties with the largest MI, in addition to the large population counties, to create another set for analysis.

For each data set, the first archetype is necessarily the zero archetype, which indicates a disease free state in all counties. In the 17 county high total MI data set certain archetypes were tied uniquely to a given outbreak, while one archetype was used to represent low-level counts in between outbreaks. Including the small population counties showed the initiation of outbreaks from the boundaries of the state, in the north (Hill county), and the east (Custer county). Each of the archetypes of the 10 large population counties data set are also mostly identified with a single outbreak during the pandemic, and are similar to those in the 17 county set, restricted to the high population counties. For instance, archetype 4 for the 10 county data set is similar to archetype 5 in the 17 county data set (without the low population counties), and both are the main component of the first outbreak. In the Delta outbreak, archetype 2 from the 10 county set is similar to archetype 4 from the 17 county decomposition. In the Omicron outbreak, archetype 5 from the 10 county data set is similar to 8 from the 17 county data set, archetype 6 is similar to 9, and archetype 3 to 3. This overlapping structure is expected, and confirms the validity of the results.

The composition of an archetype itself can illustrate the geographic spread of the disease. For instance, Flathead, Lewis & Clark and Cascade counties have similar outbreak levels in several archetypes, which is not surprising, as they are linked by major state roads and have larger cities, but the analysis confirms that these connections are important. Other clusters of counties involved in outbreaks are revealed automatically in the archetypes, for instance, in the Initial outbreak a hot spot in Hill county and larger counts in Powell county then spread to all eastern counties. Further analysis could be done by including on more counties in the east to determine how COVID-19 spreads west if initiated in the east. Archetypal decompositions could also be used to predict the spread of disease in future outbreaks. For instance, if an outbreak begins in a county that features largely in one archetype, would this imply spread to other counties with high levels in that archetype?

We close by commenting on the care that must be used in the creation and analysis of the archetypes. They have the advantage of automatically showing the counties that experience simultaneous outbreaks, and the state of the other counties during these time periods. Understanding the archetypes is intuitive, unlike graphical representations of principal component vectors. However, if outliers are present in the data (like the inflated counts from small population counties), they can bias the selection of the archetypes. This can be mitigated by filtering out these data points, as we did using an information measure between time series. In doing so we retained the small population counties that were important in the initiation of the outbreaks, which lay largely on the northern and eastern sides of the state. In conclusion, Archetypal Analysis, in tandem with human observation and intuition, can sift through complicated spatio-temporal data to reveal important features for further analysis of the phenomenon.

Supporting information

S1 File

(PDF)

Acknowledgments

We thank the anonymous reviewers for offering feedback on manuscript. We also thank the Montana Department of Public Health and Human Services, Communicable Disease Epidemiology Section, for allowing us access to Montana’s COVID-19 data.

Data Availability

Data cannot be shared publicly from the Montana Dept of Health and Human Services. Data are available from the MT DPHHS (Communicable Disease Epidemiology Section. The point of contact at DPHHS is Laura Williamson: 406-444-0064) for researchers who meet the criteria for access.

Funding Statement

EL: This research was supported by the National Institute of General Medical Sciences of 428 the National Institutes of Health (NIH), United States [Award Number P20GM130418] nigms.nih.gov The funders did not play any role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Landguth E and Holden Z and Graham J and Stark B and Bayat Mokhtari E and Kaleczyc E, et al. The delayed effect of wildfire season particulate matter on subsequent influenza season in a mountain west region of the USA. Environ Int 2020; 139 (50): 105668. doi: 10.1016/j.envint.2020.105668 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Abdi H, and Williams L. Principal component analysis. WIREs Computational Statistics 2010; 2 (4):433–459. doi: 10.1002/wics.101 [DOI] [Google Scholar]
  • 3. Cutler A, and Breiman L. Archetypal Analysis Technometrics 1994, 36 (4): 338–347. doi: 10.1080/00401706.1994.10485840 [DOI] [Google Scholar]
  • 4. Stone E, and Cutler A. Archetypal analysis of spatio-temporal dynamics. Physica D: Nonlinear Phenomena 1996; 90 (3): 209–224. doi: 10.1016/0167-2789(95)00244-8 [DOI] [Google Scholar]
  • 5. Hannachi A. and Trendafilov N. Archetypal Analysis: Mining Weather and Climate Extremes. Journal of Climate 2017; 30 (17): 6927–6944. doi: 10.1175/JCLI-D-16-0798.1 [DOI] [Google Scholar]
  • 6. Steinschneider S, and Lall U. Daily Precipitation and Tropical Moisture Exports across the Eastern United States: An Application of Archetypal Analysis to Identify Spatiotemporal Structure. Journal of Climate 2015; 28 (21):8585–8602. doi: 10.1175/JCLI-D-15-0340.1 [DOI] [Google Scholar]
  • 7. Su Z, Hao Z, Yuan F, Chen X, and Cao Q. Spatiotemporal Variability of Extreme Summer Precipitation over the Yangtze River Basin and the Associations with Climate Patterns. Water 2017; 9: 873. doi: 10.3390/w9110873 [DOI] [Google Scholar]
  • 8. Mørup M, and Hansen L. Archetypal analysis for machine learning and data mining. Neurocomputing 2012, Special Issue on Machine Learning for Signal Processing 2010; 80: 54–63. [Google Scholar]
  • 9.Li, S, Wang, P, Louviere, J, Carson, R. Archetypal analysis: a new way to segment markets based on extreme individuals. ANZMAC Conference Proceeding, A Celebration of Ehrenberg and Bass: Marketing Knowledge, Discoveries and Contribution, 2003.
  • 10. Epifanio I, Vinué G, and Aleman S. Archetypal analysis: Contributions for estimating boundary cases in multivariate accommodation problem Computers & Industrial Engineering 2013; 64 (3): 757–765. doi: 10.1016/j.cie.2012.12.011 [DOI] [Google Scholar]
  • 11. Thøgersen J, Mørup M, Damkiær S, Molin S, and Jelsbak L. Archetypal analysis of diverse pseudomonas aeruginosatranscriptomes reveals adaptation in cystic fibrosis airways. BMC Bioinformatics 2013; Sept. 23, 14 (1): 279. doi: 10.1186/1471-2105-14-279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Chapman C, Monselesan D, Risby J, Feng M, and Sloyan B. A large-scale view of marine heatwaves revealed by archetype analysis. Nature Communications 2022; Dec. 21, 2022. doi: 10.1038/s41467-022-35493-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bayat Mokhtari E, Landguth E, Anderson S, and Stone E. Decoding influenza outbreaks in a rural region of the USA with archetypal analysis. Spatiotemporal Epidemiology 2021; 38:100437. doi: 10.1016/j.sste.2021.100437 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bauckhage C, and Thurau C. Making Archetypal Analysis Practical. Pattern Recognition 2009; eds. Denzler J, Notni G, and Süße, H., Springer Berlin Heidelberg: 272–281.
  • 15.Bauckhage C. A Note on Archetypal Analysis and the Approximation of Convex Hulls. ArXiv 2014:1410.0642.
  • 16. Eugster M.J.A. and Leisch F. From Spider-Man to Hero—Archetypal Analysis in R. Journal of Statistical Software; 30(8): 1–23.21666874 [Google Scholar]
  • 17. Shannon C, and Weaver W. The Mathematical Theory of Communication. University of Illinois Press; 1998. [Google Scholar]

Decision Letter 0

Maurizio Fiaschetti

8 Jun 2023

PONE-D-23-06211Archetypal analysis of COVID-19 in Montana, USA, March 13, 2020 to April 26, 2022PLOS ONE

Dear Dr. Stone,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

I strongly encourage the authors to address all the issues raised by both the reviewers. In particular, given your statement about the use of data publicly avaialble, I encourage you to share you data and code to allow a thorough check of the paper's analysis for the revision.

Please submit your revised manuscript by Jul 23 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Maurizio Fiaschetti

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"This research was supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH), United States [Award Number P20GM130418]. We thank the anonymous reviewers for offering feedback on manuscript. We also thank he Montana Department of Public Health and Human Services, Communicable Disease Epidemiology Section, for allowing us access to Montana’s COVID-19 data. "

We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "EL: This research was supported by the National Institute of General Medical Sciences of 428

the National Institutes of Health (NIH), United States [Award Number P20GM130418]

nigms.nih.gov

The funders did not play any role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ.

5. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

6. We note that Figures 3a to 3j, 8a to 8h, 13a to 13e  in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

 We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

        1. You may seek permission from the original copyright holder of Figures 3a to 3j, 8a to 8h, 13a to 13e to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

      2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

 USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The paper describes an application of the archetypal analysis to study of COVID-19 epidemic in the state of Montana. While the application by itself is interesting from the methodological perspective, the manuscript has several important drawbacks:

1) The authors have not provided access to their data and code. This violates the community standards for data availability, which stipulates that data and code should be accessible to reviewers and readers of the paper for validation purposes.

2) The results are not clearly formulated, rendering the central message of the study unclear. The authors did not supply any quantitative or statistical evidence to substantiate the associations described in the Results section.

3) The rationale behind the abrupt shift from discussing COVID-19 pandemic results to flu on page 10 is unclear, as is the relevance of flu to the overall topic of the paper.

4) It is unclear why a significant portion of the results section (pages 5-6) is devoted to results acknowledged by the authors to be biased.

5) The decision to select 10 archetypes for the analysis (as stated on page 5) lacks a clear justification.

6) The paper is not well-written. For instance, the first paragraph of the introduction leaves readers questioning the essence of the "new implementations" that rendered spatio-temporal approaches tractable (implementations of what?), why these approaches were previously intractable, and the nature of the "patterns over time" mentioned. Furthermore, the sentence "the methods can detect spatial clustering could may reveal environmental causatives" is particularly opaque.

7) Table 1 is not cited in the paper.

Reviewer #2: The authors apply archival analysis (AA), an innovative tool similar to principal components analysis (PCA). The rationale for applying AA is convincing and the results are interesting.

Major Comment:

The subsection on the mathematical formulation has some problems. I believe these problems are in the explanation, not in any underlying issue with the method itself. The authors say “Given a specified value for k, AA identifies m-dimensional vectors z_1, z_2, … , z_k that best describe k characteristic patterns, or archetypes in the original data set … .” They go on to give some restrictions regarding these z’s, but they don’t say precisely what “best” means. The authors seem to make an attempt to make things precise at the bottom of p. 3 where they say

“The m x k matrix Z of k archetypes is defined by the matrix factorization problem:

min_{A,B} || X – XBA ||

where Z = XB.”

A couple of remarks on this statement. First, the authors don’t exactly say what the matrix factorization problem is. It actually looks like a minimization problem, not a matrix factorization problem. Is Z the minimum of || X – XBA || over all A and B? If so, they could write

Z = arg min_{A,B} || X – XBA ||.

I see no connection between Z and the minimization problem. Second, at the end of the sentence, they say Z = XB. So is Z the solution to the given minimization problem, or is it simply XB?

Minor Comments

1. The authors often talk about “observations” when they really mean “days.” It would help the reader to say there are data values for counties and days. Every time I saw “observations” I had to stop reading and think about what that means.

2. Near the bottom of p. 2, the authors talk about the AA algorithm. At this point of the paper, I didn’t have enough background to understand what any of this means. I’d suggest trimming this back, and then coming back to it later in the paper once the groundwork is laid.

3. It looks like the authors used Matlab, a package familiar to most mathematicians. Statisticians tend to use R because of its array of functions for doing data analysis. I believe there is an R package to do AA, which is called "archetypes". See the paper

Eugster, M., & Leisch, F. (2009). From spider-man to hero-archetypal analysis in R.

https://cran.r-project.org/web/packages/archetypes/vignettes/archetypes.pdf

4. On p. 3, first paragraph, the authors say “To account for differences in population size, we weighted the COVID-19 cases … .” “Weighted” could mean a lot of things in this context. Did the authors just look COVID rates on a per capita basis?

5. p. 3, just above equation (1). Should be “… coefficients that sum to unity …”

6. In equation (1) and thereafter, the betas are vectors and should be bold. This can be easily done in LaTeX using the amsmath package and the \\boldsymbol{} command.

7. p. 3, just after equation (1), Should be “Each archetype is either a convex combination of …”

8 p. 3, lines 84 and 90 (using numbering in the right margin). Write these matrices using brackets:

B = [ beta_1, … , beta_k ] and A = [ alpha_1 , … , alpha_n ]. These alphas and betas are themselves vectors and should be bold. Also use \\ldots for the ellipsis between commas; save \\cdots for the ellipsis between math operators like “+”.

9. In equation (1), the constraints must hold for all j. The “for all j” could be added a the end of (1).

10. The definition of mutual information at the bottom of p. 4 looked asymmetric to me. I got out some paper and tried to see whether it was symmetric, and of course it is. Then I turned the page and saw that the authors pointed out the symmetry. By the way, the description of MI at the top of p. 5 is very nice!

11. I couldn’t help but think that ten archetypes may be a bit much. I suspect that by the time you get past the first five or six, you get mostly noise. I’m not suggesting anything specific here, except maybe a statement that there will come a point beyond which we are observing noise.

12. p. 8, line 221. Shouldn’t “no flu” be “no COVID”? Same page, line 228, shouldn’t figure 8 be “Fig. 8”?

13. Throughout, I’d suggest not spelling out alpha. Just use $\\alpha$.

14. p. 13, lines 355-357. “A high (relative) MI indicates that a county’s time series can be better predicted by considering the other, and vice versa.” First, congratulations on getting “vice versa” correct; I always put a hyphen between the words, which is incorrect. I think the idea is that you can augment data from a county with other counties for which the MI is high. The way it is worded suggests that you ignore the data from the current county and just use the data from counties with a high MI.

15. The graphs are generally fuzzy. I know that PLOS ONE allows files in TIFF or EPS format only. EPS is a vector graphics format so the figures should be clear (no fuzziness at any level of magnification). I suspect the authors used TIFF files. The best strategy would be to render the graphs in EPS format. If this isn’t possible, then maybe saving the files at a higher TIFF resolution would make them clear.

16. In Figure 3, the axes are not needed. The legend on the right is, however, needed, but this can be given once (assuming the legend is the same for all archetypes). The figure will probably be set in a 5 by 2 array of maps so it fits on one page. The county names will not be readable at this scale. Similar comments apply to Figure 8. If the county names are needed, the authors might consider giving one map of Montana with just the county names. This figure could contain the map information (e.g., compass, scale in miles, surrounding topography, etc.) along with the county names.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jan 3;19(1):e0283265. doi: 10.1371/journal.pone.0283265.r002

Author response to Decision Letter 0


28 Sep 2023

To the Reviewers and the Editor:

We have extensively rewritten sections of the manuscript, to improve clarity and flow. The results have not changed, and the contents are almost identical to the first submission. Two figures and two tables have been added to address issues raised by the reviewers.

The markup file shows the old submission, with the revisions added. Rewrites follow struck-out sections, and wholly new parts are highlighted in yellow.

We are not able to make the data publicly available, as it is comes from the MT Dept. of Health and Human Services. Readers can apply to them directly for access to the data. We did, however, put our code up on github, and the one package we used is listed in Supporting Information.

Response to the Reviewers:

Reviewer #1: The paper describes an application of the archetypal analysis to study of COVID-19 epidemic in the state of Montana. While the application by itself is interesting from the methodological perspective, the manuscript has several important drawbacks:

1) The authors have not provided access to their data and code. This violates the community standards for data availability, which stipulates that data and code should be accessible to reviewers and readers of the paper for validation purposes.

We have added a copy of all code used in the supplementary information section, and have created a github repository of it, as well.

2) The results are not clearly formulated, rendering the central message of the study unclear. The authors did not supply any quantitative or statistical evidence to substantiate the associations described in the Results section.

We have added a clarification of the purpose of the study to the Introduction. We follow the lead of similar papers, such as Mørup and Hansen, 2010, that provide a proof of principle of applying archetypes to data from various sources. That said, the statistical fitness of the archetypes is built into the algorithm for finding them. The RSS of the truncation level chosen gives the total error of representing the data with that number of archetypes. We have included a new figure to show an approximation to the time series by the archetypes, to show this graphically.

3) The rationale behind the abrupt shift from discussing COVID-19 pandemic results to flu on page 10 is unclear, as is the relevance of flu to the overall topic of the paper.

These are typos, thanks to the reviewer for spotting it. Should be COVID-19 or disease. It is changed.

4) It is unclear why a significant portion of the results section (pages 5-6) is devoted to results acknowledged by the authors to be biased.

Another question that could be asked, is why not do archetypes on the entire data set? This section was included to explain the issues with doing do. We have highlighted the justification in the Abstract, Introduction and the Results section.

5) The decision to select 10 archetypes for the analysis (as stated on page 5) lacks a clear justification.

We have highlighted the justification in the Abstract, Introduction and the Results section, where we state: “We next consider the counties with large population centers to examine the spatio-temporal dynamics isolated to the cities and their surrounding areas. There are 10 relatively large population cities in Montana, and we chose the counties that contained these cities.”

6) The paper is not well-written. For instance, the first paragraph of the introduction leaves readers questioning the essence of the "new implementations" that rendered spatio-temporal approaches tractable (implementations of what?), why these approaches were previously intractable, and the nature of the "patterns over time" mentioned. Furthermore, the sentence "the methods can detect spatial clustering could may reveal environmental causatives" is particularly opaque.

The first paragraph has been rewritten and confusing statements reworded or removed.

7) Table 1 is not cited in the paper.

The citation label was used twice, now fixed. Thanks to the reviewer for spotting this.

Reviewer #2: The authors apply archival analysis (AA), an innovative tool similar to principal components analysis (PCA). The rationale for applying AA is convincing and the results are interesting.

Major Comment:

The subsection on the mathematical formulation has some problems. I believe these problems are in the explanation, not in any underlying issue with the method itself. The authors say “Given a specified value for k, AA identifies m-dimensional vectors z_1, z_2, … , z_k that best describe k characteristic patterns, or archetypes in the original data set … .” They go on to give some restrictions regarding these z’s, but they don’t say precisely what “best” means. The authors seem to make an attempt to make things precise at the bottom of p. 3 where they say

“The m x k matrix Z of k archetypes is defined by the matrix factorization problem:

min_{A,B} || X – XBA ||

where Z = XB.”

A couple of remarks on this statement. First, the authors don’t exactly say what the matrix factorization problem is. It actually looks like a minimization problem, not a matrix factorization problem. Is Z the minimum of || X – XBA || over all A and B? If so, they could write

Z = arg min_{A,B} || X – XBA ||.

I see no connection between Z and the minimization problem. Second, at the end of the sentence, they say Z = XB. So is Z the solution to the given minimization problem, or is it simply XB?

We have clarified the notation in this section, and included “AA seeks to find $k$ $m$-dimensional archetypes such that the RSS is minimized”. That is the meaning of “best”.

Minor Comments

1. The authors often talk about “observations” when they really mean “days.” It would help the reader to say there are data values for counties and days. Every time I saw “observations” I had to stop reading and think about what that means.

Fixed.

2. Near the bottom of p. 2, the authors talk about the AA algorithm. At this point of the paper, I didn’t have enough background to understand what any of this means. I’d suggest trimming this back, and then coming back to it later in the paper once the groundwork is laid.

This paragraph has been moved to the Methods section

3. It looks like the authors used Matlab, a package familiar to most mathematicians. Statisticians tend to use R because of its array of functions for doing data analysis. I believe there is an R package to do AA, which is called "archetypes". See the paper

Eugster, M., & Leisch, F. (2009). From spider-man to hero-archetypal analysis in R.

https://cran.r-project.org/web/packages/archetypes/vignettes/archetypes.pdf

We added this reference to the Methods Section: From Spider-Man to Hero – Archetypal Analysis in R. Eugster MJA and Leisch F (2009). Journal of Statistical Software, 30(8), pp. 1–23.

4. On p. 3, first paragraph, the authors say “To account for differences in population size, we weighted the COVID-19 cases … .” “Weighted” could mean a lot of things in this context. Did the authors just look COVID rates on a per capita basis?

Changed the word to normalized, and yes, it makes the measurements per capita.

5. p. 3, just above equation (1). Should be “… coefficients that sum to unity …”

Fixed.

6. In equation (1) and thereafter, the betas are vectors and should be bold. This can be easily done in LaTeX using the amsmath package and the \\boldsymbol{} command.

In equation 1 the beta is indexed by i and j and refers to a number. When beta is indexed by 1 subscript it is a vector.

Following this recommendation, we made all the alpha and beta vectors appear in boldface.

7. p. 3, just after equation (1), Should be “Each archetype is either a convex combination of …”

Fixed

8 p. 3, lines 84 and 90 (using numbering in the right margin). Write these matrices using brackets:

B = [ beta_1, … , beta_k ] and A = [ alpha_1 , … , alpha_n ]. These alphas and betas are themselves vectors and should be bold. Also use \\ldots for the ellipsis between commas; save \\cdots for the ellipsis between math operators like “+”.

Fixed.

9. In equation (1), the constraints must hold for all j. The “for all j” could be added a the end of (1).

Fixed.

10. The definition of mutual information at the bottom of p. 4 looked asymmetric to me. I got out some paper and tried to see whether it was symmetric, and of course it is. Then I turned the page and saw that the authors pointed out the symmetry. By the way, the description of MI at the top of p. 5 is very nice!

Thank-you.

11. I couldn’t help but think that ten archetypes may be a bit much. I suspect that by the time you get past the first five or six, you get mostly noise. I’m not suggesting anything specific here, except maybe a statement that there will come a point beyond which we are observing noise.

The truncation level is always a tricky issue. There is an elbow in the scree plot around 6, but we found that the 6 archetype set missed some of the signal that we wanted to capture. Being able to check the reconstruction against the data allows for further refinement of the truncation.

12. p. 8, line 221. Shouldn’t “no flu” be “no COVID”? Same page, line 228, shouldn’t figure 8 be “Fig. 8”?

We have removed all occurrences of the word “flu” in the document, and fixed the figure reference.

13. Throughout, I’d suggest not spelling out alpha. Just use $\\alpha$.

Fixed.

14. p. 13, lines 355-357. “A high (relative) MI indicates that a county’s time series can be better predicted by considering the other, and vice versa.” First, congratulations on getting “vice versa” correct; I always put a hyphen between the words, which is incorrect. I think the idea is that you can augment data from a county with other counties for which the MI is high. The way it is worded suggests that you ignore the data from the current county and just use the data from counties with a high MI.

Changed it to “A high (relative) MI between two counties, indicates that a county’s time series can be better predicted in tandem with the other’s, and vice versa.”

15. The graphs are generally fuzzy. I know that PLOS ONE allows files in TIFF or EPS format only. EPS is a vector graphics format so the figures should be clear (no fuzziness at any level of magnification). I suspect the authors used TIFF files. The best strategy would be to render the graphs in EPS format. If this isn’t possible, then maybe saving the files at a higher TIFF resolution would make them clear.

We used the PACE package to convert our .eps figures to .tiff, which is recommended by PLOS. From our end they look very clear. Not sure what the issue is with the draft the reviewer received, but we will make sure they are legible in the production copy.

16. In Figure 3, the axes are not needed. The legend on the right is, however, needed, but this can be given once (assuming the legend is the same for all archetypes). The figure will probably be set in a 5 by 2 array of maps so it fits on one page. The county names will not be readable at this scale. Similar comments apply to Figure 8. If the county names are needed, the authors might consider giving one map of Montana with just the county names. This figure could contain the map information (e.g., compass, scale in miles, surrounding topography, etc.) along with the county names.

Thanks to the reviewer for these helpful suggestions.

________________________________________

Attachment

Submitted filename: Response to Reviewers.pdf

Decision Letter 1

Maurizio Fiaschetti

20 Oct 2023

Archetypal analysis of COVID-19 in Montana, USA, March 13, 2020 to April 26, 2022

PONE-D-23-06211R1

Dear Dr. Stone,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Maurizio Fiaschetti

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: No

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: (No Response)

Reviewer #3: The present study is interesting because it analyzes Archetypal to epidemiological data of COVID-19 from March 13, 2020 to April 26, 2022, for the counties of Montana, USA. The manuscript has already undergone an extensive review, with many details that have actually made the manuscript substantially improve its quality. The authors made the suggested adjustments appropriately and the manuscript can be accepted for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Steven E. Rigdon

Reviewer #3: No

**********

Acceptance letter

Maurizio Fiaschetti

24 Nov 2023

PONE-D-23-06211R1

Archetypal analysis of COVID-19 in Montana, USA, March 13, 2020 to April 26, 2022

Dear Dr. Stone:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Maurizio Fiaschetti

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (PDF)

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    Data cannot be shared publicly from the Montana Dept of Health and Human Services. Data are available from the MT DPHHS (Communicable Disease Epidemiology Section. The point of contact at DPHHS is Laura Williamson: 406-444-0064) for researchers who meet the criteria for access.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES