Abstract
Generalized -means can be combined with any similarity or dissimilarity measure for clustering. Using the well known likelihood ratio or -statistic as the dissimilarity measure, a generalized -means method is proposed to group generalized linear models (GLMs) for exponential family distributions. Given the number of clusters , the proposed method is established by the uniform most powerful unbiased (UMPU) test statistic for the comparison between GLMs. If is unknown, then the proposed method can be combined with generalized liformation criterion (GIC) to automatically select the best for clustering. Both AIC and BIC are investigated as special cases of GIC. Theoretical and simulation results show that the number of clusters can be correctly identified by BIC but not AIC. The proposed method is applied to the state-level daily COVID-19 data in the United States, and it identifies 6 clusters. A further study shows that the models between clusters are significantly different from each other, which confirms the result with 6 clusters.
Keywords: Clustering, COVID-19, Exponential family distributions, Generalized -means, Generalized information criterion (GIC), Generalized linear models (GLMs)
1. Introduction
Generalized -means, including both -means and -medians as special cases, can be incorporated with any similarity or dissimilarity measure for grouping objects. The similarity or dissimilarity measure can be very general. In this work, we choose the dissimilarity measure as the well known likelihood ratio or -statistic and the objects as statistical models for exponential family distributions, such that the resulting method can be used to group generalized linear models (GLMs). In particular, we assume that each object is composed by a vector for a response and a design matrix for explanatory variables, and a GLM has been established within each object. The linear component of the GLM provides the relationship between the expected value of the response and the explanatory variables within the objects. The significance of regression coefficients for the explanatory variables is determined by the likelihood ratio statistic, which means that we can combine the likelihood ratio test with the generalized -means. The current research develops the method and uses it to group the patterns for the state-level daily confirmed cases of COVID-19 in the United States.
The outbreak of COVID-19 has become a worldwide ongoing pandemic since March 2020. According to the website of the World Health Organization (WHO), until January 31 2021, the outbreak has affected over countries and territories with more that million confirmed cases and million deaths in the entire world. The most serious country is the United States. It has over million confirmed cases and thousand deaths. To understand the outbreak in the United States in the early period, we compare daily patterns of new cases in the fifty states and Washington DC until July 31 2020. We find that some of these patterns are similar to each other and some of these are far away from each other, implying that we can carry out a clustering analysis to group these patterns. As statistical models are involved, we use the generalized -means. We adopt the likelihood ratio or -statistic because it is induced by the standard uniformly most powerful unbiased (UMPU) test for exponential family distributions. Based on theory of the UMPU test, the proposed method should be more powerful than the convenient method based on -means directly on regression coefficients. This is confirmed by our simulation studies.
Clustering is one of the most popular unsupervised statistical learning methods for unknown structures. Clustering methods are often carried out by similarity or dissimilarity measures between objects. Their goal is to group the objects into a few clusters. The definition of objects can be very general. They can be observations, images, or statistical models. The purpose of clustering is to make objects within clusters mostly homogeneous and objects between clusters mostly heterogeneous. In the literature, one of the most well known clustering methods is the -means. For objects from a Euclidean space, the method assigns each of them to the cluster with the nearest mean. Based on a given , it provides clusters according to centers. The centers are solved by minimizing the sum-of-squares (SSQ) criterion, formulated by the Euclidean distance between the objects. Theoretically, the SSQ criterion in the -means can be replaced by any similarity or dissimilarity measure, leading to the generalized -means (Bock, 2008, Soheily-Khah et al., 2016). Because the choice of the dissimilarity measure is flexible, generalized -means can be combined with any divergence measure, including the UMPU test statistics.
Many clustering methods have been proposed in the literature. Examples include hierarchical clustering (Zhao and Karypis, 2005), fuzzy clustering (Trauwaert et al., 1991), density-based clustering (Kriegel et al., 2001), model-based clustering, and partitioning clustering. Model-based clustering is usually carried out by EM algorithms or Bayesian methods under the framework of mixture models (Fraley and Raftery, 2002, Lau and Green, 2007). Partitioning clustering can be interpreted by the centroidal Voronoi tessellation method in mathematics (Du and Wong, 2002). It can be further specified to -means (Forgy, 1965, Hartigan and Wong, 1979, Lloyd, 1982, MacQueen, 1967), -medians (Charikar and Guha, 2002), and -modes (Goyal and Aggarwal, 2017), where -means is the most popular. To implement those, one needs to express observations of the data in a metric space, such that a distance measure can be defined. Several approaches have been developed to specify the distance measure. A review of these can be found in Johnson and Wichern (2002), p. 670.
Challenges appear in grouping daily patterns for the state-level COVID-19 data in the United States. Suppose that the daily patterns have been fitted by statistical models (e.g., GLMs) with the response as daily confirmed cases and explanatory variables as certain functions of time. The interest is to know whether models for individual states can be grouped into a few clusters. At least, two other methods can be used. The first is the direct usage of an existing clustering method on estimates of coefficients. A concern may arise because it is hard to address variability in estimates of coefficients. The second is the usage of mixture models, which often leads to EM algorithms for mixture structures (Qin and Self, 2006). Here, we propose another method. We use a likelihood ratio or an -statistic as the dissimilarity measure in the generalized -means. Because they are formulated by the UMPU test, the resulting method should be more powerful than any other method theoretically. To verify this, we compare our method with the other two methods by simulation studies. We find that our method has lower clustering object error (OE) rates than our competitors.
We propose our method based on a known at the beginning. When is unknown, we use GIC to select the best . We specify it to both BIC and AIC. We find that BIC is more reliable than AIC in selecting number of clusters. Therefore, we recommend using our BIC selector. To implement our method to the COVID-19 data in the United States, we have to define an unsaturated clustering problem. In particular, we partition the coefficient vector into two sub-vectors. The first sub-vector does not contain any information of time. Therefore, we only need to study the second sub-vector. The goal is to know whether time variations between these models are similar. This problem can be partially reflected by Fig. 1. Suppose that six regression lines are compared. The intercepts do not contain any time information. We allow them to vary within clusters. We restrict the generalized -means on the slopes only, leading to two clusters. Based on our intuition, we believe that the unsaturated clustering problem can also be carried out by mixture models with EM algorithms. Because our method is developed based on the UMPU test, it should be more powerful than any other methods.
Fig. 1.
Generalized -means clustering for six regression lines.
The article is organized as follows. In Section 2, we propose our method. In Section 3, we study theoretical properties of our method. In Section 4, we evaluate our method with the comparison to a few previous methods by simulation studies. In Section 5, we implement our method to the state-level COVID-19 data in the United States. In Section 6, we provide a discussion.
2. Method
We propose our method based on a known in Section 2.1. The method is combined with GIC (Zhang et al., 2010) to select the best when is unknown, and this is introduced in Section 2.2. In Section 2.3, we specify our method to regression models for normal data and loglinear models for Poisson data. These models are treated as special cases of GLMs. The loglinear model for Poisson data can be extended to models with overdispersion for quasi-Poisson data, and this is used in analysis of the state-level COVID-19 data in the United States.
2.1. Generalized -means in GLMs
The goal of clustering is to partition a set of objects, denoted by , into several non-empty subsets or clusters, such that the objects within clusters are mostly homogeneous and the objects between clusters are mostly heterogeneous. If the objects are points from a Euclidean space, then the -means can be used. It partitions into distinct clusters denoted by with given by
| (1) |
where is the center of . The right-hand side of (1) is called the SSQ criterion in the -means. The generalized means is induced if the SSQ criterion is replaced by any similarity or dissimilarity measure. In particular, let be a selected dissimilarity measure with representing an object and representing a cluster. The generalized -means solves by
| (2) |
If are points in a Euclidean space, then generalized -means becomes the -means by choosing . This can also be the -medians if is used. Furthermore, the generalized -means can also be implemented by adding a penalty function in the SSQ criterion. This can induce the convex clustering problem studied by Chen et al., 2016, Chi and Lange, 2015, Lindsten et al., 2011 and Hocking et al. (2011) in the literature.
We find that in (2) can be specified as the UMPU test statistic for grouping statistical models. In this work, we restrict our attention on GLMs for exponential family distributions, which can be linear models for normal data or loglinear models for Poisson data. The task of our method is to group the GLMs into a number of clusters.
Suppose that contains a response vector and a design matrix , such that the sample size of the entire data is . In , are independently collected from an exponential family distribution with the probability mass function (PMF) or the probability density function (PDF) as
| (3) |
where is a canonical parameter representing the location and is a dispersion parameter representing the scale. The linear component is related to explanatory variables by . The link function connects and through
| (4) |
for all and , where is the inverse function obtained by (4). In (3), there is , where is the variance function. If the canonical link is used, then (4) becomes , implying that is the identity function.
The MLEs of , denoted by , can only be solved numerically if the distribution is not normal. A popular and well known algorithm is the iteratively reweighted least squares (IRWLS) (Green, 1984). The IRWLS is equivalent to the Fisher scoring algorithm. It is identical to the Newton–Raphson algorithm under the canonical link. After is derived, a straightforward method is to estimate by moment estimation (McCullagh, 1983) as
| (5) |
where and is the residual degrees of freedom. If is not present in (3), then (5) is not needed. This occurs in Bernoulli, binomial, and Poisson models. The IRWLS is the standard algorithm for fitting GLMs, which has been adopted by many software packages, such as , , and .
Our interest is to group into a few clusters, such that we have if objects and are in the same cluster or otherwise. The regression version of this problem has been previously investigated in gene expressions by an EM algorithm for Gaussian mixture models (Qin and Self, 2006). Their interest is to know whether the entire coefficient vectors can be partitioned into a few clusters. In our method, we allow a few components of to be different within clusters. Therefore, we only need to partition the objects based on the remaining components.
Suppose that (4) is expressed as
| (6) |
where and . We want to know whether can be grouped into a few clusters, such that we only need if objects and are in the same cluster or otherwise. Based on a given , our clustering model is
| (7) |
for . We call (7) the unsaturated clustering problem. The saturated clustering problem is induced if is absent in (7). As the choice of and is flexible in (7), our method can be used to group GLMs based on any arbitrary sub-vectors of . In practice, the choice of and depends on interpretations or the interest of the applications. If no information is provided, then we can simply study the saturated clustering problem.
The best measure for the difference between statistical models under (6) or (7) is the UMPU test statistic. The UMPU test is optimal in finite samples for the comparison between two statistical models. It is also optimal in the simultaneous comparison between many statistical models. The UMPU test is more powerful than any other method with the same type I error probabilities. This motivates us to use the UMPU test statistic to define the similarity measure in (2).
We want to start with a nice initial in our generalized -means. We do not follow the usual -means algorithms, as they select the initial randomly. Instead, we want to choose the initial as heterogeneous as possible. This has been previously used in the initialization of traditional -means with observations from a Euclidean space based on a complete weighted graph (Gonzalez, 1985). It has also been used in the initialization of the -means++ proposed by Arthur and Vassilvitskii (2007), who point out that -means++ generally outperforms -means with random initial centers in terms of both accuracy and speed by substantial margins.
The goal of our initialization can be achieved by selecting most dissimilar seeds first and then using them to generate the entire initial . We use a sequential approach to obtain the seeds. At the beginning, we randomly choose the first seed from . We denote it as . We treat it as the seed for . To obtain the second seed for , we calculate the UMPU test statistic for
| (8) |
for any . A larger value of the UMPU test statistic indicates more dissimilar between and . The UMPU test statistic can be either a likelihood ratio or an -statistic. It is the likelihood ratio statistic if is absent in (3) (e.g., in binomial or Poisson regressions) or the -statistic if is present (e.g., in linear regressions). We want to be the most dissimilar to . This can be achieved by maximizing the tail distribution of the UMPU test statistic, which is equivalent to minimizing the -value. Therefore, the resulting has the lowest -value in (8).
Now, we have two seeds and . We want to derive the third seed for . We cannot use the simple UMPU test given by (8) to select . Then, we incorporate the minimax principle. For each , we calculate the UMPU test statistic for
| (9) |
For a given , (9) contains two testing problems by taking and , respectively. We want to be the most dissimilar to both and . We can do this by minimizing the maximum of the two -values. Then, we have . Using this idea, we can obtain all seeds for , respectively.
To finalize our initial , the next task is to assign the remaining objects to one of . We assign to cluster if it is the most similar to . We need this for all , which can also be achieved by the UMPU test statistic given by (9) with , respectively. We claim that is the most similar to if the -value of the UMPU test is maximized at . Then, we have our initial .
We next carry out an iterative method to update . We want to reassign every to the cluster candidates with an improved result. This can also be achieved by the UMPU test. In particular, let be the result given by the previous iteration. Then, for each , there exists a unique such that . In the current iteration, we need to determine whether should be kept in or moved to another with . After we do this for all , we obtain an updated result. It is denoted as in the current iteration. The notation will be changed to be in the next iteration.
To derive the updated based on the previous , for each , we need to know whether should be kept in the current or moved to another . To fulfill the task, we calculate the UMPU test statistic for
| (10) |
for every . Because there are cluster candidates in , we obtain -values of . We want to reassign to the most similar by using these -values. For every , we reassign to cluster candidate if the -value of the UMPU test statistic given by (10) is maximized at . This can involve two cluster candidates. After we use the method for all , we obtain the updated , which becomes in the next iteration. Although it is very unlikely, to ensure each non-empty theoretically, we do not move the object with the largest -value in current to any other . Then, we have the following algorithm.
Algorithm 1 has two major stages. The second stage is given by Step 2 to Step 5, which is common in many -means algorithms. The goal of the first stage given by Step 1 is to find the best initial . We want it to be as heterogeneous as possible. In the end, the algorithm provides non-empty clusters with the value and the -value of the UMPU test statistic based on the final partition.
The usage of Step 1 in Algorithm 1 can increase the accuracy and the speed compared to the method with the random assignment of the initial . This has been found in the -means++ when data are collected from a Euclidean space (Arthur and Vassilvitskii, 2007). Because our Step 1 can be treated as an extension of the initialization in -means++, we treat -means++ as our method if we want to compare it with our competitors for data from Euclidean spaces. This is used in Section 4.2.
2.2. Generalized information criterion
The generalized -means proposed in Section 2.1 cannot be used if is unknown. To overcome the difficulty, we use the likelihood function given by Algorithm 1 to construct a penalized likelihood function, which is used in determining if it is unknown. The penalized likelihood approach has been widely applied in variable selection problems. It is also used in clustering analysis problems (Chen et al., 2016, Chi and Lange, 2015, Hocking et al., 2011). Here, we adopt the well known GIC approach (Zhang et al., 2010) to construct our objective function with the best obtained by optimizing the corresponding criterion.
Let be the loglikelihood of (7), where represents all of the parameters involved in the model. If the dispersion parameter is not present, then is composed by and for all and only. It is enough for us to use to define the objective function in GIC. If the dispersion parameter is present, then we need to address the impact of the estimator of , because variance can be seriously underestimated in the penalized likelihood approach under the high-dimensional setting (Fan et al., 2012). We introduce our approach based on (3) without first. We then modify it to the case when is present.
Assume that does not appear in (3). The GIC for (7) is defined as , where is the MLE of and is the model degrees of freedom under , and is a positive number that controls the properties of GIC. If is the dimension of and is the dimension of , then . Because does not vary with , we define the objective function in our GIC as
| (11) |
The best is solved by
| (12) |
where is the best grouping based on the current . The GIC given by (11) includes AIC if we choose or BIC if we choose . If these are adopted, then the solutions given by (12) are denoted by and , respectively.
We need to estimate the dispersion parameter if it is present. Because the estimator based on the current can be seriously biased, we recommending using as the number of clusters in the computation of the estimate of . In particular, we calculate the best based on the current in the generalized means. We use it to compute and for all and . Next, we calculate the best by setting the number of clusters equal to with estimated by (5). This is analogous to the full model versus the reduced model approach in linear regression, where the variance parameter is always estimated under the full model. We treat the model with clusters in (7) as the full model, and the model with clusters as the reduced model. We estimate based on the full model but not the reduced model. After is derived, we put it into (11) in the computation of GIC. We then use (12) to calculate the best when is present. This is used in our method for regression models.
2.3. Specification
In regression, (6) becomes
| (13) |
where , , and . With a given , our generalized -means model becomes
| (14) |
for . We treat (14) as a special case of (13). Because the second stage in Algorithm 1 is common, we only discuss the first stage.
We select seed for randomly. Suppose that have been selected as the seeds for , for any , respectively. To determine for , we calculate the dissimilarity measure between and for and based on , where or , is the dummy variable defined as if or if , and is the error vector. As the UMPU test statistic becomes an -statistic, we calculate the -statistic for
| (15) |
Let be the -value of the -statistic. We define the -value of the dissimilarity between and as . We choose as the seed for if it has the lowest value among all objects in . Therefore, is given by the minimax principal as
| (16) |
After we obtain , which is the set of all of the seeds for , we calculate the -value of the -statistic for (15) for every and . We assign to if is maximized at . Then, we have the initial . By iterating the second stage in Algorithm 1, we obtain the final based on a given .
Because is present, we follow the GIC in variable selection for regression models (Zhang et al., 2010), and propose our GIC based on a known as
| (17) |
where is the sum of squares of errors given by (14).
Because cannot be known, we need to estimate in our method. We use the full versus reduced model approach. If the current is used, then the estimate of is divided by residual degrees of freedom. The first term on the right-hand side of (17) is always equal to , implying that this cannot be used. To overcome the difficulty, we use in (14) to estimate , denoted as . Therefore, our GIC based on an unknown becomes
| (18) |
where is the SSE with clusters in (14). This is appropriate. If the number of true clusters is less than or equal to , then slightly increasing the number of clusters would not significantly change the estimate of , implying that the second term dominates the right-hand side of (18). Otherwise, the estimate of would be significantly reduced, implying that the first term dominates the right-hand side of (18). Therefore, the objective function in our GIC provides a nice trade-off between the SSE and the penalty function.
In loglinear models for Poisson data, (6) becomes
| (19) |
With a given , it reduces to
| (20) |
for . Analogous to the regression models, after selecting randomly, we investigate
| (21) |
with or . We measure the dissimilarity between and by the likelihood ratio statistic. We derive the initial by the same idea that we have displayed in regression models. With the second stage in Algorithm 1, we obtain based on a given . To determine the best , we choose as the residual deviance of (20). As the dispersion parameter is not present, the implementation of GIC is straightforward.
For quasi-Poisson data, there is , implying that . We can still use (19), (20), and (21) to find the best with a given . To determine the best when it is unknown, we estimate by (5), which is the Pearson goodness-of-fit statistic under (20) divided by its residual degrees of freedom. For the same reason, we choose the number of clusters equal to in (20) in estimating . It is denoted as . This induces
| (22) |
where is the residual deviance (i.e., deviance goodness-of-fit) with clusters in (20).
3. Asymptotic properties
We evaluate asymptotic properties of our method under , achieved by letting . To simplify our notations, we assume that are all equal to and are all equal to such that we have and in our data. The case with distinct and can be proven under their minimums going to infinity with bounded ratios between the minimums and the maximums, where the idea is the same.
The asymptotic properties are evaluated under possibly with , which includes the case when both and are constants. For any , let be the likelihood ratio statistic for
| (23) |
As , is asymptotically distributed if and are in the same cluster, or goes to with rate otherwise. Because (23) is applied to all pairs in , the multiple testing problem must be addressed. This can be solved by the method of higher criticisms (Donoho and Jin, 2004). Because we restrict our methods on exponential family distributions, all usual regularity conditions (e.g., all those listed in Chapters 17, 18, and 22 in Ferguson (1996)) for consistency and asymptotic normality of the MLE and the asymptotic -distribution of the likelihood ratio statistic hold. Therefore, we do not need to impose any other conditions.
Lemma 1
Assume thatforare iid copies of (7) with PDF or PMF given by (3) based on a non-degenerate common distribution of for any given . If and are in the same cluster, then . If and are in different clusters, then exists a positive constant , such that the limiting distribution of is non-degenerate as .
Proof
The conclusion can be proven by the standard approach to the asymptotic properties of maximum likelihood and M-estimation. Please refer to Chapter 22 in Ferguson (1996) and Chapter 5 in van der Vaart (1998). □
Theorem 1
If the assumption of Lemma 1 holds, and for some when , then .
Proof
Note that the likelihood ratio test based on is applied to distinct . We need to evaluate the impact of the multiple testing problem. We examine the distribution of the based on Lemma 1. According to Donoho and Jin (2004), it is asymptotically bounded by a constant times if and are in same clusters or increases to with rate if and are in different clusters. Thus, with probability , the increasing rate of with and in different clusters is faster than that of with and in same clusters, implying the conclusion. □
Theorem 2
Assume thatis not present in (3) or is consistently estimated by used in the construction of GIC, and the assumption of Theorem 1 holds. If as , then and .
Proof
If , then we can find at least one pair of and , such that they are not in the same cluster but they are grouped to the same cluster. By Lemma 1, the first term on the right-hand side of (11) goes to with rate . It is faster than the rate of GIC under , implying that as . Therefore, we only need to study the case when . The loglikelihood function of (7) based on a given is equal to the sum of the loglikelihood functions obtained from each . By Theorem 1, we can restrict our attention on the case when all objects in are in the same cluster. By Donoho and Jin (2004), with probability , the loglikelihood function (7) in is not higher than that under the true cluster plus . By the property of the -approximation of the likelihood ratio statistic under the true , with probability , the first term on the right-hand side of (11) is not higher than . Combining it with the second term, we conclude that . Finally, we draw the conclusion by Theorem 1. □
Theorem 1 implies that both and can increase exponentially fast in when is known, but the rate is significantly reduced when is unknown. If , then we cannot choose in our method, implying that is not consistent, but we can still show that is consistent.
Corollary 1
Suppose that all of assumptions of Theorem 2 are satisfied. If or is constant, and when , then .
Proof
Note that the increasing rate of cannot be lower than the increasing rate of . We draw the conclusion by Theorem 2. □
Corollary 1 implies that BIC can be used to determine the number of clusters if is unknown. This is consistent with many findings for BIC in tuning parameter determinations. Examples include variable selection (Zhang et al., 2010) and dimension reduction (Bai et al., 2018) problems. In clustering analysis, if data are collected from a Euclidean space, then it is generally hard to provide a consistent estimator of (or ) based on an unknown , implying that it is unlikely to implement GIC to determine the number of clusters. This issue can be easily solved in our method because can be consistently estimated by statistical models. Therefore, we can use GIC to determine the number of clusters, but this approach cannot be migrated to data from Euclidean spaces.
4. Simulation
We carried out simulations to evaluate our methods. For an estimated cluster assignment and the true clustering assignment , we define the clustering error () of as , where if and belong to the same clusters in , or otherwise, and similarly for in . For estimated clustering assignments obtained from simulation replications, we calculate the percentage of clustering object errors () by
| (24) |
This is a commonly used criterion in the clustering literature (Wang, 2010). We also study the percentage of numbers of clusters identified correctly () as
| (25) |
where are the estimated numbers of clusters obtained from simulation replications, and is the true number of clusters. We compare methods based on and .
4.1. Regression models with a few explanatory variables
We generated data from regression models with clusters and explanatory variables. This was treated as the implementation of our method under the low-dimensional setting. Each cluster had objects. Each object contained observations. We generated explanatory variables from and from independently. For each selected , , and , we generated the normal response from
| (26) |
for and , where and with . We set if and were in the same cluster in (26). If , we chose , , and when was in the first cluster, or , , and when was in the second cluster. If , we added one more cluster by choosing , , and when was in the third cluster. Then, we obtained data from (26) with either or clusters.
We evaluated our method based on AIC and BIC with the comparison to the previous EM algorithm proposed by Qin and Self (2006). We implemented our AIC and BIC given by (18) by choosing and , respectively. The EM algorithm was implemented by R package RegClust. To implement RegClust, we had to consider the saturated clustering problem, where we chose not varied within clusters. We also considered two other competitors. The first was the usual -means directly on regression coefficients. The second was the convex clustering (Chi and Lange, 2015) directly on regression coefficients. Following Tibshirani et al. (2001), we estimated the number of clusters by maximizing the gap statistic.
Table 1 displays the simulation results for the percentage of number errors correctly identified by the EM algorithm, the -means and convex clustering directly on regression coefficients, and our AIC and BIC selectors. Although it was also based on BIC for numbers of clusters, in all of the simulations that we ran, we found that the number of clusters reported by the EM algorithm based on RegClust was either or , implying that it could not identify the true number of clusters when . The performance of -means and convex clustering was slightly better than that of the EM algorithm, but it was not as good as our BIC selector. The true could be detected by our BIC not our AIC.
Table 1.
Percentage of numbers of clusters identified correctly () based on 1000 simulation replications when data are generated from (26) with respect to -means (K), convex clustering (Convex), the EM algorithm, and our AIC and BIC selectors in generalized -means.
|
|
|
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K | Convex | EM | AIC | BIC | K | Convex | EM | AIC | BIC | |||
| 0.5 | 10 | 2 | 53.2 | 59.7 | 81.1 | 15.4 | 70.2 | 73.9 | 72.4 | 21.4 | ||
| 3 | 23.3 | 19.2 | 0.0 | 5.6 | 10.7 | 7.4 | 0.0 | 6.2 | ||||
| 20 | 2 | 85.1 | 87.2 | 75.1 | 0.5 | 92.3 | 91.7 | 75.2 | 0.3 | |||
| 3 | 6.7 | 5.6 | 0.0 | 0.0 | 1.7 | 1.5 | 0.0 | 0.0 | ||||
| 1.0 | 10 | 2 | 22.6 | 25.0 | 75.1 | 17.3 | 35.9 | 39.2 | 71.6 | 15.9 | ||
| 3 | 27.3 | 21.6 | 0.0 | 7.3 | 24.4 | 21.5 | 0.0 | 4.9 | ||||
| 20 | 2 | 52.4 | 52.7 | 74.2 | 0.5 | 69.5 | 71.5 | 72.2 | 0.3 | |||
| 3 | 15.6 | 13.0 | 0.0 | 0.3 | 15.1 | 14.0 | 0.0 | 0.0 | ||||
Table 2 displays the simulation results for the percentage of clustering object errors by the EM algorithm, the -means and convex clustering directly on regression coefficients, and our BIC selector. We did not include AIC in the table because BIC was better. Our result shows that our BIC was always better than our competitors. It was able to find the true number of clusters with lower clustering object errors. This is an advantage of our generalized -means for regression models under the low-dimensional setting.
Table 2.
Percentage of clustering object errors () based on 1000 simulation replications when data are generated from (26) with respect to -means (K), convex clustering (Convex), the EM algorithm, and our AIC and BIC selectors in generalized -means.
|
|
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| K | Convex | EM | BIC | K | Convex | EM | BIC | |||
| 0.5 | 10 | 2 | 48.7 | 46.9 | 10.0 | 47.4 | 46.9 | 33.6 | ||
| 3 | 47.7 | 47.1 | 14.5 | 49.5 | 49.2 | 33.1 | ||||
| 20 | 2 | 49.7 | 49.5 | 12.8 | 49.5 | 49.4 | 33.1 | |||
| 3 | 49.8 | 49.7 | 12.8 | 50.1 | 50.0 | 31.8 | ||||
| 1.0 | 10 | 2 | 48.8 | 48.5 | 13.1 | 49.0 | 48.6 | 33.6 | ||
| 3 | 45.6 | 44.5 | 14.9 | 46.7 | 45.8 | 36.4 | ||||
| 20 | 2 | 49.8 | 49.6 | 13.3 | 49.7 | 49.6 | 34.5 | |||
| 3 | 48.4 | 47.8 | 14.3 | 49.0 | 48.7 | 36.5 | ||||
4.2. Regression models with many explanatory variables
We still generated data from regression models with clusters, but we increased the number of explanatory variables to such that it could reflect our method under the high-dimensional setting. We studied the unsaturated problem. We also chose objects in each cluster, and each object contained observations. We generated the th explanatory variables independently from . For each selected , , and , we generated the normal response from
| (27) |
for and , where and with . We generated independently for each . If , we chose for when was in the first cluster, or when was in the second cluster. If , we added one more cluster by choosing for or for . We obtained data from (27) with clusters.
We discarded our AIC and only used our BIC selector for number of clusters (Table 3). We wanted to group the statistical models based on the last regression coefficients (i.e., it is an unsaturated clustering problem). We could not use RegClus because it has not been formulated for an unsaturated clustering problem yet. Therefore, we compared our method with the other two competitors: the -means and the convex clustering directly on regression coefficients. We also included the -means++ in our comparison because Step 1 in Algorithm 1 was motivated by initialization of -means++. Similar to Section 4.1, we still estimated the number of clusters by maximizing the gap statistic. It was used to the -means, the convex clustering, and the -means++. Our results showed that all of the four methods were able to identify the number of clusters when was small (i.e., ), but our BIC selector could still be used to identify the number of clusters even when was large (i.e., ). This was because our method was formulated by the UMPU test, which was optimal in measuring the difference between statistical models. The performance of the convex clustering was better than that of the -means and the -means++, indicting that it is more appropriate than the other two methods in grouping regression models.
Table 3.
Percentage of numbers of clusters identified correctly (IC) based on 1000 simulation replications when data are generated (27) with respect to -means (K), convex clustering (Convex), and -means++ (KPP) directly on regression coefficients based on the gap statistic and our BIC selector in generalized -means.
|
|
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| K | Convex | KPP | BIC | K | Convex | KPP | BIC | |||
| 0.1 | 10 | 2 | 92.6 | 98.5 | 88.2 | 96.5 | 93.9 | |||
| 3 | 72.3 | 99.2 | 90.2 | 72.4 | 99.9 | 93.1 | ||||
| 20 | 2 | 99.7 | 98.9 | 99.9 | ||||||
| 3 | 73.3 | 99.2 | 72.9 | 99.9 | ||||||
| 0.2 | 10 | 2 | 94.6 | 97.8 | 90.9 | 96.0 | 99.8 | 93.3 | ||
| 3 | 36.8 | 47.5 | 26.9 | 75.6 | 99.2 | 95.0 | ||||
| 20 | 2 | 99.7 | 99.0 | 99.9 | ||||||
| 3 | 24.6 | 24.4 | 10.6 | 77.2 | 99.8 | |||||
| 0.5 | 10 | 2 | 95.7 | 99.9 | 93.5 | 96.0 | 99.1 | 95.4 | ||
| 3 | 1.1 | 1.3 | 1.2 | 1.3 | 1.4 | 1.4 | ||||
| 20 | 2 | 99.9 | 99.7 | 99.9 | 99.8 | |||||
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||
| 1.0 | 10 | 2 | 96.2 | 93.0 | 97.7 | 95.8 | ||||
| 3 | 0.9 | 0.8 | 1.6 | 0.3 | 0.2 | 0.3 | ||||
| 20 | 2 | 99.6 | 99.6 | 99.9 | ||||||
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||
We also evaluated the percentage of clustering object errors (Table 4). We found that our method was also better than our competitors, as it had the lowest clustering object errors. The performance of the convex clustering was better than that of the -means and the -means++. Notice that many penalties could be used to determine the number of clusters and the gap statistic was just one of those. Examples can be found in Koepke and Clarke (2013). To know whether the performance of our competitors could be significantly improved if other penalties were adopted, we compared our BIC selector based on an unknown with our competitors based on a known . In this case, the impact of the penalties was completely removed in our competitors. Our results (not shown) indicated that the percentage of clustering object errors was almost the same as those displayed in Table 4. This means that our method based on an unknown was better than our competitors based on a known . Thus, our method can significantly enhance the precision and accuracy in grouping statistical models compared to our competitors. It is more appropriate to use our method than our competitors in grouping statistical models (see Table 5).
Table 4.
Percentage of clustering object errors () based on 1000 simulation replications when data are generated from (27) with respect to -means (K), convex clustering (Convex), and -means++ (KPP), directly on regression coefficients and our BIC selector in the generalized -means.
|
|
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| K | Convex | KPP | BIC | K | Convex | KPP | BIC | |||
| 0.1 | 10 | 2 | 0.4 | 0.1 | 0.6 | 0.2 | 0.3 | |||
| 3 | 1.4 | 0.2 | 1.5 | 0.1 | ||||||
| 20 | 2 | |||||||||
| 3 | 1.5 | 1.5 | ||||||||
| 0.2 | 10 | 2 | 0.3 | 0.1 | 0.5 | 0.2 | 0.3 | |||
| 3 | 14.0 | 12.4 | 16.1 | 1.4 | 0.1 | |||||
| 20 | 2 | |||||||||
| 3 | 17.5 | 17.6 | 20.6 | 1.5 | ||||||
| 0.5 | 10 | 2 | 10.8 | 10.3 | 10.4 | 0.9 | 0.6 | 0.9 | ||
| 3 | 33.1 | 32.6 | 32.2 | 26.0 | 26.8 | 25.7 | ||||
| 20 | 2 | 8.4 | 8.2 | 8.3 | 0.5 | 0.5 | 0.5 | |||
| 3 | 31.4 | 31.3 | 31.2 | 25.9 | 26.3 | 25.6 | ||||
| 1.0 | 10 | 2 | 41.1 | 40.2 | 40.0 | 22.8 | 21.7 | 20.5 | ||
| 3 | 46.8 | 45.7 | 46.4 | 38.9 | 38.1 | 38.0 | ||||
| 20 | 2 | 39.4 | 38.9 | 38.1 | 18.6 | 18.0 | 17.7 | |||
| 3 | 45.9 | 45.2 | 45.1 | 36.4 | 36.1 | 35.8 | ||||
Table 5.
Percentage of number of clusters identified correctly () based on 1000 simulation replications when data are generated from a Euclidean space (i.e., ) with respect to the -means (K), the convex clustering (Convex), and the -means++ (KPP) based on the gap statistic.
|
|
|
||||||
|---|---|---|---|---|---|---|---|
| K | Convex | KPP | K | Convex | KPP | ||
| 0.001 | 2 | ||||||
| 3 | 62.1 | 89.4 | 60.2 | 91.7 | |||
| 4 | 44.9 | 80.4 | 44.5 | 85.7 | |||
| 0.002 | 2 | ||||||
| 3 | 62.0 | 88.9 | 59.6 | 93.4 | |||
| 4 | 46.2 | 85.0 | 48.9 | 86.7 | |||
| 0.005 | 2 | ||||||
| 3 | 65.3 | 89.7 | 64.5 | 93.5 | |||
| 4 | 45.3 | 83.9 | 47.8 | 85.7 | |||
| 0.01 | 2 | ||||||
| 3 | 67.0 | 89.6 | 67.1 | 90.7 | |||
| 4 | 46.9 | 83.3 | 48.4 | 84.1 | |||
To understand the importance of the UMPU test statistic, we compared the -means, the convex clustering, and the -means++ when they were applied to data from Euclidean spaces. In this case, we generated data from with clusters. At the beginning, for each , we generated cluster centers by the uniform distribution on . Then, for each cluster center, we generated points by the multivariate normal distribution with mean vector to be the cluster center and variance–covariance matrix to be . After that, we used the three methods to group the data with the number of clusters to be determined by the gap statistic. We found that based on the gap statistic, all of the three methods can correctly identify the true number of clusters. The performance of the -means++ was better than the other two, because of its initialization. Because Step 1 in our algorithm is motivated by the -means++, we conclude that this step could also increase the precision and accuracy compared to the method based on random initialization. Our result for the percentage of clustering object errors (Table 6) indicated that the three methods were all precise even if they did not find the correct number of clusters. To confirm this, we looked at the information contained by additional clusters. We found that they were all small and did not significantly affect the results of the percentage of clustering object errors.
Table 6.
Percentage of clustering object errors () based on 1000 simulation replications when data are generated from a Euclidean space (i.e, ) with respect to the -means (K), the convex clustering (Convex), and the -means++ (KPP) based on the gap statistic.
|
|
|
||||||
|---|---|---|---|---|---|---|---|
| K | Convex | KPP | K | Convex | KPP | ||
| 0.001 | 2 | ||||||
| 3 | 2.4 | 2.5 | |||||
| 4 | 4.6 | 0.3 | 4.2 | 0.4 | |||
| 0.002 | 2 | ||||||
| 3 | 2.4 | 2.6 | |||||
| 4 | 4.3 | 0.4 | 4.3 | 0.4 | |||
| 0.005 | 2 | ||||||
| 3 | 2.1 | 2.2 | 0.1 | ||||
| 4 | 4.1 | 0.3 | 4.5 | 0.4 | |||
| 0.01 | 2 | ||||||
| 3 | 2.0 | 0.1 | 2.0 | 0.5 | |||
| 4 | 4.1 | 0.4 | 4.3 | 0.4 | |||
In summary, our simulation shows that all of the -means, the convex clustering, and the -means++ are precise and accurate in grouping data from Euclidean spaces, but our method is more precise and accurate than those in grouping statistical models. This is because our method is formulated by the UMPU test, which is the best in measuring the difference between statistical models. Initialization of clustering methods is important. A good initialization can increase precision and accuracy of the results.
4.3. Loglinear models
Similar to the regression models, we also chose clusters in loglinear models for Poisson data. Each cluster had objects. Each object contained observations. We generated explanatory variables and from independently. For each selected , , and , we independently generated the response from with
| (28) |
for and , where . We generated independently from . We set in the first cluster and in the second cluster. This was used if . If , we chose in the third cluster. We evaluated our method based on AIC and BIC for the unsaturated clustering problem, where we varied within clusters.
Table 7 displays the simulation results for the percentage of clustering number errors. We also found that the true could be identified by our BIC but not by our AIC. Table 8 displays the results for the percentage of clustering object errors based on BIC. It shows that the percentage of clustering object errors was still low, indicating that BIC can be used to find the correct number of cluster with the low error rate. Therefore, we recommend using BIC in our generalized -means if the number of clusters is unknown.
Table 7.
Percentage of number clusters identified correctly () in loglinear models based on 1000 simulation replications when data are generated from (28).
|
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
||||||
| AIC | BIC | AIC | BIC | AIC | BIC | AIC | BIC | ||
| 0.5 | 10 | 1.6 | 0.3 | 1.6 | 0.4 | ||||
| 20 | 0.1 | 0.0 | 0.0 | 0.0 | |||||
| 1.0 | 10 | 2.1 | 1.2 | 1.3 | 0.6 | ||||
| 20 | 0.0 | 0.0 | 0.1 | 0.0 | |||||
Table 8.
BIC for percentage of clustering object errors () in loglinear models based on 1000 simulation replications with data generated from (28).
|
for |
for |
||||
|---|---|---|---|---|---|
| 2 | 3 | 2 | 3 | ||
| 0.5 | 10 | 1.3 | 0.6 | 0.8 | 0.4 |
| 20 | 4.4 | 2.1 | 3.0 | 1.5 | |
| 1.0 | 10 | 1.1 | 0.5 | 0.7 | 0.3 |
| 20 | 3.3 | 1.5 | 2.5 | 1.1 | |
5. Application
We implemented our method to the state-level daily COVID-19 data in the United States. The state-level daily COVID-19 data are reported by the United States Centers for Disease Control and Prevention (CDC). The data set contains confirmed disease counts, deaths, and recoveries with the information updated everyday. Data reported by CDC are based on the most recent numbers reported by states and territories in the United States. COVID-19 can cause mild symptoms, which can induce delays in reporting and testing, leading to difficulties in reporting the exactly numbers of COVID-19 cases. The accuracy of the data has been discussed by CDC. CDC attempts to provide more accurate data by updating previous information. The detail interpretation of the accuracy of the data can be found in the website of the CDC.
In the global pandemic of COVID-19, many countries in the Northern Hemisphere have encountered dramatically increased cases and deaths since August 2020, leading to the appearance of the second wave (which they called). Patients in the second wave were younger than those in the first wave, but the impact was still unclear (Iftimie et al., 2020). We applied our method to the data until July 31 2020 to avoid this problem.
The outbreak of COVID-19 has occurred and become the ongoing pandemic in the world since March 2020. More than countries and territories have affected. The most serious country is the United States. Until July 31, it had over 4.7 million confirmed cases and one hundred sixty thousand deaths. Both were the highest in the world. After briefly looking at the patterns of the data (Fig. 2), we found that some of the curves were similar to each other (e.g., California and North Carolina) but some of those were far away from each other (e.g., California and Michigan). To address this issue, a straightforward approach is to group these curves by a clustering method. This can help us understand the connection of outbreaks between individual states. We found significant changes in the daily patterns before May 31 and after June 1. Two possible issues were identified based on social medias. The first was the George Floyd issue, that occurred on May 25 in Minneapolis. The second was the economy reopening issue. Most states reopened their economy or released their restrictions for the prevention of the spread at the end of May.
Fig. 2.
Daily new cases of COVID-19 in states in the mainland United States.
The first patient of COVID-19 appeared in Wuhan, China, on December 1 2019. In late December, a cluster of pneumonia cases of unknown causes was reported by local health authorities in Wuhan with clinical presentations greatly resembling viral pneumonia (Chen et al., 2020, Sun et al., 2020). Deep sequencing analysis from lower respiratory tract samples indicated a novel coronavirus (Feng, 2020, Huang et al., 2020). The virus of COVID-19 primarily spreads between people via respiratory droplets from breathing, coughing, and sneezing (World Health Organization (WHO), 2020). This can cause cluster infections in society. To avoid cluster infections, many countries have imposed travel restrictions, which affected over 91% of the total population of the world with three billion people living in countries with restrictions on people arriving from other countries borders completely closed to noncitizens and nonresidents (Pew Research Center, 2020).
Exponential increasing trends are expected at the beginning of outbreaks in any infectious disease. This has been observed in the 2009 Influenza A (H1N1) pandemic (de Picoli et al., 2011) and the 2014 Ebola outbreak in West Africa (Hunt, 2014). Without any prevention efforts, the exponential trend will be continuing for a long time until a large portion of people is infected. This trend can be changed by government prevention (Maier and Brockmann, 2020). This is the reason why we study the data until July 31, 2020.
To obtain a more appropriate model, we investigate a few candidate models. We choose the response as the number of daily new cases and explanatory variables as certain functions of time. We obtain two candidate models. The first is the exponential model given by
| (29) |
where is the starting date, is the current date, , and is the number of daily new cases observed on the current date. The second is the Gamma model given by
| (30) |
The Gamma model assumes that the expected number of daily new cases is proportional to the density of a Gamma distribution. If the second term is absent, then the Gamma model reduces to the exponential model, implying that (29) is a special case of (30).
In the case when , if , then the third term dominates the right-hand side of (30). The expected value of the response goes to infinity as time goes to infinity, leading to an exponential increasing trend in the outbreak in the study period. If , then the peak of the model is attained at . An increasing trend is expected if , and a decreasing trend is expected otherwise. Therefore, we can use the sign of to determine whether the outbreak is under control or not.
We chose as January 11, 2020 in both (29), (30). We assumed that followed the quasi-Poisson model, such that we could fit the two models by the traditional loglinear model with dispersion parameter to be estimated by (5). We assessed the two models by their values, where the value of a GLM was defined as one minus residual deviance divided by the null deviance. We verified (29), (30) by implementing them in eleven countries in the world (Table 9), where the peak was estimated by with and as the MLEs of and in the model. We found that the results given by the Gamma model were significantly better than those given by the exponential model.
Table 9.
Fitting results of the exponential and the Gamma models for the outbreak of COVID-19 in eleven selected countries between January 11 to May 31, 2020.
| Country | Exponential |
Gamma |
||||||
|---|---|---|---|---|---|---|---|---|
| Peak | ||||||||
| China | 7.91 | −0.032 | 0.368 | −9.6 | 7.77 | −0.290 | 0.813 | 02/07 |
| USA | 7.23 | 0.025 | 0.582 | −64.7 | 20.56 | −0.195 | 0.939 | 04/26 |
| Canada | 4.19 | 0.026 | 0.561 | −75.1 | 22.61 | −0.215 | 0.920 | 04/26 |
| Russia | 3.67 | 0.044 | 0.840 | −135.2 | 37.90 | −0.308 | 0.993 | 05/13 |
| Spain | 6.59 | 0.013 | 0.164 | −77.0 | 24.72 | −0.283 | 0.899 | 04/08 |
| UK | 5.53 | 0.023 | 0.446 | −82.9 | 25.25 | −0.248 | 0.857 | 04/22 |
| Italy | 6.66 | 0.010 | 0.110 | −58.0 | 19.53 | −0.238 | 0.945 | 04/03 |
| France | 6.26 | 0.012 | 0.096 | −96.9 | 30.47 | −0.353 | 0.694 | 04/07 |
| Germany | 6.33 | 0.011 | 0.103 | −83.7 | 26.80 | −0.317 | 0.862 | 04/05 |
| Switzerland | 4.90 | 0.006 | 0.030 | −116.0 | 36.50 | −0.463 | 0.853 | 03/30 |
| Sweden | 3.28 | 0.026 | 0.626 | −43.75 | 13.6 | −0.123 | 0.876 | 04/30 |
We used our generalized -means to group models for the states and Washington DC. We modified the model given by (30) as
| (31) |
where , was the number of daily new cases from the th state on the th date, and and were the coefficients given by the th cluster.
Because we allowed to be different within clusters, we were able to account for many state-level variables simultaneously by only in (31). For instance, if the population sizes of two states are different but we conclude that they belong to the same cluster, then the impact of the population sizes can be completely accounted for by in (31). Using this idea, we can account for the combined effects of governmental restrictions, policies, population densities, and population demographics only by in (31), and this is the advantage of the unsaturated clustering method used in the data analysis. It is not necessary to develop additional statistical models to account for their separate effects in the clustering analysis.
After briefly looking at the data, we found that many of daily new cases were zero in January and February and the United States only had total number of confirmed cases until February 24. We decided to exclude data before February 24 in our analysis. We then applied (31) to the data between February 24 and May 31 and the data between February 24 and July 31, respectively. We looked at their differences because we wanted to know the impact of the two issues mentioned at the beginning of this section. Both AIC and BIC showed that there were six clusters in the data (Fig. 3). We then calculated the cluster maps based on (Fig. 4). To compare, we also directly used -means to group estimates of regression coefficients given by (31) (i.e., based on and for the th state). We found that we were not able to identify the number of clusters based on the gap statistic (Fig. 5). This means that it is hard to use -means to group statistical models for the patterns of COVID-19 data in the United States. Similar issues also appeared in convex clustering directly on regression coefficients. Our generalized -means can overcome the difficulty because it is more powerful than methods directly on regression coefficients.
Fig. 3.
AIC and BIC for number of clusters in generalized -means based on (30).
Fig. 4.
Six clusters identified by BIC in generalized -means for the period between February 24 to May 31 (left) and the period between February 24 to July 31 (right), respectively.
Fig. 5.
Gap statistics for number of cluster in -means for coefficients.
To verify our result, we examined three models. The first was the main effect model. It had only one cluster in (31). The second was the resulting (31) with clusters. The third was the interaction effect model, which assumed that each state was an individual cluster in (31). We calculated the differences of residual deviance between the first and second models, and between the first and the third model, respectively. We obtained the partial by the ratio of the two differences. The partial value interpreted the ratio of residual deviance reduced by the model with clusters. When , the partial was 0.9235 for data between February 24 and May 31, and 0.9606 for data between February 24 and July 31, implying that the model with six clusters was good enough to interpret the differences among the states and Washington DC.
We evaluated properties of identified clusters by the MLEs of and with in (31) (Table 10). These coefficients were directly reported by the generalized -means. We found that the situation in the entire United States was under control before May 31 as the signs of were all negative. For data between February 24 and July 31, the situations in the states contained by the first, the fourth, and the sixth cluster became worse, as they were out-of-control. The situations in the states contained by the second, the third, and the fifth clusters were still under control. This was caused by the issue that a lot of people did not keep social distance or stay-at-home order in the summer in these states (according to social media).
Table 10.
Parameter estimates in the six clusters with a selected state (State) for each cluster based on the Gamma model for the outbreak of COVID-19 in the United States, where the standard errors are given inside the parenthesis and ×means out of control.
| Cluster | State | 02/24–05/31 |
02/24–07/31 |
||||
|---|---|---|---|---|---|---|---|
| Peak | Peak | ||||||
| 1 | California | 5/10(2.27) | × | ||||
| 2 | New York | 4/3(0.28) | 4/12 | ||||
| 3 | Illinois | 4/28(0.81) | 5/11 | ||||
| 4 | Louisiana | 4/8(0.39) | × | ||||
| 5 | Minnesota | 5/17(7.3) | 6/5 | ||||
| 6 | Florida | 4/26(0.42) | × | ||||
6. Discussion
We propose a new clustering method under the framework of the generalized -means to group GLMs for exponential family distributions. The method can automatically select the number of clusters if it is combined with GIC. Our theoretical and simulation results show that the number of clusters can be identified by BIC but not by AIC. Therefore, we recommend using BIC in finding the number of clusters. As the choice of the dissimilarity measure is flexible, our method can be extended to other models beyond GLMs. We implement our method to partition loglinear models for the state-level COVID-19 data until July 31 2020 in the United States and finally we have identified six clusters. In Fall 2020, the situations in United States and many European countries became worse and the outbreaks became out-of-control. This is left to future research.
Basically, our generalized -means can be treated as a modification of -means++ after the Euclidean distance for dissimilarity between points is replaced by the likelihood ratio statistic for dissimilarity between statistical models. The basic approach in our method can be migrated to any existing clustering methods, including convex clustering, such that the modified methods can be used to group statistical models. The idea is to simply replace the distance measure by a UMPU test statistic. Therefore, the impact of our research is not limited to generalizations of -means or -means++.
Theoretically, generalized -means can also be combined with the penalized likelihood approach. This is useful when the number of explanatory variables exceeds the number of observations within objects. As it is impossible to estimate coefficients of regression parameters, a variable selection procedure is needed to reduce the number of explanatory variables. This is also left to future research.
Acknowledgments
The authors appreciate great comments from Associate Editor and two anonymous referees that significantly improved the quality of the article.
References
- Arthur D., Vassilvitskii S. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics Philadelphia; PA, USA: 2007. K-means++: the advantage of careful seeding; pp. 1027–1035. [Google Scholar]
- Bai D., Choi S., Fujikoshi Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis. Ann. Statist. 2018;46:1050–1076. [Google Scholar]
- Bock H. Origins and extensions of the k-means algorithm in cluster analysis electron. Electron. J. Hist. Probab. Stat. 2008;4 Article 14. [Google Scholar]
- Charikar M., Guha S. A constant-factor approximation algorithm for the -median problem. J. Comput. Syst. Sci. 2002;65:129–149. [Google Scholar]
- Chen Y., Iyengar R., Ivengar G. Modeling multimodal continuous heterogeneity in conjoint analysis–a sparse learning approach. Mark. Sci. 2016;36:140–156. [Google Scholar]
- Chen N., Zhou M., Dong X., Qu J., Gong F., Han Y., Qiu Y., Wang J., Liu Y., Wei Y., Xia J., Yu T., Zhang X., Zhang L. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet. 2020;395:507–513. doi: 10.1016/S0140-6736(20)30211-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chi E.C., Lange K. Splitting methods for convex clustering. J. Comput. Graph. Statist. 2015;24:994–1013. doi: 10.1080/10618600.2014.948181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donoho D., Jin J. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 2004;32:962–994. [Google Scholar]
- Du Q., Wong T.W. Numerical studies for MacQueen’s -means algorithms for computing the centroidal Voronoi tessellations. Comput. Math. Appl. 2002;44:511–523. [Google Scholar]
- Fan J., Guo S., Hao N. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. 2012;74:37–55. doi: 10.1111/j.1467-9868.2011.01005.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng Z. Urgent research agenda for the novel coronavirus epidemic: transmission and non-pharmaceutical mitigation strategies. Chin. J. Epidemiol. 2020;41:135–138. doi: 10.3760/cma.j.issn.0254-6450.2020.02.001. [DOI] [PubMed] [Google Scholar]
- Ferguson T.S. CRC Press; New York: 1996. A Course in Large Sample Theory. [Google Scholar]
- Forgy E.W. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics. 1965;21:768–769. [Google Scholar]
- Fraley C., Raftery A.E. Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 2002;97:611–631. [Google Scholar]
- Gonzalez T.F. Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 1985;38:293–306. [Google Scholar]
- Goyal M., Aggarwal S. A review on -mode clustering algorithm. Int. J. Adv. Res. Comput. Sci. 2017;8:725–729. [Google Scholar]
- Green P.J. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternative. J. R. Stat. Soc. Ser. B. 1984;46:149–192. [Google Scholar]
- Hartigan J.A., Wong M.A. A -means clustering algorithm. Appl. Stat. 1979;28:100–108. [Google Scholar]
- Hocking, T.D., Joulin, A., Back, F., Vert, J.P., 2011. Clusterpath: an algorithm for clustering using convex fusion penalties. In: Proceedings of the 28th International Conference on International Conference on Machine Laerning. ICML2011. pp. 745–752.
- Huang C., Wang Y., Li X., Ren L., Zhao J., Hu Y., Y Y., Zhang L., Fan G., Xu J., Gu J., X T., Cheng Z., Yu T., Xia J., Wei Y., Wu W., Xie X., Yin W., Li H., Liu M., Xiao Y., Gao H., Guo L., Xie J., Wang G., Jiang R., Gao Z., Jin Q., Wang J., Cao B. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395:497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hunt A.G. Exponential growth in Ebola outbreak since May 14, 2014. Complexity. 2014;20:8–11. [Google Scholar]
- Iftimie S., López-Azcone A.F., Vallverdu I., Hernánde-Flix S., de Febrer G., Parra S., Hernández-Aguilera A., Riu F., Joven J., Camps J., Castro A. 2020. First and second waves of coronavirus disease-19: a comparative study in hospitalized patients in Reus, Spain. MedRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson R.A., Wichern D.W. Prentice Hall; New Jersey: 2002. Applied Multivariate Statistical Analysis. [Google Scholar]
- Koepke H., Clarke B. A Bayesian criterion for cluster stability. Stat. Anal. Data Min. 2013;6:346–374. [Google Scholar]
- Kriegel H.P., Kröger P., Sander J., Zimek A. Density-based clustering. WIREs Data Min. Knowl. Discov. 2001;1:231–240. [Google Scholar]
- Lau J.W., Green P.J. Bayesian model-based clustering procedures. J. Comput. Graph. Statist. 2007;16:526–558. [Google Scholar]
- Lindsten F., Ohisson G., Ljung L. Linköpings Universitet; 2011. Just Relax and Come Clustering! A Convexication of -Means Clustering: Technical Report. [Google Scholar]
- Lloyd S.P. Least squares quantization in PCM. IEEE Trans. Inform. Theory. 1982;28:128–137. [Google Scholar]
- MacQueen J.B. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press; 1967. Some methods for classification and analysis of multivariate observations; pp. 281–297. [Google Scholar]
- Maier B.F., Brockmann D. Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China. Science. 2020 doi: 10.1126/science.abb4557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCullagh P. Quasi-likelihood functions. Ann. Statist. 1983;11:59–67. [Google Scholar]
- Pew Research Center P. 2020. More than nine-in-ten people worldwide live in countries with travel restrictions amid COVID-19.https://www/pewreseach.org/fact-tank/2020/04/01 [Google Scholar]
- de Picoli S., Teixeira J.J., Ribeiro H.V., Malacarne L.C., dos Santos R.P., dos Santos Mendes R. Spreading patterns of the influenza A (H1N1) pandemic. PLoS One. 2011;6 doi: 10.1371/journal.pone.0017823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qin L.X., Self S.G. The Clustering of regression models method with applications in gene expression data. Biometrics. 2006;62:526–533. doi: 10.1111/j.1541-0420.2005.00498.x. [DOI] [PubMed] [Google Scholar]
- Soheily-Khah S., Douzal-Chouakria A., Gaussie E. Generalized -means-based clustering for temporal data under weighted and kernel time warp. Pattern Recognit. Lett. 2016;75:63–69. [Google Scholar]
- Sun K., Chen J., Viboud C. Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. Lancet Digit. Health. 2020 doi: 10.1016/S2589-7500(20)30026-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R., Walter G., Hastie T. Estimating the number of clusters in a dataset via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 2001;63:411–423. [Google Scholar]
- Trauwaert E., Kaufman L., Rousseeuw P. Fuzzy clustering algorithms based on the maximum likelihood principle. Fuzzy Sets and Systems. 1991;42:213–227. [Google Scholar]
- van der Vaart A.W. Cambridge University Press; Cambridge, UK: 1998. Asymptotic Statistics. [Google Scholar]
- Wang J. Consistent selection of the number of clusters via cross validation. Biometrika. 2010;97:893–904. [Google Scholar]
- World Health Organization (WHO) J. 2020. Getting your workplace ready for COVID-19. February 27 2020. [Google Scholar]
- Zhang Y., Li R., Tsai C. Regularization parameter selections via generalized information criterion. J. Amer. Statist. Assoc. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y., Karypis G. Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 2005;10:141–168. [Google Scholar]






