Abstract
Longitudinal data clustering is challenging because the grouping has to account for the similarity of individual trajectories in the presence of sparse and irregular times of observation. This paper puts forward a hierarchical agglomerative clustering method based on a dissimilarity metric that quantifies the cost of merging two distinct groups of curves, which are depicted by B-splines for the repeatedly measured data. Extensive simulations show that the proposed method has superior performance in determining the number of clusters, classifying individuals into the correct clusters, and in computational efficiency. Importantly, the method is not only suitable for clustering multivariate longitudinal data with sparse and irregular measurements but also for intensely measured functional data. Towards this end, we provide an R package for the implementation of such analyses. To illustrate the use of the proposed clustering method, two large clinical data sets from real-world clinical studies are analyzed.
Keywords: B-splines, Dissimilarity metric, Functional Data, Longitudinal data, Multiple outcomes
1. Introduction
Biomedical studies often use multiple markers collected over time to depict the evolvement of biological processes. In the practice of precision medicine, clustering such data helps divide patients into homogeneous groups so that targeted treatment can be applied to maximize therapeutic benefits (Paulsen et al., 2019).
Clustering the time course of the observed data is never an easy task because data proximity at selected time points does not guarantee the closeness of temporal trajectories. Most clustering methods that emerged in the last two decades had originated from functional data analysis, where observations are generated from stochastic processes and individual trajectors characterized by functional curves (Ramsay and Silverman, 2005). In a review article, Jacques and Preda (2014a) classified the existing clustering algorithms for functional data into four broad categories: Raw data method, filtering method, adaptive method, and distance-based method. Some of these methods have used the K-means clustering technique on B-spline coefficients for individual curves (Abraham et al., 2003), while others have used random-effects models for the spline coefficients derived from individual curves (James and Sugar, 2003), wavelet (Giacofci et al., 2013), and functional principal component scores (Bouveyron and Jacques, 2011; Jacques and Preda, 2014b; Schmutz et al., 2020). These methods were mainly developed for clustering univariate longitudinal or functional outcomes. Clustering methods for multivariate data were few; notable works included Tokushige et al. (2007); Ieva et al. (2013); Genolini et al. (2015); Schmutz et al. (2020).
Longitudinal data present a different set of challenges for clustering analysis because data tend to be more sparsely and often irregularly measured (Rice, 2004). For this reason, methods developed for functional data are not always applicable to longitudinal data. But some K-means-based algorithms can be extended to longitudinal data. For example, Genolini and Falissard (2010, 2011) proposed a K-means method for longitudinal data (KmL) that measures the pairwise Euclidean distance between two trajectories when data are assessed regularly and at the same time points. One could also perform K-means clustering analyses on the coefficients of B-spline basis functions (Abraham et al., 2003; Garcia-Escudero and Gordaliza, 2005, Fourier basis functions (Serban and Wasserman, 2005), and wavelet basis functions (Giacofci et al., 2013) as long as the individual curve fitting is possible. Such K-means-based algorithms alleviate the constraints on observational regularity required by KmL but have reduced computation efficiency.
For sparse and irregularly measured longitudinal data, model-based methods are good alternatives to the K-means methods, especially considering the maturity of longitudinal modeling. Many clustering methods have been developed based on mixture models (Jones et al., 2001; De la Cruz-Mesía et al., 2008; McNicolas and Murrphy, 2010; McNicolas and Subedi, 2012; Chamroukhi, 2016; Chamroukhi and Nguyen, 2019; Jacques and Preda, 2014b; Schmutz et al., 2020). With correctly specified models, these methods usually have respectable performance, even in sparsely observed data situations, because the model parameters are estimated with aggregated data from all subjects. A significant issue with the model-based methods, however, is the lack of verifiability of the model assumptions (Biernacki et al., 2000). Several authors have used mixture models with regression splines to reduce the risk of model misspecification (James and Sugar, 2003; Luan and Li, 2003; Coffey et al., 2014). Typically, model-based methods tend to have a heavier computational burden because different mixture models must be explored to determine the optimal number of clusters by the Bayesian Information Criterion (BIC) (Schwarz et al., 1978). To improve, Chamroukhi (2016) adopted a robust EM algorithm (Yang et al., 2012) to automatically ascertain the number of mixtures of regression models. More recently, Zhu et al. (2018) proposed a B-spline-based penalized regression method that allows the estimation of individual-specific spline coefficients and selects the number of clusters simultaneously using LASSO. Though the approach is theoretically elegant, its performance deteriorates quickly with an increasing sample size, as we shall demonstrate in Section 3.1.
In this paper, we put forward a new clustering method for longitudinal data with sparse and irregular observations. The method falls into the broad category of hierarchical agglomerative algorithms. In other words, it is a “bottom-up” procedure that combines trajectories when it goes up in the hierarchy if the “cost” for merging two groups of trajectories is low (Ward, 1963). We show that the proposed method exhibits excellent clustering accuracy and its performance is not sensitive to the underlying cluster sizes, and is also computationally efficient. The method can easily handle multiple-outcome longitudinal data, including outcomes that are not readily distinguishable among clusters. Viewed as a whole, we believe that the proposed method provides a reliable and efficient clustering tool for longitudinal data.
The rest of the paper is organized as follows: In Section 2, we describe the proposed clustering method and indices for determining the number of clusters. In Section 3, we present numerical results from an extensive simulation study; the performance of the proposed method is compared with competing methods. In Section 4, we present two real-data applications to demonstrate the use of the new clustering method. We end the paper in Section 5 with a few remarks. We present the technical details and additional simulation results as Supplementary Materials.
2. Methods
We propose a new clustering method for multivariate longitudinal data by carrying out the grouping in a hierarchical fashion (Ward, 1963). The method uses a similarity metric that is structurally analogous to the classic Chow test statistic (Chow, 1960), which reflects the “merging cost” of combining two clusters. We shall demonstrate the good operating characteristics of the proposed method.
2.1. Modeling and Estimating Longitudinal Trajectories
We first introduce the notation for describing the temporal trajectories of longitudinally measured outcomes.
Let be a collection of observed data from N subjects, where the ith subject has ni observations , made at times . The observation times are subject-specific and they do not have to be common across subjects.
Suppose that the longitudinal trajectory for the ith subject can be described by the following model
(1) |
where fi(·) is the fixed component, and ei(·) is the random component with mean zero.
To retain the maximal flexibility, we avoid specifying a parametric functional form for fi(·). Instead, we use B-splines with p basis functions to approximate fi(·) for i = 1, 2, …, N and conduct regression spline estimation for fi(·) using the observed data.
Letting Xij = [B1(tij), B2(tij), …, Bp(tij)] be the row vector of the B-spline basis functions evaluated at tij, we fit (1) by rewriting the model as
(2) |
where βi = [βi1, …, βip]T are the B-spline coefficients for fi(·). Writing , , and we express model (2) as
(3) |
where the random component ei satisfies (i) ; (ii) Var[ei] = Gi(ti); and (iii) ei is independent of ej for i ≠ j, where Gi(ti) is the unspecified variance-covariance matrix at observation times ti for subject i. The B-spline coefficients βi are estimated by the ordinary least-squares method.
2.2. Cost of Merging Two Subgroups
Clustering is a process of partitioning an index set c = {1, 2, …, N} into mutually exclusive subsets {ck : k = 1, 2, …, K}, where
such that subjects in the same set are homogeneous in some sense.
In longitudinal data clustering, we seek partitions where the subjects in cluster ck share a common longitudinal trajectory, i.e. they have the same fixed components. Here, we seek to identify ck for k = 1, 2, ⋯, K, such that . This process requires fitting the fixed component given in (1).
We denote the observed data associated with cluster ck as 𝒞k, i.e.,
Hence, . Similarly, we merge the observations for the two clusters, and let 𝒞u, v be the union of 𝒞u and 𝒞v, that is, 𝒞u, v = {𝒞u, 𝒞v}.
To determine the appropriateness of clustering, one needs a metric to quantify the cost of combining 𝒞u and 𝒞v. When the cost is low, the two subsets could be combined with a shared trajectory. When the cost is high, combining the subsets would lead to a larger error in the combined model and thus should be avoided. Errors associated with statistical models are typically described by sum of squared residuals (SSR). We, therefore, contend that the following ratio of SSRs under the separate and combined models gives a quantification for the merging cost:
(4) |
where
In the calculation, , where βk is the ordinary least-squares estimate for βk, the identical B-spline coefficients for all i ∈ ck, using all the data in 𝒞k.
Alternatively, 𝒟(𝒞u, 𝒞v) can also be viewed as a metric of dissimilarity between two sets of longitudinal data, 𝒞u and 𝒞v. A larger 𝒟(𝒞u, 𝒞v) value indicates greater dissimilarity between 𝒞u and 𝒞v, and thus providing stronger evidence against merging the two subsets.
Remark 1. The numerator of 𝒟(·, ·) is conceptually similar to the “merging cost” proposed by Ward (1963) under the classical linear models. We note that for any 𝒞u, 𝒞v, 𝒟(𝒞u, 𝒞v) ≥ 0, meaning that merging actions would not incur negative costs. See Proposition 1 in Web Supplementary Materials.
Remark 2. We also note that the structure of 𝒟(·, ·) is parallel to that of the Chow-test statistic (Chow, 1960), which was originally proposed to test whether two datasets are generated from a common linear model. For our purpose, two subsets with the smallest 𝒟(·, ·) value among all possible pairs of subsets are least likely to be rejected for homogeneity. Hence sequentially merging two subsets with the smallest 𝒟(·, ·) value gives rise to a hierarchical agglomerative clustering algorithm.
Remark 3. Metric 𝒟(·, ·) is readily extendable to situations of longitudinal clustering with multiple outcomes: A weighted sum of 𝒟(·, ·) from different outcomes represents the overall merging cost
where Wh is the standardized weight for outcome h, and contains data in cluster k for outcome h.
To maintain computing efficiency, we propose an ad-hoc method to determine the weights by expanding the Chow statistics as defined above,
(5) |
where 𝒞(h) includes all subjects, and is the “bottom-level cluster” that contains only observations from the ith subject as long as the data allow for fitting an individual-specific spline for calculating . Some data preprocessing steps are needed when data from an individual are not sufficient to fit the spline model, as we shall illustrate in Section 2.4. The outcome-specific weight Wh can be chosen as .
Remark 4. In our data examples for multivariate clustering, we used the same number of spline basis functions (p) that were defined at the same knots based on the locations of observation times for numerical convenience. The proposed method, however, does not require the same knots to be used across outcomes. In fact, when observation times and/or the number of observations are drastically different across the outcomes, outcome-specific spline fittings would be recommended.
2.3. Determination of the Number of Clusters
Choosing an optimal number of clusters is an essential component of all clustering methods. Since there is no universally accepted criteria to define the “optimality” of data partitioning, solutions tend to be problem-specific. Two approaches are frequently used in the literature. One is to optimize the difference of within-cluster (or intra-cluster) and between-cluster (or inter-cluster) dissimilarity, as done in the CH index (Caliński and Harabasz, 1974) and Silhouette statistic (Rousseeuw, 1987). The other is to maximize the “gap” between within-cluster dissimilarity and its expectation under the hypothesis that data are fully homogeneous (Tibshirani et al., 2001).
In this work, we explore both approaches in the context of longitudinal data clustering. For a given partition of K clusters, the within-cluster dissimilarity can be defined as
and the between-cluster dissimilarity as
where is the predicted outcome using all data, for which is the B-spline estimator of the population mean function. The CH index at K partitions is defined, therefore, by
for longitudinal data. The optimal number of clusters corresponds to the maximum value of the CH index, i.e., .
Similarly, we propose a new Gap statistic using the between-cluster dissimilarity, defined as
where 𝒞−k represents data from all subjects except those in set ck. The new Gap statistic is therefore
(6) |
The expectation in equation (6) can be calculated under certain data homogeneity assumptions. For example, Tibshirani et al. (2001) assumed that the observed data are uniformly distributed in the sample space for clustering cross-sectional data and they used the bootstrap method to compute the expectation. For longitudinal data, however, this approach appears to be numerically intractable. To maintain computational efficiency, we propose an ad-hoc procedure to compute the expectation under the following homogeneous longitudinal B-spline model:
(7) |
where E[eij] = 0 and Var[eij] = σ2. Under this model,
following the result of Proposition 2 in Web Supplementary Materials. Hence, we have
where σ2 can be estimated by fitting the B-spline model with all the data. Following the recommendation of Tibshirani et al. (2001), we use the first turning point as the chosen number of clusters, that is
Remark 5. Both CH(K) and Gapb(K) can be extended to longitudinal clustering with multiple outcomes by adding the SSR for each individual outcome.
Remark 6. Our numerical experiments showed that Gapb(K) performs better for data with a small number of clusters (≤ 5), and is more sensitive to pick out clusters with extreme cluster size compared to the CH(K) statistic. Additionally, it is numerically convenient in computing as shown above compared to the original Gap statistic, for which bootstrap is required.
2.4. The Clustering Algorithm
We propose a clustering algorithm that uses a greedy search strategy with a hierarchical merging mechanism for longitudinal data. The algorithm is structured in an agglomerative (i.e., bottom-up) manner such that the size of subclusters increases as the algorithm proceeds. The process results in a more accurate estimation of the mean function for each subcluster and the dissimilarity measure between subclusters and thus enhancing the clustering performance for sparse longitudinal data situations.
The essential steps are as follows:
Baseline subclusters: Assuming that for the ith subject, the observed data {(tij, Yij) : j = 1, …, ni} (e.g., Figure 1 (a)) allow for the fitting of a B-spline with p basis functions, we start with each subject as a baseline subcluster, fit the B-spline model, and calculate SSR s and Gapb.
Bottom-up merging process: Conduct a greedy search to merge two subclusters with the minimal cost metric 𝒟 into one cluster, and then update SSR s and Gapb after each merge. This process is repeated in a hierarchical manner to the top level where only one cluster is left. During the process, the subclusters merged in the previous steps stay in the same cluster. The process is depicted in a dendrogram (Figure 1 (b)).
Determining the number of clusters (illustrated with Gapb): Plot the sequential Gapb statistics against its respective number of clusters during the merging process, and determine the optimal number of clusters by the first turning point in the plot as shown in Figure 1 (c).
Conclusion: Summarize the final clusters from the hierarchically structured clustering results with the number determined in Step 3, as shown in the color-coded dendrogram in Figure 1 (d) and the subject-level clustering results in Figure 1 (e).
Fig. 1.
Steps in the clustering algorithm: (a) Original sample data. (b) Dendrogram from the hierarchical clustering algorithm. (c) Graph of GAPb statistic. (d) Determine the number of clusters according to the GAPb statistic. (e) Final color-coded clustering results.
The data assumption in Step 1 is commonly used in the longitudinal data clustering (Abraham et al., 2003; Garcia-Escudero and Gordaliza, 2005; Zhu et al., 2018). When the assumption is not met, one can add a preprocessing step to form the baseline subclusters for situations where some individual subjects cannot form a baseline subcluster by themselves: Let c(−) = {i ∈ c : ni ≤ p} be the set of subjects whose longitudinal observations are not sufficient for the fitting of a subject-specific B-spline model with p basis functions. Let c(+) = {i ∈ c : ni > p} be the set of subjects with sufficient data to fit a subject-specific B-spline. Each subject in c(+) results in an initial subcluster. The idea of the data preprocessing is to empty c(−) by combining the “most” similar subjects such that the combined one is sufficient to fit a baseline subcluster. The process can be accomplished through another greedy search strategy as described below.
- For each i ∈ c(−), merge the data from this subject with that of subject j that satisfies ni + nj > p to fit a B-spline with p basis functions. The mean squared residual, d(i, j), is calculated for the fitted model:
- Identify (u, v) such that
If v ∈ c(−), update c(−) by removing subjects u and v, and update c(+) by adding the two subjects to the combined longitudinal observations as a new baseline subcluster in c(+); otherwise, update c(−) by removing subject u, and update c(+) by merging data from subject u into that from v ∈ c(+). Repeat Steps 1 and 2 until c(−) becomes empty. Then modified subjects in c(+) constitute the baseline subclusters for Step 1 in the clustering algorithm.
We have implemented the method in an R package clusterMLD, short for Clustering for Multivariate Longitudinal Data (https://github.com/junyzhou10/clusterMLD) so that analysts not familiar with the R software can access the clustering function through an interactive web interface.
3. Simulation Studies
We examined the operating characteristics of the proposed method and compared it against existing methods through an extensive simulation study. We first assessed the method’s performance in a relatively straightforward setting involving one outcome and less sparse observations (Zhu et al., 2018); this is a setting where many existing clustering methods are readily applicable. We then examined the proposed method’s performance in situations of multiple outcomes and sparse observations, where the existing methods are expected to struggle. In the latter setting, only a few existing methods are available for comparison. In addition, we also compared the proposed method with the competing clustering methods in a functional data setting with complicated functional shapes and a large number of clusters.
Metrics of assessment.
Key metrics include: (1) The number of clusters, (2) the overall accuracy of clustering, which is frequently quantified by the rate of pairwise agreement, referred to as the Rand index (RI) (Rand, 1971), and (3) the time in seconds in completing the task. Theoretically, RI takes a value between 0 and 1, but it is unlikely to be near 0 even for a random classification when dealing with more than two groups. A more intuitive metric is the Adjusted Rand Index (ARI) proposed by Hubert and Arabie (1985):
where the expected value of RI is calculated based on the completely random classification, and max(RI) is the highest value that classification can achieve. ARI is close to zero for a poor clustering method, but it is bounded from above by 1. ARI approaches to 1 as a clustering method becomes more accurate. ARI is applied to assess the overall accuracy of clustering, whether or not the true number of clusters is used. When the correct number of clusters is used, we also explored a clustering-specific accuracy (CSA) measure to quantify the percentage of subjects correctly assigned to the right clusters.
We examined both Gapb and CH statistics for their performance in determining the number of clusters.
3.1. Case 1: Single Outcome with Non-sparse Observations
We first considered the following longitudinal model (1),
(8) |
We used a setting previously explored by Zhu et al. (2018), which has four clusters with the mean trajectories , , , and , respectively. We assumed a random error εij ~ 𝒩 (0, 0.42). However, instead of using 10 evenly spaced observation times in [0, 1], we simulated from a continuous uniform distribution 𝒰(0, 1) to add more variability in observation times. We required the interval between two adjacent observations to be larger than 0.06. We modeled the random component in model (8) with a random quadratic function
(9) |
where the random coefficients follow a multivariate normal distribution
(10) |
and the following correlation coefficient matrix
(11) |
and diagonal variance matrices σb = diag (0.1, 0.2, 0.2) and σb = diag (0.2, 0.4, 0.4), respectively, to indicate low and high noise scenarios. We generated data for 100 subjects for each of the two cases of cluster size combination: A balanced case with 25 subjects for each cluster and an unbalanced case with 5, 25, 25, and 45 for the four clusters.
For the clustering algorithm, we used cubic B-splines (Schoenberg, 1946; Schumaker, 2007) to model the individual functional curve fitting, as described in the method section. Three internal knots were selected at the first, second, and third sample quartiles of all observation times to form B-spline basis functions. This resulted in a total of 7 basis functions in the B-spline curve fitting (Ruppert, 2002).
We conducted comprehensive comparisons between the proposed method and eight different clustering methods mentioned in the Introduction that the computing packages are available: (1) Partitioning Around Medoids (PAM) with B-splines (Abraham et al., 2003; Garcia-Escudero and Gordaliza, 2005; Zhu et al., 2018); (2) Gaussian Mixtures (GM) with B-splines (Zhu et al., 2018); (3) the K-means method for longitudinal data (KmL) (Genolini and Falissard, 2011; Genolini et al., 2015); (4) unsupervised regression mixtures (UReMix) (Chamroukhi, 2016); (5) non-parametric pairwise grouping (NPG) (Zhu et al., 2018); (6) Gaussian mixture model for longitudinal data (longclust) with EM algorithm (McNicolas and Murrphy, 2010; McNicolas and Subedi, 2012); (7) Gaussian model-based clustering method based on functional principal component analysis scores (funHDDC) with EM algorithm (Bouveyron and Jacques, 2011; Jacques and Preda, 2014b; Schmutz et al., 2020); and (8) wavelet decomposition based Gaussian model (curvclust) (Giacofci et al., 2013). Among these, PAM and KmL are algorithm-based, whereas GM, UReMix, NPG, longclust, funHDDC, and curvclust are model-based. The Gap statistic (Tibshirani et al., 2001) was used for PAM, and BIC (Schwarz et al., 1978) for GM, longclust, funHDDC, and curvclust to determine the number of clusters, respectively. For KmL, the CH index was used as the recommended method among the alternatives provided in the R package kml for the optimal number of clusters. For UReMix, the number of mixtures was automatically determined through a robust EM algorithm (Yang et al., 2012). The number of clusters in NPG was decided by properly tuning the hyperparameters.
The simulation was repeated 100 times, and we reported the average number of identified clusters, ARI, CSA, and computing time. We also reported the standard deviations for the three performance measures. We summarized the results in Table 1.
Table 1.
Comparison of clustering performance among different methods with 100 replications in terms of mean (standard deviation) for (1). : identified cluster number; (2). ARI; (3). CSA; (4). Computing time.
N=(25,25,25,25) | N=(5,25,25,45) | |||||||
---|---|---|---|---|---|---|---|---|
ARI | CSA | Time (secs) | ARI | CSA | Time (secs) | |||
Low Noise | ||||||||
clusterMLD(GAPb) | 4.01(0.10) | 0.994(0.013) | 0.998(0.005) | 0.92 | 4.00(0.14) | 0.995(0.016) | 0.999(0.004) | 0.95 |
clusterMLD(CH) | 4.00(0.00) | 0.994(0.013) | 0.998(0.005) | 0.92 | 3.01(0.10) | 0.921(0.025) | −(−) | 0.95 |
KmL | 4.00(0.00) | 0.997(0.009) | 0.999(0.003) | 5.64 | 3.00(0.00) | 0.919(0.015) | −(−) | 4.98 |
GM (B-splines) | 5.34(1.32) | 0.937(0.078) | 0.997(0.005) | 1.05 | 5.02(1.67) | 0.844(0.122) | 0.939(0.048) | 0.95 |
uReMix | 4.32(0.49) | 0.969(0.045) | 0.999(0.003) | 0.57 | 3.04(0.20) | 0.916(0.024) | 0.875(0.083) | 0.49 |
NPG | 3.86(0.43) | 0.950(0.119) | 0.999(0.003) | 328.97 | 4.06(0.34) | 0.968(0.028) | 0.985(0.013) | 334.22 |
longclust | 4.02(0.14) | 0.996(0.013) | 0.999(0.003) | 12.92 | 3.94(0.40) | 0.985(0.043) | 1.000(0.002) | 12.72 |
funHDDC | 3.93(2.08) | 0.473(0.221) | 0.705(0.069) | 20.48 | 3.37(1.81) | 0.359(0.227) | 0.756(0.088) | 17.74 |
curvclust | 6.09(1.93) | 0.876(0.113) | 0.998(0.004) | 9.96 | 5.71(1.50) | 0.845(0.130) | 1.000(0.002) | 8.97 |
High Noise | ||||||||
clusterMLD(GAPb) | 4.45(0.72) | 0.901(0.079) | 0.972(0.029) | 0.99 | 3.75(0.74) | 0.890(0.126) | 0.979(0.032) | 0.98 |
clusterMLD(CH) | 3.96(0.28) | 0.918(0.106) | 0.972(0.028) | 0.99 | 3.04(0.20) | 0.865(0.091) | 0.778(0.054) | 0.98 |
PAM (B-splines) | 1.05(0.22) | <0.010(0.001) | −(−) | 14.7 | 1.04(0.20) | <0.010(0.001) | −(−) | 14.16 |
KmL | 3.90(0.44) | 0.903(0.137) | 0.975(0.019) | 5.25 | 3.00(0.00) | 0.884(0.035) | −(−) | 5.00 |
GM (B-splines) | 5.26(1.28) | 0.917(0.097) | 0.989(0.011) | 0.94 | 4.96(1.66) | 0.817(0.144) | 0.937(0.034) | 1.01 |
uReMix | 4.87(0.75) | 0.857(0.076) | 0.969(0.026) | 0.68 | 3.73(0.69) | 0.787(0.113) | 0.824(0.060) | 0.61 |
NPG | 4.07(0.97) | 0.744(0.217) | 0.865(0.157) | 330.56 | 4.58(0.82) | 0.889(0.102) | 0.960(0.041) | 335.91 |
longclust | 4.03(0.17) | 0.973(0.031) | 0.991(0.011) | 14.58 | 3.74(0.44) | 0.960(0.044) | 0.992(0.010) | 14.26 |
funHDDC | 4.02(2.86) | 0.343(0.235) | 0.755(0.064) | 25.21 | 5.07(2.70) | 0.388(0.206) | 0.692(0.117) | 24.9 |
curvclust | 9.14(0.82) | 0.635(0.054) | −(−) | 8.74 | 8.67(1.21) | 0.567(0.077) | −(−) | 8.32 |
Note: CSA was only reported when there were at least two cases in 100 repetitions that resulted in .
Table 1 showed that the proposed method clusterMLD with GAPb yielded highly accurate clustering results in low noise settings, and the performance was less sensitive to extreme cluster sizes; the performance was somewhat reduced in high noise settings but still respectable. With CH, clusterMLD yielded better results when cluster sizes were balanced and was insensitive to the data noises, but the performance was largely reduced with the settings of extreme cluster sizes. PAM, GM, and curvclust, on the other hand, often failed to identify the correct number of clusters. When a method failed to identify the correct number of clusters, we did not calculate its CSA level. Similar to clusterMLD with CH, KmL performed well when the cluster sizes were balanced, but it was unable to identify the right cluster number when the sizes were highly unbalanced. This suggests that CH is a less reliable statistic to identify the right cluster number when the cluster sizes are highly unbalanced. UReMix had outstanding computational efficiency owing to the explicit updating rule of the EM algorithm. But its performance on clustering accuracy left much to be desired, especially when the cluster sizes were unbalanced. NPG and funHDDC showed comparable clustering performance in determining the number of clusters on average in the tested settings compared to the proposed method with GAPb but not in the clustering accuracy measured by ARI and CSA, and were much more costly in computing: In particular, NPG needed 334–358 times more computing times compared to the proposed method. longclust appeared to perform slightly better than the proposed method with GAPb when the noise level was high. However, the computational efficiency of longclust was much lower than the proposed method: On average, the computational times for longclust were 13–15 times that of the proposed method.
The empirical distribution of identified cluster numbers based on 100 simulation runs (Figure 1 in Web Supplementary Materials) showed that (1) the proposed method with GAPb had the highest probability of identifying the right cluster number in low noise settings and good probabilities in high noise settings; (2) the proposed method with CH had the highest probability of identifying the right cluster number in the balanced settings, but the performance was much reduced with extreme cluster sizes. NPG and longclust appeared to be comparable to the proposed method with GAPb in overall clustering accuracy in the tested settings but required much more computing resources, which may prevent them from being applied to clustering larger longitudinal data.
Overall, clusterMLD (GAPb)’s excellent ability to identify the right cluster number, as well as its high computational efficiency and clustering accuracy, offers a great advantage for longitudinal data clustering, especially when data volume is large.
3.2. Case 2: Multiple Outcomes with Sparse Observations
For this part of the simulation, we considered a multivariate longitudinal model,
(12) |
where represented the hth outcome from subject i on the jth occasion, and the hth fixed component or the hth mean trajectory for the ith subject. Herein, we considered five outcomes with four underlying clusters. The mean trajectories f(h) (·), h = 1, …, 5 for each cluster were given in Table 2 and graphically illustrated in Figure 2 (A). For this setting, the clusters were mostly distinguished by Y(1) and Y(2) and less distinguished by Y(3) and Y(4). Y(5) is completely non-informative in separating the clusters.
Table 2.
The five cluster-specific mean trajectories used in Case 2
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 |
---|---|---|---|
, | , | , | , |
, | , | , | , |
, | , | , | , |
, | , | , | , |
Fig. 2.
(A) Five mean functions used in the simulation. (B) Sample trajectories in five outcomes from each cluster with εijh ~ 𝒩 (0, 42).
The random component ri(·) was modeled by following the same equation (9) as in Case 1 with σb = diag (2, 0.3, 0.06) and the same ρb. σih for h = 1, …, 5 were the additional random effects that described the correlations among the outcomes and they were generated from the multivariate normal distribution
where σ = diag (3, 3, 3, 3, 3). As a pure noise, outcome 5 was set to be totally independent of the other four outcomes. The random measurement error εijh was simulated from either a normal distribution 𝒩(0, η2) or a uniform distribution 𝒰(−η/2, η/2). To evaluate the performance of the proposed method with different noise levels, we chose η = 3 or 6.
The sparse and irregular observations were simulated as follows. For subject i, we sampled the number of observations ni from a discrete uniform distribution between 4 and 12. Then the observation times were selected as the order statistics of the ni random observations from 𝒰(0, 11) with the first observation fixed at 0 and the interval between two adjacent observations set to be greater than 0.5. Some sample trajectories color-coded by clusters were plotted in Figure 2 (B) for each outcome.
The performance under different cluster size combinations was also explored: In setting S0 we considered a balanced case wherein the ratio of four cluster sizes was 1:1:1:1; S1 represented a case where one cluster was extremely small, with the cluster size ratio being 1:13:13:13; and S2 represented a case where one cluster was much larger than the other clusters, with the cluster size ratio being 1:10.33:1:1. Settings S1 and S2 were designed to assess whether the clustering algorithm was influenced by an extremely small or large cluster, a situation likely to occur in studies of rare or overly abundant disease phenotypes. Furthermore, to explore the influence of sample size, simulations with a total sample size of 200 and 400 were conducted. When the total sample size was 200, all four clusters had 50 subjects in S0; one cluster had 5, the rest had 65 subjects each in S1; one cluster had 155, and the rest had 15 subjects each in S2.
We again used cubic B-splines (Schoenberg, 1946; Schumaker, 2007) with the three internal knots as described in Section 3.1 to carry out the individual functional curve fitting. For subjects without sufficient observations, the data preprocessing step was implemented, as described in Section 2.4.
Except for the proposed method clusterMLD, among the eight competing methods used for Case 1, only KmL and funHDDC packages have been extended to handle multiple outcomes. However, KmL3d (Genolini et al., 2015), the multivariate version of KmL, required the observations to be aligned at the same time points and thus could not accommodate the simulated data. As a result, we only compared clusterMLD with funDHHC. For each of the simulation settings, we had 100 replications. Results on the mean and standard deviation on identified cluster number, ARI and CSA were summarized in Table 3 and Table 4, respectively.
Table 3.
Comparison of mean (sd) of in 100 replicates between clusterMLD and funHDDC.
clusterMLD(Gapb) | clusterMLD(CH) | funHDDC | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Uniform | Normal | Uniform | Normal | Uniform | Normal | |||||||
η = 3 | η = 6 | η = 3 | η = 6 | η = 3 | η = 6 | η = 3 | η = 6 | η = 3 | η = 6 | η = 3 | η = 6 | |
N=200 | ||||||||||||
S0 | 3.82(0.39) | 3.88(0.36) | 3.94(0.34) | 4.00(0.47) | 3.46(0.83) | 3.22(0.87) | 2.73(0.79) | 2.00(0.00) | 2.50(1.62) | 2.43(1.54) | 2.78(1.94) | 3.18(2.56) |
S1 | 3.62(0.55) | 3.70(0.50) | 3.73(0.57) | 3.81(0.61) | 2.51(0.50) | 2.40(0.49) | 2.16(0.37) | 2.06(0.24) | 7.35(2.68) | 6.79(2.46) | 7.73(2.55) | 8.20(1.82) |
S2 | 3.76(1.18) | 3.69(0.84) | 3.68(0.55) | 3.77(0.57) | 3.26(1.27) | 2.90(1.06) | 2.68(0.87) | 2.49(0.76) | 3.66(1.60) | 3.63(1.47) | 3.42(1.43) | 3.79(1.93) |
N=400 | ||||||||||||
S0 | 3.99(0.10) | 3.97(0.17) | 3.99(0.17) | 4.01(0.27) | 3.74(0.68) | 3.50(0.86) | 2.77(0.69) | 2.00(0.00) | 2.30(1.16) | 2.42(1.16) | 2.36(1.12) | 3.96(3.21) |
S1 | 3.87(0.34) | 3.92(0.31) | 3.88(0.33) | 3.97(0.26) | 2.41(0.49) | 2.32(0.47) | 2.12(0.33) | 2.02(0.14) | 8.55(2.07) | 8.61(2.13) | 8.80(1.96) | 8.93(1.43) |
S2 | 3.88(0.57) | 3.83(0.59) | 3.81(0.46) | 3.94(0.31) | 3.14(1.07) | 3.00(1.01) | 2.68(0.68) | 2.32(0.55) | 4.01(1.67) | 3.82(1.51) | 3.92(1.38) | 4.58(2.67) |
Table 4.
Comparison of mean (sd) of ARI and CSA in 100 replicates between clusterMLD and funHDDC.
ARI | CSA | |||||||
---|---|---|---|---|---|---|---|---|
Uniform | Normal | Uniform | Normal | |||||
η = 3 | η = 6 | η = 3 | η = 6 | η = 3 | η = 6 | η = 3 | η = 6 | |
clusterMLD (Gapb) | ||||||||
N=200 | ||||||||
S0 | 0.924(0.110) | 0.934(0.098) | 0.935(0.086) | 0.886(0.084) | 0.990(0.009) | 0.988(0.007) | 0.986(0.009) | 0.968(0.014) |
S1 | 0.954(0.076) | 0.967(0.031) | 0.944(0.086) | 0.883(0.086) | 0.984(0.047) | 0.993(0.006) | 0.989(0.007) | 0.950(0.071) |
S2 | 0.901(0.179) | 0.893(0.178) | 0.948(0.06) | 0.876(0.123) | 0.991(0.008) | 0.980(0.059) | 0.989(0.008) | 0.970(0.014) |
N=400 | ||||||||
S0 | 0.980(0.030) | 0.972(0.049) | 0.971(0.043) | 0.917(0.053) | 0.993(0.004) | 0.992(0.004) | 0.991(0.006) | 0.972(0.011) |
S1 | 0.981(0.017) | 0.98(0.045) | 0.976(0.022) | 0.928(0.048) | 0.994(0.005) | 0.990(0.041) | 0.993(0.005) | 0.972(0.036) |
S2 | 0.939(0.129) | 0.959(0.086) | 0.971(0.020) | 0.918(0.053) | 0.993(0.009) | 0.993(0.005) | 0.991(0.005) | 0.977(0.008) |
clusterMLD (CH) | ||||||||
N=200 | ||||||||
S0 | 0.839(0.210) | 0.773(0.216) | 0.633(0.206) | 0.429(0.072) | 0.991(0.008) | 0.989(0.007) | 0.988(0.009) | −(−) |
S1 | 0.751(0.202) | 0.707(0.196) | 0.609(0.148) | 0.550(0.088) | −(−) | −(−) | −(−) | −(−) |
S2 | 0.711(0.215) | 0.753(0.184) | 0.781(0.161) | 0.738(0.130) | 0.918(0.148) | 0.902(0.175) | 0.861(0.193) | 0.971(0.024) |
N=400 | ||||||||
S0 | 0.919(0.166) | 0.857(0.212) | 0.656(0.176) | 0.440(0.063) | 0.993(0.004) | 0.993(0.004) | 0.994(0.005) | −(−) |
S1 | 0.717(0.198) | 0.679(0.189) | 0.599(0.131) | 0.545(0.052) | −(−) | −(−) | −(−) | −(−) |
S2 | 0.791(0.188) | 0.801(0.181) | 0.835(0.146) | 0.741(0.107) | 0.991(0.019) | 0.992(0.003) | 0.923(0.173) | −(−) |
funHDC | ||||||||
N=200 | ||||||||
S0 | 0.483(0.068) | 0.479(0.067) | 0.498(0.077) | 0.465(0.067) | −(−) | −(−) | −(−) | −(−) |
S1 | 0.458(0.186) | 0.505(0.178) | 0.446(0.167) | 0.432(0.113) | 0.854(0.058) | 0.876(0.059) | 0.842(0.064) | 0.837(0.072) |
S2 | 0.419(0.213) | 0.400(0.217) | 0.399(0.200) | 0.347(0.182) | 0.519(0.034) | 0.509(0.035) | 0.498(0.038) | 0.499(0.045) |
N=400 | ||||||||
S0 | 0.488(0.059) | 0.499(0.071) | 0.488(0.062) | 0.463(0.059) | −(−) | 0.625(0.025) | −(−) | −(−) |
S1 | 0.457(0.140) | 0.459(0.152) | 0.445(0.138) | 0.390(0.096) | 0.878(0.057) | 0.919(0.030) | 0.917(0.026) | 0.916(0.027) |
S2 | 0.360(0.201) | 0.370(0.189) | 0.342(0.176) | 0.321(0.198) | 0.512(0.040) | 0.492(0.028) | 0.500(0.035) | 0.486(0.024) |
Note: CSA was only reported when there were at least two cases in 100 repetitions that resulted in .
Table 3 showed that clusterMLD with GAPb was generally accurate in its identification of the cluster number (K = 4), and its standard deviations were small, whereas the performance of clusterMLD with CH deteriorated, especially when the cluster sizes were severely unbalanced. funHDDC often deviated from the correct number, except for setting S2, where the numbers were closer. The performance of clusterMLD with GAPb improved with increasing sample size; the method performed better with normal errors than uniform errors. The empirical distribution of identified cluster numbers was presented in Figure 2 in Web Supplementary Materials, which showed the probabilities of identifying the right cluster number were much higher in all tested settings for clusterMLD with GAPb, but not at all for funHDDC. Additionally, Table 4 showed that ARI and CSA (when available for calculation) were much higher for clusterMLD with GAPb than funHDDC. It is possible that the less satisfactory performance of funHDDC was due to the fact that it was designed for functional data with more frequent observations. The computation efficiency was an added advantage for clusterMLD compared to funHDDC in this case: On average, clusterMLD took 2.92 and 11.35 seconds to complete the task for sample size N = 200 and N = 400, respectively; the corresponding numbers were 26.79 and 62.28 for funHDDC.
In the low noise data setting (η = 3), the average estimated importance weights Wh for the five outcomes were 0.419, 0.295, 0.097, 0.125, and 0.064 with uniformly distributed measurement error and 0.411, 0.278, 0.111, 0.126, and 0.074 with normally distributed measurement errors, respectively. In the high noise setting (η = 6), the corresponding weights were 0.418, 0.286, 0.103, 0.126, and 0.067 with uniformly distributed measurement error and 0.384, 0.266, 0.124, 0.135, and 0.090 with normally distributed measurement error, respectively. The estimated weights reflected the order of importance of the multiple outcomes in differentiating the clusters, as shown in Figure 2.
Though the method that we developed is mainly for clustering longitudinal data, it is also applicable to clustering functional data, where it does not require additional data preprocessing steps. We conducted an additional simulation study to assess its performance in clustering functional data in comparison with the existing methods. We generated functional data with a large number of clusters (K = 10) and more complicated functional shapes. The full results of the simulation study are presented in Web Supplementary Materials. Briefly, the simulation showed that the two best-performing methods were clusterMLD with CH and GM, while the other methods had rather disappointing performance to various degrees. Among the two top performers, GM had 98% chance (ARI=0.996) of identifying the correct cluster number, whereas clusterMLD with CH had 81% chance (ARI=0.976). A distant third, funHDDC, had only 11% chance of identifying the right cluster number with ARI=0.808. Interestingly, clusterMLD with GAPb seemed to have lost its advantage in selecting the right cluster number when it is large.
In summary, clusterMLD with GAPb delivers a highly reliable performance for clustering longitudinal data in various data situations in comparison to the existing methods when the number of clusters is not large, and is readily implementable with multivariate outcomes. The method’s computational efficiency offers an important advantage in clustering large volume of data. As confirmed by the simulation studies, the above characteristics of clusterMLD with GAPb have made it an appealing clustering tool for a broad class of longitudinal studies, as shown in the two real applications below.
4. Real-World Applications
The method discussed in the current research is readily applicable to real data applications. To emphasize its general applicability, we present two clustering analyses using data from two real clinical investigations.
4.1. The Systolic Blood Pressure Intervention Trial
The Systolic Blood Pressure Intervention Trial (SPRINT) is a randomized clinical trial aimed at reducing cardiovascular complications in people with hypertension by aggressively lowering systolic blood pressure (BP). Participants were randomly assigned to two arms: One is an intensive treatment arm where the systolic BP goal was set to 120 mmHg, and the other is a standard treatment arm where the systolic BP goal remained at 140 mmHg. Therapeutic decisions on how to bring down BP were left to the treating physicians. The study tracked BP in study participants at three-month intervals approximately for up to five years. The study found the intervention had resulted in significantly lower systolic and diastolic BP in patients who received the SPRINT intervention (The-SPRINT-Research-Group, 2015). The SPRINT data are publicly available from the National Heart, Lung, and Blood Institute, through its Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) (https://biolincc.nhlbi.nih.gov/home) undersigned Research Materials Distribution Agreements (RMDA).
In this research, we performed a clustering analysis of BP data in SPRINT participants. Since the original trial has established the blood pressure-lowering effect of SPRINT intervention, one would expect that a good clustering method would be able to successfully differentiate the patients with well-controlled blood pressure from those with poorly controlled blood pressure, and the cluster membership should roughly correspond to the original treatment group assignment – the lower blood pressure cluster should include mainly the intensively treated patients and the high blood pressure cluster should include mainly the control patients.
The SPRINT study had a total of 9,173 participants, including 4,600 in the intensive treatment arm and 4,573 in the control arm. The participants generated 144,824 BP measurements; each patient, on average, contributed 15.8 BP measures. We excluded those with only baseline BP because they did not contribute any discriminating information to the clustering. Systolic BP was used as the primary outcome so that all methods described in Section 3 could be applied. The pointwise mean longitudinal SBP patterns in the two treatment groups are shown in Figure 3 (a).
Fig. 3.
(a) The pointwise mean SBP trajectories by the original treatment groups. (b) The pointwise mean SBP trajectories by the clusters using the proposed algorithm. The pointwise vertical bar around the mean is the one standard deviation error bar for the mean.
The clustering results are presented in Table 5. We used Cohen’s Kappa (κ) coefficient (Cohen, 1960) to assess the agreement between the cluster membership and the original treatment assignment. The proposed method identified two clusters , resulted in a high Kappa coefficient (κ = 0.647), and had a superior computational efficiency. The computing time on a MacBook equipped with a 2.3GHz 8-Core Intel i9 processor was 147.17 seconds. The low and high systolic BP clusters respectively included 4, 674 and 4, 499 patients. In the low BP cluster, 3, 828 (81.9%) was from the intensive treatment arm; in the high BP cluster, 3, 729 (82.8%) were from the standard treatment group. However, one should not anticipate a perfect agreement between cluster membership and the original treatment assignment because the SPRINT intervention did not work for every patient, and some control patients could achieve lower BP than their peers in the control group. We presented the pointwise mean systolic BP trajectories for the two clusters in Figure 3 (b); the mean trajectories clearly resembled the two treatment group BP trajectories.
Table 5.
Comparison of SPRINT data clustering results among the competing methods
K | κ | ARI | Time(sec) | |
---|---|---|---|---|
clusterMLD | 2 | 0.647 | 0.419 | 147.17 |
PAM (B-splines) | 1 | 0.000 | 0.000 | 66046.21 |
KmL | 2 | 0.202 | 0.366 | 3377.42 |
GM (B-splines) | 10 | 0.093 | 0.102 | 191.62 |
longclust | 15 | 0.047 | 0.073 | 19405.75 |
funHDDC | 10 | −0.025 | 0.056 | 3759.14 |
curvclust | 14 | 0.000 | 0.216 | 3633.78 |
Most of the competing methods did not fare well: PAM, GM, longclust, funHDDC, and curvclust yielded less meaningful numbers of clusters that contradicted the findings from the main trial paper (The-SPRINT-Research-Group, 2015). The only other method that identified two clusters was KmL, but its Kappa value (κ = 0.202) was disappointing, and the method had consumed 23 times more computing time. We were not able to ascertain results from UReMix and NPG because they were not equipped to deal with a dataset of this size. We had to stop the program for the two methods because of memory overflow and excessive computing time.
4.2. The PREDICT-HD Study
In the second application, we used clustering analysis to investigate Huntington’s disease (HD) progression phenotypes. HD is a neurodegenerative disease caused by the trinucleotide cytosine-adenine-guanine (CAG) in the first exon of the Huntington (HTT) gene (MacDonald et al., 1993). The disease debilitates motor function in the afflicted, often accompanied by accelerated impairment of cognitive functions (Long et al., 2015; Walker, 2007). Diagnosis is typically made by using the motor subscale of the Unified HD Rating Scale (UHDRS) (Shoulson and Fahn, 1979; Kieburtz et al., 2001). The PREDICT-HD is a 12-year observational study conducted between 2002 and 2014 on prodromal HD individuals who had the HTT gene but had not met the motor diagnostic criteria for HD at the study entry. The study was conducted in 33 sites across six countries (USA, Canada, Germany, Australia, Spain, and UK). Large volumes of data on neuroimaging, motor, cognitive, and psychiatric assessments were collected for predicting the onset of HD (Paulsen et al., 2014).
In this research, we studied phenotypes in HD progression in motor and cognitive impairment and how the progression phenotypes affected disease onset by using the PREDICT-HD data. The data are publicly available through the NIH Database for Genotypes and Phenotypes (dbGap); data requests should be sent to ninds-dac@mail.nih.gov.
We selected five motor and cognitive measures that are commonly used for tracking HD progression: The total motor score (TMS) (Kieburtz et al., 2001; Hogarth et al., 2005), the Symbol Digit Modalities Test (SDMT) (Smith, 1973), and the three Stroop Color Word Tests, i.e., Stroop Color test (stroopco), Stroop Word Test (stroopwo), and Stroop Color-Word Inference Test (stroopin)(Stroop, 1935; Golden, 1978). All five measures are on numerical scales, with a smaller value indicating more impairment, except for TMS, where a larger value corresponds to greater motor impairment. For 1,006 participants in PREDICT-HD, the timings of assessment were quite different among participants though scheduled annually. The average number of observations per participant was 5.62, with a minimum of 2 and a maximum of 13 observations. The application represents a typical example of multiple-outcome longitudinal data with irregular and sparse observations, for which the development of this proposed method was motivated.
Two clustering methods, clusterMLD and funHDDC, are readily applicable. For clusterMLD, we fitted cubic B-spline curves using internal knots selected at the three quartiles of the total observation time. The clusterMLD identified three clusters, each representing a different disease progression phenotype, coinciding well with the perceived disease progression patterns among the Huntingtin’s disease study group (Duff et al., 2010; Harrington et al., 2012; Paulsen et al., 2013). Since funHDDC resulted in only two clusters, we further analyzed data using the clusters generated by clusterMLD. For narrative convenience, we referred to them as Clusters 1, 2, and 3, respectively covering 317 (32.2%), 332 (33.7%), and 336 (34.1%) participants. We present the mean B-spline functional curves in Figure 4. The figure showed that SDMT and the three Stroop tests were clearly separated at baseline, with subjects in Cluster 1 being more impaired throughout the observational window, especially near the end when impairment accelerated. For TMS, the analysis showed that Cluster 1 had a much higher rate of increase, suggesting a more rapid deterioration. In contrast, Cluster 3 had progressed slowly in both motor and cognitive declines. A closer examination of the data showed that only 25 participants (7.4%) in Cluster 3 had HD diagnosis with a median time to diagnosis beyond 12 years after enrollment. The numbers of HD diagnoses for Clusters 2 and 1 were 64 (19.3%) and 149 (47.0%), with the median times to HD diagnosis of 11.1 and 5.6 years from study enrollment, respectively. The estimated survival functions were presented in Figure 5 (a), which confirmed that participants in the three clusters had significantly different time-to-HD-diagnosis distributions (p-value < 0.001) per log-rank test (Peto and Peto, 1972).
Fig. 4.
The estimated cluster-specific mean trajectories.
Fig. 5.
(a) Kaplan-Meier curves for time to HD diagnosis, according to three detected clusters. (b) Box plot for CAP by cluster
Previously, Zhang et al. (2011) explored using the CAG trinucleotide to quantify the HD genetic burden. They proposed a CAG-Age Product scale, hereby referred to as the CAP score CAP = AGE × (CAG − 33.7), which was found to strongly predict HD onset and has since been used to characterize prodromal HD risk and used for disease screening (Rodrigues et al., 2019). In the current analysis, CAP was predictive of disease progression – the three clusters had significantly different CAP values (p-value < 0.001). But as a static metric assessed at baseline, CAP does not fully capture the disease progress. A large overlap in CAP values was observed among the three clusters, as shown in Figure 5 (b). We performed an additional Cox proportional hazards analysis with both CAP and cluster labels as covariates. The result showed that after adjusting for CAP, HD progression clusters remain significant (p-value < 0.001), and thus confirming the added value of clustering analysis. This clustering analysis has profound implications in designing intervention trials for modifying disease progression – by identifying individuals at risk for a specific HD progression group to enrich the study sample for the targeted intervention (Paulsen et al., 2019).
5. Discussion
In this research, we have developed an efficient hierarchical clustering method for multivariate longitudinal data with irregular and sparse assessments. The method utilizes functional curve-fitting techniques to alleviate the influences of noisy and sparse data observation. The method takes a hierarchical agglomerative approach that merges subjects and subclusters in a bottom-up fashion and hence lays out all possible candidates of clusters with the cluster number ranging from 1 to N, the total sample size. It is equipped with two statistics for selecting the optimal cluster number, GAPb and CH. The numerical studies showed the proposed method with GAPb is highly reliable for clustering longitudinal data when the true number of clusters is not large (≤ 5). Therefore, it is most applicable to longitudinal clinical data.
While the method’s development was motivated by the need for clustering longitudinal clinical data, evidence supports its use in functional data analysis, where the method is numerically more effective because it requires no preprocessing of the data. The proposed method with CH also appears to be reliable for functional data settings where the number of clusters is large and functional shapes are complex. Extensive simulation studies have highlighted the method’s superior performance over its competitors in terms of cluster number determination, clustering accuracy, and computational efficiency.
At the core of the algorithm is a metric that quantifies the cost of each merging. The metric is calculated by comparing the sum of squared residuals of the separate and combined models; thus, it can be viewed as a metric of distance between two subclusters and hence clusterMLD belongs to the general category of distance-based clustering methods per Jacques and Preda (2014a). However, this metric is not a traditional L2 measure of the “distance” between two curves, on which the other functional data clustering methods were based (Ferraty and Vieu, 2006; Ieva et al., 2013; Tarpey and Kinateder, 2003; Tokushige et al., 2007). Instead, it is based on the sum of squared residuals, which can be calculated for a much broader class of parametric and nonparametric models, including but not limited to the regression B-splines models used in the current research. For traditional linear models, the metric has a well-grounded root in the classical theory of statistical inference, a foundation that the L2-based methods lack. This may help explain the superior performance of the proposed algorithm. An essential advantage of the method is that the readily calculated sums of squared residuals have drastically cut down the computational burden. The least-square-based cost metric greatly simplifies the comparison of longitudinal data models between the subclusters during the clustering process. We were surprised that such an intuitive approach had not been previously studied as it has clear potential in clustering both longitudinal and functional data. The strongest support for the method perhaps comes from the numerical studies, which highlighted the many advantages that set this method apart from the existing ones. Major appeals include the accurate identification of cluster number, clustering accuracy, and computational efficiency, in addition to its ability to handle sparse and irregular data, as well as multiple outcomes. Notably, the advantages appear to grow with sample size, as demonstrated in the simulation studies and in the SPRINT data analysis.
In many practical analyses, the sample size is a double-edged sword. While larger data sets lend more information to the analysis, very large datasets tend to present greater challenges to data processing. The computational burden of running iterative and computationally heavy algorithms in larger datasets has frustrated many analysts. For a dataset of size N, agglomerative clustering algorithms typically have a time complexity of 𝒪(N2) and require Ω(N2) memory (Nielsen, 2016); these could make them too slow even for medium size analyses. Parallel computing could help alleviate the burden by splitting the original data into multiple subsets, each of size n, and then applying the algorithm to each of them in a parallel fashion. The algorithm stops when d subclusters are resulted instead of continuing the process until only one cluster remains. As long as it is larger than the actual number of clusters, no specific requirement for this d is needed. The resultant subclusters from each subset are then combined. This split-and-pool procedure can be continuously applied multiple times until the number of remaining subclusters reaches a manageable scale. Finally, the algorithm is applied to keep merging subclusters until only one cluster is left. This algorithm helps reduce the time complexity to 𝒪((n + d)N). In general, larger n and d improve the clustering performance but slow the algorithm. Our analysis of the SPRINT data has confirmed the method’s practical feasibility in moderate-to-large datasets.
As true for all statistical methods, clusterMLD has its limitations: (i) We assume that the time trajectories are smooth functions so that they can be adequately depicted by spline models; (ii) Although the method is designed to accommodate irregular and sparse observations, longitudinal observations among subjects are still expected to have some levels of overlap, and so that individual curves can be fitted; (iii) Both GAPb and CH statistics are not defined for K = 1, thus precluding the possibility of testing the existence of heterogeneity within the data. We assume data heterogeneity a priori. When in doubt, that assumption could be tested with goodness-of-fit statistics; (iv). Since clustering is a data exploratory technique, not an inference procedure, we did not consider the correlations among the multivariate outcomes. Incorporating a proper working correlation matrix will no doubt enhance the inferential efficiency (Liang and Zeger, 1986), but this gain is likely to come at the expense of the increased numerical burden without substantively improving the clustering performance. Finally, the method as it currently stands does not work for discrete data. To the best of our knowledge, no existing clustering method has been devoted to the analysis of multivariate mixture data. This said, the distance metric defined in the current paper may have the potential to be extended to other data types. In addition, prior knowledge about the data may help with the choice of GAPb or CH in selecting the number of clusters. In general, it would be more advantageous to develop an adaptive strategy in the choice of GAPb or CH during the hierarchical process of merging subclusters, which still needs more research. Notwithstanding these limitations, we put forward a reliable and efficient clustering algorithm that hierarchically merges subclusters of multivariate continuous data that are observed longitudinally.
Supplementary Material
ACKNOWLEDGMENTS
The authors thank the Associate Editor and two anonymous referees for their many insightful comments that have greatly improved the quality of this work. This manuscript was prepared using the SPRINT Research Materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center, and the PREDICT-HD data obtained from the PREDICT-HD investigators and coordinators of the Huntington’s Disease Study Group. The work does not necessarily reflect the opinions or views of the SPRINT research team or the NHLBI, or the Huntington’s disease study group. This research is partly supported by National Institutes of Health grants RO1HL095086, U24AA026969, U54GM115458, RO1NS103475.
Footnotes
All authors declare that they have no conflicts of interest.
References
- Abraham C, Cornillon P-A, Matzner-Løber E, and Molinari N (2003). Unsupervised curve clustering using b-splines. Scandinavian journal of statistics 30 (3), 581–595. [Google Scholar]
- Biernacki C, Celeux G, and Govaert G (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE transactions on pattern analysis and machine intelligence 22 (7), 719–725. [Google Scholar]
- Bouveyron C and Jacques J (2011). Model-based clustering of time series in group-specific functional subspaces. Advances in Data Analysis and Classification 5 (4), 281–300. [Google Scholar]
- Caliński T and Harabasz J (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3 (1), 1–27. [Google Scholar]
- Chamroukhi F (2016). Unsupervised learning of regression mixture models with unknown number of components. Journal of Statistical Computation and Simulation 86 (12), 2308–2334. [Google Scholar]
- Chamroukhi F and Nguyen HD (2019). Model-based clustering and classification of functional data. WIREs Data Mining and Knowledge Discovery 9 (4), e1298. [Google Scholar]
- Chow GC (1960). Tests of equality between sets of coefficients in two linear regressions. Econometrica 28 (3), 591–605. [Google Scholar]
- Coffey N, Hinde J, and Holian E (2014). Clustering longitudinal profiles using p-splines and mixed effects models applied to time-course gene expression data. Computational Statistics & Data Analysis 71, 14–29. [Google Scholar]
- Cohen J (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1), 37–46. [Google Scholar]
- De la Cruz-Mesía R, Quintana FA, and Marshall G (2008). Model-based clustering for longitudinal data. Comput Stat Data Anal 52, 1441–1457. [Google Scholar]
- Duff K, Paulsen J, Mills L, Beglinger D, Moser M, Smith D, J L, Stout S, Queller S, Harrington D, the PREDICT-HD Investigators, and C. of the Huntington Study Group (2010). Mild cognitive impairment in prediagnosed huntington disease. Neurology 75, 500–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferraty F and Vieu P (2006). Nonparametric functional data analysis, Springer Series in Statistics. Springer, New York. [Google Scholar]
- Garcia-Escudero LA and Gordaliza A (2005). A proposal for robust curve clustering. Journal of classification 22 (2), 185–201. [Google Scholar]
- Genolini C, Alacoque X, Sentenac M, Arnaud C, et al. (2015). kml and kml3d: R packages to cluster longitudinal data. Journal of Statistical Software 65 (4), 1–34. [Google Scholar]
- Genolini C and Falissard B (2010). Kml: k-means for longitudinal data. Computational Statistics 25 (2), 317–328. [Google Scholar]
- Genolini C and Falissard B (2011). Kml: A package to cluster longitudinal data. Computer methods and programs in biomedicine 104 (3), e112–e121. [DOI] [PubMed] [Google Scholar]
- Giacofci M, Lambert-Lacroix S, Marot G, and Picard F (2013). Wavelet-based clustering for mixed-effects functional models in high dimension. Biometrics 69 (1), 31–40. [DOI] [PubMed] [Google Scholar]
- Golden CJ (1978). Stroop color and word test: cat. no. 30150M; a manual for clinical and experimental uses. Stoelting. [Google Scholar]
- Harrington D, Smith M, Zhang Y, Carlozzi N, Paulsen J, and the PREDICT-HD Investogators of the Huntington Study Group (2012). Cognitive domains that predict time to diagnosis in prodromal huntington disease. J Neurol Neurosurg Psychiatry 83, 612–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hogarth P, Kayson E, Kieburtz K, Marder K, Oakes D, Rosas D, Shoulson I, Wexler NS, Young AB, Zhao H, et al. (2005). Interrater agreement in the assessment of motor manifestations of huntington’s disease. Movement disorders 20 (3), 293–297. [DOI] [PubMed] [Google Scholar]
- Hubert L and Arabie P (1985). Comparing partitions. Journal of classification 2 (1), 193–218. [Google Scholar]
- Ieva F, Paganoni A, Pigoli D, and Vitelli V (2013). Multivariate functional clustering for the morphological analysis of electrocardiograph curves. J R Stat Soc Ser C Appl Stat 62 (3), 401–418. [Google Scholar]
- Jacques J and Preda C (2014a). Functional data clustering: a survey. Adv Data Anal Classif 8, 231–255. [Google Scholar]
- Jacques J and Preda C (2014b). Model-based clustering for multivariate functional data. Comput Stat Data Anal 71, 92–106. [Google Scholar]
- James GM and Sugar CA (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association 98 (462), 397–408. [Google Scholar]
- Jones BL, Nagin DS, and Roeder K (2001). A sas procedure based on mixture models for estimating developmental trajectories. Sociological methods & research 29 (3), 374–393. [Google Scholar]
- Kieburtz K, Penney JB, Corno P, Ranen N, Shoulson I, Feigin A, Abwender D, Greenarnyre JT, Higgins D, Marshall FJ, et al. (2001). Unified huntington’s disease rating scale: reliability and consistency. Neurology 11 (2), 136–142. [Google Scholar]
- Liang K-Y and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika 73 (1), 13–22. [Google Scholar]
- Long JD, Paulsen JS, P.-H. Investigators, and C. of the Huntington Study Group (2015). Multivariate prediction of motor diagnosis in huntington’s disease: 12 years of predict-hd. Movement Disorders 30 (12), 1664–1672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luan Y and Li H (2003). Clustering of time-course gene expression data using a mixed-effects model with b-splines. Bioinformatics 19 (4), 474–482. [DOI] [PubMed] [Google Scholar]
- MacDonald ME, Ambrose CM, Duyao MP, Myers RH, Lin C, Srinidhi L, Barnes G, Taylor SA, James M, Groot N, et al. (1993). A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington’s disease chromosomes. Cell 72 (6), 971–983. [DOI] [PubMed] [Google Scholar]
- McNicolas PD and Murrphy TB (2010). Model-based clustering of longitudinal data. The Canadian Journal of Statistics 38 (1), 153–168. [Google Scholar]
- McNicolas PD and Subedi S (2012). Clustering gene expression time course data using mixtures of multivariate t-distributions. Journal of Statistical Planning and Inference 142, 1114–1127. [Google Scholar]
- Nielsen F (2016). Hierarchical clustering. In Introduction to HPC with MPI for Data Science, pp. 195–211. Springer. [Google Scholar]
- Paulsen J, Smith M, Long J, the PREDICT-HD investigators, and coordinators of the Huntington Study Group (2013). Cognitive decline in prodromal huntington disease: implications for clinical trials. J Neurol Neurosurg Psychiatry 84, 1233–1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paulsen JS, Long JD, Ross CA, Harrington DL, Erwin CJ, Williams JK, Westervelt HJ, Johnson HJ, Aylward EH, Zhang Y, et al. (2014). Prediction of manifest huntington’s disease with clinical and imaging measures: a prospective observational study. The Lancet Neurology 13 (12), 1193–1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paulsen JS, Lourens S, Kieburtz K, and Zhang Y (2019). Sample enrichment for clinical trials to show delay of onset in huntington disease. Movement Disorders 34 (2), 274–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peto R and Peto J (1972). Asymptotically efficient rank invariant test procedures. Journal of the Royal Statistical Society: Series A (General) 135 (2), 185–198. [Google Scholar]
- Ramsay J and Silverman B (2005). Fundtional data anlaysis, 2nd edn, Springer Series in Statistics. Springer, New York. [Google Scholar]
- Rand WM (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66 (336), 846–850. [Google Scholar]
- Rice JA (2004). Functional and longitudinal data analysis: perspectives on smoothing. Statistica Sinica, 631–647. [Google Scholar]
- Rodrigues FB, Quinn L, and Wild EJ (2019). Huntington’s disease clinical trials corner: January 2019. Journal of Huntington’s disease 8 (1), 115–125. [DOI] [PubMed] [Google Scholar]
- Rousseeuw PJ (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, 53–65. [Google Scholar]
- Ruppert D (2002). Selecting the number of knots for penalized splines. Journal of computational and graphical statistics 11 (4), 735–757. [Google Scholar]
- Schmutz A, Jacques J, Bouveyron C, Chèze L, and Martin P (2020). Clustering multivariate functional data in group-specific functional subspaces. Computational Statistics 35, 1101–1131. [Google Scholar]
- Schoenberg IJ (1946). Contributions to the problem of approximation of equidistant data by analytic functions. part b. on the problem of osculatory interpolation. a second class of analytic approximation formulae. Quarterly of Applied Mathematics 4 (2), 112–141. [Google Scholar]
- Schumaker L (2007). Spline Functions: Basic Theory (3 ed.). Cambridge Mathematical Library. Cambridge University Press. [Google Scholar]
- Schwarz G et al. (1978). Estimating the dimension of a model. Annals of statistics 6 (2), 461–464. [Google Scholar]
- Serban N and Wasserman L (2005). Cats: clustering after transformation and smoothing. Journal of the American Statistical Association 100 (471), 990–999. [Google Scholar]
- Shoulson I and Fahn S (1979). Huntington disease: clinical care and evaluation. Neurology 29 (1), 1–1. [DOI] [PubMed] [Google Scholar]
- Smith A (1973). Symbol digit modalities test. Western Psychological Services; Los Angeles. [Google Scholar]
- Stroop JR (1935). Studies of interference in serial verbal reactions. Journal of experimental psychology 18 (6), 643. [Google Scholar]
- Tarpey T and Kinateder K (2003). Clustering functional data. Journal of Classification 20 (1), 93–114. [Google Scholar]
- The-SPRINT-Research-Group (2015). A randomized trial of intensive versus standard blood-pressure control. New England Journal of Medicine 373 (22), 2103–2116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R, Walther G, and Hastie T (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2), 411–423. [Google Scholar]
- Tokushige S, Yadoshisa H, and Inada K (2007). Crisp and fuzzy k-means clustering algorithm for multivariate functional data. Comput Stat 22, 1–16. [Google Scholar]
- Walker FO (2007). Huntington’s disease. The Lancet 369 (9557), 218–228. [DOI] [PubMed] [Google Scholar]
- Ward JH (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58 (301), 236–244. [Google Scholar]
- Yang M-S, Lai C-Y, and Lin C-Y (2012). A robust em clustering algorithm for gaussian mixture models. Pattern Recognition 45 (11), 3950–3961. [Google Scholar]
- Zhang Y, Long JD, Mills JA, Warner JH, Lu W, Paulsen JS, P.-H. Investigators, and C. of the Huntington Study Group (2011). Indexing disease progression at study entry with individuals at-risk for huntington disease. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 156 (7), 751–763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Qu A, et al. (2018). Cluster analysis of longitudinal profiles with subgroups. Electronic Journal of Statistics 12 (1), 171–193. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.