Abstract
Precision medicine is an important area of research with the goal of identifying the optimal treatment for each individual patient. In the literature, various methods are proposed to divide the population into subgroups according to the heterogeneous effects of individuals. In this paper, a new exploratory machine learning tool, named latent supervised clustering, is proposed to identify the heterogeneous subpopulations. In particular, we formulate the problem as a regression problem with subject specific coefficients, and use adaptive fusion to cluster the coefficients into subpopulations. This method has two main advantages. First, it relies on little prior knowledge and weak parametric assumptions on the underlying subpopulation structure. Second, it makes use of the outcome-predictor relationship, and hence can have competitive estimation and prediction accuracy. To estimate the parameters, we design a highly efficient accelerated proximal gradient algorithm which guarantees convergence at a competitive rate. Numerical studies show that the proposed method has competitive estimation and prediction accuracy, and can also produce interpretable clustering results for the underlying heterogeneous effect.
Keywords: Accelerated Proximal Gradient Algorithm, Clustering Analysis, Convex Clustering, Machine Learning, Precision Medicine, Subpopulation Identification
1. Introduction
In clinical research, precision medicine aims at developing the optimal treatment for each individual according to the subject’s personal characteristics. The motivation originates from the findings that different groups of patients can respond dramatically differently to the same health care intervention, which can be caused by their specific body mechanisms. Failing to detect a targeted subpopulation can eliminate some precise drugs by washing out their effects using the whole population. In practice, it takes tremendous efforts to search for the targeted subpopulation of certain interventions (Brookes et al. (2004); Lagakos (2006)). One important reason is that the crucial features that decide the targeted subpopulation are usually either hidden in numerous collected ones, or even unmeasured. Therefore, it is desirable to develop methods that can automatically detect such subpopulations.
Recently, various machine learning methods have been introduced and applied to identify subpopulations. In supervised learning, linear regression with two-way interactions becomes widely used. However, such a method requires strong parametric assumptions requiring that the underlying heterogeneity can be determined by those interactions (Greenland (2009)). Some nonparametric methods such as random forests are also popular while the results remain less interpretable in practice (Wager and Athey (2018)). In addition, there are also studies in unsupervised learning that has weak parametric assumptions. Clustering analysis can be a good representative. Clustering analysis usually detects observation similarities using a pre-defined distance on the covariates. One typical example is the usage of hierarchical clustering for heterogeneous gene expression analysis (Perou et al. (2000)). Recently, a new clustering method, convex clustering, was proposed by solving a convex optimization problem using the pairwise fusion penalty (Guo et al., 2010; Hocking et al., 2011; Chi and Lange, 2015). The introduced algorithm tremendously boosts the efficiency especially when applied to large datasets. However, such unsupervised learning tools can produce meaningful and desirable results only when the subpopulations are completely determined by the covariate similarities defined. In practice, subpopulations can also heavily depend on the outcomes or even the relationship between outcomes and covariates. This can be shown by many clinical studies that aim to split the patient population according to drug effects.
Other than supervised and unsupervised learning, Wei and Kosorok (2013) recently introduced a new type of machine learning tools, named latent supervised learning, to maintain the advantage of both. Their method assumes that each observation comes with an unobserved label, i.e. the latent outcome, which identifies its subpopulation and determines the underlying distribution. Furthermore, they assume that the observed outcome follows a mixture Gaussian distribution with two latent components. These latent components are assumed to be determined by a linear combination of features. There are follow-up extensions of Wei and Kosorok (2013) in the literature. For instance, Altstein and Li (2013) adopted this idea to the time-to-event response, and Shen and He (2015) suggested a logistic-normal mixture model instead. Although these methods showed competitive performance to detect the subpopulation boundaries, some drawbacks still exist. First, most of the existing methods still count on prior knowledge on the parametric functions of the subpopulation boundaries. In many exploratory studies, such information can be difficult to obtain. Second, many latent supervised learning methods rely on certain distribution assumptions of the observed outcomes as well, which may not hold in practice especially for complex subpopulation structures.
In this paper, we focus on subpopulation detection, and aim to address the two drawbacks of latent supervised learning. In particular, we would like to propose an exploratory tool, named latent supervised clustering, to estimate the heterogeneous effects at the same time of clustering the samples without prior knowledge on their boundaries. To achieve these two goals, we formulate the problem as regression with subject-specific coefficients, which can be treated as the heterogenous relationships between the outcomes and covariates from the observed data. Then we cluster such relationships with the adaptive fusion penalty, which is extended from the perspective of convex clustering. The proposed method inherits the advantages of both latent supervised learning and traditional clustering analysis. On one hand, it enables the data orient the learning process so that no assumption is needed for the subpopulation structure. On the other hand, it utilizes the information of both covariates and outcomes to estimate the heterogeneity so that it can both identify the subpopulation and predict the outcome values. In contrast, regular clustering does not utilize the outcome information.
Clustering such outcome-predictor relationships can be very challenging because they are not observed directly but can only be derived. One important question is how to define a distance properly to encourage such clustering patterns. We would like to adapt the idea of convex clustering. Note that convex clustering formulates the clustering process as a convex minimization problem involving the sum of a loss function and a penalty term with a tuning parameter balancing these two terms. The loss term is a sum of Euclidean distances between observations and their corresponding subject-specific centroids. The penalty term includes a sum of the fusion penalty between each pair of such centroids to encourage them to merge together. In latent supervised clustering, we propose a different loss + penalty form to accommodate the outcome information. For the loss term, we use a pre-defined loss calculated from the observed outcome and its fitted value by a certain model with subject-specific parameters. In this way, a smaller loss value indicates a better goodness of fit. This model can be either parametric or non-parametric such as smoothing splines. While we only discuss linear regression in this paper, extensions to classification or nonlinear regression can also be covered by, for example, using a linear function with the deviance loss or using smoothing splines with the quadratic loss. We assume that observations from the same subpopulation have the identical values of the subject-specific parameters. To encourage such a pattern in the estimation, we impose an adaptive pairwise fusion penalty on each pair of the parameters in the penalty term. The corresponding weights are determined by the estimated differences of the pairs. In summary, such a convex optimization formulation can lead to maximizing the overall goodness of fit at the same time of minimizing the heterogeneity within each detected cluster.
The main contributions of this paper are as follows. First, we propose a new method using the supervised machine learning framework that aims to identify the heterogeneity by clustering the defined outcome-predictor relationship. We borrow the convex clustering idea but design new loss functions and penalty terms to encourage competitive performance and computational efficiency. Second, we design a novel optimization algorithm to solve the underlying convex minimization problem. This algorithm has several distinct features compared to existing proximal gradient methods: (1). It can handle non-smooth loss terms using a smoothing technique as in Nesterov (2005) in a homotopy fashion; (2). It allows us to use inexact computation of the proximal operator of the penalty term; (3). As far as we know, it has the best known convergence rate guarantee among homotopy smoothing first-order methods. We would like to point out that the proposed latent supervised clustering technique covers the problem that Ma and Huang (2016) discussed as a special case. They focused on the case when the subpopulations can be merely determined by a varying intercept term of a linear model.
The remainder of the paper is organized as follows. In Section 2, we introduce the proposed method. In Section 3, we present the proposed accelerated proximal gradient algorithm to solve the optimization problem and show its convergence rate properties. Simulated examples and data applications are presented in Section 4 to demonstrate the performance of the proposed method under finite samples. Some discussions are provided in Section 5. More technical details including statistical learning theory and proofs of theorems, together with additional numerical results, are left in the supplemental materials.
2. Methodology
We use {(xi, yi), i = 1 ⋯, n} to represent our training data, where xi is a p-dimensional covariate vector with its response yi. We consider the model as follows:
| (1) |
where xi always contains an intercept term as its first element, βi is the coefficient vector of the i-th observation xi, and εi is the noise term that has mean zero and a bounded variance. We further assume that xi, and εi are mutually independent of each other. To describe the heterogeneity of covariate effects, we allow βi to take different values for different indices i. Our goal is to estimate and cluster these βi’s, and let the clustering results hep to provide subpopulation identification. For model (1), we assume that the value of βi is determined by its underlying subpopulation. In other words, if we denote a partition of {1, ⋯, n} with , where K represents the number of subpopulations and is usually unknown in practice, then the values of βi from the same latent subpopulation are supposed to be identical. In this paper, we concentrate on linear models, i.e. . When the linear assumption is too strong, one can extend the function to be nonlinear such as using basis expansions with , where g1, ⋯, gm are basis functions.
To estimate and cluster these βi’s, we consider the following optimization problem:
| (2) |
where ℓ represents a pre-selected loss function to measure the goodness of fit. For the loss term, we consider two popular representatives in regression: the check loss (Koenker (2005)) that is used in quantile regression and the quadratic loss that is used in the least squares estimation. Note that one can easily change the loss to solve classification problems. As to the penalty term, the adaptive pairwise fusion penalty is used to adjust for the potential bias created by the ℓ1-penalty. In practice, we find that can be a good option, where indicates whether the observation j is among i’s m-nearest neighbors defined by the Euclidean distance, and is an estimate of which can be initialized by the local regression coefficients. The upper bound Bw is added in case some pairs of have values too close to each other. Considering the numerous terms of the fusion penalty (O(n2)), the m-nearest neighbors strategy can save tremendous computational time when solving (2) while maintaining competitive performance.
Note that we can view our model (2) as an exact penalty formulation of the constrained problem of minimizing the loss subject to equality constraints βi = βj, where the product λnwij can be considered as penalty parameters. From the theory of exact penalty methods, the parameters wij are expected to update at each iteration of the algorithm until they go beyond a given threshold defined by the ℓ∞-norm of the corresponding optimal Lagrange multiplier assocated with the constraint βi = βj (see e.g. (Nocedal and Wright, 2006, Chapter 17)).
In practice, one may find some prior evidence to show that certain components of xi have homogeneous effects on the outcome yi, i.e., coefficients remain constant for all. In this case, we recommend imposing the penalty only on the heterogeneous part. Suppose that the collected observation includes (xi, zi, yi), where zi is a q dimensional covariate vector known to have homogeneous effect. We rewrite the model (1) into yi = f(xi; βi) + h(zi; γ) + εi, where γ is the same for all, and h is a measurable function that can have a parametric or nonparametric form. Similar to the form of f, we restrict our discussion for linear functions with . In this scenario, the optimization problem in (2) becomes
| (3) |
Moving redundant components from xi to zi can be crucial in saving computational time especially when the dimension of the covariates is large. Here, we suggest a “forward screening” idea that can help distinguish zi from xi when no prior knowledge can be obtained. We start with a parsimonious model in which xi only contains an intercept term. Then, we move a variable from zi to xi that boosts the model performance the most. This process is repeated until no further improvement can be obtained by moving more variables to xi. More illustrations of this idea are left in data applications of Section 4. In the following sections, we concentrate on the model (3).
As a special case of latent supervised clustering, Ma and Huang (2016) focused on when the subpopulations can be determined by a subject-specific intercept. They suggested using a concave fusion penalty in the objective function and applied the alternating direction method of multipliers (ADMM) algorithm to solve the optimization problem. However, this method is not suitable to solve our problem. To apply ADMM to (3) that has a complex regularization term, one needs to introduce many intermediate variables, which significantly increases the problem size and becomes extremely inefficient. In addition, ADMM requires a complicated strategy to tune the penalty parameter of the augmented Lagrangian to obtain good performance, which can be very difficult with complex regularization. In practice, it can be observed from Section 4 that ADMM takes longer to compute and produces suboptimal prediction accuracy.
Our proposed method and algorithm enjoy several advantages. First, our method has significant computational benefits because the problem we solve is convex, and also our proposed algorithm does not need to generate the p · n2 additional intermediate parameters as ADMM does. This feature allows our algorithm to scale up to relatively high dimensional problems compared to ADMM. In addition, our method has a theoretical convergence rate guarantee on the original model (3) as opposed to a guarantee on the constrained reformulation for the ADMM. Second, the quadratic loss suggested by Ma and Huang (2016) can be a suboptimal choice due to its sensitivity to the outliers. This can be partly attributed to the fact that the least square estimators have the breakdown point to be zero (Huber (2004); Zhao et al. (2018)). That indicates, if a subject from the first subpopulation is wrongly assigned to the second subpopulation, it can highly impact the coefficient estimates of the second subpopulation when the quadratic loss is applied. We compare the results of the quadratic loss and check loss, and find the latter one can significantly improve the model performance. Third, it is not desirable to penalize all pairs of (βi, βj) equally. In the ideal case, we hope that large weights can be assigned to the pairs from the same subpopulation, while zero weights are used for those from different subpopulations.
The proposed method can also be used to predict the subpopulations and response of a new observation. One can treat the identified outcome labels in the training datasets as the estimated underlying outcome, and fit it using the predictor information with a standard classification model. Some popular choices include discriminant analysis, k-nearest neighbors and random forests. The fitted classification model can be used to make predictions on which subpopulations the new observations should be assigned to. Then, one can plug in the corresponding estimated βi and γ to predict the response yi.
3. Algorithms
In this section, we design a novel accelerated proximal gradient algorithm to solve the proposed optimization problem efficiently, inspired by the fast iterative shrinkage-thresholding algorithm (FISTA, Beck and Teboulle (2009)). To further increase the speed of convergence, we integrate a restarting strategy at each iteration, which includes inexact calculation of the proximal operator. Note that FISTA can only work for smooth objective functions. For the nonsmooth check loss, we propose approximating it with a smooth adaptive surrogate function that guarantees the best known convergence rate.
3.1. Model Reformulation
For simplicity, we concatenate the parameters βi and γ of model (3) into one vector ζ ≜ (βT, γT)T where . Then, we rewrite the loss and penalty term as:
| (4) |
where Dw is the weight matrix. In this way, the optimization problem (3) can be expressed in the following compact form:
| (5) |
We focus on two loss functions: a quadratic loss and a check loss ℓτ(r) = τrI(r ≥ 0) − (1 − τ)rI(r < 0) = (τ − 0.5)r + 0.5|r|. When ℓ is the quadratic loss, fn has Lipschitz gradient, i.e., there exists Lfn ≥ 0 such that for all . When ℓ is the check loss, (5) is still convex but fully nonsmooth (i.e., both fn and Jn are nonsmooth).
3.2. Algorithmic Design and Convergence Properties
We develop the algorithm based on Beck and Teboulle (2009) and Nesterov (2013), while the following steps are new. First, we design a proximal operator (see Definition below) for the penalty Jn using an adaptive fast projected gradient method with a warm-start. Second, we incorporate the algorithm with a restart procedure recently studied in Fercoq and Qu (2016) to accelerate the performance of the algorithm. Third, for the check loss function, we apply smoothing techniques to approximate it with a surrogate function depending on a parameter which can be adaptively updated in the algorithm. Last, we design a new variant of the adaptive method proposed in Tran-Dinh (2017) to solve (5) that has a convergence rate guarantee without any parameter tuning strategy.
The computational cost of the algorithm consists of two parts. First, we need to evaluate the gradient vector of fn or its smoothed approximation, and evaluate the Lipschitz constant of this gradient mapping. Second, we compute the proximal operator of Jn which is defined as:
| (6) |
for any ζ and s > 0. The main steps of the proposed algorithm for solving (5) are presented in Algorithm 1.
Algorithm 1.
Adaptive fast Proximal Gradient algorithm (APG)
| 1. Choose an arbitrarily initial point and desired tolerances ε > 0 and ϵ0 ≥ 0. |
| 2. Evaluate ; Set τ0 = 1, and . |
| 3. If the check loss is used, then input η1 (e.g., ). |
| 4. For t = 0, 1, ⋯, tmax, perform: |
| Step 1: Set Lfn ≔ Lf for the quadratic loss, and for the check loss. Then compute the step-size . |
| Step 2: Compute approximately up to the accuracy ϵt as defined in (7) below. |
| Step 3: If stopping-criterion is satisfied, terminate the algorithm. |
| Step 4: If fn is the quadratic loss, update . If fn is the check loss, update τt+1 as the positive solution of . |
| Step 5: Update the accelerated step . |
| Step 6: If fn is the check loss, then update . |
| Step 7: Perform a restarting step if needed, and update ϵt+1. |
| 5. End of the main loop. |
In summary, when fn is a quadratic loss, we have where . Then, its gradient , which is Lipschitz continuous with its Lipschitz constant as , where denotes the maximum eigenvalue of . When fn is a nonsmooth check loss, we approximate it by a smooth fn(·; η) with more details described in Subsection 3.2.2. Then, the next step is to compute the proximal operator proxg, which is presented in Subsection 3.2.3.
Next we discuss the convergence guarantee of Algorithm 1. Let ζ* be an optimal solution of (5) with its optimal value ϕn(ζ*), i.e. we have ϕn(ζ) ≥ ϕn(ζ*) for any ζ. Define that ζ(t) is an approximate solution to (5) with an accuracy ε ≥ 0, if ϕn(ζ(t)) − ϕn(ζ*) ≤ ε. In (5), we are not able to compute the proximal operator of Jn exactly, but rather approximate it with a given accuracy εt > 0. In particular, we define
and . Then by definition, is the corresponding objective function of the proximal operator problem at Step 2 of Algorithm 1. One can show that , as long as the following condition holds:
| (7) |
Now we provide a general convergence result for Algorithm 1 above. For the two particular losses, we present the theorems seperately in the subsections below. Their technical proofs can be found in the supplemental material.
3.2.1. Convergence for a quadratic loss
If ℓ is a quadratic loss, then the following theorem provides a convergence rate guarantee of Algorithm 1.
Theorem 3.1. Let fn be a quadratic loss, and let be a sequence generated by Algorithm 1 where proxγjn is computed approximately with the accuracy ϵt ≥ 0 as defined in (7). Then, we have
| (8) |
where . Consequently, for any accuracy ε > 0 and positive constant c ≥ 1, if the inner accuracy at t, ϵt, is chosen to be , then the maximum iteration number of (5) to achieve ζ(t) does not exceed . Here, ⌊⋅⌋ denotes the floor function.
3.2.2. Smoothing technique for the check loss
The check loss ρτ(r) = τrI(r ≥ 0) − (1 − τ)rI(r < 0) = (τ − 0.5)r + 0.5|r| is convex but nonsmooth. We consider approximating it using a smooth convex function ρτ(⋅;η) that has a smoothness parameter η > 0. For any fixed value of η > 0, the smooth function ρτ(⋅;η) needs to satisfy the following basic properties. First, ρτ(⋅;η) is smooth and convex. Its gradient ∇rρτ(⋅;η), with respect to r, is Lipschitz continuous with the Lipschitz constant Lρ depending on η. Second, ρτ(⋅;η) approximates ρτ(⋅) well. In other words, there exists a constant Dρ, independent of η, such that ρτ(r; η) ≤ ρτ(r) ≤ ρτ(r; η) + ηDρ for all r. There are several choices for ρτ(·), such as the following two:
Huber loss: with and .
Logit-type loss: with and .
We now introduce the following properties of ρτ(⋅; η) with the proof placed in the supplemental material.
Lemma 3.2. We consider as a smooth version of the check loss in (4) using either the Huber loss or the logistic loss. Then, this function is convex and differentiable with its gradient ∇ζfn(⋅; η) as Lipschitz continuous, with the Lipschitz constant . Moreover, we have
| (9) |
for any and η > 0.
The lemma shows that fn(ζ; η) → fn(ζ) as η → 0+. In this way, we approximate the problem of (5) by
| (10) |
Our goal is to compute an ε-approximation solution ζ(t) to the true solution ζ* of (5) such that ϕn(ζ(t) − ϕn(ζ*) ≤ ε. To fulfill the goal, we propose the idea of applying the fast proximal gradient method to solve (10) approximately, with a homotopy scheme to decrease the smoothness parameter ηt at each iteration t such that ηt → 0+. The step-size parameter is updated by using the unique positive solution τt+1 of the cubic equation .
The following theorem indicates that if ζ(t) generated by Algorithm 1 is an approximate solution of (10), then it is also an approximate solution of the orginal problem (5). The detailed proof is left in the supplemental material.
Theorem 3.3. Let {ζ(t)} be a sequence generated by Algorithm 1, where proxsJn is computed approximately as (7) with the accuracy for some positive constant c ≥ 1. Then we have
| (11) |
where , Γt = 0 for the Huber loss, and for the logistic loss. Hence, for any accuracy ε > 0, the maximum number of iterations for an approximate solution of (5) does not exceed , where Γε = 0 for the Huber loss, and for the logistic loss.
Remark 1. The worst-case convergence bounds in Theorems 3.1 and 3.3 depend on . If this eigenvalue is large, then it affects the complexity bound tmax of our methods. To speed up the actual performance, we suggest to perform a linesearch routine as the linesearch variant in the supplementary material. Another possibility is to apply a preconditioning technique to to reduce its leading eigenvalue before solving the problem.
3.2.3. Evaluating the proximal operator for the penalty
We now describe how to approximately evaluate the proximal operator of Jn under the condition (7). We convert the problem (6) into its dual problem which is defined as
| (12) |
and Dw is defined in (4). To approximate ν*(u), we apply an accelerated projected gradient algorithm that can be presented in a few lines as follows:
Accelerated projected gradient scheme to approximate proxsJn(u):
Given u, s > 0 and an initial point ν(0). Compute . Set , δ0 ≔ 1 and Γ0 ≔ 0. At each iteration j ≥ 0, we update
;
where ;
, where Γj+1 ≔ Γj + δj+1, and .
Here, πB∞(v) = max{min{v, 1}, −1} is the projection of v onto the ℓ∞-unit ball B∞ ≔ {v | ‖v‖∞ ≤ 1}. This algorithm is terminated after a pre-specified number of iterations, jmax. The output of this routine is , which approximates .
Next we analyze the computational effort to achieve the approximation point ζ(t+1) as in (7). By adapting the results from Tran-Dinh (2017), we have , where and the function is defined in (12). In order to achieve an ϵt-approximate solution ζ(t+1) of at the t-th iteration, we requires . Hence, the maximum number of iterations jmax is
By exploiting a warm-start strategy with the previous approximate point ν(t−1) for ν(0), we find that the distance becomes smaller. In this way, we recommend fixing jmax as a small number, such as 50, to achieve an approximate ζ(t+1) in the implementation.
3.3. Finding an Initial Point
The convergence properties in Section 3.2 indicates that a good selection of the starting values can reduce the number of iterations. Based on the assumption that each variable in X is independent of each one in Z, we introduce an ad-hoc method that can be easily applied to find a proper start point in practice. We split the method into two steps as follows:
First, we calculate the distance matrix of the dataset based on (X, y*), where y* is the residual of the linear regression between y and Z. We use y* instead of y since the expectation of y* is exactly Xβ. This conclusion holds due to the fact that a linear regression produces unbiased coefficients estimate when the omitted variables in Z are all independent of those included in the model. In this way, we treat the response y* as a new variable and calculate the distance matrix on (X, y*). For our numerical study, we choose to calculate the Manhattan distance, i.e. (Borg and Groenen (2005)), with α = 1.
Second, denote and calculate each based on the k-nearest neighbors of the i-th subject with a linear regression model with the selected loss. The k-nearest neighbor set for the i-th subject is defined as the k observations having the smallest distance d(i, j) to the i-th subject, and whose response being within the neighbor of yi as . The reason of the second criterion is to increase the chance that those neighbors come from the same latent groups as the i-th subject does. The selection of the neighbor ball radius depends on the variation of the noise and . In our numerical examples, we vary ϵ from 0.5 to 6.
As a remark, we note that we need to update the weights using current estimate of the parameters. To simplify the computation, instead of solving the global solution of each convex problem to update the weights, for early iterations of Algorithm 1, we use one-iteration approximation of the true solution of the convex problem, leading to a suboptimal solution to this convex problem. Although the suboptimal solution may not be very accurate, it can be used as a rough approximation to the true solution of the convex problem using the current weights by doing only one iteration of the whole optimization routine. This can greatly speed up the computation. Once the weights stabilize, our algorithm automatically converges to the true global solution using the corresponding stable weights.
4. Numerical Analysis
In this section, we use numerical examples to test the performance of latent supervised clustering. We study the estimation accuracy, runtime, and prediction performance using simulated and real data. For comparisons, we also implemented the concave method by Ma and Huang (2016) with a general heterogeneous-effect vector xi (Concave), and its extension (Concave-2, Ma et al. (2018)). We additionally include two standard methods: linear regression with all two-way interactions and random forests. We pick the tuning parameters by selecting λn from {2−3, 2−2, 2−1, 20}, setting the number of neighbors as m = 10, and choosing the radius of the response neighbor ϵ from {2−1, 20, 2, 22}.
4.1. Simulations
We evaluate the model using eight simulated examples with linear outcome-predictor relationship. In particular, Examples 1–2 include two subpopulations with linear boundaries. Example 3 is extended from the simulation in Wei and Kosorok (2013), by adding more covariates in xi. Example 4 has a higher dimensional zi with noises. Examples 5, 6 and 8 introduce nonlinear subpopulation boundaries with noisy variables whose coefficients are zeros. Example 7 covers the nonlinear boundary case where the number of underlying subgroups increases to 5.
For each example, we generate each of the xi and zi from an independent continuous uniform U(−2, 2), and the random noise εi from N(0, .1). We use a tuning dataset and choose the λn that minimizes the tuning error. To calculate the validation error, we consider the estimated parameters for all detected subpopulations from the training set. We choose the one with the smallest mean square error as its validation prediction.
Once finding the optimal tuning parameters, we generate a test dataset that is ten times as large as the training to assess the prediction performance. We divide prediction into two steps. First, we treat the clustering label of the training set as an underlying outcome, and fit it using all the covariates with a classifier. In the simulation studies, we choose k-nearest neighbors for most of the cases and use kernel discriminant analysis as an alternative when xi is high dimensional with noisy variables. Second, for each observation from the testing set, we use the fitted classifier to decide which cluster it belongs to, and then plug in the corresponding estimated coefficients to make predictions.
For the accuracy comparison, we report the averages and standard deviations mean squared errors. In addition, we also compare the estimated number of subpopulations . For the time comparison, we present the average run time (in seconds) of each method for all examples. The simulations are repeated for 50 times with the details listed below.
Example 1: Univariate Linear Regression with Two Subgroups.
Suppose the underlying true model is linear with:
where γ = (1, −5, 2, 1, −3, 1, 3, 2, −4)T and the random noise εi ~ N(0, .1).
Example 2: Three Dimensional xi with Noisy Variables in zi.
The true model is:
where γ = (1, −5, 3, 2, 1, 0, 0, 0, 0)T and the random noise εi ~ N(0, .1).
Example 3: Higher xi Dimension with Noisy Variables.
We have xi contain 25 variables, and the subpopulation is determined by the first five of them. The true model is:
where γ = (1, −5, 3, 2, 1, 0, 0, 0, 0)T and the random noise εi ~ N(0, .1).
Example 4: Higher zi Dimension with Three Subgroups.
Consider a model that has 50 homogeneous variables zi and three latent subpopulations as:
where , represents a 45-dimensional zero vector and εi ~ N(0, .1).
Example 5: Nonlinear Subgroups Boundary with Noisy Variables in xi.
Consider a model in which xi has 6 variables and 2 of them construct a nonlinear subpopulation boundary:
where γ = (1, −5, 3, 2, 1)T and the random noise εi ~ N(0, .1).
Example 6: Complex Subgroups Boundary with Noisy Variables in xi.
Consider a situation that has a complex nonlinear subpopulation boundary:
where γ = (1, −5, 3, 2, 1) and the random noise εi ~ N(0, .1).
Example 7: Nonlinear Subgroups Boundaries with Five Subgroups.
There are five subpopulations with , where (b0, b1, b2, b3, b4, b5) = (−∞, 2, 3.5, 5, 7, ∞), γ = (1, −5, 3, 2, 1)T, and εi ~ N(0, .1).
Example 8: Complex Subgroups Boundaries with Three Subgroups and Noisy Variables.
Consider a model having eleven xi’s and three subpopulations:
where γ = (1, −5, 3, 2, 1)T and the random noise εi ~ N(0, .1).
The simulation results are reported in Table 1. From Table 1, we can see that latent supervised clustering (LSC) with the check loss almost always produces the best RMSE results for and . The Concave method can maintain relatively small errors while suffers a large variability. The Concave-2 method by Ma et al. (2018), which aims at estimating the subgroup boundary directly, can have a competitive estimate accuracy when the subgroup boundary is simpler as in Examples 1–4. When the underlying subpopulation structure becomes more complex, as in Examples 5 and 6, the advantage of the LSC-check becomes more clear due to its strong stability to the noise caused by being clustered in the wrong subpopulation. This demonstrates the advantage of using a more robust loss as well as the adaptive fusion penalty. When there are over two subpopulations in the setting with even more complex boundaries as in Examples 7 and 8, none of the methods can produce satisfactory estimation while the LSC-check still outperforms the others. This indicates that one should consider data cleaning or variable selection before fitting LSC in practice when the covariates have complex relationships. As to the estimates of the subpopulation numbers, the two LSC methods perform better than the Concave while all methods tend to overestimate this number except for the case of K = 5 for Example 7.
Table 1:
Simulation results: the averages and standard deviations of mean squared errors with the best results in bold. The K column provides the detected numbers of clusters. Concave is the method with the concave fusion penalty, Concave-2 refers to Ma et al. (2018), LSC-quad and LSC-check represent latent supervised clustering with the quadratic and check loss.
| Case | Concave | Concave-2 | LSC-quad | LSC-check | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| 1 | 0.261 (0.153) | 0.034 (0.054) | 3.04 (0.68) | 0.214 (0.052) | 0.010 (0.004) | 2.02 (0.14) | 0.195 (0.061) | 0.011 (0.008) | 2.1 (0.48) | 0.185 (0.033) | 0.010 (0.006) | 2.17 (0.37) |
| 2 | 0.349 (0.116) | 0.123 (0.114) | 3.15 (0.61) | 0.236 (0.035) | 0.017 (0.005) | 2.46 (0.50) | 0.282 (0.133) | 0.012 (0.003) | 2.27 (0.52) | 0.225 (0.092) | 0.006 (0.002) | 2.21 (0.45) |
| 3 | 0.408 (0.22) | 0.115 (0.04) | 2.32 (0.68) | 0.111 (0.002) | 0.016 (0.007) | 1.87 (0.42) | 0.315 (0.180) | 0.032 (0.024) | 2.21 (0.49) | 0.177 (0.122) | 0.023 (0.018) | 2.11 (0.41) |
| 4 | 0.421 (0.035) | 0.099 (0.012) | 2.67 (0.89) | 0.381 (0.12) | 0.012 (0.003) | 2.62 (0.32) | 0.285 (0.041) | 0.010 (0.004) | 3.34 (0.73) | 0.273 (0.03) | 0.012 (0.007) | 3.12 (0.38) |
| 5 | 0.434 (0.058) | 0.254 (0.140) | 2.32 (0.36) | 0.359 (0.104) | 0.076 (0.040) | 2.61 (0.84) | 0.332 (0.70) | 0.009 (0.001) | 2.14 (0.36) | 0.285 (0.041) | 0.006 (0.001) | 2.1 (0.27) |
| 6 | 0.616 (0.161) | 0.091 (0.023) | 2.73 (0.52) | 0.654 (0.274) | 0.293 (0.189) | 2.89 (0.92) | 0.348 (0.152) | 0.078 (0.001) | 2.15 (0.22) | 0.386 (0.134) | 0.040 (0.001) | 2.13 (0.18) |
| 7 | 2.345 (0.251) | 0.148 (0.011) | 6.25 (1.54) | 2.129 (0.047) | 0.083 (0.010) | 6.40 (0.89) | 1.699 (0.271) | 0.111 (0.098) | 6.17 (1.06) | 1.896 (0.157) | 0.084 (0.044) | 6.23 (0.36) |
| 8 | 2.134 (0.360) | 0.230 (0.184) | 5.53 (0.82) | 2.159 (0.062) | 0.257 (0.038) | 4.67 (0.89) | 1.349 (0.236) | 0.219 (0.072) | 4.47 (0.70) | 1.341 (0.262) | 0.129 (0.032) | 4.32 (0.59) |
Table 2 reports the average runtime that the selected methods take to return the results with a given set of tuning parameters. Latent supervised learning is faster than the concave penalty method with ADMM when the sample size n and dimension p are large. This is because the concave penalty method takes more iterations before convergence, and also ADMM creates O(n2p) intermediate parameters. Concave-2 is the fastest in Example 3 because the boundary is quite easy to be estimated directly so that their algorithm can converge much faster. In addition, the suggested specification of the weight vector w in LSC results in a sparse penalty coefficient matrix. For the two LSC methods, LSC-quad performs faster than LSC-check because the smooth check loss decreases the convergence speed of the proposed algorithm, as shown in the previous section.
Table 2:
Simulation runtime comparison: average runtime (in seconds) of the selected methods for each set of the tuning parameters with the best results in bold.
| Examples | Concave | Concave-2 | LSC-quad | LSC-check | ||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| # | n | p | q | K | ||||
| 1 | 300 | 2 | 9 | 2 | 18.5 | 15.8 | 11.0 | 12.1 |
| 2 | 300 | 3 | 9 | 2 | 692 | 91.7 | 71.1 | 107.3 |
| 3 | 300 | 24 | 5 | 2 | 16954 | 1054.1 | 2364.1 | 2433.7 |
| 4 | 300 | 4 | 50 | 3 | 793.8 | 771.7 | 161.5 | 186.5 |
| 5 | 300 | 6 | 5 | 2 | 2225.1 | 1793.5 | 458.5 | 537.1 |
| 6 | 300 | 6 | 5 | 2 | 2244.5 | 6320.7 | 436.7 | 507.4 |
| 7 | 600 | 6 | 5 | 5 | 9513 | 7078.6 | 2666.6 | 3373.1 |
| 8 | 600 | 11 | 5 | 3 | 40514.5 | 35540.8 | 11116.2 | 14739.0 |
To make predictions for the test datasets, we choose k-nearest neighbors with k = 10 to predict the underlying label except for Examples 3, 7, and 8 due to the noisy covariates. As an alternative, we choose kernel discriminant analysis with the parameters selected by the tuning sets.
We report the mean squared errors of prediction in Table 3. From the results, LSC with the check loss enjoys the minimal prediction error. For Examples 1 and 2, all the methods produce satisfactory results. Linear regression with two-way interactions performs as good as LSC in Example 3 because the underlying boundary fits the interaction assumption exactly. When the underlying boundaries become nonlinear as for Examples 5–8, neither linear regression with interactions nor random forests can produce reliable prediction results. The performance of LSC is significantly better than the concave penalty methods in terms of the average prediction error and the variability. Similar to the estimation accuracy results, none of these methods can produce good prediction results when the latent subpopulation structure becomes too complicate as in Examples 7 and 8.
Table 3:
Simulation prediction accuracy: the mean squared prediction errors and standard deviations of the selected methods with the best results in bold. Reg-Int means linear regression with two-way interactions of xi’s, and RF represents random forests.
| Example | Concave | LSC-quad | LSC-check | Reg-Int | RF |
|---|---|---|---|---|---|
| 1 | 0.284 (0.05) | 0.083 (0.004) | 0.057 (0.003) | 0.388 (0.025) | 0.499 (0.081) |
| 2 | 1.142 (0.155) | 0.644 (0.093) | 0.486 (0.085) | 2.897 (0.13) | 1.513 (0.126) |
| 3 | 10.531 (0.308) | 6.936 (0.202) | 3.481 (0.166) | 3.599 (0.113) | 9.759 (1.505) |
| 4 | 2.838 (0.243) | 0.935 (0.105) | 0.776 (0.106) | 1.506 (0.074) | 7.469 (1.21) |
| 5 | 12.453 (1.893) | 10.379 (1.778) | 9.985 (1.759) | 30.325 (1.017) | 46.881 (2.835) |
| 6 | 14.176 (2.073) | 11.226 (1.295) | 10.305 (1.26) | 122.594 (12.372) | 24.77 (3.999) |
| 7 | 14.639 (1.954) | 12.379 (0.91) | 9.75 (0.851) | 20.829 (0.675) | 16.787 (2.041) |
| 8 | 32.274 (2.797) | 21.748 (1.495) | 16.92 (1.351) | 67.239 (1.397) | 42.693 (2.894) |
To better illustrate how the value of λn affects the number of detected clusters, we draw the solution paths for the cases with both quadratic and check loss functions of Example 1 in Figure 1. The solution path is for the heterogeneous effect coefficients with the same idea as in the dendrogram for agglomerative hierarchical clustering (Hastie et al., 2009). It is calculated by using a “bottom up” approach starting with a small λn that does not show any clustering pattern. In Figure 1, the solution paths for the two loss functions look similar to each other. When λn reaches to around 1.5, the coefficient estimates merge into two distinct values around −1 and 1, i.e. the underlying true values. As λn goes to a larger value such as 4, the estimates eventually converge to a single point that is close to zero.
Figure 1:

Solution paths for against λn in Example 1 with selected λn shown by dashed lines.
Following a reviewer’s suggestion, we have performed additional simulation studies to investigate the performance of our “forward screening” idea and the effect of correlations among predictors. The results show that our proposed forward screening is effective in identifying variable roles. When the predictors are correlated, our algorithms take more iterations to converge as the correlation increases. This matches with our theoretical analysis. More details can be found in the Supplementary Material.
4.2. Real Data Applications
In this subsection, we apply the proposed method to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data (Kueper et al., 2018), and Pima Indian Diabetes data (Lichman, 2013) to evaluate its performance. We use the same tuning parameter ranges as the simulation examples. We use a 5-fold cross-validation with 50 replications for tuning and prediction. To predict the cluster labels of observations in the validation sets, we use k-nearest neighbors with k = 10. For both the proposed methods and the method with the concave penalty, we follow the suggested “forward screening” idea in Section 2 to identify xi from zi. In particular, we start with a parsimonious model that has only an intercept term in xi and all the covariates in zi. Then we move variables from zi to xi one at a time, by choosing the one that boosts the validation prediction accuracy the most. The process stops when the validation prediction is not improved. To save time in practice, one may conduct the screening with a fixed and reasonable tuning parameter set. We report the mean squared prediction errors of all the methods for both the training and validation sets as a criterion, and then briefly describe the pattern of the detected subpopulations.
4.2.1. Alzheimer’s Disease Neuroimaging Initiative (ADNI)
The dataset records the structural magnetic resonance imaging (MRI) of 226 normal controls, 393 patients with mild cognitive impairment (MCI), and 186 patients with Alzheimer’s disease. It uses a score (ranged from 0 to 150) named Alzheimer’s Disease Assessment Scale–Cognitive Subscale (ADAS-Cog) to assess the level of cognitive dysfunction in Alzheimer’s disease. The goal is to use 93 manually labeled regions of interest (ROI) as covariates, to predict the ADAS-Cog and also detect the three subgroups simultaneously (i.e. normal, MCI and AD).
We first reduce the covariate dimension to 20 by recurrent feature elimination (Hastie et al., 2009), and sort the remaining covariates using the “forward screening” idea. The indices of the first five ROI are 42, 82, 43, 85, and 30. Fitting regression by subgroups, we find the coefficients of these five ROIs quite different across the three subgroups. In this way, it is reasonable to assume that the ROIs have heterogeneous effect on ADAS. Figure 2 shows that the best prediction accuracy is achieved when the ROI 42 and 82 are included. The columns of LSC-quad and Concave represent latent supervised clustering results with the quadratic loss and the method with the concave fusion penalty. Both of them perform worse than LSC-check with larger prediction errors and variability. Reg and RF denote linear regression with two-way interactions and random forests. These two commonly used methods fail to achieve satisfactory prediction results for this problem.
Figure 2:

Alzheimer’s Disease: mean squared prediction errors. LSC-c1–LSC-c4 present latent supervised clustering with the check loss that includes the corresponding number of ROI in heterogeneous vector by the “forward screening” idea, LSC-q is LSC with the quadratic loss, Concave represents the method with concave fusion penalty, Reg means linear regression with two-way interactions, and RF represents random forests.
We are also interested in comparing the detected subgroups with the true diagnosed label. Figure 3 presents the K-means clustering results of the estimated β’s, the validation performance of linear regression and LSC with predicted subgroups as well as true subgroup labels. One can clearly see that the K-means suggests the correct number of clusters according to the within group sum of squares. The detected subgroups match well with the ground truth (around 60% accuracy) except for those lie around the boundary between normal and MCI, which is proved to be difficult in the literature.
Figure 3:

Alzheimer’s Disease: within-cluster variation plot and scatter plots of the observed and predicted outcome for the validation set in one realization of CV.
4.2.2. Pima Indian Diabetes
The Pima Indian Diabetes dataset collects 768 females at least 21 years old of Pima Indian heritage. The dataset contains 8 attributes and a class variable indicating whether tested positive for diabetes. The 2-hour serum insulin is measured among the 8 attributes, and it is considered as a proper surrogate outcome for the binary indicator of the underlying diabetes test. Therefore, we fit the 2-hour serum insulin using all the other attributes except for the diabetes test binary variable. We remove all the rows that contains missing values. Similar to the previous example, we use a 5-fold cross validation and fit the selected methods using the training sets. We also apply the “forward screening” idea and the selection order is diabetes pedigree function (1), diastolic blood pressure (2), body mass index (3), age (4), triceps skin fold thickness (5), and plasma glucose concentration (6). The mean squared prediction errors are presented in Figure 4 for all the methods. The proposed method with the check loss and three variables in xi achieves the best prediction performance. LSC-c4 and LSC-c5 can have competitive mean squared errors while the variances are slightly larger. LSC-c6 suffers overfiting with its prediction error larger than that of LSC with the quadratic loss. The linear regression with interactions and random forests produce the worst prediction errors when compared with other methods. In addition, because there is no ground truth for the subgroup in the data, we conduct a χ2 test between the detected subgroup labels and the underlying diagnosis variable to roughly evaluate how important the labels are. The result shows that the proposed methods always suggested two latent subgroups, and the detected labels show significant relationship with the underlying diabetes test indicator according to the χ2 test. The median of the p values is 0.031 and this suggests that the identified subgroup is reasonable.
Figure 4:

Pima Indian Diabetes: mean squared prediction errors. LSC-c1 - LSC-c6 present latent supervised clustering with the check loss that includes the corresponding number of variables in xi by the “forward screening” idea.
5. Discussion
In this paper, we propose a novel machine learning method that aims at clustering the underlying subpopulation structure based on the heterogeneous relationship between the outcome and covariates. Although we mainly focus on the scenario of the linear relationship between the outcome and covariates, the proposed method can be a very good exploratory tool in practice due to its weak assumptions on the underlying subpopulation structure. We develop a very efficient algorithm with competitive convergence rates to solve the optimization problem, and also discuss the statistical consistency properties of the estimators for the coefficients. In numerical studies, the proposed method demonstrates strong performance in both subpopulation detection and outcome prediction. One interesting future direction is to extend the proposed method to other clustering methods with various types of outcomes.
Supplementary Material
Acknowledgments
The authors are indebted to the editor, the associate editor, and two reviewers, whose helpful suggestions led to a much improved presentation. This research was supported in part by NSF grants IIS1632951, DMS1821231, and NIH grants R01GM126550 and P01CA142538.
References
- Altstein L and Li G (2013). Latent subgroup analysis of a randomized clinical trial through a semiparametric accelerated failure time mixture model. Biometrics 69 (1), 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck A and Teboulle M (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences 2(1), 183–202. [Google Scholar]
- Borg I and Groenen PJF (2005). Modern Multidimensional Scaling: Theory and Applica-tions. Springer Science & Business Media. [Google Scholar]
- Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, and Peters TJ (2004). Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. Journal of Clinical Epidemiology 57(3), 229–236. [DOI] [PubMed] [Google Scholar]
- Chi EC and Lange K (2015). Splitting Methods for Convex Clustering. Journal of Compu-tational and Graphical Statistics 24 (4), 994–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fercoq O and Qu Z (2016). Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358. [Google Scholar]
- Greenland S (2009). Interactions in epidemiology: relevance, identification, and estimation. Epidemiology (Cambridge, Mass.) 20 (1), 14–17. [DOI] [PubMed] [Google Scholar]
- Guo J, Levina E, Michailidis G, and Zhu J (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66 (3), 793–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media. [Google Scholar]
- Hocking TD, Joulin A, and Bach F (2011). Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA. [Google Scholar]
- Huber PJ (2004). Robust Statistics. John Wiley & Sons. [Google Scholar]
- Koenker R (2005). Quantile Regression. Cambridge University Press. [Google Scholar]
- Kueper JK, Speechley M, and Montero-Odasso M (2018). The alzheimer’s disease assess-ment scale-cognitive subscale (adas-cog): modifications and responsiveness in pre-dementia populations. a narrative review. Journal of Alzheimer’s Disease 63(2), 423–444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lagakos SW (2006). The challenge of subgroup analyses-reporting without distorting. The New England Journal of Medicine 354 (16), 1667–1669. [DOI] [PubMed] [Google Scholar]
- Lichman M (2013). UCI machine learning repository. [Google Scholar]
- Ma S and Huang J (2016). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112(517), 1–42.29861517 [Google Scholar]
- Ma S, Huang J, and Zhang Z (2018). Exploration of heterogeneous treatment effects via concave fusion. arXiv preprint arXiv:1607.03717. [DOI] [PubMed] [Google Scholar]
- Nesterov Y (2005). Smooth minimization of non-smooth functions. Mathematical Programming 103 (1), 127–152. [Google Scholar]
- Nesterov Y (2013). Gradient methods for minimizing composite objective function. Mathematical Programming 140 (1), 125–161. [Google Scholar]
- Nocedal J and Wright S (2006). Numerical Optimization. Springer. [Google Scholar]
- Perou CM, Sprlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lpnning PE, Borresen-Dale AL, Brown PO, and Botstein D (2000). Molecular portraits of human breast tumours. Nature 406(6797), 747–752. [DOI] [PubMed] [Google Scholar]
- Shen J and He X (2015). Inference for Subgroup Analysis With a Structured Logistic-Normal Mixture Model. Journal of the American Statistical Association 110(509), 303–312. [Google Scholar]
- Tran-Dinh Q (2017). Adaptive smoothing algorithms for nonsmooth composite convex mini-mization. Computational Optimization and Applications 66(3), 425–451. [Google Scholar]
- Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
- Wei S and Kosorok MR (2013). Latent Supervised Learning. Journal of the American Statistical Association 108 (503), 957–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao J, Yu G, and Liu Y (2018). Angle breakdown point for classification. Annals of Statistics 46(6B), 3362–3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
