Identifying Heterogeneous Effect using Latent Supervised Clustering with Adaptive Fusion

Jingxiang Chen; Quoc Tran-Dinh; Michael R Kosorok; Yufeng Liu

doi:10.1080/10618600.2020.1763808

. Author manuscript; available in PMC: 2022 Apr 8.

Published in final edited form as: J Comput Graph Stat. 2020 Jun 30;30(1):43–54. doi: 10.1080/10618600.2020.1763808

Identifying Heterogeneous Effect using Latent Supervised Clustering with Adaptive Fusion

Jingxiang Chen ^*, Quoc Tran-Dinh ^†, Michael R Kosorok ^‡, Yufeng Liu ^§

PMCID: PMC8993151 NIHMSID: NIHMS1660202 PMID: 35400988

Abstract

Precision medicine is an important area of research with the goal of identifying the optimal treatment for each individual patient. In the literature, various methods are proposed to divide the population into subgroups according to the heterogeneous effects of individuals. In this paper, a new exploratory machine learning tool, named latent supervised clustering, is proposed to identify the heterogeneous subpopulations. In particular, we formulate the problem as a regression problem with subject specific coefficients, and use adaptive fusion to cluster the coefficients into subpopulations. This method has two main advantages. First, it relies on little prior knowledge and weak parametric assumptions on the underlying subpopulation structure. Second, it makes use of the outcome-predictor relationship, and hence can have competitive estimation and prediction accuracy. To estimate the parameters, we design a highly efficient accelerated proximal gradient algorithm which guarantees convergence at a competitive rate. Numerical studies show that the proposed method has competitive estimation and prediction accuracy, and can also produce interpretable clustering results for the underlying heterogeneous effect.

Keywords: Accelerated Proximal Gradient Algorithm, Clustering Analysis, Convex Clustering, Machine Learning, Precision Medicine, Subpopulation Identification

1. Introduction

In clinical research, precision medicine aims at developing the optimal treatment for each individual according to the subject’s personal characteristics. The motivation originates from the findings that different groups of patients can respond dramatically differently to the same health care intervention, which can be caused by their specific body mechanisms. Failing to detect a targeted subpopulation can eliminate some precise drugs by washing out their effects using the whole population. In practice, it takes tremendous efforts to search for the targeted subpopulation of certain interventions (Brookes et al. (2004); Lagakos (2006)). One important reason is that the crucial features that decide the targeted subpopulation are usually either hidden in numerous collected ones, or even unmeasured. Therefore, it is desirable to develop methods that can automatically detect such subpopulations.

Recently, various machine learning methods have been introduced and applied to identify subpopulations. In supervised learning, linear regression with two-way interactions becomes widely used. However, such a method requires strong parametric assumptions requiring that the underlying heterogeneity can be determined by those interactions (Greenland (2009)). Some nonparametric methods such as random forests are also popular while the results remain less interpretable in practice (Wager and Athey (2018)). In addition, there are also studies in unsupervised learning that has weak parametric assumptions. Clustering analysis can be a good representative. Clustering analysis usually detects observation similarities using a pre-defined distance on the covariates. One typical example is the usage of hierarchical clustering for heterogeneous gene expression analysis (Perou et al. (2000)). Recently, a new clustering method, convex clustering, was proposed by solving a convex optimization problem using the pairwise fusion penalty (Guo et al., 2010; Hocking et al., 2011; Chi and Lange, 2015). The introduced algorithm tremendously boosts the efficiency especially when applied to large datasets. However, such unsupervised learning tools can produce meaningful and desirable results only when the subpopulations are completely determined by the covariate similarities defined. In practice, subpopulations can also heavily depend on the outcomes or even the relationship between outcomes and covariates. This can be shown by many clinical studies that aim to split the patient population according to drug effects.

Other than supervised and unsupervised learning, Wei and Kosorok (2013) recently introduced a new type of machine learning tools, named latent supervised learning, to maintain the advantage of both. Their method assumes that each observation comes with an unobserved label, i.e. the latent outcome, which identifies its subpopulation and determines the underlying distribution. Furthermore, they assume that the observed outcome follows a mixture Gaussian distribution with two latent components. These latent components are assumed to be determined by a linear combination of features. There are follow-up extensions of Wei and Kosorok (2013) in the literature. For instance, Altstein and Li (2013) adopted this idea to the time-to-event response, and Shen and He (2015) suggested a logistic-normal mixture model instead. Although these methods showed competitive performance to detect the subpopulation boundaries, some drawbacks still exist. First, most of the existing methods still count on prior knowledge on the parametric functions of the subpopulation boundaries. In many exploratory studies, such information can be difficult to obtain. Second, many latent supervised learning methods rely on certain distribution assumptions of the observed outcomes as well, which may not hold in practice especially for complex subpopulation structures.

In this paper, we focus on subpopulation detection, and aim to address the two drawbacks of latent supervised learning. In particular, we would like to propose an exploratory tool, named latent supervised clustering, to estimate the heterogeneous effects at the same time of clustering the samples without prior knowledge on their boundaries. To achieve these two goals, we formulate the problem as regression with subject-specific coefficients, which can be treated as the heterogenous relationships between the outcomes and covariates from the observed data. Then we cluster such relationships with the adaptive fusion penalty, which is extended from the perspective of convex clustering. The proposed method inherits the advantages of both latent supervised learning and traditional clustering analysis. On one hand, it enables the data orient the learning process so that no assumption is needed for the subpopulation structure. On the other hand, it utilizes the information of both covariates and outcomes to estimate the heterogeneity so that it can both identify the subpopulation and predict the outcome values. In contrast, regular clustering does not utilize the outcome information.

Clustering such outcome-predictor relationships can be very challenging because they are not observed directly but can only be derived. One important question is how to define a distance properly to encourage such clustering patterns. We would like to adapt the idea of convex clustering. Note that convex clustering formulates the clustering process as a convex minimization problem involving the sum of a loss function and a penalty term with a tuning parameter balancing these two terms. The loss term is a sum of Euclidean distances between observations and their corresponding subject-specific centroids. The penalty term includes a sum of the fusion penalty between each pair of such centroids to encourage them to merge together. In latent supervised clustering, we propose a different loss + penalty form to accommodate the outcome information. For the loss term, we use a pre-defined loss calculated from the observed outcome and its fitted value by a certain model with subject-specific parameters. In this way, a smaller loss value indicates a better goodness of fit. This model can be either parametric or non-parametric such as smoothing splines. While we only discuss linear regression in this paper, extensions to classification or nonlinear regression can also be covered by, for example, using a linear function with the deviance loss or using smoothing splines with the quadratic loss. We assume that observations from the same subpopulation have the identical values of the subject-specific parameters. To encourage such a pattern in the estimation, we impose an adaptive pairwise fusion penalty on each pair of the parameters in the penalty term. The corresponding weights are determined by the estimated differences of the pairs. In summary, such a convex optimization formulation can lead to maximizing the overall goodness of fit at the same time of minimizing the heterogeneity within each detected cluster.

The main contributions of this paper are as follows. First, we propose a new method using the supervised machine learning framework that aims to identify the heterogeneity by clustering the defined outcome-predictor relationship. We borrow the convex clustering idea but design new loss functions and penalty terms to encourage competitive performance and computational efficiency. Second, we design a novel optimization algorithm to solve the underlying convex minimization problem. This algorithm has several distinct features compared to existing proximal gradient methods: (1). It can handle non-smooth loss terms using a smoothing technique as in Nesterov (2005) in a homotopy fashion; (2). It allows us to use inexact computation of the proximal operator of the penalty term; (3). As far as we know, it has the best known convergence rate guarantee among homotopy smoothing first-order methods. We would like to point out that the proposed latent supervised clustering technique covers the problem that Ma and Huang (2016) discussed as a special case. They focused on the case when the subpopulations can be merely determined by a varying intercept term of a linear model.

The remainder of the paper is organized as follows. In Section 2, we introduce the proposed method. In Section 3, we present the proposed accelerated proximal gradient algorithm to solve the optimization problem and show its convergence rate properties. Simulated examples and data applications are presented in Section 4 to demonstrate the performance of the proposed method under finite samples. Some discussions are provided in Section 5. More technical details including statistical learning theory and proofs of theorems, together with additional numerical results, are left in the supplemental materials.

2. Methodology

We use {(x_i, y_i), i = 1 ⋯, n} to represent our training data, where x_i is a p-dimensional covariate vector with its response y_i. We consider the model as follows:

y_{i} = f (x_{i}; β_{i}) + ε_{i},

(1)

where x_i always contains an intercept term as its first element, β_i is the coefficient vector of the i-th observation x_i, and ε_i is the noise term that has mean zero and a bounded variance. We further assume that x_i, and ε_i are mutually independent of each other. To describe the heterogeneity of covariate effects, we allow β_i to take different values for different indices i. Our goal is to estimate and cluster these β_i’s, and let the clustering results hep to provide subpopulation identification. For model (1), we assume that the value of β_i is determined by its underlying subpopulation. In other words, if we denote a partition of {1, ⋯, n} with $S = (S_{1}, \dots, S_{K})$ , where K represents the number of subpopulations and is usually unknown in practice, then the values of β_i from the same latent subpopulation $\in S_{m}$ are supposed to be identical. In this paper, we concentrate on linear models, i.e. $f (x_{i}; β_{i}) = x_{i}^{T} β_{i}$ . When the linear assumption is too strong, one can extend the function to be nonlinear such as using basis expansions with $f (x_{i}; β_{i}) = \sum_{j = 1}^{m} β_{i j} g_{j} (x_{i})$ , where g₁, ⋯, g_m are basis functions.

To estimate and cluster these β_i’s, we consider the following optimization problem:

\min_{β_{i} \in ℝ^{p}} {ϕ_{n} (β_{i}; λ_{n}) = \sum_{i = 1}^{n} [ℓ (y_{i}, x_{i}^{T} β_{i}) + λ_{n} \sum_{i < j} w_{i j} {‖ β_{i} - β_{j} ‖}_{1}]},

(2)

where ℓ represents a pre-selected loss function to measure the goodness of fit. For the loss term, we consider two popular representatives in regression: the check loss (Koenker (2005)) that is used in quantile regression and the quadratic loss that is used in the least squares estimation. Note that one can easily change the loss to solve classification problems. As to the penalty term, the adaptive pairwise fusion penalty is used to adjust for the potential bias created by the ℓ₁-penalty. In practice, we find that $w_{i j} = \min {B_{w}, \frac{ι_{{i, j}}^{m}}{{‖ {\tilde{β}}_{i} - {\tilde{β}}_{j} ‖}_{1}}}$ can be a good option, where $ι_{{i, j}}^{m}$ indicates whether the observation j is among i’s m-nearest neighbors defined by the Euclidean distance, and $\tilde{β}$ is an estimate of $β = {(β_{1}^{T}, \dots, β_{n}^{T})}^{T}$ which can be initialized by the local regression coefficients. The upper bound B_w is added in case some pairs of $({\tilde{β}}_{i}, {\tilde{β}}_{j})$ have values too close to each other. Considering the numerous terms of the fusion penalty (O(n²)), the m-nearest neighbors strategy can save tremendous computational time when solving (2) while maintaining competitive performance.

Note that we can view our model (2) as an exact penalty formulation of the constrained problem of minimizing the loss subject to equality constraints β_i = β_j, where the product λ_nw_ij can be considered as penalty parameters. From the theory of exact penalty methods, the parameters w_ij are expected to update at each iteration of the algorithm until they go beyond a given threshold defined by the ℓ_∞-norm of the corresponding optimal Lagrange multiplier assocated with the constraint β_i = β_j (see e.g. (Nocedal and Wright, 2006, Chapter 17)).

In practice, one may find some prior evidence to show that certain components of x_i have homogeneous effects on the outcome y_i, i.e., coefficients remain constant for all. In this case, we recommend imposing the penalty only on the heterogeneous part. Suppose that the collected observation includes (x_i, z_i, y_i), where z_i is a q dimensional covariate vector known to have homogeneous effect. We rewrite the model (1) into y_i = f(x_i; β_i) + h(z_i; γ) + ε_i, where γ is the same for all, and h is a measurable function that can have a parametric or nonparametric form. Similar to the form of f, we restrict our discussion for linear functions with $h (z_{i}; γ) = z_{i}^{T} γ$ . In this scenario, the optimization problem in (2) becomes

\min_{β_{i}, γ} {ϕ_{n} (β_{i}, γ; λ_{n}) = \sum_{i = 1}^{n} [ℓ (y_{i}, x_{i}^{T} β_{i} + z_{i}^{T} γ) + λ_{n} \sum_{i < j} w_{i j} {‖ β_{i} - β_{j} ‖}_{1}]} .

(3)

Moving redundant components from x_i to z_i can be crucial in saving computational time especially when the dimension of the covariates is large. Here, we suggest a “forward screening” idea that can help distinguish z_i from x_i when no prior knowledge can be obtained. We start with a parsimonious model in which x_i only contains an intercept term. Then, we move a variable from z_i to x_i that boosts the model performance the most. This process is repeated until no further improvement can be obtained by moving more variables to x_i. More illustrations of this idea are left in data applications of Section 4. In the following sections, we concentrate on the model (3).

As a special case of latent supervised clustering, Ma and Huang (2016) focused on when the subpopulations can be determined by a subject-specific intercept. They suggested using a concave fusion penalty in the objective function and applied the alternating direction method of multipliers (ADMM) algorithm to solve the optimization problem. However, this method is not suitable to solve our problem. To apply ADMM to (3) that has a complex regularization term, one needs to introduce many intermediate variables, which significantly increases the problem size and becomes extremely inefficient. In addition, ADMM requires a complicated strategy to tune the penalty parameter of the augmented Lagrangian to obtain good performance, which can be very difficult with complex regularization. In practice, it can be observed from Section 4 that ADMM takes longer to compute and produces suboptimal prediction accuracy.

Our proposed method and algorithm enjoy several advantages. First, our method has significant computational benefits because the problem we solve is convex, and also our proposed algorithm does not need to generate the p · n² additional intermediate parameters as ADMM does. This feature allows our algorithm to scale up to relatively high dimensional problems compared to ADMM. In addition, our method has a theoretical convergence rate guarantee on the original model (3) as opposed to a guarantee on the constrained reformulation for the ADMM. Second, the quadratic loss suggested by Ma and Huang (2016) can be a suboptimal choice due to its sensitivity to the outliers. This can be partly attributed to the fact that the least square estimators have the breakdown point to be zero (Huber (2004); Zhao et al. (2018)). That indicates, if a subject from the first subpopulation is wrongly assigned to the second subpopulation, it can highly impact the coefficient estimates of the second subpopulation when the quadratic loss is applied. We compare the results of the quadratic loss and check loss, and find the latter one can significantly improve the model performance. Third, it is not desirable to penalize all pairs of (β_i, β_j) equally. In the ideal case, we hope that large weights can be assigned to the pairs from the same subpopulation, while zero weights are used for those from different subpopulations.

The proposed method can also be used to predict the subpopulations and response of a new observation. One can treat the identified outcome labels in the training datasets as the estimated underlying outcome, and fit it using the predictor information with a standard classification model. Some popular choices include discriminant analysis, k-nearest neighbors and random forests. The fitted classification model can be used to make predictions on which subpopulations the new observations should be assigned to. Then, one can plug in the corresponding estimated β_i and γ to predict the response y_i.

3. Algorithms

In this section, we design a novel accelerated proximal gradient algorithm to solve the proposed optimization problem efficiently, inspired by the fast iterative shrinkage-thresholding algorithm (FISTA, Beck and Teboulle (2009)). To further increase the speed of convergence, we integrate a restarting strategy at each iteration, which includes inexact calculation of the proximal operator. Note that FISTA can only work for smooth objective functions. For the nonsmooth check loss, we propose approximating it with a smooth adaptive surrogate function that guarantees the best known convergence rate.

3.1. Model Reformulation

For simplicity, we concatenate the parameters β_i and γ of model (3) into one vector ζ ≜ (β^T, γ^T)^T where $β^{T} = (β_{1}^{T}, \dots, β_{n}^{T})$ . Then, we rewrite the loss and penalty term as:

f_{n} (ζ) = \sum_{i = 1}^{n} ℓ (y_{i}, x_{i}^{T} β_{i} + z_{i}^{T} γ) and J_{n} (ζ) = λ_{n} {‖ D_{w} β ‖}_{1} \equiv λ_{n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j} \sum_{k = 0}^{p} | β_{i k} - β_{j k} |,

(4)

where D_w is the weight matrix. In this way, the optimization problem (3) can be expressed in the following compact form:

\min_{ζ \in ℝ^{n p + q}} {ϕ_{n} (ζ) = f_{n} (ζ) + J_{n} (ζ)} .

(5)

We focus on two loss functions: a quadratic loss $ℓ_{2} (r) = \frac{1}{2} r^{2}$ and a check loss ℓ_τ(r) = τrI(r ≥ 0) − (1 − τ)rI(r < 0) = (τ − 0.5)r + 0.5|r|. When ℓ is the quadratic loss, f_n has Lipschitz gradient, i.e., there exists L_{f_n} ≥ 0 such that $‖ \nabla f_{n} (ζ) - \nabla f_{n} (\hat{ζ}) ‖ \leq L_{f_{n}} ‖ ζ - \hat{ζ} ‖$ for all $ζ, \hat{ζ}$ . When ℓ is the check loss, (5) is still convex but fully nonsmooth (i.e., both f_n and J_n are nonsmooth).

3.2. Algorithmic Design and Convergence Properties

We develop the algorithm based on Beck and Teboulle (2009) and Nesterov (2013), while the following steps are new. First, we design a proximal operator (see Definition below) for the penalty J_n using an adaptive fast projected gradient method with a warm-start. Second, we incorporate the algorithm with a restart procedure recently studied in Fercoq and Qu (2016) to accelerate the performance of the algorithm. Third, for the check loss function, we apply smoothing techniques to approximate it with a surrogate function depending on a parameter which can be adaptively updated in the algorithm. Last, we design a new variant of the adaptive method proposed in Tran-Dinh (2017) to solve (5) that has a convergence rate guarantee without any parameter tuning strategy.

The computational cost of the algorithm consists of two parts. First, we need to evaluate the gradient vector of f_n or its smoothed approximation, and evaluate the Lipschitz constant of this gradient mapping. Second, we compute the proximal operator of J_n which is defined as:

{prox}_{s J_{n}} (ζ) ≔ \arg \min_{\hat{ζ}} {J_{n} (\hat{ζ}) + \frac{1}{2 s} {‖ \hat{ζ} - ζ ‖}_{2}^{2}},

(6)

for any ζ and s > 0. The main steps of the proposed algorithm for solving (5) are presented in Algorithm 1.

Algorithm 1.

Adaptive fast Proximal Gradient algorithm (APG)

1. Choose an arbitrarily initial point

ζ^{0} \in ℝ^{(n + 1) d}

and desired tolerances ε > 0 and ϵ₀ ≥ 0.

2. Evaluate

L_{f} ≔ λ_{\max} ({\tilde{X}}^{T} \tilde{X})

; Set τ₀ = 1, and

{\hat{ζ}}^{0} ≔ ζ^{0}

3. If the check loss is used, then input η₁ (e.g.,

η_{1} = \frac{1}{n}

4. For t = 0, 1, ⋯, t_max, perform:

Step 1: Set L_{f_n} ≔ L_f for the quadratic loss, and

L_{f_{n}} ≔ \frac{L_{f}}{η_{t + 1}}

for the check loss. Then compute the step-size

α_{t} = \frac{1}{L_{f_{n}}}

Step 2: Compute approximately

ζ^{(t + 1)} \approx {prox}_{α_{t} J_{n}} ({\hat{ζ}}^{(t)} - α_{t} \nabla f_{n} ({\hat{ζ}}^{(t)}))

up to the accuracy ϵ_t as defined in (7) below.

Step 3: If stopping-criterion is satisfied, terminate the algorithm.

Step 4: If f_n is the quadratic loss, update

τ_{t + 1} ≔ \frac{1}{2} (1 + \sqrt{1 + 4 τ_{t}^{2}})

. If f_n is the check loss, update τ_t+1 as the positive solution of

τ^{3} - τ^{2} - τ_{t}^{2} τ - τ_{t}^{2} = 0

Step 5: Update the accelerated step

{\hat{ζ}}^{(t + 1)} ≔ ζ^{(t + 1)} + \frac{τ_{k} - 1}{τ_{k + 1}} (ζ^{(t + 1)} - ζ^{(t)})

Step 6: If f_n is the check loss, then update

η_{t + 2} ≔ (\frac{τ_{t + 1}}{τ_{t + 1} + 1}) η_{t + 1}

Step 7: Perform a restarting step if needed, and update ϵ_t+1.

5. End of the main loop.

Open in a new tab

In summary, when f_n is a quadratic loss, we have $f_{n} (ζ) = \frac{1}{2} {‖ \tilde{X} ζ - y ‖}_{2}^{2}$ where $\tilde{X} = diag (x_{1}^{T}, x_{2}^{T}, \dots, x_{n}^{T}, z^{T})$ . Then, its gradient $\nabla f_{n} (ζ) = {\tilde{X}}^{T} (\tilde{X} ζ - y)$ , which is Lipschitz continuous with its Lipschitz constant as $L_{f_{n}} = λ_{\max} ({\tilde{X}}^{T} \tilde{X})$ , where $λ_{\max} ({\tilde{X}}^{T} \tilde{X})$ denotes the maximum eigenvalue of ${\tilde{X}}^{T} \tilde{X}$ . When f_n is a nonsmooth check loss, we approximate it by a smooth f_n(·; η) with more details described in Subsection 3.2.2. Then, the next step is to compute the proximal operator prox_g, which is presented in Subsection 3.2.3.

Next we discuss the convergence guarantee of Algorithm 1. Let ζ* be an optimal solution of (5) with its optimal value ϕ_n(ζ*), i.e. we have ϕ_n(ζ) ≥ ϕ_n(ζ*) for any ζ. Define that ζ^(t) is an approximate solution to (5) with an accuracy ε ≥ 0, if ϕ_n(ζ^(t)) − ϕ_n(ζ*) ≤ ε. In (5), we are not able to compute the proximal operator of J_n exactly, but rather approximate it with a given accuracy ε_t > 0. In particular, we define

Q_{n} (ζ; {\hat{ζ}}^{(t)}) ≔ f_{n} ({\hat{ζ}}^{(t)}) + \nabla f_{n} {({\hat{ζ}}^{(t)})}^{T} (ζ - {\hat{ζ}}^{(t)}) + \frac{L_{f}}{2} {‖ ζ - {\hat{ζ}}^{(t)} ‖}^{2} + J_{n} (ζ),

and $ζ_{*}^{(t + 1)} ≔ {prox}_{α_{t} J_{n}} ({\hat{ζ}}^{(t)} - α_{t} \nabla f_{n} ({\hat{ζ}}^{(t)}))$ . Then by definition, $Q_{n} (\cdot; {\hat{ζ}}^{(t)})$ is the corresponding objective function of the proximal operator problem at Step 2 of Algorithm 1. One can show that $‖ ζ^{(t + 1)} - ζ_{*}^{(t + 1)} ‖ \equiv ‖ ζ^{(t + 1)} - {prox}_{α_{t} J_{n}} ({\hat{ζ}}^{(t)} - α_{t} \nabla f_{n} ({\hat{ζ}}^{(t)})) ‖ \leq \sqrt{2 α_{t} ϵ_{t}}$ , as long as the following condition holds:

0 \leq Q_{n} (ζ^{(t + 1)}; {\hat{ζ}}^{(t)}) - Q_{n} (ζ_{*}^{(t + 1)}; {\hat{ζ}}^{(t)}) \leq ϵ_{t} .

(7)

Now we provide a general convergence result for Algorithm 1 above. For the two particular losses, we present the theorems seperately in the subsections below. Their technical proofs can be found in the supplemental material.

3.2.1. Convergence for a quadratic loss

If ℓ is a quadratic loss, then the following theorem provides a convergence rate guarantee of Algorithm 1.

Theorem 3.1. Let f_n be a quadratic loss, and let ${ζ^{(t)}}$ be a sequence generated by Algorithm 1 where prox_{γj_n} is computed approximately with the accuracy ϵ_t ≥ 0 as defined in (7). Then, we have

ϕ_{n} (ζ^{(t)}) - ϕ_{n} (ζ^{*}) \leq \frac{2 λ_{\max} ({\tilde{X}}^{T} \tilde{X})}{{(t + 1)}^{2}} {(‖ ζ^{(0)} - ζ^{*} ‖ + R_{t})}^{2},

(8)

where $R_{t} = \frac{\sqrt{2}}{\sqrt{L_{f}}} (2 \sum_{j = 0}^{t - 1} (j + 1) \sqrt{ϵ_{j}} + \sqrt{\sum_{j = 0}^{t - 1} {(j + 1)}^{2} ϵ_{j}})$ . Consequently, for any accuracy ε > 0 and positive constant c ≥ 1, if the inner accuracy at t, ϵ_t, is chosen to be $\frac{c}{{(t + 1)}^{5}}$ , then the maximum iteration number of (5) to achieve ζ^(t) does not exceed $t_{\max} = ⌊ \frac{\sqrt{2 λ_{\max} ({\tilde{X}}^{T} \tilde{X})}}{\sqrt{ε}} ‖ ζ^{(0)} - ζ^{*} ‖ + \frac{10 \sqrt{2} c}{\sqrt{ε}} ⌋$ . Here, ⌊⋅⌋ denotes the floor function.

3.2.2. Smoothing technique for the check loss

The check loss ρ_τ(r) = τrI(r ≥ 0) − (1 − τ)rI(r < 0) = (τ − 0.5)r + 0.5|r| is convex but nonsmooth. We consider approximating it using a smooth convex function ρ_τ(⋅;η) that has a smoothness parameter η > 0. For any fixed value of η > 0, the smooth function ρ_τ(⋅;η) needs to satisfy the following basic properties. First, ρ_τ(⋅;η) is smooth and convex. Its gradient ∇_rρ_τ(⋅;η), with respect to r, is Lipschitz continuous with the Lipschitz constant L_ρ depending on η. Second, ρ_τ(⋅;η) approximates ρ_τ(⋅) well. In other words, there exists a constant D_ρ, independent of η, such that ρ_τ(r; η) ≤ ρ_τ(r) ≤ ρ_τ(r; η) + ηD_ρ for all r. There are several choices for ρ_τ(·), such as the following two:

Huber loss: $ρ_{τ} (r; η) = {\begin{matrix} \frac{1}{2 η} r^{2} & if | r | \leq η \\ | r | - \frac{η}{2} & otherwise \end{matrix}$ with $L_{ρ} = \frac{1}{η}$ and $D_{ρ} = \frac{1}{2}$ .
Logit-type loss: $ρ_{τ} (r; η) = (τ - \frac{1}{2}) r + \frac{η}{2} \ln (e^{r / η} + e^{- r / η})$ with $L_{ρ} = \frac{1}{η}$ and $D_{ρ} = \ln (2)$ .

We now introduce the following properties of ρ_τ(⋅; η) with the proof placed in the supplemental material.

Lemma 3.2. We consider $f_{n} (ζ; η) ≜ \sum_{i = 1}^{n} ρ_{τ} (y_{i} (x_{i}^{T} β_{i} + z_{i}^{T} γ); η)$ as a smooth version of the check loss in (4) using either the Huber loss or the logistic loss. Then, this function is convex and differentiable with its gradient ∇_ζf_n(⋅; η) as Lipschitz continuous, with the Lipschitz constant $L_{f_{n}} = \frac{λ_{\max} ({\tilde{X}}^{T} \tilde{X})}{η}$ . Moreover, we have

f_{n} (ζ; η) \leq f_{n} (ζ) \leq f_{n} (ζ; η) + n η D_{ρ},

(9)

for any $ζ \in ℝ^{(n + 1) p}$ and η > 0.

The lemma shows that f_n(ζ; η) → f_n(ζ) as η → 0⁺. In this way, we approximate the problem of (5) by

\min_{ζ \in ℝ^{n p + q}} {ϕ_{n} (ζ; η) = f_{n} (ζ; η) + J_{n} (ζ)} .

(10)

Our goal is to compute an ε-approximation solution ζ^(t) to the true solution ζ* of (5) such that ϕ_n(ζ^(t) − ϕ_n(ζ*) ≤ ε. To fulfill the goal, we propose the idea of applying the fast proximal gradient method to solve (10) approximately, with a homotopy scheme to decrease the smoothness parameter η_t at each iteration t such that η_t → 0⁺. The step-size parameter is updated by using the unique positive solution τ_t+1 of the cubic equation $c_{3} (τ) = τ^{3} + τ^{2} + τ_{t}^{2} τ - τ_{t}^{2} = 0$ .

The following theorem indicates that if ζ^(t) generated by Algorithm 1 is an approximate solution of (10), then it is also an approximate solution of the orginal problem (5). The detailed proof is left in the supplemental material.

Theorem 3.3. Let {ζ^(t)} be a sequence generated by Algorithm 1, where prox_sJn is computed approximately as (7) with the accuracy $ϵ_{t} ≔ \frac{c}{{(t + 1)}^{4}}$ for some positive constant c ≥ 1. Then we have

ϕ_{n} (ζ^{(t + 1)}) - ϕ_{n} (ζ^{*}) \leq \frac{1}{(t + 1)} (\frac{L_{f}}{2 η_{1}} {‖ ζ^{(0)} - ζ^{*} ‖}^{2} + 2 n η_{1} D_{ρ} + \frac{1.9 \sqrt{c L_{f}}}{\sqrt{η_{1}}} ‖ ζ^{(0)} - ζ^{*} ‖ + \frac{35 c}{2 η_{1}} + Γ_{t}),

(11)

where $L_{f} ≔ λ_{\max} ({\tilde{X}}^{T} \tilde{X})$ , Γ_t = 0 for the Huber loss, and $Γ_{t} = \frac{D_{ρ} (1 + 4 η_{1} \ln (t + 1))}{2}$ for the logistic loss. Hence, for any accuracy ε > 0, the maximum number of iterations for an approximate solution of (5) does not exceed $t_{\max} = O (\frac{1 + Γ_{ε}}{ε})$ , where Γ_ε = 0 for the Huber loss, and $Γ_{ϵ} = \ln (\frac{1}{ε})$ for the logistic loss.

Remark 1. The worst-case convergence bounds in Theorems 3.1 and 3.3 depend on $λ_{\max} ({\tilde{X}}^{T} \tilde{X})$ . If this eigenvalue is large, then it affects the complexity bound t_max of our methods. To speed up the actual performance, we suggest to perform a linesearch routine as the linesearch variant in the supplementary material. Another possibility is to apply a preconditioning technique to ${\tilde{X}}^{T} \tilde{X}$ to reduce its leading eigenvalue before solving the problem.

3.2.3. Evaluating the proximal operator for the penalty

We now describe how to approximately evaluate the proximal operator of J_n under the condition (7). We convert the problem (6) into its dual problem which is defined as

{prox}_{s J_{n}} (u) = u - s D_{w}^{T} ν^{*} (u), where ν^{*} (u) ≔ \arg \min_{{‖ ν ‖}_{\infty} \leq 1} {\frac{s}{2} {‖ D_{w}^{T} ν ‖}^{2} - ν^{T} D_{w} u},

(12)

and D_w is defined in (4). To approximate ν*(u), we apply an accelerated projected gradient algorithm that can be presented in a few lines as follows:

Accelerated projected gradient scheme to approximate prox_{sJ_n}(u):

Given u, s > 0 and an initial point ν⁽⁰⁾. Compute $L_{D_{w}} ≔ λ_{\max} (D_{w} D_{w}^{T})$ . Set ${\bar{ν}}^{(0)} = {\hat{ν}}^{(0)} ≔ {\hat{ν}}^{(0)}$ , δ₀ ≔ 1 and Γ₀ ≔ 0. At each iteration j ≥ 0, we update

$ν^{(j + 1)} ≔ π B_{\infty} ({\hat{ν}}^{(j)} - \frac{1}{L_{D_{w}}} D_{w} (D_{w}^{T} {\hat{ν}}^{(j)} - s^{- 1} u))$ ;
${\hat{ν}}^{(j + 1)} ≔ ν^{(j + 1)} + \frac{δ_{j} - 1}{δ_{j + 1}} (ν^{(j + 1)} - ν^{(j)})$ where $δ_{j + 1} ≔ \frac{1}{2} (1 + \sqrt{1 + 4 δ_{j}^{2}})$ ;
${\bar{ν}}^{j + 1} ≔ (1 - ω_{j}) {\bar{ν}}^{j} + ω_{j} ν^{(j + 1)}$ , where Γ^j+1 ≔ Γ^j + δ_j+1, and $ω_{j} ≔ \frac{δ_{j + 1}}{Γ_{j + 1}}$ .

Here, π_{B_∞}(v) = max{min{v, 1}, −1} is the projection of v onto the ℓ_∞-unit ball B_∞ ≔ {v | ‖v‖_∞ ≤ 1}. This algorithm is terminated after a pre-specified number of iterations, j_max. The output of this routine is $ζ^{(t + 1)} ≔ {\hat{ζ}}^{(t)} - α_{t} \nabla f_{n} ({\hat{ζ}}^{(t)}) - α_{t} D_{w}^{T} {\bar{ν}}^{j_{\max}}$ , which approximates $ζ_{*}^{(t + 1)} = {prox}_{α_{t} J_{n}} ({\hat{ζ}}^{(t)} - α_{t} \nabla f_{n} ({\hat{ζ}}^{(t)}))$ .

Next we analyze the computational effort to achieve the approximation point ζ^(t+1) as in (7). By adapting the results from Tran-Dinh (2017), we have $Q_{n} (ζ^{(t + 1)}; {\hat{ζ}}^{(t)}) - Q_{n} (ζ_{*}^{(t + 1)}; {\hat{ζ}}^{(t)}) \leq \frac{2 L_{D} {‖ ν^{(0)} - ν_{t}^{*} ‖}^{2}}{{(j + 1)}^{2}}$ , where $ν_{t}^{*} ≔ ν^{*} ({\hat{ζ}}^{(t)} - α_{t} \nabla f_{n} ({\hat{ζ}}^{(t)}))$ and the function $ν_{t}^{*} (\cdot)$ is defined in (12). In order to achieve an ϵ_t-approximate solution ζ^(t+1) of $ζ_{*}^{(t + 1)}$ at the t-th iteration, we requires $\frac{2 L_{D_{w}} {‖ ν^{(0)} - ν_{t}^{*} ‖}^{2}}{{(j + 1)}^{2}} \leq ϵ_{t}$ . Hence, the maximum number of iterations j_max is

j_{\max} (t) ≔ ⌊ \frac{\sqrt{2 λ_{\max} (D_{w} D_{w}^{T})}}{\sqrt{ϵ_{t}}} ‖ ν^{(0)} - ν_{t}^{*} ‖ ⌋ .

By exploiting a warm-start strategy with the previous approximate point ν^(t−1) for ν⁽⁰⁾, we find that the distance $‖ ν^{(0)} - ν_{t}^{*} ‖$ becomes smaller. In this way, we recommend fixing j_max as a small number, such as 50, to achieve an approximate ζ^(t+1) in the implementation.

3.3. Finding an Initial Point

The convergence properties in Section 3.2 indicates that a good selection of the starting values $\tilde{ζ} = (\tilde{β}, \tilde{γ})$ can reduce the number of iterations. Based on the assumption that each variable in X is independent of each one in Z, we introduce an ad-hoc method that can be easily applied to find a proper start point $\tilde{β}$ in practice. We split the method into two steps as follows:

First, we calculate the distance matrix of the dataset based on (X, y*), where y* is the residual of the linear regression between y and Z. We use y* instead of y since the expectation of y* is exactly Xβ. This conclusion holds due to the fact that a linear regression produces unbiased coefficients estimate when the omitted variables in Z are all independent of those included in the model. In this way, we treat the response y* as a new variable and calculate the distance matrix on (X, y*). For our numerical study, we choose to calculate the Manhattan distance, i.e. $d (i, j) = {‖ x_{i} - x_{j} ‖}_{1} + α {‖ y_{i}^{*} - y_{j}^{*} ‖}_{1}$ (Borg and Groenen (2005)), with α = 1.

Second, denote $\tilde{β} = ({\tilde{β}}_{1}, \dots, {\tilde{β}}_{n})$ and calculate each ${\tilde{β}}_{i}$ based on the k-nearest neighbors of the i-th subject with a linear regression model with the selected loss. The k-nearest neighbor set for the i-th subject is defined as the k observations having the smallest distance d(i, j) to the i-th subject, and whose response being within the neighbor of y_i as $O_{ϵ} (y_{i}^{*})$ . The reason of the second criterion is to increase the chance that those neighbors come from the same latent groups as the i-th subject does. The selection of the neighbor ball radius depends on the variation of the noise and $x_{i}^{T} β_{i}$ . In our numerical examples, we vary ϵ from 0.5 to 6.

As a remark, we note that we need to update the weights using current estimate of the parameters. To simplify the computation, instead of solving the global solution of each convex problem to update the weights, for early iterations of Algorithm 1, we use one-iteration approximation of the true solution of the convex problem, leading to a suboptimal solution to this convex problem. Although the suboptimal solution may not be very accurate, it can be used as a rough approximation to the true solution of the convex problem using the current weights by doing only one iteration of the whole optimization routine. This can greatly speed up the computation. Once the weights stabilize, our algorithm automatically converges to the true global solution using the corresponding stable weights.

4. Numerical Analysis

In this section, we use numerical examples to test the performance of latent supervised clustering. We study the estimation accuracy, runtime, and prediction performance using simulated and real data. For comparisons, we also implemented the concave method by Ma and Huang (2016) with a general heterogeneous-effect vector x_i (Concave), and its extension (Concave-2, Ma et al. (2018)). We additionally include two standard methods: linear regression with all two-way interactions and random forests. We pick the tuning parameters by selecting λ_n from {2⁻³, 2⁻², 2⁻¹, 2⁰}, setting the number of neighbors as m = 10, and choosing the radius of the response neighbor ϵ from {2⁻¹, 2⁰, 2, 2²}.

4.1. Simulations

We evaluate the model using eight simulated examples with linear outcome-predictor relationship. In particular, Examples 1–2 include two subpopulations with linear boundaries. Example 3 is extended from the simulation in Wei and Kosorok (2013), by adding more covariates in x_i. Example 4 has a higher dimensional z_i with noises. Examples 5, 6 and 8 introduce nonlinear subpopulation boundaries with noisy variables whose coefficients are zeros. Example 7 covers the nonlinear boundary case where the number of underlying subgroups increases to 5.

For each example, we generate each of the x_i and z_i from an independent continuous uniform U(−2, 2), and the random noise ε_i from N(0, .1). We use a tuning dataset and choose the λ_n that minimizes the tuning error. To calculate the validation error, we consider the estimated parameters for all detected subpopulations from the training set. We choose the one with the smallest mean square error as its validation prediction.

Once finding the optimal tuning parameters, we generate a test dataset that is ten times as large as the training to assess the prediction performance. We divide prediction into two steps. First, we treat the clustering label of the training set as an underlying outcome, and fit it using all the covariates with a classifier. In the simulation studies, we choose k-nearest neighbors for most of the cases and use kernel discriminant analysis as an alternative when x_i is high dimensional with noisy variables. Second, for each observation from the testing set, we use the fitted classifier to decide which cluster it belongs to, and then plug in the corresponding estimated coefficients to make predictions.

For the accuracy comparison, we report the averages and standard deviations mean squared errors. In addition, we also compare the estimated number of subpopulations $\hat{K}$ . For the time comparison, we present the average run time (in seconds) of each method for all examples. The simulations are repeated for 50 times with the details listed below.

Example 1: Univariate Linear Regression with Two Subgroups.

Suppose the underlying true model is linear with:

y_{i} = {\begin{matrix} 1 - x_{i 1} + z_{i}^{T} γ + ε_{i}, & x_{i 1} \leq 1 \\ - 1 + x_{i 1} + z_{i}^{T} γ + ε_{i}, & x_{i 1} > 1 \end{matrix},

where γ = (1, −5, 2, 1, −3, 1, 3, 2, −4)^T and the random noise ε_i ~ N(0, .1).

Example 2: Three Dimensional x_i with Noisy Variables in z_i.

The true model is:

y_{i} = {\begin{matrix} 1 - x_{i 1} - 2 x_{i 2} + z_{i}^{T} γ + ε_{i}, & x_{i 1} + x_{i 2} \leq 0 \\ - 1 + 2 x_{i 1} - x_{i 2} + z_{i}^{T} γ + ε_{i}, & x_{i 1} + x_{i 2} > 0 \end{matrix},

where γ = (1, −5, 3, 2, 1, 0, 0, 0, 0)^T and the random noise ε_i ~ N(0, .1).

Example 3: Higher x_i Dimension with Noisy Variables.

We have x_i contain 25 variables, and the subpopulation is determined by the first five of them. The true model is:

y_{i} = {\begin{matrix} - 4 + z_{i}^{T} γ + ε_{i}, & 1 + x_{i 1} + x_{i 2} - 3 x_{i 3} + 2 x_{i 4} \leq 0 \\ 1 + z_{i}^{T} γ + ε_{i}, & 1 + x_{i 1} + x_{i 2} - 3 x_{i 3} + 2 x_{i 4} > 0 \end{matrix},

where γ = (1, −5, 3, 2, 1, 0, 0, 0, 0)^T and the random noise ε_i ~ N(0, .1).

Example 4: Higher z_i Dimension with Three Subgroups.

Consider a model that has 50 homogeneous variables z_i and three latent subpopulations as:

y_{i} = {\begin{matrix} 1 - x_{i 1} + z_{i}^{T} γ + ε_{i}, & x_{i 1} + x_{i 2} + x_{i 3} \leq - 1 \\ - 1 + z_{i}^{T} γ + ε_{i}, & x_{i 1} + x_{i 2} + x_{i 3} > 1 \\ 1 + x_{i 1} + z_{i}^{T} γ + ε_{i}, & otherwise \end{matrix},

where $γ = {(1, - 5, 3, 2, 1, 0_{45}^{T})}^{T}$ , $0_{45}^{T}$ represents a 45-dimensional zero vector and ε_i ~ N(0, .1).

Example 5: Nonlinear Subgroups Boundary with Noisy Variables in x_i.

Consider a model in which x_i has 6 variables and 2 of them construct a nonlinear subpopulation boundary:

y_{i} = {\begin{matrix} 1 - x_{i 1} - 3 x_{i 2} + z_{i}^{T} γ + ε_{i}, & x_{i 1}^{2} + x_{i 2}^{2} \leq 4 \\ 1 + 3 x_{i 1} + x_{i 2} + 4 x_{i 3} - x_{i 5} + z_{i}^{T} γ + ε_{i}, & x_{i 1}^{2} + x_{i 2}^{2} > 4 \end{matrix},

where γ = (1, −5, 3, 2, 1)^T and the random noise ε_i ~ N(0, .1).

Example 6: Complex Subgroups Boundary with Noisy Variables in x_i.

Consider a situation that has a complex nonlinear subpopulation boundary:

y_{i} = {\begin{matrix} 1 + 5 x_{i 1} + z_{i}^{T} γ + ε_{i}, & x_{i 1}^{2} \sin x_{i 2} + x_{i 3}^{3} + \log (x_{i 4} + 5) + x_{i 5} \leq 5 \\ - 1 - 3 x_{i 1} + z_{i}^{T} γ + ε_{i}, & x_{i 1}^{2} \sin x_{i 2} + x_{i 3}^{3} + \log (x_{i 4} + 5) + x_{i 5} > 5 \end{matrix},

where γ = (1, −5, 3, 2, 1) and the random noise ε_i ~ N(0, .1).

Example 7: Nonlinear Subgroups Boundaries with Five Subgroups.

There are five subpopulations with $y_{i} = \sum_{k = 1}^{5} I ((x_{i 1}^{2} + x_{i 2}^{2}) \in (b_{k - 1}, b_{k}]) \cdot (k - 3) (1 + x_{i 1} + \dots + x_{i 5}) + z_{i}^{T} γ + ε_{i}$ , where (b₀, b₁, b₂, b₃, b₄, b₅) = (−∞, 2, 3.5, 5, 7, ∞), γ = (1, −5, 3, 2, 1)^T, and ε_i ~ N(0, .1).

Example 8: Complex Subgroups Boundaries with Three Subgroups and Noisy Variables.

Consider a model having eleven x_i’s and three subpopulations:

y_{i} = {\begin{matrix} 1 + x_{i 1} + 3 x_{i 2} + 0 x_{i 3} + 3 x_{i 4} + 2 x_{i 5} + z_{i}^{T} γ + ε_{i}, & x_{i 1}^{2} + \exp (x_{i 2}) \leq 2.5 \\ - 1 + 3 x_{i 1} + x_{i 2} - 2 x_{i 3} + 0 x_{i 4} - 2 x_{i 5} + z_{i}^{T} γ + ε_{i}, & x_{i 1}^{2} + \exp (x_{i 2}) > 5.5 \\ 1 - x_{i 1} - x_{i 2} - x_{i 3} + 2 x_{i 4} - 0 x_{i 5} + z_{i}^{T} γ + ε_{i}, & o.w. \end{matrix},

where γ = (1, −5, 3, 2, 1)^T and the random noise ε_i ~ N(0, .1).

The simulation results are reported in Table 1. From Table 1, we can see that latent supervised clustering (LSC) with the check loss almost always produces the best RMSE results for $\hat{β}$ and $\hat{γ}$ . The Concave method can maintain relatively small errors while suffers a large variability. The Concave-2 method by Ma et al. (2018), which aims at estimating the subgroup boundary directly, can have a competitive estimate accuracy when the subgroup boundary is simpler as in Examples 1–4. When the underlying subpopulation structure becomes more complex, as in Examples 5 and 6, the advantage of the LSC-check becomes more clear due to its strong stability to the noise caused by being clustered in the wrong subpopulation. This demonstrates the advantage of using a more robust loss as well as the adaptive fusion penalty. When there are over two subpopulations in the setting with even more complex boundaries as in Examples 7 and 8, none of the methods can produce satisfactory estimation while the LSC-check still outperforms the others. This indicates that one should consider data cleaning or variable selection before fitting LSC in practice when the covariates have complex relationships. As to the estimates of the subpopulation numbers, the two LSC methods perform better than the Concave while all methods tend to overestimate this number except for the case of K = 5 for Example 7.

Table 1:

Simulation results: the averages and standard deviations of mean squared errors with the best results in bold. The K column provides the detected numbers of clusters. Concave is the method with the concave fusion penalty, Concave-2 refers to Ma et al. (2018), LSC-quad and LSC-check represent latent supervised clustering with the quadratic and check loss.

Case	Concave			Concave-2			LSC-quad			LSC-check

	$\hat{β}$	$\hat{γ}$	$\hat{K}$	$\hat{β}$	$\hat{γ}$	$\hat{K}$	$\hat{β}$	$\hat{γ}$	$\hat{K}$	$\hat{β}$	$\hat{γ}$	$\hat{K}$
1	0.261 (0.153)	0.034 (0.054)	3.04 (0.68)	0.214 (0.052)	0.010 (0.004)	2.02 (0.14)	0.195 (0.061)	0.011 (0.008)	2.1 (0.48)	0.185 (0.033)	0.010 (0.006)	2.17 (0.37)
2	0.349 (0.116)	0.123 (0.114)	3.15 (0.61)	0.236 (0.035)	0.017 (0.005)	2.46 (0.50)	0.282 (0.133)	0.012 (0.003)	2.27 (0.52)	0.225 (0.092)	0.006 (0.002)	2.21 (0.45)
3	0.408 (0.22)	0.115 (0.04)	2.32 (0.68)	0.111 (0.002)	0.016 (0.007)	1.87 (0.42)	0.315 (0.180)	0.032 (0.024)	2.21 (0.49)	0.177 (0.122)	0.023 (0.018)	2.11 (0.41)
4	0.421 (0.035)	0.099 (0.012)	2.67 (0.89)	0.381 (0.12)	0.012 (0.003)	2.62 (0.32)	0.285 (0.041)	0.010 (0.004)	3.34 (0.73)	0.273 (0.03)	0.012 (0.007)	3.12 (0.38)
5	0.434 (0.058)	0.254 (0.140)	2.32 (0.36)	0.359 (0.104)	0.076 (0.040)	2.61 (0.84)	0.332 (0.70)	0.009 (0.001)	2.14 (0.36)	0.285 (0.041)	0.006 (0.001)	2.1 (0.27)
6	0.616 (0.161)	0.091 (0.023)	2.73 (0.52)	0.654 (0.274)	0.293 (0.189)	2.89 (0.92)	0.348 (0.152)	0.078 (0.001)	2.15 (0.22)	0.386 (0.134)	0.040 (0.001)	2.13 (0.18)
7	2.345 (0.251)	0.148 (0.011)	6.25 (1.54)	2.129 (0.047)	0.083 (0.010)	6.40 (0.89)	1.699 (0.271)	0.111 (0.098)	6.17 (1.06)	1.896 (0.157)	0.084 (0.044)	6.23 (0.36)
8	2.134 (0.360)	0.230 (0.184)	5.53 (0.82)	2.159 (0.062)	0.257 (0.038)	4.67 (0.89)	1.349 (0.236)	0.219 (0.072)	4.47 (0.70)	1.341 (0.262)	0.129 (0.032)	4.32 (0.59)

Open in a new tab

Table 2 reports the average runtime that the selected methods take to return the results with a given set of tuning parameters. Latent supervised learning is faster than the concave penalty method with ADMM when the sample size n and dimension p are large. This is because the concave penalty method takes more iterations before convergence, and also ADMM creates O(n²p) intermediate parameters. Concave-2 is the fastest in Example 3 because the boundary is quite easy to be estimated directly so that their algorithm can converge much faster. In addition, the suggested specification of the weight vector w in LSC results in a sparse penalty coefficient matrix. For the two LSC methods, LSC-quad performs faster than LSC-check because the smooth check loss decreases the convergence speed of the proposed algorithm, as shown in the previous section.

Table 2:

Simulation runtime comparison: average runtime (in seconds) of the selected methods for each set of the tuning parameters with the best results in bold.

Examples					Concave	Concave-2	LSC-quad	LSC-check

#	n	p	q	K
1	300	2	9	2	18.5	15.8	11.0	12.1
2	300	3	9	2	692	91.7	71.1	107.3
3	300	24	5	2	16954	1054.1	2364.1	2433.7
4	300	4	50	3	793.8	771.7	161.5	186.5
5	300	6	5	2	2225.1	1793.5	458.5	537.1
6	300	6	5	2	2244.5	6320.7	436.7	507.4
7	600	6	5	5	9513	7078.6	2666.6	3373.1
8	600	11	5	3	40514.5	35540.8	11116.2	14739.0

Open in a new tab

To make predictions for the test datasets, we choose k-nearest neighbors with k = 10 to predict the underlying label except for Examples 3, 7, and 8 due to the noisy covariates. As an alternative, we choose kernel discriminant analysis with the parameters selected by the tuning sets.

We report the mean squared errors of prediction in Table 3. From the results, LSC with the check loss enjoys the minimal prediction error. For Examples 1 and 2, all the methods produce satisfactory results. Linear regression with two-way interactions performs as good as LSC in Example 3 because the underlying boundary fits the interaction assumption exactly. When the underlying boundaries become nonlinear as for Examples 5–8, neither linear regression with interactions nor random forests can produce reliable prediction results. The performance of LSC is significantly better than the concave penalty methods in terms of the average prediction error and the variability. Similar to the estimation accuracy results, none of these methods can produce good prediction results when the latent subpopulation structure becomes too complicate as in Examples 7 and 8.

Table 3:

Simulation prediction accuracy: the mean squared prediction errors and standard deviations of the selected methods with the best results in bold. Reg-Int means linear regression with two-way interactions of x_i’s, and RF represents random forests.

Example	Concave	LSC-quad	LSC-check	Reg-Int	RF
1	0.284 (0.05)	0.083 (0.004)	0.057 (0.003)	0.388 (0.025)	0.499 (0.081)
2	1.142 (0.155)	0.644 (0.093)	0.486 (0.085)	2.897 (0.13)	1.513 (0.126)
3	10.531 (0.308)	6.936 (0.202)	3.481 (0.166)	3.599 (0.113)	9.759 (1.505)
4	2.838 (0.243)	0.935 (0.105)	0.776 (0.106)	1.506 (0.074)	7.469 (1.21)
5	12.453 (1.893)	10.379 (1.778)	9.985 (1.759)	30.325 (1.017)	46.881 (2.835)
6	14.176 (2.073)	11.226 (1.295)	10.305 (1.26)	122.594 (12.372)	24.77 (3.999)
7	14.639 (1.954)	12.379 (0.91)	9.75 (0.851)	20.829 (0.675)	16.787 (2.041)
8	32.274 (2.797)	21.748 (1.495)	16.92 (1.351)	67.239 (1.397)	42.693 (2.894)

Open in a new tab

To better illustrate how the value of λ_n affects the number of detected clusters, we draw the solution paths for the cases with both quadratic and check loss functions of Example 1 in Figure 1. The solution path is for the heterogeneous effect coefficients with the same idea as in the dendrogram for agglomerative hierarchical clustering (Hastie et al., 2009). It is calculated by using a “bottom up” approach starting with a small λ_n that does not show any clustering pattern. In Figure 1, the solution paths for the two loss functions look similar to each other. When λ_n reaches to around 1.5, the coefficient estimates merge into two distinct values around −1 and 1, i.e. the underlying true values. As λ_n goes to a larger value such as 4, the estimates eventually converge to a single point that is close to zero.

Figure 1: — Solution paths for ${\hat{β}}_{n}$ against λ_n in Example 1 with selected λ_n shown by dashed lines.

Following a reviewer’s suggestion, we have performed additional simulation studies to investigate the performance of our “forward screening” idea and the effect of correlations among predictors. The results show that our proposed forward screening is effective in identifying variable roles. When the predictors are correlated, our algorithms take more iterations to converge as the correlation increases. This matches with our theoretical analysis. More details can be found in the Supplementary Material.

4.2. Real Data Applications

In this subsection, we apply the proposed method to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data (Kueper et al., 2018), and Pima Indian Diabetes data (Lichman, 2013) to evaluate its performance. We use the same tuning parameter ranges as the simulation examples. We use a 5-fold cross-validation with 50 replications for tuning and prediction. To predict the cluster labels of observations in the validation sets, we use k-nearest neighbors with k = 10. For both the proposed methods and the method with the concave penalty, we follow the suggested “forward screening” idea in Section 2 to identify x_i from z_i. In particular, we start with a parsimonious model that has only an intercept term in x_i and all the covariates in z_i. Then we move variables from z_i to x_i one at a time, by choosing the one that boosts the validation prediction accuracy the most. The process stops when the validation prediction is not improved. To save time in practice, one may conduct the screening with a fixed and reasonable tuning parameter set. We report the mean squared prediction errors of all the methods for both the training and validation sets as a criterion, and then briefly describe the pattern of the detected subpopulations.

4.2.1. Alzheimer’s Disease Neuroimaging Initiative (ADNI)

The dataset records the structural magnetic resonance imaging (MRI) of 226 normal controls, 393 patients with mild cognitive impairment (MCI), and 186 patients with Alzheimer’s disease. It uses a score (ranged from 0 to 150) named Alzheimer’s Disease Assessment Scale–Cognitive Subscale (ADAS-Cog) to assess the level of cognitive dysfunction in Alzheimer’s disease. The goal is to use 93 manually labeled regions of interest (ROI) as covariates, to predict the ADAS-Cog and also detect the three subgroups simultaneously (i.e. normal, MCI and AD).

We first reduce the covariate dimension to 20 by recurrent feature elimination (Hastie et al., 2009), and sort the remaining covariates using the “forward screening” idea. The indices of the first five ROI are 42, 82, 43, 85, and 30. Fitting regression by subgroups, we find the coefficients of these five ROIs quite different across the three subgroups. In this way, it is reasonable to assume that the ROIs have heterogeneous effect on ADAS. Figure 2 shows that the best prediction accuracy is achieved when the ROI 42 and 82 are included. The columns of LSC-quad and Concave represent latent supervised clustering results with the quadratic loss and the method with the concave fusion penalty. Both of them perform worse than LSC-check with larger prediction errors and variability. Reg and RF denote linear regression with two-way interactions and random forests. These two commonly used methods fail to achieve satisfactory prediction results for this problem.

Figure 2: — Alzheimer’s Disease: mean squared prediction errors. LSC-c1–LSC-c4 present latent supervised clustering with the check loss that includes the corresponding number of ROI in heterogeneous vector by the “forward screening” idea, LSC-q is LSC with the quadratic loss, Concave represents the method with concave fusion penalty, Reg means linear regression with two-way interactions, and RF represents random forests.

We are also interested in comparing the detected subgroups with the true diagnosed label. Figure 3 presents the K-means clustering results of the estimated β’s, the validation performance of linear regression and LSC with predicted subgroups as well as true subgroup labels. One can clearly see that the K-means suggests the correct number of clusters according to the within group sum of squares. The detected subgroups match well with the ground truth (around 60% accuracy) except for those lie around the boundary between normal and MCI, which is proved to be difficult in the literature.

4.2.2. Pima Indian Diabetes

The Pima Indian Diabetes dataset collects 768 females at least 21 years old of Pima Indian heritage. The dataset contains 8 attributes and a class variable indicating whether tested positive for diabetes. The 2-hour serum insulin is measured among the 8 attributes, and it is considered as a proper surrogate outcome for the binary indicator of the underlying diabetes test. Therefore, we fit the 2-hour serum insulin using all the other attributes except for the diabetes test binary variable. We remove all the rows that contains missing values. Similar to the previous example, we use a 5-fold cross validation and fit the selected methods using the training sets. We also apply the “forward screening” idea and the selection order is diabetes pedigree function (1), diastolic blood pressure (2), body mass index (3), age (4), triceps skin fold thickness (5), and plasma glucose concentration (6). The mean squared prediction errors are presented in Figure 4 for all the methods. The proposed method with the check loss and three variables in x_i achieves the best prediction performance. LSC-c4 and LSC-c5 can have competitive mean squared errors while the variances are slightly larger. LSC-c6 suffers overfiting with its prediction error larger than that of LSC with the quadratic loss. The linear regression with interactions and random forests produce the worst prediction errors when compared with other methods. In addition, because there is no ground truth for the subgroup in the data, we conduct a χ² test between the detected subgroup labels and the underlying diagnosis variable to roughly evaluate how important the labels are. The result shows that the proposed methods always suggested two latent subgroups, and the detected labels show significant relationship with the underlying diabetes test indicator according to the χ² test. The median of the p values is 0.031 and this suggests that the identified subgroup is reasonable.

Figure 4: — Pima Indian Diabetes: mean squared prediction errors. LSC-c1 - LSC-c6 present latent supervised clustering with the check loss that includes the corresponding number of variables in x_i by the “forward screening” idea.

5. Discussion

In this paper, we propose a novel machine learning method that aims at clustering the underlying subpopulation structure based on the heterogeneous relationship between the outcome and covariates. Although we mainly focus on the scenario of the linear relationship between the outcome and covariates, the proposed method can be a very good exploratory tool in practice due to its weak assumptions on the underlying subpopulation structure. We develop a very efficient algorithm with competitive convergence rates to solve the optimization problem, and also discuss the statistical consistency properties of the estimators for the coefficients. In numerical studies, the proposed method demonstrates strong performance in both subpopulation detection and outcome prediction. One interesting future direction is to extend the proposed method to other clustering methods with various types of outcomes.

Supplementary Material

Supp 1

NIHMS1660202-supplement-Supp_1.zip^{(750.5KB, zip)}

Acknowledgments

The authors are indebted to the editor, the associate editor, and two reviewers, whose helpful suggestions led to a much improved presentation. This research was supported in part by NSF grants IIS1632951, DMS1821231, and NIH grants R01GM126550 and P01CA142538.

References

Altstein L and Li G (2013). Latent subgroup analysis of a randomized clinical trial through a semiparametric accelerated failure time mixture model. Biometrics 69 (1), 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beck A and Teboulle M (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences 2(1), 183–202. [Google Scholar]
Borg I and Groenen PJF (2005). Modern Multidimensional Scaling: Theory and Applica-tions. Springer Science & Business Media. [Google Scholar]
Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, and Peters TJ (2004). Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. Journal of Clinical Epidemiology 57(3), 229–236. [DOI] [PubMed] [Google Scholar]
Chi EC and Lange K (2015). Splitting Methods for Convex Clustering. Journal of Compu-tational and Graphical Statistics 24 (4), 994–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fercoq O and Qu Z (2016). Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358. [Google Scholar]
Greenland S (2009). Interactions in epidemiology: relevance, identification, and estimation. Epidemiology (Cambridge, Mass.) 20 (1), 14–17. [DOI] [PubMed] [Google Scholar]
Guo J, Levina E, Michailidis G, and Zhu J (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66 (3), 793–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media. [Google Scholar]
Hocking TD, Joulin A, and Bach F (2011). Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA. [Google Scholar]
Huber PJ (2004). Robust Statistics. John Wiley & Sons. [Google Scholar]
Koenker R (2005). Quantile Regression. Cambridge University Press. [Google Scholar]
Kueper JK, Speechley M, and Montero-Odasso M (2018). The alzheimer’s disease assess-ment scale-cognitive subscale (adas-cog): modifications and responsiveness in pre-dementia populations. a narrative review. Journal of Alzheimer’s Disease 63(2), 423–444. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lagakos SW (2006). The challenge of subgroup analyses-reporting without distorting. The New England Journal of Medicine 354 (16), 1667–1669. [DOI] [PubMed] [Google Scholar]
Lichman M (2013). UCI machine learning repository. [Google Scholar]
Ma S and Huang J (2016). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112(517), 1–42.29861517 [Google Scholar]
Ma S, Huang J, and Zhang Z (2018). Exploration of heterogeneous treatment effects via concave fusion. arXiv preprint arXiv:1607.03717. [DOI] [PubMed] [Google Scholar]
Nesterov Y (2005). Smooth minimization of non-smooth functions. Mathematical Programming 103 (1), 127–152. [Google Scholar]
Nesterov Y (2013). Gradient methods for minimizing composite objective function. Mathematical Programming 140 (1), 125–161. [Google Scholar]
Nocedal J and Wright S (2006). Numerical Optimization. Springer. [Google Scholar]
Perou CM, Sprlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lpnning PE, Borresen-Dale AL, Brown PO, and Botstein D (2000). Molecular portraits of human breast tumours. Nature 406(6797), 747–752. [DOI] [PubMed] [Google Scholar]
Shen J and He X (2015). Inference for Subgroup Analysis With a Structured Logistic-Normal Mixture Model. Journal of the American Statistical Association 110(509), 303–312. [Google Scholar]
Tran-Dinh Q (2017). Adaptive smoothing algorithms for nonsmooth composite convex mini-mization. Computational Optimization and Applications 66(3), 425–451. [Google Scholar]
Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
Wei S and Kosorok MR (2013). Latent Supervised Learning. Journal of the American Statistical Association 108 (503), 957–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao J, Yu G, and Liu Y (2018). Angle breakdown point for classification. Annals of Statistics 46(6B), 3362–3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1660202-supplement-Supp_1.zip^{(750.5KB, zip)}

[R1] Altstein L and Li G (2013). Latent subgroup analysis of a randomized clinical trial through a semiparametric accelerated failure time mixture model. Biometrics 69 (1), 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Beck A and Teboulle M (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences 2(1), 183–202. [Google Scholar]

[R3] Borg I and Groenen PJF (2005). Modern Multidimensional Scaling: Theory and Applica-tions. Springer Science & Business Media. [Google Scholar]

[R4] Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, and Peters TJ (2004). Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. Journal of Clinical Epidemiology 57(3), 229–236. [DOI] [PubMed] [Google Scholar]

[R5] Chi EC and Lange K (2015). Splitting Methods for Convex Clustering. Journal of Compu-tational and Graphical Statistics 24 (4), 994–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fercoq O and Qu Z (2016). Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358. [Google Scholar]

[R7] Greenland S (2009). Interactions in epidemiology: relevance, identification, and estimation. Epidemiology (Cambridge, Mass.) 20 (1), 14–17. [DOI] [PubMed] [Google Scholar]

[R8] Guo J, Levina E, Michailidis G, and Zhu J (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66 (3), 793–804. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media. [Google Scholar]

[R10] Hocking TD, Joulin A, and Bach F (2011). Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA. [Google Scholar]

[R11] Huber PJ (2004). Robust Statistics. John Wiley & Sons. [Google Scholar]

[R12] Koenker R (2005). Quantile Regression. Cambridge University Press. [Google Scholar]

[R13] Kueper JK, Speechley M, and Montero-Odasso M (2018). The alzheimer’s disease assess-ment scale-cognitive subscale (adas-cog): modifications and responsiveness in pre-dementia populations. a narrative review. Journal of Alzheimer’s Disease 63(2), 423–444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Lagakos SW (2006). The challenge of subgroup analyses-reporting without distorting. The New England Journal of Medicine 354 (16), 1667–1669. [DOI] [PubMed] [Google Scholar]

[R15] Lichman M (2013). UCI machine learning repository. [Google Scholar]

[R16] Ma S and Huang J (2016). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112(517), 1–42.29861517 [Google Scholar]

[R17] Ma S, Huang J, and Zhang Z (2018). Exploration of heterogeneous treatment effects via concave fusion. arXiv preprint arXiv:1607.03717. [DOI] [PubMed] [Google Scholar]

[R18] Nesterov Y (2005). Smooth minimization of non-smooth functions. Mathematical Programming 103 (1), 127–152. [Google Scholar]

[R19] Nesterov Y (2013). Gradient methods for minimizing composite objective function. Mathematical Programming 140 (1), 125–161. [Google Scholar]

[R20] Nocedal J and Wright S (2006). Numerical Optimization. Springer. [Google Scholar]

[R21] Perou CM, Sprlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lpnning PE, Borresen-Dale AL, Brown PO, and Botstein D (2000). Molecular portraits of human breast tumours. Nature 406(6797), 747–752. [DOI] [PubMed] [Google Scholar]

[R22] Shen J and He X (2015). Inference for Subgroup Analysis With a Structured Logistic-Normal Mixture Model. Journal of the American Statistical Association 110(509), 303–312. [Google Scholar]

[R23] Tran-Dinh Q (2017). Adaptive smoothing algorithms for nonsmooth composite convex mini-mization. Computational Optimization and Applications 66(3), 425–451. [Google Scholar]

[R24] Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]

[R25] Wei S and Kosorok MR (2013). Latent Supervised Learning. Journal of the American Statistical Association 108 (503), 957–970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Zhao J, Yu G, and Liu Y (2018). Angle breakdown point for classification. Annals of Statistics 46(6B), 3362–3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identifying Heterogeneous Effect using Latent Supervised Clustering with Adaptive Fusion

Jingxiang Chen

Quoc Tran-Dinh

Michael R Kosorok, Professor

Yufeng Liu

Roles

Abstract

1. Introduction

2. Methodology

3. Algorithms

3.1. Model Reformulation

3.2. Algorithmic Design and Convergence Properties

Algorithm 1.

3.2.1. Convergence for a quadratic loss

3.2.2. Smoothing technique for the check loss

3.2.3. Evaluating the proximal operator for the penalty

Accelerated projected gradient scheme to approximate proxsJn(u):

3.3. Finding an Initial Point

4. Numerical Analysis

4.1. Simulations

Example 1: Univariate Linear Regression with Two Subgroups.

Example 2: Three Dimensional xi with Noisy Variables in zi.

Example 3: Higher xi Dimension with Noisy Variables.

Example 4: Higher zi Dimension with Three Subgroups.

Example 5: Nonlinear Subgroups Boundary with Noisy Variables in xi.

Example 6: Complex Subgroups Boundary with Noisy Variables in xi.

Example 7: Nonlinear Subgroups Boundaries with Five Subgroups.

Example 8: Complex Subgroups Boundaries with Three Subgroups and Noisy Variables.

Table 1:

Table 2:

Table 3:

Figure 1:

4.2. Real Data Applications

4.2.1. Alzheimer’s Disease Neuroimaging Initiative (ADNI)

Figure 2:

Figure 3:

4.2.2. Pima Indian Diabetes

Figure 4:

5. Discussion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Accelerated projected gradient scheme to approximate prox_{sJ_n}(u):

Example 2: Three Dimensional x_i with Noisy Variables in z_i.

Example 3: Higher x_i Dimension with Noisy Variables.

Example 4: Higher z_i Dimension with Three Subgroups.

Example 5: Nonlinear Subgroups Boundary with Noisy Variables in x_i.

Example 6: Complex Subgroups Boundary with Noisy Variables in x_i.