Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 15.
Published in final edited form as: Stat Med. 2021 Apr 5;40(13):3181–3195. doi: 10.1002/sim.8972

Inferring Latent Heterogeneity Using Many Feature Variables Supervised by Survival Outcome

Beilin Jia 1, Donglin Zeng 1, Jason JZ Liao 2, Guanghan F Liu 3, Xianming Tan 1, Guoqing Diao 4, Joseph G Ibrahim 1,*
PMCID: PMC8237103  NIHMSID: NIHMS1716612  PMID: 33819928

Abstract

In cancer studies, it is important to understand disease heterogeneity among patients so that precision medicine can particularly target high-risk patients at the right time. Many feature variables such as demographic variables and biomarkers, combined with a patient’s survival outcome, can be used to infer such latent heterogeneity. In this work, we propose a mixture model to model each patient’s latent survival pattern, where the mixing probabilities for latent groups are modelled through a multinomial distribution. The Bayesian information criterion (BIC) is used for selecting the number of latent groups. Furthermore, we incorporate variable selection with the adaptive lasso into inference so that only a few feature variables will be selected to characterize the latent heterogeneity. We show that our adaptive lasso estimator has oracle properties when the number of parameters diverges with the sample size. The finite sample performance is evaluated by the simulation study, and the proposed method is illustrated by two datasets.

Keywords: adaptive lasso, censoring, latent model, mixture distribution, oracle property

1 |. INTRODUCTION

A typical clinical trial is designed to test a drug/vaccine on a large and diverse group of patients and hopefully the One-Size-Fits-All approach is successful. The benefit for this approach is the quick availability of an effect drug/vaccine to the broadly targeted population with the unmet medical need. However, with much less low-hanging fruits available, it becomes more challenging to develop a blockbuster drug/vaccine that works for all studies population. Especially in more advanced and hard-to-treat diseases such as oncology, patients often present a heterogeneous survival experience, and their disease outcomes may range from early death to spontaneous regression of the tumor followed by cure. This traditional one-size-fit-all approach may not be cost and time effective due to the highly heterogeneity of the study population. As technology advancing, more personal clinical, genetic, genomic, and environmental information and other baseline characteristic variables are available before the clinical study. Consequently, sponsors are looking into ways to conduct study in a more homogeneous subgroup with much higher probability of success to develop new medicines effectively. Thus, a challenging statistical problem with strong scientific/clinical interest in drug discovery and development is the identification of patient subgroups with different survival experience. Recently, Liao and Liu1 demonstrated that many Kaplan Meier survival curves commonly seen in oncology trials can be reconstructed using a mixture of two or three parametric survival profiles. In other words, a disease population can be approximately decomposed into two or three latent groups with unique corresponding survival behavior in each latent subgroup.

Statistical methods to identify such latent groups have received increasing attention in the past several decades, where the majority of the broad literature focuses on latent class modeling. Among them, one popular subclass of latent class models is a mixture model that considers finite mixture components to represent the unobserved heterogeneity in the data. For survival outcomes, Farewell2 discussed the use of mixture models by assuming a fraction of long-term survivors. Larson and Dinse3 used a parametric mixture model to analyze competing risks data where the mixing parameters correspond to the marginal probabilities of various failure types. Altstein and Li4 studied a semiparametric accelerated failure time mixture model on a latent subgroup with time-to-event data in randomized clinical trials, and Shen and He5 proposed a structured Logistic-Normal mixture model to identify subgroups. These methods had no discussion of variable selection. Wu et al.6 extended Shen and He’s work5 and introduced a backward elimination algorithm to select important variables. However, such an algorithm is computational intensive, and there is no theoretical justification for this procedure. More recently, Bussy et al.7 proposed a Quasi-Newton Expectation Maximization (QNEM) algorithm to detect patients subgroups. They only considered discrete survival time. In their algorithm, the negative log-likelihood is penalized by the elastic-net in every iteration, which is computationally intensive and there is no theoretical justification. Bennis et al.8 proposed a neural network architecture to estimate a finite mixture of two-parameter Weibull distributions with right-censored data. This approach consists multiple network layers, including fully connected layers, which has complicated structure and poor interpretability. Moreover, this approach did not have variable selection or theoretical justifications.

In practice, datasets tend to be rich in information, and baseline characteristics are often considered to be predictive of the latent subgroup membership. A large number of covariates, such as demographic characteristics and biomarkers, may be involved in the procedures of identifying latent subgroups. Variable selection becomes crucial to reducing dimensions without losing much information and it helps in understanding important features. Hence, it is natural to introduce variable selection in latent subgroup identification and think that only a few covariates are truly predictive of latent subgroup membership. Various approaches in variable selection from data without time-to-event outcomes and heterogeneity have been greatly discussed in the literature, including test-based approaches (see Chatfield9, Harrell Jr et al.10, Steyerberg et al.11) and penalty-based methods (see Tibshirani12, Meinshausen and Buhlmann13, Fan and Li14, Zou15). There has also been extensive work discussing variable selection for survival outcomes (see Tibshirani16, Fan and Li17) or for mixture Gaussian models (see Law et al.18,19; Raftery and Dean20, Khalili and Chen21). However, no literature takes the time-to-event data into consideration for variable selection when potential heterogeneity may be in the population.

In this paper, to predict the latent subgroup membership for future individuals and identify the important variables that are predictive of latent subgroup membership for individuals with specific survival profiles, we propose a method for variable selection in latent subgroup identification with time-to-event data. More specifically, we model the survival distribution through a mixture of Weibull distributions, where each mixture represents a latent subgroup. The latent group membership is then modelled via a multinomial distribution that may vary with feature variables. To select important feature variables for characterizing the latent groups, the EM algorithm is first applied to obtain the initial maximum likelihood estimate, and then the adaptive lasso penalty is introduced for variable selection. We show that our proposed estimator enjoys the oracle property when the number of covariates diverges with the sample size. The rest of the paper is organized as follows. Section 2 details the proposed method for variable selection in latent subgroup identification for individuals with specific survival profiles. Section 3 provides the theoretical properties of the proposed method. Section 4 shows the finite sample performance of our proposed method via a simulation study. Two real data examples in Section 5 demonstrate the applications of the proposed method.

2 |. METHODOLOGY

2.1 |. Model

We assume that the whole population consists of K different subgroups. Each group of patients will follow a specific survival profile. More specifically, we assume that the kth group has a survival distribution S(t, ηk), which has a parametric form with unknown parameters ηk, for k = 1, …, K. In this paper, we assume that the survival outcome for each latent subgroup follows a Weibull distribution, which is a commonly used distribution in survival analysis due to its flexibility and reliability1. The functional form of the Weibull distribution for the kth latent subgroup is givens by S(t,ηk)=exp{(tλk)κk}, where ηk = (κk, λk)T, κk is the shape parameter and λk is the scale parameter.

We let T denote the time to event and X denote all the baseline covariates, which the number of baseline covariates could be large. To classify each patient into one of the survival groups using the baseline covariates X (X contains constant 1), we introduce a latent group membership B and assume

P(T>tB=k,X)=S(t,ηk) (1)

and

P(B=kX)=exp{βkTX}k=1Kexp{βkTX}=πk(X,β) (2)

for k = 1, …, K, where β1 = 0 and β2, …, βk are unknown parameters. Therefore, the latent group membership determines which group the patient should belong to and this membership depends on the baseline covariates through a multinomial distribution. Clearly, the proposed model implies that the marginal survival distribution for T takes a mixture form:

P(T>tX)=k=1KS(t,ηk)πk(X,β),

where β=(β2T,,βKT)T. To conduct a future trial, for any new patient with baseline covariates X = x, we then classify this patient into group k with maximal value βkTx, i.e., the most likely group membership.

2.2 |. Initial Estimate

Suppose that we have right-censored observations from n i.i.d patients, denoted by

{Yi=TiCi,Δi=I(TiCi),Xi,i=1,,n},

where Ci is the censoring time. Assuming that the censoring time is independent of Ti given Xi, and is independent of the latent group membership Bi, we obtain the observed data log-likelihood function as

ln,obs(θ)=i=1n[Δilog{k=1Kf(Yi,ηk)πk(Xi;β)}+(1Δi)log{k=1KS(Yi,ηk)πk(Xi;β)}], (3)

where θ = (ηT, βT)T, ηT=(η1T,,ηKT)T, and f(t, ηk) = −S′(t, ηk).

To estimate β, we introduce B1, …, Bn as the latent group membership for each subject and use the EM algorithm to compute the maximum likelihood estimators, treating the B′s as missing data. In the E-step, at the kth iteration, we compute the expected log-likelihood based on the current estimates of all parameters, conditional on the observed data, which is equivalent to calculating the posterior probability of Bi = k given the observed data for k = 1, …, K and i = 1, …, n. More specifically, this posterior probability is

qik=f(Yi,ηk)πk(Xi;β)k=1Kf(Yi,ηk)πk(Xi;β)

if Δi = 1, and it is

qik=S(Yi,ηk)πk(Xi;β)k=1KS(Yi,ηk)πk(Xi;β)

if Δi = 0. In the M-step, we compute the estimates that maximize the expected log-likelihood obtained in the E-step,

ln(η,β)=i=1nk=1Kqik[Δilog(f(Yi,ηk))+(1Δi)log(S(Yi,ηk))+log(πk(Xi;β))]. (4)

To estimate the survival distribution parameter η, we implement the Newton-Raphson algorithm to update the estimate based on the expected log-likelihood function (4). The expected log-likelihood function (4) is essentially a weighted multinomial regression. To obtain the maximum likelihood estimate β˜, for each iteration, we apply a one-step Newton-Raphson in the M-step to update the estimate. After convergence, we obtain the maximum likelihood estimates η˜ and β˜. It is easy to see that the expected log-likelihood function (4) in the M-step increases at each iteration, which implies that the algorithm is guaranteed to converge and will stay unchanged once converged.

To determine the best number of latent subgroups in the data, we consider several choices of the number of latent subgroups. For each potential number of latent subgroups, we apply a similar procedure as stated in this section to obtain the initial estimates. The value of the log-likelihood based on the initial estimates is computed afterwards. The BIC, as suggested by Nylund et al.22, is then calculated to determine the best number of latent subgroups for the data. Nylund et al.22 evaluated the performance of several information criteria for correctly identifying the number of groups. The performance of BIC for determining the best number of latent subgroups is also evaluated in Section 4.

2.3 |. Variable Selection for Latent Groups

The objective function (4) in the M-step is essentially a weighted multinomial regression, with weights being the posterior probability of Bi = k given the observed data for k = 1, …, K and i = 1, …, n. We use this objective function to accommodate penalties for variable selection. Because of the strict concavity of the objective function (4), we can derive nice theoretical properties for the estimator after variable selection.

Among many penalty functions, we apply the convex adaptive lasso penalty to the objective function (4). The weight for each coefficient in the adaptive lasso penalty is related to the importance of the corresponding covariate and helps to adaptively penalize each coefficient by tuning each coefficient with a different parameter. Zou15 shows that the adaptive lasso enjoys the oracle properties by inflating the weights for zero-coefficient covariates and enables the weights of nonzero-coefficient covariates to converge to a finite constant. The data-dependent adapting weights can be the reciprocal of any consistent estimator of β (Zou15). Here we consider the maximum likelihood estimator β˜. The penalized objective function becomes

ln(η˜,β)+λk=1Kj=1d|βkj||β˜kj|γ, (5)

where γ is a prespecified positive constant and the commonly used value is γ = 1, and β = (β11, β12, …, β1d, β21, …, βKd)T. Here, we do not introduce a penalty on η, and η˜ is the maximum likelihood estimator. Hence, minimizing (5) is equivalent to applying the adaptive lasso penalty to a weighted multinomial regression.

To obtain the adaptive lasso estimates β^, we minimize the penalized objective function (5) via a two-step strategy. The first step is to calculate the maximum likelihood estimates (η˜,β˜) that optimize (4) by an iterative Newton-Raphson update. Denote θ = (η, β). Define the gradient vector ln(θ)=ln(θ)θ and the Hessian matrix 2ln(θ)=2ln(θ)θθT. The Newton-Raphson update is

θ(t+1)=θ(t)(2ln(θ)|θ=θ(t))1ln(θ)|θ=θ(t). (6)

The second step is to obtain the adaptive lasso estimates β^ by minimizing (5) via a coordinate descent algorithm, where the coefficients are iterated over to minimize (5).

Hence, to minimize the penalized objective function (5) for any fixed γ, we use the following procedure.

  • Step 1.

    Use the EM algorithm and the Newton-Raphson update (6) to compute the maximum likelihood estimates η˜ and β˜.

  • Step 2.

    Calculate the weights in the adaptive lasso penalty, w˜i for i = 1, …, n, by using β˜.

  • Step 3.

    Compute the weights, qik for i = 1, 2, …, n, and k = 1, 2, …, K, in the weighted multinomial regression ln(η˜,β) by using η˜ and β˜.

  • Step 4.

    Apply the coordinate descent algorithm to minimize the penalized objective function (5) until the convergence criterion is met.

In Step 3, the weights in the weighted multinomial regression are obtained by plugging in the estimates η˜ and β˜, since η˜ and β˜ are consistent maximum likelihood estimates, and the weights are fairly close to the true weights, which is shown in the Supporting Information. The minimization in Step 4 is based on the coordinate descent algorithm, which can be implemented via a statistical package such as glmnet in R.

To select the data-dependent tuning parameter λ in the proposed algorithm, we use V -fold cross validation. We consider λ from a set of grid points and partition the data into V subsets with equal size. For each point λ, we compute the coefficients using V − 1 subsets and obtain the deviance residual on the V th subset by using these coefficients V times. Averaging over V deviance residuals, we have an average deviance residual associated with one point of λ. We then select among the average deviance residuals and have the best choice for the tuning parameter λ that yields the smallest average deviance residual. After variable selection, we reapply the EM algorithm and maximize (3) by including the selected important covariates. We then classify patients to their most likely latent subgroup based on the post-selection maximum likelihood estimates.

3 |. THEORETICAL PROPERTIES

In this section, we describe the asymptotic properties of our estimators when the number of parameters grows with the sample size. With a slight abuse of notation, we write βn=(βn1T,,βnKT)T=(βn1,,βnpn)T, where pn is the number of variables, and θn=(ηT,βnT)T. We consider the penalized objective function based on n samples,

Qn(θn)=ln(θn)nλnj=1pn|βnj|/|β˜nj|γ.

Denote the true values of θn by θn0. We write θn0 as (ηT,βn10T,βn20T)T, where

βn10=(βn10,βn20,,βnq0)T

consists of all q nonzero components and

βn20=(βn(q+1)0,βn(q+2)0,,βnpn0)T

consists of the remaining zero components. Correspondingly, we have the adaptive lasso estimator θ^n=(η^T,β^n1T,β^n2T)T.

We require the following regularity conditions.

  • (C1)

    The function S(t, ηk) for k = 1, 2, …, K is non-increasing and continuously differentiable.

  • (C2)
    Let g(Xi, Yi, Δi, θn) Denote the probability density for observation {Xi, Yi, Δi}, for i = 1, 2, …, n. The observations {Xi, Yi, Δi, i = 1, 2, …, n} are independent and identically distributed. Let λmin(A) and λmax(A) denote the minimum and maximum eigenvalues of a positive definite matrix A, respectively. Assume that, for all i, the Fisher information matrix
    In(θn)=E[(logg(Xi,Yi,Δi,θn)θn)(logg(Xi,Yi,Δi,θn)θn)T]
    satisfies
    C1λmin{In(θn)}λmax{In(θn)}C2
    and, for j, l = 1, 2, …, pn,
    E[(logg(Xi,Yi,Δi,θn)η)T(logg(Xi,Yi,Δ1,θn)η)]2C3
    and
    E[logg(Xi,Yi,Δi,θn)βnjlogg(Xi,Yi,Δi,θn)βnl]2C4,
    where C1, C2, C3 and C4 are positive constants.
  • (C3)
    θn0 is contained in a large enough open set. For all θn within this open set, the third derivatives of g(Xi, Yi, Δi, θn) with respect to βn satisfy
    |3logg(Xi,Yi,Δi,θn)βnjβnlβnm|Mnjlm(Xi,Yi,Δi)
    and
    E[Mnjlm2(Xi,Yi,Δi)]C5, where C5 is a positive constant
    for j, l, m = 1, 2, …, pn.
  • (C4) Assume that
    min1jq|βnj0|λn, as n.

Condition (C1) requires S(t, ηk), k = 1, 2, …, k to be a valid survival distribution. Conditions (C2) and (C3) are similar to conditions (F) and (G) in Fan et al.23, which assume that the likelihood function has reasonably good behavior. Condition (C4) is used to establish the oracle property of the adaptive lasso estimator and already implicitly assumed in a finite dimensional setting. This condition is exactly condition (H) in Fan et al.23, which allows nonzero coefficients to vanish and can be distinguished at a rate by the penalized likelihood.

Under conditions (C1) – (C4), we have the following asymptotic results for our estimators.

Theorem 1.

Denote the maximum likelihood estimates of ln,obs(θn) by θ˜n, where

ln,obs(θn)=i=1n[Δilog{k=1Kf(Yi,ηk)πk(Xi;βn)}+(1Δi)log{k=1KS(Yi,ηk)πk(Xi;βn)}]

If pn4/n0 as n → ∞, then θ˜nθn0=Op(pnn1/2)

Theorem 2.

If npnλn=O(1) and pn4/n0 as n → ∞, then there is a unique maximizer θ^n of Qn(θn) such that θ^nθn0=Op{pn(n1/2)}.

Finally, we provide the asymptotic distribution of the adaptive lasso estimator. We let

bn={0,,0,λnsign(βn10)/|β˜n1|γ,,λnsign(βnq0)/|β˜nq|γ}T,

θn1=(ηT,βn1T,0T)T and θn10=(η0T,βn10T,0T)T. Let s be the number of parameters for the survival distributions of the K latent subgroups. The first s zeros contained in bn are due to the fact that we do not penalize the parameters of the survival distributions.

Theorem 3.

If n → 0, n/pnλn and pn5/n0 as n → ∞, then under Theorem 1, the adaptive lasso estimator θ^n has the following properties:

  1. β^n2 with probability tending to 1;

  2. nAnIn1/2(θn10){In(θn10)}[θ^n1θn10+{In(θn10)}1bn]DN(0,G)
    where An is a r × (s + q) matrix such that AnAnTG, and G is a r × r non-negative symmetric matrix.

One key to the proofs is to obtain a uniform approximation rate for the weights in the expression of Qn(θn). For this, we use the result established in Theorem 1. The proofs of Theorems 2 and 3 then follow the standard arguments in variable selection for parametric models, including the existence of the local maximum in a neighborhood of the true parameters and verification of the fact that the oracle estimator attains this local maximum, but with careful verification of certain approximation rates in terms of pn. The details of the proof are given in the Supporting Information. The theoretical properties for the post-selection estimator, that is, the maximum likelihood estimator of selected important variables after refitting the model without the adaptive lasso penalty, could be easily obtained. Under Theorem 3, the probability that adaptive lasso estimator of unimportant variables does not equal to zero tends to 0. Therefore, the post-selection estimator has the same asymptotic distribution as the adaptive lasso estimator of important variables, which is stated in Theorem 3.

4 |. SIMULATION STUDIES

We conduct the simulation study that assumes two latent subgroups exist. We consider 10, 30 and 50 covariates in the regression model and only a few of covariates have nonzero effects. The covariates X = (X1, X2, …, Xp), where p = 10, 30, 50, are generated from standard normal distribution with moderate correlations. Time to event data for each latent subgroup follow a different Weibull distribution with scale parameter λ and shape parameter κ. The censoring time is generated from an exponential distribution, where the mean is calibrated by a prespecified censoring rate of 10%.

The true values of the scale parameters (i.e., λ1 and λ2) of the Weibull distributions for two latent subgroups are set to be 1 and 4.5 respectively, and the true values of the shape parameters (i.e., κ1 and κ2) are 1 and 3 for two latent subgroups respectively. Around 40% of the individuals belong to latent subgroup 1 with a 2-year survival probability of 13.5%. Sixty percent of the individuals are in latent subgroup 2 and have a 2-year survival probability of 91.5%. The subgroup-specific survival curves are illustrated in Figure 1. The true β associated with latent subgroup membership is calibrated such that the true proportions of the two subgroups are 40% and 60%, respectively. Three scenarios in the simulation study are described below. The sensitivity analysis that evaluates the proposed model when the link function of the multinomial distribution for the latent subgroup membership is nonlinear is included in the Supporting Information.

FIGURE 1.

FIGURE 1

The true survival curves in the simulation study.

  • Scenario 1.
    10 covariates are independently generated from a standard normal distribution, and first three of them are important covariates. Subgroup 1 is regarded as the reference group and β1 is set to 0.
    β=(β1β2)=(0,0,0,0,0,,00.4,0.2,0.6,0.3,0,,0)
  • Scenario 2.

    30 covariates are generated from standard normal distribution, and first eight of them have nonzero effects.

    Correlations between X1 and X2, X3 and X4, X7 and X10 are set to be 0.2, 0.3 and 0.2, respectively.
    β=(β1β2)=(0,0,0,0,0,0,0,0,0,0,,00.4,0.2,0.6,0.3,0.5,0.5,0.7,0.7,0.5,0,,0)
  • Scenario 3.

    50 covariates are generated from standard normal distribution, and first eight of them have nonzero effects. Correlations between X1 and X2, X3 and X4, X7 and X10 are set to be 0.2, 0.3 and 0.2, respectively. True values of regression coefficients for important variables are set to be the same as in scenario 2.

To implement the EM algorithm to obtain the maximum likelihood estimators of the β’s and the survival distribution parameters κk, λk for k = 1, 2, …, K, the stopping criteria for EM is lobs(θ(k+1))−lobs(θ(k)) < 10−4, where θ = (κ1, λ1, …, κk, λk, βT)T. For γ in the adaptive lasso penalty, we use γ = 1 for all simulation studies. For each simulated dataset, we first identify the number of latent subgroups by applying our method for estimation and calculating BIC. Once the number of latent subgroups is determined, we use the EM algorithm to obtain the maximum likelihood estimates and then implement the adaptive lasso procedure to perform variable selection. We consider the grid 2−16, 2−15, …, 215, 216 for the tuning parameter λ, and report the results that yield the smallest value of average deviance residual. After variable selection, the EM algorithm is reapplied to the models with only selected covariates. We repeat the simulation 1000 times and consider sample sizes of n = 300, 1000 and 3000.

We first calculate BIC for models assuming no latent subgroups, two, and three latent subgroups, in datasets that truly consisted of two latent subgroups. BIC suggests that approximately 100% of the datasets with the sample size of 1000 consist of two latent groups. When the sample size increases to 3000, all the datasets are correctly detected consisting of two latent subgroups by the BIC criterion.

Table 1 summarizes the prediction accuracy, along with standard errors, for models without and after variable selection for all three scenarios, and also reports the average number of correct and incorrect zero coefficients and corresponding standard errors. The prediction accuracy is calculated by applying the decision rule obtained from the training set to a validation set with sample size 10, 000. Compared to the optimal accuracy rate, our method performs well for all three scenarios, especially for models with only selected important covariates. The prediction accuracy is approaching to the optimal accuracy rate as the sample size increases. When the number of covariates increases, our method also works well in terms of prediction accuracy and variable selection results. The optimal accuracy rate is 1–Bayes error rate, where the Bayes error rate is calculated via the formula 1E(max kP(B=kX)), is the lowest possible test error rate. For scenario 1, when the sample size is 300, important variables are correctly selected in approximately 80% of the datasets, and unimportant variables are selected in approximately 15% of the datasets. As the sample size increases to 1000, important variables are identified in over 99% of the datasets. Meanwhile, the ability to shrink zero coefficients to zero is also improved: the rate of incorrectly selecting unimportant variables is below 5%. For scenario 2 when the number of covariates increases to 30, important variables can be correctly identified in around 80% of the datasets when the sample size is 300, while the unimportant variables are selected in around 32% of the datasets. The ability to identify important variables and shrink zero coefficients to zero is improved when the sample size grows to 1000: important covariates can be picked out in approximately 95% of the datasets, and our method selects unimportant variables in only 5% of the datasets. For scenario 3 with 50 covariates, when the sample size is 300, around 80% of the datasets can correctly distinguish important variables and the rate of incorrectly selecting unimportant variables is 35%. As the sample size increases to 1000, the rates of identifying important variables and selecting unimportant variables are 80% and 11%, respectively. When the sample size further grows to 3000, over 99% of the datasets can correctly recognize important variables and the rate of incorrectly selecting unimportant variables decreases to around 1%. Table 2 reports the accuracy of nonzero coefficient post-selection estimates, their standard errors and coverage probabilities for nominal 95% confidence intervals for scenario 1. Due to the limited space, we report results for scenario 2 and 3 in the Supporting Information. To obtain the standard errors for the maximum likelihood estimates, we use the Louis formula24 because the latent group membership is treated as missing data in our method. For these three scenarios, we observe similar results: the post-selection estimates are slightly biased on small samples and the bias can be reduced by increasing the sample size; the 95% confidence intervals for the post-selection estimators based on the estimated coefficients and standard errors have accurate coverage for the true parameters.

TABLE 1.

Results from the simulation study with 2 latent groups

Accuracy (SE) Comparison
N without variable selection after variable selection Group 2 vs. 1
Corr. (SE) Incorr. (SE)
Scenario 1: 10 independent covariates, 3 of them are important.
The optimal accuracy rate is 0.648.
300 0.617 (0.019) 0.619 (0.030) 5.70 (1.754) 0.63 (0.876)
1000 0.636 (0.009) 0.640 (0.010) 6.53 (0.946) 0.08 (0.278)
3000 0.643 (0.006) 0.645 (0.015) 6.90 (0.326) 0.00 (0.045)
Scenario 2: 30 covariates with moderate correlations, 8 of them are important.
The optimal accuracy rate is 0.732.
300 0.675 (0.020) 0.680 (0.029) 13.47 (6.574) 0.86 (1.146)
1000 0.716 (0.007) 0.724 (0.008) 19.71 (3.245) 0.24 (0.459)
3000 0.728 (0.005) 0.731 (0.005) 21.41 (1.256) 0.03 (0.159)
Scenario 3: 50 covariates with moderate correlations, 8 of them are important.
The optimal accuracy rate is 0.732.
300 0.653 (0.019) 0.666 (0.030) 27.39 (10.290) 0.84 (1.178)
1000 0.705 (0.008) 0.720 (0.010) 36.45 (6.969) 0.22 (0.430)
3000 0.724 (0.005) 0.731 (0.005) 40.86 (2.783) 0.03 (0.179)

Note. Each column corresponds to prediction accuracy and standard errors, average number of correct (Corr.) and incorrect (Incorr.) zero coefficients and standard errors from 1000 simulated datasets.

TABLE 2.

Maximum likelihood Estimates after variable selection, their standard errors, and coverage probabilities for nominal 95% confidence intervals from the simulation study with 2 latent groups

N Parameter Bias SE SEE CP
300 k1 0.035 0.113 0.109 0.954
λ1 0.015 0.274 0.254 0.852
k2 0.096 0.422 0.354 0.915
λ2 −0.013 0.210 0.190 0.902
β0 −0.003 0.350 0.318 0.901
β1 −0.033 0.186 0.110 0.605
β2 −0.002 0.277 0.172 0.866
β3 0.005 0.202 0.145 0.807
1000 k1 0.013 0.062 0.061 0.946
λ1 −0.000 0.166 0.160 0.890
k2 0.007 0.203 0.190 0.931
λ2 −0.015 0.109 0.109 0.940
β0 0.010 0.195 0.187 0.916
β1 −0.008 0.101 0.080 0.883
β2 −0.017 0.101 0.097 0.949
β3 −0.006 0.088 0.089 0.962
3000 k1 0.004 0.036 0.036 0.942
λ1 0.002 0.099 0.099 0.925
k2 0.006 0.110 0.111 0.952
λ2 −0.003 0.065 0.064 0.949
β0 0.002 0.111 0.112 0.939
β1 0.000 0.049 0.050 0.954
β2 −0.007 0.056 0.055 0.948
β3 −0.005 0.051 0.051 0.947

Note: SE, standard error; SEE, mean of standard error estimator; CP, coverage probability for nominal 95% confidence interval.

5 |. REAL DATA APPLICATION

Our proposed method is applied to two datasets (see Supporting Information for the other real data example). We apply the proposed methodology to data from a breast cancer clinical trial to study the potential heterogeneity of patients in terms of their survival outcomes and investigate important variables that are associated with such heterogeneity. The data was collected from a large clinical trial, IBCSG Trial VI25, in premenopausal women with node-positive breast cancer to study both the duration of adjuvant chemotherapy and the reintroduction of delayed chemotherapy. Patients were randomized in a two by two factorial design to receive the following: (A) cyclophosphamide, methotrexate, and fluorouracil (CMF) for six consecutive cycles (CMF*6); (B) CMF*6 plus three single cycles of reintroduction CMF; (C) CMF*3; and (D) CMF*3 plus three single cycles of reintroduction CMF. The patients’ quality of life (QOL) was also measured at baseline and was hypothesized to contain prognostic information and reflect breast cancer progression. Four aspects of QOL, including physical well-being, mood, appetite and perceived coping, were assessed by a self-assessment QOL questionnaire. In addition to treatment effects and patients’ QOL, disease-free survival (median follow-up of 7.47 years, rescaled to [0, 1]), event status, age at baseline, estrogen receptor (ER) status (1=positive, 0=negative) and the number of positive nodes of the tumor (i.e., node group, 1=number of positive nodes > 4, 0=else) are also considered in the data. After excluding missing values, data are available for 962 patients. The median follow-up for disease free survival (DFS) is 7.47 years and the event rate is around 45%. We rescale the DFS to [0, 1] and standardize continuous variables such as age and the four measures of QOL for computation.

Our first step is to investigate whether latent subgroups exist in the data. BIC suggests that two latent subgroups are contained in the data with a value of 806.1. Values of BIC of models assuming no latent subgroup and three latent subgroups in the data are 962.0 and 886.5 respectively. We assume that the survival outcomes for patients in different latent subgroups follow different Weibull distributions. Next, survival distributions and coefficients for covariates are estimated by the mixture model via the EM algorithm and Newton-Raphson. The estimated shape parameters of the Weibull distribution for two latent subgroups are 3.05 and 1.90, respectively. The corresponding scale parameter estimates are 0.24 and 1.43. The regression coefficient estimates for the covariates are summarized in Table 3. Based on initial estimates from this model, we then implement our variable selection procedure to identify important covariates that are associated with latent subgroup membership assignment. After variable selection, we refit the model using the selected important covariates. The estimated Weibull distributions for two latent subgroups are pretty close to those without variable selection: the survival distribution for latent group 1 has shape parameter estimate of 3.04 and scale parameter estimate of 0.24; the survival distribution for latent group 2 has shape parameter estimate of 1.91 and scale parameter estimate of 1.43. The results of the adaptive lasso estimator and maximum likelihood estimator after variable selection can be found in Table 3. The latent subgroup membership is associated with age and the number of positive nodes. The latent subgroup membership for each individual is predicted afterwards, based on maximum likelihood estimates without variable selection and after variable selection. The model without variable selection yields 74 out of 962 individuals belong to latent group 1, which makes up about 7.7% of the total individuals in the data. After variable selection, 63 individuals (≈ 6.5%) belong to latent group 1. Kaplan-Meier curves for the two latent subgroups are utilized to illustrate the survival profiles, which is shown in the left panel of Figure 2.

TABLE 3.

Parameter estimates for the IBCSG trial data

Overall models Treatment-specific models
Treatment B Treatment C
Covariates MLE w/o var. sel. (p-value) MLE after var. sel. (p-value) MLE w/o var. sel. (p-value) MLE after var. sel. (p-value) MLE w/o var. sel. (p-value) MLE after var. sel. (p-value)
(intercept) 1.21 1.21 1.09 1.05 0.90 0.87
(<0.0001) (<0.0001) (0.0017) (0.0019) (0.0176) (0.0009)
age 0.25 0.25 0.16 0.16 0.57 0.55
(0.0034) (0.0030) (0.3316) (0.3309) (0.0015) (0.0017)
node −1.06 −1.05 −1.74 −1.77 −1.26 −1.16
(<0.0001) (<0.0001) (<0.0001) (0.0033) (0.0007) (0.0011)
ER status 0.13 0 0.32 0.32 0.03 0
(0.4869) (−) (0.3811) (0.3740) (0.9277) (−)
physical 0.08 0 −0.17 −0.08 0.22 0
(0.4456) (−) (0.4591) (0.7095) (0.2959) (−)
mood −0.24 0 −0.02 −0.42 −0.36 0
(0.0375) (−) (0.0129) (0.0451) (0.1271) (−)
appetite 0.04 0 0.26 0 0.02 0
(0.6430) (−) (0.2178) (−) (0.8964) (−)
cope 0.21 0 0.27 0 0.32 0
(0.0302) (−) (0.2008) (−) (0.0953) (−)
trtB −0.03 0 - - - -
(0.8884) (−) (−) (−) (−) (−)
trtC −0.17 0 - - - -
(0.4925) (−) (−) (−) (−) (−)
trtD −0.08 0 - - - -
(0.7354) (−) (−) (−) (−) (−)

Note. “node”: the number of positive nodes in the tumor; “physical”: physical well-being; “cope”: perceived coping.

FIGURE 2.

FIGURE 2

Kaplan-Meier Curves for latent subgroups and for subgroups determined by age and the number of positive nodes

Note: “Time” is disease free survival and rescaled to [0, 1].

The results of variable selection indicate that treatment does not have a significant effect on the latent subgroup membership assignment. To further explore the heterogeneity of patients under different therapeutic procedures, we apply our method to the datasets of patients under each treatment. According to BIC, two latent subgroups are detected among patients with treatment B and patients with treatment C, and no latent subgroup is identified among patients with treatment A and treatment D. More specifically, for treatment A, the BICs for models assuming no latent subgroup, two and three latent subgroups are calculated as 257.5, 311.7 and 382.4 respectively. For treatment B, the BICs for these three models are 258.2, 221.0 and 347.2 respectively. For treatment C, BICs for these three models become 225.0, 203.9 and 260.0 respectively. For treatment D, the corresponding BIC values for these three models are 250.0, 278.4 and 305.1. Table 3 reports the estimated regression coefficients for patients under treatment B and patients under treatment C. Assuming that the survival outcomes for patients follow a Weibull distribution, among patients under treatment B, the survival distribution estimates yield a shape parameter of 3.06 and a scale parameter of 0.24 for latent group 1, and a shape parameter of 1.88 and a scale parameter of 1.44 for latent group 2, based on the model without variable selection. For patients under treatment C, the survival distribution estimates yield a shape parameter of 2.57 and a scale parameter of 0.28 for latent group 1, and a shape parameter of 2.21 and a scale parameter of 1.74 for latent group 2, based on the model without variable selection. After variable selection, two estimated Weibull distributions yield κ^1=2.88, λ^1=0.27, κ^2=1.87 and λ^2=1.84 for patients under treatment B, and κ^1=2.57, λ^1=0.28, κ^2=2.10 and λ^2=1.75 for patients under treatment C. The middle and right panels of Figure 2 demonstrate survival profiles for two latent groups from patients under treatment B and treatment C. After we obtain the predicted latent group membership, A logrank test is performed to evaluate the difference between survival profiles for two latent groups from patients under treatment B and treatment C. P-values from logrank test are smaller than 0.0001, which implies that, among patients under treatment B and treatment C, two latent groups are significantly different in terms of their survival profiles.

Based on the results of variable selection, we find that for patients under treatment B, the latent subgroup membership assignment is associated with age, the number of positive nodes, ER status, physical well-being and mood. When looking at patients under treatment C, only age and the number of positive nodes are predictive of the latent subgroup membership assignment, which agrees with the findings in the overall model. With these findings, we conclude that some latent subgroups of patients under treatment B and treatment C respond to the treatment differently due to some important covariates such as age and the number of positive nodes. Therefore, it is of interest to further study how the treatment works differently for some subgroups that are determined by important covariates. We create four subgroups of patients based on dichotomized age and the number of positive nodes. More explicitly, we dichotomize age by a threshold of 40 years, which is learned from previous findings about the IBCSG trial26. A Cox proportional hazards model with treatment as the only covariate is then applied to each subgroups of patients to evaluate the treatment effect. In the subgroups of patients aged less than 40 years and with more than 4 positive nodes, treatment C has a significant effect on the survival outcomes (p-value=0.031). Compared with treatment A, the hazard ratio of treatment C is 2.48 with 95% confidence interval (1.085, 5.666), which implies that the hazard for patients treated with CMF*3 is higher than for patients treated with CMF*6. Kaplan-Meier curves for patients under each treatment in different subgroups are demonstrated in Figure 3.

FIGURE 3.

FIGURE 3

Kaplan-Meier Curves for patients under each treatment in different subgroups

Note: Top-left panel: subgroup of patients with age less than 40 years old and the number of positive nodes less than 4. Top-right panel: subgroup of patients with age more than 40 years old and the number of positive nodes less than 4. Bottom-left panel: subgroup of patients with age less than 40 years old and the number of positive nodes more than 4. Bottom-right panel: subgroup of patients with age more than 40 years old and the number of positive nodes more than 4. Treatment A is the reference for hazard ratio and corresponding confidence interval estimates. P-value corresponds to the logrank test of treatment effect for each subgroup. “Time” is disease free survival and rescaled to [0, 1].

6 |. CONCLUSION

In this article, we propose a novel algorithm to detect the latent subgroups for individuals with different survival profiles and identify important covariates that are associated with the latent subgroup membership assignment. We have shown that our proposed estimator is consistent and enjoys the oracle properties when the number of covariates diverges with the sample size. Our proposed method can simultaneously estimate the unknown survival distributions and the coefficients that are predictive of latent subgroup membership assignment. The data with a large number of covariates can be handled well through a penalized objective function. This proposed methodology would potentially work as an exploratory step in clinical trial settings before implementing a subgroup analysis to study the treatment effect. The selected important covariates may help to explicitly determine the subgroup and discover how patients in different subgroups respond differently to treatments. Specific treatments could be developed for a target group of patients subsequently. Furthermore, using this proposed algorithm, we could directly classify patients into high-risk and low-risk groups based on their survival profiles. Since the identified classes have distinct survival distributions, each class is clinically meaningful, corresponding to patients with either long or short survival trajectories. Thus, the obtained classes can be useful to differentiate subgroup of patients at least in the following direction. First, the obtained latent classes can be used for patient recruitment in conducting future clinical trials. For example, we can recruit more patients from the high-risk group to empower trials. Another potential application is as illustrated in the real data application, our method can be used to identify subgroups of patients who may more benefit from one treatment as compared to the rest and to explore the baseline characteristics of these subgroups of patients, or their intersections, based on selected important covariates.

In our proposed algorithm, the survival distributions for latent subgroups are assumed to follow Weibull distributions with unknown parameters. It is easy to extend out methodology to other parametric distributions, such as the exponential distribution and the lognormal distribution. Our parametric framework could also be weakened by assuming that the baseline hazard function is semi-parametric and we estimate the baseline cumulative hazard function using the Breslow’s estimator.

The distribution of the latent subgroup membership, given baseline covariates, is assumed to be a multinomial distribution. This assumption could be relaxed by considering a tree-based partition on the data, which could be used to identify latent subgroups and select important covariates in a nonparametric framework. The tree-based method for latent subgroup identification may be helpful for handling the data with many covariates. Lastly, the way we select the best number of latent subgroups is to essentially apply the proposed method to the data by assuming a different number of latent subgroups in the data. A nonparametric approach for determining the best number of latent subgroups could be established as well. We could implement the tree-based partition procedure based on several choices of the number of latent subgroups assumed in the data, compute BIC values for each choice and select the number of latent subgroups associated with the smallest value of BIC. Semi-parametric approaches and their theoretical justification are currently being investigated.

We assume that the survival distributions from different latent groups are distinct, which implies that the mixture model can be identifiable. The selection on number of latent subgroups using BIC can also help to distinguish non-identifiable cases, where BIC will suggest that no latent subgroups exists in the data. It has been well understood that the model selection method with a fixed number of covariates using BIC criterion27 can identify the true model consistently28,29. For the situation with a diverging number of covariates, the asymptotic behavior of a slightly modified version of BIC criterion has been greatly discussed as well and the consistency in linear regression model selection with a diverging number of covariates for penalized estimators has been studied30. When the number of covariates diverges in a mixture model setting, although the BIC criterion works well in our empirical studies, we are not aware of the theoretical results of BIC criterion in this situation. We will pursue this interesting topic in our future work.

Moreover, our proposed approach does not cover the ultra-high dimensional case in which the dimensionality pn is much larger than the sample size n. In this case, some problems need to be solved. For example, the estimate of the number of latent groups based on BIC may not be consistent31, and the variable selection procedure will be challenged32. Further work on applying our proposed method to ultra-high dimensional data needs to be done. This work focuses on identifying latent groups who have a distinct survival experience. The same idea can be extended to study latent groups who may respond to treatments differently. The latter will be particularly characterized by different treatment effects which can be constant or time-varying. Variable selection will also be important to determine a small list of feature variables for medical decisions. We will pursue these extensions in future work.

Supplementary Material

supp
supplement figure 1

Footnotes

DATA ACCESSIBILITY

Data for the applied example (IBCSG trial) is not for sharing. Data for the other applied example in Supporting Information is available in R survival package. Scripts to perform the simulation studies, including data generation and analysis, are available at https://github.com/beilinjia/mixtureSurv.

SUPPORTING INFORMATION

Additional supporting information including theoretical justifications, additional simulation results and real data analysis can be found online in the Supporting Information at the end of this article.

References

  • 1.Liao JJ, Liu GF. A flexible parametric survival model for fitting time to event data in clinical trials. Pharmaceutical Statistics 2019; 18(5): 555–567. [DOI] [PubMed] [Google Scholar]
  • 2.Farewell VT. The use of mixture models for the analysis of survival data with long-term survivors. Biometrics 1982: 1041–1046. [PubMed] [Google Scholar]
  • 3.Larson MG, Dinse GE. A mixture model for the regression analysis of competing risks data. Journal of the Royal Statistical Society: Series C (Applied Statistics) 1985; 34(3): 201–211. [Google Scholar]
  • 4.Altstein L, Li G. Latent subgroup analysis of a randomized clinical trial through a semiparametric accelerated failure time mixture model. Biometrics 2013; 69(1): 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Shen J, He X. Inference for subgroup analysis with a structured logistic-normal mixture model. Journal of the American Statistical Association 2015; 110(509): 303–312. [Google Scholar]
  • 6.Wu R, Zheng M, Yu W. Subgroup analysis with time-to-event data under a logistic-cox mixture model. Scandinavian Journal of Statistics 2016; 43(3): 863–878. [Google Scholar]
  • 7.Bussy S, Guilloux A, Gaïffas S, Jannot AS. C-mix: A high-dimensional mixture model for censored durations, with applications to genetic data. Statistical methods in medical research 2019; 28(5): 1523–1539. [DOI] [PubMed] [Google Scholar]
  • 8.Bennis A, Mouysset S, Serrurier M. Estimation of Conditional Mixture Weibull Distribution with Right Censored Data Using Neural Network for Time-to-Event Analysis. In: Springer.; 2020: 687–698. [Google Scholar]
  • 9.Chatfield C. Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society: Series A (Statistics in Society) 1995; 158(3): 419–444. [Google Scholar]
  • 10.Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996; 15(4): 361–387. [DOI] [PubMed] [Google Scholar]
  • 11.Steyerberg EW, Eijkemans MJ, Habbema JDF. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. Journal of Clinical Epidemiology 1999; 52(10): 935–942. [DOI] [PubMed] [Google Scholar]
  • 12.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 1996; 58(1): 267–288. [Google Scholar]
  • 13.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 2006; 34(3): 1436–1462. [Google Scholar]
  • 14.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001; 96(456): 1348–1360. [Google Scholar]
  • 15.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 2006; 101(476): 1418–1429. [Google Scholar]
  • 16.Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine 1997; 16(4): 385–395. [DOI] [PubMed] [Google Scholar]
  • 17.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics 2002; 30(1): 74–99. [Google Scholar]
  • 18.Law MH, Jain AK, Figueiredo M. Feature selection in mixture-based clustering. In:; 2003: 641–648.
  • 19.Law MH, Figueiredo MA, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004; 26(9): 1154–1166. [DOI] [PubMed] [Google Scholar]
  • 20.Raftery AE, Dean N. Variable selection for model-based clustering. Journal of the American Statistical Association 2006; 101(473): 168–178. [Google Scholar]
  • 21.Khalili A, Chen J. Variable selection in finite mixture of regression models. Journal of the American Statistical Association 2007; 102(479): 1025–1038. [Google Scholar]
  • 22.Nylund KL, Asparouhov T, Muthén BO. Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling: A Multidisciplinary Journal 2007; 14(4): 535–569. [Google Scholar]
  • 23.Fan J, Peng H, others. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 2004; 32(3): 928–961. [Google Scholar]
  • 24.Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 1982; 44(2): 226–233. [Google Scholar]
  • 25.Colleoni M, Litman H, Castiglione-Gertsch M, et al. Duration of adjuvant chemotherapy for breast cancer: a joint analysis of two randomised trials investigating three versus six courses of CMF. British Journal of Cancer 2002; 86(11):1705–1714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Group IBCS. Duration and reintroduction of adjuvant chemotherapy for node-positive premenopausal breast cancer patients.. Journal of Clinical Oncology 1996; 14(6): 1885–1894. [DOI] [PubMed] [Google Scholar]
  • 27.Schwarz G, others. Estimating the dimension of a model. The Annals of Statistics 1978; 6(2): 461–464. [Google Scholar]
  • 28.Shao J. An asymptotic theory for linear model selection. Statistica Sinica 1997; 7(2): 221–242. [Google Scholar]
  • 29.Shi P, Tsai CL. Regression model selection - a residual likelihood approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002; 64(2): 237–252. [Google Scholar]
  • 30.Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2009; 71(3): 671–683. [Google Scholar]
  • 31.Drton M, Plummer M, others. A Bayesian information criterion for singular models. Journal of the Royal Statistical Society 2017; 79(2): 323–380. [Google Scholar]
  • 32.Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. The Journal of Machine Learning Research 2009; 10: 2013–2038. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp
supplement figure 1

RESOURCES