Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jul 8.
Published in final edited form as: Ann Stat. 2022 Feb 16;50(1):487–510. doi: 10.1214/21-aos2117

SEMIPARAMETRIC LATENT-CLASS MODELS FOR MULTIVARIATE LONGITUDINAL AND SURVIVAL DATA

KIN YAU WONG 1,*, DONGLIN ZENG 2, D Y LIN 2
PMCID: PMC9269993  NIHMSID: NIHMS1764505  PMID: 35813218

Abstract

In long-term follow-up studies, data are often collected on repeated measures of multivariate response variables as well as on time to the occurrence of a certain event. To jointly analyze such longitudinal data and survival time, we propose a general class of semiparametric latent-class models that accommodates a heterogeneous study population with flexible dependence structures between the longitudinal and survival outcomes. We combine nonparametric maximum likelihood estimation with sieve estimation and devise an efficient EM algorithm to implement the proposed approach. We establish the asymptotic properties of the proposed estimators through novel use of modern empirical process theory, sieve estimation theory, and semiparametric efficiency theory. Finally, we demonstrate the advantages of the proposed methods through extensive simulation studies and provide an application to the Atherosclerosis Risk in Communities study.

MSC 2010 subject classifications: Primary 62N02; secondary 62G05, 62H30

Keywords and phrases: Censored data, Joint analysis, Mixture models, Nonparametric estimation, Sieve estimation

1. Introduction.

Many clinical and epidemiological studies generate data on repeated measures of response variables at multiple time points as well as on time to the occurrence of a clinical event. In cardiovascular cohort studies, for example, data are often recorded for both repeated measures of risk factors (e.g., blood pressures, cholesterol levels) and time to a cardiovascular event (e.g., stroke, heart attack) or death [5]. Shared random-effect models and joint latent-class models have been proposed to investigate the dynamic relationships among such longitudinal and survival data.

In shared random-effect models, a linear mixed model with a set of unobserved random effects is assumed for the longitudinal outcomes, and a proportional hazards model or transformation model with the same random effects as covariates is assumed for the survival time [4, 23, 18, 24, 8]. The shared random effects account for the dependence between the longitudinal and survival outcomes. These models typically assume that, conditional on the random effects, the distribution of the survival time and the effects of covariates on the longitudinal and survival outcomes are the same across subjects.

Joint latent-class models assume that the population consists of subgroups and within each subgroup, subjects have the same distributions of longitudinal and survival outcomes [14, 7]. These models allow the baseline risk of event and the association pattern between the longitudinal and survival outcomes to vary flexibly across subgroups. However, the existing work is mostly confined to fully parametric models. Lin et al. [6] proposed a semiparametric latent-class model with a nonparametric baseline hazard function for the survival time in each latent class but did not investigate the theoretical properties of the proposed nonparametric maximum likelihood estimators (NPMLE). In fact, such NPMLEs are inconsistent [12, 20]; see Section S1 of the supplementary materials [21].

We propose a general model for the joint analysis of multivariate longitudinal data and survival time. We assume that the population consists of a mixture of latent subgroups such that within each subgroup, the joint distribution of the longitudinal and survival outcomes is described by a separate random-effect model, in which the survival time is characterized by a separate nonparametric baseline hazard function. This model naturally extends those of Henderson, Diggle and Dobson [4] and Tsiatis and Davidian [18] by allowing the existence of latent subgroups. The model can be used to address important scientific questions:

  1. Identification of latent subgroups within a heterogeneous study population;

  2. Estimation of the effects of baseline covariates, such as treatment, on longitudinal and survival outcomes within each subgroup;

  3. Evaluation of the event risk given baseline covariates and trajectories of longitudinal outcomes; and

  4. Estimation of the association between the trajectories of longitudinal outcomes and covariates with proper adjustment of informative dropout due to the occurrence of the event.

The proposed modeling framework also extends existing work by accommodating multivariate longitudinal outcomes measured at multiple time points. This framework is particularly useful in cardiovascular studies, where multiple risk factors, such as blood pressures and cholesterol levels, are repeatedly measured. Including multivariate longitudinal outcomes not only provides a comprehensive depiction of the dynamic relationships among the event of interest and relevant risk factors but also helps identify the latent subgroup structure.

Due to the presence of multiple nonparametric components in the model and the lack of a closed-form expression for the likelihood function, model estimation is highly challenging both theoretically and computationally. To overcome the non-identifiability of the fully nonparametric likelihood approach, we propose to combine nonparametric likelihood estimation with sieve estimation, such that the cumulative hazard function of a reference latent class is estimated by a step function with jumps at the observed event times, and the ratios of the baseline hazard functions across latent classes are estimated by spline functions. We develop a stable and efficient (accelerated) EM algorithm [3] to compute the proposed estimators.

We prove that the proposed estimators are consistent and the parametric components of the estimators are asymptotically efficient. The derivations involve novel applications of empirical process theory, sieve estimation theory, and semiparametric efficiency theory. One major challenge in our theoretical development is to show that the proposed model is identifiable with an invertible information operator. Due to the presence of latent classes, techniques for establishing model identifiability or invertibility of the information operator for semiparametric shared random-effect models are not directly applicable to the current setting. In addition, existing methods for latent-class models are not readily applicable to semiparametric models. To establish model identifiability and the invertibility of the information operator, we note that the likelihood and the score function are the sums of the terms arising from the likelihood of semiparametric shared random-effect models and show that the terms in the summation can be separated by properly varying the observed data values.

The rest of this article is structured as follows. In Section 2, we formulate the model and describe the proposed estimation approach. In Section 3, we discuss the computation of the proposed estimators, and in Section 4, we present the theoretical results. In Section 5, we report the results from our simulation studies. In Section 6, we provide an application to the Atherosclerosis Risk in Communities (ARIC) study [5]. In Section 7, we make some concluding remarks. We relegate technical proofs to the Appendix.

2. Model, likelihood, and sieve estimation.

Suppose that there are G latent classes. Let C denote the latent class membership, with C = g if a subject belongs to the gth latent class (g = 1,…,G). We relate C to a set of time-independent covariates W, which generally includes the constant 1, through a multinomial logistic regression model:

PC=g|W=eαgTWl=1GeαlTW, (1)

where αg is the vector of class-specific regression parameters. For model identifiability, we set αG = 0. Each latent class is characterized by class-specific trajectories of multivariate longitudinal outcomes and a class-specific risk of the event of interest. The longitudinal outcomes and the event time are assumed to be conditionally independent given the latent class membership and a multivariate random effect.

Suppose that there are J types of longitudinal outcomes and the jth type is measured at Nj time points. For j = 1,…,J and k = 1,…,Nj, let Yjk denote the kth measurement of the jth longitudinal outcome and Xjk and X˜jk denote corresponding covariates, which include the constant 1. The covariates Xjk, X˜jk and W may partially or completely overlap. We relate Yjk to Xjk and X˜jk through the multivariate linear mixed model:

Yjk|C=g=βgTXjk+bTX˜jk+ϵjk (2)

for g = 1,…,G, where βg is a vector of class-specific regression parameters, b is a vector of random effects assumed to follow the multivariate normal distribution with mean 0 and variance Σξg, ϵj1,,ϵjNj are independent zero-mean normal random variables with variance σgj2, and Σ(ξg) is a covariance matrix indexed by a vector of class-specific variance parameters ξg.

Let T denote the event time of interest. We relate T to a set of potentially time-dependent covariates Z(·) through the proportional hazards model:

λt|Z,b,C=g=λgteγgTZt+ηgTb, (3)

where λg(·) is an arbitrary class-specific baseline hazard function, and γg and ηg are class-specific regression parameters. In the presence of censoring, we observe T˜=TΛU and ∆ = I(TU), where U is the censoring time, and I(·) is the indicator function. Let Y = (Y11,…,Y1N1,…,YJ1,…,YJNJ)T, X=X11,,X1N1,,XJ1,,XJNJT, and X˜=X˜11,,X˜1N1,,X˜J1,,X˜JNJT. The data consist of n independent observations OiNi1,,NiJ,Yi,Xi,X˜i,T˜i,Δi,Wi,{Zit}t0,T˜i, for i = 1, …, n, where τ is the end of study time.

Let θα1,,αG1,β1,,βG,σ112,,σ1J2,,σGJ2,ξ1,,ξG,γ1,,γG,η1,,ηG denote the set of all Euclidean parameters and Λgt=0tλgudu for g=1,,G. Under the assumption of noninformative censoring and longitudinal measurement times, rigorously formulated in Section S2 of the supplementary materials [21], the likelihood function concerning (θ1,…,ΛG) is proportional to

i=1ng=1GeαgTWil=1GeαlTWij=1Jk=1Nijσgj1e12σgj2YijkβgTXijkbTX˜ijk2{λgTi˜eγgTZiTi˜+ηgTb}Δi×exp{0Ti˜eγgTZit+ηgTbdΛgt}Σξg1/2e12bTΣξg1bdb. (4)

We reparametrize the model by setting Λ = Λ1 and ψg = log(λg1); we then estimate Λ nonparametrically and approximate ψg using a sieve of B-spline functions for g = 2,…, G. In particular, we treat Λ as a step function that jumps at the observed event times and replace λ1T˜i in the likelihood by ΛT˜i, where Λ{t} is the jump size of Λ at t. Let B1,Bmn be B-spline functions on a grid over 0,τ, where the number of spline functions mn increases with the sample size. For g = 2,…, G, we approximate ψg by s=1mnagsBs, where a{ags}g=2,,G;s=1,,mn is a set of regression parameters. Ideally, NPMLE would be adopted for every nonparametric function because it does not require tuning and is more flexible than splines. However, because the NPMLE for (Λ1, …, ΛG) is inconsistent, we estimate the cumulative baseline hazard function of a reference group using NPMLE and estimate the remaining nonparametric functions using splines, so as to achieve as much model flexibility as possible while ensuring estimation consistency.

Let θ^n,Λ^n,a^n be the maximizer of

i=1ng=1GeαgTWil=1GeαlTWij=1Jk=1Nijσgj1e12σgj2YijkβgTXijkbTX˜ijk2×Λ{T˜i}eγgTZiT˜i+s=1mnagsBsT˜i+ηgTbΔiexp0T˜ieγgTZit+s=1mnagsBst+ηgTbdΛt×|Σξg|1/2e12bTΣξg1bdb,

and let ψ^ng=s=1mna^ngsBs, where a^ngs is the corresponding element of a^n. Let B=ψ2,,ψG. The sieve NPMLE of θ,Λ,B is θ^n,Λ^n,B^n, where B^n=ψ^n2,,ψ^nG.

3. Computation of the sieve NPMLE.

In this section, we use Z(·) to denote the combination of the original set of time-dependent covariates and the B-spline functions (B1, …, Bm), with γg being the corresponding vector of regression parameters for the gth latent class. We compute the sieve NPMLE using an accelerated version of the EM algorithm, with C and b treated as missing data. The proposed algorithm iteratively performs the EM steps. Unlike the standard EM algorithm, an E-step may not be performed under the current parameter estimates but under some function of the estimates at the previous steps.

We first introduce the standard EM algorithm. The complete-data log-likelihood function is

i=1ng=1GICi=gαgTWilogl=1GeαlTWi12log|Σξg|12biTΣξg1bij=1Jk=1Nij12logσgj2+YijkβgTXijkbiTX˜ijk22σgj2+ΔiγgTZiT˜i+ηgTbi+logΛT˜isT˜iΛseγgTZis+ηgTbi.

In the E-step, we compute the expectation of functions of (b,C) involved in the M-step. The conditional density of bi given Ci = g and the observed data is proportional to

figbij=1JσgjNijj=1Jk=1NijexpYijkβgTXijkbiTX˜ijk22σgj2|Σξg|1/2×exp12biTΣξg1bieΔiγgTZiT˜i+ηgTbiexp0T˜ieγgTZit+ηgTbidΛt,

and the conditional probability of Ci = g given the observed data is proportional to

qigeαgTWifigbdb.

The conditional expectation of any function h of (bi,Ci) given the observed data is

Ehbi,Ci|Oi=g=1Gp^ighb,gfigbdbfigbdb,

where p^ig=qig/l=1Gqil. The integrations in the above equation can be approximated with the adaptive Gauss–Hermite quadrature [9].

In the M-step, we update the parameters by maximizing the expected complete-data log-likelihood function given the observed data. In particular, we update αg (g = 1,…,G − 1) by maximizing the weighted multinomial log-likelihood

i=1ng=1Gp^igαgTWilogg=1GeαgTWi

via the Newton-Raphson algorithm. Then, we update βg and σgj2j=1,,J;g=1,,G by maximizing

12j=1Ji=1np^igNijlogσgj2+k=1Nij1σgj2E^gYijkβgTXijkbiTX˜ijk2

and update ξg (g = 1,…,G) by maximizing

12i=1np^iglog|Σξg|+E^gbiTΣξg1bi,

where E^g denotes the conditional expectation with respect to bi given Ci = g and the observed data. If closed-form solutions for the maximization problems are not available, then we employ the Newton-Raphson algorithm. In addition, we update (γg,ηg) (g = 1,…,G) by maximizing the (weighted) partial likelihood

i=1nΔig=1Gp^igγgTZiT˜i+ηgTE^gbilogg=1Gj=1nIT˜jT˜ip^jgeγgTZjT˜iE^geηgTbj

via the Newton-Raphson algorithm. Finally, we update the cumulative baseline hazard function Λ by

Λ^T˜i=Δig=1Gj=1nIT˜jT˜ip^jgeγ^gTZjT˜iE^geη^gTbj

for i = 1,…,n, where (γ^g, η^g) are the current estimates of the parameters.

The standard EM algorithm, which iteratively performs the E-step and M-step until convergence, may be slow, especially when the number of parameters is large. To accelerate the convergence, we adopt a modification of the EM algorithm proposed by Varadhan and Roland [19]. Let ϑ denote the set of all parameters and s(ϑ) be the set of updated parameters after a single EM step if the initial parameter value is ϑ. With ϑ(k) being the set of current estimates, a step of the accelerated EM algorithm consists of

  1. Calculate ϑ1 = s(ϑ(k)).

  2. Calculate ϑ2 = s(ϑ1).

  3. Calculate r = ϑ1ϑ(k), v = ϑ2ϑ1r, and a = −||r||2/||v||2.

  4. Update the parameter estimates by ϑ(k+1) = s(ϑ(k) −2ar +a2v).

To improve stability, we update the parameters using the standard EM steps at early steps of the algorithm. Once the difference between consecutive parameter estimates becomes smaller than a certain threshold, we perform the accelerated EM steps until convergence. When the assumed number of latent classes is larger than the actual number, the model is nonidentifiable, and the parameter estimates may not converge; therefore, we terminate the algorithm when the difference between the log-likelihood values of consecutive iterations is smaller than a certain threshold.

The algorithm may converge to a local maximum of the log-likelihood. To improve the chance of obtaining the global maximum, we can run the algorithm with different initial values and set the estimates to the converged values that yield the largest log-likelihood. One strategy for setting the initial values is to classify subjects into G classes by some clustering method and set the parameter values for each class to be the estimates obtained from subjects assigned to the class.

Upon convergence, we use Louis’s formula [11] to compute the observed information matrix, essentially treating the model as parametric, with parameters θ, ΛT˜ii:Δi=1 and agsg=2,,G;s=1,,mn. The submatrix of the inverse of the observed information matrix corresponding to θ can be used to estimate the standard errors of θ^n. This submatrix is essentially an estimate of the inverse of the efficient information matrix I˜ defined in the proof of Theorem 4.2, where the least-favorable directions are estimated by solving the empirical counterparts of the integral equations they satisfy. The consistency of this standard error estimator is established in Theorem 4.3.

We propose to use the Bayesian information criterion (BIC) [16] to select the number of latent classes G. Specifically, for each G, we estimate the model using the sieve NPMLE and compute

BIC=2logLnθ^n,Λ^n,B^n+slogn,

where Ln is the likelihood function, and s is the number of free parameters in the model, including the regression parameters for the B-spline functions. We select the G that yields the smallest BIC value.

4. Asymptotic properties of the sieve NPMLE.

Assume that the degree of the B-spline functions is fixed at some p ≥ 1 and that the distance between adjacent knots is within K1mn1,Kmn1 for some large constant K. Let d be the dimension of the Euclidean parameters and Θ be a known, compact parameter space of θ. Let θ0,Λ0,B0 denote the true parameter values, where B0=ψ02,,ψ0G. Let Λgt=0tλgudu and Λ0g be its true value (g = 1,…,G).

We impose the following conditions.

(C1) The parameter θ0 lies in the interior of Θ, and the function Λ0g is continuously differentiable up to the third order on 0,τ for g = 1, …, G.

(C2) With probability one, PT˜=τ|W,X,X˜,Z>δ0 for some fixed δ0 > 0.

(C3) With probability one, Z(·) has left-continuous sample paths on 0,τ with right derivatives. In addition, there exists a large constant K such that

Pmaxj=1,,JNj+W2+X2+X˜2+supt0,τZt2+supt0,τZt2<K=1,

where Z is the (componentwise) left derivative of Z.

(C4) The number of knots mn satisfies mn = O(nq) for some 1/12 < q < 1/8.

The next condition is more technical and ensures model identifiability and invertibility of the information operator. Essentially, it requires that the covariates take enough distinct values such that the class-specific distributions of the longitudinal outcomes can be distinguished and the effect of each covariate on each class-specific distribution can be identified. Let Σ0g=diagσ0g121N1,,σ0gJ21NJ, Γ0g=Ψ0gI+Ψ0gTX˜Σ0g1X˜TΨ0g1Ψ0gT, and Σ0Yg=X˜Ψ0gΨ0gTX˜T+Σ0g, where 1k is a k-vector of ones, Ψ0g is an orthogonal matrix such that Σξ0g=Ψ0gΨ0gT, and σ0gj2 and ξ0g are the true values of the corresponding parameters. Note that Σ0Y g is the covariance matrix of Y given C = g and (N1,…,NJ).

(C5) There exist some positive integers (n1,…,nJ) such that P(N1 = n1; …; NJ = nJ) > 0 and that the following holds. Let X be the set of possible values of X,X˜ given (N1 = n1,…,NJ = nJ) such that X˜TX˜ is invertible and

X˜Σξ0gX˜T+Σ0gX˜Σξ0lX˜T+Σ0l
orXβ0gXβ0landΣ0Yg1Xβ0g+Σ0g1X˜Γ0gTη0gΣ0Yl1Xβ0l+Σ0l1X˜Γ0lTη0l

whenever g ≠ l. For k= 1, …, nj and j = 1, …, J, if WThW=0, XjkThXjk=0, X˜jkThX˜jk=0 and Z(t)ThZ = 0 almost surely for all X,X˜χ and t0,τ, then hW = 0, hXjk = 0, hXjk=0, hX˜jk=0, and hZ = 0, where hW, hXjk, hX˜jk, and hZ are fixed vectors of appropriate dimensions.

The final condition ensures that the least-favorable direction for the Euclidean parameters is sufficiently smooth.

(C6) The conditional density of the censoring variable U given the observed covariates is continuously differentiable on the support of U with respect to some dominating measure up to the third order.

Remark 1. Conditions (C1)–(C3) are common assumptions in the analysis of right-censored data under semiparametric survival models. Condition (C4) pertains to the rate at which the number of B-spline functions increases to infinity. Condition (C5) pertains to the class-specific distributions of the longitudinal outcomes and event time. Instead of directly assuming the identifiability and invertibility of the information operator of the proposed model, we derive these properties under assumptions on individual class-specific distributions. Condition (C5) requires that after removing specific covariate values that yield equality of certain quantities of the class-specific distributions of the observed variables, the set of possible covariate values are linearly independent. For latent-class models in general, linear independence of the covariates and distinctness of parameter values across latent classes are not sufficient for the invertibility of the information operator. To see this, consider a simple model with two latent classes, a known mixture probability of 0.5 for each class, a single binary covariate X, and a single outcome variable Y with Y | (X,C = g) ∼ N(αg + βgX,1) for g = 1,2, where C denotes the latent class membership. The score statistic along the direction α1 = α01 + ϵ, α2 = α02ϵ, β1 = β01ϵ, and β2 = β02 + ϵ, and is zero when α01 = α02, even if β01β02, where (α01020102) are the true parameter values. This model does not satisfy (a simplified version of) condition (C5) because the two latent classes are different only at X ≠ 0, but given X ≠ 0, (1,X) is no longer linearly independent. A simple sufficient condition for condition (C5) is that all covariates are linearly independent and the class-specific variances of Y are distinct almost surely.

Let || · || be the supremum norm over 0,τ. We have the following results.

Theorem 4.1. Under conditions (C1)–(C5), there exists a local maximum of the nonparametric likelihood in the sieve space, denoted by θ^n,Λ^n,B^n, such that

θ^nθ022+Λ^nΛ02+g=2G0τψ^ngtψ0gt2dt=opn1/2.

This theorem provides a preliminary, combined rate of convergence for the estimators of the Euclidean and infinite-dimensional parameters. Based on this convergence rate, the following theorem establishes that the Euclidean parameter estimators converge at the optimal n1/2 rate and attain the semiparametric efficiency bound [1].

Theorem 4.2. Under conditions (C1)–(C6), n1/2θ^nθ0 converges weakly to the normal distribution with zero mean, and its asymptotic variance attains the semiparametric efficiency bound.

Let In be the negative Hessian matrix of the log-likelihood evaluated at the estimated parameters, with the jump sizes of Λ^n and the coefficients of the spline functions in ψ^n2,,ψ^nG treated as Euclidean parameters. Let V^n be the submatrix of (n−1In)−1 that corresponds to θ.

Theorem 4.3. Under conditions (C1)–(C6), V^nI˜12=op1, where I˜ is the efficient information matrix of θ defined in the proof of Theorem 4.2.

The proofs of Theorems 4.1 and 4.2 are given in Appendix A, whereas the proof of Theorem 4.3 is given in Section S3 of the supplementary materials [21].

5. Simulation studies.

We considered a longitudinal study where data were collected on repeated measures of longitudinal outcomes as well as on the time to the occurrence of an event of interest. Each subject was examined periodically until the event of interest occurred or the subject was lost to follow-up. At the initial examination, a set of baseline covariates, which may represent sex, age, and other information, were measured, and at each examination, two types of longitudinal outcomes were measured. The latent class for each subject was generated from model (1) with G = 3 and W = (1,X1,X2)T, where X1 and X2 are independent Bernoulli(0,5) and N(0,1), respectively. We set the examination times at sk = 0,15(k−1) for k = 1,…,10. For j = 1,2 and k = 1,…,10, we generated

Yjk|C=g=βgjTXk+bj+b3+ϵjk, (5)

where ϵjk|C=g~N0,σgj2, Xk = (1,sk,X1,X2)T, bj|C=g~N0,ξgj2, and (b1,b2,b3) are independent of each other and of (X1,X2). Note that the random effects b1 and b2 account for the dependence among repeated measures of a single type of longitudinal outcome, whereas b3 accounts for the dependence between the two types of longitudinal outcomes. The event time T was generated from model (3) with a single random effect term b3 and Z(t) = (X1,X2)T for all t, and the censoring variable U was generated from Uniform 0,τ with τ=5. Note that the number of longitudinal outcome measurements is max{k : k ≤ 10,skTU}.

The true values of the Euclidean parameters are given in Table S1 of the supplementary materials [21]. The class-specific baseline hazard functions are λ1(t) = 0,5, λ2(t) = exp(0,25t), and λ3(t) = 1. The proportions of subjects belonging to latent classes 1, 2, and 3 are approximately 35%, 35%, and 30%, respectively. The average number of longitudinal outcome measurements per subject is about 5.4. The censoring proportion is about 25%.

We set the degree of the B-spline functions to be 1 and the number of interior knots to be 2; in our experience, the results are largely insensitive to the choice of the number of knots. The locations of the knots were set data-adaptively to be the 33% and 66% empirical quantiles of the observed event times. We considered G = 2, 3, and 4 latent classes and used BIC to select G. To set the initial values, we use k-mean clustering based on the event (or censoring) time, the censoring indicator, and the baseline longitudinal outcome values to classify subjects into subgroups with k = G. Then, we fit the generalized linear models and survival models (without random effects) on each subgroup and set the initial parameter values to be the corresponding estimated values. The initial values for the coefficients of the B-splines and the regression parameters of the random effects are set to 0, the initial values of Var(bj)+Var(ϵjk) are set to be the estimated variances in the corresponding fitted linear models with Var(bj) = Var(ϵjk) (j = 1,2; k = 1; …, 10), and the variance of b3 is set to be 0.1. The initial cumulative baseline hazard function is set to be the Breslow estimator. We constrained all Euclidean parameter estimates (including the regression parameters for the B-spline functions and the logarithm of the variance parameters) to be smaller than or equal to 10 in absolute value. This constraint is imposed because in the early iterations of the EM algorithm, the unconstrained estimates may sometimes become too extreme and cause numerical problems. We set the sample size to be n = 1000 or 2000 and considered 1000 simulation replicates for each setting.

Under G = 3, in no replicates do any parameter estimates (in absolute value) equal the boundary value of 10. Some parameter estimates are equal to the boundary value in about 60% of the replicates for G = 4 and in less than 5% of the replicates for G = 2. The convergence to the boundary under G = 4 is expected, because the model is nonidentifiable. In all but ten replicates under n = 1000, BIC selected the correct number of latent classes, and thus we only present the estimation results under G = 3. Because the labels of the latent classes are arbitrary, after convergence of the EM algorithm, we redefined the latent classes such that the orders of the estimated values of certain parameters across latent classes match the orders of the corresponding true parameter values. The estimation results for n = 1000 and n = 2000 are summarized in Tables S1 and S2 in the supplementary materials [21], respectively. The estimators of all parameters, including the class-specific cumulative baseline hazard functions at particular time points, are virtually unbiased. The standard errors are estimated accurately, and the coverage probabilities of the confidence intervals are close to the nominal level, especially for n = 2000. Thus, the proposed estimation method effectively uncovers the latent structure of the population, produces consistent estimators, and yields valid statistical inference.

6. A real study.

The ARIC study is a prospective epidemiological cohort study conducted in the United States. In the study, a total of about 15,000 subjects received a baseline examination in 1987–1989 and potentially six subsequent examinations in 1990–1992, 1993–1995, 1996–1998, 2011–2013, 2016–2017, and 2018–2019. At each examination, medical data, such as body mass index (BMI), blood pressure, and cholesterol levels, were collected. The subjects were also followed through reviews of hospital records, and potentially right-censored observations on time to myocardial infarction (MI), stroke, and death were also obtained.

We aimed to study the risk of cardiovascular diseases or death among African American subjects and to detect the presence of latent subgroups. The event of interest is MI, stroke, or death. The African American subjects were recruited from two centers of study in Forsyth County, NC and Jackson, MS. We set study location, sex, and BMI, glucose level, smoking status, and age at the first examination as covariates; these are referred to as the baseline covariates in the sequel. We considered systolic blood pressure and total cholesterol level, which were measured at each examination, as longitudinal outcomes. After removing 347 subjects with prior (or unknown status of) stroke or coronary heart disease at baseline and 178 subjects with missing data, the sample size is 3284, and the censoring proportion is 49.2%.

We fit models (1)(3), where T is the time from the first examination to MI, stroke, or death, whichever occurred first, (Y1k,Y2k) are respectively the systolic blood pressure and total cholesterol level at the kth examination, and Nj is the total number of examinations (k = 1,…,Nj;j = 1,2). The set of covariates W consists of the baseline covariates (and the constant 1 for the intercept). For the jth longitudinal outcome at the kth examination, we assumed model (5) with the set of covariates Xk consisting of the baseline covariates and the time of the kth examination. In the survival model, the set of covariates Z(t) is time-independent and consists of the baseline covariates, and the set of random effects consists of a single term b3. We set the degree of the B-spline functions to be 1 and considered 2–4 interior knots. The locations of the knots were chosen to be empirical quantiles of the observed event times. We ranged the number of latent classes G from 1 to 6.

For any numbers of knots for the B-spline functions, the BIC picked G = 4 latent classes. The BIC values at G = 1,…,6 under 2 interior knots are plotted in Figure S1 of the supplementary materials [21]. Since the estimation results across different numbers of knots are similar, we reported the results under 2 interior knots. The point estimates, standard errors, and p-values of all Euclidean parameters in the survival model are given in Table 1, and the estimated class-specific cumulative hazard functions are plotted in Figure 1; the estimation results for the remaining Euclidean parameters are given in Tables S3 and S4 of the supplementary materials [21]. The estimated trajectories of the mean longitudinal outcomes for a typical subject from each latent class are plotted in Figure S2 of the supplementary materials [21]. We classified a subject to a latent class if the (estimated) posterior probability of the class is larger than 0.7; a subject is unclassified if none of the posterior probabilities is larger than 0.7. The Kaplan–Meier curves for the (predicted) latent classes are plotted in Figure S3 of the supplementary materials [21].

TABLE 1.

Estimation results for the Euclidean parameters in the survival model for the ARIC data

Parameter Estimate SE p-value Parameter Estimate SE p-value
γ 1,Center 0.2431 0.3041 4.24E–01 γ 3,Glucose 0.2304 0.0450 3.15E–07
γ 1,BMI −0.0775 0.0949 4.14E–01 γ 3,Smoke 0.8147 0.1487 4.26E–08
γ 1,Glucose 0.4086 0.1325 2.04E–03 γ 3,Sex 0.3840 0.1355 4.61E–03
γ 1,Smoke 0.7848 0.1505 1.84E–07 γ 3,Age 0.5433 0.0673 7.13E–16
γ 1,Sex 0.5965 0.1617 2.25E–04 γ 4,Center 0.0770 0.3369 8.19E–01
γ 1,Age 0.6440 0.1303 7.75E–07 γ 4,BMI −0.1136 0.1082 2.94E–01
γ 2,Center 0.1269 0.1887 5.01E–01 γ 4,Glucose 0.2954 0.0411 7.05E–13
γ 2,BMI 0.1052 0.0552 5.65E–02 γ 4,Smoke 0.5983 0.2039 3.34E–03
γ 2,Glucose 0.0634 0.0403 1.16E–01 γ 4,Sex 0.4959 0.1986 1.25E–02
γ 2,Smoke 0.6472 0.1378 2.65E–06 γ 4,Age 0.2654 0.0980 6.78E–03
γ 2,Sex 0.3533 0.1298 6.49E–03 η 1 1.8929 2.5689 4.61E–01
γ 2,Age 0.3426 0.0721 2.00E–06 η 2 1.5561 0.6952 2.52E–02
γ 3,Center −0.0954 0.1920 6.19E–01 η 3 0.9861 2.3893 6.80E–01
γ 3,BMI 0.1853 0.0641 3.86E–03 η 4 1.3614 1.0065 1.76E–01

NOTE: For the parameters labeled γ, the first subscript represents the latent class, and the second subscript represents the covariate that corresponds to the parameter. “Center” is the indicator for the Jackson center; “Sex” is the indicator for male; “Smoke” is the indicator for smoker; “Glucose” represents glucose level. All continuous covariates are standardized. The parameter ηg is the regression parameter of b3 for the gth latent class.

FIG 1.

FIG 1.

Estimated class-specific baseline hazard functions for the ARIC data.

Older subjects, males, and smokers have higher risk of MI, stroke, or death across all latent classes. Subjects with higher BMI tend to have higher risk of disease or death in the third latent class, but BMI has no significant association with the risk in other latent classes. Glucose level has highly significant positive effect on the risk of disease or death in all but the second latent class. The random effect b3, which captures the dependence of the systolic blood pressure and the total cholesterol level, is significantly associated with the risk of disease or death only in the second latent class. This suggests that systolic blood pressure and total cholesterol level are associated with the risk of disease or death even conditional on the latent class membership. The estimated class-specific cumulative hazard of the second latent class is substantially higher than those of the other classes, and the empirical survival probabilities of the second latent class are smaller. The mean systolic blood pressure of subjects in the second latent class tends to be higher than those of the other classes. The results suggest that the second latent class is characterized by elevated risk of disease or death. The other groups also exhibit differences in the risk of disease or death, distributions of the longitudinal outcomes, and effects of covariates on the longitudinal and survival outcomes. In the latent-class membership model, the regression parameters for glucose level are significantly negative for the first three latent classes, suggesting that the fourth latent class is characterized by high glucose level. In addition, the second latent class is characterized by older subjects, and the third latent class is characterized by males and subjects with higher BMI.

Suppose that we are interested in the conditional survival function for a subject at risk at time s given the trajectories of the longitudinal outcome measurements up to s. For a subject with time-independent covariates in the survival model, this probability function can be estimated by h(t)/h(s) for ts, where

ht=g=1GeαgTWl=1GeαlTWexpΛgteγgTZ+ηgTbj=1Jk=1Kjσgj1e12σgj2YjkβgTXjkbTX˜jk2×Σξg1/2e12bTΣξg1bdb,

Kj is the number of observations on the jth longitudinal outcome by time s, and the parameters are evaluated at the sieve NPMLE. Figure S4 in the supplementary materials [21] shows the estimated curves for two hypothetical subjects at s = 10.

We use cross-validation to evaluate the robustness of the latent-class structure. We split the data into 20 pairs of training and validation datasets with a ratio of sample sizes of 3 : 2. On each training dataset, we fit the latent-class model with G = 4 and 2 interior knots for the B-spline functions, and for each subject in the corresponding validation dataset, we used the estimated model to compute the posterior probabilities of class membership given the subject’s covariates and longitudinal outcomes (but not the event time). A subject is predicted to belong to a latent class if the posterior probability of the class is larger than 0.7; a subject is unclassified if none of the posterior probabilities is larger than 0.7. Note that the prediction of latent class does not directly involve the event time of the subjects in the validation dataset.

To evaluate the explanatory power of the (predicted) latent classes, in each validation dataset, we fit the Cox model with covariates, including the baseline systolic blood pressure, the baseline total cholesterol level, and the predicted latent classes; unclassified subjects were discarded. We tested the significance of the latent classes in the model using the likelihood-ratio test. The combined p-value across data splits is 0.0248, where the combined p-value is defined as Φ0.05s=120Φ1ps, ps is the p-value for the sth split, and Φ is the standard normal distribution function. In addition, we fit a stratified Cox model, stratifying on the latent classes, with covariates including the baseline covariates, the baseline systolic blood pressure, the baseline total cholesterol level, and the interaction between the latent classes and the other covariates. The combined p-value for the likelihood-ratio tests for the interaction terms is 0.0250. These results suggest the existence of heterogeneity in the population that is not captured by the observed covariates. Subjects from different latent classes have not only different baseline hazards but also different association patterns between the covariates and the risk of disease or death.

7. Discussion.

In this article, we consider a semiparametric latent-class model for the joint analysis of longitudinal outcomes and a potentially right-censored event time. We develop a novel estimation approach that combines NPMLE and sieve estimation. We prove that the nonparametric components of the proposed estimators are consistent at a rate of o(n1/4). Although sieve estimators generally converge at a rate slower than n1/2, the Euclidean components of the estimators are nevertheless n1/2-consistent and asymptotically normal.

Under the proposed model, covariates may be associated with the event time through the latent class membership or directly through the class-specific survival models. The regression parameters in the survival models are best interpreted conditional on the latent variables b and C, so that for a subject in a specific latent class, each covariate in the survival model contributes multiplicatively to the baseline hazard. To obtain an “overall” effect of the covariates, we may adopt a Monte-Carlo approach: repeatedly generate data from the estimated model and the observed covariates, and fit the Cox model on the generated event times and covariates. The estimated regression parameters could be interpreted as the overall effects of the covariates, combining the effects on the latent class membership and the class-specific event-time distributions.

We proposed to estimate the standard error of the estimators by the inverse of the observed information matrix. This approach yields satisfactory performance in our extensive numerical studies, but it may be numerically unstable in very large samples or models. If one is interested only in the inference of the Euclidean parameters, then alternative methods based on the profile likelihood can be adopted [22].

The constraints on the number of B-spline functions given by condition (C4) guarantee that ψ^ngg=2,,G converges to the true value at a rate faster than n1/4, so that the Euclidean parameters can attain the efficiency bound. Because ψ0g’s are continuously differentiable up to the third order, the approximation error of the spline functions is of rate O(n−3q), and q > 1/12 is necessary for ψ^ngψ0g2=opn1/4; this bound can be relaxed under stronger assumptions on the smoothness of Λ0g’s. The upper limit q < 1/8 arises from the shrinking-neighborhood-based argument for consistency. In the proof, we show that a local maximum of the log-likelihood exists in an o(n−1/4)-neighborhood of the true parameter values. The upper limit q < 1/8 is to guarantee that the second-order term in the linear expansion of the log-likelihood dominates other terms in the expansion.

An intuitively appealing nonparametric estimation approach is to set each class-specific cumulative baseline hazard function to be a step function that jumps at the observed event times. This approach, however, yields inconsistent estimators even in the simple settings considered by Ma and Wang [12] and Wang, Garcia and Ma [20] because the parameter space is overly complex. Each (uncensored) observation belongs to a specific latent class and should only contribute to the jump of the corresponding cumulative baseline hazard function at the observed event time. However, the latent class membership is unknown, and this nonparametric approach incorrectly allows all cumulative baseline hazard functions to jump at the event time. To overcome this difficulty, we only estimate the cumulative baseline hazard function of a reference class nonparametrically and approximate the relative magnitudes of the baseline hazard functions between the reference class and other classes using spline functions. With a properly-chosen number of grid points for the spline functions, the complexity of the parameter space is controlled to yield consistent estimators.

During the preparation of this article, independent work of Liu et al. [10] was brought to our attention. Our model is more general than that of Liu et al. [10], which allows only a single type of longitudinal outcome with a random intercept in the longitudinal outcome model, and Liu et al. [10] adopted spline approximation for all nonparametric functions. In addition, we establish the asymptotic properties of the proposed estimators under specific assumptions on the proposed models and the observed data, whereas the assumptions in Liu et al. [10] are expressed in very general terms and are difficult to verify for given models. To demonstrate the extra flexibility of the proposed model over that of Liu et al. [10], we conducted a simulation study, which showed that misspecification of the latent variable structure may yield substantial estimation bias; see Section S4 of the supplementary materials [21].

Our work can be extended in several directions. First, one may be interested in the joint analysis of multiple event times, such as the times to the occurrence of different diseases. The proposed modeling framework can be readily extended to allow for multivariate event times by assuming a separate regression model for each event time with a set of shared random effects b. The sieve NPMLE can be easily extended to the multivariate setting, and its theoretical properties can be established along the lines of the proofs of Theorems 4.1 and 4.2.

Second, one may consider interval-censored event time(s). In ARIC, the onset of asymptomatic diseases, such as diabetes and hypertension, was not directly observed but was known to fall within certain time intervals. To accommodate interval censoring, we can extend the proposed methods and use the NPMLE [28] to estimate the cumulative baseline hazard function of the reference class. However, interval censoring results in a different likelihood function, which poses great challenges to the derivation of the asymptotic properties of the sieve NPMLE.

Finally, it would be of interest to consider high-dimensional longitudinal outcomes or covariates. In current biomedical studies, different types of molecular data, such as DNA alteration and gene expression, are collected along with clinical data. Such molecular data are often high-dimensional, with the number of variables much larger than the sample size. These data contain rich genetic information that can be used to classify subjects into biologically distinct disease subtypes [17]. We can set variables for the molecular data as longitudinal outcomes or covariates in models (1)(3) and adopt a penalized (sieve) likelihood approach for estimation.

Supplementary Material

Supp

Acknowledgements.

This work was supported by a research grant from the Hong Kong Polytechnic University (P0030124), the Hong Kong Research Grants Council grant PolyU 253042/18P, and the National Institutes of Health awards R01-HL149683 and R01-HG009974. The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services, under Contract nos. (HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700005I, HHSN268201700004I). The authors thank the staff and participants of the ARIC study for their important contributions.

APPENDIX A: PROOFS OF THEOREMS

In this appendix, we prove Theorems 4.1 and 4.2. The proofs make use of the lemmas given in Appendix B. To facilitate the presentation, we introduce the following notation. Let MK=Λ0,τ:Λ is monotone nondecreasing, Λ0=0,Λτ<K}. For some large enough positive constant K, let ΞK=Θ×MK×BVK0,τG1 be the parameter space of (θ2,…,ψG), where BVK0,τ=ψ0,τ:ψV<K, and || · ||V is the total variation over 0,τ, such that

fV=sup0=t0t1<<tm=τj=1mftjftj1.

The subscript K for the parameter spaces may be suppressed in the sequel. Let Ψθ,Λ,B denote

g=1GeαgTWl=1GeαlTWj=1Jk=1Njσgj1e12σgj2YjkβgTXjkbTX˜jk2eγgTZT˜+ψgT˜+ηgTbΔ×exp0T˜eγgTZt+ψgt+ηgTbdΛtΣξg1/2e12bTΣξg1bdb,

so that the likelihood for a generic subject is proportional to ΛT˜ΔΨθ,Λ,B. Let Ψ˙θθ,Λ,B denote the derivative of Ψθ,Λ,B with respect to θ, Ψ˙Λθ,Λ,BH denote the derivative of Ψθ,Λ,B with respect to Λ along the direction H, and Ψ˙ψgθ,Λ,Bh denote the derivative of Ψθ,Λ,B with respect to ψg along the direction h.

In the sequel, we use || · || to denote the Euclidean norm for vectors and the L2-norm with respect to the Lebesgue measure for functions over 0,τ. For a set of functions Bψ2,,ψg, let B2=g=2Gψg2. Let and n denote the true and empirical measures, respectively.

Proof of Theorem 4.1. Following Schumaker [15], under condition (C1), there exist functions ψ˜n2,,ψ˜nG such that ψ˜ngψ0g=Omn3 for g = 2,…,G, where ψ˜ng=s=1mna˜gsBs for some regression parameters a˜gsg=2,,G;s=1,,mn. Let

Nϵn=ψ2,,ψG:ψg=s=1mnagsBs:s=1mnagsa˜gs2ϵn2,g=2,,G,

where ϵn is a positive sequence such that ϵn=omn3/2. For Bnψn2,,ψnGNϵn,

ψngψ˜0gVs=1mnagsa˜gsBs'=Omnϵn2mn1/2=o1.

Therefore, each function ψng of Nϵn has bounded total variation and converges uniformly to ψ0g.

The outline of the proof is as follows. For any sequence of BnNϵn, we define

θ^nBn,Λ^nBn=argmaxθ,Λnθ,Λ,Bn.

First, we show that θ^nBn,Λ^nBnpθ0,Λ0 uniformly over BnNϵn. Then, we derive the rate of convergence of θ^nBn,Λ^nBn in terms of ϵn. Finally, we show that the maximum of the profile log-likelihood nθ^nBn,Λ^nBn,Bn over BnNϵn lies in the interior of Nϵn for some ϵn=on1/4mn1/2 and for large enough n. For simplicity of presentation, we suppress the argument Bn in θ^nBn and Λ^nBn in the sequel.

Step 1. We prove the existence of the NPMLE, i.e., Λ^nτ<. Let πg=eαgTW/l=1GeαlTW and fg(Y,b) denote the joint density of (Y, b) for the gth latent class (given N1, …, NJ ); we suppress the parameter or covariate values in the expressions for simplicity of presentation. Note that

ΨO;θ,Λ,Bg=1GπgeγgTZT˜+ηgTb+ψgT˜Δ1+0T˜eγgTZs+ηgTb+ψgsdΛsκfgY,bdbg=1GπgeψgT˜Δ1+0T˜eψgsdΛsκeO1+bfgY,bdb

for some constant κ > 1, where ≲ denotes “smaller than up to a scaling factor.” Therefore, if Λτ=, then the right-hand side of the above inequality is zero. We conclude that Λ^nτ<, so that the NPMLE exists.

Step 2. We show that the NPMLE is uniformly bounded. Note that

1nlogLnθ^n,Λ^n,Bn1ni=1nΔilogΛ^nT˜i+1ni=1nlogg=1GπgieψngT˜iΔ×1+0T˜ieψngsdΛ^nsκeO1+bfgYi,bdb.

Let N˜n=n1i=1nΔiIT˜i. We have

1nlogLnθ0,N˜n,Bn1ni=1nΔilogn+Op1,

where the second term on the right-hand side is asymptotically bounded uniformly over BnNϵn. Thus,

1nlogLnθ^n,Λ^n,Bn1nlogLnθ0,N˜n,Bn1ni=1nΔilognΛ^nT˜iκni=1nlog1+Λ^nτ+Op1.

Using a partitioning argument similar to that of Murphy [13], we can show that the right-hand side of the above inequality tends to ‒∞ if limsupnΛ^nτ=. By the definition of θ^n,Λ^n, the left-hand side of the inequality is nonnegative, so that limsupnsupBnNϵnΛ^nτ<.

Step 3. We show that θ^n,Λ^n is consistent. Because Λ^n belongs to a function space with bounded total variation, by Helly’s selection theorem, for every subsequence of {n}n=1,2,…, there exists a further subsequence such that θ^nθ and Λ^nΛ for some (θ*,Λ*). We show that θ = θ0 and Λ = Λ0 for any subsequence. With an abuse of notation, let {n}n=1,2,… be the subsequence. Let

Λ˜nt=i=1nΔiIT˜itj=1nΨ˙ΛOj;θ0,Λ0,B0IT˜iΨOj;θ0,Λ0,B01.

Note that Ψ˙Λθ,Λ,BIt=IT˜tg=1GπgQgO,beγgTZt+ηgTb+ψgtdb, where

QgO,b=eγgTZT˜+ηgTb+ψgT˜Δexp0T˜eγgTZt+ηgTb+ψgtdΛtfgY,b.

By the definition of the NPMLE, nθ^n,Λ^n,Bnnθ0,Λ˜n,Bn, so

nΔlogΛ^nT˜Λ˜nT˜+nlogΨθ^n,Λ^n,BnΨθ0,Λ˜n,Bn0. (6)

Note that

nlogΨθ^n,Λ^n,BnlogΨθ,Λ,B0=nlogΨθ^n,Λ^n,Bn+logΨθ^n,Λ^n,BnlogΨθ,Λ,B0,

where the first term on the right-hand side goes to zero almost surely because the class of logΨθ,Λ,B is Gilvenko–Cantelli by Lemma B.1, and the second term is o(1) by the dominated convergence theorem; note that both terms converge uniformly over BnNϵn. By a similar argument on nlogΨθ0,Λ˜n,Bn, the second term on the left-hand side of (6) is

nlogΨθ^n,Λ^n,BnΨθ0,Λ˜n,Bn=logΨθ,Λ,B0Ψθ0,Λ0,B0+op1,

where the op(1) term tends to 0 almost surely.

Consider the first term on the left-hand side of (6). Note that

Λ^nt=0tnνθ0,Λ0,B0;snνθ^n,Λ^n,Bn;sdΛ˜ns, (7)

where νθ,Λ,B;t=Ψ˙Λθ,Λ,BIt/Ψθ,Λ,B. By Lemma B.1, νθ,Λ,B;t:t0,τ,θ,Λ,BΞ is Glivenko-Cantelli, so

supt0,τnνθ0,Λ0,B0;t+supBnNϵnsupt0,τnνθ^n,Λ^n,Bn;ta.s.0.

By the dominated convergence theorem, νθ^n,Λ^n,Bn;t converges to νθ,Λ,B0;t for each t. In addition, it is easy to see that the derivative of νθ^n,Λ^n,Bn;t with respect to t is uniformly bounded, so that νθ^n,Λ^n,Bn;t is equicontinuous with respect to t. Thus, by the Arzela-Ascoli theorem, νθ^n,Λ^n,Bn;tνθ,Λ,B0;t uniformly in t0,τ. Furthermore, we can follow the argument in Zeng, Lin and Lin [26, p. 374] to show by contradiction that mint0,τνθ,Λ,B0;t>0. Taking limit on both sides of (7) yields

Λt=0tνθ0,Λ0,B0;sνθ,Λ,B0;sdΛ0s.

We conclude that Λ is absolutely continuous with respect to Λ0 and thus is differentiable. Let λ be the derivative of Λ. Combining the above results with (6), we have

logλT˜ΔΨθ,Λ,B0λ0T˜ΔΨθ0,Λ0,B00.

By the nonnegativity of the Kullback-Leibler divergence and Lemma B.2, the left-hand side of the above inequality is nonpositive and is equal to zero if and only if (θ*,Λ) = (θ*,Λ). Therefore, θ^n,Λ^n is consistent.

Step 4. We derive a bound on θ^nθ0+Λ^nΛ0 in terms of BnB0. For any hθd and hΛBV0,τ, let

˙θΛθ,Λ,Bhθ,hΛ=ϵθ+ϵhθ,Λ+ϵhΛdΛ,Bϵ=0.

Clearly, n˙θΛθ^n,Λ^n,Bnhθ,hΛ=0 and n˙θΛθ0,Λ0,B0hθ,hΛ=0 for any (hθ,hΛ). Suppressing the arguments (hθ,hΛ), we have

˙θΛθ^n,Λ^n,B0˙θΛθ0,Λn,B0=˙θΛθ^n,Λ^n,B0n˙θΛθ^n,Λ^n,Bn=n˙θΛθ^n,Λ^n,Bn˙θΛθ0,Λn,B0˙θΛθ0,Λ0,B0˙θΛθ^n,Λ^n,Bn˙θΛθ0,Λ0,Bn˙θΛθ^n,Λ^n,Bn˙θΛθ0,Λ0,B0.

By Lemma B.1, the class ˙θΛθ,Λ,Bhθ,hΛ:θ,Λ,BΞ,hθ1,hΛV1 is Donsker, so that the first term on the right-hand side above is Op(n−1/2) uniformly over BnNϵn. By repeated applications of the mean-value theorem, we can show that the second term is OBnB0 and the third term is oθ^nθ0+Λ^nΛ0. To evaluate the left-hand side of the above display, note that ˙θΛθ,Λ,B0 is the score statistic of a survival model with a single nonparametric component; the model falls under the framework of, for example, Zeng and Lin [25]. Using arguments analogous to the proof of Theorem 3.2 of Zeng and Cai [24] and the proof of Theorem 2 of Zeng and Lin [27], we can show that the map θ,Λ˙θΛθ,Λ,B0 is Frechet-differentiable with a derivative ˙θΛ that takes the form of a Fredholm operator. By Lemma B.4, ˙θΛ(evaluated at the true parameter values) is one-to-one, so it is continuously invertible. Therefore, there exists some positive constant c1 such that ˙θΛθ^nθ0,Λ^nΛ0c1θ^nθ0+Λ^nΛ0, where the norm on the left-hand side of the inequality is the supremum norm over hθ,hΛ:hθ1,hΛV1. By the consistency of θ^n,Λ^n and the differentiability of ˙θΛ,

˙θΛθ^n,Λ^n,B0˙θΛθ0,Λ0,B0c1+o1θ^nθ0+Λ^nΛ0.

Combining the above results, we conclude that

θ^nθ0+Λ^nΛ0Ann1/2+BnB0,

where An is some random variable that may depend on Bn and satisfies supBnNϵnAn=Op1.

Step 5. We show that a local maximum of nθ^n,Λ^n,B0 with respect to Bn exists in the interior of Nϵn for large enough n. It suffices to show that supBnNϵnnθ^n,Λ^n,B0<nθ0,Λ˜n,B˜n with probability going to 1 as n → ∞, where B˜n=ψ˜n2,,ψ˜nG. Let

Bn=nθ^n,Λ^n,Bnnθ0,Λ˜n,B˜n=nθ^n,Λ^n,Bnθ0,Λ˜n,B˜n+θ^n,Λ^n,Bnθ0,Λ˜n,B˜nθ0,Λ˜n,B˜nθ0,Λ˜n,B0. (8)

By Lemma B.1, the first term on the right-hand side of (8) can be written as Cnn−1/2 for some variable Cn such that supBnNϵnCn=op1. To evaluate the second term on the right-hand side above, let

ξϵ;Λ=θ0+ϵθ^nθ0,Λ+ϵh^ΛdΛ,B0+ϵBnB0,

where h^Λ is a step function that jumps at the observed event times, with h^Λ=dΛ^n/dΛ˜n1 points. The second term of the right-hand side of (8) is equal to ξ1;Λ˜nξ0;Λ˜n=ξ0;Λ˜n+ξϵ;Λ˜n for some ϵ ∈[0, 1]. Note that ξʹ(0;Λen) is equal to

Δh^ΛT˜+Ψ˙θ0,Λ˜n,B0θ^nθ0,h^ΛdΛ˜n,BnB0/Ψθ0,Λ˜n,B0=Δh^ΛT˜+Ψ˙θ0,Λ0,B0θ^nθ0,h^ΛdΛ0,BnB0/Ψθ0,Λ0,B0+Ψ˙θ0,Λ˜n,B0θ^nθ0,h^ΛdΛ˜n,BnB0/Ψθ0,Λ˜n,B0Ψ˙θ0,Λ0,B0θ^nθ0,h^ΛdΛ0,BnB0/Ψθ0,Λ0,B0=OpΛ˜nΛ0θ^nθ0+h^ΛV+BnB0V,

where Ψ˙θ,Λ,Bhθ,HΛ,hB=Ψ˙θθ,Λ,BThθ+Ψ˙Λθ,Λ,BHΛ+g=2GΨ˙ψgθ,Λ,Bhg for hB=h2,,hG. The last equality above follows from the mean-value theorem and that the score statistic is mean zero. By standard arguments for the NPMLE, Λ˜nΛ0=Opn1/2. Also, h˜ΛV=op1 and BnB0V=o1, so the right-hand side of the above equation is op(n−1/2). To evaluate ξϵ;Λ˜n, we write

ξϵ;Λ˜n=ξϵ;Λ˜nξ0;Λ˜n+ξ0;Λ˜nξ0;Λ˜0+ξ0;Λ˜0.

Using the mean-value theorem, we can show that the first term on the right-hand side of the above equation is Opθ^nθ03+h^Λ3+BnB033+opΛ˜nΛ0. Following the arguments for the evaluation of ξ0;Λ˜n, we can show that the second term is op(n−1/2). Note that the third term is the negative information of the one-dimensional submodel θ = θ0 + ϵhθ, dΛ = (1 + ϵhΛ)dΛ0, and B=B0+ϵhB, where hθ=θ^nθ0, hΛ=h^Λ, and hB=BnB0. Let H=d×L20,τG. For any hhθ,hΛ,hψ2,,hψGH, the score statistic of the submodel along direction h is

˙h=g=1GπgQgT˜,Δ,Y,b1l=1GπlQlT˜,Δ,Y,b˜db˜QlT˜,Δ,Y,b˜db˜WThαg+ΔZT˜Thγg+bThηg+hΛT˜+hψgT˜0T˜eZsTγ0g+η0gTb+ψ0gsZsThγg+bThηg+hΛs+hψgsdΛ0s+fg1Y,bThYgfgY,bdb/g=1GπgQgT˜,Δ,Y,bdbKT˜,Δ,Y;h,

where QgT˜,Δ,Y,b=QgO,b, fg1Y,b is the derivative of fg(Y, b) with respect to βg,σg2,ξg, hYg=hβgT,hσg,hξgTT, hαg,hβg,hσg,hξg,hγg,hηg are the directions that correspond to the parameters αg,βg,σg2,ξg,γg,ηg for g=1,,G, hαG=0, and hψ1=0. For h(1), h2H, we can write

˙h1˙h2=hθ1TG1h2+g=1G0τhΛ1t+hψg1tG2gt;h2dt,

where G1(h) is some linear function of h, and G2g(t;h) is equal to

EπgQgt,1,Y,bdbl=1GπlQlt,1,Y,bdbfTt|YSUt|YKt,1,Y;hEItT˜πgQgT˜,Δ,Y,beZtTγ0g+η0gTb+ψ0gtdbl=1GπlQlT˜,Δ,Y,bdbKT˜,Δ,Y;hλt=aThθ+EπgQgt,1,Y,bdbl=1GπlQlt,1,Y,bdbfTt|YSUt|YhΛt+k=2GEπgπkQgt,1,Y,bdbQkt,1,Y,bdbl=1GπlQlt,1,Y,bdb2fTt|YSUt|Yhψktk=1G0τhΛs+hψksIstEπgπkfTt|YSUt|Y×Qgt,1,Y,bdbQkt,1,Y,beZsTγ0k+η0kTb+ψ0ksdbl=1GπlQlt,1,Y,bdb2+ItsEπgπk×Qgs,1,Y,beZtTγ0g+η0gTb+ψ0gtdbQks,1,Y,bdbl=1GπlQls,1,Y,bdb2fTs|YSUs|YEIsT˜ItT˜πgπk×QgT˜,Δ,Y,beZtTγ0g+η0gTb+ψ0gtdbQkT˜,Δ,Y,beZsTγ0k+η0kTb+ψ0gsdbl=1GπlQlT˜,Δ,Y,bdb2×λ0sds,

where fT (· | Y) is the conditional density of the survival time T given Y, SU(· | Y) is the conditional survival function of the censoring time U given Y, and a is a d-dimensional vector. Define an inner product 〈.,.〉 on H such that

h1,h2=hθ1Thθ2+0τhΛ1thΛ2t+g=2Ghψg1thψg2tdt,

and let ˙ be the adjoint operator of ˙. By the definition of ˙, ˙h1˙h2=h1,˙˙h2, such that

˙˙h=G1h,g=1GG2g;h,G22;h,,G2G;h.

On the space H, we define a seminorm hI=h,˙˙h1/2. By Lemma B.4, ||h||I = 0 implies that h = 0, such that || · ||I is a norm in H. Clearly, ||h||Ic2h,h1/2 for some constant c2. By the bounded inverse theorem in Banach spaces, we have 〈h,h1/2c3||h||I for some constant c3. We conclude that

ξ0;Λ0=θ^nθ0,h^Λ,BnB0I2c32θ^nθ02+h^Λ2+g=2Gψngψ0g2.

By Donsker properties of the class of νθ,Λ,B;t and the mean-value theorem,

h^Λ=Opθ^nθ0+Λ^nΛ0+BnB02+n1/2.

In addition, a linear expansion argument shows that the third term of (8) is of order up to B˜nB02. Combining the above results, we have

BnDnn1/2+EnBnB˜n33+B˜nB02c32g=2Gψngψ˜0g2Dnn1/2+c4Enmn1ϵn3+mn6c32g=2Gψngψ˜0g2

for some sequences of positive variables Dn and En such that supBnNϵnDn=op1 and supBnNϵnEn=Op1 and some positive constant c4. The second inequality holds because by Theorem 5.2 of de Boor [2],

ψngψ˜ng33=Omn1s=1mnagsa˜gs3=Omn1ϵn3.

Suppose that BnNϵn. By the same theorem of de Boor [2], ψngψ˜ng2c5mn1ϵn2 for some g and c5 > 0. Therefore, by choosing ϵn such that ϵn=on1/4mn1/2 and

ϵn2supBnNϵnDnn1/2mn+mn5,

we have P(Bn < 0) → 1; the existence of such an ϵn with ϵn=omn3/2 is guaranteed under condition (C4). We conclude that there exists a local maximum of nθ^n,Λ^n,Bn with respect to Bn in the interior of Nϵn; let B^n be the maximizer. Note that by Theorem 5.2 of de Boor [2], ψngψ˜ng2=Omn1s=1mnagsa˜gs2=Omn1ϵn2 for all BnNϵn. We have

θ^nθ02+Λ^nΛ02+B^nB02=Opn1+B^nB02=Opmn1ϵn2+mn6=opn1/2.

Proof of Theorem 4.2. Let ˙θ be the score statistic for θ, ˙ΛhΛ be the score statistic for Λ along the submodel Λ+ϵhΛdΛ, and ˙ψghψg be the score statistic for ψg along the submodel ψg + ϵhψg (g = 2,…,G). For a set of functions h ≡ (h1,…,hd), let ˙Λh=˙Λh1,,˙ΛhdT and ˙ψgh=˙ψgh1,,˙ψghdT. Let h˜Λ and h˜ψg be the least favorable directions for the nonparametric functions, such that h˜Λ,h˜ψ1,,h˜ψG=argminhΛ,hψ2,,hψG˙θ˙ΛhΛdΛ0g=2G˙ψghψg2, where the integration in the second term in the norm is carried out componentwise. The existence of h˜Λ and h˜ψg follows from the invertibility of the information operator, established in Step 5 of the proof of Theorem 4.1. In addition, from the expressions of ˙˙ given in Step 5 of the proof of Theorem 4.1 and condition (C6), each component of h˜ψg is continuously differentiable up to the third order. Let h˜n,ψg be the (componentwise) projection of h˜ψg onto the sieve space, such that h˜n,ψgh˜ψg=Omn3. By the definition of the sieve NPMLE, n˙θθ^n,Λ^n,B^n=0, n˙Λθ^n,Λ^n,B^nh˜ΛdΛ^n=0, and n˙ψgθ^n,Λ^n,B^nh˜n,ψg=0. Note that

n˙ψgθ^n,Λ^n,B^nh˜ψg=n˙ψgθ^n,Λ^n,B^nh˜n,ψg+˙ψgθ0,Λ0,B0h˜ψgh˜n,ψg+n˙ψgθ^n,Λ^n,B^nh˜ψgh˜n,ψg+˙ψgθ^n,Λ^n,B^nh˜ψgh˜n,ψg˙ψgθ0,Λ0,B0h˜ψgh˜n,ψg.

The first two terms of the right-hand side above are zero. By Lemma B.1, the class of ˙ψgθ,Λ,Bh is Donsker, so that the third term is op(n−1/2). By the mean-value theorem, Theorem 4.1, and condition (C4), the fourth term is op(n−1/2). Obviously, ˙θθ0,Λ0,B0=0, ˙Λθ0,Λ0,B0h˜ΛdΛ0=0, and ˙ψgθ0,Λ0,B0h˜ψg=0. We have

n1/2n˙θθ^n,Λ^n,B^n˙Λθ^n,Λ^n,B^nh˜ΛdΛ^ng=2G˙ψgθ^n,Λ^n,B^nh˜ψg=n1/2˙θθ^n,Λ^n,B^n˙Λθ^n,Λ^n,B^nh˜ΛdΛ^ng=2G˙ψgθ^n,Λ^n,B^nh˜ψg˙θθ0,Λ0,B0+˙Λθ0,Λ0,B0h˜ΛdΛ0+g=2G˙ψgθ0,Λ0,B0h˜ψg+op1. (9)

By Lemma B.1, the class

˙θθ,Λ,BTv˙Λθ,Λ,BHΛg=2G˙ψgθ,Λ,Bh˜ψg:θ,Λ,BΞ,v1,HΛV1

is Donsker. Therefore, the left-hand side of (9) is equal to

n1/2n˙θθ0,Λ0,B0+˙Λθ0,Λ0,B0h˜ΛdΛ0g=2G˙ψgθ0,Λ0,B0h˜ψg+op1,

which converges in distribution to N0,I˜, where

I˜˙θθ0,Λ0,B0˙Λθ0,Λ0,B0h˜ΛdΛ0g=2G˙ψgθ0,Λ0,B0h˜ψg2

is the efficient information matrix for θ. By the Taylor series expansion, Theorem 4.1, and the definition of h˜Λ and h˜ψgg=2,,G, the right-hand side of (9) is

n1/2θ^nθ0T¨θθ¨Λθh˜ΛdΛ0g=2G¨ψgθh˜ψg+op1=n1/2I˜θ^nθ0+op1,

where ¨θθ, ¨Λθ are ¨ψgθ the derivatives of ˙θ, ˙Λ, and ˙ψg with respect to θ, respectively. As established in Step 5 in the proof of Theorem 4.1, the information operator is invertible, so the efficient information matrix is invertible. We conclude that n1/2θ^nθ0dN0,I˜1. Because θ^n is an asymptotically linear estimator with the influence function lying in the space spanned by the score functions, θ^n is asymptotically efficient [1]. □

APPENDIX B: USEFUL LEMMAS

In this appendix, we present four lemmas that are useful for the proofs of Theorems 4.1 and 4.2. The proofs of the lemmas are given in Section S3 of the supplementary materials [21].

Lemma B.1. For any finite K, the classes of functions

G1=logΨθ,Λ,B:θ,Λ,BΞK
G2=Ψ˙θθ,Λ,BTvΨθ,Λ,B:θ,Λ,BΞK,v<K
G3=Ψ˙Λθ,Λ,BHΛΨθ,Λ,B:θ,Λ,BΞK,HΛV<K
G4g=Ψ˙ψgθ,Λ,BhψgΨθ,Λ,B:θ,Λ,BΞK,hψgV<K

are Donsker.

Lemma B.2. Under conditions (C1)–(C3) and (C5), the latent-class model given by (1)(3) is locally identifiable.

Lemma B.3. Consider the following normal mixture model. Let W be a set of covariates and C be a latent class indicator with distribution specified by (1). For g = 1,…,G, let Yg ∼ N(µg, g), where (µ1, …,µG) are vectors of mean parameters, and (1,…,G) are covariance matrices. The observed outcome variable is Y=g=1GIC=gYg. Let (µ0g,0g) be the true values of (µg,g). If (µ01,01),…,(µ0G,0G) are distinct and the components of W are linearly independent, then the score statistic along any submodel is nonzero.

Lemma B.4. Under conditions (C1)–(C3) and (C5), the score statistic along any one-dimensional submodel for the latent-class model given by (1)(3) is nonzero.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Semiparametric latent-class models for multivariate longitudinal and survival data”. We present additional regularity conditions, the proofs of Theorem 4.3 and Lemmas B.1–B.4, additional simulation results, and additional real data analysis results.

REFERENCES

  • [1].BICKEL PJ, KLAASSEN CA, RITOV Y and WELLNER JA (1993). Efficient and Adaptive Estimation for Semiparametric Models Johns Hopkins University Press, Baltimore. [Google Scholar]
  • [2].DE BOOR C (1976). Splines as linear combinations of B-splines. A survey. In Approximation Theory II (Lorentz GG, Chui CK and Schumaker LL, eds.) 1–47. Academic Press, New York. [Google Scholar]
  • [3].DEMPSTER AP, LAIRD NM and RUBIN DB (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol 39 1–22. 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]
  • [4].HENDERSON R, DIGGLE P and DOBSON A (2000). Joint modelling of longitudinal measurements and event time data. Biostatistics 1 465–480. 10.1093/biostatistics/1.4.465 [DOI] [PubMed] [Google Scholar]
  • [5].THE ARIC INVESTIGATORS (1989). The atherosclerosis risk in communities (ARIC) study: Design and objectives. American Journal of Epidemiology 129 687–702. 10.1093/oxfordjournals.aje.a115184 [DOI] [PubMed] [Google Scholar]
  • [6].LIN H, TURNBULL BW, MCCULLOCH CE and SLATE EH (2002). Latent class models for joint analysis of longitudinal biomarker and event process data: application to longitudinal prostate-specific antigen readings and prostate cancer. J. Amer. Statist. Assoc 97 53–65. 10.1198/016214502753479220 [DOI] [Google Scholar]
  • [7].LIU Y, LIU L and ZHOU J (2015). Joint latent class model of survival and longitudinal data: an application to CPCRA study. Comput. Statist. Data Anal 91 40–50. 10.1016/j.csda.2015.05.007 [DOI] [Google Scholar]
  • [8].LIU L, MA JZ and O’QUIGLEY J (2008). Joint analysis of multi-level repeated measures data and survival: an application to the end stage renal disease (ESRD) data. Stat. Med 27 5679–5691. 10.1002/sim.3392 [DOI] [PubMed] [Google Scholar]
  • [9].LIU Q and PIERCE DA (1994). A note on Gauss-Hermite quadrature. Biometrika 81 624–629. 10.2307/2337136 [DOI] [Google Scholar]
  • [10].LIU Y, LIN Y, ZHOU J and LIU L (2020). A semi-parametric joint latent class model with longitudinal and survival data. Stat. Interface 13 411–422. 10.4310/SII.2020.v13.n3.a10 [DOI] [Google Scholar]
  • [11].LOUIS TA (1982). Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol 44 226–233. 10.1111/j.2517-6161.1982.tb01203.x [DOI] [Google Scholar]
  • [12].MA Y and WANG Y (2012). Efficient distribution estimation for data with unobserved sub-population identifiers. Electron. J. Stat 6 710–737. 10.1214/12-EJS690 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].MURPHY SA (1994). Consistency in a proportional hazards model incorporating a random effect. Ann. Statist 22 712–731. 10.1214/aos/1176325492 [DOI] [Google Scholar]
  • [14].PROUST-LIMA C, SÉNE M, TAYLOR JM and JACQMIN-GADDA H (2014). Joint latent class models for longitudinal and time-to-event data: a review. Stat. Methods Med. Res 23 74–90. 10.1177/0962280212445839 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].SCHUMAKER LL (2007). Spline Functions: Basic Theory (3rd ed.). Wiley. [Google Scholar]
  • [16].SCHWARZ G (1978). Estimating the dimension of a model. Ann. Statist 6 461–464. 10.1214/aos/1176344136 [DOI] [Google Scholar]
  • [17].SHEN R, OLSHEN AB and LADANYI M (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25 2906–2912. 10.1093/bioinformatics/btp543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].TSIATIS AA and DAVIDIAN M (2004). Joint modeling of longitudinal and time-to-event data: an overview. Statist. Sinica 14 809–834. [Google Scholar]
  • [19].VARADHAN R and ROLAND C (2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand. J. Statist 35 335–353. 10.1111/j.1467-9469.2007.00585.x [DOI] [Google Scholar]
  • [20].WANG Y, GARCIA TP and MA Y (2012). Nonparametric estimation for censored mixture data with application to the Cooperative Huntington’s Observational Research Trial. J. Amer. Statist. Assoc 107 1324–1338. 10.1080/01621459.2012.699353 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].WONG KY, ZENG D and LIN DY (2021). Supplement to “Semiparametric latent-class models for multivariate longitudinal and survival data” [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].XU C, BAINES PD and WANG J-L (2014). Standard error estimation using the EM algorithm for the joint modeling of survival and longitudinal data. Biostatistics 15 731–744. 10.1093/biostatistics/kxu015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].XU J and ZEGER SL (2001). Joint analysis of longitudinal data comprising repeated measures and times to events. J. R. Stat. Soc. Ser. C. Appl. Stat 50 375–387. 10.1111/1467-9876.00241 [DOI] [Google Scholar]
  • [24].ZENG D and CAI J (2005). Asymptotic results for maximum likelihood estimators in joint analysis of repeated measurements and survival time. Ann. Statist 33 2132–2163. 10.1214/009053605000000480 [DOI] [Google Scholar]
  • [25].ZENG D and LIN DY (2007). Maximum likelihood estimation in semiparametric regression models with censored data. J. R. Stat. Soc. Ser. B. Stat. Methodol 69 507–564. 10.1111/j.1369-7412.2007.00606.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].ZENG D, LIN DY and LIN X (2008). Semiparametric transformation models with random effects for clustered failure time data. Statist. Sinica 18 355–377. [PMC free article] [PubMed] [Google Scholar]
  • [27].ZENG D and LIN DY (2010). A general asymptotic theory for maximum likelihood estimation in semiparametric regression models with censored data. Statist. Sinica 20 871–910. [PMC free article] [PubMed] [Google Scholar]
  • [28].ZENG D, MAO L and LIN DY (2016). Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103 253–271. 10.1093/biomet/asw013 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp

RESOURCES