Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2017 Sep 23;19(3):374–390. doi: 10.1093/biostatistics/kxx047

A joint model for mixed and truncated longitudinal data and survival data, with application to HIV vaccine studies

Tingting Yu 1,, Lang Wu 1, Peter B Gilbert 2,3
PMCID: PMC6193623  PMID: 29028943

SUMMARY

In HIV vaccine studies, a major research objective is to identify immune response biomarkers measured longitudinally that may be associated with risk of HIV infection. This objective can be assessed via joint modeling of longitudinal and survival data. Joint models for HIV vaccine data are complicated by the following issues: (i) left truncations of some longitudinal data due to lower limits of quantification; (ii) mixed types of longitudinal variables; (iii) measurement errors and missing values in longitudinal measurements; (iv) computational challenges associated with likelihood inference. In this article, we propose a joint model of complex longitudinal and survival data and a computationally efficient method for approximate likelihood inference to address the foregoing issues simultaneously. In particular, our model does not make unverifiable distributional assumptions for truncated values, which is different from methods commonly used in the literature. The parameters are estimated based on the h-likelihood method, which is computationally efficient and offers approximate likelihood inference. Moreover, we propose a new approach to estimate the standard errors of the h-likelihood based parameter estimates by using an adaptive Gauss–Hermite method. Simulation studies show that our methods perform well and are computationally efficient. A comprehensive data analysis is also presented.

Keywords: Adaptive Gauss–Hermite, Cox model, h-Likelihood, Lower limit of quantification, Mixed-effect model, Shared-parameter model

1. Introduction

In preventive HIV vaccine efficacy trials, participants are randomized to receive a series of vaccinations or placebos and are followed until the day of being diagnosed with HIV infection or until the end of study follow-up. We are often interested in the times to HIV infection. Meanwhile, blood samples are repeatedly collected over time for each participant in order to measure immune responses induced by the vaccine, such as CD4 T cell responses. A major research interest in HIV vaccine studies is to identify potential immune response biomarkers for HIV infection. For example, for many infectious diseases antibodies induced by a vaccine can recognize and kill a pathogen before it establishes infection; therefore high antibody levels are often associated with a lower risk of pathogen infection. Since longitudinal trajectories of some immune responses are often associated with the risk of HIV infection, in statistical analysis it is useful to jointly model the longitudinal and survival data. Moreover, such joint models can be used to address measurement errors and non-ignorable missing data in the longitudinal data.

There has been active research on joint models of longitudinal and survival data in recent years. Lawrence Gould and others (2015) have given a comprehensive review in this field. Rizopoulos and others (2009) proposed a computational approach based on the Laplace approximation for joint models of continuous longitudinal response and time-to-event outcome. Bernhardt and others (2014) discussed a multiple imputation method for handling left-truncated longitudinal variables used as covariates in AFT survival models. Król and others (2016) considered joint models of a left-truncated longitudinal variable, recurrent events, and a terminal event. The truncated values of the longitudinal variable were assumed to follow the same normal distributions as the untruncated values. Other recent work includes Fu and Gilbert (2017), Barrett and others (2015), Elashoff and others (2015), Chen and others (2014), Taylor and others (2013), Rizopoulos (2012b), and Zhu and others (2012). Analysis of HIV vaccine trial data offers the following new challenges: (i) some longitudinal data may be left truncated by a lower limit of quantification (LLOQ) of the biomarker assay, and the common approach of assuming that truncated values follow parametric distributions is unverifiable and may be unreasonable for vaccine trial data; (ii) the longitudinal multivariate biomarker response data are intercorrelated and may be of mixed types such as binary and continuous; (iii) the longitudinal data may exhibit periodic patterns over time, due to repeated administrations of the HIV vaccine; (iv) some longitudinal biomarkers may have measurement errors and missing data; and (v) the computation associated with likelihood inference can be very intensive and challenging. A comprehensive statistical analysis of HIV vaccine trial data requires us to address all the foregoing issues simultaneously. Therefore, despite extensive literature on joint models, new statistical models and methods are in demand.

In this article, we propose innovative models and methods to address the above issues. The contributions of the paper are: (i) when longitudinal data are left truncated, we propose a new method that does not assume any parametric distributions for the truncated values, which is different from existing approaches in the literature that unrealistically assume truncated values to follow the same distributions as those for the observed values; (ii) for longitudinal data with left truncation, the observed values are assumed to follow a truncated normal distribution; (iii) we incorporate the associations among several longitudinal responses of mixed types by linking the longitudinal models with shared and correlated random effects (Rizopoulos, 2012b); and (iv) we address the computational challenges of likelihood inference by proposing a computationally very efficient approximate method based on the h-likelihood method (Lee and others, 2006). It is known that, when the baseline hazard in the Cox survival model is completely unspecified, the standard errors of the parameter estimates in the joint models may be underestimated (Rizopoulos, 2012a; Hsieh and others, 2006). To address this issue, we also propose a new approach to estimate the standard errors of parameter estimates.

The article is organized as follows. In Section 2, we introduce the HIV vaccine trial data that motivates the research. In Section 3, we describe the proposed models and methods, which address all of the issues discussed above simultaneously. Section 4 presents analysis of the HIV vaccine trial data. Section 5 shows simulation studies to evaluate the proposed models and methods. We conclude the article with some discussion in Section 6.

2. The HIV vaccine data

Our research is motivated by the VAX004 trial, which is a 36-month efficacy study of a candidate vaccine to prevent HIV-1 infection, which contained two recombinant gp120 Envelope proteins: the MN and GNE8 HIV HIV-1 strains (Flynn and others, 2005). One of the main objectives in the trial was to assess immune response biomarkers measured in vaccine recipients for their association with the incidence of HIV infection. It was addressed using plasma samples collected at the immunization visits, 2 weeks after the immunization visits, and the final visit (i.e. months 0, 0.5, 1, 1.5, 6, 6.5, …, 30, 30.5, 36) to measure several immune response variables in vaccine recipients. Eight immune response variables were measured in total, most of which are highly correlated with each other and some have up to 16% missing data. We focus on a subset of these variables that may be representative and have low rates of missing data. In particular, we use the NAb and the MNGNE8 variables, where NAb is the titer of neutralizing antibodies to the MN strain of the HIV-1 gp120 Env protein and MNGNE8 is the average level of binding antibodies (measured by ELISA) to the MN and GNE8 HIV-1 gp120 Env proteins (Gilbert and others, 2005). Due to space limitation, the details of the clinical research questions, participant selection procedure, and other immune response variables are described in Section 1 in the Supplementary material available at Biostatistics online.

The data set we consider has 194 participants in total, among whom 21 participants acquired HIV infection during the trial with time of infection diagnosis ranging from day 43 to day 954 and event rate of 10.8%. The average number of repeated measurements over time is 12.6 per participant. Moreover, NAb has a LLOQ of 1.477, and about 27% of NAb measurements are below the LLOQ (left-truncated). To minimize the potential for bias due to missing data on the immune response biomarkers, the models adjust for the dominant baseline prognostic factor for HIV-1 infection—baseline behavioral risk score, which is grouped into three categories: 0, 1, or 2 for risk score 0, 1–3, 4–7, respectively.

Figure 1 shows the longitudinal trajectories of the immune responses of a few randomly selected participants, where the left truncated values in NAb are substituted by the LLOQ. The value of an immune response typically increases sharply right after each vaccination, and then starts to decrease about 2 weeks after the vaccination. Such patterns are shown as the reverse sawtooth waves in Figure 1. We see that participants HIV-infected later on seem to have lower values of MNGNE8 and NAb than those uninfected by the end of the study. In particular, for MNGNE8, there seem to be clear differences between HIV-infected and uninfected participants, separated by the median value. The figures show that the longitudinal patterns of some immune responses seem to be associated with HIV infection, motivating inference via joint models of the longitudinal and survival data. In addition, some immune responses are highly correlated over time and are of mixed types, so we should also incorporate the associations among different types of longitudinal variables. Moreover, due to substantial variations across subjects, mixed effects models may be useful. The random effects in mixed effects models can serve several purposes: (i) they represent individual variations or individual-specific characteristics of the participants; (ii) they incorporate the correlation among longitudinal measurements for each participant; and (iii) they may be viewed as summaries of the individual profiles. Therefore, mixed effects models seem to be a reasonable choice for modeling the HIV vaccine trial data.

Fig. 1.

Fig. 1.

Longitudinal trajectories of two immune response variables for a few randomly selected VAX004 vaccine recipients, where the solid lines represent pre-infection trajectories of participants who acquired HIV infected and the dashed lines represent trajectories for participants who never acquired HIV infection. The left truncated values in NAb are substituted by the LLOQ of 1.477.

More results of the exploratory data analysis are given in Sections 2 and 3 in the Supplementary material available at Biostatistics online, including the summary statistics of the immune responses, Kaplan–Meier plot of the time to HIV infection, and longitudinal trajectories of NAb and MNGNE8 with the time variable shifted and aligned at the event times.

3. Joint models and inference

3.1. The longitudinal, truncation, and survival models

3.1.1. Models for longitudinal data of mixed types.

In the following, we denote by Inline graphic a random variable, Inline graphic its observed value, Inline graphic a generic density function, with similar notation for other variables. For simplicity of presentation, we consider two correlated longitudinal variables, Inline graphic and Inline graphic, where Inline graphic is continuous and subject to left truncation due to LLOQ, and Inline graphic is binary or count (e.g. dichotomized variable or number of CD4 T cells). The models can be easily extended to more than two longitudinal processes. Let Inline graphic be the LLOQ of Inline graphic and Inline graphic be the truncation indicator of Inline graphic such that Inline graphic if Inline graphic and Inline graphic otherwise.

For the continuous longitudinal variable Inline graphic, after possibly some transformations such as a log-transformation, we may assume that the untruncated data of Inline graphic follow a truncated normal distribution. We consider a linear or nonlinear mixed effects (LME or NLME) model for the observed values of Inline graphic given that Inline graphic, that is,

graphic file with name M19.gif (3.1)

where Inline graphic is the longitudinal variable of participant Inline graphic at time Inline graphic, Inline graphic and Inline graphic are vectors of covariates, Inline graphic contains fixed parameters, Inline graphic contains random effects, Inline graphic is a known nonlinear function, Inline graphic and Inline graphic are covariance matrices, and Inline graphic are random errors independent of Inline graphic. A LME model is usually an empirical model while an NLME model is a mechanistic model widely used in HIV viral dynamics (Wu, 2009). We assume that Inline graphic, i.e. the within-individual repeated measurements are independent conditional on the random effects.

In model (3.1), the observed Inline graphic’s given the random effects and the condition “Inline graphic” (or Inline graphic) are assumed to be normally distributed, so it is reasonable to assume the Inline graphic follow a truncated normal distribution (Mehrotra and others, 2000). For the truncated Inline graphic values (i.e. Inline graphic), any parametric distributional assumptions are unverifiable, although most existing literature makes such assumptions for convenience of likelihood inference. Moreover, the truncated values are unlikely to follow normal distributions in most cases, since the Inline graphic values at least must be positive while a normal random variable can take any real values. Thus, it is more reasonable to assume the truncated normal distribution for the observed Inline graphic values and leave the distribution of the truncated Inline graphic values completely unspecified. The density function of “Inline graphic” is given as Mehrotra and others (2000)

graphic file with name M43.gif (3.2)

where Inline graphic, Inline graphic is the probability density function of the standard normal distribution Inline graphic and Inline graphic is the corresponding cumulative distribution function.

For the discrete longitudinal variable Inline graphic, we consider the following generalized linear mixed effects model (GLMM)

graphic file with name M49.gif (3.3)

where Inline graphic is the longitudinal variable of participant Inline graphic at time Inline graphic, Inline graphic is a known link function, Inline graphic and Inline graphic are vectors of covariates, Inline graphic are fixed parameters, Inline graphic is a vector of random effects with Inline graphic, and Inline graphic is assumed to follow a distribution in the exponential family.

The longitudinal data may contain intermittent missing data and dropouts. We assume that the intermittent missing data and dropouts are missing at random. The fact that the missing data are biomarkers measuring immune responses to the vaccine (and not variables such as toxicity that could obviously be related to missed visits or dropout), and that the vaccine has a large safety data base showing it is not toxic, makes this assumption plausible.

3.1.2. A new approach for truncated longitudinal data.

When the Inline graphic values are truncated, a common approach in the literature is to assume that the truncated values continue to follow the normal distribution assumed for the observed values (Hughes, 1999; Wu, 2002). However, such an assumption is unverifiable and may be unreasonable in some cases, as noted earlier. In particular, when the truncation rate is high, the normality assumption is even less reasonable as the truncation rate can be much larger than the left-tail probability of the normal distribution for the observed data. For example, Figure 2 displays histograms of NAb for two participants, where the left truncated data are substituted by the LLOQ of 1.477. The truncation rates, 27% for participant 1 and 33% for participant 2, seem much lager than the left-tail probabilities of the assumed distributions for the observed data.

Fig. 2.

Fig. 2.

Histograms of NAb of two VAX004 vaccine recipients, where the left-truncated data are substituted by the LLOQ of 1.477.

Here we propose a different approach: we do not assume any parametric distributions for the truncated values, but instead we conceptually view the truncated values as a point mass or cluster of unobserved values below the LLOQ without any distributional assumption. Note that, although the truncation status Inline graphic can be determined by the Inline graphic values, in HIV vaccine studies, many biomarkers are measured infrequently over time, due to both budget and practical considerations, while some other variables can be measured more frequently. For this reason, when the Inline graphic values are not measured, we can roughly predict the truncation status of the Inline graphic value based on other measured variables that are associated with Inline graphic, including time. It is important to predict the truncation status of Inline graphic values, since left-truncated Inline graphic values have important implications (e.g. a positive immune response may be needed for protection by vaccination). A model for the truncation indicator Inline graphic can help make reasonable predictions of Inline graphic when such predictions are needed. Therefore, we assume the following model for the truncation indicator Inline graphic:

graphic file with name M71.gif (3.4)

where Inline graphic and Inline graphic contain covariates, Inline graphic contains fixed parameters, and Inline graphic contains random effects. The contribution of the longitudinal data of Inline graphic for individual Inline graphic to the likelihood given the random effects is Inline graphic where Inline graphic is given by (3.2) and Inline graphic.

Another use of model (3.4) is modeling non-ignorable or informative missing data in the longitudinal Inline graphic data. When longitudinal data have both left-truncated data and non-ignorable missing data, we should consider two separate models similar to (3.4). Here we do not consider the issue of non-ignorable missing data, but the models and methods can be easily extended to handle missing data. In fact, left truncated data may be viewed as non-ignorable missing data.

3.1.3. Association between mixed types of longitudinal variables.

Different immune response variables are typically highly correlated and may be of different types, such as one being continuous and another one being binary. The exact structures of the associations among different longitudinal variables may be complicated. However, we can reasonably assume that the variables are associated through shared or correlated random effects from different models. This is a reasonable assumption, since the random effects represent individual deviations from population averages and can be interpreted as unobserved or latent individual characteristics, such as individual genetic information or health status, which govern different longitudinal processes. This can be seen from Figure 1 where different immune response variables within the same individual exhibit similar patterns over time, including the truncation process. Therefore, we assume that Inline graphic, where Inline graphic is an arbitrary covariance matrix. Note that we allow the random effects in the longitudinal models to be different since the longitudinal trajectories of different variables may exhibit different between-individual variations (as measured by random effects), especially for different types of longitudinal variables such as binary and continuous variables.

3.1.4. A Cox model for time-to-event data.

The times to HIV infection may be related to the longitudinal patterns of the immune responses and left-truncated statuses. The specific nature of this dependence may be complicated. There are several possibilities: (i) the infection time may depend on the current immune response values at infection times; (ii) the infection time may depend on past immune response values; and (iii) the infection time may depend on summaries or key characteristics of the longitudinal or truncation trajectories. Here we consider case (iii) for the following reasons: (a) the random effects may be viewed as summaries of individual-specific longitudinal trajectories; (b) the immune response values may be truncated due to lower detection limits; and (c) this approach is also widely used in the joint model literature. Since the random effects in the longitudinal models may be interpreted as “summaries” or individual-specific characteristics of the longitudinal processes, we may use random effects from the longitudinal models as “covariates” in the survival model. Such an approach is commonly used in the literature and is often called “shared parameter models” (Wulfsohn and Tsiatis, 1997; Rizopoulos, 2012b).

Let Inline graphic be the time to HIV infection, Inline graphic be the right-censoring time, Inline graphic be the observed time, and Inline graphic be the event indicator. We assume the censoring is non-informative and consider a Cox model for the observed survival data Inline graphic,

graphic file with name M89.gif (3.5)

where Inline graphic is an unspecified baseline hazard function, Inline graphic contains baseline covariates of individual Inline graphic, and Inline graphic and Inline graphic are vectors of fixed parameters. In model (3.5), the parameters Inline graphic link the risk of HIV infection at time Inline graphic to the random effects in the longitudinal or truncation models, which allow us to check if individual-specific characteristics of the longitudinal immune responses are associated with the risk of HIV infection. We assume that the survival data and the longitudinal data are conditionally independent given the random effects.

3.2. An approximate method for likelihood inference

We consider the likelihood method for parameter estimation and inference for the above models. Let Inline graphic be the collection of all mean parameters and Inline graphic be the collection of variance–covariance (dispersion) parameters. The (joint) likelihood for all the observed longitudinal data and time-to-infection data is given by

graphic file with name M99.gif

Since the dimension of the random effects Inline graphic is often high and some density functions can be highly complicated, evaluation of the above integral can be a major challenge. The common approach based on the Monte Carlo EM algorithm can offer potential difficulties such as very slow or even non-convergence (Hughes, 1999). Numerical integration methods such as the Gaussian Hermite (GH) quadrature method can also be very tedious. Therefore, in the following we consider an approximate method based on the h-likelihood, which can be computationally much more efficient while maintaining reasonable accuracy (Lee and others, 2006; Ha and others, 2003; Molas and others, 2013). Its performance in the current context will be evaluated by simulations later.

Essentially, the h-likelihood method uses Laplace approximations to the intractable integral in the likelihood. A first-order Laplace approximation can be viewed as the GH quadrature method with one node. So a Laplace approximation can be less accurate than the GH quadrature method with more than one node. However, when the dimension of the integral is high, a Laplace approximation can be computationally much less intensive than the GH quadrature method whose computational intensity grows exponentially with the dimension of the integral. Moreover, it produces approximate MLEs for the mean parameters and approximate restricted maximum likelihood estimates (REMLs) for the variance–covariance (dispersion) parameters. For the models (3.1), (3.3)–(3.5) in the previous section, the log h-likelihood function is given by

graphic file with name M101.gif (3.6)

Based on Ha and others (2003) and Molas and others (2013), we propose the following estimation procedure via the h-likelihood. Beginning with some starting values Inline graphic, we iterate the steps below:

  • Step 1: At iteration Inline graphic, given Inline graphic, Inline graphic, Inline graphic, obtain updated estimates of the random effects Inline graphic by maximizing Inline graphic in (3.6) with respect to Inline graphic;

  • Step 2: Given Inline graphic, Inline graphic, Inline graphic, obtain updated estimates of the mean parameters Inline graphic by maximizing the following adjusted profileh-likelihood as in Lee and Nelder (1996) with respect to Inline graphic:
    graphic file with name M115.gif
  • Step 3: Given (Inline graphic, Inline graphic, Inline graphic, obtain updated estimates of the variance-covariance Inline graphic by maximizing the following adjusted profile h-likelihood,
    graphic file with name M120.gif
    where Inline graphic
  • Step 4: Given Inline graphic, Inline graphic, Inline graphic, obtain an updated nonparametric estimate of the baseline hazard Inline graphic as follows
    graphic file with name M126.gif
    where Inline graphic is an indicator function.

By iterating the above four steps until convergence, we can obtain approximate MLEs for the mean parameters, approximate REMLs for the variance–covariance parameters, empirical Bayes estimates of the random effects, and a nonparametric estimate of the baseline hazard function. To set starting values, we may first fit the models separately and then choose the resulting parameter estimates as the starting values for Inline graphic. More details are described in Section 4.3. The standard errors of the parameter estimates can be obtained based on

graphic file with name M129.gif

That is, the estimated variances of Inline graphic can be chosen to be the diagonal elements of the top left corner of the matrix Inline graphic (Lee and Nelder, 1996; Ha and others, 2003).

As mentioned in Section 1, the standard errors of parameter estimates may be under-estimated when the baseline hazard Inline graphic is unspecified (Rizopoulos, 2012a; Hsieh and others, 2006). A bootstrap method for obtaining standard errors is a good choice, but it is computationally intensive. Thus, here we propose a new approach to estimate the standard errors of parameter estimates based on the adaptive Gauss–Hermite (aGH) method (Rizopoulos, 2012a; Hartzel and others, 2001; Pinheiro and Bates, 1995). The basic idea is as follows. After convergence of the above steps, we can approximate the score function of Inline graphic for subject Inline graphic by the following:

graphic file with name M135.gif

where Inline graphic is the dimension of the random effects, Inline graphic is the number of quadrature points for each random effect, Inline graphic are weights for the original GH nodes Inline graphic, Inline graphic with Inline graphic being the upper triangular factor of the Cholesky decomposition of Inline graphic, and Inline graphic. Then, the standard errors of the parameter estimates can be estimated based on Inline graphic. In practice, we can calculate Inline graphic numerically using the central difference approximation (Rizopoulos, 2012a).

4. Data analysis

4.1. HIV vaccine data and new time variables

In this section, we analyze the VAX004 data set described in Section 2, based on the proposed models and methods. Our objective is to check if individual-specific longitudinal characteristics of immune responses are associated with the risk of HIV infection. A comprehensive analysis may be infeasible due to space limitation, but we will focus on the essential features of the data. Since the immune response variables are mostly highly correlated, we choose two variables, “MNGNE8” and “NAb”, which may represent the key features of the longitudinal immune response data. Note that some variables are often conveniently converted to binary data for simpler clinical interpretations. Here, we let Inline graphic be the dichotomized MNGNE8 data such that Inline graphic if the MNGNE8 value of individual Inline graphic at time Inline graphic is larger than the sample median 0.57 and Inline graphic otherwise. Let Inline graphic be the original NAb value of individual Inline graphic at time Inline graphic. Recall that 27% of the original NAb values are below this variable’s LLOQ (i.e. left truncated).

A unique feature of vaccine trial data is that the longitudinal immune response data typically exhibit periodic patterns, due to repeated administration of the vaccine. This can be clearly seen in Figure 1. Statistical modeling must incorporate these features. Here we use a simple periodic function Inline graphic to empirically capture the periodic patterns and further define the following time variables (in months): (i) the time from the beginning of the study to the current scheduled measurement time, denoted by Inline graphic; (ii) the time from the most recent immunization to the current scheduled measurement time, denoted by Inline graphic (so Inline graphic); and (iii) the time between two consecutive vaccine administrations, denoted by Inline graphic, so there will be at least one Inline graphic and one Inline graphic between Inline graphic and Inline graphic. For measurement time Inline graphic scheduled after the final vaccination, we define Inline graphic as the time between the final vaccination and the final measurement time. These different time variables are needed in modeling the longitudinal trajectories. Figure 3 gives an example of how different time variables are defined for a randomly chosen participant Inline graphic. Recall that vaccinations are scheduled at months 0, 1, 6, 12, 18, 24, 30, and the study ends at month 36. For this participant Inline graphic, s/he receives the first four vaccinations, but then drops out from the study before receiving the fifth vaccination. There are eight measurements over time in total, denoted by the cross symbols, where the measurement times may be different from the vaccination times. Suppose that the sixth measurement is taken at month 9, i.e. Inline graphic, then we have Inline graphic, the difference between the sixth measurement time and the latest vaccination time (Vac 3 at month 6) for this participant, and Inline graphic, since the sixth measurement happens between the third vaccination (Vac 3 at month 6) and the fourth vaccination (Vac 4 at month 12). To avoid very large or small parameter estimates, we also re-scale the times as follows: Inline graphic (in weeks) and Inline graphic (in years).

Fig. 3.

Fig. 3.

Illustration of three time variables in VAX004. The cross symbols indicate the measurement times of subject Inline graphic. The dashed vertical lines show the scheduled times of vaccinations and the end of study (i.e. month 0, 1, 6, 12, 18, 24, 30, 36), where the black dashed lines represent the times when subject Inline graphic received vaccines and the gray dashed lines represent the times when subject Inline graphic missed the scheduled vaccinations. The arrow lines represent the time periods Inline graphic of the sixth and eighth measurements with Inline graphic and Inline graphic, respectively.

4.2. Models

Based on rationales discussed in Sections 3 and 4.1, we consider empirical models for the continuous and binary longitudinal data and survival model. The longitudinal models are selected based on AIC values (see details in Section 3 in the Supplementary material available at Biostatistics online). For the NAb data with 27% truncation, we model the untruncated data by using the LME model

graphic file with name M178.gif (4.1)

where Inline graphic and Inline graphic are categories 1 and 2 of baseline behavioral risk score, the random effects Inline graphic, Inline graphic is the variance parameter, Inline graphic follows a truncated normal distribution with mean 0 and variance Inline graphic, and Inline graphic follows a truncated normal distribution. To ensure identifiability of the models, we assume that Inline graphic. We only consider a random intercept in the model because adding more random effects does not substantially reduce AIC values while making the models more complicated.

We also model the truncation indicator, Inline graphic, of NAb to find possible associations of truncation with the time variables and other covariates and to predict the truncation status of NAb at times when NAb values are unavailable. The selected model is given as below,

graphic file with name M188.gif (4.2)

which shares the same random effect as the NAb model (4.1), since these two processes seem to be highly correlated with each other. In many studies, the Inline graphic and Inline graphic values are measured sparsely and we can use model (4.2) to predict the truncation status of Inline graphic at times when Y-measurements are unavailable.

For the binary MNGNE8 data, variable selections by AIC values lead to the model

graphic file with name M192.gif (4.3)

where Inline graphic is the variance parameter with Inline graphic and the individual characteristics are incorporated via random slope Inline graphic and random intercept Inline graphic shared by models (4.1)–(4.2).

The association among the longitudinal models is incorporated through shared and correlated random effects from different models: Inline graphic, with Inline graphic and Inline graphic. Note that the random effect Inline graphic is shared by all the longitudinal models, since all the immune response longitudinal data exhibit similar individual-specific patterns and the random effect Inline graphic for the continuous NAb data best summarizes these patterns. For example, when a participant has a high baseline measurement of NAb, s/he likely also has a high baseline value of MNGNE8 and a low baseline probability that NAb is left truncated.

The survival model for the time to HIV infection is given by the “shared-parameter” model

graphic file with name M202.gif (4.4)

where Inline graphic is the measurement of GNE8_CD4 (i.e. blocking of the binding of the GNE8 HIV-1 gp120 Env protein to soluble CD4) for individual Inline graphic on the first day of the study after the first immunization, rescaled to have a mean of 0 and a standard deviation of 1. We call Inline graphic the standardized baseline GNE8_CD4. Since the analysis in this section is exploratory in nature, for simplicity we ignore other covariates.

4.3. Parameter estimates, model diagnostics, and new findings

We estimate model parameters using the proposed h-likelihood method. As a comparison, we also use the two-step method, which fits each longitudinal model separately and obtains random effect estimates in the first step and then in the second step the random effects in the Cox model are simply substituted by their estimates from the first step. The results of the two-step method are obtained using the R packages lme4 and survival. The drawbacks of the two-step method are: (i) it may under-estimate the standard errors of the parameter estimates in the survival model, since it fails to incorporate the estimation uncertainty in the first step; (ii) it fails to incorporate the associations among the longitudinal variables; and (iii) it may lead to biased estimates of longitudinal model parameters when longitudinal data are terminated by event times and/or truncated longitudinal data are simply replaced by the LLOQ or half this limit (Wu, 2009). Table 1 summarizes estimation results based on the above two methods. Algorithms based on the h-likelihood method were terminated when the relative change became less than Inline graphic in the estimates or Inline graphic in the approximated log-likelihood. Since our main objective is to investigate if individual-specific characteristics of the longitudinal immune responses are associated with the risk of HIV infection, we mainly focus on Inline graphic and Inline graphic in the survival model (4.4) as these parameters link the random effects to the hazard of HIV infection.

Table 1.

Estimates of all model parameters in VAX004

Model Par Two-step method H-likelihood method
    Est SE p-value Est SEInline graphic p-valueInline graphic
Estimates of mean parameters
  Inline graphic 1.57 0.05 Inline graphic 2.35 Inline graphic Inline graphic
  Inline graphic 1.89 0.05 Inline graphic 0.95 Inline graphic Inline graphic
LME model (4.1) Inline graphic Inline graphic0.54 0.02 Inline graphic Inline graphic0.31 Inline graphic Inline graphic
for NAb Inline graphic 0.55 0.05 Inline graphic 1.46 Inline graphic Inline graphic
  Inline graphic Inline graphic 0.04 0.93 Inline graphic0.04 Inline graphic Inline graphic
  Inline graphic Inline graphic0.27 0.12 0.03 Inline graphic0.10 Inline graphic Inline graphic
  Inline graphic 2.09 0.17 Inline graphic 1.94 Inline graphic Inline graphic
  Inline graphic Inline graphic6.52 0.31 Inline graphic Inline graphic6.27 Inline graphic Inline graphic
  Inline graphic 1.71 0.11 Inline graphic 1.65 Inline graphic Inline graphic
Truncation model (4.2) Inline graphic Inline graphic0.64 0.22 Inline graphic Inline graphic0.61 Inline graphic Inline graphic
  Inline graphic Inline graphic0.09 0.14 0.50 Inline graphic0.03 Inline graphic Inline graphic
  Inline graphic 1.15 0.38 Inline graphic 0.96 Inline graphic Inline graphic
  Inline graphic Inline graphic1.60 0.10 Inline graphic Inline graphic1.68 Inline graphic Inline graphic
  Inline graphic 0.11 0.01 Inline graphic 0.14 Inline graphic Inline graphic
GLMM (4.3) Inline graphic 1.76 0.19 Inline graphic 1.82 Inline graphic Inline graphic
for MNGNE8 Inline graphic Inline graphic0.05 0.01 Inline graphic Inline graphic0.05 Inline graphic Inline graphic
  Inline graphic Inline graphic0.72 0.23 Inline graphic Inline graphic0.79 Inline graphic Inline graphic
  Inline graphic 0.70 0.56 0.21 Inline graphic0.28 Inline graphic Inline graphic
Survival model (4.4) Inline graphic 1.46 1.14 0.20 1.86 Inline graphic Inline graphic
  Inline graphic Inline graphic0.01 0.24 0.96 Inline graphic1.69 Inline graphic Inline graphic
  Inline graphic 0.13 0.23 0.58 2.37 Inline graphic Inline graphic
Estimates of variance-covariance parameters
  Inline graphic 0.66     0.48    
  Inline graphic 0.22     0.54    
  Inline graphic Inline graphic0.89     Inline graphic1.60    
  Inline graphic 0.43     0.004    
  Inline graphic 0.05     0.19    
  Inline graphic 0.37     0.76    

SEInline graphic and p-valueInline graphic: Standard error and p-value based on the h-likelihood method.

SEInline graphic and p-valueInline graphic: Standard error and p-value based on the newly proposed method with 4 quadrature points.

From Table 1, we see that the two methods lead to quite different results, especially the estimates of Inline graphic and Inline graphic in the survival model that are our main focus in this analysis. For the two-step method, the estimates of Inline graphic and Inline graphic are near zero with confidence intervals including zero, not supporting that individual-specific immune response longitudinal trajectories are associated with the risk of HIV infection. However, these parameter estimates based on the proposed joint model with the h-likelihood method lead to different conclusions. Both Inline graphic and Inline graphic are highly significant based on the standard errors estimated by the joint model with the h-likelihood method (denoted as SEInline graphic), suggesting that individual-specific immune response longitudinal trajectories are highly associated with the risk of HIV infection. Since the standard errors based on the h-likelihood method may be under-estimated (Hsieh and others, 2006; Rizopoulos, 2012b), as discussed earlier, we also calculate the standard errors using the proposed method based on the aGH method, and the results with four quadrature points are given as SEInline graphic in the table. We see that, based on the new standard errors, the p-value for testing Inline graphic is slightly larger than Inline graphic while that for testing Inline graphic is still highly significant. Therefore, we may conclude that individual-specific immune response longitudinal trajectories are associated with the risk of HIV infection. This conclusion is unavailable based on the two-step method.

The negative estimate of Inline graphic suggests that higher NAb values are associated with a lower risk of HIV infection, and the positive estimate of Inline graphic suggests that large increases in MNGNE8 over time are associated with a higher risk of HIV infection. Specifically, there is an estimated 81.6% decrease (i.e. Inline graphic) in the hazard/risk with a one unit increase in the individual effect Inline graphic and an estimated 10.6 times increase (i.e. Inline graphic) in the hazard/risk with a one unit increase in the individual-specific slope Inline graphic, holding other covariates constant. These findings are original, since they are unavailable based on the two-step method, and show the important contribution of the proposed joint model and the h-likelihood method.

The joint model method and the two-step method have consistent significances of the parameters in the longitudinal models (4.1)–(4.3), except for Inline graphic and Inline graphic. By the two-step method, the tests for Inline graphic and for Inline graphic yield significant p-values, suggesting that participants with baseline behavioral risk score in category 2 (i.e. risk2 = 1) have significantly lower NAb values than other participants. By the joint model method, on the other hand, such a negative association is not statistically significant. For model (4.1), the mean square error (MSE) based on the joint model is 0.296, while the MSE based on the two-step method is 0.403.

The model diagnostics are conducted to check the assumptions and goodness-of-fit of the models. The results are listed in Section 4 in the Supplementary material available at Biostatistics online. Overall, the assumptions hold and the models fit the data well. The data used in this example may be requested through a concept proposal to the owner of the data—Global Solutions in Infectious Diseases.

5. Simulation studies

In this section, we conduct three simulation studies to evaluate the proposed joint model with the h-likelihood method. The models and their true parameter values in the simulation studies are chosen to be similar to the estimated values in the models for real data in the previous section.

5.1. Simulation study Inline graphic

Conditional on the random effects, the binary data Inline graphic are generated from a Bernoulli distribution with probabilities Inline graphic, where Inline graphic. For the continuous data Inline graphic, we first randomly generate Inline graphic from a normal distribution Inline graphic with Inline graphic. Then we create truncations so that Inline graphic is observed if Inline graphic and truncated otherwise and we choose LLOQ = 2. The random effects Inline graphic are generated from a multivariate normal distribution Inline graphic. The true values of the parameters are set to be: Inline graphic, Inline graphic, Inline graphic, Inline graphic. The survival times Inline graphic are generated from a Weibull distribution with shape parameter of Inline graphic and scale parameter of Inline graphic, where Inline graphic is a baseline covariate generated from the standard normal distribution. The non-informative censoring times Inline graphic are generated from a Weibull distribution with shape parameter of 5 and scale parameter of Inline graphic. The true parameter values are given as Inline graphic.

5.2. Simulation studies Inline graphic and Inline graphic

To better evaluate the performance of the h-likelihood method, we conducted two additional simulation studies: (i) a joint model with higher dimensions of random effects (Study 2, four random effects); and (ii) a joint model with a parametric survival model (Study 3, a Weibull survival model). For the parametric joint model in Study 3, we also estimate the model parameters using the aGH method as comparison. Due to space limitation, we put the details of these two simulation studies in Sections 5 and 6 in the Supplementary material available at Biostatistics online.

5.3. Simulation results and discussions

We compare the performance of the methods based on the relative bias and MSE of the parameter estimates, which are defined as follows (say, for parameter Inline graphic):

  • relative bias (%) of Inline graphic,

  • relative MSE (%) of Inline graphic,

where Inline graphic is the estimate of Inline graphic in simulation iteration Inline graphic, Inline graphic is the total number of repetitions, and Inline graphic is the true parameter value.

Table 2 summarizes the results of Simulation Study 1 (Inline graphic) when the longitudinal measurements are collected bi-weekly. The proposed h-likelihood method outperforms the two-step method as it returns much less biased estimates for most of the parameters. As for the bias in Inline graphic, it is known that the h-likelihood method may perform less satisfactorily for logistic mixed effects models (Kuk and Cheng, 1999; Waddington and Thompson, 2004). The standard errors of Inline graphic’s seem to be underestimated by the h-likelihood method. This problem has been reported elsewhere (Hsieh and others, 2006; Rizopoulos, 2012b). However, our newly proposed method based on the aGH method with 4 quadrature points returns coverage probabilities much closer to the nominal coverage probabilities. The results of Simulation Studies 2 and 3 are given in Tables S5 and S6 in the Supplementary material available at Biostatistics online. The main conclusions are consistent with those from Table 2. In Simulation Study 3, the Inline graphic’s based on the h-likelihood method are much less biased, though with slightly larger MSEs, than those based on the aGH method, while for Inline graphic, Inline graphic, and Inline graphic, the aGH method has less biased estimates than the h-likelihood method (see Table S6 in the Supplementary material available at Biostatistics online). Synthesizing the simulation results, we conclude that the proposed h-likelihood method, with the new approach of estimating the standard errors, performs reasonably well. Its performance remains consistent with higher dimensions of random effects and parametric survival models. Although it is sometimes less accurate than the aGH method, it is computationally much more efficient.

Table 2.

Simulation results with bi-weekly longitudinal measurements based on the two-step (TS) method and the h-likelihood (HL) method

Model Par True Estimate SSE rBias (%) rMSE (%) Coverage probability (%)
    value TS HL TS HL TS HL TS HL TS HLInline graphic HLInline graphic
Inline graphic 2.00 2.14 2.01 0.04 0.06 7.05 0.55 1.07 0.16 15 94 98
(4.1) Inline graphic 1.00 0.88 1.00 0.04 0.04 11.55 0.20 1.47 0.20 13 94 99
  Inline graphic Inline graphic0.30 Inline graphic0.26 Inline graphic0.30 0.02 0.02 13.07 0.00 0.60 0.12 32 94 99
  Inline graphic 1.50 1.41 1.49 0.02 0.02 5.70 0.37 0.52 0.04 1 94 98
  Inline graphic Inline graphic1.65 Inline graphic1.64 Inline graphic1.64 0.10 0.10 0.57 0.59 0.62 0.63 94 95 99
(4.3) Inline graphic 0.15 0.15 0.15 0.02 0.02 2.02 2.45 0.18 0.17 94 89 96
  Inline graphic 1.80 1.80 1.78 0.11 0.10 0.26 0.84 0.61 0.61 96 95 99
  Inline graphic Inline graphic0.05 Inline graphic0.05 Inline graphic0.05 0.01 0.01 1.80 2.68 0.06 0.06 93 93 97
  Inline graphic Inline graphic0.75 Inline graphic0.68 Inline graphic0.74 0.16 0.17 9.44 0.74 4.00 3.76 88 85 97
(4.4) Inline graphic Inline graphic1.50 Inline graphic1.10 Inline graphic1.44 0.22 0.29 26.91 3.99 13.96 5.92 38 65 84
  Inline graphic 2.00 1.53 2.02 0.23 0.31 23.40 1.10 13.70 4.88 34 69 87

HLInline graphic: Coverage probability based on the h-likelihood method.

HLInline graphic: Coverage probability based on the newly proposed method for standard errors with 4 quadrature points.

6. Discussion

In this article, we have considered a joint model for mixed types of longitudinal data with left truncation and a survival model and proposed a new method to handle the left-truncation in longitudinal data. A main advantage of this method, compared with existing methods in the literature (e.g. Hughes, 1999), is that it does not make any untestable distributional assumption for the truncated data that are below a measurement instruments LLOQ. Different types of longitudinal data are assumed to be associated via shared and correlated random effects. We have also proposed an h-likelihood method for approximate joint likelihood inference, which is computationally much more efficient than the aGH method. Moreover, we have proposed a new method to better estimate the standard errors of parameter estimates from the h-likelihood method. Based on a MacBook Pro Version 10.11.4, the average computing times of the h-likelihood method were 2.7 min for the semiparametric joint model with 2 random effects and 21.9 min for the semiparametric joint model with 4 random effects, respectively. For the parametric joint model with 2 random effects, the average running time of the h-likelihood method was 9.1 min, much faster than the aGH method that takes 28.4 min.

Analysis of the real HIV vaccine data based on the proposed method shows that the individual-specific characteristics of longitudinal immune response, summarized by random effects in the models, are highly associated with the risk of HIV infection. This finding is quite interesting and helpful to designing future HIV vaccine studies. We have also proposed a model for the left-truncation indicator of the longitudinal immune response data and showed that the left-truncation status follows certain patterns as functions of time. Such a model can be used to predict the left-truncation status (below LLOQ status) of some longitudinal immune response values when measurement schedules are infrequent or sparse.

The joint model in this article may be extended in several directions. For example, the Cox model may be replaced by an accelerated failure time model or survival model for interval censored data or competing risks data. The association among different types of longitudinal processes may also be modeled in other ways such as shared latent processes. In addition, the dropouts in the real data may be associated with longitudinal patterns, so we may consider incorporating missing data mechanisms into the joint models in future research. Research for these extensions will be reported separately.

7. Software

Software in the form of R code and a sample input data set are available at https://github.com/oliviayu/HHJMs.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Supplementary Material

Supplementary Data

Acknowledgments

The authors thank the reviewers for the thoughtful comments to help improve the article greatly. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or BMGF. The authors thank the participants, investigators, and sponsors of the VAX004 trial, including Global Solutions for Infectious Diseases. Conflict of Interest: None declared.

Funding

National Institute Of Allergy And Infectious Diseases of the National Institutes of Health (NIH) (Award Numbers R37AI054165 and UM1AI068635); and Bill and Melinda Gates Foundation (BMGF) (Award Number OPP1110049).

References

  1. Barrett J., Diggle P., Henderson R. and Taylor-Robinson D. (2015). Joint modelling of repeated measurements and time-to-event outcomes: flexible model specification and exact likelihood inference. Journal of the Royal Statistical Society: Series B 77, 131–148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bernhardt P. W., Wang H. J., and Zhang D. (2014). Flexible modeling of survival data with covariates subject to detection limits via multiple imputation. Computational Statistics and Data Analysis, 69, 81–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen Q., May R. C., Ibrahim J. G., Chu H., and Cole S. R. (2014). Joint modeling of longitudinal and survival data with missing and left-censored time-varying covariates. Statistics in Medicine, 33, 4560–4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Elashoff R. M., Li G., and Li N. (2015). Joint Modeling of Longitudinal and Time-to-Event Data. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  5. Flynn N., Forthal D., Harro C., Judson F., Mayer K., Para M., and Gilbert P.. The rgp120 Inline graphic Vaccine Study Group (2005). Placebo-controlled phase 3 trial of recombinant glycoprotein 120 vaccine to prevent HIV-1 infection.Journal of Infectious Diseases, 191, 654–65. [DOI] [PubMed] [Google Scholar]
  6. Fu R. and Gilbert P. B. (2017). Joint modeling of longitudinal and survival data with the cox model and two-phase sampling. Lifetime Data Analysis, 23, 136–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Gilbert P. B., Peterson M. L., Follmann D., Hudgens M. G., Francis D. P., Gurwith M., Heyward W. L., Jobes D. V., Popovic V., Self S. G.,. et al. (2005). Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of hiv-1 infection in a phase 3 hiv-1 preventive vaccine trial. Journal of Infectious Diseases, 191, 666–677. [DOI] [PubMed] [Google Scholar]
  8. Ha I. D., Park T., and Lee Y. (2003). Joint modelling of repeated measures and survival time data. Biometrical Journal, 45, 647–658. [Google Scholar]
  9. Hartzel J., Agresti A., and Caffo B. (2001). Multinomial logit random effects models. Statistical Modelling, 1, 81–102. [Google Scholar]
  10. Hsieh F., Tseng Y.-K., and Wang J.-L. (2006). Joint modeling of survival and longitudinal data: likelihood approach revisited. Biometrics, 62, 1037–1043. [DOI] [PubMed] [Google Scholar]
  11. Hughes J. P. (1999). Mixed effects models with censored data with application to hiv rna levels. Biometrics, 55, 625–629. [DOI] [PubMed] [Google Scholar]
  12. Król A., Ferrer L., Pignon J.-P., Proust-Lima C., Ducreux M., Bouché O., Michiels S., and Rondeau V. (2016). Joint model for left-censored longitudinal data, recurrent events and terminal event: predictive abilities of tumor burden for cancer evolution with application to the ffcd 2000–05 trial. Biometrics, 72, 907–916. [DOI] [PubMed] [Google Scholar]
  13. Kuk A. Y. and Cheng Y. W. (1999). Pointwise and functional approximations in monte carlo maximum likelihood estimation. Statistics and Computing, 9, 91–99. [Google Scholar]
  14. Lawrence Gould A., Boye M. E., Crowther M. J., Ibrahim J. G., Quartey G., Micallef S., and Bois F. Y. (2015). Joint modeling of survival and longitudinal non-survival data: current methods and issues. report of the dia bayesian joint modeling working group. Statistics in Medicine, 34, 2181–2195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Lee Y. and Nelder J. A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society: Series B, 58, 619–678. [Google Scholar]
  16. Lee Y., Nelder J. A., and Pawitan Y. (2006). Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  17. Mehrotra K. G., Kulkarni P. M., Tripathi R. C., and Michalek J. E. (2000). Maximum likelihood estimation for longitudinal data with truncated observations. Statistics in Medicine, 19, 2975–2988. [DOI] [PubMed] [Google Scholar]
  18. Molas M., Noh M., Lee Y., and Lesaffre E. (2013). Joint hierarchical generalized linear models with multivariate gaussian random effects. Computational Statistics and Data Analysis, 68, 239–250. [Google Scholar]
  19. Pinheiro J. C. and Bates D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4, 12–35. [Google Scholar]
  20. Rizopoulos D. (2012a). Fast fitting of joint models for longitudinal and event time data using a pseudo-adaptive gaussian quadrature rule. Computational Statistics and Data Analysis, 56, 491–501. [Google Scholar]
  21. Rizopoulos D. (2012b). Joint Models for Longitudinal and Time-to-Event Data: With Applications in R. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  22. Rizopoulos D., Verbeke G., and Lesaffre E. (2009). Fully exponential laplace approximations for the joint modelling of survival and longitudinal data. Journal of the Royal Statistical Society: Series B, 71, 637–654. [Google Scholar]
  23. Taylor J. M., Park Y., Ankerst D. P., Proust-Lima C., Williams S., Kestin L., Bae K., Pickles T., and Sandler H. (2013). Real-time individual predictions of prostate cancer recurrence using joint models. Biometrics, 69, 206–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Waddington D. and Thompson R. (2004). Using a correlated probit model approximation to estimate the variance for binary matched pairs. Statistics and Computing, 14, 83–90. [Google Scholar]
  25. Wu L. (2002). A joint model for nonlinear mixed-effects models with censoring and covariates measured with error, with application to aids studies. Journal of the American Statistical Association, 97, 955–964. [Google Scholar]
  26. Wu L. (2009). Mixed Effects Models for Complex Data. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  27. Wulfsohn M. S. and Tsiatis A. A. (1997). A joint model for survival and longitudinal data measured with error. Biometrics, 53, 330–339. [PubMed] [Google Scholar]
  28. Zhu H., Ibrahim J. G., Chi Y.-Y., and Tang N. (2012). Bayesian influence measures for joint models for longitudinal and survival data. Biometrics, 68, 954–964. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES