Abstract
In jointly modelling longitudinal and survival data, the longitudinal data may be complex in the sense that they may contain outliers and may be left censored. Motivated from an HIV vaccine study, we propose a robust method for joint models of longitudinal and survival data, where the outliers in longitudinal data are addressed using a multivariate t-distribution for b-outliers and using an M-estimator for e-outliers. We also propose a computationally efficient method for approximate likelihood inference. The proposed method is evaluated by simulation studies. Based on the proposed models and method, we analyze the HIV vaccine data and find a strong association between longitudinal biomarkers and the risk of HIV infection.
Keywords: biomarker, outliers, robust joint model, h-likelihood, left-censoring
1. Introduction.
In HIV vaccine studies, joint models of longitudinal and survival data are instrumental, since it is crucial to study the association between longitudinal immune responses biomarkers during repeated immunizations over time and the risk of HIV infection. Joint modelling of longitudinal and survival data has received significant attention in the literature in the past two decades (e.g., Rizopoulos (2012); Elashoff, Li and Li (2015); Wu and Yu (2016)). The following issues complicate joint modelling of HIV vaccine data: (i) some longitudinal biomarker data may be left-censored due to a lower limit of quantification (LLOQ) or below detection limit; (ii) there may be outliers in the longitudinal data; (iii) various immune responses biomarkers may be strongly associated with each other and may be of mixed types such as some being binary and some being continuous; and (iv) standard likelihood estimation of the model parameters can be computationally intensive. There seem to be no existing joint modelling methods which can address all these issues simultaneously.
Our research is motivated by the VAX004 phase III trial. It is a 36-month efficacy study of a candidate vaccine, which contains two recombinant gp120 Envelope proteins – the MN and GNE8 HIV-1 strains, to prevent HIV-1 infection (Flynn et al., 2005). Participants assigned to the vaccine group received the candidate vaccine, with repeated immunizations at months 0, 1, 6, 12, 18, 24, 30 post-enrollment. They were followed until the day of diagnosis of HIV infection or until the end of the study. Meanwhile, their plasma samples were collected at the immunization visits, two weeks after each immunization visit, and the final visit. A number of immune response biomarkers were measured at each visit, including NAb (the titer of neutralizing antibodies to the MN strain of the HIV-1 gp120 Envprotein) and MNGNE8 (the average level of binding antibodies, measured by ELISA, to the MN and GNE8 HIV-1 gp120 Env proteins). Figure 1 shows the trajectories of four immune response biomarkers (Yu, Wu and Gilbert (2018)). Besides the features noted in the previous paragraph, we also see the following features from Figure 1: (a) different biomarkers seem strongly associated with each other since they exhibit similar patterns over time; (b) there may be associations between the biomarkers’ longitudinal patterns and the risk of HIV infection; and (c) there seem large variations between individuals. More details are presented in Section 4. Due to the large between-individual variations, mixed effects or random effects models may be desirable for modelling the longitudinal biomarker data.
Fig 1:

Longitudinal measurements of four immune response biomarkers. The solid lines show randomly selected subjects with HIV infection by the end of the study, while the dotted lines show randomly selected subjects without HIV infection by the end of the study.
Many authors have reviewed the literature on joint models of longitudinal and survival data (e.g., Rizopoulos (2012); Taylor et al. (2013); Elashoff, Li and Li (2015); Wu and Yu (2016), among others). There has also been active research on longitudinal data with left censoring (e.g., Hughes (1999); Wu (2009); Yu, Wu and Gilbert (2018)), longitudinal data with outliers (e.g., Cantoni and Ronchetti (2001); Sinha (2004); Qin et al. (2016)), and computational tools for joint models (e.g., Rizopoulos (2012); Lee and Nelder (1996); Barrett et al. (2015)). However, to our knowledge, little literature seems available to address all these issues simultaneously.
In this paper, we propose a joint model for longitudinal and survival data, which addresses left censoring and outliers simultaneously. While the methodology for each of these issues is not new, addressing all issues simultaneously in the joint model is not trivial. For (unobserved) left censored longitudinal data, we avoid distributional assumptions by treating them as a point mass, following Yu, Wu and Gilbert (2018). An outlier may be defined as a data point which appears to be inconsistent with the remainder of the data, or a data point which differs significantly from other data. For longitudinal data, the terms e-outliers and b-outliers have appeared in the literature: an e-outlier is an outlying data point among the repeated measurements within an individual, while a b-outlier is an outlying individual in the sample. In other words, an e-outlier may arise in the model for the within-individual repeated measurements, while a b-outlier may arise in the assumed model/distribution for the random effects. Thus, naturally, we may consider a heavy-tail multivariate t-distribution for the b-outliers rather than the standard normal distribution assumed for the random effects, and we may use an M-estimator for the e-outliers. An M-estimator downweights potential outlying observations and is widely used in the robust inference literature (e.g., Cantoni and Ronchetti (2001); Sinha (2004); Wu and Qiu (2011)). Note that, if there are too many outliers (say, over 5%), these outliers may be viewed as from a separate population, so we might either study them separately or consider a mixture of two distributions such as a mixture of two normal distributions.
To address the association between different types of longitudinal data, we consider several mixed effects models with correlated or shared random effects. For such a complex joint model, computation for likelihood inference is a major challenge, since the standard EM algorithm may offer very slow or non-convergence. Here we extend the h-likelihood method of Lee and Nelder (1996) for the proposed joint model for approximate likelihood estimation, which is computationally very efficient.
The rest of the paper is organized as follows. In Section 2, we describe the joint model and its likelihood in a general form. Section 3 introduces our proposed models and methods, with asymptotic results, and the proposed approximate estimation method. In Section 4, we analyze the HIV vaccine data in details. Section 5 shows simulation results evaluating the proposed method. We conclude the article with some discussion in Section 6.
2. The Joint Model and Its Likelihood.
2.1. The models.
In this section, we present the models in general forms so that they can be applied in other applications. In the motivating HIV vaccine study described in the previous section, the longitudinal biomarkers are highly correlated, as can be seen from the similar patterns in Figure 1. In practice, some continuous longitudinal biomarker data are sometimes dichotomized for biological reasons. Thus, to simplify the presentation, here we consider one continuous longitudinal biomarker Y* and one binary longitudinal biomarker Z (taking values 0 or 1). These two different types of biomarkers may be associated with each other, as noted earlier. In addition, the biomarker Y* value may be left censored due to an assay’s lower limit of quantification (LLOQ) d (or below detection limit d). We denote the observed (un-censored) Y*–value by Y. Another complication in the data is that some observed Y data may contain outliers. Let C be the left censoring indicator of Y* such that C = 1 if Y* value is below LLOQ (i.e., Y* < d) and C = 0 otherwise. Occasionally, to simplify presentation, we may not strictly distinguish Y and Y* if the context is clear. We also use bold-face y to denote a column vector, yT denote its transpose, and y denote a scalar. We use similar notation for other variables.
Note that the value of the left-censoring indicator variable C at time t is determined by the longitudinal biomarker Y* value if Y* is measured at time t. However, for the motivating HIV study, the biomarker Y* value is measured infrequently over time, so the values of the censoring status C at many time points are not available when Y* is not measured, especially at time points close to HIV infection. On the other hand, the longitudinal patterns of the censoring status C, especially at time points close the HIV infection, may be associated with the risk of HIV infection, as shown in the data analysis and other studies. Thus, by modelling the censoring indicator C based on observed data, we may be able to “predict” the left censoring status C at time t based on its observed patterns and other available information, even when Y* is not measured at time t. As noted earlier, some longitudinal biomarkers in the study are highly correlated. So we may use this information to help predicting the left-censoring status C at time t when Y* is not measured. We may incorporate this information by allowing the random effects in the C model to be correlated with the random effects in the models for other biomarkers (e.g., Z) or directly use other biomarker as predictors in the model for C. Here we consider correlated or shared random effects.
Suppose that there are n individuals in the study, with ni repeated measurements on the longitudinal biomarkers for individual i, i = 1, 2, · · · , n. Some biomarker values may be missing at scheduled times, and the missing data are assumed to be missing at random. Let Yij be the biomarker Y for individual i at time tij and yij be its observed value, i = 1, 2, · · · , n, j = 1, 2, · · · , ni. We define Cij, Zij and cij, zij similarly. The observed data are {(yij, cij, zij), i = 1, 2, · · · , n, j = 1, 2, · · · , ni}.
We consider a linear mixed effects (LME) model for the continuous Y -data, and generalized linear mixed effects (GLME) models for the binary Z-data and C-data. Specifically, we consider the following GLME models for the binary variable Zij
| (1) |
where wij and uij are covariate vectors containing time, α is a vector of fixed effect parameters, b1i is a vector of random effects. For simplicity, we suppress the dimensions of these vectors, with the understanding that the dimensions match in the model. We use similar notation for other models.
For the observed (un-censored) data Y, i.e., given Cij = 0, we consider the following LME model
| (2) |
where xij and vij contain covariates including time, β contains fixed effect parameters, b2i contains random effects, and ϵij is an error term with ϵij ~ N(0, σ2). We assume that the observed data Y follow a truncated normal distribution with density function . Note that, here we only model the observed (uncensored) Y -data, not the unobserved left-censored Y*–data which are below detection limit d. Thus, we avoid making un-verifiable model or distributional assumptions for the unobserved left-censored data. The unobserved left-censored data are treated as point mass with as shown in the joint likelihood (5) in Section 2.2. This may lead to some loss of information, compared with a model assuming a (unverifiable) distribution for the censored values. However, we also avoid assuming a model for the unobserved left-censored Y*–data which cannot be verified and is often unreasonable since the unobserved below-detection Y*–values may behave very differently than the observed above-detection Y -values.
In model (2), we actually model Y|C = 0, so we also need to model C, similar to the idea of mixture models. As noted earlier, for the left-censoring indicator C, while its value is determined by the value of Y* if Y* is measured, the values of Y* are often un-available at times of interest due to its infrequent measurement schedules. Since the values of Y are highly correlated with the values of other biomarkers and exhibit similar periodic patterns (see Figure 1), we may use information from other observed biomarkers and the observed longitudinal pattern of left-censoring statuses to possibly “predict” the censoring status of Y* when the values of Y* are unavailable. The information from other observed biomarkers may be incorporated by allowing the random effects from these biomarker models to be correlated with the random effects for the C-model, since these random effects summarize the individual-specific characteristics of the corresponding longitudinal trajectories. Thus, we consider the following GLME model for the binary censoring indicator Cij
| (3) |
where ϕij and ψij contain covariates including time, η contains fixed effect parameters, and b3i contains random effects. Note that, as shown below, the random effects b3i in model (3), which reflect the individual characteristics of the censoring process, are assumed to be correlated with the random effects b1i and b2i in models (1) and (2) respectively, allowing information from other biomarker data to help predicting the censoring status when it is not known. Such censoring information may be predictive for the risk of HIV infection.
The values of Y, Z, and C are correlated. Since Yij is a continuous random variables while Zij and Cij are binary random variables, the usual multivariate mixed effects models are not applicable. Here, we incorporate the association by assuming that the random effects in the three models are correlated or shared. This is a reasonable assumption since the three variables exhibit similar individual-specific longitudinal patterns. Specifically, we assume that
where Σ is an arbitrary covariance matrix incorporating the correlations among the random effects, and f(bi|Σ) is typically a multivariate normal distribution with zero mean and covariance matrix Σ but it can also be a multivariate t-distribution for robust inference. Note that bi only contains the distinct elements in (b1i, b2i, b3i).
In the HIV vaccine study, our main interest is in the risk of event “HIV infection” and if this risk is associated with the longitudinal biomarker data. At the end of the study, some subjects may not experience HIV infection, so their event times are right censored. We assume that the right censoring is non-informative. Let {(si, δi)|i = 1, · · · , n} denote the observed survival (or event time) data, where si is either the event time or the censoring time, and δi is the event indicator such that δi = 1 if subject i is infected and δi = 0 otherwise. For the survival data, we consider the following Cox proportional hazards model
| (4) |
where hi(t) is the hazard/risk of HIV infection for subject i at time t, h0(t) is an unspecified baseline hazard, x0i contains baseline covariates, and λ0 and λ1 are vectors of fixed parameters. Thus, the risk of HIV infection hi(t) is assumed to be associated with the random effects bi in the longitudinal biomarker models. This assumption may be reasonable since the random effects in the longitudinal models reflect the individual-specific characteristics of the corresponding longitudinal processes. Such a joint model is called shared-parameter model in the joint model literature, which is perhaps the most commonly assumed joint model for longitudinal and survival data.
In the above models, the three longitudinal models (1), (2), (3) are linked through correlated or shared random effects, and these random effects are used as “predictors” in the survival model (4). The specifications of these models are motivated by the HIV vaccine data described in Section 1. Since the four models are associated, inference of all the model parameters can be based on the joint likelihood of all the observed data, as described in the next section.
2.2. The likelihood.
Let be a collection of parameters in the survival model (4), let θ = (αT, βT, ηT, λT)T be a collection of all the mean parameters in all the longitudinal and survival models, let vec(Σ) be a vector of unconstrained parameters that determines Σ based on spherical parameterization, and let ξ = (σ, vec(Σ)) be a collection of all the distinct variance-covariance parameters. Let f(·) denote a generic density function. We assume that, conditioning on the random effects, the longitudinal variables and the survival variable are independent of each other. The (joint) likelihood of all the observed longitudinal data and survival data based on the assumed models can be written as
| (5) |
where , , , , and ϕ() is the probability density function (pdf) of a standard normal distribution.
The MLEs of all the parameters can be obtained by maximizing the above likelihood, which leads to solving the following estimating equations
| (6) |
| (7) |
| (8) |
| (9) |
| (10) |
| (11) |
Before discussing computational issues of solving the above estimating equations, we note that the MLEs of the parameters in the mixed effects models are very sensitive to outliers in the longitudinal data (Sinha (2004), Wu (2009)). In other words, ignoring outliers in the data may lead to misleading parameter estimates. In the next section, we discuss robust estimates of the model parameters.
3. Robust Estimation.
3.1. A robust method.
For longitudinal data, there are two types of outliers: e-outliers (outlying observations within an individual) and b-outliers (outlying individuals). Since it can often be challenging to detect outliers in longitudinal data, robust methods which automatically incorporate outliers can be very useful. In the following, we focus on outliers in the continuous data of Y. The method can be conceptually extended to outliers in the binary data of Z, as shown in Sinha (2004). For potential outliers in the Y -data, we propose to handle e-outliers based on the M-estimators and handle b-outliers based on heavy-tail distributions such as the multivariate t-distribution. Since the longitudinal data of Y and Z exhibit similar between-individual variations, b-outliers in the Y -data may also imply b-outliers in the Z-data.
An M-estimator for e-outliers.
For e-outliers in the Y -data, we consider an M-estimator method to downweigh the influence of outliers in the estimating equation (7) for estimating β, as in Sinha (2004). Note that the estimating equation (7) of β can be re-written as
We apply a bounded function ψ() (e.g. the Huber’s function ψc(x) = max(−c, min(x, c)), c > 0 with turning point c) on the term and obtain the following robust estimating equation for β
| (12) |
where is a correction term to ensure Fisher’s consistency of an estimator of β. When ψ(x) = ψc(x) (the Huber’s function), we may choose the value of c to reflect different degrees of robustness. When c = ∞, the robust M-estimator reduces to the usual MLE. Common choices of c are c = 1.5 or c = 2.
Similarly, for the variance parameter σ2 in the model for the Y -data, we may consider the following robust estimating equation
| (13) |
where .
A multivariate t-distribution for b-outliers.
The standard distributional assumption for the random effects is the multivariate normal distribution. When the variations between individuals suggest potential b-outliers, a multivariate t-distribution may be more desirable since it has heavier tails than the multivariate normal distribution. Thus, for robust inference, we assume that the random effects follow a multivariate t-distribution with degrees of freedom ν and covariance matrix Σ, i.e.,
We may then choose the value of the degrees of freedom ν to reflect different degrees of robustness: the smaller the value of ν, the heavier the tails of the t-distribution. When ν = ∞, the distribution reduces to the multivariate normal distribution N(0, Σ). With sufficient data, we may estimate the degrees of freedom ν from the data using the maximum likelihood method. For small samples, we may set it a priori at some sensible values (often between 3 and 9) and perform sensitivity analysis (Lange, Little and Taylor, 1989). Based on Lucas (1997), the protection against outliers is preserved only if the degree of freedom parameter is fixed. In the data analysis, we assume ν to be fixed and plot the profile likelihood of ν to choose the most appropriate value of ν and we also perform sensitivity analysis.
3.2. Asymptotic results.
We show some asymptotic results of the robust estimate of β, following similar proofs in the literature (e.g. Sinha (2004)). These asymptotic results can be used to obtain standard errors of the robust estimates in data analysis, when the sample size is reasonably large. When the sample size is small, bootstrap method may be used to obtain the standard errors of robust parameter estimates. For other parameter estimates, standard asymptotic results for MLEs may be used.
Let be a collection of all the longitudinal and survival response data of individual i, i = 1, 2, · · · , n. Let
and . Let be a function whose derivative function with respect to β is and . Then we have . Let .
Theorem.
Let be the robust estimator obtained by solving the estimating equation (12), and let β* be the true value of β. Under the regularity assumptions R1–R7 in the Supplementary Material (Yu et al., 2022), we have the following results:
Consistency: a.s. as n → ∞,
- Asymptotic normality:
where
and
The proofs are given in Section 1 of the Supplementary Material (Yu et al., 2022).
The above asymptotic normality result (b) can be used for inference of parameter β, such as confidence interval and hypothesis testing. Let ,, be the parameter estimates obtained by solving likelihood score equations (6), (8), and (9), respectively. Then these estimates have the usual properties of consistency and asymptotic normality of MLEs.
3.3. A computationally efficient approximate method.
The estimating equations in the previous sections involve intractable integrations with respect to the (unobserved) random effects bi = (b1i, b2i, b3i). The dimension of the random effects bi is usually high (at least 3), so it can be computationally challenging to solve these estimating equations since the commonly used numerical integration methods or Monte Carlo methods can be computationally intensive. In the following, we propose an approximate method based on h-likelihood method, which is computationally efficient (Lee and Nelder (1996)). As an attractive computational tool, the h-likelihood method has becoming increasingly popular for likelihood estimation of mixed effects models – see Lee, Nelder and Pawitan (2018) for a recent review. Essentially, it uses Laplace approximations to intractable integrals in the likelihoods for mixed effects models, with adjustments to reduce bias, which produce approximate maximum likelihood estimates (MLEs) for the mean parameters and approximate restricted MLEs (REMLs) for the variance-covariance (dispersion) parameters. It is particularly useful for likelihood estimation involving high-dimensional integrations with respect to the random effects, since it is computationally much more efficient than the corresponding Monte Carlo or numerical methods and it performs well in various settings. In this paper, we propose an h-likelihood method for robust inference in our joint model setting and evaluate its performance via simulations.
The log h-likelihood function based on the assumed models can be written as
| (14) |
where and similar notation for other variables. Let θ−β be the collection of all the mean parameters except for β. Beginning with some starting values (θ(0), ξ(0)), we propose the following iterative procedure: at iteration k (k = 1, 2, · · ·), we do the following steps
Step 1: Given θ(k) and ξ(k), obtain updated estimates of the random effects b(k+1) by maximizing ℓh with respect to b;
Step 2: Given , , and , obtain updated robust estimate of β(k+1) by solving estimating equation (12), where the the expectation is approximated using samples drawn based on the Metropolis-Hastings algorithm.
- Step 3: Given , , and , obtain updated estimates of θ−β by maximizing the following adjusted profile h-likelihood with respect to θ−β:
where H(ℓh, b) = −∂2ℓh/∂bT∂b. - Step 4: Given and , obtain updated estimates of σ(k+1) by solving the estimating equation (13), where the expectation is approximated using samples drawn based on the Metropolis-Hastings algorithm. Then, we obtain the estimates of the rest of the parameters in ξ (denoted as ξ−σ) by maximizing the following adjusted profile h-likelihood with respect to ξ−σ,
where - Step 5: Given , , , we obtain an updated nonparametric estimate of the baseline hazard as follows
where I(·) is an indicator function. If the survival data are modelled by a parametric model such as a Weibull model, the parameter estimates can be obtained in a similar way as in the previous steps.
Iterating the above Steps 1–5 until convergence, we obtain the approximate estimates of the parameters and random effects. After convergence, we may estimate the covariance matrices of the parameter estimates by calculating the derivatives of the estimating equations (6), (12), (8), and (9),
The estimated standard errors of the parameter estimates are the square root of the diagonal elements of the covariance matrices.
4. Analysis of HIV Vaccine Data.
4.1. The Data.
In this section, we return to the VAX004 phase III trial dataset briefly described in Section 1. The dataset contains 194 participants in total. By the end of the study, 21 (10.8%) subjects acquired HIV infection during the follow-up, and the times to infection diagnosis range from 43 to 954 days. A number of longitudinal variables are collected throughout the study, but most of them are highly correlated with correlations over 0.9. Thus, we choose two representative variables of primary interest: NAb and MNGNE8. The variable MNGNE8 is dichotomized based on its median values, as done in Yu, Wu and Gilbert (2018).
From Figure 1, we see that data on the longitudinal variables exhibit similar periodic patterns, due to repeated administrations of the vaccine. Moreover, variable NAb is left-censored with a lower limit of quantification (LLOQ) of 1.477 and censoring rate of 27%. The observed NAb data range from 1.79 to 4.82 with a mean of 3.13 and a standard deviation of 0.59. Figure 2 shows the longitudinal trajectories of NAb of some selected participants, with left-censoring and possible outliers. The two points indicated in blue boxes in Figure 2 are examples of possible outliers, because they do not appear to follow the general patterns of sawtooth waves. In HIV vaccine trials, the values of an immune response typically increase sharply right after each vaccination, and then start to decrease about two weeks after the vaccination. So the point outlier 1, measured about 23 weeks after vaccination, is supposed to be below the previous measurement point, and the point outlier 2, measured about 2 weeks after vaccination, is supposed to be above the previous measurement point. Thus, these two points are possible e-outliers in the data. As discussed in Section 1, outliers are often model-dependent and may not always be easily detected. Thus, robust methods such as the M-estimators and t-distributions are useful since they automatically incorporate potential outliers. For longitudinal data, e-outliers refer to outlying observations among within-individual repeated measurements, while b-outliers refer to outlying individuals in the sample. As exploratory analysis, some outliers may be detected by plotting standardized residuals and estimated random effects (Zheng, Fung and Zhu, 2013; Koller, 2016), or by inspecting the distributions of the Mahalanobis distances based on the estimates (Gill, 2000; Waternaux, Laird and Ware, 1989; Copt and Victoria-Feser, 2006).
Fig 2:

Longitudinal trajectories of variable NAb of some selected participants, where the left-censored values were substituted by LLOQ of 1.477 and two possible outliers were indicated in boxes.
A main objective of the study is to investigate possible association between the risk of HIV infection and the immune response longitudinal measurements. From Figure 1, it appears that such association may exist, but it needs to be confirmed by modelling the data. Thus, a joint model for the longitudinal data and the survival (time-to-infection) data is desirable. A main challenge in the data analysis is to handle outliers and left censoring in the longitudinal data simultaneously. In the following, we analyze the dataset based on the proposed joint model and the estimation method.
4.2. The models.
Let Zij be the dichotomized MNGNE8 data such that Zij = 1 if the MNGNE8 value of individual i at time tij is larger than its sample median 0.57 and Zij = 0 otherwise (Yu, Wu and Gilbert (2018)). Let Yij be the observed (uncensored) NAb value (i.e. Yij > LLOQ), and let Cij be the left-censoring indicator of NAb such that Cij = 1 if NAb is left-censored and Cij = 0 otherwise.
As we can see from Figure 1, the longitudinal trajectories of Y and Z are closely associated with time variables, which may be defined in different ways, such as scheduled measurement times, vaccination times, times since most recent vaccinations, and time intervals between two consecutive vaccinations. Thus, in modelling these longitudinal data, we may treat these time variables as covariates. More specifically, we define the following time variables (in months): (i) sampledays, denoted by tij, which is the measurement time of individual i at j-th measurement; (ii) the vaccination time of individual i at k-th vaccination, denoted by ; (iii) dosedays, denoted by , which is the time from the most recent vaccination to the current measurement time tij, i.e., where ; and (iv) the time between two consecutive vaccinations within which tij lies, denoted by Δij, i.e., with . For measurement time tij after the final vaccination, we define Δij as the time between the final vaccination and the final measurement time. In data analysis, to avoid too small or too large parameter estimates, we re-scale some time variables to different time units (year/week): (in years) and (in weeks).
Next, we consider several mixed effects models for modelling the longitudinal biomarker data, due to the large between-individual variations. Since the longitudinal measurement schedules are infrequent in this study, nonparametric or semiparametric mixed effects models may not be suitable as they require more frequent measurements. As shown in Figure 1, the longitudinal biomarker data exhibit clear periodic patterns due to repeated vaccinations. Thus, we may use simple periodic functions, such as the sin() function, to partially capture these periodic patterns. For the remaining variations in the data, we consider linear and quadratic functions of time and choose the models using standard model selection criteria such as AIC or BIC values and likelihood ratio test. For models which are similar based on these criteria, we choose the ones which are simpler and make more scientific sense. As noted in the previous paragraph, to simplify the presentations of the models, we use different time variables tij, , , , and Δij – they are illustrated in Figure 1 in the Supplementary Material (Yu et al., 2022). The details of the model selections and the goodness-of-fit of the models are given in Sections 2–3 of the Supplementary Material (Yu et al., 2022). As a result, we obtain the following final models for the longitudinal biomarker data:
| (15) |
| (16) |
| (17) |
where the dk’s are dispersion/variance parameters, with identification conditions d1 > 0, d2 > 0, and ij is an error term with ϵij ~ N(0, σ2). In the above models, the random effect b1i in the Z-model also appears in the models for Y and C since the three longitudinal trajectories exhibit very similar between-individual variations and thus have closely related random effects. We assume that the random effects
with degree of freedom ν and covariance matrix , |r12| < 1.
The random effects bi in the above longitudinal models represent individual-specific characteristics of the longitudinal trajectories. We are interested to investigate if such individual-specific characteristics may be associated with the risk of HIV infection. Therefore, for the survival data, we consider the following Cox model
| (18) |
where xi is the standardized value of the immune response GNE8_CD4 of individual i at the initial visit, and the random effects bi are used as predictors in the survival model. The significances of the estimates of λ1 and λ2 indicate possible association between the longitudinal processes and risks of HIV infections.
4.3. Data analysis results.
The longitudinal and survival models (15)–(18) are linked through shared random effects. We use the proposed joint model and the robust estimation method for estimating the model parameters simultaneously, incorporating outliers and left-censored values. We choose the values of the turning point c in the M-estimator and the degree of freedom parameter ν in the t-distribution based on the plot of the profile likelihood. The profile log-likelihood plot of ν given different values of c is shown in Figure 2 in the Supplemental Material (Yu et al., 2022). The turning points c = 1.5 and c = 2 return very similar approximate log-likelihood values, with the latter slightly better than the former. Given a fixed c, the approximate log-likelihood generally has an increasing pattern over the degree of freedom parameter ν. This implies that there may be few or no b-outliers or the impact of b-outliers is minor. The combination of c = 2 and ν = ∞ (i.e. bi ~ N(0, Σ)) gives the largest approximate log-likelihood among all.
Table 1 shows the parameter estimation results based on the non-robust (NR) method and the robust method with c = 2 and ν = ∞ (R1), c = ∞ and ν = 3 (R2), and c = 2 and ν = 3 (R3), respectively. One of the main objectives of this analysis is to check if the parameter estimates and their standard errors are affected by the potential outliers in the longitudinal data. We see that the two methods produce somewhat similar estimates and standard errors, with the differences between them relatively minor. This suggests that the impact of the potential outliers on the parameter estimates may be small in this setting. Thus, we can be reasonably confident about these results and the implied conclusions. Note that the key parameters of interest here are the λk’s since they link the risk of HIV infection to the random effects from the longitudinal models which reflect the individual-specific characteristics of the longitudinal processes. The negative estimates of λ1 suggest that the larger the values of NAb (Y) and/or the larger the probability of MNGNE8 (Z) above its median, the lower the risk of HIV infection. The positive estimates of λ2 suggest that a larger increasing trend in MNGNE8 over time is associated with a higher risk of HIV infection.
Table 1.
Parameter estimates based on robust and non-robust methods.
| Model | Par. | Estimate | Std.Error | P-value | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||||||
| NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | ||
|
| |||||||||||||
| Estimates of mean parameters | |||||||||||||
|
| |||||||||||||
| α 0 | −1.55 | −1.62 | −1.59 | −1.59 | 0.10 | 0.12 | 0.10 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Model (15) | α 1 | 0.11 | 0.12 | 0.11 | 0.11 | 0.03 | 0.04 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 |
| α 2 | 1.73 | 1.81 | 1.79 | 1.79 | 0.19 | 0.20 | 0.19 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | |
| α 3 | −0.05 | −0.05 | −0.05 | −0.05 | 0.01 | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | |
| β 0 | 2.42 | 2.38 | 2.42 | 2.37 | 0.04 | 0.04 | 0.04 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | |
| β 1 | 0.85 | 0.89 | 0.84 | 0.89 | 0.05 | 0.06 | 0.05 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Model (16) | β 2 | −0.28 | −0.29 | −0.28 | −0.29 | 0.02 | 0.02 | 0.02 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
| β 3 | 1.34 | 1.47 | 1.34 | 1.46 | 0.07 | 0.08 | 0.07 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | |
| β 4 | 0.20 | 0.21 | 0.22 | 0.22 | 0.01 | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | |
| η 0 | 2.03 | 1.94 | 1.95 | 1.95 | 0.14 | 0.14 | 0.14 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Model (17) | η 1 | −6.56 | −6.23 | −6.25 | −6.26 | 0.30 | 0.29 | 0.29 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 |
| η 2 | 1.73 | 1.64 | 1.65 | 1.65 | 0.10 | 0.10 | 0.10 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | |
| η 3 | −0.67 | −0.64 | −0.66 | −0.66 | 0.22 | 0.22 | 0.22 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | |
| λ 0 | −0.71 | −0.70 | −0.74 | −0.75 | 0.24 | 0.23 | 0.23 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Model (18) | λ 1 | −0.91 | −1.07 | −1.41 | −1.40 | 0.29 | 0.29 | 0.29 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 |
| λ 2 | 1.64 | 1.60 | 1.92 | 1.94 | 0.26 | 0.24 | 0.27 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | |
|
| |||||||||||||
| Estimates of dispersion parameters | |||||||||||||
| σ | 0.50 | 0.49 | 0.50 | 0.49 | |||||||||
| d 1 | 0.35 | 0.29 | 0.23 | 0.23 | |||||||||
| d 2 | 0.05 | 0.06 | 0.07 | 0.07 | |||||||||
| d 4 | −0.88 | −0.79 | −0.75 | −0.75 | |||||||||
| d 3 | 0.20 | 0.21 | 0.22 | 0.22 | |||||||||
| d 12 | 0.52 | 0.50 | 0.61 | 0.61 | |||||||||
NR: non-robust method with c = ∞ and (b1i, b2i)T ∼ N(0, Σ);
R1: robust method with c = 2 and (b1i, b2i)T ∼ N(0, Σ).
R2: robust method with c = ∞ and .
R3: robust method with c = 2 and .
We also conducted model diagnostics and present the results in Section 3 of the Supplementary Material (Yu et al., 2022). For the linear mixed effect model of NAb, we check its goodness-of-fit by the fitted-versus-observed value plots. For the logistic mixed effect models of C and MNGNE8ind, we plotted their ROC curves. For the survival model, we plotted the Kaplan-Meier estimates versus fitted survival curves and also the Schoenfeld residual plot. All these plots can be found in the Supplementary Material (Yu et al., 2022). Overall, the proposed models fit the data reasonably well without major concerns.
5. Simulation Studies.
In this section, we evaluate the proposed robust method and h-likelihood method via comprehensive simulation studies. The models and the true parameter values are chosen to be similar to those in the real data analysis (see Section 4.2). However, different percentages of two types of outliers are randomly created, and the proposed robust methods with various degrees of robustness are used to fit the data to evaluate and compare their performances. The details are given below.
5.1. Simulation design.
We simulate the data as follows. Given random effects (b1i, b2i)T ~ N(0, Σ), we generate binary longitudinal data Zij and Cij from Bernoulli distributions with probabilities
respectively, where
When Cij = 0 (i.e., when Yij is not left censored or Yij ≥ d), we generate Yij from a truncated normal distribution, i.e. with a lower bound of d, where
We generate the survival times Ti from a Weibull distribution with shape parameter 15 and scale parameter 800exp(λ0xi + λ1b1i + λ2b2i)−1/15, where xi is a baseline covariate generated from N(0, 1). Moreover, we generate the non-informative right censored survival times from a Weibull distribution with shape parameter 5 and scale parameter 1000, in order to achieve similar right censoring rates of the survival times as in the real data. The observed survival data (si, δi) are , with being the event indicator. The true parameter values in all the models are set as α = (−1.65, 0.15, 1.8, −0.05)T, β = (2, 1, −0.3, 1.5)T, η = (1, −3, 0.9, −3.5)T, λ = (−0.75, −1.5,2)T, (d1, d2, d3, d4) = (0.4, 0.15, −1.5, 0.5), d = 1.5, σ = 0.5, and .
To create outliers in Y, we consider the following three scenarios: (i) Scenario 1 (e-outliers): We randomly select a subset (say 5%) of the simulated data Y and add a large number (say ±5 * σ) to the original values; (ii) Scenario 2 (b-outliers): We randomly select a subset (say 5%) of individuals and add a large number (say ±10) to the generated random effects b1i’s; (iii) Scenario 3 (e-outliers + b-outliers): We add large numbers to a set (say 2.5%) of the observations of Y and to a set (say 2.5%) of the random effects b1i’s.
We evaluate the parameter estimates by relative bias (rBias) and relative rooted mean square error (rRMSE):
relative bias (%) of ,
relative RMSE (%) of ,
where is the estimate of θ in simulation replication m, M = 100 is the number of replications in all the simulations, and θ is the true parameter value. To check if the standard error formulas based on the h-likelihood method given at the end of Section 3.3 are reliable or not, we compute two types of standard errors for each parameter estimate: average of the standard errors based on the h-likelihood method from simulation repetitions (denoted by SE), and the sample standard error of the estimates from simulation repetitions without using the h-likelihood formula (denoted by SSE). If the SE’s from the h-likelihood method are reliable, they should be close to the corresponding SSE’s. Finally, we report the estimated coverage probabilities of the 95% confidence intervals for the corresponding parameters using the h-likelihood-based SE’s.
5.2. Simulation results.
Tables 2 – 4 provide the main simulation results: Table 2 summarizes the simulation results when the random effects b1i’s contain 5% outliers by adding ±10; Table 3 summarizes the simulation results when there are 5% e-outliers in the Y -data by adding ±5σ; Table 4 summarizes the simulation results when there are 2.5% outliers in the random effects b1i’s by adding ±5 and 2.5% e-outliers in the measurements of Y by adding ±5σ. The overall censoring rates for Y are about 13.5%, 15%, and 14.5% for Tables 2 – 4 respectively. More simulation results with different percentages of outliers can be found in Section 4 of the Supplementary Material (Yu et al., 2022). As noted earlier, in the proposed joint model, the key parameters of interest are λ1 and λ2 since they indicate if the individual longitudinal trajectories are associated with risk of HIV infection, which is our main objective. Thus, in the following we focus on these two parameters in the discussion of simulation results.
Table 2.
Simulation results with 5% b-outliers.
| Par | True value |
Estimate | SE | SSE | rBias% | rRMSE% | CP% | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NR | R1 | R2 | NR | R1 | R2 (SE*) | NR | R1 | R2 | NR | R1 | R2 | NR | R1 | R2 | NR | R1 | R2 (CP*) | ||
|
| |||||||||||||||||||
| α 0 | −1.65 | −1.70 | −1.70 | −1.69 | .09 | .09 | .09 (.10) | .09 | .09 | .09 | 2.8 | 2.9 | 2.3 | 6 | 6 | 6 | 90 | 89 | 93 (94) |
| α 1 | 0.15 | 0.16 | 0.16 | 0.15 | .01 | .01 | .01 (.02) | .01 | .01 | .01 | 5.1 | 4.4 | 3.3 | 7 | 7 | 6 | 75 | 88 | 92 (100) |
| α 2 | 1.80 | 1.85 | 1.84 | 1.84 | .11 | .11 | .11 (.11) | .12 | .12 | .11 | 2.7 | 2.5 | 2.1 | 7 | 7 | 7 | 93 | 92 | 93 (93) |
| α 3 | −0.05 | −0.05 | −0.05 | −0.05 | .01 | .01 | .01 (.01) | .01 | .01 | .00 | 5.0 | 4.2 | 3.5 | 11 | 11 | 10 | 95 | 96 | 97 (96) |
| β 0 | 2.00 | 2.01 | 2.00 | 2.00 | .03 | .03 | .03 (.05) | .03 | .03 | .03 | 0.3 | 0.1 | 0.1 | 1 | 1 | 1 | 95 | 95 | 94 (100) |
| β 1 | 1.00 | 1.00 | 1.00 | 1.00 | .04 | .04 | .04 (.04) | .04 | .04 | .04 | 0.2 | 0.1 | 0.1 | 4 | 4 | 4 | 95 | 96 | 95 (95) |
| β 2 | −0.30 | −0.30 | −0.30 | −0.30 | .02 | .02 | .02 (.02) | .02 | .02 | .02 | 0.4 | 0.3 | 0.1 | 6 | 6 | 6 | 94 | 95 | 94 (95) |
| β 3 | 1.50 | 1.50 | 1.50 | 1.50 | .02 | .02 | .02 (.02) | .03 | .03 | .03 | 0.1 | 0.1 | .04 | 2 | 2 | 2 | 89 | 90 | 91 (88) |
| η 0 | 1.00 | 0.97 | 0.98 | 0.98 | .12 | .12 | .12 (.19) | .11 | .12 | .11 | 3.2 | 1.8 | 1.7 | 12 | 12 | 11 | 95 | 96 | 98 (100) |
| η 1 | −3.00 | −3.00 | −2.99 | −2.97 | .25 | .25 | .25 (.25) | .22 | .22 | .21 | 0.1 | 0.4 | 1.0 | 7 | 7 | 7 | 98 | 98 | 98 (98) |
| η 2 | 0.90 | 0.89 | 0.89 | 0.88 | .12 | .12 | .12 (.11) | .10 | .10 | .10 | 0.8 | 1.0 | 1.8 | 11 | 11 | 11 | 98 | 97 | 97 (96) |
| η 3 | −3.50 | −3.53 | −3.52 | −3.53 | .16 | .16 | .16 (.16) | .14 | .14 | .14 | 0.9 | 0.6 | 0.7 | 4 | 4 | 4 | 96 | 96 | 96 (96) |
| λ 0 | −0.75 | −0.77 | −0.75 | −0.73 | .14 | .14 | .14 (.17) | .16 | .16 | .16 | 2.1 | 0.2 | 2.7 | 22 | 21 | 22 | 90 | 91 | 92 (95) |
| λ 1 | −1.50 | −1.59 | −1.54 | −1.45 | .22 | .20 | .19 (.36) | .31 | .35 | .33 | 5.9 | 2.5 | 3.1 | 22 | 23 | 22 | 77 | 75 | 73 (93) |
| λ 2 | 2.00 | 2.14 | 2.09 | 2.01 | .19 | .18 | .17 (.28) | .30 | .31 | .29 | 7.0 | 4.4 | 0.3 | 17 | 16 | 14 | 71 | 76 | 78 (93) |
NR: non-robust method with bi ∼ N(0, Σ) and c = ∞.
R1: robust method with and c = ∞.
R2: robust method with and c = ∞.
SE: average standard error. SSE: sample standard error; CP: coverage probability.
SE* and CP*: average standard error and coverage probability based on the nonparametric bootstrap method respectively.
Table 4.
Simulation results with 2.5% b-outliers and 2.5% e-outliers.
| Par | True value |
Estimate | SE | SSE | Bias% | rRMSEs% | CP% | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | NR | R1 | R2 | R3 | ||
|
| |||||||||||||||||||||||||
| α 0 | −1.65 | −1.68 | −1.68 | −1.69 | −1.69 | .09 | .09 | .09 | .09 | .09 | .09 | .09 | .09 | 2.0 | 2.1 | 2.2 | 2.2 | 6 | 6 | 6 | 6 | 97 | 95 | 94 | 93 |
| α 1 | 0.15 | 0.16 | 0.15 | 0.15 | 0.15 | .01 | .01 | .01 | .01 | .01 | .01 | .01 | .01 | 3.8 | 2.8 | 3.0 | 2.2 | 6 | 6 | 6 | 5 | 93 | 96 | 95 | 97 |
| α 2 | 1.80 | 1.84 | 1.83 | 1.83 | 1.84 | .11 | .11 | .11 | .11 | .11 | .11 | .11 | .11 | 2.4 | 1.9 | 1.9 | 2.1 | 7 | 6 | 6 | 7 | 96 | 94 | 93 | 93 |
| α 3 | −0.05 | −0.05 | −0.05 | −0.05 | −0.05 | .01 | .01 | .01 | .01 | .01 | .01 | .01 | .01 | 5.6 | 2.2 | 2.3 | 2.4 | 13 | 12 | 12 | 12 | 90 | 92 | 93 | 92 |
| β 0 | 2.00 | 1.99 | 1.98 | 1.98 | 1.97 | .03 | .03 | .03 | .03 | .03 | .03 | .03 | .03 | 0.6 | 1.2 | 0.8 | 1.4 | 2 | 2 | 2 | 2 | 95 | 87 | 94 | 87 |
| β 1 | 1.00 | 1.01 | 1.02 | 1.01 | 1.02 | .05 | .05 | .05 | .05 | .05 | .05 | .05 | .05 | 1.0 | 1.5 | 1.2 | 1.9 | 5 | 5 | 5 | 5 | 97 | 97 | 97 | 97 |
| β 2 | −0.30 | −0.30 | −0.31 | −0.30 | −0.31 | .02 | .02 | .02 | .02 | .02 | .02 | .02 | .02 | 1.6 | 2.0 | 1.6 | 2.4 | 7 | 8 | 7 | 8 | 97 | 95 | 95 | 94 |
| β 3 | 1.50 | 1.51 | 1.52 | 1.51 | 1.52 | .03 | .03 | .03 | .03 | .03 | .03 | .03 | .03 | 0.5 | 1.4 | 1.0 | 1.4 | 2 | 2 | 2 | 2 | 92 | 88 | 93 | 88 |
| η 0 | 1.00 | 0.99 | 1.01 | 1.01 | 1.02 | .12 | .12 | .12 | .12 | .11 | .13 | .13 | .12 | 0.7 | 1.2 | 1.5 | 1.8 | 11 | 13 | 13 | 12 | 98 | 91 | 90 | 95 |
| η 1 | −3.00 | −3.04 | −3.02 | −3.02 | −3.02 | .25 | .25 | .25 | .25 | .23 | .28 | .28 | .27 | 1.4 | 0.6 | 0.6 | 0.7 | 8 | 9 | 9 | 9 | 96 | 96 | 94 | 96 |
| η 2 | 0.90 | 0.92 | 0.90 | 0.90 | 0.90 | .12 | .12 | .12 | .12 | .11 | .14 | .14 | .13 | 2.3 | 0.2 | 0.2 | 0.5 | 12 | 15 | 15 | 14 | 97 | 93 | 93 | 92 |
| η 3 | −3.50 | −3.50 | −3.53 | −3.54 | −3.55 | .15 | .15 | .15 | .15 | .16 | .16 | .16 | .15 | .03 | 1.0 | 1.1 | 1.3 | 5 | 5 | 5 | 5 | 95 | 92 | 93 | 94 |
| λ 0 | −0.75 | −0.77 | −0.77 | −0.78 | −0.76 | .15 | .14 | .14 | .14 | .18 | .18 | .19 | .17 | 2.7 | 2.8 | 4.1 | 1.3 | 23 | 24 | 25 | 22 | 91 | 89 | 86 | 92 |
| λ 1 | −1.50 | −1.61 | −1.55 | −1.56 | −1.55 | .20 | .17 | .18 | .18 | .30 | .33 | .33 | .38 | 7.2 | 3.0 | 3.8 | 3.1 | 21 | 22 | 22 | 25 | 75 | 73 | 69 | 71 |
| λ 2 | 2.00 | 2.15 | 2.11 | 2.13 | 2.09 | .27 | .19 | .19 | .19 | .33 | .34 | .35 | .36 | 7.3 | 5.3 | 6.4 | 4.4 | 18 | 18 | 18 | 18 | 82 | 71 | 67 | 72 |
NR: non-robust method with bi ∼ N(0, Σ) and c = ∞.
R1: robust method with and c = 2.
R2: robust method with and c = 1.5.
R3: robust method with and c = 2.
SE: average standard error. SSE: sample standard error; CP: coverage probability.
Table 3.
Simulation results with 5% e-outliers.
| Par | True value |
Estimate | SE | SSE | rBias% | rRMSE% | CP% | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NR | R1 | R2 | NR | R1 | R2 | NR | R1 | R2 | NR | R1 | R2 | NR | R1 | R2 | NR | R1 | R2 | ||
|
| |||||||||||||||||||
| α 0 | −1.65 | −1.69 | −1.69 | −1.68 | .09 | .09 | .09 | .09 | .09 | .09 | 2.4 | 2.2 | 2.0 | 6 | 6 | 6 | 97 | 97 | 97 |
| α 1 | 0.15 | 0.16 | 0.16 | 0.16 | .01 | .01 | .01 | .01 | .01 | .01 | 4.2 | 4.1 | 3.8 | 7 | 6 | 6 | 90 | 88 | 93 |
| α 2 | 1.80 | 1.85 | 1.85 | 1.84 | .11 | .11 | .11 | .11 | .11 | .11 | 3.0 | 2.8 | 2.4 | 7 | 7 | 7 | 95 | 96 | 96 |
| α 3 | −0.05 | −0.05 | −0.05 | −0.05 | .01 | .01 | .01 | .01 | .01 | .01 | 6.1 | 6.0 | 5.6 | 13 | 12 | 13 | 89 | 93 | 90 |
| β 0 | 2.00 | 1.94 | 1.98 | 1.99 | .04 | .03 | .03 | .04 | .04 | .03 | 3.2 | 1.0 | 0.6 | 4 | 2 | 2 | 67 | 90 | 95 |
| β 1 | 1.00 | 1.04 | 1.01 | 1.01 | .07 | .05 | .05 | .07 | .06 | .05 | 4.3 | 1.4 | 1.0 | 8 | 6 | 5 | 93 | 95 | 97 |
| β 2 | −0.30 | −0.32 | −0.31 | −0.30 | .03 | .02 | .02 | .03 | .02 | .02 | 5.4 | 2.0 | 1.6 | 11 | 8 | 7 | 95 | 95 | 97 |
| β 3 | 1.50 | 1.55 | 1.51 | 1.51 | .04 | .03 | .03 | .04 | .03 | .03 | 3.3 | 0.9 | 0.5 | 4 | 2 | 2 | 74 | 90 | 92 |
| η 0 | 1.00 | 1.01 | 0.99 | 0.99 | .13 | .12 | .12 | .12 | .10 | .11 | 1.2 | 0.7 | 0.7 | 12 | 10 | 11 | 94 | 97 | 98 |
| η 1 | −3.00 | −3.08 | −3.04 | −3.04 | .25 | .25 | .25 | .25 | .24 | .23 | 2.6 | 1.5 | 1.4 | 9 | 8 | 8 | 95 | 95 | 96 |
| η 2 | 0.90 | 0.94 | 0.92 | 0.92 | .12 | .12 | .12 | .12 | .12 | .11 | 4.1 | 2.6 | 2.3 | 13 | 13 | 12 | 93 | 95 | 97 |
| η 3 | −3.50 | −3.53 | −3.49 | −3.50 | .15 | .15 | .15 | .16 | .16 | .16 | 0.9 | 0.2 | .03 | 5 | 4 | 5 | 93 | 97 | 95 |
| λ 0 | −0.75 | −0.80 | −0.78 | −0.77 | .16 | .15 | .15 | .18 | .17 | .18 | 7.2 | 4.1 | 2.7 | 25 | 23 | 23 | 93 | 93 | 91 |
| λ 1 | −1.50 | −1.71 | −1.57 | −1.61 | .23 | .19 | .20 | .34 | .29 | .30 | 14 | 4.5 | 7.2 | 27 | 20 | 21 | 76 | 78 | 75 |
| λ 2 | 2.00 | 2.28 | 2.11 | 2.15 | .34 | .28 | .27 | .36 | .33 | .33 | 14 | 5.6 | 7.3 | 23 | 17 | 18 | 79 | 81 | 82 |
NR: non-robust method with bi ∼ N(0, Σ) and c = ∞.
R1: robust method with bi ∼ N(0, Σ) and c = 2.
R2: robust method with bi ∼ N(0, Σ) and c = 1.5.
SE: average standard error. SSE: sample standard error; CP: coverage probability.
Table 2 shows that the robust method (R1 and R2) for b-outliers generally returns smaller relative biases (rBias) and relative rooted-mean-square-errors (rRMSE) in parameter estimates than the non-robust (NR) method, where the robust method is based on a multivariate tν-distribution with degrees of freedom ν. To check model mis-specification and perform sensitivity analysis, two cases are considered for the robust method: R1 (ν = 5) and R2 (ν = 3). We see that R2 is preferred for and . This is because t3 in R2 has heavier tails than t5 in R1, so t3 can accommodate more outliers. Note that the standard errors (SE) of and in the survival model based on the proposed method are under-estimated, since the corresponding SE’s are smaller than the corresponding SSE’s (the sample SE from simulations). Thus, the corresponding coverage probabilities of the 95% confidence intervals are below the nominal level. This problem has been recognized in the joint-model literature when the baseline hazard in the survival model is completely unspecified (Hsieh, Tseng and Wang (2006); Rizopoulos (2012); Yu, Wu and Gilbert (2018)). In practice, we may consider bootstrap methods to obtain standard errors or consider a parametric survival model such as a Weibull survival model. Table 2 shows the standard errors and coverage probabilities based on the nonparametric bootstrap method for method R2, denoted by SE* and CP* respectively, and Table 7 in the Supplementary Material (Yu et al., 2022) provides simulation results when a Weibull survival model is used for the case of c = ∞ and . We see that the coverage probabilities of the key parameters λ1 and λ2 based on bootstrap are now close to the nominal level 0.95.
Table 3 shows that, in the presence of e-outliers only, the robust methods substantially outperform the non-robust method, with much smaller rBiases and rRMSEs for the key parameter estimates and . To check model mis-specification and perform sensitivity analysis, we again consider two cases for the robust method with different turning point c for the M-estimator: R1: c = 2 and R2: c = 1.5. We see that R1 leads to smaller rBiases and rRMSEs, which is consistent with the common choice of c = 2 for M-estimators in the literature of robust methods. Again, the coverage probabilities are smaller than the nominal level 95% due to under-estimated standard errors, as mentioned earlier. Table 8 in the Supplementary Material (Yu et al., 2022) provides the simulation results when a Weibull survival model is used with c = 2 and bi ~ N(0, Σ). The robust method leads to reasonable coverage probabilities for λ1 and λ2 when a parametric survival model is considered.
Table 4 shows the simulation results with the presence of both b-outliers and e-outliers. The results are consistent with those in Tables 2 and 3 in terms of biases and coverage probabilities. The rRMSEs values are comparable across different methods.
From these simulation results, we can conclude that the proposed robust method performs well and is preferred to the non-robust method in the presence of outliers. The advantages of the robust method are more substantial for e-outliers than for b-outliers. This is not surprising since joint models are often less sensitive to the mis-specification of the random effect distribution (Rizopoulos, 2012). As noted above, one issue of the proposed robust method is the under-estimated standard errors for the parameters in the Cox PH model, which may lead to lower coverage rates and is a common issue in this joint model setting. On the other hand, the non-robust method, which ignores outliers, may inflate the standard errors and thus produce higher coverage rates, but it produces larger biases and MSEs in general. The issue of under-estimated standard errors may be addressed by bootstrap methods or considering a parametric survival model, such as a Weibull survival model, instead of Cox PH model.
6. Discussion.
We have proposed a robust method for joint models of mixed types of longitudinal data and survival data, where the longitudinal data may contain outliers and left censoring. We address the e-outliers by an M-estimator and the b-outliers by a multivariate t-distribution. Moreover, we have proposed a computationally efficient approximate method based on the h-likelihood for parameter estimation. We find that parameter estimates in joint models of longitudinal and survival data are very sensitive to e-outliers but are less sensitive to b-outliers in the longitudinal data. Thus, in data analysis using joint models, we should focus more on the outlying observations among the repeated measurements within an individual than outlying individuals.
Based on the proposed method, we analyze the motivating HIV vaccine data and find a strong association between the individual-specific longitudinal immune response trajectories and the risk of HIV infection. This finding is important since we may then possibly predict the risk of HIV infection based on early longitudinal trajectories of potential biomarkers, avoiding the more expensive approach of collecting all the longitudinal data later in the study. An advantage of using random effects from the longitudinal models as predictors in the survival model is that such an approach avoids the complications of the longitudinal values being censored or missing or mis-measured at event times. Moreover, the random effects may be viewed as summaries of the longitudinal trajectories. For this specific HIV vaccine data, the analysis results based on the robust method and the non-robust method seem close. However, the proposed robust method allows us to be more confident about the analysis results and conclusions since left censoring and potential outliers are addressed. Moreover, simulation results show that the robust method may perform much better than the non-robust method in some cases. Thus, the proposed method may be useful for other datasets. Finally, the proposed method may be used as a tool for sensitivity analysis to see if analysis results are sensitive to left censoring and potential outliers.
In this study, since the biomarkers are highly correlated, it may be difficult to recognize which of the biomarkers leads to the event or perhaps the biomarkers jointly lead to the event. As we have done here, a simple approach would be to choose one or two representative biomarkers in the modelling, since a multivariate longitudinal model for all the biomarkers may involve too many parameters, making the joint model over complicated. In future research, we might consider a single underlying latent process which governs all the longitudinal biomarkers. We might also consider a principal component analysis for the longitudinal biomarkers and then choose one or two longitudinal principal components in the joint model.
We have assumed a point mass for the censored values in order to avoid a unverifiable parametric distributional assumption for the censored values. This may lead to some loss of information, compared to a method assuming a unverifiable parametric distribution for the censored values (e.g., Hughes (1999)). However, since no information regarding the distribution of the censored values is available, there is actually little loss of information, so avoiding a unverifiable distributional assumption may be a reasonable approach.
Kong and Nan (2016) considered a left-censored covariate in regression models for cross-sectional data. They used survival analysis methods for the left-censored covariate after converting the left-censoring to right censoring, similar to the methods used in survival analysis for right-censored data. They assumed an accelerated failure time (AFT) model for the transformed covariate, and included other observed predictors (covariates) in the AFT model, without distributional assumption for the error term of the AFT model. Then, they used the nonparametric Kaplan-Meier estimator to estimate the residual distribution. While they did not explicitly assume a point mass for the censored values, the Kaplan-Meier estimator of the error distribution is based on the observed data of the covariate with censoring, and they used other observed data as predictors in the AFT model to improve the estimation. They claim that the efficiency gain of their method depends on the strength of the association between the covariate with censoring and other predictors. Essentially, this is similar to our idea of using other observed predictors to possibly predict the censoring status of Y* when Y* is not measured, based on a logistic regression model. Neither methods assume a distribution for the censored values. A comparison of the two methods will be reported separately.
Although there has been extensive research in joint models in the last two decades, computation remains a main challenge for statistical inference of joint models. The approximate method based on the h-likelihood appears to work reasonably well and is computationally efficient. The h-likelihood is essentially a modification of the Laplace approximation. Such an approximation often performs well for continuous data but may perform less satisfactory for discrete data such as binary data. Its performance may also depend on the frequencies of the repeated measurements. We plan to investigate the approximate method further via more extensive simulation studies.
The models in the paper are motivated from a real dataset from an HIV vaccine study, so they may also be applicable in other similar studies. While the models appear to be complicated, the main longitudinal models are in fact a linear mixed effects (LME) model for the continuous response and a logistic mixed effects model for the binary response. Both types of models are commonly used in practice. The LME model in the data analysis appears to be complicated since it incorporates the periodic patterns of the longitudinal data due to repeated vaccinations over time, but it is still an LME model. Left censoring and outliers are common problems in many longitudinal studies. Therefore, the proposed models are more generally applicable, and the LME model may be simpler in other applications.
7. Software.
R code and a sample data set are available at https://github.com/oliviayu/robust-HHJMs or upon request to the first author.
Supplementary Material
Footnotes
SUPPLEMENTARY MATERIAL
Supplementary to “Robust joint modelling of left-censored longitudinal data and survival data, with application to HIV vaccine studies”
The supplementary material includes four sections: 1. Proof of asymptotic properties; 2. Model selection procedure; 3. Model Diagnostics of the Real Data Analysis; and 4. Additional simulation results.
REFERENCES
- Barrett J, Diggle P, Henderson R and Taylor-Robinson D (2015). Joint modelling of repeated measurements and time-to-event outcomes: flexible model specification and exact likelihood inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77 131–148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cantoni E and Ronchetti E (2001). Robust inference for generalized linear models. Journal of the American Statistical Association 96 1022–1030. [Google Scholar]
- Copt S and Victoria-Feser M-P (2006). High-breakdown inference for mixed linear models. Journal of the American Statistical Association 101 292–300. [Google Scholar]
- Elashoff RM, Li G and Li N (2015). Joint modeling of longitudinal and time-to-event data. Chapman and Hall. [Google Scholar]
- Flynn N, Forthal D, Harro C, Judson F, Mayer K, Para M, Gilbert P and THE RGP120 HIV VACCINE STUDY GROUP (2005). Placebo-controlled phase 3 trial of recombinant glycoprotein 120 vaccine to prevent HIV-1 infection. Journal of Infectious Diseases 191 654–65. [DOI] [PubMed] [Google Scholar]
- Gill PS (2000). A robust mixed linear model analysis for longitudinal data. Statistics in Medicine 19 975–987. [DOI] [PubMed] [Google Scholar]
- Hsieh F, Tseng Y-K and Wang J-L (2006). Joint modeling of survival and longitudinal data: likelihood approach revisited. Biometrics 62 1037–1043. [DOI] [PubMed] [Google Scholar]
- Hughes JP (1999). Mixed effects models with censored data with application to HIV RNA levels. Biometrics 55 625–629. [DOI] [PubMed] [Google Scholar]
- Koller M (2016). robustlmm: An R package for robust estimation of linear mixed-effects models. Journal of Statistical Software, Articles 75 1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kong S and Nan B (2016). Semiparametric approach to regression with a covariate subject to a detection limit. Biometrika 103 161–174. [Google Scholar]
- Lange KL, Little RJ and Taylor JM (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 881–896. [Google Scholar]
- Lee Y and Nelder JA (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society. Series B (Methodological) 619–678. [Google Scholar]
- Lee Y, Nelder JA and Pawitan Y (2018). Generalized linear models with random effects: unified analysis via H-likelihood 153. CRC Press. [Google Scholar]
- Lucas A (1997). Robustness of the student t based M-estimator. Communications in Statistics-Theory and Methods 26 1165–1182. [Google Scholar]
- Qin G, Zhang J, Zhu Z and Fung W (2016). Robust estimation of partially linear models for longitudinal data with dropouts and measurement error. Statistics in Medicine 35 5401–5416. [DOI] [PubMed] [Google Scholar]
- Rizopoulos D (2012). Joint models for longitudinal and time-to-event data: with applications in R. CRC Press. [Google Scholar]
- Sinha SK (2004). Robust analysis of generalized linear mixed models. Journal of the American Statistical Association 99 451–460. [Google Scholar]
- Taylor JM, Park Y, Ankerst DP, Proust-Lima C, Williams S, Kestin L, Bae K, Pickles T and Sandler H (2013). Real-time individual predictions of prostate cancer recurrence using joint models. Biometrics 69 206–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waternaux C, Laird NM and Ware JH (1989). Methods for analysis of longitudinal data: blood-lead concentrations and cognitive development. Journal of the American Statistical Association 84 33–41. [Google Scholar]
- Wu L (2009). Mixed effects models for complex data. CRC Press. [Google Scholar]
- Wu L and Qiu J (2011). Approximate bounded influence estimation for longitudinal data with outliers and measurement errors. Journal of statistical planning and inference 141 2321–2330. [Google Scholar]
- Wu L and Yu T (2016). Joint modeling of longitudinal and survival data In Wiley StatsRef: Statistics Reference Online 1–9. American Cancer Society. 10.1002/9781118445112.stat07849 [DOI] [Google Scholar]
- Yu T, Wu L and Gilbert PB (2018). A joint model for mixed and truncated longitudinal data and survival data, with application to HIV vaccine studies. Biostatistics 19 374–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu T, Wu L, Qiu J and Gilbert PB (2022). Supplement to “Robust joint modelling of left-censored longitudinal data and survival data, with application to HIV vaccine studies”. [DOI] [PMC free article] [PubMed]
- Zheng X, Fung WK and Zhu Z (2013). Robust estimation in joint mean–covariance regression model for longitudinal data. Annals of the Institute of Statistical Mathematics 65 617–638. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
