Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jul 30.
Published in final edited form as: Stat Med. 2015 Jul 15;35(17):2991–3006. doi: 10.1002/sim.6590

Joint Multiple Imputation for Longitudinal Outcomes and Clinical Events Which Truncate Longitudinal Follow-up

Bo Hu 1,*, Liang Li 2, Tom Greene 3
PMCID: PMC4714958  NIHMSID: NIHMS707142  PMID: 26179943

Abstract

Longitudinal cohort studies often collect both repeated measurements of longitudinal outcomes and times to clinical events whose occurrence precludes further longitudinal measurements. Although joint modeling of the clinical events and the longitudinal data can be used to provide valid statistical inference for target estimands in certain contexts, the application of joint models in medical literature is currently rather restricted due to the complexity of the joint models and the intensive computation involved. We propose a multiple imputation (MI) approach to jointly impute missing data of both the longitudinal and clinical event outcomes. With complete imputed datasets, analysts are then able to use simple and transparent statistical methods and standard statistical software to perform various analyses without dealing with the complications of missing data and joint modeling. We show that the proposed MI approach is flexible and easy to implement in practice. Numerical results are also provided to demonstrate its performance.

Keywords: multiple imputation, missing data, joint modeling, competing risk, longitudinal data

1. Introduction

Longitudinal clinical studies often include repeated measurements of biomarkers, patient reported outcomes or other longitudinal outcomes, as well as clinical events that preclude further measurement of the longitudinal outcomes. In particular, studies of chronic kidney disease (CKD) often center on longitudinal assessments of markers of disease severity or comorbidity, such as estimated glomerular filtration rate (eGFR), proteinuria, quality of life or physical function, etc., which are measured according to a designed schedule until death or kidney failure, at which point follow-up is terminated. When subjects are followed over time in a longitudinal study, it is almost inevitable that some longitudinal assessments will be missing [1, 2, 3]. Meanwhile the survival outcomes, measured as times to the clinical events, may also be missing due to loss to follow-up or administrative censoring.

Historically, most studies in the medical literature have considered the longitudinal and survival outcomes in separate analyses, in which planned longitudinal measurements occurring after the terminating events are treated as ignorable missing data. This practice has a number of drawbacks. When the longitudinal outcomes and the clinical events are correlated, failure to appropriately account for ”non-ignorable” dropout can lead to biased statistical inference for the longitudinal outcomes. Furthermore, conducting separate analyses reduces efficiency by failing to incorporate the information provided jointly by both types of outcomes.

Joint modeling of the longitudinal and survival outcomes has gained popularity over the past decades (see [4] for a systematic review). Many joint modeling approaches fall under the framework of shared-parameter models [5, 6, 7]. A shared-parameter model typically consists of two sub models, one for the longitudinal outcome and the other for the survival outcome. The two sub models are linked through a shared parameter that represents subject-specific characteristics of the longitudinal profiles. Joint analysis may also be performed using multi-state approaches [8]. Hu et al. (2012) [9] proposed a multi-state representation that classifies subjects into discrete states defined jointly by the longitudinal and survival outcomes. Although a number of joint modeling approaches are available, their applications in medical literature are currently limited due to their statistical and computational complexity. For example, the shared-parameter models often involve a multi-dimensional integration, which may become intractable for a high dimensional shared parameter that is often needed in models with non-linear longitudinal profiles [10, 11]. The available software for joint modeling also has limited model options for the great diversity of problems encountered in practice [12].

In this paper we propose a computationally simple and transparent multiple imputation (MI) approach to impute missing data in both longitudinal and survival outcomes to support statistical inferences. We shall restrict our consideration to estimands that can be defined by the joint distribution of the survival outcomes and longitudinal measurements obtained prior to the occurrence of the clinical events. This avoids the conceptual problems that occur when attempting to conduct statistical inference on outcomes obtained after death or other terminating events [13]. With this restriction, we restrict imputations to the survival outcomes themselves and to truly missing longitudinal measurements that are missing due to dropout or intermittent missingness as opposed to the occurrence of the clinical events. For the survival part, we will also restrict attentions to estimands and imputed events within the follow-up period. With complete data after imputation, simple and direct statistical approaches can be used for analysis. In other words, imputation plus simple approaches can serve as a practical alternative to the complicated joint modeling approaches.

There is an extensive literature on the imputation of missing longitudinal data [2, 14, 15]. Ali and Siddiqui (2009) [16] specifically discussed the imputation of longitudinal data in the informative dropout setting, but their imputation approach does not depend on survival outcomes. There also exist many imputation methods for missing covariates in survival analysis (e.g., [17, 18, 19]), where the missing data are usually for the baseline covariates. It is generally recommended that the survival outcome should be used for the imputation of covariates. Several approaches have been proposed for the imputation of censored event times (e.g., [20, 21]). [22] and [23] discussed the imputation of censored event times by using longitudinal data as auxiliary variables. [24] discussed the use of multiple imputation for the diagnostics of joint models. However, this approach requires fitting the joint model prior to multiple imputations. Our proposed approach differs from the current literature by imputing both missing survival and longitudinal data. The imputation proceeds in a sequential fashion so that the imputation of each outcome depends on the other.

Section 2 introduces a motivating example. Section 3 describes the general framework of the MI approach and a flexible parametric imputation approach under this framework that is easy to implement in practice. Section 4 applies the MI algorithms to a simulation study and the motivating example to examine the numerical performance. Section 5 concludes the paper with a brief discussion.

2. The AASK Study

The African American Study of Kidney Disease and Hypertension (AASK) is a randomized clinical trial with a follow-on cohort study in 1094 African Americans with chronic hypertensive kidney disease [25]. The follow-up time ranges from 8.5 to 11.5 years depending on the date of randomization of the participants.

In the AASK Study, as in most clinical trials of chronic kidney disease, follow-up of longitudinal outcomes was terminated at the occurrence End Stage Renal Disease (ESRD) or death, whichever occurred first. Therefore the survival data in AASK fall under the competing-risk framework since the occurrence of death censored the observation of ESRD and vice versa. ESRD is defined by the initiation of renal replacement therapy that dramatically alters the interpretation of many biomarkers measured in the serum or the urine. In general renal replacement therapy may be provided in form of dialysis or kidney transplantation, but in the AASK study all occurrences of ESRD were signified by the initiation of dialysis. 318 AASK patients reached dialysis and another 176 patients died prior to the end of follow-up period.

The AASK study database includes an extensive array of longitudinal data and provides a rare opportunity to study the long-term progression of chronic kidney disease. In this paper we consider the bivariate longitudinal outcome of estimated glomerular filtration rate (eGFR) and proteinuria. eGFR is calculated from serum creatinine, age and gender, and is the standard measure of the level of kidney function [26]. In the AASK study, eGFR was assessed at baseline and approximately every 6 months thereafter until the patient died, reached dialysis, or was otherwise lost to follow-up. Proteinuria, measured as the ratio of urinary protein and creatinine in a 24-hour urine collection, was assessed at baseline, every 6 months during the first five years and annually thereafter. The maximum number of eGFR measurements was 21 and the maximum number of proteinuria measurements was 15 for the AASK participants.

Figure 1 shows the status of data collection at each scheduled patient visit. For simplicity, we have defined a common right censoring time for both the longitudinal and survival outcomes, and have categorized any missing measurements of the longitudinal outcome prior to right censoring and the clinical events as intermittently missing. At the time of each scheduled visit, a patient could be deceased, in ESRD, right censored, , continuing in follow-up with intermittently missing longitudinal data (eGFR or proteinuria), or continuing in follow-up with observed longitudinal data. Under our framework, the status of the patient is regarded as known at a given visit if they previously reached ESRD or death, so that only the categories defined by right censoring or intermittent missingness indicate true missing data requiring imputation.

Figure 1.

Figure 1

Percentages of observed and missing data at follow-up visits in AASK

In AASK, 823 patients had intermittent missing patterns of the longitudinal data, with intermittent missing gap ranging from 9 to 64 months (with a mean of 16.5 months). 355 patients were lost to follow-up, causing missingness in both the longitudinal and survival data.

3. Multiple Imputation

3.1. Data Notations

We consider competing-risk type of survival outcome with M event types. T* denotes the time to the first event and δ* takes a value of m = 1, ···, M if event type m occurs first. In the AASK study, m = 1 for dialysis and m = 2 for death.

For the longitudinal data, we consider multivariate outcomes, y1, ···, yJ , where yj is a vector of repeated measurements of the jth outcome that are scheduled to be measured at a set of time points tj = tj1, ···, tjmj)T. J = 2 in the AASK example, where y1=(y11,,y121)T is a vector of longitudinal eGFR measurements at t1 = (0, 6, ···, 120)T and y2=(y21,,y215)T is a vector of longitudinal proteinuria measurements at t2 = (0, 6, ···, 48, 60 ···, 120)T. For simplicity of notation, we denote Y as the vector of all longitudinal outcomes that are pooled together and arranged chronologically. Outcomes with tied measurement times may be arranged randomly.

Although the longitudinal outcomes are scheduled to be measured at all time points, their measurements are not feasible or their meanings are significantly altered after the occurrence of the events [27]. Therefore we consider longitudinal outcomes with measurement times prior to the time to the first clinical event, denoted by YT*. Then W=(ZT,T,δ,YTT)T denotes the complete data, where Z is a vector of baseline covariates.

Let C be the censoring time for the clinical events, which we assume to be independent of the longitudinal and survival variables given baseline covariates Z. The observed survival data are then T = min(T*, C) and δ = δ*I(T*C). The actual event time T* is censored and can be considered as missing but greater than T if δ = 0. Meanwhile we decompose the longitudinal data YT* as Yobs and Ymis, where Yobs and Ymis are the observed and missing part of YT*, respectively.

Suppose that the study has n subjects, Wi=(ZiT,Ti,δi,YiT)T is a vector of complete data for subject i, i = 1, ···, n. Meanwhile (ZiT,Ti,δi,Yi,obsT)T represents the observed data. For simplicity, we assume no missing data for the baseline variables Z throughout this paper but our approach can be easily extended to account for missing baseline data.

3.2. General Framework

Multiple imputation replaces each missing data with a random sample of plausible values. More formally, imputed values should be drawn from the posterior predictive distribution of the missing data given the observe data. In our context, the target predictive distribution can be written as

p(T,δ,YmisT,δ=0,Yobs,Z). (1)

Since the missing data in (1) consist of both longitudinal and survival parts, we factor (1) as

p(T,δ,YmisT,δ=0,Yobs,Z)=p(T,δT,δ=0,Yobs,Z)p(YmisT,δ,T,δ=0,Yobs,Z). (2)

For the missing survival data, we can further decompose p(T*, δ*|T, δ = 0, Yobs, Z) in (2) as

p(TT,δ=0,Yobs,Z)p(δT,T,δ=0,Yobs,Z). (3)

Therefore we can impute the composite event time T* and event type δ* sequentially.

Based on equations (2) and (3), it is intuitive to impute the missing data in two steps, where the first step imputes the missing survival data (T*, δ*) and the second step imputes the missing longitudinal data Ymis. For subjects with observed survival data, that is, δ > 0, the target predictive distribution is p(Ymis|Yobs, T*, δ*). Therefore imputation is only needed for missing longitudinal data. In principle, any appropriate approach can be applied to impute the missing data in each step. In the following section, we describe a simple parametric approach to illustrate the general framework.

3.3. A Parametric Approach

For the parametric imputation approach, we express (1) as

p(T,δ,YmisT,δ=0,Yobs,Z)=p(T,δ,YmisT,δ=0,Yobs,Z;θ)p(θT,δ=0,Yobs,Z)dθ=p(TT,δ=0,Yobs,Z;α)p(δT,T,δ=0,Yobs,Z;β)×p(YmisT,δ,T,δ=0,Yobs,Z;γ)p(θT,δ=0,Yobs,Z)dθ, (4)

where θ = (α, β, γ)T is a vector of unknown parameters and each component of θ indexes one corresponding conditional distribution in the second equality. Instead of evaluating the integral (4), the data augmentation (DA) algorithm [28] is generally more convenient, which proceeds by randomly sampling the parameters and missing data iteratively.

Based on (4), we propose an imputation approach based on sequential conditional parametric models. After l iterations, let T(l)=(Tmis(l),Tobs),δ(l)=(δmis(l),δobs) and Y(l)=(Ymis(l),Yobs) denote the complete survival and longitudinal data, where Tmis(l),δmis(l) and Ymis(l) represent imputed data. At iteration l + 1, we impute the missing values of T*, δ* and Y in the following steps.

Step 1a. Impute event times

For each censored time t that imputation is needed, we assume a Weibull model for the composite event time T* for the set of subjects at risk, that is, R(t) = {i|Tit}. More specifically, the hazard function is

h(s;αt)=αt0sαt0-1exp(αt1+αt2Tf(Y(l))+αt3TZ)forst, (5)

where αt = (αt0, ···, αt3)T is part of the parameter vector α in (4). f(Y(l)) represents some characteristics of the longitudinal data. For example, in the AASK study, f(Y(l)) could be the baseline eGFR and proteinuria, the latest eGFR and proteinuria prior to t or both baseline and the latest values [23].

Let α^t=(α^t0,,α^t3T)T denote the vector of the maximum likelihood estimates of the parameters in model (5), which asymptotically follows a multivariate normal (MVN) distribution as nt(α^t-αt)~MVN(0,αt). nt is the size of R(t). For the imputation, we

  1. draw α̃ = (α̃t0, α̃t1, α̃t2, α̃t3)T from the distribution MVN(α̂t, Σα̂t/nt);

  2. compute the hazard function h(s; α̃) for subjects that need imputation, that is, {i|Ti = t, δi = 0};

  3. draw T*(l+1) from the cumulative density function 1 − S(s; α̃)/S(t; α̃) for st, where S(t; α̃) is the survival probability function.

Step 1 above uses an empirical Bayes approach by drawing α̃ from the sampling distribution of α̂t in fitting the Weibull model (5), which is a common approximation approach used for regression-based imputation methods [29]. Based on model (5), the logarithm of the survival probability in the third step above can be readily calculated as

logS(t;α)=--th(x;α)dx=-tαt0exp(αt1+αt2Tf(Y(l))+αt3TZ).

Step 1b. Impute event types

We assume a multinomial logistic model for the event indicator δ*. Denote pm(βt) = Pr(δ* = m) the probability that event m occurs first for the subjects with imputed event time Ti(l+1) from Step 1a. Under the multinomial logistic model, we have

logpm(βt)p1(βt)=βt0m+βt1mlogT(l+1)+βt2mf(Y(l))+βt3mZ (6)

for m > 1, where βt=(βt02,,βt3M)T is part of the parameter vector β in (4) and m=1Mpm(βt)=1. The log-transformation of the event time, log T*(l+1), is included as a covariate in model (6) to model the dependence of the event type on the event time. Other transformations of T*(l+1) may be also considered, e.g., H(T*(l+1)) with H() being the Nelson-Aalon estimator of the cumulative hazard function [19].

Fitting model (6) should be restricted to subjects in R(t) with δ > 0, which then yields a maximum likelihood estimate β̂t that asymptotically follows a multivariate normal distribution MVN(βt, Σβt). To impute the event types for subject i with i ∈ {i|Ti = t, δi = 0}, we

  • 1

    draw β̃t from MVN (β̂t, Σβ̂t);

  • 2

    compute probabilities pm(β̃t), m = 1, ···, M;

  • 4

    draw a M-vector r from a multinomial distribution with probabilities pm(β̃t), m = 1, ···, M;

  • 4

    set δ*(l+1) = m if the mth element of r is equal to 1.

Steps 1a and 1b are repeated for all distinct observed censoring times, after which all missing survival data are imputed. And the resulting data can be expressed as (Tmis(l+1),Tobs,δmis(l+1),δobs,Ymis(l),Yobs). Step 1b can beskipped for standard survival data without competing events. In practice, the risk set R(t) may have a small sample size for some t and fitting models (5) and (6) might encounter numerical problems. An approximation approach can be obtained by running these two imputation steps at a set of landmark times (τ1 < ··· < τK ) instead of all observed censoring times. For each landmark time τk, models (5) and (6) will be fit to subjects with Tiτk and the fitted models will be used to impute missing data for subjects with Ti ∈ [τk, τk+1) and δi = 0. This basically assumes constant modeling relationship for each landmark interval [τk, τk+1). A similar approach is used for the dynamic prediction by landmarking [30]. The landmark times could be selected to ensure proper fit of models (5) and (6).

Step 2. Impute longitudinal data

If the vector of complete data W or an appropriate transformation of W follows a multivariate normal distribution, the conditional distribution p(Y|T*, δ*, Z; γ) is also multivariate normal, where γ consists of the mean vector and the variance-covariance matrix of the MVN distribution. In this case, a direct approach is to use the Markov chain Monte Carlo (MCMC) method [31] to impute the missing longitudinal data. The MCMC method initially estimates γ using the Expectation-Maximization (EM) algorithm, and then iterates an I-step and a P-step. The I-step draws Ymis(l+1) from the conditional MVN distribution of p(Ymis|Yobs, T*(l+1), δ*(l+1), Z; γ). The P-step simulates γ based on the observed and imputed values of Y.

In cases where it is difficult to identify a suitable multivariate distribution for W or its transformations, a popular alternative is the fully conditional specification (FCS) approach [32]. The FCS approach is able to handle mixed variable types since each variable is imputed using its own imputation model. The FCS approach is also known as multiple imputation by chained equations (MICE) in the literature [29]. More specifically, let Yj be the jth component of the longitudinal data Y and let Wj denote the vector containing the remaining variables of W except Yj. Under the assumption that the conditional distribution p(Yj|Wj, γj) follows a parametric regression model with parameter γj, we

  1. regress Yj on Wj by restricting to subjects with observed values of Yj;

  2. draw γ̃j from the sampling distribution of the estimates of γj in fitting the regression model;

  3. draw Yj(l+1) from p(Yj|Wj; γ̃j).

The regressors Wj in the first step contain both observed and imputed values of other variables. The above process is repeated for all components of the longitudinal data Y. Subsequently iteration l + 1 completes and data are updated as (Tmis(l+1),Tobs,δmis(l+1),δobs,Ymis(l+1),Yobs) . For subjects with observed survival data, the imputation consists only of Step 2.

The multi-step imputation procedure may iterate several times to achieve numerical convergence, which is similar to the chained equation approach for multiple imputation. While there is not a formal approach to determine the number of iterations, convergence can be monitored by examining the mean values of each variable under imputation over iterations [33]. Simulation works indicated that satisfactory performance can be achieved with just 5 or 10 iterations [34, 35].

Upon completion of the imputation, an administrative censoring time may be applied to each imputed dataset, which typically coincides with the planned follow-up period of the study. For example, AASK study ended on June 30th, 2007. Then any imputed event times later than the end date can be discarded.

4. Numerical Results

4.1. Simulation Studies

Simulation model

We considered a univariate longitudinal outcome in the simulation studies. The serial measurements of the longitudinal outcome Y (t), t = 0, 6, ···, 120, and the survival outcome (T*, δ*) were generated from the following models,

Y(t)=b0+b1t+e(t),logT=α0+α1TZ+α2Tb+ε,logP(δ=2)P(δ=1)=β0+β1TZ+β2Tb+β3logT, (7)

where δ* = 1 and 2 to denote dialysis and death, respectively; Z ~ MVN(μz, Σz) represents a two-dimensional vector of baseline variables; b = (b0, b1)T ~ MVN(μb, Σb) represents the intercept and slope of the longitudinal trajectory; e(t)~N(0,σe2) is the residual error; and ε~N(0,σε2). The values of all parameters, μz, μb, Σz, Σb, σe, σd, α and β, were estimated from the AASK data.

An independent censoring time C was simulated from a mixture of an exponential distribution exp(1/300) and a truncated normal distribution N(130, 30)I(0,130). The mixture distribution was chosen to yield an overall high censoring rate and an accelerated censoring rate towards the end of the follow-up period. C censors the observation of the time to event, leading to T = min(T*, C) and δ = δ*I(T* < C) as the observed survival data.

Missing model of longitudinal data

We generated missing data of the longitudinal outcome according to the following strategy:

  • (Dropout). We only simulated Y (t) with tT .

  • (Intermittent missing). A simulated Y (t) was randomly deleted according to the proportion of intermittent missing eGFR at time t in AASK.

  • (Consecutively-intermittent missing). We simulated the number of consecutively-intermittent missing longitudinal measurements from a Poisson distribution, nc ~ Poisson(λc). We then randomly selected a time point and deleted all nc subsequent longitudinal measurements.

  • (Lost to follow-up). We simulated nl ~ Poisson(λl), and then deleted nl consecutive measurements prior to the time T .

The baseline longitudinal outcome Y (0) was retained for all subjects. We considered two missing mechanisms, missing at random (MAR) and missing not at random (MNAR), by selecting different models for λc and λl. For MAR, log λc and log λl were linear functions of the baseline longitudinal data Y (0), T and δ, which were all kept observed in the simulation study. Then missingnesses of the longitudinal data only depended on observed data. For MNAR, λc and λl further depended on b0 and b1, the unobserved random intercept and slope of the longitudinal trajectory. Details of the models for λc and λl can be found in the Supplementary Materials.

The simulation size was 100 and the sample size was 500 in each simulation. Figure 2 shows the proportion of missing measurements of the longitudinal outcome at each scheduled visit averaged over 100 simulations for the missing at random scenario. Similar missing rates were observed under the MNAR mechanism. For each simulated dataset, we applied two approaches to impute the missing data. The two approaches differed only in the imputation of the missing longitudinal data, where the first approach applied the MCMC method by assuming a multivariate normal distribution for the longitudinal outcome given all other variables; and the second approach used the FCS method. We refer these two approaches as ”MVN” and ”FCS” approaches throughout the remainder of this paper. The number of imputations was 10 for each simulated data set, and Steps 1a, 1b and 2 in Section 3.3 were iterated 10 times for each imputation.

Figure 2.

Figure 2

Average missing rates of longitudinal outcomes in the simulation study (MAR)

To evaluate the performance of the MI approach, we considered the multi-state model, where the longitudinal and survival outcomes are analyzed jointly by classifying subjects into a set of mutually exclusive states. For the simulation study, we considered 5 states at each time t,

st=k,k=1,,5, (8)

where k = 1, ···, 5 represent the states of no events and Y (t) below 30, between 30 and 60 and above 60, dialysis and deaths, respectively.

We considered the following estimands associated with the multi-state model: (i) the multi-state probabilities Pkt = P(st = k) for t ∈ [0, 120]; (ii) the areas under the multi-state probability curves, 0120Pktdt for k = 1, ···, 5; (iii) transition probabilities between states over designated time intervals, P(st = k|su = l) with t > u. We also considered (iv) the mean intercept and slope of the longitudinal trajectory; (v) the average one-year change of the longitudinal outcome preceding dialysis or death. With a complete imputed dataset, all estimands expect for (iv) can be estimated as simple proportions or means. The mean intercept and slope of the longitudinal data were estimated by using the unweighted least square method [36]. Additional details concerning the estimation procedures are described in the Supplementary Materials.

The estimates from multiple imputed datasets were summarized using Rubin’s rule [1]. For any estimand Q, let j be the estimate obtained from the jth imputed dataset, j = 1, ···, m. Vj is the variance estimate associated with j. The multiple imputation estimate for Q is the average of the m estimates

Q^=jQ^jm

with the variance estimate as

V^=jVjm+(1+1m)B,

where B is the sample variance of the estimates across m imputed datasets. The relative increase in variance due to missingness is calculated by r=(m+1)BjVj and the relative efficiency is mm+r.

Figure 3 compares the true and estimated multi-state probability curves when data are missing at random. Each curve is an average across 100 simulations. The red lines represent the true probability curves, which are obtained from the complete datasets without missing data. The blue and green lines are from the imputed datasets by applying the MVN and FCS approaches, respectively, which are both very close to the true curves. The black lines are the probability curves estimated from the observed datasets with missing data, which are clearly biased from the true curves. We display the black curves as a frame of reference to illustrate the discrepancies when applying the simple approaches that applied to the imputed data to the datasets with missing data. In practice, it is rare that simple approaches are directly applicable to missing data but more sophisticated approaches, e.g., the multi-state model developed by [9], are needed. Table 1 shows the simulation results for other estimands under MAR mechanism. For the areas under the probability curves and the transition probabilities, the estimates obtained from the imputed data are generally close to the true values. The estimates are somewhat greater than the true values around the tail end of the follow-up period, which may be due to the higher missing rate and smaller number of longitudinal measurements therein. The results of the MVN and FCS approaches are very similar to each other in general. As expected, the direct estimates from the observed data are biased. For the mean intercept and slope of the longitudinal trajectories, all estimates are close to the true values since the unweighted least-square method is known to be unbiased in the case of informative dropout. For the decline of Y one-year prior to dialysis or death, all estimates are also close to the true values but those from the imputed data have smaller variances since more subjects are involved in estimating the declines with imputed data.

Figure 3.

Figure 3

Comparison of true and estimated multi-state probability curves (MAR)

Table 1.

Simulation results when data are missing at random (MAR)

Estimand True Observed MVN FCS
Q SE SD SE SD RE SE SD RE
Area under the multi-state probability curve, Ck
k = 1 14.459 13.413 1.309 1.554 14.431 1.381 1.531 0.998 14.417 1.382 1.539 0.997
k = 2 49.543 47.430 1.545 1.761 49.322 1.634 1.595 0.996 49.275 1.632 1.604 0.995
k = 3 39.315 35.375 1.234 1.291 39.285 1.330 1.234 0.995 39.347 1.326 1.239 0.995
k = 4 11.772 17.039 1.384 1.447 11.919 0.975 1.067 0.998 11.912 0.974 1.070 0.998
k = 5 4.911 6.742 0.810 0.887 5.042 0.587 0.610 0.996 5.048 0.588 0.618 0.996

Transition probability
P (s(30) = 2|s(0) = 1) 0.542 0.387 0.044 0.047 0.543 0.049 0.047 0.985 0.543 0.049 0.047 0.985
P (s(30) = 3|s(0) = 2) 0.323 0.279 0.025 0.026 0.322 0.028 0.027 0.986 0.323 0.028 0.026 0.985
P (s(60) = 3|s(30) = 2) 0.375 0.235 0.025 0.032 0.379 0.032 0.029 0.981 0.380 0.033 0.028 0.978
P (s(60) = 4|s(30) = 3) 0.185 0.141 0.028 0.031 0.183 0.032 0.033 0.998 0.183 0.032 0.033 0.997
P (s(90) = 4|s(60) = 3) 0.289 0.207 0.028 0.035 0.289 0.032 0.030 0.991 0.288 0.032 0.030 0.991
P (s(90) = 5|s(60) = 3) 0.132 0.099 0.021 0.025 0.132 0.024 0.023 0.991 0.132 0.024 0.022 0.991
P (s(120) = 4|s(90) = 3) 0.256 0.224 0.033 0.042 0.266 0.035 0.031 0.984 0.265 0.035 0.031 0.984
P (s(120) = 5|s(90) = 3) 0.252 0.228 0.033 0.041 0.258 0.035 0.030 0.983 0.257 0.035 0.029 0.983

Intercept and slope of longitudinal trajectory
μb0 6.945 6.945 0.049 0.051 6.944 0.049 0.051 0.998 6.941 0.057 0.051 0.995
μb1 −0.025 −0.025 0.001 0.001 −0.025 0.001 0.001 0.992 −0.025 0.001 0.001 0.987

1-yr decline of Y prior to events
Dialysis −0.034 −0.034 0.009 0.008 −0.033 0.005 0.003 0.955 −0.033 0.005 0.003 0.955
Death −0.030 −0.032 0.011 0.011 −0.030 0.006 0.004 0.948 −0.030 0.006 0.004 0.949

SE: standard error from Rubin’s rule; SD: Empirical standard error; RE: relative efficiency

Figure 4 and Table 2 show the simulation results when data are not missing at random (MNAR). In this case, we also assumed a Gamma distribution for b1, the slope of Y (t). The estimates based on the FCS and the MVN approaches again exhibited little evidence of bias, with the exception of small deviations at the tail end of the follow-up period.

Figure 4.

Figure 4

Comparison of true and estimated multi-state probability curves (MNAR)

Table 2.

Simulation results when data are missing at random (MNAR)

Estimand True Observed MVN FCS
Q SE SD SE SD RE SE SD RE
Area under the multi-state probability curve, Ck
k = 1 14.284 12.815 1.248 1.392 14.212 1.372 1.427 0.997 14.194 1.376 1.416 0.996
k = 2 49.366 46.634 1.496 1.726 49.019 1.630 1.573 0.995 49.012 1.634 1.558 0.994
k = 3 39.456 35.433 1.232 1.299 39.559 1.325 1.364 0.994 39.586 1.331 1.356 0.993
k = 4 11.998 18.209 1.443 1.509 12.185 0.985 1.016 0.998 12.182 0.986 1.005 0.998
k = 5 4.896 6.908 0.831 0.931 5.025 0.589 0.616 0.996 5.026 0.589 0.620 0.996

Transition probability
P (s(30) = 2|s(0) = 1) 0.551 0.382 0.044 0.047 0.550 0.049 0.048 0.984 0.551 0.049 0.047 0.984
P (s(30) = 3|s(0) = 2) 0.322 0.274 0.025 0.031 0.324 0.028 0.029 0.984 0.324 0.028 0.029 0.984
P (s(60) = 3|s(30) = 2) 0.381 0.221 0.025 0.030 0.387 0.033 0.028 0.976 0.387 0.034 0.027 0.975
P (s(60) = 4|s(30) = 3) 0.192 0.141 0.028 0.034 0.189 0.032 0.035 0.997 0.189 0.032 0.035 0.997
P (s(90) = 4|s(60) = 3) 0.290 0.209 0.028 0.036 0.290 0.032 0.032 0.991 0.290 0.032 0.032 0.989
P (s(90) = 5|s(60) = 3) 0.133 0.095 0.020 0.023 0.131 0.024 0.022 0.991 0.131 0.024 0.022 0.990
P (s(120) = 4|s(90) = 3) 0.255 0.219 0.033 0.035 0.266 0.035 0.028 0.984 0.266 0.035 0.028 0.984
P (s(120) = 5|s(90) = 3) 0.250 0.218 0.033 0.046 0.256 0.035 0.033 0.984 0.256 0.035 0.032 0.983

Intercept and slope of longitudinal trajectory
μb0 6.945 6.945 0.049 0.054 6.945 0.050 0.055 0.997 6.944 0.050 0.055 0.997
μb1 −0.025 −0.025 0.001 0.001 −0.025 0.001 0.001 0.989 −0.025 0.001 0.001 0.987

1-yr decline of Y prior to events
Dialysis −0.034 −0.033 0.010 0.010 −0.033 0.005 0.003 0.956 −0.032 0.006 0.003 0.952
Death −0.031 −0.034 0.011 0.013 −0.030 0.006 0.004 0.950 −0.030 0.006 0.004 0.946

SE: standard error from Rubin’s rule; SD: Empirical standard error; RE: relative efficiency

4.2. An Application to AASK

According to the scheduled timeline of measuring eGFR and proteinuria, the complete longitudinal data can be arranged chronologically as the vector,

Y=(Prot1,eGFR1,,Prot9,eGFR9,eGFR10,Prot10,eGFR11,eGFR12,Prot11,,Prot15,eGFR21)T,

where eGFRj denotes the jth square-root transformed eGFR measurement and Protk dentoes the kth log transformed proteinuria measurement. The baseline covariates Z include age, gender, smoking status, LVH, albumin, urinary acid and serum nitrogen.

Missing data are imputed using the same approaches described in Section 4.1. Figure 5 shows the eGFR trajectories of 4 randomly chosen subjects. Each black dot represents an observed value and a blue point represents an imputed value. Three subjects have observed event times while the bottom-right panel shows imputed event times (dashed orange lines). The top-right panel illustrates a case in which the observed time of dialysis is used to impute prior missing eGFR data. These results are from the MVN approach; similar results are obtained using the FCS approach. Figure 6 illustrates the joint imputation of missing data in the bivariate longitudinal outcome of proteinuria and eGFR measurements for three AASK subjects. Figure 1 and 2 in the Supplementary Materials show the means of the longitudinal variables and the cumulative event probabilities over 100 iterations, which suggest satisfactory convergence after 10 iterations.

Figure 5.

Figure 5

eGFR trajectories of a random sample of 4 AASK subjects

Figure 6.

Figure 6

eGFR and proteinuria trajectories of three AASK subjects

The simulation results in Section 4.1 suggest that we can apply simple approaches to imputed datasets to obtain satisfactory estimates. On the other hand, there may exist more sophisticated approaches for estimating the same parameters based on observed dataset with missing values. For example, the multi-state probabilities can be estimated using the approach proposed by [9]. We refer it as the “HLT” approach in this section. More details about the HLT approach can be found in [9]. Table 3 compares the estimated multi-state probabilities by applying the simple approach to the imputed datasets and the HLT approach to the originally observed datasets with missing values, respectively. Bootstrap procedure is needed to obtain standard errors for the HLT approach. The estimates from these two approaches are close to each other, suggesting that simple approach when applied to the “complete” imputed dataset can yield similar results as the more sophisticated HLT approach.

Table 3.

Comparison of multi-state probabilities estimated using the simple and HLT approaches

State t Simple Approach HLT Approach
SE SE
1 30 0.061 0.007 0.061 0.007
60 0.168 0.011 0.167 0.011
90 0.235 0.013 0.238 0.013
120 0.298 0.015 0.305 0.014
2 30 0.036 0.006 0.037 0.006
60 0.077 0.008 0.077 0.008
90 0.125 0.010 0.122 0.010
120 0.166 0.012 0.162 0.011
3 30 0.175 0.011 0.186 0.013
60 0.159 0.011 0.171 0.013
90 0.150 0.011 0.164 0.012
120 0.116 0.013 0.137 0.012
4 30 0.521 0.016 0.509 0.017
60 0.469 0.016 0.467 0.016
90 0.405 0.016 0.399 0.015
120 0.344 0.018 0.324 0.017
5 30 0.206 0.012 0.207 0.013
60 0.127 0.010 0.118 0.012
90 0.084 0.009 0.078 0.008
120 0.077 0.010 0.082 0.014

5. Discussion

Missing data are usually inevitable in clinical studies and multiple imputation has been a popular method to handle missing data. In this paper, we discuss multiple imputation for missing longitudinal and survival data simultaneously when analyses are complicated by informative dropout, non-ignorable missingness, etc. Many joint modeling approaches have been proposed to analyze such data over the past three decades. However, applications of joint models in the medical literature have been limited due to their complexity in both implementation and interpretation. The proposed MI approach can serve as an alternative to the joint modeling approaches. With complete imputed datasets, simple and transparent statistical approaches can be applied to obtain satisfactory results, as shown by the simulation studies and an application to the AASK study.

Our approach is easy to implement using mainstream software such as SAS and R. Furthermore, the parametric imputation approach is able to naturally accommodate multivariate longitudinal outcomes and competing-risk survival outcomes. Another flexibility of our approach is that investigators may choose own preferred imputation models at each step.

Our MI approach is analogous to the chained equation approach [32, 29]. The chained equation approach is a flexible alternative to the common MI methods that require an appropriate multivariate distribution of the variables for imputation, which is often not applicable in the presence of non-normal or categorical variables [37]. However, theoretical properties of the chained equation approach have not been established in general [29]. Our approach shares the same limitation. Our procedure may not guarantee consistent estimation in all situations, but as long as good imputations are created, the proposed method should reduce, to a great extent, any bias due to informative dropout, censoring, and intermittently missingness.

As is the case for multiple imputation in general, it is advisable to include all variables that enter into the analysis of the complete data into the imputation model [31, 38]. Therefore one should include longitudinal outcomes as covariates in the imputation models for survival outcomes and vice versa. In practice, model selection procedures may be used to tailor specific imputation models. For example, one could include in the imputation model all variables that significantly predict the incomplete variable or those that predict the missingness of the incomplete variable. The validity of the imputation models can be checked with standard diagnostic techniques. For example, residual plots might be drawn for each imputation model based on data from the final iteration.

Supplementary Material

Supp MaterialS1

Acknowledgments

This research was sponsored by grant R01DK090046 from the National Institutes of Health.

References

  • 1.Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley; New York: 1987. [Google Scholar]
  • 2.Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2. Wiley; Hoboken, NJ: 2002. [Google Scholar]
  • 3.Ibrahim JG, Molenberghs G. Missing data methods in longitudinal studies: a review. Test (Madrid, Spain) 2009;18(1):1–43. doi: 10.1007/s11749-009-0138-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tsiatis AA, Davidian M. Joint modeling of longitudinal and time-to-event data: An overview. Statistica Sinica. 2004;14:809834. [Google Scholar]
  • 5.Vonesh EF, Greene T, Schluchter MD. Shared parameter models for the joint analysis of longitudinal data and event times. Statistics in Medicine. 2006;25:143163. doi: 10.1002/sim.2249. [DOI] [PubMed] [Google Scholar]
  • 6.Hsieh F, Tseng YK, Wang JL. Joint modeling of survival and longitudinal data: Likelihood approach revisited. Biometrics. 2006;62:1037–1043. doi: 10.1111/j.1541-0420.2006.00570.x. [DOI] [PubMed] [Google Scholar]
  • 7.Li L, Hu B, Greene T. A semiparametric joint model for longitudinal and survival data with application to hemodialysis study. Biometrics. 2009;65:737–745. doi: 10.1111/j.1541-0420.2008.01168.x. [DOI] [PubMed] [Google Scholar]
  • 8.Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: Competing risks and multi-state models. Statistics in Medicine. 2007;26:2389–2430. doi: 10.1002/sim.2712. [DOI] [PubMed] [Google Scholar]
  • 9.Hu B, Li L, Wang X, Greene T. Nonparametric multistate representations of survival and longitudinal data with measurement error. Statistics in Medicine. 2012;31:2303–2317. doi: 10.1002/sim.5369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Rizopoulos D, Verbeke G, Lesaffre E. Fully exponential Laplace approximations for the joint modeling of survival and longitudinal data. Journal of the Royal Statistical Society (Series B) 2009;71:637–654. [Google Scholar]
  • 11.Rizopoulos D. Fast fitting of joint models for longitudinal and event time data using a pseudo-adaptive Gaussian quadrature rule. Computational Statistics & Data Analysis. 2012;56:491–501. [Google Scholar]
  • 12.Rizopoulos D. Joint Models for Longitudinal and Time-to-Event Data, with Applications in R. Boca Raton: Chapman and Hall/CRC; 2012. [Google Scholar]
  • 13.Zhang JL, Rubin DB. Estimation of causal effects via principal stratification when some outcomes are truncated by death. Journal of Educational and Behavioral Statistics. 2003;28:353368. [Google Scholar]
  • 14.Schafer JL. Multiple imputation with PAN. In: Sayer AG, Collins LM, editors. New Methods for the Analysis of Change. Washington, DC: American Psychological Association; 2001. pp. 355–377. [Google Scholar]
  • 15.Yang X, Li J, Shoptaw S. Imputation-based strategies for clinical trial longitudinal data with nonignorable missing values. Statistics in Medicine. 2008;27:2826–49. doi: 10.1002/sim.3111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ali MW, Siddiqui O. Multiple imputation compared with some informative dropout procedures in the estimation and comparison of rates of change in longitudinal clinical trials with dropouts. Journal of Biopharmaceutical Statistics. 2000;10:165–181. doi: 10.1081/BIP-100101020. [DOI] [PubMed] [Google Scholar]
  • 17.Paik MC, Tsai WY. On using the Cox proportional hazards model with missing covariates. Biometrika. 1997;84:579593. doi: 10.1023/a:1009657116403. [DOI] [PubMed] [Google Scholar]
  • 18.van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999;18:681694. doi: 10.1002/(sici)1097-0258(19990330)18:6<681::aid-sim71>3.0.co;2-r. [DOI] [PubMed] [Google Scholar]
  • 19.White IR, Royston P. Imputing missing covariate values for the Cox model. Statistics in Medicine. 2009;28(15):1982–98. doi: 10.1002/sim.3618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Taylor JMG, Murray S, Hsu C-H. Survival estimation and testing via multiple imputation. Statistics & Probability Letters. 2002;58(3):221–232. [Google Scholar]
  • 21.Xiang F, Murray S, Liu X. Analysis of transplant urgency and benefit via multiple imputation. Statistics in Medicine. 2014 doi: 10.1002/sim.6250. [DOI] [PubMed] [Google Scholar]
  • 22.Faucett CL, Schenker N, Taylor JM. Survival analysis using auxiliary variables via multiple imputation, with application to AIDS clinical trial data. Biometrics. 2002;58:37–47. doi: 10.1111/j.0006-341x.2002.00037.x. [DOI] [PubMed] [Google Scholar]
  • 23.Hsu CH, Taylor JM, Murray S, Commenges D. Survival analysis using auxiliary variables via non-parametric multiple imputation. Statistics in Medicine. 2006;25:3503–3517. doi: 10.1002/sim.2452. [DOI] [PubMed] [Google Scholar]
  • 24.Rizopoulos D, Verbeke G, Molenberghs G. Multiple-imputation-based residuals and diagnostic plots for joint models of longitudinal and survival outcomes. Biometrics. 2010;66:20–29. doi: 10.1111/j.1541-0420.2009.01273.x. [DOI] [PubMed] [Google Scholar]
  • 25.Appel LJ, Middleton J, Miller ER, Lipkowitz M, Norris K, Agodoa LY, Bakris G, Douglas JG, Charleston J, Gassman J, Greene T, Jamerson K, Kusek JW, Lewis JA, Phillips RA, Rostand SG, Wright JT. The rationale and design of the AASK cohort study. J Am Soc Nephrol. 2003;14(Suppl 2):S166S172. doi: 10.1097/01.asn.0000070081.15137.c0. [DOI] [PubMed] [Google Scholar]
  • 26.K/DOQI clinical practice guidelines for chronic kidney disease. National Kidney Foundation; 2002. [Google Scholar]
  • 27.Kurland B, Heagerty P. Directly parameterized regression conditioning on being alive: Analysis of longitudinal data truncated by deaths. Biostatistics. 2005;6:241–258. doi: 10.1093/biostatistics/kxi006. [DOI] [PubMed] [Google Scholar]
  • 28.Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association. 1987;82(398):528–540. [Google Scholar]
  • 29.White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine. 2011;30:377–399. doi: 10.1002/sim.4067. [DOI] [PubMed] [Google Scholar]
  • 30.van Houwelingen HC, Putter H. Dynamic predicting by landmarking as an alternative for multi-state modeling: an application to acute lymphoid leukemia data. Lifetime Data Analysis. 2008;14(4):447463. doi: 10.1007/s10985-008-9099-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall; London: 1997. [Google Scholar]
  • 32.van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research. 2007;16(3):219–242. doi: 10.1177/0962280206074463. [DOI] [PubMed] [Google Scholar]
  • 33.van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software. 2011;45(3):1–67. [Google Scholar]
  • 34.Brand JPL. PhD thesis. Erasmus University; Rotterdam: 1999. Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets. [Google Scholar]
  • 35.van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully Conditional Specification in Multivariate Imputation. Journal of Statistical Computation and Simulation. 2006;76(12):10491064. [Google Scholar]
  • 36.Wu MC, Bailey KR. Estimation and comparison of changes in the presence of informative right censoring: Conditional linear model. Biometrics. 1989;45:939–955. [PubMed] [Google Scholar]
  • 37.Bernaards CA, Belin TR, Schafer JL. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine. 2007;26:1368–1382. doi: 10.1002/sim.2619. [DOI] [PubMed] [Google Scholar]
  • 38.Moons KG, Donders RA, Stijnen T, Harrell FE., Jr Using the outcome for imputation of missing predictor values was preferred. Journal of Clinical Epidemiology. 2006;59:1092–1101. doi: 10.1016/j.jclinepi.2006.01.009. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp MaterialS1

RESOURCES