Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Dec 1.
Published in final edited form as: Psychometrika. 2020 Oct 2;85(4):890–904. doi: 10.1007/s11336-020-09729-y

Using multiple imputation with GEE with non-monotone missing longitudinal binary outcomes

Stuart R Lipsitz 1, Garrett M Fitzmaurice 2, Roger D Weiss 3
PMCID: PMC7855014  NIHMSID: NIHMS1645646  PMID: 33006740

Summary

This paper considers multiple imputation (MI) approaches for handling non-monotone missing longitudinal binary responses when estimating parameters of a marginal model using generalized estimating equations (GEE). GEE has been shown to yield consistent estimates of the regression parameters for a marginal model when data are missing completely at random (MCAR). However, when data are missing at random (MAR), the GEE estimates may not be consistent; the MI approaches proposed in this paper minimize bias under MAR. The first MI approach proposed is based on a multivariate normal distribution, but with the addition of pairwise products among the binary outcomes to the multivariate normal vector. Even though the multivariate normal does not impute 0 or 1 values for the missing binary responses, as discussed by Horton et al. (2003), we suggest not rounding when filling in the missing binary data because it could increase bias. The second MI approach considered is the fully conditional specification (FCS) approach. In this approach, we specify a logistic regression model for each outcome given the outcomes at other time points and the covariates. Typically, one would only include main effects of the outcome at the other times as predictors in the FCS approach, but we explore if bias can be reduced by also including pairwise interactions of the outcomes at other time point in the FCS. In a study of asymptotic bias with non-monotone missing data, the proposed MI approaches are also compared to GEE without imputation. Finally, the proposed methods are illustrated using data from a longitudinal clinical trial comparing four psychosocial treatments from the National Institute on Drug Abuse Collaborative Cocaine Treatment Study, where patients’ cocaine use is collected monthly for 6 months during treatment.

Keywords: Fully conditional specification, generalized estimating equations, missing completely at random, missing at random, multivariate normal

1. Introduction

Longitudinal studies in which each subject is to be observed at a fixed number of times are common in medicine. In this paper we consider statistical methods for the analysis of such data when the outcome is binary (e.g., success or failure) and the missing data pattern is not monotone; e.g., a subject’s outcome variable can be observed at one time point, missing at the next time point, and then observed at later time point(s).

Our motivating example is a longitudinal clinical trial from the National Institute on Drug Abuse Collaborative Cocaine Treatment Study (CCTS) whose goal was to compare four psychosocial treatments to reduce cocaine use in patients with cocaine dependence (Crits-Christoph, 1999). A total of 487 patients with a principal diagnosis of cocaine dependence were randomized to one of the four treatments. In the CCTS, cocaine use was assessed monthly for the 6 month duration of treatment; the main outcome at each time point was cocaine use (yes/no) as determined by urine screen. The main interest is in determining whether the treatments differ in terms of reducing cocaine use during the 6 month of treatment. However, a feature of this study which complicates the analysis is that only 285 (58.5 %) of the patients have measurements at all 6 occasions. With 6 time points, there are 26 − 1 = 63 possible missing data patterns in the dataset, although only 42 of these patterns are found in the dataset. To appreciate the extent of missingness over time, we can examine the percent of subjects observed at each time point. At time 1, 437 (89.7 %) subjects are observed; at time 2, 407 (83.6 %); at time 3, 390 (80.1 %); at time 4, 386 (79.3 %); at time 5, 380 (78.0 %); and at time 6, 377 (77.4 %). We see that the percent observed decreases slightly at each successive time point. However, the missingness pattern is not monotone; there are 119 (24.4 %) subjects who missed at least one measurement, but returned for a later measurement. There were 83 (17.0 %) subjects with monotone missingness, meaning once these subjects miss a visit, they never return for a subsequent visit. To obtain consistent estimates the regression parameters of marginal models, Liang and Zeger (1986) proposed the generalized estimating equations (GEE) approach. This approach does not require the complete specification of the joint distribution of the repeated responses, but only the first two moments. When some individuals’ response vectors are only partially observed, GEE approaches (Liang and Zeger, 1986; Carey et al., 1993; Lipsitz et al., 2000) circumvent the problem of missing data by simply basing inferences on the observed responses. This approach yields consistent marginal regression parameter estimates provided that the responses are missing completely at random (MCAR) (Rubin, 1976; Laird, 1988). In particular, when the outcome data are MCAR, missingness depends only on the covariates (that are included in the model), and the GEE provides consistent regression parameter estimates. However, when missingness is related to the observed data (covariates and observed responses), but conditionally independent of the missing responses given the observed data, the missing data are said to be missing at random (MAR) (Rubin, 1976; Laird, 1988) and GEE can yield biased regression parameter estimates. One approach for handling missing data that are missing at random within the GEE framework is weighted estimating equations (Robins, et al., 1995). However, this approach is more appealing for monotone missing data and difficult to apply with non-monotone missing data that are MAR, although there has been some recent work in this area for non-monotone missing data that are non-ignorably missing in which missingness is related to the unobserved data (Tchetgen et al., 2017).

Multiple imputation (Rubin, 1978; Rubin & Schenker, 1986; Rubin, 1987; Barnard & Meng, 1999; Schafer, 1999; Horton & Lipsitz, 2001; Little & Rubin, 2002; Scheuren, 2005) is a general technique to reduce bias when missing data are MAR; the approach is flexible in that it is appropriate for missing covariate and/or missing outcomes in univariate or multivariate settings. With multiple imputation, missing data on the outcome are imputed or “filled-in” based on some assumed model for the missing data given the observed data. For repeated measures data, multiple imputation has also been developed within the GEE framework (Paik, 1997; Beunckens et al., 2008). Both of these authors propose methods for imputing missing values for the outcome from models for the conditional mean of the missing outcome at a particular occasion given the history of past observed outcomes. Specifically, they focus on the special case of monotone missing data patterns as might arise from dropout or attrition and propose a sequence of conditional (on past outcomes) or Markov type models for imputing missing data. For example, Paik (1997) proposes a telescoping series of regression models conditional on the past that can be used to directly impute the missing outcome data in a sequential way. However, this sequential method of imputation of the missing data is applicable only to monotone missing data patterns. With non-monotone missing data patterns, the conditional mean of the missing outcome at a particular occasion, given the past observed outcomes, can no longer be directly obtained from the sequence of regression models; instead, imputation would require Gibbs sampling from the sequence of regression models. Consequently, the multiple imputation approaches of Paik (1997) and Beunckens et al. (2008) for GEE are best suited for monotone missing data.

The first multiple imputation approach for non-monotone missing data that we propose here is imputation from a multivariate normal distribution, but with the addition of pairwise products among the binary outcomes to the multivariate (normal) vector. Schafer (1997) has shown that assuming that multivariate binary data is multivariate normal for imputation purposes works well for estimating marginal proportions under the MAR mechanisms posed in his simulations, or when the fraction of missing data is small (say < 10%). Here, we highlight scenarios where imputing the longitudinal binary data using the multivariate normal does not work well for estimation in GEE. Instead, we show that adding the pairwise products of the longitudinal binary observations to the multivariate normal vector reduces bias in GEE. For non-monotone missing data, the multivariate normal gives a simple form of the conditional distribution of the missing vector given the observed vector (also multivariate normal). Even though the multivariate normal does not impute 0 or 1 values for the missing binary responses, as discussed by Horton et al. (2003), we suggest not rounding when filling in the missing binary data because doing so could increase bias.

The second multiple imputation approach considered is the fully conditional specification (FCS) approach (Van Buuren, 2007). In this approach, we specify a logistic regression model for each outcome given the outcomes at other time points and the covariates. In the FCS approach, we do not need to specify the full joint distribution, only the conditional distributions. Typically, one would only include main effects of the outcomes at the other times as predictors in the FCS approach. As with the imputation from the multivariate normal distribution, we explore if bias can be reduced in FCS by including the main effects as well as pairwise interactions among the outcomes at other time points in the conditional logistic regression model.

In Section 2, we introduce some notation and consider a marginal regression model for the vector of binary responses. In Section 3, we describe generalized estimating equations (GEE) for complete and missing data. In Section 4, we discuss the two multiple imputation approaches. In Section 5, we present a study of the asymptotic bias of GEE when missing data are handled using multiple imputation. Finally, in Section 6, we present results from analysis of the CCTS longitudinal clinical trial introduced earlier.

2. Notation and Distributional Assumptions

Suppose that N individuals are to be observed on T binary responses. Then, for the ith individual (i = 1, …, N) we can form a (T × 1) response vector, Yi = [Yi1, …, YiT]′, where Yit = 1 if the ith individual has a positive response (say, “success”) on the tth response, and Yit = 0 otherwise. Associated with Yit, each individual also has a J × 1 covariate vector xit. Let Xi = [xi1, …, xiT]′ represent the T × J matrix of covariates for the ith individual. The marginal distribution of Yit is Bernoulli and it is assumed that the probability of success,

πit=πit(β)=E(Yit|xit,β)=pr(Yit=1|xit,β)=exp(xitβ)1+exp(xitβ), (1)

where β is a vector of logistic regression parameters. The πit(β) can be grouped together to form a vector πi(β) containing the marginal probabilities of success, πi(β) = E[Yi|Xi, β] = [πi1, …, πiT]′. Note that we are primarily interested in making inference about β. In this paper we consider the case where individuals are not observed on all T binary responses; however, we assume that no covariates are missing.

The joint probability for any pair of binary responses,

πist=E(YisYit|xis,xit,β,α)=pr(Yis=1,Yit=1|xis,xit,β,α),

can be modeled in terms of the two marginal probabilities πis(β) and πit(β), in addition to an association parameter vector α. For example, the correlation between any pair of binary responses, Yis and Yit, is

ρist=ρist(β,α)=Corr(Yis,Yit|β,α)=πistπisπit[πis(1πis)πit(1πit)]1/2.

In terms of the correlation coefficient, the joint probability πist can be expressed as

πist(β,α)=πist=πisπit+ρist[πis(1πis)πit(1πit)]1/2. (2)

Note that other association parameters could also be used, e.g., marginal odds ratios instead of correlations.

3. Generalized estimating equations

When there are no missing data, the generalized estimating equations (GEE) for β are given by

u1(β^)=i=1Nu1i(β^)=i=1ND^iV^i1[Yiπi(β^)]=0, (3)

where Di = πi(β)/β, and Vi = Vi(α, β) is the T × T “working” or approximate covariance matrix of Yi (Liang and Zeger, 1986). Since Yit is binary, the tth diagonal elements of Vi is Var(Yit) = πit(1 − πit), which is specified entirely by the marginal distributions (i.e., by the vector of regression parameters β). The (s, t)th off-diagonal element of Vi is Cov(Yis, Yit) = πistπisπit, where πist is specified by equation (2). A second set of estimating equations similar to (3) can be used to estimate α, the parameters associated with the correlations in (2).

When there are missing outcome data, we can write Yi=(Ym,i,Yo,i) where Yo,i is a (Ti × 1) vector containing the observed components of Yi, and Ym,i is a [(TTi) × 1] vector containing the missing components of Yi. The missing data patterns are assumed to be non-monotone, i.e., subjects can have missing values on at least one occasion, but observed values at a later occasion. With missing data, the GEE for β based on the observed data is

u1(β^)=i=1ND^o,iV^o,i1[Yo,iπo,i(β^)]=0, (4)

where πo,i and Vo,i are the elements of πi and Vi corresponding to Yo,i, The solution to this GEE yields consistent estimates of β when data are MCAR, but consistency may not hold when the data are MAR.

4. Multiple Imputation

In general, using multiple imputation, we ‘fill-in’ or ‘impute’ the missing data Ym,i for each subject to create a set of, say K ‘filled-in’ or ‘imputed’ datasets, and then we estimate β in each of these imputed datasets, and average the estimated β’s over the K imputed datasets to obtain the multiple imputation estimator.

Rubin & Schenker (1986) and Rubin (1987) give a detailed summary of multiple imputation. Here, we review some relevant parts. First, we create K imputed datasets by sampling the missing data Ym,i, K times using one of the imputation methods discussed below; this creates Ym,ik, k = 1, …, K. For the kth imputed dataset, we calculate the GEE estimate β^k as well as the within imputation variance Uk=Var^(β^k). Thus, the K imputed datasets give us β^k and Uk, for k = 1, …, K.

With K imputations, the multiple imputation estimate of β is

β^*=k=1Kβ^kK.

Then, normal based inferences for β can be made (Rubin, 1978; Rubin, 1987),

(β^*β)~N(0,V),

where

V=W^+(K+1K)B^,W^=k=1KUkK (5)

is the average within imputation variance, and

B^=k=1K(β^kβ^*)(β^kβ^*)K1

is the between imputation variance.

Next, we describe the approaches to imputing the missing data. If the missing data are MAR, a consistent estimate of β can be obtained by replacing Ym,i with the conditional expectation of Ym,i given (Yo,i, Xi) Note, however, that the computation of the conditional expectation E(Ym,i|Yo,i, Xi) requires the full specification of the joint distribution of Yi. With a vector of T binary responses, there are 2T possible response sequences, and Yi has a multinomial distribution with 2T joint cell probabilities. The primary appeal of GEE lies in avoiding the full specification of this joint distribution of Yi.

Therefore, for multiple imputation, we first consider an approximation for E(Ym,i|Yo,i, Xi) based on the multivariate normal distribution for Yi. We use multiple imputation to fill-in Ym,i given (Yo,i, Xi). However, instead of assuming that Yi given Xi is multivariate normal in the imputation, we also form the [T(T − 1)/2 × 1] vector of cross-products

Ui={Ui12,Ui13,,Ui(T1)T}

where Uist = YisYit and assume Y=[Yi,Ui] is multivariate normal. With this assumption, the conditional distribution of Ym,i given (Yo,i, Xi) will depend on Yo,i as well as the cross-products of the elements of Yo,i, say Uo,i. This conditional distribution is straightforward to impute from since it is also multivariate normal with conditional mean E(Ym,i|Yo,i, Uo,i, Xi) and conditional variance V ar(Ym,i|Yo,i, Uo,i, Xi), which are both functions of the marginal mean E(Ym,i, Yo,i, Ui|Xi) and marginal variance V ar(Ym,i, Yo,i, Ui|Xi); see for example Johnson & Wichern (2002) for the conditional mean and variance for a multivariate normal distribution. This can be considered a joint modelling approach in which the conditional distribution (multivariate normal) is derived from the joint distribution (multivariate normal). To impute the missing data from the conditional multivariate normal distribution, we used a Bayesian MCMC (Markov chain Monte Carlo) (Schafer, 1997; Gilks, et al., 1996) approach implemented in SAS Proc MI (SAS Institute Inc, 2020) to sample Ym,i from the conditional multivariate normal distribution. For an arbitrary missing data pattern for multivariate normal data for a subject, the Bayesian MCMC approach draws from the posterior distribution of the missing data given the observed data using 2-step Imputation-Parameter-step algorithm developed by Schafer (1997). In the MCMC algorithm, one can also specify Bayesian priors for the parameters of the marginal means and variances (Liu et al., 2000), and we used a non-informative Jeffrey’s prior in our example since we had no prior information on the longitudinal outcomes.

Even though the true joint distribution of the binary outcomes in not multivariate normal, we use the multivariate normal distribution for imputation purposes. Further, even though the imputed value for the binary Yit from the conditional multivariate normal distribution is continuous, we do not round the continuous value to 0 or 1, but keep the original imputed continuous value in the estimation procedure. Previous theory (Horton et al. 2003) has shown that even though the imputed value is continuous, rounding when filling in the missing binary data could increase bias of the resulting estimate. For example, if we consider a simple one-sample case where some subjects are missing and data are MCAR, the observed proportion (mean of the binary variable) is unbiased. If we impute the missing observations from a normal distribution with mean equal to the observed proportion in complete cases, the imputed values (not rounded) of the missing observations will have mean equal to the observed proportion. If we create the imputations by rounding the normally distributed value to 0 or 1, then an imputed value of a missing observation will equal the observed proportion plus an error term that does not have mean 0 and thus the mean of the imputed values will no longer equal the observed proportion. Further, GEE is a quasi-likelihood approach in which one only needs to specify the model for the mean and variance (and correlations). Thus, even with imputed values of Yit that are continuous, one can still implement the GEE approach with marginal mean πit and marginal variance πit(1 − πit).

The second multiple imputation approach we consider is the fully conditional specification (FCS) approach (Van Buuren, 2007), or a so-called “chained equation” approach. In this approach, we specify a logistic regression model for each Yit given the Yis’s at all other time points and Xi,

pr(Yit=1|Yi1,,Yi,t1,Yi,t+1,,YiT,Xi,θt)., (6)

where θt is the logistic regression parameter vector from this conditional distribution. In the FCS approach, we do not need to specify the full joint distribution, only the series of conditional distributions given by (6). The FCS approach uses a Markov chain Monte Carlo (MCMC) method known as the Gibbs sampler to generate imputed values from the predictive distribution of the missing data, given the observed data. Briefly, the FCS approach involves iterating between the following steps. At each iteration, the logistic regression model (6) is fit to the observed values of Yit given both the observed and imputed values of Yis (for st) and Xi; this yields logistic regression parameter estimates, θ^t (and their associated covariance matrix, say C^t). A parameter vector θt for each logistic regression model (6) is drawn from the posterior predictive distribution of θt given the observed and imputed data; the posterior predictive distribution is assumed to be multivariate normal with mean θ^t and covariance C^t. Finally, the parameter vector θt is then used to impute the missing values for Yit (given observed and imputed values of Yis, for st). The above steps are iterated a sufficiently large number of times to ensure that the imputed values are (at least approximately) draws from the posterior predictive distribution of the missing data Ym,i, given the observed data Yo,i. This Gibbs sampling approach is particularly appealing given that the series of conditional distributions, specified by the logistic regression models (6), are straightforward to specify and fit. We note that in our example, since we had no prior information on θt, we implicitly used an improper non-informative uniform prior (Zellner & Rossi, 1984) for these parameters.

As with the imputation from the multivariate normal distribution, we explore if bias can be reduced by including all pairwise interactions of the Yis’s at other time points in the logistic regression model for each Yit given the other Yis’s, i.e.,

pr(Yit=1|Yi1,,Yi,t1,Yi,t+1,,YiT,YisYiu, for all s,ut,Xi).

It is of interest to examine the potential bias of the multivariate normal imputation and FCS approaches when the data are MAR. We conjecture that adding the cross-product terms to the multivariate normal outcome vector and the pairwise interactions to the FCS approach can reduce the bias compared to multiple imputation without the inclusion of these cross-product or pairwise interaction terms. We explore this conjecture in a study of asymptotic bias in the following section.

5. Study of Asymptotic Bias

In this section we study the asymptotic bias in estimating β using 5 approaches for handling missing data, including the two proposed in the previous section. For simplicity, we consider a two-group longitudinal design configuration with a binary response measured on four occasions. We assume that half of the individuals are assigned to each of the two groups, i.e., the group indicator variable, xi, equals 0 or 1 with pr(xi = 1) = 0.5. The following marginal model for πit = E(Yit|xi, β) is assumed,

logit(πit)=β0+β1xi,    t=1,2,3,4. (7)

So far, we have only specified the marginal distribution of each Yit separately. Next, we assume that the joint distribution of (Yi1, Yi2, Yi3, Yi4|xi) is given by the Bahadur distribution (Bahadur, 1961). To describe the Bahadur distribution, we define the standardized variable Zit to be

Zit=Yitπit{πit(1πit)}1/2.

The pairwise correlation between Yis and Yit is ρst = E(ZisZit); the 3rd-order correlation is defined as ρstu = E(ZisZitZiu); and the 4th-order correlation is defined as ρstuv = E(ZisZitZiuZiv). For subject i (with covariate xi equal to 0 or 1), the Bahadur distribution is

prYi1=y1,Yi2=y2,Yi3=y3,Yi4=y4|xi,β=t=14πit(β)yt1πit(β)1yt1+s>tρstziszit+s>t>uρstuziszitziu+ρ1234zi1zi2zi3zi4. (8)

We note that as long as the pairwise correlations (the ρst’s) are non-zero, the conditional distribution of one Yit given the other three Yis’s depend on the cross-products of pairs of the other Yis’s, even if ρstu = ρ1234 = 0. Thus, any imputation approach that does not include the cross-products could produce bias.

In this study, to create a plausible non-monotone MAR missingness model where the probability of being missing depends on previous observed outcomes, we let subjects be missing at times 3 or 4 (or both), but do not allow missingness at times 1 and 2. We define the indicator random variable Rit which equals 1 if Yit is observed and 0 if Yit is missing. We let missingness be non-monotone, so there are 3 possible patterns of (Ri3, Ri4) that define missingness. If a subject is observed at time 3 but missing at time 4, then (Ri3, Ri4) = (1, 0); if a subject is observed at time 4 but missing at time 3, then (Ri3, Ri4) = (0, 1); if a subject is missing at both times 3 and 4, then (Ri3, Ri4) = (0, 0); if a subject is observed at both times 3 and 4, then (Ri3, Ri4) = (1, 1). To allow for non-monotone MAR missingness, we use a simple missing at random model where

pr(Ri3=r3,Ri4=r4|Yi,xi)=pr(Ri3=r3|yi1,yi2,xi)pr(Ri4=r3|yi1,yi2,xi)=ϕi3r3(1ϕi3)(1r3)ϕi4r4(1ϕi4)(1r4), (9)

where ϕit = pr(Rit = rt|yi1, yi2, xi). Note that the missing data are missing at random since (9) does not depend on the possibly missing outcomes, (Yi3, Yi4). For simplicity, we also let the logistic regression models for ϕi3 and ϕi4 be identical

logit(ϕi3)=logit(ϕi4)=γ0+γ1xi+γ2yi1+γ3yi2+γ23yi1yi2. (10)

Letting β^ be the estimated parameter vector from a given approach, the asymptotic bias of that approach is defined as EA[β^β]=(β*β), where β is the true parameter value and EA[β^]=β* is the asymptotic expectation of β^. With a discrete set of outcomes and covariates, Rotnitzky & Wypij (1994) showed that the asymptotic bias of an approach can be determined by considering an artificial sample of weighted observations for each possible realization of (Yi1, Yi2, Yi3, Yi4, Ri3, Ri4|xi). Since each Yit, Rit and xi are binary, the artificial sample will contain 27 = 128 observations. The weight for each observation is given by its respective joint probability

wi=pr(Yi1=y1,Yi2=y2,Yi3=y3,Yi4=y4,Ri3=r3,Ri4=r4|xi).

To then obtain the asymptotic expectation β* for a given estimation approach, we solve for β* in the usual way for each approach, with each individual’s contribution to the GEE weighted by its respective joint probability. Basically, we create a dataset of all possible values of (Yi1, Yi2, Yi3, Yi4, xi) with corresponding weight wi, and fit a weighted GEE in a program like SAS Proc Genmod (SAS Institute Inc, 2020) or the R gee (Carey, et al. 2012) to get β*.

For the study of asymptotic bias, we let the true β0 = 1.0, β1 = 0.5 in (7). In our simulations, a subject’s probably of success is either exp(1.5)/(1 + exp(1.5)) = 0.82 when x = 1 or exp(1)/(1 + exp(1)) = 0.73 when x = 0. Thus, our asymptotic study considers the case with a high probability of success. We let the true correlation model be exchangeable with value ρ = ρst = 0.2 and 0.4. We let ρstu = ρ3 and ρstuv=ρstu4. We also include missing data models with an interaction between Yi1 and Yi2 (γ23 = −4) and without an interaction (γ23 = 0) to determine if this interaction in the missingness model affects the results.

Also, even though the true marginal model only has a group effect, with no time or time by group interaction, we fit a marginal model for each approach that includes these terms,

logit(πit)=β0+β1xi+β2I(t=2)+β3I(t=3)+β4I(t=4)+β12xiI(t=2)+β13xiI(t=3)+β14xiI(t=4), (11)

where I(·) is an indicator function. We emphasize here that we fit an overspecified model that included the time by group interaction terms, even though the true model only has a non-zero main effect of group. In most longitudinal studies, the main interest is in assessing if the trends over time differ in the two groups, which is represented by the time by group interaction. Thus, our interest is in determining if the different approaches give biased estimates of the time by group interaction, and not in the main effects in the model. Any approach that gives a non-zero asymptotic expectation of the time by group interaction terms will be a biased approach. In particular, if there is no asymptotic bias, then for a given approach, the time by group interaction terms should all converge to 0, i.e., β12*=β13*=β14*=0.

In the asymptotic study, we fit GEE with exchangeable correlation for all approaches (for standard GEE and within imputation). Further, for each multiple imputation approach, we performed 1000 multiple imputations (the multiple imputation estimate is again the average of the estimates over the 1000 imputations). The asymptotic bias from the imputation approaches is the asymptotic bias specifically for M = 1000 imputations as N → ∞). The asymptotic bias does depend on the number of imputations, but 1000 imputations is large enough to minimize any bias due to a small number of imputations (Graham et al. 2007).

Table 1 gives the asymptotic bias of the various approaches for different values of the missingness parameters γ. In Table 1, we denote multivariate normal imputation without cross-products as MVN-MI, i.e., assuming

(Yi1,Yi2,Yi3,Yi4)

given xi is multivariate normal; we denote multivariate normal imputation with cross-products as MVN-MI-cross, i.e., assuming

(Yi1,Yi2,Yi3,Yi4,Yi1Yi2,Yi1Yi3,Yi1Yi4,Yi2Yi3,Yi2Yi4,Yi3Yi4)

given xi is multivariate normal; we denote FCS imputation without pairwise interactions as FCS-MI, with a logistic regression for the conditional probability

pr(Yit=1|Yis,Yiu,Yiv,xi);

and we denote FCS imputation with pairwise interactions as FCS-MI-interact, with a logistic regression for the conditional probability

pr(Yit=1|Yis,Yiu,Yiv,YisYiu,YisYiv,YiuYiv,xi).

We note that in all of these multiple imputation approaches, we assumed that the associations between outcomes at pairs of time are different, and thus that there is an ‘unstructured covariance’ matrix for the vector of outcomes for a subject.

Table 1.

The vector value β* to which β^ converges. The marginal logistic model has parameters β12 = β13 = β14 = 0 with exchangeable correlation, ρ.

pr(Ri3, Ri4) Missing Data Model ρ = 0.2 ρ = 0.4
(1, 1) (0, 1) (1, 0) (0, 0) (γ0, γ1, γ2, γ3, γ23)a Approach β12* β13* β14* β12* β13* β14*
0.30 0.21 0.21 0.28 (0.17,−1,2,2,−4) GEE 0.000 −0.143 −0.143 0.000 −0.432 −0.432
MVN-MI 0.000 −0.118 −0.113 0.000 −0.470 −0.471
MVN-MI-cross 0.000 0.009 0.006 0.000 −0.002 −0.001
FCS-MI 0.000 −0.166 −0.164 0.000 −0.518 −0.517
FCS-MI-interact 0.000 −0.005 −0.004 0.000 −0.003 −0.001
0.49 0.19 0.19 0.13 (1,−1,2,2,−4) GEE 0.000 −0.090 −0.090 0.000 −0.260 −0.260
MVN-MI 0.000 −0.065 −0.068 0.000 −0.268 −0.261
MVN-MI-cross 0.000 0.001 0.002 0.000 0.002 0.004
FCS-MI 0.000 −0.094 −0.091 0.000 −0.296 −0.298
FCS-MI-interact 0.000 0.000 −0.003 0.000 0.002 0.000
0.34 0.17 0.17 0.32 (1.5,−1,−1,−1,0) GEE 0.000 −0.004 −0.004 0.000 −0.001 −0.001
MVN-MI 0.000 −0.047 −0.049 0.000 −0.162 −0.163
MVN-MI-cross 0.000 0.005 −0.001 0.000 −0.003 −0.008
FCS-MI 0.000 −0.063 −0.070 0.000 −0.185 −0.187
FCS-MI-interact 0.000 −0.002 0.002 0.000 −0.002 −0.006
0.52 0.16 0.16 0.16 (2.5,−1,−1,−1,0) GEE 0.000 −0.022 −0.022 0.000 −0.052 −0.052
MVN-MI 0.000 −0.033 −0.022 0.000 −0.090 −0.089
MVN-MI-cross 0.000 0.001 0.001 0.000 0.000 −0.003
FCS-MI 0.000 −0.037 −0.032 0.000 −0.105 −0.103
FCS-MI-interact 0.000 −0.002 0.002 0.000 −0.001 0.003
a

The missing data model is logit(ϕi3) = logit(ϕi4) = γ0 + γ1xi + γ2yi1 + γ3yi2 + γ23yi1yi2

MVN-MI is multivariate normal imputation without cross-products

MVN-MI-cross is multivariate normal imputation with cross-products

FCS-MI is FCS imputation without pairwise interactions

FCS-MI-interact is FCS imputation with pairwise interactions

We see from Table 1 that standard GEE based on the observed data only has substantial bias when the true missingness mechanism depends on the interaction between Yi1 and Yi2, but has much less bias when this interaction effect is 0. When this interaction term is non-zero, standard GEE has its largest bias when the proportion missing is the highest. Multiple imputation with the multivariate normal without cross-products and FCS without pairwise interactions in the conditional models perform similarly in Table 1. When the interaction term in the missingness model is not 0, multivariate normal without cross-products and FCS without pairwise interactions perform very similar to standard GEE; in this case, imputing does not reduce the bias. On the other hand, multiple imputation with the multivariate normal with cross-products and FCS with pairwise interactions reduces the bias to a negligible amount. When the interaction term in the missingness model equals 0, multivariate normal without cross-products and FCS without pairwise interactions perform very similar, but actually give more bias than standard GEE. In this case, imputing without cross-products does not reduce the bias. On the other hand, multiple imputation with the multivariate normal with cross-products and FCS with pairwise interactions again reduces the bias to a negligible amount.

Typically, for most longitudinal studies, the marginal parameters β are the parameters of main scientific interest, while ρ is considered to be a nuisance parameter. Thus, the bias for the correlation is not displayed in Table 1. However, the asymptotic bias of the exchangeable correlation (the posed correlation for all approaches), had less than 10% relative bias for all approaches. Also, although not shown here, for any of the multiple imputation approaches, we also fit GEE under independence and autoregressive correlation structures, and the asymptotic bias was very similar to GEE with exchangeable structure. This agrees with the GEE theory for a single dataset with no missing data (a balanced dataset across time) in that the GEE approach for estimating the marginal model is robust to the ‘working correlation’ model with no missing data (Liang & Zeger, 1986). In particular, once we impute the missing values in the dataset for a given imputation approach, we have a dataset with no missing data, so that any ‘working correlation’ will produce similar estimates of the marginal parameters. Thus, the asymptotic bias in our study would not be due to a mis-specification of the ‘working correlation’, but due to the imputation approach.

In summary, this asymptotic study suggests that when using GEE for longitudinal studies with missing data, one should at the very least perform a sensitivity analysis using multiple imputation with the multivariate normal with cross-products and/or FCS with pairwise interactions in the conditional models.

6. Application

In this section, we illustrate the application of the proposed methodology to the analysis of the National Institute on Drug Abuse Collaborative Cocaine Treatment Study (CCTS), the longitudinal clinical trial dataset described earlier. Full details on the design and procedures of the CCTS can be found in previous publications on the trial (e.g., Crits-Christoph et al., 1999). Briefly, the CCTS was a multi-site randomized clinical trial that compared the efficacy of four psychosocial treatments for cocaine dependence. Each treatment consisted of 6 months of active phase treatment. A total of 487 outpatients, who were diagnosed with DSM-IV cocaine dependence and had used cocaine during the past 30 days, were randomly assigned to one of the four treatment conditions. At intake, patients on average used cocaine 10 days out of the past 30 and had been using cocaine for an average of 7 years (SD = 4.8). A composite cocaine use outcome measure, constructed by pooling information from self-report data and weekly observed urine samples, was used to code each month of treatment as abstinent versus any cocaine use. Thus, in the CTTS there were 6 monthly assessments of cocaine use during follow-up (corresponding to t = 1, 2, 3, 4, 5, 6) yielding repeated measures of a binary response. That is, the response variable of interest at each occasion is the cocaine use, Yit, which equals 1 if the patient was found to be using at time t and equals 0 otherwise. The main scientific interest is in determining any treatment differences in the cocaine use profile over the treatment period. In this study, 41.5% of subjects were missing at least one of the six monthly cocaine use assessments and it was thought likely that missingness is not completely at random; for the analyses presented here, we assume missingness is at random (MAR). We must emphasize that the assumption of MAR (which also includes MCAR) is an unverifiable assumption in the sense that MAR cannot be distinguished from not missing at random (NMAR) based on the distribution of the observed data.

We considered the following marginal logistic regression model as a function of three indicator variables for treatment group (with IDC chosen as the reference group),

log [πit1πit]=β0+β1I(CT)+β2I(SE)+β3I(GDC)+βtimet    (for t=1,2,3,4,5,6;)

where I(·) is an indicator for treatment group, and the vector t contains 5 indicators for categorical time effects. Because the 6 repeated binary outcomes are obtained post-randomization, comparison of the treatment groups are made in terms of rates of cocaine use during the 6 months of active treatment, t=16πit(β). That is, the expected frequency of cocaine use during the six months is determined by the marginal probabilities of the six binary outcomes, pr(Yit = 1|xit, β), and estimates of the marginal regression parameters, β, are used as the basis for inference about treatment group differences in the rates of cocaine use during the 6 months of active treatment. To account for the association among the binary responses, an autoregressive correlation model, Corr(Yis, Yit) = ρ|st| was fit to the data. As was noted earlier, there was a substantial amount of missing data on the 6 binary outcomes. We accounted for the missing data using multiple imputation and performed 100 imputations for each imputation approach.

We compare the estimates of the treatment effects (β1, β2, β3) using five alternative approaches: 1) the standard GEE approach; 2) multiple imputation from the multivariate normal without cross-products in the multivariate normal outcome; 3) multiple imputation from the multivariate normal with cross-products in the multivariate normal outcome; 4) FCS imputations with no pairwise interactions in the model for Yit given the other elements of the vector Yi; and 5) FCS imputations with pairwise interactions between the elements of the vector Yi as additional covariates in the model for Yit. We note that for all imputation approaches, when imputing missing data, no distinction is made between monotone and non-monotone patterns (i.e., imputing Ym,i given Yo,i does not differ whether or not the elements of Yo,i have a monotone pattern.)

Among the five approaches, the parameter estimates are similar (see Table 2). When compared to the reference group, IDC, only CT has higher rates of within-treatment drug use. This is an important clinical study, and even though the estimates using all approaches are similar, it was important for the clinical investigators that a sensitivity analysis was performed to assess whether the non-monotone missing data could lead to biased results. Thus, it is reassuring (to our clinical collaborators) that in this application all approaches yield similar results. We note here that since the estimates are so similar between standard GEE (the approach that is unbiased under MCAR) and all of the imputation approaches (which are unbiased under MAR assuming the imputation models are correct), it is tempting to conclude that the data are likely to be MCAR. However, we caution against such an interpretation because the assumption of MAR (which also includes MCAR) is an unverifiable assumption in the sense that MAR cannot be distinguished from not missing at random (NMAR) based on the distribution of the observed data.

Table 2.

Logistic regression parameter estimates for the CCTS longitudinal clinical trial

Standard
Effect Methoda Estimate Error p–value
INTERCEPT GEE 0.211 0.139 0.130
MVN-MI 0.197 0.139 0.155
MVN-MI-cross 0.192 0.142 0.176
FCS-MI 0.198 0.137 0.148
FCS-MI-interact 0.173 0.139 0.216
CT GEE 0.516 0.207 0.013
MVN-MI 0.507 0.206 0.014
MVN-MI-cross 0.525 0.211 0.013
FCS-MI 0.516 0.205 0.012
FCS-MI-interact 0.499 0.201 0.013
SE GEE 0.387 0.203 0.056
MVN-MI 0.363 0.202 0.073
MVN-MI-cross 0.365 0.207 0.078
FCS-MI 0.359 0.199 0.072
FCS-MI-interact 0.377 0.202 0.062
GDC GEE 0.249 0.203 0.220
MVN-MI 0.270 0.202 0.183
MVN-MI-cross 0.262 0.206 0.202
FCS-MI 0.266 0.201 0.185
FCS-MI-interact 0.226 0.200 0.259

MVN-MI is multivariate normal imputation without cross-products

MVN-MI-cross is multivariate normal imputation with cross-products

FCS-MI is FCS imputation without pairwise interactions

FCS-MI-interact is FCS imputation with pairwise interactions

7. Discussion

In this paper we consider multiple imputation approaches for estimating parameters of a marginal model using GEE. The results of the asymptotic study suggest that multiple imputation using either a multivariate normal approach with additional cross-products terms or an FCS approach with interactions yields marginal regression parameter estimates with much less bias than multiple imputation without the cross-product terms or interactions. Further, when the true missing data model does not contain the interaction, it was found that the multiple imputation approaches without cross-product terms yielded larger bias than standard GEE based on the observed data only. Thus, in our study of asymptotic bias, MI without cross-product terms did not perform any better than standard GEE, and in some cases worse. Our conjecture is that since the true Bahadur model in the asymptotic study is such that the conditional distribution of any Yit given the other three Yis’s depend on the cross-products of pairs, any imputation approach that fails to include these cross-products could produce bias, which is why the multivariate normal approach without cross-products terms or an FCS approach without interactions had the largest bias. In summary, the purpose of this paper was to “enrich” the information (the cross-product terms) in the conditional distribution of the missing data given the observed data in the imputation step in order to reduce the bias of the resulting multiple imputation estimates.

Because of the broad range of possible data configurations with multiple covariate and missing data distributions, it is difficult to draw definitive conclusions from this study of asymptotic bias. Nonetheless, in the asymptotic study reported here, MI with cross-product terms performs discernibly better than MI without cross-product terms. At the very least, we suggest investigators conduct a sensitivity analysis with inclusion of appropriate cross-product terms in the imputation model to ensure that missing data will not lead to substantial bias in estimation via GEE. Further, in this paper, we focused on developing new imputation approaches to minimize the bias. Important future work should explore performance criterion such as mean square error. Our asymptotic study with non-monotone missingness has T = 4 time points, but we expect similar bias with more than 4 time points. Even though our approaches should minimize bias for any T, with increasing T, the computational burden of the proposed imputation approaches increases due to both increases in the numbers of pairwise products (and interactions) and the number of missing data patterns. Thus, the main issue with increased T is the computational burden.

Although beyond the scope of this paper, it would be of interest to explore how the MI approach with cross-products performs for other types of non-Gaussian data besides binary data. Finally, note that if the missing outcome data are assumed to be non-ignorable (i.e., the probability of missingness depends on unobserved data) with non-monotone missing data patterns, then multiple imputation lends itself well for non-ignorable sensitivity analyses (Carpenter & Kenward, 2013); this is also a topic that is beyond the scope of this paper.

Acknowledgments

The authors were supported by grants from the National Institutes of Health, National Institute on Drug Abuse grants [NIDA R33 DA042847, UG1 DA015831, K24 DA022288].

Contributor Information

Stuart R. Lipsitz, Brigham and Women’s Hospital and Ariadne Labs, Boston, MA, U.S.A

Garrett M. Fitzmaurice, McLean Hospital, Belmont, MA, U.S.A

Roger D. Weiss, McLean Hospital, Belmont, MA, U.S.A

References

  1. Bahadur RR (1961). A representation of the joint distribution of responses to n dichotomous items In Studies in Item Analysis and Prediction, Ed. Solomon H, pp. 158–68. Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press. [Google Scholar]
  2. Barnard J,& Meng XL (1999). Applications of Multiple Imputation in Medical Studies: From AIDS to NHANES. Statistical Methods in Medical Research, 8, 1736. [DOI] [PubMed] [Google Scholar]
  3. Beunckens C, Sotto C, & Molenberghs G (2008) A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Computational Statistics & Data Analysis, 52, 1533–1548. [Google Scholar]
  4. Carey V, Zeger SL & Diggle PJ (1993). Modelling multivariate binary data with alternating logistic regressions. Biometrika, 80, 517–526. [Google Scholar]
  5. Carey VJ, Lumley T, & Ripley BD (2012), gee: Generalized Estimation Equation Solver, URL http://CRAN.R-project.org/package=gee, R package version 4.13–18
  6. Carpenter JR, & Kenward MG (2013). Multiple Imputation and Its Application. New York: Wiley. [Google Scholar]
  7. Crits-Christoph P Psychosocial treatments for cocaine dependence: National institute on drug abuse collaborative cocaine treatment study. (1999). Archives of General Psychiatry, 56, 493–502. [DOI] [PubMed] [Google Scholar]
  8. Enders CK (2010). Applied Missing Data Analysis. New York: The Guilford Press. [Google Scholar]
  9. Gilks WR, Richardson S, & Spiegelhalter DJE (1996). Markov Chain Monte Carlo in Practice. New York: Chapman & Hall. [Google Scholar]
  10. Graham JW, Olchowski AE, & Gilreath TD. (2007) How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci, 8, 206–213. [DOI] [PubMed] [Google Scholar]
  11. Horton NJ, & Lipsitz SR (2001). Multiple Imputation in Practice: Comparison of Software Packages for Regression Models with Missing Variables. American Statistician, 55, 244–254. [Google Scholar]
  12. Horton NJ, Parzen M. Lipsitz SR. (2003) A Potential for Bias When Rounding in Multiple Imputation. Am Stat, 57, 229–232. [Google Scholar]
  13. Johnson RA, & Wichern DW (2002). Applied multivariate statistical analysis. Upper Saddle River, NJ: Prentice Hall. [Google Scholar]
  14. Laird NM (1988). Missing data in longitudinal studies. Statistics in Medicine, 7, 305–315. [DOI] [PubMed] [Google Scholar]
  15. Liang KY & Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]
  16. Lipsitz SR, Laird NM & Harrington DP (1992). A three-stage estimator for studies with repeated and possibly missing binary outcomes. Applied Statistics. 41, 203–213. [Google Scholar]
  17. Lipsitz SR, Fitzmaurice GM, Orav EJ, & Laird NM (1994) Performance of generalized estimating equations in practical situations. Biometrics. 50, 270–278. [PubMed] [Google Scholar]
  18. Lipsitz SR, Molenberghs G, Fitzmaurice GM, & Ibrahim J (2000). GEE with Gaussian estimation of the correlations when data are incomplete. Biometrics. 56, 528–536. [DOI] [PubMed] [Google Scholar]
  19. Little RJA & Rubin DB (2002). MStatistical Analysis with Missing Data. 2nd ed. New York: John Wiley & Sons. [Google Scholar]
  20. Liu M, Taylor JM, & Belin TR (2000). Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics. 56, 1157–1163. [DOI] [PubMed] [Google Scholar]
  21. Paik M (1997). The generalized estimating equation approach when data are not missing completely at random. J. Amer. Statist. Assoc, 92, 1320–1329. [Google Scholar]
  22. Robins JM, Rotnitzky A & Zhao LP (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc, 90, 106–121. [Google Scholar]
  23. Rotnitzky A & Wypij D (1994). A note on the bias of estimators with missing data. Biometrics, 50, 1163–1170. [PubMed] [Google Scholar]
  24. Rubin DB (1976). Inference and missing data. Biometrika, 63, 581–592. [Google Scholar]
  25. Rubin DB (1978). Multiple imputations in sample surveys-a phenominological bayesian approach to nonresponse in Proceedings of the International Statistical Institute, Manila, 517–532. [Google Scholar]
  26. Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. [Google Scholar]
  27. Rubin DB & Schenker N (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. JASA, 81, 366–374. [Google Scholar]
  28. SAS Institute Inc (2020). SAS/STAT Software, Version 9.4. Cary, NC: URL http://www.sas.com/. [Google Scholar]
  29. Schafer JL (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall Ltd. [Google Scholar]
  30. Schafer JL (1999). Multiple Imputation: A Primer. Statistical Methods in Medical Research, 8, 3–15. [DOI] [PubMed] [Google Scholar]
  31. Scheuren F (2005). Multiple imputation: How it began and continues. The American Statistician, 59, 3315–319. [Google Scholar]
  32. Tchetgen E, Wang L & Sun B. (2017) Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Unpublished Manuscript. Archived as arXiv:1607.02631v3 [stat.ME] at: https://arxiv.org/abs/1607.02631v3. [DOI] [PMC free article] [PubMed]
  33. Van Buuren S (2007). Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification. Statistical Methods in Medical Research, 16, 219–242. [DOI] [PubMed] [Google Scholar]
  34. Van Buuren S & Groothuis-Oudshoorn K (2011). mice: multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–68. [Google Scholar]
  35. Van Buuren S (2012). Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  36. Zellner A & Rossi PE (1984). Bayesian analysis of dichotomous quantal response models. Journal of Econometrics, 25, 365393. [Google Scholar]

RESOURCES