Using multiple imputation with GEE with non-monotone missing longitudinal binary outcomes

Stuart R Lipsitz; Garrett M Fitzmaurice; Roger D Weiss

doi:10.1007/s11336-020-09729-y

. Author manuscript; available in PMC: 2021 Dec 1.

Published in final edited form as: Psychometrika. 2020 Oct 2;85(4):890–904. doi: 10.1007/s11336-020-09729-y

Using multiple imputation with GEE with non-monotone missing longitudinal binary outcomes

Stuart R Lipsitz ¹, Garrett M Fitzmaurice ², Roger D Weiss ³

PMCID: PMC7855014 NIHMSID: NIHMS1645646 PMID: 33006740

Summary

This paper considers multiple imputation (MI) approaches for handling non-monotone missing longitudinal binary responses when estimating parameters of a marginal model using generalized estimating equations (GEE). GEE has been shown to yield consistent estimates of the regression parameters for a marginal model when data are missing completely at random (MCAR). However, when data are missing at random (MAR), the GEE estimates may not be consistent; the MI approaches proposed in this paper minimize bias under MAR. The first MI approach proposed is based on a multivariate normal distribution, but with the addition of pairwise products among the binary outcomes to the multivariate normal vector. Even though the multivariate normal does not impute 0 or 1 values for the missing binary responses, as discussed by Horton et al. (2003), we suggest not rounding when filling in the missing binary data because it could increase bias. The second MI approach considered is the fully conditional specification (FCS) approach. In this approach, we specify a logistic regression model for each outcome given the outcomes at other time points and the covariates. Typically, one would only include main effects of the outcome at the other times as predictors in the FCS approach, but we explore if bias can be reduced by also including pairwise interactions of the outcomes at other time point in the FCS. In a study of asymptotic bias with non-monotone missing data, the proposed MI approaches are also compared to GEE without imputation. Finally, the proposed methods are illustrated using data from a longitudinal clinical trial comparing four psychosocial treatments from the National Institute on Drug Abuse Collaborative Cocaine Treatment Study, where patients’ cocaine use is collected monthly for 6 months during treatment.

Keywords: Fully conditional specification, generalized estimating equations, missing completely at random, missing at random, multivariate normal

1. Introduction

Longitudinal studies in which each subject is to be observed at a fixed number of times are common in medicine. In this paper we consider statistical methods for the analysis of such data when the outcome is binary (e.g., success or failure) and the missing data pattern is not monotone; e.g., a subject’s outcome variable can be observed at one time point, missing at the next time point, and then observed at later time point(s).

Our motivating example is a longitudinal clinical trial from the National Institute on Drug Abuse Collaborative Cocaine Treatment Study (CCTS) whose goal was to compare four psychosocial treatments to reduce cocaine use in patients with cocaine dependence (Crits-Christoph, 1999). A total of 487 patients with a principal diagnosis of cocaine dependence were randomized to one of the four treatments. In the CCTS, cocaine use was assessed monthly for the 6 month duration of treatment; the main outcome at each time point was cocaine use (yes/no) as determined by urine screen. The main interest is in determining whether the treatments differ in terms of reducing cocaine use during the 6 month of treatment. However, a feature of this study which complicates the analysis is that only 285 (58.5 %) of the patients have measurements at all 6 occasions. With 6 time points, there are 2⁶ − 1 = 63 possible missing data patterns in the dataset, although only 42 of these patterns are found in the dataset. To appreciate the extent of missingness over time, we can examine the percent of subjects observed at each time point. At time 1, 437 (89.7 %) subjects are observed; at time 2, 407 (83.6 %); at time 3, 390 (80.1 %); at time 4, 386 (79.3 %); at time 5, 380 (78.0 %); and at time 6, 377 (77.4 %). We see that the percent observed decreases slightly at each successive time point. However, the missingness pattern is not monotone; there are 119 (24.4 %) subjects who missed at least one measurement, but returned for a later measurement. There were 83 (17.0 %) subjects with monotone missingness, meaning once these subjects miss a visit, they never return for a subsequent visit. To obtain consistent estimates the regression parameters of marginal models, Liang and Zeger (1986) proposed the generalized estimating equations (GEE) approach. This approach does not require the complete specification of the joint distribution of the repeated responses, but only the first two moments. When some individuals’ response vectors are only partially observed, GEE approaches (Liang and Zeger, 1986; Carey et al., 1993; Lipsitz et al., 2000) circumvent the problem of missing data by simply basing inferences on the observed responses. This approach yields consistent marginal regression parameter estimates provided that the responses are missing completely at random (MCAR) (Rubin, 1976; Laird, 1988). In particular, when the outcome data are MCAR, missingness depends only on the covariates (that are included in the model), and the GEE provides consistent regression parameter estimates. However, when missingness is related to the observed data (covariates and observed responses), but conditionally independent of the missing responses given the observed data, the missing data are said to be missing at random (MAR) (Rubin, 1976; Laird, 1988) and GEE can yield biased regression parameter estimates. One approach for handling missing data that are missing at random within the GEE framework is weighted estimating equations (Robins, et al., 1995). However, this approach is more appealing for monotone missing data and difficult to apply with non-monotone missing data that are MAR, although there has been some recent work in this area for non-monotone missing data that are non-ignorably missing in which missingness is related to the unobserved data (Tchetgen et al., 2017).

Multiple imputation (Rubin, 1978; Rubin & Schenker, 1986; Rubin, 1987; Barnard & Meng, 1999; Schafer, 1999; Horton & Lipsitz, 2001; Little & Rubin, 2002; Scheuren, 2005) is a general technique to reduce bias when missing data are MAR; the approach is flexible in that it is appropriate for missing covariate and/or missing outcomes in univariate or multivariate settings. With multiple imputation, missing data on the outcome are imputed or “filled-in” based on some assumed model for the missing data given the observed data. For repeated measures data, multiple imputation has also been developed within the GEE framework (Paik, 1997; Beunckens et al., 2008). Both of these authors propose methods for imputing missing values for the outcome from models for the conditional mean of the missing outcome at a particular occasion given the history of past observed outcomes. Specifically, they focus on the special case of monotone missing data patterns as might arise from dropout or attrition and propose a sequence of conditional (on past outcomes) or Markov type models for imputing missing data. For example, Paik (1997) proposes a telescoping series of regression models conditional on the past that can be used to directly impute the missing outcome data in a sequential way. However, this sequential method of imputation of the missing data is applicable only to monotone missing data patterns. With non-monotone missing data patterns, the conditional mean of the missing outcome at a particular occasion, given the past observed outcomes, can no longer be directly obtained from the sequence of regression models; instead, imputation would require Gibbs sampling from the sequence of regression models. Consequently, the multiple imputation approaches of Paik (1997) and Beunckens et al. (2008) for GEE are best suited for monotone missing data.

The first multiple imputation approach for non-monotone missing data that we propose here is imputation from a multivariate normal distribution, but with the addition of pairwise products among the binary outcomes to the multivariate (normal) vector. Schafer (1997) has shown that assuming that multivariate binary data is multivariate normal for imputation purposes works well for estimating marginal proportions under the MAR mechanisms posed in his simulations, or when the fraction of missing data is small (say < 10%). Here, we highlight scenarios where imputing the longitudinal binary data using the multivariate normal does not work well for estimation in GEE. Instead, we show that adding the pairwise products of the longitudinal binary observations to the multivariate normal vector reduces bias in GEE. For non-monotone missing data, the multivariate normal gives a simple form of the conditional distribution of the missing vector given the observed vector (also multivariate normal). Even though the multivariate normal does not impute 0 or 1 values for the missing binary responses, as discussed by Horton et al. (2003), we suggest not rounding when filling in the missing binary data because doing so could increase bias.

The second multiple imputation approach considered is the fully conditional specification (FCS) approach (Van Buuren, 2007). In this approach, we specify a logistic regression model for each outcome given the outcomes at other time points and the covariates. In the FCS approach, we do not need to specify the full joint distribution, only the conditional distributions. Typically, one would only include main effects of the outcomes at the other times as predictors in the FCS approach. As with the imputation from the multivariate normal distribution, we explore if bias can be reduced in FCS by including the main effects as well as pairwise interactions among the outcomes at other time points in the conditional logistic regression model.

In Section 2, we introduce some notation and consider a marginal regression model for the vector of binary responses. In Section 3, we describe generalized estimating equations (GEE) for complete and missing data. In Section 4, we discuss the two multiple imputation approaches. In Section 5, we present a study of the asymptotic bias of GEE when missing data are handled using multiple imputation. Finally, in Section 6, we present results from analysis of the CCTS longitudinal clinical trial introduced earlier.

2. Notation and Distributional Assumptions

Suppose that N individuals are to be observed on T binary responses. Then, for the i^th individual (i = 1, …, N) we can form a (T × 1) response vector, Y_i = [Y_i1, …, Y_iT]′, where Y_it = 1 if the i^th individual has a positive response (say, “success”) on the t^th response, and Y_it = 0 otherwise. Associated with Y_it, each individual also has a J × 1 covariate vector x_it. Let X_i = [x_i1, …, x_iT]′ represent the T × J matrix of covariates for the i^th individual. The marginal distribution of Y_it is Bernoulli and it is assumed that the probability of success,

π_{i t} = π_{i t} (β) = E (Y_{i t} | x_{i t}, β) = p r (Y_{i t} = 1 | x_{i t}, β) = \frac{exp (x_{i t}^{'} β)}{1 + exp (x_{i t}^{'} β)},

(1)

where β is a vector of logistic regression parameters. The π_it(β) can be grouped together to form a vector π_i(β) containing the marginal probabilities of success, π_i(β) = E[Y_i|X_i, β] = [π_i1, …, π_iT]′. Note that we are primarily interested in making inference about β. In this paper we consider the case where individuals are not observed on all T binary responses; however, we assume that no covariates are missing.

The joint probability for any pair of binary responses,

π_{ist} = E (Y_{i s} Y_{i t} | x_{i s}, x_{i t}, β, α) = p r (Y_{i s} = 1, Y_{i t} = 1 | x_{i s}, x_{i t}, β, α),

can be modeled in terms of the two marginal probabilities π_is(β) and π_it(β), in addition to an association parameter vector α. For example, the correlation between any pair of binary responses, Y_is and Y_it, is

ρ_{ist} = ρ_{ist} (β, α) = Corr (Y_{i s}, Y_{i t} | β, α) = \frac{π_{ist} - π_{i s} π_{i t}}{{[π_{i s} (1 - π_{i s}) π_{i t} (1 - π_{i t})]}^{1 / 2}} .

In terms of the correlation coefficient, the joint probability π_ist can be expressed as

π_{ist} (β, α) = π_{ist} = π_{i s} π_{i t} + ρ_{ist} {[π_{i s} (1 - π_{i s}) π_{i t} (1 - π_{i t})]}^{1 / 2} .

(2)

Note that other association parameters could also be used, e.g., marginal odds ratios instead of correlations.

3. Generalized estimating equations

When there are no missing data, the generalized estimating equations (GEE) for β are given by

u_{1} (\hat{β}) = \sum_{i = 1}^{N} u_{1 i} (\hat{β}) = \sum_{i = 1}^{N} {\hat{D}}_{i}^{'} {\hat{V}}_{i}^{- 1} [Y_{i} - π_{i} (\hat{β})] = 0,

(3)

where D_i = ∂π_i(β)/∂β, and V_i = V_i(α, β) is the T × T “working” or approximate covariance matrix of Y_i (Liang and Zeger, 1986). Since Y_it is binary, the t^th diagonal elements of V_i is Var(Y_it) = π_it(1 − π_it), which is specified entirely by the marginal distributions (i.e., by the vector of regression parameters β). The (s, t)^th off-diagonal element of V_i is Cov(Y_is, Y_it) = π_ist − π_isπ_it, where π_ist is specified by equation (2). A second set of estimating equations similar to (3) can be used to estimate α, the parameters associated with the correlations in (2).

When there are missing outcome data, we can write $Y_{i}^{'} = (Y_{m, i}^{'}, Y_{o, i}^{'})$ where Y_o,i is a (T_i × 1) vector containing the observed components of Y_i, and Y_m,i is a [(T − T_i) × 1] vector containing the missing components of Y_i. The missing data patterns are assumed to be non-monotone, i.e., subjects can have missing values on at least one occasion, but observed values at a later occasion. With missing data, the GEE for β based on the observed data is

u_{1} (\hat{β}) = \sum_{i = 1}^{N} {\hat{D}}_{o, i}^{'} {\hat{V}}_{o, i}^{- 1} [Y_{o, i} - π_{o, i} (\hat{β})] = 0,

(4)

where π_o,i and V_o,i are the elements of π_i and V_i corresponding to Y_o,i, The solution to this GEE yields consistent estimates of β when data are MCAR, but consistency may not hold when the data are MAR.

4. Multiple Imputation

In general, using multiple imputation, we ‘fill-in’ or ‘impute’ the missing data Y_m,i for each subject to create a set of, say K ‘filled-in’ or ‘imputed’ datasets, and then we estimate β in each of these imputed datasets, and average the estimated β’s over the K imputed datasets to obtain the multiple imputation estimator.

Rubin & Schenker (1986) and Rubin (1987) give a detailed summary of multiple imputation. Here, we review some relevant parts. First, we create K imputed datasets by sampling the missing data Y_m,i, K times using one of the imputation methods discussed below; this creates $Y_{m, i}^{k}$ , k = 1, …, K. For the k^th imputed dataset, we calculate the GEE estimate ${\hat{β}}^{k}$ as well as the within imputation variance $U^{k} = \hat{Var} ({\hat{β}}^{k})$ . Thus, the K imputed datasets give us ${\hat{β}}^{k}$ and U^k, for k = 1, …, K.

With K imputations, the multiple imputation estimate of β is

{\hat{β}}^{*} = \frac{\sum_{k = 1}^{K} {\hat{β}}^{k}}{K} .

Then, normal based inferences for β can be made (Rubin, 1978; Rubin, 1987),

({\hat{β}}^{*} - β) ~ N (0, V),

where

V = \hat{W} + (\frac{K + 1}{K}) \hat{B}, \hat{W} = \frac{\sum_{k = 1}^{K} U^{k}}{K}

(5)

is the average within imputation variance, and

\hat{B} = \frac{\sum_{k = 1}^{K} ({\hat{β}}^{k} - {\hat{β}}^{*}) {({\hat{β}}^{k} - {\hat{β}}^{*})}^{'}}{K - 1}

is the between imputation variance.

Next, we describe the approaches to imputing the missing data. If the missing data are MAR, a consistent estimate of β can be obtained by replacing Y_m,i with the conditional expectation of Y_m,i given (Y_o,i, X_i) Note, however, that the computation of the conditional expectation E(Y_m,i|Y_o,i, X_i) requires the full specification of the joint distribution of Y_i. With a vector of T binary responses, there are 2^T possible response sequences, and Y_i has a multinomial distribution with 2^T joint cell probabilities. The primary appeal of GEE lies in avoiding the full specification of this joint distribution of Y_i.

Therefore, for multiple imputation, we first consider an approximation for E(Y_m,i|Y_o,i, X_i) based on the multivariate normal distribution for Y_i. We use multiple imputation to fill-in Y_m,i given (Y_o,i, X_i). However, instead of assuming that Y_i given X_i is multivariate normal in the imputation, we also form the [T(T − 1)/2 × 1] vector of cross-products

U_{i} = {U_{i 12}, U_{i 13}, \dots, U_{i (T - 1) T}}^{'}

where U_ist = Y_isY_it and assume $Y = {[Y_{i}^{'}, U_{i}^{'}]}^{'}$ is multivariate normal. With this assumption, the conditional distribution of Y_m,i given (Y_o,i, X_i) will depend on Y_o,i as well as the cross-products of the elements of Y_o,i, say U_o,i. This conditional distribution is straightforward to impute from since it is also multivariate normal with conditional mean E(Y_m,i|Y_o,i, U_o,i, X_i) and conditional variance V ar(Y_m,i|Y_o,i, U_o,i, X_i), which are both functions of the marginal mean E(Y_m,i, Y_o,i, U_i|X_i) and marginal variance V ar(Y_m,i, Y_o,i, U_i|X_i); see for example Johnson & Wichern (2002) for the conditional mean and variance for a multivariate normal distribution. This can be considered a joint modelling approach in which the conditional distribution (multivariate normal) is derived from the joint distribution (multivariate normal). To impute the missing data from the conditional multivariate normal distribution, we used a Bayesian MCMC (Markov chain Monte Carlo) (Schafer, 1997; Gilks, et al., 1996) approach implemented in SAS Proc MI (SAS Institute Inc, 2020) to sample Y_m,i from the conditional multivariate normal distribution. For an arbitrary missing data pattern for multivariate normal data for a subject, the Bayesian MCMC approach draws from the posterior distribution of the missing data given the observed data using 2-step Imputation-Parameter-step algorithm developed by Schafer (1997). In the MCMC algorithm, one can also specify Bayesian priors for the parameters of the marginal means and variances (Liu et al., 2000), and we used a non-informative Jeffrey’s prior in our example since we had no prior information on the longitudinal outcomes.

Even though the true joint distribution of the binary outcomes in not multivariate normal, we use the multivariate normal distribution for imputation purposes. Further, even though the imputed value for the binary Y_it from the conditional multivariate normal distribution is continuous, we do not round the continuous value to 0 or 1, but keep the original imputed continuous value in the estimation procedure. Previous theory (Horton et al. 2003) has shown that even though the imputed value is continuous, rounding when filling in the missing binary data could increase bias of the resulting estimate. For example, if we consider a simple one-sample case where some subjects are missing and data are MCAR, the observed proportion (mean of the binary variable) is unbiased. If we impute the missing observations from a normal distribution with mean equal to the observed proportion in complete cases, the imputed values (not rounded) of the missing observations will have mean equal to the observed proportion. If we create the imputations by rounding the normally distributed value to 0 or 1, then an imputed value of a missing observation will equal the observed proportion plus an error term that does not have mean 0 and thus the mean of the imputed values will no longer equal the observed proportion. Further, GEE is a quasi-likelihood approach in which one only needs to specify the model for the mean and variance (and correlations). Thus, even with imputed values of Y_it that are continuous, one can still implement the GEE approach with marginal mean π_it and marginal variance π_it(1 − π_it).

The second multiple imputation approach we consider is the fully conditional specification (FCS) approach (Van Buuren, 2007), or a so-called “chained equation” approach. In this approach, we specify a logistic regression model for each Y_it given the Y_is’s at all other time points and X_i,

p r (Y_{i t} = 1 | Y_{i 1}, \dots, Y_{i, t - 1}, Y_{i, t + 1}, \dots, Y_{i T}, X_{i}, θ_{t}) .,

(6)

where θ_t is the logistic regression parameter vector from this conditional distribution. In the FCS approach, we do not need to specify the full joint distribution, only the series of conditional distributions given by (6). The FCS approach uses a Markov chain Monte Carlo (MCMC) method known as the Gibbs sampler to generate imputed values from the predictive distribution of the missing data, given the observed data. Briefly, the FCS approach involves iterating between the following steps. At each iteration, the logistic regression model (6) is fit to the observed values of Y_it given both the observed and imputed values of Y_is (for s ≠ t) and X_i; this yields logistic regression parameter estimates, ${\hat{θ}}_{t}$ (and their associated covariance matrix, say ${\hat{C}}_{t}$ ). A parameter vector θ_t for each logistic regression model (6) is drawn from the posterior predictive distribution of θ_t given the observed and imputed data; the posterior predictive distribution is assumed to be multivariate normal with mean ${\hat{θ}}_{t}$ and covariance ${\hat{C}}_{t}$ . Finally, the parameter vector θ_t is then used to impute the missing values for Y_it (given observed and imputed values of Y_is, for s ≠ t). The above steps are iterated a sufficiently large number of times to ensure that the imputed values are (at least approximately) draws from the posterior predictive distribution of the missing data Y_m,i, given the observed data Y_o,i. This Gibbs sampling approach is particularly appealing given that the series of conditional distributions, specified by the logistic regression models (6), are straightforward to specify and fit. We note that in our example, since we had no prior information on θ_t, we implicitly used an improper non-informative uniform prior (Zellner & Rossi, 1984) for these parameters.

As with the imputation from the multivariate normal distribution, we explore if bias can be reduced by including all pairwise interactions of the Y_is’s at other time points in the logistic regression model for each Y_it given the other Y_is’s, i.e.,

p r (Y_{i t} = 1 | Y_{i 1}, \dots, Y_{i, t - 1}, Y_{i, t + 1}, \dots, Y_{i T}, Y_{i s} Y_{i u}, for all s, u \neq t, X_{i}) .

It is of interest to examine the potential bias of the multivariate normal imputation and FCS approaches when the data are MAR. We conjecture that adding the cross-product terms to the multivariate normal outcome vector and the pairwise interactions to the FCS approach can reduce the bias compared to multiple imputation without the inclusion of these cross-product or pairwise interaction terms. We explore this conjecture in a study of asymptotic bias in the following section.

5. Study of Asymptotic Bias

In this section we study the asymptotic bias in estimating β using 5 approaches for handling missing data, including the two proposed in the previous section. For simplicity, we consider a two-group longitudinal design configuration with a binary response measured on four occasions. We assume that half of the individuals are assigned to each of the two groups, i.e., the group indicator variable, x_i, equals 0 or 1 with pr(x_i = 1) = 0.5. The following marginal model for π_it = E(Y_it|x_i, β) is assumed,

logit (π_{i t}) = β_{0} + β_{1} x_{i}, t = 1, 2, 3, 4.

(7)

So far, we have only specified the marginal distribution of each Y_it separately. Next, we assume that the joint distribution of (Y_i1, Y_i2, Y_i3, Y_i4|x_i) is given by the Bahadur distribution (Bahadur, 1961). To describe the Bahadur distribution, we define the standardized variable Z_it to be

Z_{i t} = \frac{Y_{i t} - π_{i t}}{{π_{i t} (1 - π_{i t})}^{1 / 2}} .

The pairwise correlation between Y_is and Y_it is ρ_st = E(Z_isZ_it); the 3^rd-order correlation is defined as ρ_stu = E(Z_isZ_itZ_iu); and the 4^th-order correlation is defined as ρ_stuv = E(Z_isZ_itZ_iuZ_iv). For subject i (with covariate x_i equal to 0 or 1), the Bahadur distribution is

pr \{(Y_{i 1} = y_{1}), (Y_{i 2} = y_{2}), (Y_{i 3} = y_{3}), (Y_{i 4} = y_{4}) | x_{i}, β\} = \{\prod_{t = 1}^{4} π_{i t} {(β)}^{y_{t}} {[1 - π_{i t} (β)]}^{(1 - y_{t})}\} \{1 + \sum_{s > t} ρ_{s t} z_{i s} z_{i t} + \sum_{s > t > u} ρ_{s t u} z_{i s} z_{i t} z_{i u} + ρ_{1234} z_{i 1} z_{i 2} z_{i 3} z_{i 4}\} .

(8)

We note that as long as the pairwise correlations (the ρ_st’s) are non-zero, the conditional distribution of one Y_it given the other three Y_is’s depend on the cross-products of pairs of the other Y_is’s, even if ρ_stu = ρ₁₂₃₄ = 0. Thus, any imputation approach that does not include the cross-products could produce bias.

In this study, to create a plausible non-monotone MAR missingness model where the probability of being missing depends on previous observed outcomes, we let subjects be missing at times 3 or 4 (or both), but do not allow missingness at times 1 and 2. We define the indicator random variable R_it which equals 1 if Y_it is observed and 0 if Y_it is missing. We let missingness be non-monotone, so there are 3 possible patterns of (R_i3, R_i4) that define missingness. If a subject is observed at time 3 but missing at time 4, then (R_i3, R_i4) = (1, 0); if a subject is observed at time 4 but missing at time 3, then (R_i3, R_i4) = (0, 1); if a subject is missing at both times 3 and 4, then (R_i3, R_i4) = (0, 0); if a subject is observed at both times 3 and 4, then (R_i3, R_i4) = (1, 1). To allow for non-monotone MAR missingness, we use a simple missing at random model where

p r (R_{i 3} = r_{3}, R_{i 4} = r_{4} | Y_{i}, x_{i}) = p r (R_{i 3} = r_{3} | y_{i 1}, y_{i 2}, x_{i}) p r (R_{i 4} = r_{3} | y_{i 1}, y_{i 2}, x_{i}) = ϕ_{i 3}^{r_{3}} {(1 - ϕ_{i 3})}^{(1 - r_{3})} ϕ_{i 4}^{r_{4}} {(1 - ϕ_{i 4})}^{(1 - r_{4})},

(9)

where ϕ_it = pr(R_it = r_t|y_i1, y_i2, x_i). Note that the missing data are missing at random since (9) does not depend on the possibly missing outcomes, (Y_i3, Y_i4). For simplicity, we also let the logistic regression models for ϕ_i3 and ϕ_i4 be identical

logit (ϕ_{i 3}) = logit (ϕ_{i 4}) = γ_{0} + γ_{1} x_{i} + γ_{2} y_{i 1} + γ_{3} y_{i 2} + γ_{23} y_{i 1} y_{i 2} .

(10)

Letting $\hat{β}$ be the estimated parameter vector from a given approach, the asymptotic bias of that approach is defined as $E_{A} [\hat{β} - β] = (β^{*} - β)$ , where β is the true parameter value and $E_{A} [\hat{β}] = β^{*}$ is the asymptotic expectation of $\hat{β}$ . With a discrete set of outcomes and covariates, Rotnitzky & Wypij (1994) showed that the asymptotic bias of an approach can be determined by considering an artificial sample of weighted observations for each possible realization of (Y_i1, Y_i2, Y_i3, Y_i4, R_i3, R_i4|x_i). Since each Y_it, R_it and x_i are binary, the artificial sample will contain 2⁷ = 128 observations. The weight for each observation is given by its respective joint probability

w_{i} = p r (Y_{i 1} = y_{1}, Y_{i 2} = y_{2}, Y_{i 3} = y_{3}, Y_{i 4} = y_{4}, R_{i 3} = r_{3}, R_{i 4} = r_{4} | x_{i}) .

To then obtain the asymptotic expectation β* for a given estimation approach, we solve for β* in the usual way for each approach, with each individual’s contribution to the GEE weighted by its respective joint probability. Basically, we create a dataset of all possible values of (Y_i1, Y_i2, Y_i3, Y_i4, x_i) with corresponding weight w_i, and fit a weighted GEE in a program like SAS Proc Genmod (SAS Institute Inc, 2020) or the R gee (Carey, et al. 2012) to get β*.

For the study of asymptotic bias, we let the true β₀ = 1.0, β₁ = 0.5 in (7). In our simulations, a subject’s probably of success is either exp(1.5)/(1 + exp(1.5)) = 0.82 when x = 1 or exp(1)/(1 + exp(1)) = 0.73 when x = 0. Thus, our asymptotic study considers the case with a high probability of success. We let the true correlation model be exchangeable with value ρ = ρ_st = 0.2 and 0.4. We let ρ_stu = ρ³ and $ρ_{stuv} = ρ_{stu}^{4}$ . We also include missing data models with an interaction between Y_i1 and Y_i2 (γ₂₃ = −4) and without an interaction (γ₂₃ = 0) to determine if this interaction in the missingness model affects the results.

Also, even though the true marginal model only has a group effect, with no time or time by group interaction, we fit a marginal model for each approach that includes these terms,

logit (π_{i t}) = β_{0} + β_{1} x_{i} + β_{2} I (t = 2) + β_{3} I (t = 3) + β_{4} I (t = 4) + β_{12} x_{i} I (t = 2) + β_{13} x_{i} I (t = 3) + β_{14} x_{i} I (t = 4),

(11)

where I(·) is an indicator function. We emphasize here that we fit an overspecified model that included the time by group interaction terms, even though the true model only has a non-zero main effect of group. In most longitudinal studies, the main interest is in assessing if the trends over time differ in the two groups, which is represented by the time by group interaction. Thus, our interest is in determining if the different approaches give biased estimates of the time by group interaction, and not in the main effects in the model. Any approach that gives a non-zero asymptotic expectation of the time by group interaction terms will be a biased approach. In particular, if there is no asymptotic bias, then for a given approach, the time by group interaction terms should all converge to 0, i.e., $β_{12}^{*} = β_{13}^{*} = β_{14}^{*} = 0$ .

In the asymptotic study, we fit GEE with exchangeable correlation for all approaches (for standard GEE and within imputation). Further, for each multiple imputation approach, we performed 1000 multiple imputations (the multiple imputation estimate is again the average of the estimates over the 1000 imputations). The asymptotic bias from the imputation approaches is the asymptotic bias specifically for M = 1000 imputations as N → ∞). The asymptotic bias does depend on the number of imputations, but 1000 imputations is large enough to minimize any bias due to a small number of imputations (Graham et al. 2007).

Table 1 gives the asymptotic bias of the various approaches for different values of the missingness parameters γ. In Table 1, we denote multivariate normal imputation without cross-products as MVN-MI, i.e., assuming

(Y_{i 1}, Y_{i 2}, Y_{i 3}, Y_{i 4})

given x_i is multivariate normal; we denote multivariate normal imputation with cross-products as MVN-MI-cross, i.e., assuming

(Y_{i 1}, Y_{i 2}, Y_{i 3}, Y_{i 4}, Y_{i 1} Y_{i 2}, Y_{i 1} Y_{i 3}, Y_{i 1} Y_{i 4}, Y_{i 2} Y_{i 3}, Y_{i 2} Y_{i 4}, Y_{i 3} Y_{i 4})

given x_i is multivariate normal; we denote FCS imputation without pairwise interactions as FCS-MI, with a logistic regression for the conditional probability

p r (Y_{i t} = 1 | Y_{i s}, Y_{i u}, Y_{i v}, x_{i});

and we denote FCS imputation with pairwise interactions as FCS-MI-interact, with a logistic regression for the conditional probability

p r (Y_{i t} = 1 | Y_{i s}, Y_{i u}, Y_{i v}, Y_{i s} Y_{i u}, Y_{i s} Y_{i v}, Y_{i u} Y_{i v}, x_{i}) .

We note that in all of these multiple imputation approaches, we assumed that the associations between outcomes at pairs of time are different, and thus that there is an ‘unstructured covariance’ matrix for the vector of outcomes for a subject.

Table 1.

The vector value $β^{*}$ to which $\hat{β}$ converges. The marginal logistic model has parameters β₁₂ = β₁₃ = β₁₄ = 0 with exchangeable correlation, ρ.

pr(R_i3, R_i4)				Missing Data Model		ρ = 0.2			ρ = 0.4
(1, 1)	(0, 1)	(1, 0)	(0, 0)	(γ₀, γ₁, γ₂, γ₃, γ₂₃)^a	Approach	$β_{12}^{*}$	$β_{13}^{*}$	$β_{14}^{*}$	$β_{12}^{*}$	$β_{13}^{*}$	$β_{14}^{*}$
0.30	0.21	0.21	0.28	(0.17,−1,2,2,−4)	GEE	0.000	−0.143	−0.143	0.000	−0.432	−0.432
					MVN-MI	0.000	−0.118	−0.113	0.000	−0.470	−0.471
					MVN-MI-cross	0.000	0.009	0.006	0.000	−0.002	−0.001
					FCS-MI	0.000	−0.166	−0.164	0.000	−0.518	−0.517
					FCS-MI-interact	0.000	−0.005	−0.004	0.000	−0.003	−0.001
0.49	0.19	0.19	0.13	(1,−1,2,2,−4)	GEE	0.000	−0.090	−0.090	0.000	−0.260	−0.260
					MVN-MI	0.000	−0.065	−0.068	0.000	−0.268	−0.261
					MVN-MI-cross	0.000	0.001	0.002	0.000	0.002	0.004
					FCS-MI	0.000	−0.094	−0.091	0.000	−0.296	−0.298
					FCS-MI-interact	0.000	0.000	−0.003	0.000	0.002	0.000
0.34	0.17	0.17	0.32	(1.5,−1,−1,−1,0)	GEE	0.000	−0.004	−0.004	0.000	−0.001	−0.001
					MVN-MI	0.000	−0.047	−0.049	0.000	−0.162	−0.163
					MVN-MI-cross	0.000	0.005	−0.001	0.000	−0.003	−0.008
					FCS-MI	0.000	−0.063	−0.070	0.000	−0.185	−0.187
					FCS-MI-interact	0.000	−0.002	0.002	0.000	−0.002	−0.006
0.52	0.16	0.16	0.16	(2.5,−1,−1,−1,0)	GEE	0.000	−0.022	−0.022	0.000	−0.052	−0.052
					MVN-MI	0.000	−0.033	−0.022	0.000	−0.090	−0.089
					MVN-MI-cross	0.000	0.001	0.001	0.000	0.000	−0.003
					FCS-MI	0.000	−0.037	−0.032	0.000	−0.105	−0.103
					FCS-MI-interact	0.000	−0.002	0.002	0.000	−0.001	0.003

Open in a new tab

The missing data model is logit(ϕ_i3) = logit(ϕ_i4) = γ₀ + γ₁x_i + γ₂y_i1 + γ₃y_i2 + γ₂₃y_i1y_i2

MVN-MI is multivariate normal imputation without cross-products

MVN-MI-cross is multivariate normal imputation with cross-products

FCS-MI is FCS imputation without pairwise interactions

FCS-MI-interact is FCS imputation with pairwise interactions

We see from Table 1 that standard GEE based on the observed data only has substantial bias when the true missingness mechanism depends on the interaction between Y_i1 and Y_i2, but has much less bias when this interaction effect is 0. When this interaction term is non-zero, standard GEE has its largest bias when the proportion missing is the highest. Multiple imputation with the multivariate normal without cross-products and FCS without pairwise interactions in the conditional models perform similarly in Table 1. When the interaction term in the missingness model is not 0, multivariate normal without cross-products and FCS without pairwise interactions perform very similar to standard GEE; in this case, imputing does not reduce the bias. On the other hand, multiple imputation with the multivariate normal with cross-products and FCS with pairwise interactions reduces the bias to a negligible amount. When the interaction term in the missingness model equals 0, multivariate normal without cross-products and FCS without pairwise interactions perform very similar, but actually give more bias than standard GEE. In this case, imputing without cross-products does not reduce the bias. On the other hand, multiple imputation with the multivariate normal with cross-products and FCS with pairwise interactions again reduces the bias to a negligible amount.

Typically, for most longitudinal studies, the marginal parameters β are the parameters of main scientific interest, while ρ is considered to be a nuisance parameter. Thus, the bias for the correlation is not displayed in Table 1. However, the asymptotic bias of the exchangeable correlation (the posed correlation for all approaches), had less than 10% relative bias for all approaches. Also, although not shown here, for any of the multiple imputation approaches, we also fit GEE under independence and autoregressive correlation structures, and the asymptotic bias was very similar to GEE with exchangeable structure. This agrees with the GEE theory for a single dataset with no missing data (a balanced dataset across time) in that the GEE approach for estimating the marginal model is robust to the ‘working correlation’ model with no missing data (Liang & Zeger, 1986). In particular, once we impute the missing values in the dataset for a given imputation approach, we have a dataset with no missing data, so that any ‘working correlation’ will produce similar estimates of the marginal parameters. Thus, the asymptotic bias in our study would not be due to a mis-specification of the ‘working correlation’, but due to the imputation approach.

In summary, this asymptotic study suggests that when using GEE for longitudinal studies with missing data, one should at the very least perform a sensitivity analysis using multiple imputation with the multivariate normal with cross-products and/or FCS with pairwise interactions in the conditional models.

6. Application

In this section, we illustrate the application of the proposed methodology to the analysis of the National Institute on Drug Abuse Collaborative Cocaine Treatment Study (CCTS), the longitudinal clinical trial dataset described earlier. Full details on the design and procedures of the CCTS can be found in previous publications on the trial (e.g., Crits-Christoph et al., 1999). Briefly, the CCTS was a multi-site randomized clinical trial that compared the efficacy of four psychosocial treatments for cocaine dependence. Each treatment consisted of 6 months of active phase treatment. A total of 487 outpatients, who were diagnosed with DSM-IV cocaine dependence and had used cocaine during the past 30 days, were randomly assigned to one of the four treatment conditions. At intake, patients on average used cocaine 10 days out of the past 30 and had been using cocaine for an average of 7 years (SD = 4.8). A composite cocaine use outcome measure, constructed by pooling information from self-report data and weekly observed urine samples, was used to code each month of treatment as abstinent versus any cocaine use. Thus, in the CTTS there were 6 monthly assessments of cocaine use during follow-up (corresponding to t = 1, 2, 3, 4, 5, 6) yielding repeated measures of a binary response. That is, the response variable of interest at each occasion is the cocaine use, Y_it, which equals 1 if the patient was found to be using at time t and equals 0 otherwise. The main scientific interest is in determining any treatment differences in the cocaine use profile over the treatment period. In this study, 41.5% of subjects were missing at least one of the six monthly cocaine use assessments and it was thought likely that missingness is not completely at random; for the analyses presented here, we assume missingness is at random (MAR). We must emphasize that the assumption of MAR (which also includes MCAR) is an unverifiable assumption in the sense that MAR cannot be distinguished from not missing at random (NMAR) based on the distribution of the observed data.

We considered the following marginal logistic regression model as a function of three indicator variables for treatment group (with IDC chosen as the reference group),

log [\frac{π_{i t}}{1 - π_{i t}}] = β_{0} + β_{1} I (C T) + β_{2} I (S E) + β_{3} I (G D C) + β_{time}^{'} t (for t = 1, 2, 3, 4, 5, 6;)

where I(·) is an indicator for treatment group, and the vector t contains 5 indicators for categorical time effects. Because the 6 repeated binary outcomes are obtained post-randomization, comparison of the treatment groups are made in terms of rates of cocaine use during the 6 months of active treatment, $\sum_{t = 1}^{6} π_{i t} (β)$ . That is, the expected frequency of cocaine use during the six months is determined by the marginal probabilities of the six binary outcomes, pr(Y_it = 1|x_it, β), and estimates of the marginal regression parameters, β, are used as the basis for inference about treatment group differences in the rates of cocaine use during the 6 months of active treatment. To account for the association among the binary responses, an autoregressive correlation model, Corr(Y_is, Y_it) = ρ^|s−t| was fit to the data. As was noted earlier, there was a substantial amount of missing data on the 6 binary outcomes. We accounted for the missing data using multiple imputation and performed 100 imputations for each imputation approach.

We compare the estimates of the treatment effects (β₁, β₂, β₃) using five alternative approaches: 1) the standard GEE approach; 2) multiple imputation from the multivariate normal without cross-products in the multivariate normal outcome; 3) multiple imputation from the multivariate normal with cross-products in the multivariate normal outcome; 4) FCS imputations with no pairwise interactions in the model for Y_it given the other elements of the vector Y_i; and 5) FCS imputations with pairwise interactions between the elements of the vector Y_i as additional covariates in the model for Y_it. We note that for all imputation approaches, when imputing missing data, no distinction is made between monotone and non-monotone patterns (i.e., imputing Y_m,i given Y_o,i does not differ whether or not the elements of Y_o,i have a monotone pattern.)

Among the five approaches, the parameter estimates are similar (see Table 2). When compared to the reference group, IDC, only CT has higher rates of within-treatment drug use. This is an important clinical study, and even though the estimates using all approaches are similar, it was important for the clinical investigators that a sensitivity analysis was performed to assess whether the non-monotone missing data could lead to biased results. Thus, it is reassuring (to our clinical collaborators) that in this application all approaches yield similar results. We note here that since the estimates are so similar between standard GEE (the approach that is unbiased under MCAR) and all of the imputation approaches (which are unbiased under MAR assuming the imputation models are correct), it is tempting to conclude that the data are likely to be MCAR. However, we caution against such an interpretation because the assumption of MAR (which also includes MCAR) is an unverifiable assumption in the sense that MAR cannot be distinguished from not missing at random (NMAR) based on the distribution of the observed data.

Table 2.

Logistic regression parameter estimates for the CCTS longitudinal clinical trial

		Standard
Effect	Method^a	Estimate	Error	p–value
INTERCEPT	GEE	0.211	0.139	0.130
	MVN-MI	0.197	0.139	0.155
	MVN-MI-cross	0.192	0.142	0.176
	FCS-MI	0.198	0.137	0.148
	FCS-MI-interact	0.173	0.139	0.216
CT	GEE	0.516	0.207	0.013
	MVN-MI	0.507	0.206	0.014
	MVN-MI-cross	0.525	0.211	0.013
	FCS-MI	0.516	0.205	0.012
	FCS-MI-interact	0.499	0.201	0.013
SE	GEE	0.387	0.203	0.056
	MVN-MI	0.363	0.202	0.073
	MVN-MI-cross	0.365	0.207	0.078
	FCS-MI	0.359	0.199	0.072
	FCS-MI-interact	0.377	0.202	0.062
GDC	GEE	0.249	0.203	0.220
	MVN-MI	0.270	0.202	0.183
	MVN-MI-cross	0.262	0.206	0.202
	FCS-MI	0.266	0.201	0.185
	FCS-MI-interact	0.226	0.200	0.259

Open in a new tab

MVN-MI is multivariate normal imputation without cross-products

MVN-MI-cross is multivariate normal imputation with cross-products

FCS-MI is FCS imputation without pairwise interactions

FCS-MI-interact is FCS imputation with pairwise interactions

7. Discussion

In this paper we consider multiple imputation approaches for estimating parameters of a marginal model using GEE. The results of the asymptotic study suggest that multiple imputation using either a multivariate normal approach with additional cross-products terms or an FCS approach with interactions yields marginal regression parameter estimates with much less bias than multiple imputation without the cross-product terms or interactions. Further, when the true missing data model does not contain the interaction, it was found that the multiple imputation approaches without cross-product terms yielded larger bias than standard GEE based on the observed data only. Thus, in our study of asymptotic bias, MI without cross-product terms did not perform any better than standard GEE, and in some cases worse. Our conjecture is that since the true Bahadur model in the asymptotic study is such that the conditional distribution of any Y_it given the other three Y_is’s depend on the cross-products of pairs, any imputation approach that fails to include these cross-products could produce bias, which is why the multivariate normal approach without cross-products terms or an FCS approach without interactions had the largest bias. In summary, the purpose of this paper was to “enrich” the information (the cross-product terms) in the conditional distribution of the missing data given the observed data in the imputation step in order to reduce the bias of the resulting multiple imputation estimates.

Because of the broad range of possible data configurations with multiple covariate and missing data distributions, it is difficult to draw definitive conclusions from this study of asymptotic bias. Nonetheless, in the asymptotic study reported here, MI with cross-product terms performs discernibly better than MI without cross-product terms. At the very least, we suggest investigators conduct a sensitivity analysis with inclusion of appropriate cross-product terms in the imputation model to ensure that missing data will not lead to substantial bias in estimation via GEE. Further, in this paper, we focused on developing new imputation approaches to minimize the bias. Important future work should explore performance criterion such as mean square error. Our asymptotic study with non-monotone missingness has T = 4 time points, but we expect similar bias with more than 4 time points. Even though our approaches should minimize bias for any T, with increasing T, the computational burden of the proposed imputation approaches increases due to both increases in the numbers of pairwise products (and interactions) and the number of missing data patterns. Thus, the main issue with increased T is the computational burden.

Although beyond the scope of this paper, it would be of interest to explore how the MI approach with cross-products performs for other types of non-Gaussian data besides binary data. Finally, note that if the missing outcome data are assumed to be non-ignorable (i.e., the probability of missingness depends on unobserved data) with non-monotone missing data patterns, then multiple imputation lends itself well for non-ignorable sensitivity analyses (Carpenter & Kenward, 2013); this is also a topic that is beyond the scope of this paper.

Acknowledgments

The authors were supported by grants from the National Institutes of Health, National Institute on Drug Abuse grants [NIDA R33 DA042847, UG1 DA015831, K24 DA022288].

Contributor Information

Stuart R. Lipsitz, Brigham and Women’s Hospital and Ariadne Labs, Boston, MA, U.S.A

Garrett M. Fitzmaurice, McLean Hospital, Belmont, MA, U.S.A

Roger D. Weiss, McLean Hospital, Belmont, MA, U.S.A

References

Bahadur RR (1961). A representation of the joint distribution of responses to n dichotomous items In Studies in Item Analysis and Prediction, Ed. Solomon H, pp. 158–68. Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press. [Google Scholar]
Barnard J,& Meng XL (1999). Applications of Multiple Imputation in Medical Studies: From AIDS to NHANES. Statistical Methods in Medical Research, 8, 1736. [DOI] [PubMed] [Google Scholar]
Beunckens C, Sotto C, & Molenberghs G (2008) A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Computational Statistics & Data Analysis, 52, 1533–1548. [Google Scholar]
Carey V, Zeger SL & Diggle PJ (1993). Modelling multivariate binary data with alternating logistic regressions. Biometrika, 80, 517–526. [Google Scholar]
Carey VJ, Lumley T, & Ripley BD (2012), gee: Generalized Estimation Equation Solver, URL http://CRAN.R-project.org/package=gee, R package version 4.13–18
Carpenter JR, & Kenward MG (2013). Multiple Imputation and Its Application. New York: Wiley. [Google Scholar]
Crits-Christoph P Psychosocial treatments for cocaine dependence: National institute on drug abuse collaborative cocaine treatment study. (1999). Archives of General Psychiatry, 56, 493–502. [DOI] [PubMed] [Google Scholar]
Enders CK (2010). Applied Missing Data Analysis. New York: The Guilford Press. [Google Scholar]
Gilks WR, Richardson S, & Spiegelhalter DJE (1996). Markov Chain Monte Carlo in Practice. New York: Chapman & Hall. [Google Scholar]
Graham JW, Olchowski AE, & Gilreath TD. (2007) How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci, 8, 206–213. [DOI] [PubMed] [Google Scholar]
Horton NJ, & Lipsitz SR (2001). Multiple Imputation in Practice: Comparison of Software Packages for Regression Models with Missing Variables. American Statistician, 55, 244–254. [Google Scholar]
Horton NJ, Parzen M. Lipsitz SR. (2003) A Potential for Bias When Rounding in Multiple Imputation. Am Stat, 57, 229–232. [Google Scholar]
Johnson RA, & Wichern DW (2002). Applied multivariate statistical analysis. Upper Saddle River, NJ: Prentice Hall. [Google Scholar]
Laird NM (1988). Missing data in longitudinal studies. Statistics in Medicine, 7, 305–315. [DOI] [PubMed] [Google Scholar]
Liang KY & Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]
Lipsitz SR, Laird NM & Harrington DP (1992). A three-stage estimator for studies with repeated and possibly missing binary outcomes. Applied Statistics. 41, 203–213. [Google Scholar]
Lipsitz SR, Fitzmaurice GM, Orav EJ, & Laird NM (1994) Performance of generalized estimating equations in practical situations. Biometrics. 50, 270–278. [PubMed] [Google Scholar]
Lipsitz SR, Molenberghs G, Fitzmaurice GM, & Ibrahim J (2000). GEE with Gaussian estimation of the correlations when data are incomplete. Biometrics. 56, 528–536. [DOI] [PubMed] [Google Scholar]
Little RJA & Rubin DB (2002). MStatistical Analysis with Missing Data. 2nd ed. New York: John Wiley & Sons. [Google Scholar]
Liu M, Taylor JM, & Belin TR (2000). Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics. 56, 1157–1163. [DOI] [PubMed] [Google Scholar]
Paik M (1997). The generalized estimating equation approach when data are not missing completely at random. J. Amer. Statist. Assoc, 92, 1320–1329. [Google Scholar]
Robins JM, Rotnitzky A & Zhao LP (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc, 90, 106–121. [Google Scholar]
Rotnitzky A & Wypij D (1994). A note on the bias of estimators with missing data. Biometrics, 50, 1163–1170. [PubMed] [Google Scholar]
Rubin DB (1976). Inference and missing data. Biometrika, 63, 581–592. [Google Scholar]
Rubin DB (1978). Multiple imputations in sample surveys-a phenominological bayesian approach to nonresponse in Proceedings of the International Statistical Institute, Manila, 517–532. [Google Scholar]
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. [Google Scholar]
Rubin DB & Schenker N (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. JASA, 81, 366–374. [Google Scholar]
SAS Institute Inc (2020). SAS/STAT Software, Version 9.4. Cary, NC: URL http://www.sas.com/. [Google Scholar]
Schafer JL (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall Ltd. [Google Scholar]
Schafer JL (1999). Multiple Imputation: A Primer. Statistical Methods in Medical Research, 8, 3–15. [DOI] [PubMed] [Google Scholar]
Scheuren F (2005). Multiple imputation: How it began and continues. The American Statistician, 59, 3315–319. [Google Scholar]
Tchetgen E, Wang L & Sun B. (2017) Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Unpublished Manuscript. Archived as arXiv:1607.02631v3 [stat.ME] at: https://arxiv.org/abs/1607.02631v3. [DOI] [PMC free article] [PubMed]
Van Buuren S (2007). Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification. Statistical Methods in Medical Research, 16, 219–242. [DOI] [PubMed] [Google Scholar]
Van Buuren S & Groothuis-Oudshoorn K (2011). mice: multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–68. [Google Scholar]
Van Buuren S (2012). Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
Zellner A & Rossi PE (1984). Bayesian analysis of dichotomous quantal response models. Journal of Econometrics, 25, 365393. [Google Scholar]

[R1] Bahadur RR (1961). A representation of the joint distribution of responses to n dichotomous items In Studies in Item Analysis and Prediction, Ed. Solomon H, pp. 158–68. Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press. [Google Scholar]

[R2] Barnard J,& Meng XL (1999). Applications of Multiple Imputation in Medical Studies: From AIDS to NHANES. Statistical Methods in Medical Research, 8, 1736. [DOI] [PubMed] [Google Scholar]

[R3] Beunckens C, Sotto C, & Molenberghs G (2008) A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Computational Statistics & Data Analysis, 52, 1533–1548. [Google Scholar]

[R4] Carey V, Zeger SL & Diggle PJ (1993). Modelling multivariate binary data with alternating logistic regressions. Biometrika, 80, 517–526. [Google Scholar]

[R5] Carey VJ, Lumley T, & Ripley BD (2012), gee: Generalized Estimation Equation Solver, URL http://CRAN.R-project.org/package=gee, R package version 4.13–18

[R6] Carpenter JR, & Kenward MG (2013). Multiple Imputation and Its Application. New York: Wiley. [Google Scholar]

[R7] Crits-Christoph P Psychosocial treatments for cocaine dependence: National institute on drug abuse collaborative cocaine treatment study. (1999). Archives of General Psychiatry, 56, 493–502. [DOI] [PubMed] [Google Scholar]

[R8] Enders CK (2010). Applied Missing Data Analysis. New York: The Guilford Press. [Google Scholar]

[R9] Gilks WR, Richardson S, & Spiegelhalter DJE (1996). Markov Chain Monte Carlo in Practice. New York: Chapman & Hall. [Google Scholar]

[R10] Graham JW, Olchowski AE, & Gilreath TD. (2007) How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci, 8, 206–213. [DOI] [PubMed] [Google Scholar]

[R11] Horton NJ, & Lipsitz SR (2001). Multiple Imputation in Practice: Comparison of Software Packages for Regression Models with Missing Variables. American Statistician, 55, 244–254. [Google Scholar]

[R12] Horton NJ, Parzen M. Lipsitz SR. (2003) A Potential for Bias When Rounding in Multiple Imputation. Am Stat, 57, 229–232. [Google Scholar]

[R13] Johnson RA, & Wichern DW (2002). Applied multivariate statistical analysis. Upper Saddle River, NJ: Prentice Hall. [Google Scholar]

[R14] Laird NM (1988). Missing data in longitudinal studies. Statistics in Medicine, 7, 305–315. [DOI] [PubMed] [Google Scholar]

[R15] Liang KY & Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]

[R16] Lipsitz SR, Laird NM & Harrington DP (1992). A three-stage estimator for studies with repeated and possibly missing binary outcomes. Applied Statistics. 41, 203–213. [Google Scholar]

[R17] Lipsitz SR, Fitzmaurice GM, Orav EJ, & Laird NM (1994) Performance of generalized estimating equations in practical situations. Biometrics. 50, 270–278. [PubMed] [Google Scholar]

[R18] Lipsitz SR, Molenberghs G, Fitzmaurice GM, & Ibrahim J (2000). GEE with Gaussian estimation of the correlations when data are incomplete. Biometrics. 56, 528–536. [DOI] [PubMed] [Google Scholar]

[R19] Little RJA & Rubin DB (2002). MStatistical Analysis with Missing Data. 2nd ed. New York: John Wiley & Sons. [Google Scholar]

[R20] Liu M, Taylor JM, & Belin TR (2000). Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics. 56, 1157–1163. [DOI] [PubMed] [Google Scholar]

[R21] Paik M (1997). The generalized estimating equation approach when data are not missing completely at random. J. Amer. Statist. Assoc, 92, 1320–1329. [Google Scholar]

[R22] Robins JM, Rotnitzky A & Zhao LP (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc, 90, 106–121. [Google Scholar]

[R23] Rotnitzky A & Wypij D (1994). A note on the bias of estimators with missing data. Biometrics, 50, 1163–1170. [PubMed] [Google Scholar]

[R24] Rubin DB (1976). Inference and missing data. Biometrika, 63, 581–592. [Google Scholar]

[R25] Rubin DB (1978). Multiple imputations in sample surveys-a phenominological bayesian approach to nonresponse in Proceedings of the International Statistical Institute, Manila, 517–532. [Google Scholar]

[R26] Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. [Google Scholar]

[R27] Rubin DB & Schenker N (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. JASA, 81, 366–374. [Google Scholar]

[R28] SAS Institute Inc (2020). SAS/STAT Software, Version 9.4. Cary, NC: URL http://www.sas.com/. [Google Scholar]

[R29] Schafer JL (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall Ltd. [Google Scholar]

[R30] Schafer JL (1999). Multiple Imputation: A Primer. Statistical Methods in Medical Research, 8, 3–15. [DOI] [PubMed] [Google Scholar]

[R31] Scheuren F (2005). Multiple imputation: How it began and continues. The American Statistician, 59, 3315–319. [Google Scholar]

[R32] Tchetgen E, Wang L & Sun B. (2017) Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Unpublished Manuscript. Archived as arXiv:1607.02631v3 [stat.ME] at: https://arxiv.org/abs/1607.02631v3. [DOI] [PMC free article] [PubMed]

[R33] Van Buuren S (2007). Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification. Statistical Methods in Medical Research, 16, 219–242. [DOI] [PubMed] [Google Scholar]

[R34] Van Buuren S & Groothuis-Oudshoorn K (2011). mice: multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–68. [Google Scholar]

[R35] Van Buuren S (2012). Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]

[R36] Zellner A & Rossi PE (1984). Bayesian analysis of dichotomous quantal response models. Journal of Econometrics, 25, 365393. [Google Scholar]

PERMALINK

Using multiple imputation with GEE with non-monotone missing longitudinal binary outcomes

Stuart R Lipsitz

Garrett M Fitzmaurice

Roger D Weiss

Summary

1. Introduction

2. Notation and Distributional Assumptions

3. Generalized estimating equations

4. Multiple Imputation

5. Study of Asymptotic Bias

Table 1.

6. Application

Table 2.

7. Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Using multiple imputation with GEE with non-monotone missing longitudinal binary outcomes

Stuart R Lipsitz

Garrett M Fitzmaurice

Roger D Weiss

Summary

1. Introduction

2. Notation and Distributional Assumptions

3. Generalized estimating equations

4. Multiple Imputation

5. Study of Asymptotic Bias

Table 1.

6. Application

Table 2.

7. Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases