Abstract
Longitudinal data are very popular in practice, but they are often missing in either outcomes or time-dependent risk factors, making them highly unbalanced and complex. Missing data may contain various missing patterns or mechanisms, and how to properly handle it for unbiased and valid inference still presents a significant challenge. Here, we propose a novel semiparametric framework for analyzing longitudinal data with both missing responses and covariates that are missing at random and intermittent, a general and widely encountered situation in observational studies. Within this framework, we consider multiple robust estimation procedures based on innovative calibrated propensity scores, which offers additional relaxation of the misspecification of missing data mechanisms and shows more satisfactory numerical performance. Also, the corresponding robust information criterion on consistent variable selection for our proposed model is developed based on empirical likelihood-based methods. These advocated methods are evaluated in both theory and extensive simulation studies in a variety of situations, showing competing properties and advantages compared to the existing approaches. We illustrate the utility of our approach by analyzing the data from the HIV Epidemiology Research Study.
Keywords: empirical likelihood, missing at random, propensity scores, semiparametric models, variable selection
1 |. INTRODUCTION
Robust inference in longitudinal data analysis is one of the active research topics in the past decades. Parametric models are so parsimonious that they would be at risk of inducing modeling biases if the assumed distribution is far away from the truth, while nonparametric methods are too flexible to make concise inference. Thus, semiparametric approaches have gained wide attention to keep the balance (Liang and Zeger, 1986). However, in the presence of missing data that is typical and even inevitable during longitudinal follow-up studies, the existing semiparametric approaches might be invalid because the missing data may be simply ignored or improperly treated.
Generalized estimating equations (GEE) proposed by Liang and Zeger (1986) is a popular semiparametric approach to make marginal inference on longitudinal data. However, GEE might introduce bias to the estimates when the missing mechanism is not missing completely at random (Preisser et al., 2002). Later, Robins et al. (1995) adopted the technique of inverse missing probability weighting (IPW) to adjust for bias and thus proposed weighted generalized estimating equations (WGEE) for the data that are missing at random (MAR). However, WGEE can provide consistent estimates only under the correctly specified missing data model. Thereafter, augmented inverse probability weighting (AIPW) was also proposed in the context of dropout missingness under MAR to further incorporate the ancillary information and gain robustness to the misspecification of the missing data model (Robins and Rotnitzky, 1995; Bang and Robins, 2005; Seaman and Copas, 2009; Lin et al., 2017), which requires either the missing data model or the conditional mean model constructed by some ancillary variables to be correctly specified. However, AIPW still demands the specification of a series of conditional mean models, which brings even a tougher practical question in application. Also, extensive simulation studies have shown that AIPW is sensitive to the near-zero values of missing probabilities, leading to more severe biased estimates when both missing and conditional mean models are misspecified (Robins and Wang, 2000; Seaman and Copas, 2009; Chen and Zhou, 2011; Han, 2014). Later on, particularly for intermittent missingness in both outcomes and covariates under MAR, Chen et al. (2010) proposed an extended WGEE, and the corresponding AIPW approach was later investigated by Chen and Zhou (2011). However, a sequence of models for missing covariates should be specified with additional numerical integration to achieve the goal of robust estimation, and also their work only focused on binary outcomes.
Compared to the double robust property in AIPW, a so-called multiple robust method has been recently proposed (Han and Wang, 2013; Han, 2014; Chen and Haziza, 2017), where the estimators would be consistent if one of the multiple missing data models or the conditional mean model is correctly specified. However, only cross-sectional data were previously considered in their work. For longitudinal data with dropouts in time-dependent variables, Han (2016) constructed propensity scores by calibrating the estimated observing probabilities to achieve multiple robustness and intrinsic efficiency properties. However, only the mean estimation of the response measured at the study end was investigated, and also the missing data model has to be correctly specified to achieve those desired properties.
Currently, substantial work focus on the dropout missing data with well-established approaches (Liang and Zeger, 1986; Robins and Rotnitzky, 1995; Chen and Zhou, 2011; Chen et al., 2019), however, still limited studies to address the case of intermittent missingness. Some participants may not be recorded during the entire follow-up, but with at least one of responses missing at some time point. This nonmonotone missingness is a common missing-data pattern in real application due to various reasons such as missed visits, study withdrawal, or data input typo (Lin et al., 2018). Our current work is motivated by the data from the HIV Epidemiology Research Study (HERS), a longitudinal cohort study of 1310 women with or at high risk for HIV infection from 1993 to 2000. During the follow-up, a variety of clinical, behavioral, demographic, and sociological outcomes were recorded approximately every 6 months with intermittent missingness in several measures (e.g., CD4+ counts, human papillomavirus [HPV]). However, so far few studies have utilized this fruitful data for identifying potential risk factors for HPV infection. More details of the study design and data collection can be found in Smith et al. (1997) and Grady et al. (1998).
In this paper, we focuses on longitudinal data with missing responses and missing covariates that are MAR and conducts robust inference based on semiparametric models with the propensity score method. Here, the term “robustness” is defined as the relaxation of model misspecification for missing data. This unified framework includes consistent estimation and variable selection by accommodating intermittent missingness in either responses or time-varying covariates. A novel calibration procedure is introduced for parameter estimation which is multiple robust to misspecification of the missing data model. The theory in this paper also provides some insights that the proposed estimators might still be empirically robust even when none of the candidate models for missing data is correct. Accordingly, we further advocate a robust information criterion based upon empirical likelihood for consistent variable selection (ie, to exactly capture the true model with probability tending to one) (Owen, 2001). Note that Chen et al. (2019) considered the scenario of longitudinal data with dropouts under MAR, and proposed a joint empirical Akaike information criterion (JEAIC) and a joint empirical Bayesian information criterion (JEBIC) for simultaneous selection of marginal mean and correlation structures. These criteria outperforms the existing approaches such as missing longitudinal information criterion (MLIC) (Shen and Chen, 2012) and weighted quasi-likelihood information criterion (QICW) (Gosho, 2016) through extensive simulation studies, and attain appealing asymptotic properties; however, those theoretical work were obtained given the assumption that the missing data are correctly modeled. Recently, Chen et al. (2020) provided a more general robust criterion, called the empirical likelihood-based consistent information criterion (ELCIC), with theoretical derivations on its consistency property. Here, we will further adapt the idea of ELCIC to handle variable selection in this complicated context with missing data.
The rest of the paper will be organized as follows. In Section 2, we will present notations and a unified semiparametric framework for analyzing longitudinal responses and covariates with intermittent missingness under MAR, including robust estimation procedures and a robust information criterion for variable selection. A specific case of intermittent missing data allowing for both responses and covariates is discussed in Section 3. Section 4 conducts extensive simulation to evaluate the finite performance of our proposal. A real data application from the HERS is presented in Section 5 for illustration. Section 6 presents concluding remarks and discuss potential future work.
2 |. NOTATION AND METHOD FORMULATION
2.1 |. The model setups
First, let us introduce some basic notations. For i = 1, …, n and j = 1, …, T, we consider Dij as the set of {Wi, Yij, Xij}, where Yij and Xij are, respectively, responses and covariates, which are potentially unobserved, and Wi = {Wi1, …, WiT} is the collection of other covariates that are always observed including the intercept. For i = 1, …, n, we have , with Yi = (Yi1, …, YiT)T and Xi = (Xi1, …, XiT)T, independently and identically distributed (i.i.d). Besides, denote Rij as a missing indicator for the data Dij, which equals 1 if the data Dij are all observed, and 0 otherwise. For j ≥ 2, notation is the collection of all the history data {Di1, Di2, …, Dij−1}, denotes the history of missing indicators {Ri1, Ri2, …, Rij−1}; and and denote overall and the history of observed data, respectively. Throughout, we use ∑i,j to represent the summation across all nT pairs of {i, j}, and denote and as the summation and product across observed data {i, j}, respectively; similarly, and are defined across j. E(·) and var(·) represent expectation and variance of random variable or random vector, respectively. We use A⊗2 to denote operation AAT for any vector A. The notation ‖·‖ is the Euclidean norm. Also, we model the marginal mean of Yij given {Xij, Wij}, denoted by μij, which has the form
| (1) |
for i = 1, …, n and j = 1, …, T, where are the regression coefficients with the true values β0, and l(·) is any monotone and differentiable link function depending on the type of responses. The covariates Wij are always observed as we mentioned above.
There are several different assumptions for MAR defined in the longitudinal framework in literature (Robins et al., 1995; Chen et al., 2010). A reasonable and common assumption considered in wide applications, but not the mildest one, is that the probability of Rij = 1 given Di satisfies
| (2) |
for each i = 1, …, n and j = 2, …, T, and πi1 = 1 for all i, ie, the baseline data at the first time visit is always observed. Any systematic differences between the observed and missing data can be explained by the associations with the observed data. Given the observed data for the subject i, the probability of being observed of data at time j is assumed to be independent of unobserved data in {Yi, Xi}. The same assumption was also adopted by Chen et al. (2010) to characterize the intermittent missingness for both responses and covariates. Throughout, we assume the observing probability is bounded by a constant, ie,
| (3) |
for all i = 1, …, n and j = 1, …, T. Therefore, based on the assumptions (2) and (3), it is easy to show that
| (4) |
holds for any measurable function H, and this is fundamental for deriving desired properties of our proposed estimators.
2.2 |. The proposed estimators
Based on Equation (4), it is easy to realize that, given the observing probability πij(θ) which is correctly modeled with some parameters θ, a consistent estimator of β can be obtained by solving
| (5) |
where , and is a consistent estimator. However, one major limitation for (5) is that it requires the correctly specified model for πij(θ). Also, this approach is sensitive to outliers of the weights in real practice.
Instead, we will adopt the calibration technique to obtain more robust estimators, which in essence relies on some moment equations (Deville and Särndal, 1992; Han, 2014, 2016). In the following, we will first present the rationale and procedures of conducting calibration in a general framework, and then evaluate the desired properties next. The foundation of the calibration here is based on the following conditional moment equation. Given the i.i.d assumption and any measurable function fij(Di; θ) indexed by some parameters θ, we have
| (6) |
where with , and hj(θ) could be any quantities free of D1. One example, as described later, is that the function fij(Di; θ) could be πij(θ) with hj(θ) = E{f1j(D1;θ)}. Under the assumption (2), it is easy to check that Equation (6) holds only when the observation-level weight under the correctly specified observing probability π1j(θ). Accordingly, the empirical version of the above Equation (6) can be written as
| (7) |
where , and and are some consistent estimators. Note that the above equality does not hold in general due to sample randomness. Thus, the proposed calibration is to recalculate the weights wij by the following constrained optimization procedures. Consider wij ≥ 0 such that , where wij are calibrated by maximizing with respect to the above constrain as well as for identifiability consideration. By the standard Lagrange multiplier method, we have
| (8) |
with M defined as the total number of observations and satisfying by minimizing the loss function . Here, we call the calibrated weights. This constrained optimization agrees with the idea in Deville and Särndal (1992) to fit a hybrid propensity score model.
Thereafter, we can define the estimator by solving the following estimating equations:
| (9) |
where Uij(β) is defined the same as (5). It is worthwhile mentioning at this point that in formulas (5) and (9), we do not account for correlations among repeated measures. More discussions about the issue related to correlation structures can be referred to Section 6.
2.3 |. Asymptotic properties
We will apply the aforementioned calibration to fulfill the multiple robustness of the estimates from (9) with regards to the misspecification of the missing data model, and then illustrate its asymptotic properties in terms of three perspectives: consistency, high robustness, and asymptotic normality. First, we discuss a potential form of to achieve these desired properties; other choices are discussed in Section 6. For any i and j, we consider
| (10) |
where the observing probabilities πij(θ) are correctly specified; ; and c1(θ) is a vector which only depends on θ or could be constant. A special and simple example is to set c1(θ) = 1, ie, . In this case, general notations in Section 2.2 are specified as , and , respectively.
First, to recognize the consistency of , let us connect the bridge from biased sampling problems by maximizing with respect to the following constrains:
where pij is the conditional empirical probability mass on Dij given Rij = 1, and πij(θ) are the underlying true observing probabilities. Then, based on the empirical likelihood procedure, we have , where the Lagrange multipliers are obtained by solving the following equations:
| (11) |
Compared with (8), we conclude that by the uniqueness of the solution , which leads to the estimates of wij as
| (12) |
Similar to Han (2014), the extra weight in (12) would dilute the severe impact of extreme values to the estimators, and thus leads to more stable numerical performance. Moreover, the weights can be recognized as the functions of , and . Note that, given the probability πij under the correctly specified model with the true value of θ0, we have λ0 = 0, with details provided in the Supplementary Material. Also, we can derive with τj(θ0) = E{fij(Di; θ0)}, and . Thus, applying (4) leads to
| (13) |
for all i and j, which is the fundamental property for validating the estimating equations. Accordingly, from (13), the consistency is summarized in the following theorem.
Theorem 1. Given the assumptions (2) and (3) and regularity conditions in the Supplementary Material, we have , where is obtained from solving (9).
Next, we show the estimator is multiple robust. From the previous discussion, we know that the consistency of holds when the functions satisfy (10). Despite its appearance as a strong restriction at the first glance, this contains a wide family of the choice of . For instance, if the first element of is equal to , then (10) would be satisfied by setting c(θ) = (1, 0, …, 0)T. This implies that the estimation consistency holds as long as one of the element in equals regardless of the remaining elements, which shares the multiple robust property proposed for cross-sectional data analysis in Han (2014). In theory, however, the consistency still holds under a milder condition if a linear combination of equals . The essential merit for this milder condition would provide some insight that, even though all the elements in are not exactly equal to . we still expect a well-behaved estimator empirically if we believe there is a linear combination of approximating to . Such a higher robustness to the misspecification of the missing data model has already been observed in the simulation studies under cross-sectional data as well as the case studies in section 4 from Han (2014). The theory derived in this paper provides more theoretical insights into this phenomenon.
Finally, suppose that are some estimating equations for parameters only involved in the missing data model, which will be discussed in detail in Section 3. Then, based on the consistency under regularity conditions in the Supplementary Material, we derive the influence function of the estimator with the true value as β0 as same as before, which is summarized in the following theorem.
Theorem 2. Given the assumptions (2) and (3) and regularity conditions in the Supplementary Material, the consistency property further implies the influence function of estimators with the true value as β0 as
| (14) |
where
with gij(θ0) satisfying (10).
Here, and . Based on Theorem 2, we conclude that converge to the multivariate normal distribution MVN(0, Σ) with Note that the influence function (2) is different from theorem 2 in Han (2014). By conducting the calibration based on the moment equations (6), our proposed method automatically accounts for the marginal observing probabilities μπj(θ0) = E{πij(θ0)} in the asymptotic standard error of the estimators, which imposes heavier weights to the estimation equations corresponding to the visit j with less missingness. We believe that it is desirable and more reasonable to put higher credit to the group with more observations, compared to the WGEE where observations between different visits are treated equally across various missing rates. More discussions about the efficiency, though it is not the key focus in this paper, can be referred to Supplementary Material. However, in practice, since we often do not know the true missing data model, the subject-resampling bootstrap could be implemented to estimate standard errors of the estimators, ie, resampling the subjects with replacement (Han, 2016). Now we mainly focus on the robustness, and how to model observing probabilities πij(θ) will be illustrated in Section 3. Despite the advantages of highly robust property regarding missing data models, the estimator is still vulnerable to the misspecification of the marginal mean model μij(β), which could also lead to biased estimates and invalid inference in practice. Now let us move forward to the robust and consistent variable selection for the marginal mean structure.
2.4 |. Variable selection via ELCIC
Variable selection is critical for model goodness-of-fit check, which is of active research interest. Here, we will present a novel information criterion to identify the true marginal mean model. Since the estimation we proposed in previous sections is under the semiparametric framework, the traditional likelihood-based information criteria cannot be applied. Moreover, the procedures of multiple robust estimation allow the true underlying missing data model undetermined, which substantially increases the difficulty of specifying the information criterion, and also makes existing criteria unfeasible.
Note that the empirical likelihood-based consistent information criterion (Chen et al., 2020) has its potential to fit this situation well in the sense that, first, it does not request any specification of data distribution, and, second, no extra model specification is needed, as long as providing proper full estimating equations and well-behaved plug-in estimators. Now we will embed the ELCIC into our unified framework to facilitate robust and consistent variable selection in the marginal mean structure, ie, selecting the true model with probability tending to one, without any prior information for the correct missing data model.
Suppose the largest model we consider involves a covariate matrix , include X which are potentially unobserved or W which are always observed. Also, define with the true values Ω0, where , λ and θ are defined above. The full estimating equations for variable selection are given by
| (15) |
where with the true values as . Here, the coefficient parameters β correspond to the variables in a candidate model, and 0 are coefficients corresponding to the remaining variables in . Based upon the full estimating equations (15), the empirical likelihood ratio can be derived in the following manner:
| (16) |
From (16), the point mass probabilities for the subject-level observations are utilized to construct the empirical likelihood ratio. Based on the theory of empirical likelihood, the information from the data would be automatically and efficiently borrowed from the estimating equations constrains without any distribution assumptions (Qin and Lawless, 1994).
Similar to Chen et al. (2020), we replace the parameters Ω by the consistent estimator with the convergence rate of order Op(n−1/2), under which a special case is provided in the proof of Theorem 1 in the Supplementary Material. Finally, based on the standard Lagrange approach (Owen, 2001), we could derive our information criterion
| (17) |
where ; p denotes the number of predictors in a candidate model; and the Lagrange multiplier is obtained by solving the estimating equations . This is a special extension of ELCIC into our framework, and we still name it ELCIC in order to keep consistent in literature. The model with the smallest value of ELCIC will be selected.
It is worth mentioning that Chen et al. (2020) provided a general theory to validate the consistency of ELCIC under mild conditions. To justify our proposal, it remains to show that E{gF(Ω0)} = 0, which follows the similar proof for (13). From Equation (12),
| (18) |
Similar to Section 2.3, λ0 = 0 holds under the correct missing data model πij(θ0), and thus by applying (4), we have E{gF(Ω0)} = 0. Accordingly, under the regularity conditions in Chen et al. (2020), ELCIC could exactly capture the true model with probability tending to one, and more detailed technical proofs can be also achieved. Thus, by utilizing (15), ELCIC would embrace all robust properties illustrated in Section 2.3.
3 |. MODELS FOR OBSERVING PROBABILITY
We will describe how to specify the model for observing probabilities πij(θ), where θ are the parameters only involved in the observing probabilities. Here, we will consider the scenario with both missing responses and covariates under the MAR with assumption (2), which allows a general missing mechanism, ie, intermittent missing, for broad application. For illustration, we only consider one missing covariate for illustration, and how to extend to the scenarios with multiple missing covariates can be easily referred to Chen et al. (2010) and Chen and Zhou (2011). Under this situation, we advocate to obtain πij(θ) under the assumption (2) and decompose it into three aspects followed by Chen et al. (2010): , and ψij, a relative odds ratio defined as
where and are, respectively, missing indicators for Xij and Yij. Indeed, such a decomposition systematically characterizes the individual missing processes of Xi and Yi as well as the association among these two processes. Thus, this modeling is in some sense more valid and interpretable as compared to solo modeling . Suppose we have the vector of covariates ηij, vij, and Γij, then , ψij could be modeled as
Notice that, in Lipsitz et al. (1991), the conditional joint probability could be calculated by
where . Therefore, for i = 1, …, n and j = 2, …, T, a typical case of the conditional observing probabilities πij(θ) defined in (2) with could be derived as below (Chen et al., 2010):
The summations and take over all possible historical values of and respectively. To obtain the estimates of the parameters θ, we apply the GEE method for the outcomes , , and with the estimating equations given below
| (19) |
where with , the covariance matrix of Sij is calculated by
By conducting the estimation based on (19), are the efficient estimates, which are required for applying the generalized information property (Pierce, 1982) in the proof of Theorem 2. Some other modeling approaches for both missing responses and covariates can be found in the literature (Schafer, 1997; van Buuren, 2007; Shardell and Miller, 2008).
4 |. SIMULATION STUDIES
We will evaluate the performance of our method under the case of intermittent missing in both responses and covariates in finite samples through simulation. Here, we adopt a simpler assumption in practice that the observing probability at time j only depends on the previously observed data, a special case of the assumption (2). For binary repeated outcomes, logit{μ(β)} = β0 + β1x1i + β2x2ij, where the function logit(x) = log{x/(1 − x)}, the time-independent covariate x1i follows Unif[−1,1], and the time-dependent covariate x2ij follows Bin(1,0.5) for i = 1, …, n and j = 1, …, 3. Note that only the time-dependent covariate x2ij is potentially missing. The parameters (β0, β1, β2)T = (0.5, 0.5, −0.5)T. The correlation structure of repeated outcomes within-subject is exchangeable with a correlation coefficient of 0.5. Based on Section 3, the true missing mechanisms of and are modeled as
for j = 2, 3. For the joint probability of and , we assume the constant odds ratio model . The parameters and (0, −2.5, 1)T with so that the overall missing probability is around 0.3 and 0.5, respectively. In practice, two missing processes and are often assumed to be independent for simplicity, ie, . As indicated in Chen et al. (2010), it may lead to biased estimates. Here, we consider the two misspecified models (m1 and m2) without accounting for the association between and ,
| (20) |
and
| (21) |
with . Let us denote the true missing data model and two misspecified models as πij(θt), πij(θm1), and πij(θm2), respectively. To evaluate the performance of the calibration introduced in Section 2, we consider the following four typical calibration situations: (a) under the correctly specified missing data model λij, (b) three models , (c) only one misspecified model , and (d) two misspecified models and , which are, respectively, corresponding to , , and to construct the calibrated weights, given . Here, we use the notation of MULTI_XXX to denote the results based on our proposal with the suffix indicating if missing data model is incorporated into modeling with 1 for yes and 0 for no (the first number index for λij, which is the true one; the second number index for ; the third number index for ). Therefore, we have MULTI_100, MULTI_111, MULTI_010, and MULTI_011 to denote the results under these four cases, respectively. A similar notation strategy is applied for WGEE_XXX.
We implement 1000 Monte Carlo replicates with four combinations with the sample size n = 400, 700. To conduct the direct comparison, we only consider the WGEE method from Chen et al. (2010). The methods based on other philosophies, such as imputation and EM algorithm, has been already investigated and compared with WGEE in literature (Chen et al., 2010; Chen and Zhou, 2011). Note that WGEE_100 and WGEE_010 are the estimators under the correctly specified (λij) and misspecified (λm1) models for missingness with an independent correlation structure, respectively. Descriptive statistics including relative bias (RB), standard deviation (SD), absolute bias (AB), and root mean square error (RMSE) are summarized for comparisons.
From Table 1, we find out that in general, our proposed estimators perform better and more stable compared to WGEE. When the model of missingness is misspecified, WGEE has nonnegligible bias. On the other hand, MULTI_100 and MULTI_111 are much more robust to the model misspecification and even perform better than WGEE under the correct missing data model since the calibrated propensity scores utilized are less sensitive to the extreme values from the missing data models. We also observed that when missing rate becomes relatively large, WGEE often experiences convergence issues, and thus artificial truncation should be applied. Furthermore, MULTI_010 performs better than WGEE_010, and more interestingly, MULTI_011 has a better behavior compared to MULTI_010, which justifies our statement in Section 2.3. This empirically implies that more candidate models of missingness considered in our proposal might further protect the estimation when even none of them are correctly specified. It is also interesting to observe that under this complex missing mechanism with high missing probability, our proposed estimators would have better efficiency compared to WGEE. It is indeed expected since calibrated propensity scores put more credits to the time visit where more data are observed. Per reviewers’ suggestion, we also conducted more comparisons by considering the naive GEE method, which does not influence our current conclusion and is not shown due to space limit.
TABLE 1.
Performance of multiple robust estimators compared with WGEE estimators under 500 Monte Carlo data (all numbers listed in the table are multiplied by 100)
| Setups | Method | β 0 | β 1 | β 2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RB | SD | AB | RMSE | RB | SD | AB | RMSE | RB | SD | AB | RMSE | ||
| n = 400 | WGEE_100 | 3.4 | 13.2 | 10.8 | 13.3 | 1.4 | 17.6 | 14.0 | 17.6 | 0.3 | 15.7 | 12.5 | 15.7 |
| m = 0.3 | WGEE_010 | 13.8 | 13.0 | 11.9 | 14.7 | 2.0 | 18.0 | 14.3 | 18.1 | 1.4 | 16.2 | 12.9 | 16.3 |
| MULTI_100 | 0.9 | 11.8 | 9.4 | 11.8 | 0.6 | 15.9 | 12.5 | 15.8 | 2.3 | 14.1 | 11.1 | 14.2 | |
| MULTI_111 | 1.2 | 11.7 | 9.4 | 11.8 | 0.8 | 15.9 | 12.6 | 15.9 | 2.3 | 14.1 | 11.0 | 14.1 | |
| MULTI_001 | 2.3 | 11.8 | 9.5 | 11.8 | 0.6 | 15.9 | 12.6 | 15.9 | 2.4 | 14.1 | 11.1 | 14.2 | |
| MULTI_011 | 1.4 | 11.8 | 9.4 | 11.8 | 0.8 | 15.9 | 12.6 | 15.9 | 2.3 | 14.1 | 11.1 | 14.1 | |
| n = 700 | WGEE_100 | 3.9 | 9.4 | 7.7 | 9.6 | 0.3 | 13.2 | 10.5 | 13.2 | 0.3 | 12.0 | 9.6 | 12.0 |
| m = 0.3 | WGEE_010 | 13.7 | 9.9 | 9.8 | 12.0 | 1.3 | 13.9 | 11.3 | 13.9 | 0.4 | 12.8 | 10.2 | 12.8 |
| MULTI_100 | 0.3 | 9.2 | 7.5 | 9.2 | 0.1 | 12.5 | 10.0 | 12.5 | 0.9 | 11.5 | 9.3 | 11.5 | |
| MULTI_111 | 0.0 | 9.1 | 7.4 | 9.1 | 0.3 | 12.4 | 9.9 | 12.4 | 0.9 | 11.5 | 9.3 | 11.5 | |
| MULTI_001 | 1.0 | 9.2 | 7.4 | 9.2 | 0.1 | 12.5 | 10.0 | 12.5 | 1.0 | 11.5 | 9.3 | 11.5 | |
| MULTI_011 | 0.2 | 9.1 | 7.4 | 9.1 | 0.3 | 12.4 | 9.9 | 12.4 | 1.0 | 11.5 | 9.3 | 11.5 | |
| n = 400 | WGEE_100 | 3.8 | 19.3 | 15.3 | 19.3 | 1.7 | 24.8 | 19.3 | 24.8 | 1.4 | 24.7 | 19.4 | 24.7 |
| m = 0.5 | WGEE_010 | 27.9 | 24.1 | 22.1 | 27.9 | 2.5 | 32.3 | 25.4 | 32.3 | 2.8 | 32.5 | 26.0 | 32.6 |
| MULTI_100 | 0.4 | 13.5 | 10.6 | 13.4 | 0.8 | 17.5 | 13.7 | 17.5 | 2.4 | 17.0 | 13.6 | 17.0 | |
| MULTI_111 | 0.5 | 13.5 | 10.6 | 13.5 | 0.7 | 17.6 | 13.8 | 17.6 | 2.5 | 17.1 | 13.6 | 17.1 | |
| MULTI_001 | 1.8 | 13.4 | 10.6 | 13.4 | 0.7 | 17.5 | 13.8 | 17.5 | 2.7 | 16.9 | 13.5 | 17.0 | |
| MULTI_011 | 0.9 | 13.5 | 10.6 | 13.5 | 0.8 | 17.5 | 13.8 | 17.5 | 2.7 | 17.0 | 13.5 | 17.0 | |
| n = 700 | WGEE_100 | 4.0 | 13.8 | 11.2 | 13.9 | 0.4 | 19.2 | 15.4 | 19.2 | 0.6 | 18.7 | 15.0 | 18.7 |
| m = 0.5 | WGEE_010 | 26.3 | 17.9 | 18.0 | 22.2 | 0.4 | 24.9 | 20.0 | 24.9 | 0.1 | 25.6 | 20.2 | 25.6 |
| MULTI_100 | 0.9 | 10.4 | 8.3 | 10.4 | 0.1 | 13.9 | 11.2 | 13.9 | 0.8 | 13.6 | 10.9 | 13.6 | |
| MULTI_111 | 0.5 | 10.4 | 8.3 | 10.4 | 0.0 | 13.9 | 11.1 | 13.9 | 0.7 | 13.7 | 10.9 | 13.7 | |
| MULTI_001 | 0.5 | 10.4 | 8.3 | 10.4 | 0.0 | 13.9 | 11.1 | 13.9 | 0.8 | 13.6 | 10.9 | 13.6 | |
| MULTI_011 | 0.2 | 10.4 | 8.3 | 10.4 | 0.1 | 13.9 | 11.2 | 13.9 | 0.8 | 13.7 | 10.9 | 13.6 | |
In practice, the bootstrapping estimates of standard errors for are recommended since the true model for missing data is unknown. To further evaluate its performance, we implement a resampling scheme with 100 bootstrap replications and obtain the bootstrapping-based estimates of standard errors. The standard error estimates with a coverage percentage of the 95% confidence interval from 500 Monte Carlo replicates are recorded. The results for sample size n = 400, 700 are summarized in Table 2, and we can see that the bootstrapping-based estimates asymptotically approximate to the true values with satisfactory coverage percentages around 95%. In addition, we consider the other situations where the marginal mean model maybe misspecified, and still satisfactory results are achieved. Please refer to the Supplementary material for more details.
TABLE 2.
Performance of bootstrapping estimates of standard errors for multiple robust estimators under 500 Monte Carlo data (all numbers listed in the table are multiplied by 100)
| Setups | Method | β 0 | β 1 | β 2 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MCSE | BSE | CP | MCSE | BSE | CP | MCSE | BSE | CP | ||
| n = 400 | MULTI_100 | 11.56 | 11.84 | 95.6 | 16.58 | 16.39 | 93.8 | 14.36 | 14.54 | 95.0 |
| m = 0.3 | MULTI_010 | 11.49 | 11.79 | 95.8 | 16.53 | 16.38 | 93.8 | 14.43 | 14.56 | 95.2 |
| MULTI_001 | 11.50 | 11.81 | 95.8 | 16.57 | 16.39 | 93.8 | 14.39 | 14.55 | 94.8 | |
| MULTI_111 | 11.50 | 11.81 | 95.6 | 16.49 | 16.37 | 93.8 | 14.45 | 14.57 | 95.0 | |
| MULTI_011 | 11.49 | 11.80 | 95.6 | 16.47 | 16.37 | 94.0 | 14.47 | 14.57 | 94.8 | |
| n = 700 | MULTI_100 | 8.85 | 8.96 | 95.4 | 12.16 | 12.35 | 94.4 | 11.13 | 10.94 | 95.2 |
| m = 0.3 | MULTI_010 | 8.77 | 8.93 | 94.8 | 12.19 | 12.33 | 95.0 | 11.13 | 10.96 | 95.0 |
| MULTI_001 | 8.80 | 8.94 | 94.8 | 12.17 | 12.35 | 94.8 | 11.14 | 10.96 | 94.8 | |
| MULTI_111 | 8.78 | 8.94 | 94.8 | 12.18 | 12.32 | 94.6 | 11.13 | 10.96 | 95.0 | |
| MULTI_011 | 8.76 | 8.93 | 94.6 | 12.18 | 12.32 | 94.8 | 11.13 | 10.95 | 94.8 | |
| n = 400 | MULTI_100 | 13.17 | 13.54 | 94.2 | 17.98 | 17.99 | 94.6 | 18.06 | 17.53 | 93.2 |
| m = 0.5 | MULTI_010 | 13.10 | 13.48 | 94.6 | 18.00 | 18.00 | 94.8 | 18.13 | 17.56 | 93.4 |
| MULTI_001 | 13.10 | 13.48 | 94.6 | 17.97 | 17.97 | 94.8 | 18.10 | 17.54 | 93.4 | |
| MULTI_111 | 13.13 | 13.49 | 94.6 | 18.07 | 18.02 | 94.8 | 18.19 | 17.61 | 93.2 | |
| MULTI_011 | 13.07 | 13.47 | 94.6 | 18.00 | 18.00 | 94.6 | 18.12 | 17.57 | 93.4 | |
| n = 700 | MULTI_100 | 10.07 | 10.28 | 95.8 | 13.30 | 13.64 | 95.8 | 13.52 | 13.24 | 93.4 |
| m = 0.5 | MULTI_010 | 9.95 | 10.24 | 95.8 | 13.32 | 13.65 | 96.0 | 13.63 | 13.27 | 93.0 |
| MULTI_001 | 9.98 | 10.24 | 95.6 | 13.32 | 13.62 | 95.8 | 13.60 | 13.25 | 93.6 | |
| MULTI_111 | 9.93 | 10.24 | 95.6 | 13.30 | 13.65 | 95.8 | 13.57 | 13.27 | 93.4 | |
| MULTI_011 | 9.92 | 10.23 | 95.6 | 13.32 | 13.64 | 95.2 | 13.62 | 13.26 | 93.6 | |
In addition, our proposed information criteria for variable selection is evaluated, where six models are considered as candidates. To clearly see the robustness of the proposed criterion, we apply ELCIC with the full estimating equations . It is not hard to realize that ELCIC is consistent only when the missing data model is correctly specified, and the corresponding results are denoted by WGEE. The selection rates are summarized in Table 3 with 500 Monte Carlo replicates. We observe that misspecification of the missing data model could have a serious effect on the variable selection via WGEE, and this poor performance would not be improved much as sample size increases, especially in the situation with a relatively high missing probability. On the other hand, our proposed criteria MULTI_100, MULTI_111, and MULTI_011 have much higher and more stable selection rates. Besides, as sample size increases from 400 to 700, we observe substantial improvement in the selection rates of MULTI_100, MULTI_111, and MULTI_011, which agrees with the consistent property that ELCIC could exactly locate the true model with probability tending to one.
TABLE 3.
Performance of ELCIC under multiple robust estimators and WGEE estimators with an independent correlation structure: percentages of selecting six candidate models across 500 Monte Carlo data are summarized
| Setups | Method | x 1 | x1, x2, x3, x4, x5 | x 2 | x1, x2 | x1, x2, x3 | x1, x2, x4, x5 |
|---|---|---|---|---|---|---|---|
| n = 400 | WGEE_100 | 0.174 | 0 | 0.272 | 0.538 | 0.014 | 0.002 |
| m = 0.3 | WGEE_010 | 0.192 | 0 | 0.298 | 0.496 | 0.01 | 0.004 |
| MULTI_100 | 0.142 | 0 | 0.248 | 0.6 | 0.01 | 0 | |
| MULTI_111 | 0.142 | 0 | 0.254 | 0.596 | 0.008 | 0 | |
| MULTI_010 | 0.138 | 0 | 0.256 | 0.596 | 0.01 | 0 | |
| MULTI_011 | 0.14 | 0 | 0.252 | 0.6 | 0.008 | 0 | |
| n = 700 | WGEE_100 | 0.042 | 0 | 0.104 | 0.85 | 0.004 | 0 |
| m = 0.3 | WGEE_010 | 0.052 | 0 | 0.118 | 0.818 | 0.008 | 0.004 |
| MULTI_100 | 0.018 | 0 | 0.068 | 0.906 | 0.008 | 0 | |
| MULTI_111 | 0.022 | 0 | 0.07 | 0.904 | 0.004 | 0 | |
| MULTI_010 | 0.02 | 0 | 0.068 | 0.908 | 0.004 | 0 | |
| MULTI_011 | 0.02 | 0 | 0.072 | 0.904 | 0.004 | 0 | |
| n = 400 | WGEE_100 | 0.408 | 0 | 0.402 | 0.186 | 0 | 0.004 |
| m = 0.5 | WGEE_010 | 0.448 | 0 | 0.424 | 0.12 | 0.004 | 0.004 |
| MULTI_100 | 0.276 | 0 | 0.292 | 0.428 | 0.002 | 0.002 | |
| MULTI_111 | 0.276 | 0 | 0.296 | 0.422 | 0.004 | 0.002 | |
| MULTI_010 | 0.272 | 0 | 0.302 | 0.422 | 0.002 | 0.002 | |
| MULTI_011 | 0.276 | 0 | 0.296 | 0.422 | 0.004 | 0.002 | |
| n = 700 | WGEE_100 | 0.31 | 0.002 | 0.29 | 0.388 | 0.01 | 0 |
| m = 0.5 | WGEE_010 | 0.396 | 0.002 | 0.348 | 0.244 | 0.01 | 0 |
| MULTI_100 | 0.098 | 0.002 | 0.12 | 0.764 | 0.012 | 0.004 | |
| MULTI_111 | 0.096 | 0.002 | 0.122 | 0.764 | 0.012 | 0.004 | |
| MULTI_010 | 0.102 | 0.002 | 0.114 | 0.762 | 0.016 | 0.004 | |
| MULTI_011 | 0.1 | 0.002 | 0.122 | 0.756 | 0.016 | 0.004 |
Notes. The model containing the covariates {x1, x2} is the true model.
5 |. DATA APPLICATION
We illustrate the application of our proposed method by using the HERS study. One particular clinical interest is to investigate potential risk factors for HPV infection, which is a crucial factor of cervical cancer. Literature has shown that HPV and HIV are both viruses that cause sexually transmitted infections, leading to different conditions, and people with HIV are more susceptible to HPV than others. There are a variety of work evaluating CD4+ counts, an important marker of disease progression and treatment effects, among women infected with HIV because HIV directly attacks this lymphocyte; however, how is this longitudinal marker associated with HPV (1 = yes, 0 = no) is still unclear due to different symptoms, outlook, and treatment (Bang and Robins, 2005). We select the first three visits from the HERS, where the potential risk factors include time-dependent CD4+ counts (CD4, log-transformed), time visits (VST, coded 1,2,3), and other baseline variables such as age (AGE, years), race (RACE, 1 = Black, 0 = Non-Black), ever injected drug (DRUG, 1 = yes, 0 = no), education level (EDU, 1 = 9th Grade or higher, 0 = others), ever smoked (SMK, 1 = yes, 0 = no), sexual behavior within last 6 months (SEX, 1 = yes, 0 = no), and the viral load (VIL, 1 = larger than 5000, 0 = others). Of note is that the selected data exhibit an intermittent missing pattern for both HPV and CD4+ counts, and the missing data are due to a variety of reasons (eg, refusing to report, failure to record) (Smith et al., 1997). We exclude the patients having missing data in responses/covariates at the baseline, and have about 19% missing percentage for either HPV or CD4+ counts at the second visit and 17% at the third visit. Visualization of missing data patterns from eight representative samples in the HERS data is provided in Figure 1. After data cleaning and preprocessing, we finally have 667 subjects remained for final analysis.
FIGURE 1.

Visualization of missing data patterns from eight representative samples in the HERS data. This figure appears in color in the electronic version of this paper, and any mention of color refers to that version
For the marginal mean model of HPV as binary outcomes, we consider CD4, VST, RACE, AGE, DRUG, EDU, SMK, and SEX as potential risk factors and implement variable selection among six candidate models by ELCIC described in Section 2.4. The robust estimates proposed in Section 2.2 are offered for each candidate model with bootstrapping-based estimated standard error under 200 bootstrapped replicates. To quantify the missing data and derive the calibrated weights, we consider two potential candidate models. One model is based on the method illustrated in Section 3, where two conditional observing probabilities for HPV and CD4+ counts, respectively, are modeled by
where RHPV and RCD4 are missing indicators for HPV and CD4+ counts, respectively. Besides, we assume the constant odds ratio model log(υij) = θ to characterize the association between two missing mechanisms. The other model is to assume the conditional independence between two missing processes of HPV and CD4+ counts given the covariates:
where R* = RHPV × RCD4. Note that the first missing data model aims to joint model the observing probability for outcomes and covariates, which achieves appealing inter-pretability to entangle the complex missing mechanism, while the second one is simpler given the conditional independence assumption and, thus, easy for computational implementation. In real practice, we do not know the underlying true models for missing data, however, the advantage of our proposed method is to allow incorporating multiple candidate models for missing data to maintain robust inference.
The model-fitting results by utilizing both models mentioned above for the calibrated weights are summarized in Table 4. The model with the variables CD4+ counts and drug injected at baseline is selected by ELCIC, which matches the significant ones in the largest candidate model (Model 2). Compared to Model 4, Model 5 with an extra variable of baseline age included has comparable ELCIC with negligible difference. Thus, we recommend adding this variable into regression if the baseline age effect to HPV is also of clinical interest. Also, all risk factors are significant in Models 4 and 5 based on the proposed multiple robust estimation, though the age effect is not significant (p-value = .07) in the largest candidate model (Model 2). Comparing to the existing literature (Robbins et al., 2015), our findings validate and agree with previous studies, for instance, odds of HPV positive diagnosis was increased among HIV-positive individuals with a low CD4+ count. Additional results on model comparison and further evaluation of our proposed approach are provided in the Supplementary Material. In particular for Models 2 and 4 by leveraging all the data, we perform comparisons by considering only the first missing data model to calculate the propensity scores and also WGEE, where the significance results remain the same, though there exist slight differences in the estimates for parameters and standard errors. To further illustrate the benefit and robustness of our proposal, we perform a subset analysis by selecting the subjects with at least one missing observation in either HPV or CD4+ counts. Note that the subpopulation is comparable to the originally full data in terms of population characteristics, with the sample size of 184 and intermittent missing rates of 66% and 60% at the second and third visit, respectively. Our proposal successfully captures the significance effects of CD4+ counts, an important risk factor, while the WGEE fails to detect that.
TABLE 4.
Analysis of the HERS based on six candidate models
| Variables | Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 |
|---|---|---|---|---|---|---|
| Intercept | 3.015 (0.598)* | 2.863 (0.847)* | 2.903 (0.6)* | 3.043 (0.565) * | 3.639 (0.677)* | 3.33 (0.705)* |
| VST | 0.068 (0.045) | 0.062 (0.047) | 0.067 (0.045) | |||
| CD4 | −0.436 (0.097)* | −0.485 (0.108)* | −0.444 (0.1)* | −0.449 (0.095) * | −0.479 (0.102)* | −0.485 (0.104)* |
| AGE | −0.021 (0.012) | −0.025 (0.011)* | −0.019 (0.012) | |||
| RACE (=1) | 0.267 (0.158) | 0.265 (0.153) | ||||
| DRU (=1) | 0.380 (0.162)* | 0.307 (0.148) * | 0.406 (0.156)* | 0.415 (0.155)* | ||
| EDU (=1) | 0.092 (0.275) | |||||
| SMK (=1) | 0.198 (0.234) | |||||
| SEX (=1) | 0.289 (0.181) | 0.305 (0.18) | ||||
| ELCIC | 38.07 | 58.53 | 40.55 | 35.03 | 35.66 | 39.09 |
Notes. Summary results include multiple robust estimates with standard error estimates in parentheses and the values of ELCIC for variable selection in the marginal mean structure (* denotes the P-value < .05).
6 |. DISCUSSION
Multiple robustness to the missing data model in the longitudinal data framework is more crucial and favored compared to the cross-sectional case since more complicated missing mechanism and twisted dynamic patterns will be involved, which leads to a broader candidate pool of modelings to the missing data and in turn a variety of estimation approaches to the selected modeling. In this paper, we present a promising propensity score approach for longitudinal data analysis with MAR, where the calibrated propensity scores are multiple robust to the misspecification of the missing data model and more stable compared to the IPW. Under this framework, we can conduct valid estimation and consistent variable selection in widely applied situations, such as intermittent missing data in both responses and covariates. Notice that dropout missingness, a special missing data pattern, is not directly incorporated in this framework, but needs additional technical adjustments. In particular, the overall mean μπj(θ) cannot be estimated by Equation (10), and a subtle modification is needed, which will be pursued in future work. For the consistency of variable selection, we assume the true model is within the pool of candidate models. Thus, it would be of interest to investigate the situation where no candidate models are specified correctly. Furthermore, currently we only consider one covariate with missing data, but this work can be generalized to the context with multiple covariates with potential missingness. The model strategy is similar, but with higher computational burden, which needs further investigation on the development of efficient computing algorithms. Multiple robust estimation under the missing not at random assumption is also of research interest for future work. In addition, we have provided more discussion and extensions (e.g., efficiency improvement and dropout missingness, other options for gij(θ)), which are presented in the Supplementary Material.
Supplementary Material
ACKNOWLEDGMENTS
Wang’s research was partially supported by Grant UL1 TR002014 and KL2 TR002015 from the National Center for Advancing Transnational Sciences (NCATS) and was also funded, in part, under a grant from the Pennsylvania Department of Health using Tobacco CURE Funds. Research of Liu was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). Data from the HERS were collected under from the U.S. Centers for Disease Control and Prevention, and the authors especially thank the HERS participants and the HERS Research Group. The content is solely the responsibility of the authors and does not represent the official views of the National Institute of Health, the U.S. Centers for Disease Control and Prevention and other affiliated institutes, and also the Department specially disclaims responsibility for any analyses, interpretations, or conclusions.
Funding information
Pennsylvania Department of Health; National Center for Advancing Translational Sciences, Grant/Award Numbers: UL1 TR002014, KL2 TR002015; Eunice Kennedy Shriver National Institute of Child Health and Human Development
Footnotes
SUPPORTING INFORMATION
Web Appendices of proofs and tables for simulation studies and data application referenced in Sections 4 and 5 are available with this paper at the Biometrics website on Wiley Online Library. The R program code for simulation studies are also available in this Supporting Information section.
DATA AVAILABILITY STATEMENT
The data that support the findings in this paper are available from the U.S. Centers for Disease Control and Prevention (CDC) https://aspe.hhs.gov/report/inventory-federally-sponsored-hiv-and-hiv-relevant-databases/database-hiv-epidemiologic-research-study-hers. Restrictions apply to the availability of the data, and approval for data usage is needed by the HERS Executive Committee. For more information, contact: Alex Ewing (yhy4@cdc.gov) from Surveillance Branch, Division of HIV/AIDS Prevention, Surveillance, and Epidemiology, CDC.
REFERENCES
- Bang H and Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61, 962–973. [DOI] [PubMed] [Google Scholar]
- van Buuren S (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219–242. [DOI] [PubMed] [Google Scholar]
- Chen B, Yi GY and Cook RJ (2010). Weighted generalized estimating functions for longitudinal response and covariate data that are missing at random. Journal of the American Statistical Association, 105, 336–353. [Google Scholar]
- Chen B and Zhou X (2011). Doubly robust estimates for binary longitudinal data analysis with missing response and missing covariates. Biometrics, 67, 830–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen C, Shen B, Zhang L, Xue Y and Wang M (2019). Empirical-likelihood-based criteria for model selection on marginal analysis of longitudinal data with dropout missingness. Biometrics, 75,950–965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen C, Wang M, Wu R and Li R (2020). A general information criterion for model selection based on empirical likelihood. arXiv:2006.13281. [Google Scholar]
- Chen S and Haziza D (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104, 439–453. [Google Scholar]
- Deville JC and Särndal CE (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382. [Google Scholar]
- Gosho M (2016). Model selection in the weighted generalized estimating equations for longitudinal data with dropout. Biometrical Journal, 58, 570–587. [DOI] [PubMed] [Google Scholar]
- Grady D, Applegate W, Bush T, Furberg C, Riggs B and Hulley S (1998). Heart and estrogen/progestin replacement study (hers): design, methods, and baseline characteristics. Control Clinical Trials, 19, 314–335. [DOI] [PubMed] [Google Scholar]
- Han P (2014). Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association, 109, 1159–1173. [Google Scholar]
- Han P (2016). Intrinsic efficiency and multiple robustness in longitudinal studies with drop-out. Biometrika, 103, 683–700. [Google Scholar]
- Han P and Wang L (2013). Estimation with missing data: beyond double robustness. Biometrika, 100, 417–430. [Google Scholar]
- Liang K-Y and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73,13–22. [Google Scholar]
- Lin H, Fu B, Qin G and Zhu Z (2017). Doubly robust estimation of generalized partial linear models for longitudinal data with dropouts. Biometrics, 73,1132–1139. [DOI] [PubMed] [Google Scholar]
- Lin T-I, Lachos VH and Wang W-L (2018). Multivariate longitudinal data analysis with censored and intermittent missing responses. Statistics in Medicine, 37, 2822–2835. [DOI] [PubMed] [Google Scholar]
- Lipsitz SR, Laird NM and Harrington DP (1991). Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika, 78,153–160. [Google Scholar]
- Owen AB (2001). Empirical Likelihood. Boca Raton, FL: Chapman and Hall/CRC. [Google Scholar]
- Pierce DA (1982). The asymptotic effect of substituting estimators for parameters in certain types of statistics. Annals of Statistics, 10, 475–478. [Google Scholar]
- Preisser JS, Lohman KK and Rathouz PJ (2002). Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in Medicine, 21, 3035–3054. [DOI] [PubMed] [Google Scholar]
- Qin J and Lawless J (1994). Empirical likelihood and general estimating equations. Annals of Statistics, 22, 300–325. [Google Scholar]
- Robbins HA, Fennell CE, Gillison M, Xiao W, Guo Y, Wentz A, et al. (2015). Prevalence of and risk factors for oral human papillomavirus infection among HIV-positive and HIV-negative people who inject drugs. PLoS One, 10, e0143698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM and Rotnitzky A (1995). Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90,122–129. [Google Scholar]
- Robins JM, Rotnitzky A and Zhao LP (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90,106–121. [Google Scholar]
- Robins JM and Wang N (2000). Inference for imputation estimators. Biometrika, 87, 113–124. [Google Scholar]
- Schafer JL (1997). Analysis of Incomplete Multivariate Data. New York: Chapman and Hall. [Google Scholar]
- Seaman S and Copas A (2009). Doubly robust generalized estimating equations for longitudinal data. Statistics in Medicine, 28, 937–955. [DOI] [PubMed] [Google Scholar]
- Shardell M and Miller R (2008). Weighted estimating equations for longitudinal studies with death and non-montone missing time-dependent covariates and outcomes. Statistics in Medicine, 27, 1008–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen C-W and Chen Y-H (2012). Model selection for generalized estimating equations accommodating dropout missingness. Biometrics, 68, 1046–1054. [DOI] [PubMed] [Google Scholar]
- Smith DK, Warren DL, Vlahov D, Schuman P, Stein MD, Greenberg BL, et al. (1997). Design and baseline participant characteristics of the human immunodeficiency virus epidemiology research (HER) study: a prospective cohort study of human immunodeficiency virus infection in us women. American Journal of Epidemiology, 146, 459–469. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings in this paper are available from the U.S. Centers for Disease Control and Prevention (CDC) https://aspe.hhs.gov/report/inventory-federally-sponsored-hiv-and-hiv-relevant-databases/database-hiv-epidemiologic-research-study-hers. Restrictions apply to the availability of the data, and approval for data usage is needed by the HERS Executive Committee. For more information, contact: Alex Ewing (yhy4@cdc.gov) from Surveillance Branch, Division of HIV/AIDS Prevention, Surveillance, and Epidemiology, CDC.
