ABSTRACT
Analyses of cluster randomized trials (CRTs) can be complicated by informative missing outcome data. Methods such as inverse probability weighted generalized estimating equations have been proposed to account for informative missingness by weighing the observed individual outcome data in each cluster. These existing methods have focused on settings where missingness occurs at the individual level and each cluster has partially or fully observed individual outcomes. In the presence of missing clusters, for example, all outcomes from a cluster are missing due to drop-out of the cluster, these approaches ignore this cluster-level missingness and can lead to biased inference if the cluster-level missingness is informative. Informative missingness at multiple levels can also occur in CRTs with a multi-level structure where study participants are nested in subclusters such as healthcare providers, and the subclusters are nested in clusters such as clinics. In this paper, we propose new estimators for estimating the marginal treatment effect in CRTs accounting for missing outcome data at multiple levels based on weighted generalized estimating equations. We show that the proposed multi-level multiply robust estimator is consistent and asymptotically normally distributed provided that one of the multiple propensity score models postulated at each clustering level is correctly specified. We evaluate the performance of the proposed method through extensive simulations and illustrate its use with a CRT evaluating a Malaria risk-reduction intervention in rural Madagascar.
Keywords: cluster randomized trials, expectation-maximization (EM) algorithm, generalized estimating equation (GEE), inverse probability weighting (IPW), multi-level missing data, multiply robust, propensity score
1. INTRODUCTION
Cluster-randomized trials (CRTs), with groups of individuals as randomization units, are commonly used in biomedical research for intervention evaluation (Hayes and Moulton, 2017). Because outcomes from individuals within the same cluster are likely to be correlated, analysis of CRTs must account for this dependence within cluster. The generalized estimating equation (GEE) approach has often been adopted to estimate the marginal treatment effect in CRTs (Liang and Zeger, 1986). Compared to mixed effects model, GEE targets the population marginal effect parameter and requires fewer parametric assumptions on the outcome distribution (Hubbard et al., 2010). It renders valid inference provided that the mean model is correctly specified and is robust to misspecification of the correlation structure. However, in the presence of informative missing outcome data, the GEE estimator based on the complete data may result in biased estimates (Hossain et al. 2017a, b).
Here, we consider the setting where outcome missingness is independent of unobserved and observed outcomes, conditional on baseline covariates and exposure. Such missingness mechanism has been termed as “covariate-dependent missingness (CDM)” (Hossain et al., 2017a, b) or “restricted missing at random” (Prague et al., 2016). Two common approaches to address CDM data include multilevel multiple imputation (Schafer and Yucel, 2002; Diaz-Ordaz et al., 2016; Hossain et al., 2017a, b) and inverse probability weighting (IPW) (Robins et al., 1995). We adopt the IPW framework, which avoids the need to correctly specify the joint distribution of the clustered outcomes.
The IPW methods accounting for missing outcomes in CRTs have been proposed to handle the setting where missingness occurs at the individual level and each cluster has partially or fully observed individual outcomes (Prague et al., 2016). In the presence of missing clusters, for example, all outcomes from a cluster are missing due to drop-out of the cluster, these approaches ignore this cluster-level missingness and can lead to biased inference if the cluster-level missingness is informative (Giraudeau and Ravaud, 2009).
Missing clusters are not uncommon in CRTs. A systematic review of CRTs reported that 31% of 86 trials had missing clusters (Fiero et al., 2016). Multi-level missingness can also occur in CRTs with a multi-level structure. For example, in a study to evaluate if proactive community case management (pro-CCM) is effective in reducing malaria burden in rural endemic area of Madagascar, 22 fokontanies (smallest administrative units) were randomized to pro-CCM or conventional integrated community case management (iCCM) (Ratovoson et al., 2022). The study participants were nested in households, which were nested in each fokontany. About 24% of the study participants and 22% of the households were lost to follow-up due to moving away, absence, death, or refusal to participate.
In this paper, we develop new estimators for estimating the marginal treatment effect in CRTs with multi-level missing outcomes based on the GEE framework. We derive the multi-level weights to account for informative missingness and incorporate a multiply robust estimation approach (Han, 2014), which allows analysts to specify multiple propensity score (PS) models at each of the clustering levels. We note that the term “multiply robust” has been used in other contexts (Tchetgen Tchetgen and Shpitser, 2012). The proposed multi-level multiply robust GEE (MMR-GEE) estimator enjoys the multiple robustness property in the sense that the target parameter will be consistently estimated as long as one of the multiple PS models postulated at each clustering level is correctly specified.
The remainder of the article is organized as follows. Sections 2.1 and 2.2 introduce the notation, setting, and assumptions. Sections 2.3 and 2.4 present the proposed MMR-GEE estimator. Section 2.5 establishes the theoretical properties of the MMR-GEE estimator. Section 2.6 addresses the misclassification issue of the observed missingness indicator at the cluster level. For notational simplicity, we anchor our presentation around 2-level CRTs, where informative missingness may occur both at the cluster and at the individual level in Sections 2.1-2.6. In Section 2.7, we present an extension to 3-level CRTs where informative missingness can occur both at the subcluster and at the individual level. In Section 3, we illustrate the use of the proposed methods with the Pro-CCM study (Ratovoson et al., 2022). Results from extensive simulation based on the Pro-CCM study are reported in Section 4. The paper is concluded with practical considerations and discussions in Section 5.
2. METHODS
2.1. Notation and models
We consider a 2-arm parallel CRT where study participants are followed over time with the outcome
, a vector of
cluster-level baseline covariates
, and a vector of
individual-level baseline covariates
for participant
in cluster
. Let
be the binary treatment indicator for cluster
(treated
and control
); the treatment assignment probability is known and given by
. The vector of cluster-level covariates
and matrix of individual-level covariates
are assumed to be fully observed before randomization. Note that
can include cluster-specific information (e.g., cluster location or size) as well as summary statistics of individual-level covariates (e.g., average age or the proportion of male/female within a cluster). Here, we use
to denote the vector of individual-level missingness indicator and
to denote the cluster-level missingness indicator for outcomes
.
when
is observed and
when
is missing.
if the cluster
drops out of the study so that no individual outcomes in that cluster can be observed, and
otherwise.
Let
be the number of observed clusters, and
be the number of observed outcomes in cluster
. Without loss of generality, let
be the indexes for observed clusters and
be the indexes for missing clusters; for participants in observed cluster
, let
be the indexes of participants whose outcomes are observed and
be the indexes of participants whose outcomes are missing. See Table 1 for an illustration of the data structure under the multi-level missingness setting.
TABLE 1.
Data structure of multi-level missingness in cluster randomized trials, including outcome
, cluster-level missingness indicator
(
if the cluster is observed and
if the cluster is missing), and individual-level missingness indicator
(
if
is observed and
if
is missing).
| Cluster | Unit |
|
|
|
|---|---|---|---|---|
| 1 | 1 |
|
1 | 1 |
| 1 |
|
|
1 | 1 |
| 1 |
|
|
1 | 1 |
| 1 |
|
|
0 | 1 |
| 1 |
|
|
0 | 1 |
| 1 |
|
|
0 | 1 |
|
|
|
|
1 |
| s | 1 |
|
1 | 1 |
| s |
|
|
1 | 1 |
| s |
|
|
1 | 1 |
| s |
|
|
0 | 1 |
| s |
|
|
0 | 1 |
| s |
|
|
0 | 1 |
| s+1 | 1 |
|
0 | 0 |
| s+1 |
|
|
0 | 0 |
| s+1 |
|
|
0 | 0 |
|
|
|
0 | 0 |
| M | 1 |
|
0 | 0 |
| M |
|
|
0 | 0 |
| M |
|
|
0 | 0 |
Our primary interest lies in the estimation of and inference about the parameters in the marginal mean model
with link function
, where
targets the marginal treatment effect. When there is no missing data, an estimator of
can be obtained by solving the following estimating equation (Liang and Zeger, 1986):
![]() |
(1) |
is the design matrix with
.
is the covariance matrix with
.
is the working correlation matrix indexed by non-diagonal elements
.
When outcomes are missing under CDM, fitting Model (1) with complete data could lead to biased inference for
(Prague et al., 2016). Provided that all clusters are observed, that is,
for all
, one can attempt to correct the bias through IPW-GEE (Robins et al., 1995):
![]() |
(2) |
Model (2) recovers population moments by reweighing the complete data according to the weighting matrix
. The conditional probability that
is observed, also called the PS, is denoted by
. In practice, the true PS is unknown. One can postulate a logistic regression model that regresses the missingness indicator on the treatment indicator and baseline covariates. A consistent and asymptotically normal estimator of
can be obtained when the PS model is correctly specified for the CDM missingness mechanism.
2.2. Multi-level missingness processes and assumptions
We consider the following multi-level missingness processes: Clusters drop out or withdraw from the study after randomization and before outcome data collection, where this cluster-level missingness is induced by the model
with parameters
. For clusters that remain throughout the study, the outcomes of individual participants may be missing, where this individual-level missingness is induced by another model
with parameters
. In such setting, the IPW-GEE method ignores the cluster-level missingness process and may lead to biased estimates of
. Throughout the paper, we make the following assumptions:
Non-informative cluster size: Outcomes do not depend on cluster sizes.
-
Multi-level CDM: The multi-level missingness processes depend on neither observed nor missing outcomes, conditional on baseline covariates and treatment:
. Positivity: The probabilities of both cluster- and individual-level missingness are bounded away from zero:
and
.
2.3. Multi-level IPW-GEE
We adapt weighting methods from the longitudinal drop-out setting (Robins et al., 1995; Mitani et al., 2022) to estimate
under the multi-level CDM setting. The conditional probability of observing
can be expressed as
![]() |
(3) |
where
corresponds to the cluster-level missingness process induced by
and
corresponds to the individual-level missingness process induced by
. By modifying the weighting matrix of (2), we propose the multi-level IPW-GEE (MIPW-GEE) estimator as follows:
![]() |
(4) |
The consistency of the MIPW-GEE estimator requires correct specification of both PS models: that is,
and
for some
and
.
2.4. Multi-level multiply robust GEE
To protect against misspecification of the PS models, we propose a multiply robust estimator of
, denoted by
, based on the empirical likelihood theory (Owen, 2001; Han, 2014). The proposed MMR-GEE estimator allows analysts to specify multiple sets of PS models, and
will be a consistent estimator for
provided that one of the cluster-level and one of the individual-level PS models are correctly specified.
The MMR-GEE estimator can be obtained by solving estimating Equation (2) with weights replaced by the multiply robust weights:
![]() |
(5) |
We derive the multiply robust weights by extending the method from the independent data setting in Han and Wang (2013) to the clustered data setting where informative missingness can occur at multiple levels. Let
denote the set of
postulated individual-level PS models for
and
denote the set of
postulated cluster-level PS models for
, where
and
are vectors of parameters for the
th and
th models. Let
and
be the estimators for
and
, respectively. Now define
. The multiply robust weights for individuals with observed outcomes
,
can be obtained from solving a constrained optimization problem:
![]() |
(6) |
subject to the following constraints:
![]() |
The first constraint requires that the weights are non-negative. The second constraint imposes that the weights sum up to 1. The third constraint weighs each postulated model evaluated at the biased samples to represent the population mean. For
,
, now define
,
, and
![]() |
The constrained optimization problem (6) can be solved through the Lagrange multiplier technique, which yields
![]() |
(7) |
where
is a
vector by solving the following equation:
![]() |
(8) |
The detailed derivation can be found in Web Appendix A. There may be multiple roots to Equation (8). We apply convex minimization from Han (2014) to obtain
.
2.5. Consistency and asymptotic normality of MMR-GEE
Below, we first demonstrate that the MMR-GEE estimator has the multiply robust property, that is,
is a consistent estimator for
when both
and
contain a correctly specified PS model. We then establish the asymptotic distribution of
. We consider the case where the number of clusters grows to infinity and the cluster sizes are bounded above. For notational simplicity, all clusters have the same cluster size. The results can be generalized to varying cluster sizes by invoking the Lindeberg–Feller central limit theorem. In what follows, we use subscript asterisk to denote probability limits,
to denote the true parameters of the PS models, and
to denote the true parameters of the marginal mean model.
2.5.1. Multiple robustness of
We first show that the multiply robust weights of Equation (5) under which both
and
contain a correctly specified PS model are asymptotically equivalent to the multi-level inverse probability weights of Equation (4) when the true correct models are known. This asymptotic equivalence can be established by building the connection between the multiply robust weights and another version of the empirical likelihood weights conditional on the observed sample assuming that the correct PS models are known, which we will derive below.
Without loss of generality, let
and
, the first model in
and
, respectively, be the correctly specified models. Furthermore, let
be the empirical probability of
conditional on
for
. The estimator for
, denoted by
, can be obtained by solving the empirical version of the constrained optimization problem (6) using the same Lagrange multipliers method as in Section 2.4. With some algebra manipulation,
can be expressed as (see Web Appendix B for derivation)
![]() |
(9) |
Plugging
back to the weighting matrix of (5), we establish the relationship that
![]() |
(10) |
which are asymptotically proportional to the weights of the correctly specified MIPW-GEE estimator of Equation (4). As the number of clusters
goes to infinity, we have
![]() |
(11) |
which proves the consistency of
. The results are summarized below:
Theorem 1
When
contains a correct model for
and
contains a correct model for
, as
,
.
The proof is provided in Web Appendix B.
2.5.2. Asymptotic distribution
We derive the asymptotic distribution of
by following the approach from Theorem 2 of Han (2014) assuming that the correct models are known. Without loss of generality, let
and
be the correctly specified models for
and
. The score functions of
and
, denoted by
and
, are
![]() |
Let
![]() |
where
and for any matrix
,
. Furthermore, write
,
, and
. The following theorem gives the asymptotic distribution of
.
Theorem 2
When both
and
contain a correctly specified model for
and
, respectively,
has an asymptotic normal distribution with mean
and variance var(
), where
See Web Appendix C for detailed derivation and proof.
The asymptotic variance of the MMR-GEE estimator requires the knowledge of correct PS models, which are usually unavailable. Therefore, the asymptotic variance formula cannot easily be used to obtain the standard error estimates. We recommend using the non-parametric “clustered bootstrap” approach for inference. The “clustered bootstrap” approach samples
clusters with replacement, with all individuals from the resampled clusters included in the bootstrap sample (Field and Welsh, 2007).
2.6. An EM algorithm to address misclassification of cluster-level missingness indicators
The consistency of the proposed MMR-GEE estimator requires parameters of the PS models to be consistently estimated. When no individual outcomes from a cluster are available, it is possible that outcome data from this cluster are missing by the cluster-level missingness process (that is, the true cluster-level missingness indicator
); it is also possible that the cluster remains in the study, but all individual outcomes from this cluster are missing, especially when cluster size is small (that is,
, but
for all
). Let
denote the observed cluster-level missingness indicator. In both cases, we observe
, but
can be either 0 or 1. Because consistent estimation of parameters in the PS models requires knowing
, potential misclassification can occur if one naively assigns
. We summarize all possible patterns of
, which include
: When a cluster drops out after randomization and before outcome data collection, all participants’ outcomes in that cluster cannot be observed so
is 0.
: When the cluster does not drop out, we might still observe
if all individual outcomes within the cluster are missing. Such scenario is more likely to happen for small cluster sizes.
: When the cluster does not drop out, the observed cluster-level missingness indicator is the true cluster-level missingness indicator when at least one participant’s outcome in cluster
is observed.
The patterns are also summarized in Table S1 in Web Appendix D.1. Under pattern (2), the observed
misclassifies the true
, leading to bias in the estimated parameters for the PS models. More specifically, suppose that the PS models are
![]() |
(12) |
where
,
,
, and
. Because
and
, estimators of
and
based on
may be biased even if the PS models are correctly specified. To address this potential misclassification problem, we treat
as partially observed data and propose an EM algorithm (Dempster et al., 1977) to estimate the parameters in the PS models.
In the current setting, the “complete” data are
, which is denoted by (
) for simplicity of notation. The complete data log likelihood is
![]() |
(13) |
The conditional expectation of the Expectation step (E-step) at iteration
given the observed data
is
![]() |
(14) |
For the Maximization step (M-step), we recommend using the optimization software such as the Optimr function in R (Nash, 2016; R Core Team, 2021) to maximize the complete data likelihood. By applying the E-step and M-step iteratively, the EM estimators
and
can be obtained after the algorithm converges. When the PS models are correctly specified,
and
would be consistent for the true parameters of the PS models despite misclassification of the cluster-level missingness indicators, that is,
and
. The detailed derivation of the complete data likelihood and E-step as well as pseudo code for the algorithm can be found in Web Appendix D.
2.7. Extension to 3-level CRTs
In 3-level CRTs, study participants are nested in subclusters such as households or healthcare providers, and subclusters are nested in clusters such as regions or clinics. Below, we extend the proposed methods to address informative outcome missingness at both the subcluster and the individual levels.
Let
be the outcome and
be a vector of
baseline covariates for participant
from subcluster
in cluster
. Here, the baseline covariates
can contain individual-, subcluster-, and cluster-level information and are fully observed. We consider a 2-arm parallel CRT using the same binary treatment indicator notation
. For the multi-level missingness processes,
denotes the vector of individual-level missingness indicator, and
is used to denote the subcluster-level missingness indicator for outcomes
.
when
is observed and
when
is missing.
when all participants’ outcomes in subcluster
are missing and
otherwise. Essentially, Table 1 represents the data structure of one cluster (except with the same treatment status), and the data structure for 3-level CRTs is the concatenation of all clusters.
Under the multi-level missingness setting for 3-level CRTs, the estimating equation for MIPW-GEE and MMR-GEE can be modified as follows:
![]() |
(15) |
is the design matrix with
.
is the covariance matrix with
. Under 3-level CRTs, exchangeable and block exchangeable correlation structures are common choices for the specification of
. To account for informative missing subclusters, the multi-level weighting matrix takes the following form:
![]() |
(16) |
where
is the individual-level missingness process for
and
is the subcluster-level missingness process for
. The extension of the MMR-GEE estimator can be obtained by replacing the weighting matrix of Equation (15) by
![]() |
(17) |
The estimation of
follows the same strategy as in Section 2.4.
3. APPLICATION
We illustrate our proposed methods using data from the Pro-CCM study (Ratovoson et al., 2022), which investigated the efficacy of the pro-CCM intervention in reducing the prevalence of malaria in the Mananjary district of Madagascar. A total of 22 clusters (i.e., fokontany) were randomized to pro-CCM (treatment) or iCCM (control). Study participants were nested in households, which were nested in each fokontany. The disease status of each participant was assessed at baseline and at endline. Here, we focus on the individual-level diagnostic test result (
) at endline for participant
from household
in fokontany
(
if positive, 0 if negative). The dataset consists of 29,683 participants with 7 individual-level baseline covariates (male indicator
, age
, primary school indicator
, secondary school indicator
, high level school indicator
, sleep in mosquito nets indicator
, and sleep in the yard indicator
) and 4 household-level baseline covariates (household size
, % of male
, highest education level
, and indoor residual spraying indicator
).
The overall missingness of the individual-level outcome at endline was 31%, corresponding to 22.3% of missing households. Results based on a mixed effects model adjusting for socio-demographic characteristics suggested no statistical differences in test positivity at endline between participants in the intervention and control arm (OR = 0.71; 95% CI: 0.36-1.43) (Ratovoson et al., 2022). We reanalyze this dataset using the GEE approaches to estimate the marginal treatment effect assuming outcomes have CDM. The specification of the PS models in
and
would ideally be based on knowledge about the underlying causal structure of the missingness processes. In the absence of such information, we apply a backward step-wise procedure based on the AIC to select covariates for the PS, yielding the following models:
![]() |
(18) |
![]() |
(19) |
![]() |
(20) |
The model fitting results are provided in Table S2 in Web Appendix E.1. We carry out the following 4 analyses: CC-GEE based on participants with observed test results at endline, IPW-GEE using Model (18) for the PS, MIPW-GEE using Models (19) and (20) for the subcluster- and individual-level PS, and MMR-GEE with 2 sets of PS models
and
.
contains Model (19) and another model that includes (
,
,
);
contains Model (20) and another model that includes
,
,
,
,
,
,
). The parameters in the PS models for MIPW-GEE and MMR-GEE are estimated with the EM algorithm proposed in Section 2.6.
CC-GEE yields a marginal treatment effect estimate (
= 0.75, 95% CI: 0.38-1.50) that is similar to the original finding. Approaches that incorporate potentially informative missing outcomes lead to effect estimates slightly closer to the null (IPW-GEE
= 0.81, 95% CI: 0.36-1.82; MIPW-GEE
= 0.85, 95% CI: 0.28-2.41; MMR-GEE
= 0.82, 95% CI: 0.38-1.77). Nevertheless, the conclusion remains the same as confidence intervals from all approaches include the null. The IPW-GEE estimator without explicitly modeling the subcluster-level missingness yields effect estimates similar to our proposed MIPW-GEE and MMR-GEE estimators. Such similarity suggests that subclusters may be missing completely at random. Indeed, even though 22.3% of households are missing, the estimated probability for the subcluster-level missingness is all close to 1 (i.e., mean of
= 0.99 with range 0.96-1.00). While in this particular application, all approaches lead to the same conclusions, the availability of proposed methods permits assessment of the impact of potentially informative missingness on effect estimates under a range of assumptions about the outcome missingness mechanisms at multiple levels.
4. SIMULATION STUDIES
We conduct 2 sets of simulation studies to assess the finite-sample performance of our proposed MMR-GEE estimator. The first set, presented in this section, is structured under a 3-level clustering setting based on the Pro-CCM study, whereas the second set, presented in Web Appendix F, is structured under a 2-level clustering setting.
4.1. Simulation design and data generating processes
We consider 3 designs: the original Pro-CCM design (Org-Pro-CCM) that replicates the Pro-CCM study, as well as alternative design 1 (Alt-1) and alternative design 2 (Alt-2) under varying cluster sizes, outcome models, and missingness processes (see Table 2).
TABLE 2.
Details of the outcome models, missingness models, and proportion of missingness under the Org-Pro-CCM, Alt-1, and Alt-2 designs.
| Org-Pro-CCM | Alt-1 | Alt-2 | |
|---|---|---|---|
| Outcome model | |||
|
(−1.22, −0.34) | (−0.50, 0.50) | (−0.50, 0.50) |
|
|
|
|
|
(−0.99, −1.32) | (1.00, 0.50) | (1.00, 0.50) |
|
(0.29, 0.87) | (1.00, 0.50) | (1.00, 0.50) |
|
|
|
|
|
(0.28, −0.30, 0.67, 1.39, 0.22) | (−0.50, −0.10) | (−0.50, −0.10) |
|
(0, 0, −0.22, −0.70, −0.36) | (−0.50, −0.10) | (−0.50, −0.10) |
| Missingness models | |||
|
(0.90, 0) | (4.50, −0.50) | (4.50, −0.50) |
|
|
|
|
|
(0.03, 0.02) | (−0.50, −0.50) | (−1.00, −0.50, −0.50) |
|
− | (−3.50, −2.00) | (0, −3.50, −2.00) |
|
(2.42, −0.22) | (1.5, -0.50) | (0, −0.50) |
|
|
– |
|
|
(−0.07, −0.28, 0.22) | – | 1 |
|
(0.04, 0.13, 0) | – | – |
|
|
|
|
|
(−0.21, 0.03, 0.24) | 0.2 | 0.2 |
|
– | 0.2 | 0.2 |
| Proportion of missingness | |||
| % of missing households | 0.26 | 0.18 | 0.16 |
| % of overall missing individuals | 0.36 | 0.31 | 0.36 |
Baseline information is resampled from the Pro-CCM study: we sample 22 clusters (i.e., fokontany) in the Org-Pro-CCM and Alt-1 designs and 50 clusters in the Alt-2 design with replacement. Within each cluster, we further sample 30 subclusters (i.e., households) with replacement, yielding a total of 660 households in the Org-Pro-CCM and Alt-1 designs and 1,500 households in the Alt-2 design. To better understand the impact of small subcluster sizes, a constraint is applied in the Alt-2 design, limiting household size to 1–3 members. Treatment assignment (
) is simulated from a Bernoulli distribution with probability
at the cluster level. The primary outcome
is generated from the following model:
![]() |
(21) |
with nested block exchangeable correlation matrix defined by the within- and between-household intracluster correlation coefficients (ICCs), denoted as
.
characterizes the dependency between 2 outcomes within the same household, whereas
characterizes the dependency between 2 outcomes from distinct households within the same fokontany. Here, we choose
and generate the binary outcome using the SimCorMultRes package in R (Touloumis, 2016; R Core Team, 2021).
Our interest lies in estimating
from the marginal model
![]() |
(22) |
which is different from the conditional treatment effect parameter
in model (21). To obtain the true
, we fit the (unadjusted) GEE model with an independent working correlation matrix to 20, 000 full data sets. We then obtain
by averaging over 20, 000 estimates. The outcome missingness processes are induced through the following models:
![]() |
(23) |
![]() |
(24) |
The parameter values and the proportion of missingness under the 3 designs are provided in Table 2. Overall, the proportion of missing households ranges from 16% to 18%, and the proportion of overall missing individuals ranges from 31% to 36%.
4.2. Analysis approaches
To demonstrate the importance of correcting potential bias due to informative missing outcomes, we compare the following 4 approaches. First, we carry out an unweighted CC-GEE analysis based on Model (1). Second, we apply the IPW-GEE method based on Model (2), where the PS is estimated by the unconditional logistic regression model with the same functional form as Model (24) but ignores subcluster-level missingness:
![]() |
(25) |
Third, we employ the MIPW-GEE method, where both the subcluster- and individual-level PS models are correctly specified. Lastly, we implement our proposed MMR-GEE estimator by specifying
and
. Both
and
contain one correctly specified and one misspecified models. To estimate the parameters in the PS models, we fit the standard logistic regression model based on
(referred to as MIPW-GEE-no-EM and MMR-GEE-no-EM) and also apply the EM algorithm (referred to as MIPW-GEE-EM and MMR-GEE-EM). In the Org-Pro-CCM and Alt-1 designs, the misclassification rate of
is less than one percent, rendering the application of the EM algorithm unnecessary. For all approaches, we adopt an independent working correlation structure and utilize the “clustered bootstrap” method to obtain standard error estimates. All results are based on 1,000 replicates and 200 bootstrapping resamples.
4.3. Simulation results
Table 3 summarizes the estimates of
, empirical and bootstrap standard errors, and empirical coverage probability. In the Org-Pro-CCM design, all methods produce comparable results, exhibiting unbiased estimates of the marginal odds ratio. The percentage of 95% CI covering the true parameter values is close to 95%. In the Alt-1 and Alt-2 designs, CC-GEE and IPW-GEE provide biased estimates of the marginal odds ratio, with biases ranging from −0.46 to −0.34 for CC-GEE and −0.37 to −0.26 for IPW-GEE, whereas the averages of estimates from MIPW-GEE and MMR-GEE are very close to the true parameter values. The averages of bootstrap standard errors closely match the empirical standard errors in general. In the Alt-2 design, the bootstrap standard errors for MIPW-GEE-EM and MMR-GEE-EM deviate slightly from their empirical counterparts, likely due to numerical instability from the EM algorithm. When
is consistently estimated (i.e., MIPW-GEE-no-EM and MMR-GEE-no-EM for Alt-1 and MIPW-GEE-EM and MMR-GEE-EM for Alt-2), the percentage of 95% CI that covers the true value is close to 95%. The empirical coverage associated with CC-GEE and IPW-GEE can be substantially lower than the nominal level (e.g.,
60% for CC-GEE and
75% for IPW-GEE).
TABLE 3.
Empirical estimates of
and its corresponding odds ratio, empirical standard errors, mean of the estimated bootstrapping-based standard errors using the “clustered bootstrap” method, and empirical coverage probability based on 1,000 replicates and 200 bootstrapping resamples. The coverage probability is the percentage of true
contains in the 95% CI constructed from the bootstrapping-based standard errors.
| Clustered bootstrap | |||||
|---|---|---|---|---|---|
Est.
|
Est. OR | Emp. S.E. | Est. S.E. | Cov. prob. | |
|
Org-Pro-CCM | ||||
| Full data | −0.24 | 0.87 | 0.43 | 0.42 | 0.93 |
| CC-GEE | −0.24 | 0.87 | 0.45 | 0.44 | 0.93 |
| IPW-GEE | −0.24 | 0.87 | 0.45 | 0.44 | 0.93 |
| MIPW-GEE-no-EM | −0.24 | 0.87 | 0.45 | 0.44 | 0.93 |
| MMR-GEE-no-EM | −0.24 | 0.88 | 0.45 | 0.44 | 0.93 |
|
Alt-1 | ||||
| Full data | 0.41 | 1.57 | 0.28 | 0.27 | 0.94 |
| CC-GEE | 0.17 | 1.23 | 0.28 | 0.28 | 0.84 |
| IPW-GEE | 0.23 | 1.31 | 0.28 | 0.28 | 0.87 |
| MIPW-GEE-no-EM | 0.41 | 1.57 | 0.29 | 0.28 | 0.94 |
| MMR-GEE-no-EM | 0.41 | 1.58 | 0.29 | 0.28 | 0.95 |
|
Alt-2 | ||||
| Full data | 0.27 | 1.35 | 0.23 | 0.23 | 0.94 |
| CC-GEE | −0.15 | 0.89 | 0.27 | 0.25 | 0.57 |
| IPW-GEE | −0.06 | 0.98 | 0.26 | 0.25 | 0.71 |
| MIPW-GEE-no-EM | 0.12 | 1.21 | 0.36 | 0.33 | 0.88 |
| MIPW-GEE-EM | 0.26 | 1.37 | 0.32 | 0.42 | 0.96 |
| MMR-GEE-EM | 0.26 | 1.37 | 0.31 | 0.28 | 0.91 |
Figure 1 presents the empirical distribution of the
estimates for the Alt-2 design, with figures from the other 2 settings provided in Web Appendix E.2. The solid line denotes the true marginal effect, and the dashed line is the empirical mean across 1,000 coefficient estimates. Overall, the proposed MMR-GEE-EM and MIPW-GEE-EM estimators lead to estimates that are centered at the true
. Because CC-GEE ignores informative missing data and IPW-GEE fails to account for subcluster-level missingness, they both result in biased estimates. MIPW-GEE-no-EM attempts to adjust for the multi-level missingness processes. However, without using the EM algorithm to correct the misclassfication in
, estimates from MIPW-GEE-no-EM lead to bias. On the other hand, MIPW-GEE-EM appropriately accounts for the misclassification in
and the bias disappears.
FIGURE 1.
Empirical distribution of
based on 1000 replicates for the Alt-2 design. The solid line denotes the truth (0.28). The dashed line denotes the empirical mean of the estimated
.
We further compare strategies for estimating the parameters in the correctly specified PS models with and without using the EM algorithm and present the estimated parameter values in Web Appendix E.3. Under all scenarios,
and
are centered at the true values, whereas the parameters estimated by the standard logistic regression model using
can be substantially biased for the Alt-2 setting. As mentioned in Section 2.6, the misclassification in
due to the missingness of all individual outcomes in the (sub)cluster is more likely to happen for small (sub)cluster sizes. As (sub)cluster size increases, the misclassification in
becomes less probable because the probability of all individual outcomes within a (sub)cluster being missing, i.e.,
, would be very small. Therefore, large (sub)cluster size obviates the need to apply the EM.
5. DISCUSSION
Drawing upon the empirical likelihood theory, this paper proposes a new estimation procedure for the marginal treatment effect in CRTs with multi-level missing outcomes that guards against the partial misspecification of the PS models. The proposed MMR-GEE estimator allows analysts to specify multiple sets of PS models and leads to consistent treatment effect estimates provided that one of the multiple PS models specified at each clustering level is correctly specified and the parameters in the PS models are consistently estimated. We assume that outcome missingness is driven by baseline variables. If outcome missingness also depends on post-baseline variables, one could consider including post-randomization variables in the PS models (Carpenter and Kenward, 2007). The 2-stage targeted minimum loss-based estimation approach proposed by Balzer et al. (2023) considered outcome missingness driven by both baseline information and post-randomization variables. To guide the selection of the PS models, graphical models such as the missingness directed acyclic graphs (m-DAGs), which integrate information on the missingness process alongside context-specific knowledge, can be used to assess the causes of missingness (Moreno-Betancur et al., 2018; Mohan and Pearl, 2021; Lee et al., 2023). One potential strategy analysts can consider is to specify multiple potential m-DAGs (Lee et al., 2023); the PS models inferred from these m-DAGs can then be incorporated into the MMR-GEE estimator. The selection of the number of PS models for the MMR-GEE estimator involves balancing these considerations: including enough candidate models to improve the chances of correctly specifying the missingness process and controlling the dimension of the Lagrange multipliers. As the number of PS model increases, so does the complexity of the Lagrange multipliers, which can lead to numerical challenges when solving the constrained optimization problem (Han and Wang, 2013).
When using the GEE approach to analyze clustered data through a marginal model, the correlation among cluster members is modeled to determine the weight assigned to the data from each member. Here, we assume that the cluster size is non-informative (i.e., the outcome is independent of the cluster size), so that clustering does not affect the marginal mean model specification. In such settings, the participant-average and the cluster-average treatment effect estimands coincide (Kahan et al., 2023), and the proposed weighted-GEE estimators are consistent provided that the PS models for outcome missingness are correctly specified. However, in the presence of informative cluster size, the choice of working covariance matrix may correspond to different fitted marginal models, potentially resulting in treatment effect estimators targeting different estimands (Kahan et al., 2023; Williamson et al., 2003). For example, commonly used estimators such as GEEs with an exchangeable correlation structure can be biased for both the participant-average and the cluster-average treatment effect estimands. Therefore, it is recommended to carefully consider the target estimand and the likelihood of informative cluster size when choosing an appropriate analysis method.
We create a flowchart to guide the selection of an appropriate method for handling missing outcome data at multiple levels (Figure 2). First, our proposed approach is targeted toward the multi-level CDM setting. If one believes that clusters are missing completely at random, it may be sufficient to apply the IPW-GEE method to incorporate informative missingness at the individual-level. Second, although MMR-GEE provides the flexibility to specify multiple sets of PS models, analysts can apply the MIPW-GEE method if they have substantial knowledge about the true multi-level missingness processes. Finally, the goal of the EM algorithm is to address the challenge in estimating parameters of the PS models due to misclassification in
, which is more likely to happen for small cluster sizes. When cluster sizes are large, the likelihood of all individual outcomes within a cluster being missing diminishes, making the misclassification in
highly improbable. Thus, the application of the EM algorithm becomes unnecessary. When clusters contain a mixture of different sizes and the likelihood of this misclassification is uncertain, the EM algorithm is recommended. In the absence of misclassification in
, the EM algorithm would converge very quickly so the added computational burden is minimal.
FIGURE 2.
Recommendations for modeling and estimating the propensity score under various missingness mechanisms and scenarios.
We consider parametric logistic regression models when modeling the outcome missingness processes. Machine learning methods such as classification and regression trees could also be used (Lee et al., 2010). Augmented IPW-GEEs, which combine a PS model and a covariate-conditional mean outcome model, have been developed for handling missing outcome data in CRTs (Prague et al., 2016). The inclusion of an outcome model adds an extra layer of protection against model misspecification, and can also enhance the estimation efficiency if correctly specified. Extension of the MMR-GEE estimator to incorporate outcome models in the multi-level missing outcome settings requires further investigation. Lastly, we assume the missingness mechanism to be multi-level CDM. Developing sensitivity analysis methods to evaluate the impact of violating missing data assumptions would be useful.
Supplementary Material
Web Appendices, Tables, and Figures referenced in Sections 2.4, 2.5, 2.6, 3, and 4, and source code are available with this paper at the Biometrics website on Oxford Academic. Software in the form of R code is available at: https://github.com/JerryChiaRuiChang/MMR-GEE.
ACKNOWLEDGMENTS
We thank the editor, associate editor, and two reviewers for their helpful feedback and comments.
Contributor Information
Chia-Rui Chang, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115, United States.
Rui Wang, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115, United States; Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, MA 02215, United States.
FUNDING
Research in this article was in part supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (NIH) R01 AI136947.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
The data that support the findings in this paper are openly available at https://doi.org/10.7910/DVN/IIDE2B.
References
- Balzer L. B., van der Laan M., Ayieko J., Kamya M., Chamie G., Schwab J. et al. (2023). Two-stage TMLE to reduce bias and improve efficiency in cluster randomized trials. Biostatistics, 24, 502–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carpenter J. R., Kenward M. G. (2007). Missing data in randomised controlled trials: a practical guide. Health Technology Assessment Methodology Programme, Birmingham, p. 199. https://researchonline.lshtm.ac.uk/id/eprint/4018500. [Google Scholar]
- Dempster A. P., Laird N. M., Rubin D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1–22. [Google Scholar]
- Diaz-Ordaz K., Kenward M., Gomes M., Grieve R. (2016). Multiple imputation methods for bivariate outcomes in cluster randomised trials. Statistics in Medicine, 35, 3482–3496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Field C. A., Welsh A. H., (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B, 69, 369–390. [Google Scholar]
- Fiero M. H., Huang S., Oren E., Bell M. L. (2016). Statistical analysis and handling of missing data in cluster randomized trials: a systematic review. Trials, 17, 72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Giraudeau B., Ravaud P. (2009). Preventing bias in cluster randomised trials. PLoS Medicine, 6, e1000065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han P. (2014). Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association, 109, 1159–1173. [Google Scholar]
- Han P., Wang L. (2013). Estimation with missing data: beyond double robustness. Biometrika, 100, 417–430. [Google Scholar]
- Hayes R. J., Moulton L. H. (2017). Cluster Randomised Trials. New York: Chapman and Hall/CRC. [Google Scholar]
- Hossain A., Diaz-Ordaz K., Bartlett J. W. (2017a). Missing binary outcomes under covariate-dependent missingness in cluster randomised trials. Statistics in Medicine, 36, 3092–3109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hossain A., Diaz-Ordaz K., Bartlett J. W. (2017b). Missing continuous outcomes under covariate dependent missingness in cluster randomised trials. Statistical Methods in Medical Research, 26, 1543–1562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hubbard A. E., Ahern J., Fleischer N. L., Van der Laan M., Satariano S. A., Jewell N. et al. (2010). To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology, 21, 467–474. [DOI] [PubMed] [Google Scholar]
- Kahan B. C., Li F., Copas A. J., Harhay M. O. (2023). Estimands in cluster-randomized trials: choosing analyses that answer the right question. International Journal of Epidemiology, 52, 107–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee B. K., Lessler J., Stuart E. A. (2010). Improving propensity score weighting using machine learning. Statistics in Medicine, 29, 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee K. J., Carlin J. B., Simpson J. A., Moreno-Betancur M. (2023). Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification. International Journal of Epidemiology, 52, 1268–1275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang K.-Y., Zeger S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]
- Mitani A. A., Kaye E. K., Nelson K. P. (2022). Accounting for drop-out using inverse probability censoring weights in longitudinal clustered data with informative cluster size. The Annals of Applied Statistics, 16, 596–611. [Google Scholar]
- Mohan K., Pearl J., (2021). Graphical models for processing missing data. Journal of the American Statistical Association, 116, 1023–1037. [Google Scholar]
- Moreno-Betancur M., Lee K. J., Leacy F. P., White I. R., Simpson J. A., Carlin J. B. (2018). Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. American Journal of Epidemiology, 187, 2705–2715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nash J. C. (2016). optimr: A Replacement and Extension of the ‘optim’ Function. http://cran.r-project.org/package=optimr. [Google Scholar]
- Owen A. B. (2001). Empirical Likelihood. New York: Chapman and Hall/CRC. [Google Scholar]
- Prague M., Wang R., Stephens A., Tchetgen Tchetgen E., DeGruttola V. (2016). Accounting for interactions and complex inter-subject dependency in estimating treatment effect in cluster-randomized trials with missing outcomes. Biometrics, 72, 1066–1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2021). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. [Google Scholar]
- Ratovoson R., Garchitorena A., Kassie D., Ravelonarivo J. A., Andrianaranjaka V., Razanatsiorimalala S. et al. (2022). Proactive community case management decreased malaria prevalence in rural Madagascar: results from a cluster randomized trial. BMC Medicine, 20, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins J. M., Rotnitzky A., Zhao L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90, 106–121. [Google Scholar]
- Schafer J. L., Yucel R. M. (2002). Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 11, 437–457. [Google Scholar]
- Tchetgen Tchetgen E. J., Shpitser I. (2012). Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis. Annals of Statistics, 40, 1816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Touloumis A. (2016). Simulating correlated binary and multinomial responses under marginal model specification: the SimCorMultRes Package. The R Journal, 8, 79. [Google Scholar]
- Williamson J. M., Datta S., Satten G. A. (2003). Marginal analyses of clustered data when cluster size is informative. Biometrics, 59, 36–42. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Web Appendices, Tables, and Figures referenced in Sections 2.4, 2.5, 2.6, 3, and 4, and source code are available with this paper at the Biometrics website on Oxford Academic. Software in the form of R code is available at: https://github.com/JerryChiaRuiChang/MMR-GEE.
Data Availability Statement
The data that support the findings in this paper are openly available at https://doi.org/10.7910/DVN/IIDE2B.














































