Skip to main content
Biometrika logoLink to Biometrika
. 2012 Apr 29;99(2):439–456. doi: 10.1093/biomet/ass013

Improved double-robust estimation in missing data and causal inference models

Andrea Rotnitzky 1, Quanhong Lei 2, Mariela Sued 3, James M Robins 4
PMCID: PMC3635709  PMID: 23843666

Abstract

Recently proposed double-robust estimators for a population mean from incomplete data and for a finite number of counterfactual means can have much higher efficiency than the usual double-robust estimators under misspecification of the outcome model. In this paper, we derive a new class of double-robust estimators for the parameters of regression models with incomplete cross-sectional or longitudinal data, and of marginal structural mean models for cross-sectional data with similar efficiency properties. Unlike the recent proposals, our estimators solve outcome regression estimating equations. In a simulation study, the new estimator shows improvements in variance relative to the standard double-robust estimator that are in agreement with those suggested by asymptotic theory.

Keywords: Drop-out, Marginal structural model, Missing at random

1. INTRODUCTION

In a missing data model, an estimator is double-robust if it is consistent when either a model for the missingness mechanism or a model for full-data law is correctly specified. In a causal inference model, an estimator is double-robust if it is consistent when either a model for the treatment assignment mechanism or a model for the counterfactual data distribution is correctly specified.

Scharfstein et al. (1999) noted that an estimator originally developed and identified as the locally efficient estimator in the class of augmented inverse probability weighted estimators in missing data models in Robins et al. (1994), was double-robust. Since then, many estimators with the double-robust property have been proposed, several of which were recently reviewed by Kang & Schafer (2007) and the discussants of that paper, for the special case of estimating a population mean and the causal effect of a binary treatment.

Rubin & van der Laan (2008) noted that the locally efficient estimator of Robins et al. (1994) can be quite inefficient if the model for the full-data distribution is incorrectly specified. To remedy this, they described a new general approach yielding locally efficient estimators with desirable efficiency properties when the full-data model is incorrectly specified. Tan (2008) and Cao et al. (2009) demonstrated that for estimating a population mean with missing data and unknown missingness probabilities, a particular form of the Rubin and van der Laan procedure yields double-robust estimators. In a recent paper, Tan (2010a) combines that procedure with restricted empirical maximum likelihood estimation to derive new double-robust estimators of a population mean with missing data and of population average treatment effects that have the efficiency properties of the Rubin and van der Laan estimator. A property of the procedure in Tan (2010a) not satisfied by the proposals of Tan (2008) and Cao et al. (2009) is that means are estimated as weighted averages with positive weights. Thus, estimated means always fall in the parameter space and in the range of observed outcomes.

In this paper, we describe a new general approach to constructing locally efficient double robust estimators for the parameters of regression models with outcomes missing at random and for parameters of marginal structural mean models for point exposure studies with continuous or discrete exposures that have the advantageous efficiency properties of the Rubin and van der Laan procedure. For the special case of a population mean, our estimators do not reduce to any of the earlier estimators of Tan (2008, 2010a) and Cao et al. (2009). In fact, unlike these other proposals, our estimators solve outcome regression estimating equations. These are equations identical to the ordinary weighted least squares estimating equations but with the missing outcome or counterfactual response replaced by an estimate of its conditional expectation given baseline covariates. As such, unlike augmented inverse probability weighted estimating equations, our equations always have a solution, as long as their full-data analogues also have a solution and the estimated conditional expectation falls in the sample space of the outcome. In particular, like Tan (2010a), our estimators of a population mean always fall in the parameter range.

Several proposals exist for constructing locally efficient double-robust estimators that solve outcome regression estimating equations. For regression models with missing data and for marginal structural models, these include the procedures in Bang & Robins (2005) and the targeted maximum likelihood methodology (van der Laan & Rubin, 2006; van der Laan, 2010). For a population mean, Kang & Schafer (2007, Equation (10)) described a so-called double-robust weighted least squares outcome regression estimator. These authors reported a simulation study in which this estimator performed better than the Bang and Robins estimator. These approaches have not yet been adapted to provide estimators with the improved efficiency properties of Rubin & van der Laan (2008).

The procedure described here is essentially a generalization to the regression setting of Kang and Shafer’s weighted least squares outcome regression estimator of a population mean. The key innovation is that the weights depend on an augmented model for the missingness or treatment probability which incorporates covariates constructed so as to ensure the advantageous efficiency properties of the Rubin & van der Laan (2008) procedure. Our approach exploits the counterintuitive fact that the efficiency of augmented inverse probability weighted estimators improves as the dimension of the model under which the missingness or treatment probabilities are estimated increases.

2. Cross-sectional studies with missing data

2.1. Existing double-robust estimators

Consider a study in which the intended data, (Yi, Li) (i = 1, …, n), are independent and identically distributed across i, where Yi is a scalar outcome and Li is a vector of additional variables. Assume Li is observed but Yi is missing in a subsample. Let Ai = 1 if Yi is observed and Ai = 0 otherwise. In this section, we consider estimation of the unknown p × 1 parameter vector β0 of the regression model

E(Y|Z)=h(Z;β0), (1)

where h(· ; ·) is a known smooth function and Z is a subset of the components of L, possibly empty, from independent and identically distributed copies (Ai, AiYi, LiT) (i = 1, …, n), of (A, AY, LT). We make the missing at random assumption (Rubin, 1976) that f (A | Y, L) = f (A | L) and the positivity assumption that pr(A = 1 | Y, L) > 0. Under these assumptions, β0 solves for any p × 1 function b(·), the population moment equation

E[b(Z){E(Y|L,A=1)h(Z;β)}]=0. (2)

Equation (2) has motivated the so-called outcome regression estimator β̂reg(η̂). To compute it, one first estimates the unknown parameter η0 of a working outcome regression model for the respondents

E(Y|L,A=1)=m(L;η0), (3)

where m(·; ·) is a known smooth function, by some η̂ using the units with observed Y. Next, one computes β̂reg(η̂) where, by definition, for any η, β̂reg(η) solves

En[b(Z){m(L;η)h(Z;β)}]=0, (4)

with b(·) a p × 1 user-specified vector function. In what follows, En(·) and En(· | ·) stand for the empirical mean and conditional mean operator, i.e., En(U)n1i=1nUi, and En(U|A=1,V){i=1nI(Ai=1,Vi=V)Ui}/{i=1nI(Ai=1,Vi=V)}. Under regularity conditions which include the requirement that (2) has a unique solution, the estimator β̂reg(η̂) is consistent for β0 and asymptotically normal if model (3) is correctly specified provided η̂ is consistent for η0 and asymptotically normal. Note that (4) is the same as the weighted least squares estimating equations En[b(Z){Yh(Z; β)}] = 0 that one would ordinarily use in the absence of missing data, for example, with b(Z) = (1, ZT)T if h(Z; β) = ZTβ or h(Z; β) = expit(ZTβ) where expit(u) = {1 + exp(−u)}−1, but with the outcome Y replaced by m(L; η). Then, if the range of m(·; ·) falls in the sample space of Y, equation (4) will ordinarily have a solution.

The moment equation (2) can be re-written as

E[b(Z)ω{Yh(Z;β)}]=0,

where ω = Af (A | L)−1. This formulation has motivated the so-called augmented inverse probability weighted estimators β̂g,aipw. To compute them, one first posits a parametric model for the missingness probability

f(A|L)=f(A|L;α0) (5)

for instance, f (A | L; α) = expit(αT )A{1 − expit(αT )}1−A where T = (1, LT), and estimates α0 by its maximum likelihood estimator α̂. Next, one computes β̂g,aipw solving, with α evaluated at α̂, the equations

En[ω(α)b(Z){Yh(Z;β)}]En[g(A,L)Eα{g(A,L)|L}]=0, (6)

where g(·) and b(·) are user-specified p × 1 vector functions, ω(α) = A f (A | L; α)−1 and Eα {g(A, L) | L} is the conditional expectation of g(A, L) given L under f (A | L; α). When g = 0, β̂g,aipw, throughout denoted by β̂ipw, is called an inverse probability weighted estimator. If model (5) is correctly specified, then under regularity conditions, β̂g,aipw is consistent and asymptotically normal with asymptotic variance no smaller than that of β̂goptaipw where goptb(A, L) = ωb(Z){E(Y | A, L) − h(Z; β0)}. This motivates estimating β0 by β̂(η̂, α̂) where, for a given b(·) and each (η, α), β̂ (η, α) solves

En[ω(α)b(Z){Yh(Z;β)}]En[gη,α,βb(A,L)Eα{gη,α,βb(A,L)|L}]=0, (7)

with gη,α,βb(A, L) = ω (α)b(Z){m(L; η) − h(Z; β)} (Robins & Rotnitzky, 1995).

Regardless of the validity of the outcome regression model (3), if the missingness model (5) is correct no variance correction due to estimation of η is needed, i.e.,

n{β^(η^,α^)β^(η*,α^)}=op(1), (8)

where η* is the probability limit of η̂. When both the outcome regression and the missingness models are correct, no adjustment due to estimation of α on the asymptotic variance of β̂(η0, α̂) is needed, i.e., √n{β̂ (η0, α̂) − β̂ (η0, α0)} = op(1).

Robins & Rotnitzky (1995) noted that the estimator β̂ (η̂, α̂) satisfies:

Property 1. if the missingness model is correct, it is consistent for β0 and asymptotically normal;

Property 2. if both the outcome regression and missingness models are correct, it has asymptotic variance equal to the smallest asymptotic variance of all estimators β̂g,aipw that use the same b.

Unlike the outcome regression estimator β̂reg(η), the estimator β̂(η̂, α̂) may fall outside the range of plausible values for β. For example, in the absence of Z and when h(Z; β) = β, (7) is equivalent to

β=En[ω(α){Ym(L;η)}]+En{m(L;η)}. (9)

If Y is binary and m(L; η) = expit(ηT ), then 0 < En{m(L; η̂)} < 1. However, | En[ω (α̂){Ym(L; η̂)}]| may be very large if few weights ωi (α̂) are exceedingly large, as is the case when few units with Ai = 1 have relatively very small estimated values f (Ai | Li; α̂). Another manifestation of this problem is in estimating equations with no solution. For instance, if we parameterize E(Y) = expit(β0), (9) with β replaced by expit(β) has no solution if the right-hand side of (9) is outside the interval (0, 1).

Scharfstein et al. (1999) noted that β̂ (η̂, α̂) is double-robust: it is consistent for β0 and asymptotically normal provided either model (5) or model (3) is correct. The estimator β̂ (η̂, α̂) with η̂ the ordinary or iteratively reweighted least squares estimator based on units with observed Y is known as the standard double-robust estimator. Several authors (Robins & Wang, 2000; Kang & Schafer, 2007; Rubin & van der Laan, 2008) have noted that standard double-robust estimators may have substantial bias and large variance, even under correct specification of the missingness model, if the estimated missingness probabilities are highly variable and/or the outcome regression model is misspecified. Alternative double-robust estimators have been recently developed to address these problems (van der Laan & Rubin, 2006; van der Laan, 2010; Tan, 2006, 2007, 2008, 2010a, 2010b; Robins et al., 2007; Cao et al., 2009). In particular, Tan (2008, 2010a) and Cao et al. (2009) derived double-robust estimators of E(Y) = β0, i.e., in the special case in which Z is absent and h(Z; β) = β, which satisfy Properties 1 and 2 and have the enhanced efficiency benefit that:

Property 3. if the missingness model is correct, they have asymptotic variance smaller than or equal to the asymptotic variance of any estimator β̂ (η, α̂) with η fixed but arbitrary even if the outcome regression model is incorrect.

The estimator of Tan (2008) agrees with that of Cao et al. (2009) when m(L; η) is linear in η but otherwise it has a subtle difference that ensures that, when the missingness model is correct, it also has asymptotic variance no larger than that of any estimator computed like β̂ (η, α̂) but with m(L; η) replaced by c1 + c2m(L; η) for any c1 and c2. Both proposals adapt a particular version of an estimator proposed by Rubin & van der Laan (2008) to the setting in which the missingness probabilities are unknown and estimated under a parametric model. Under both proposals, the estimators are of the form β̂ (η̃, α̂), where η̃ minimizes σ̂2(η), an adequately chosen consistent estimator of the asymptotic variance σ2 (η) of β̂ (η, α̂). A drawback of these approaches is that they are based on solving equations of the form (7), which may not have a solution or may yield estimators that fall outside the parameter space. To remedy this Tan (2010a) derived an estimator that maximizes a constrained empirical likelihood. A clever choice of constraints combined with a calibration of a linear model for the missingness probabilities ensures that the estimator is double-robust, is efficient in the aforementioned larger class of estimators considered by Tan (2008), satisfies Properties 1–3 and is a weighted average of the observed outcomes.

In the next section, we introduce a new class of estimators β̂dr of parameters of regression models with the following properties:

Property 4. they are double-robust, i.e., consistent for β0 and asymptotically normal if either model (3) or (5) is correct;

Property 5. they satisfy Properties 1 and 2;

Property 6. they solve an outcome regression estimating equation;

Property 7. given user-specified real-valued functions ϕ1(·), …, ϕK (·), each ϕk(β̂dr) (k = 1, …, K), has asymptotic variance smaller than or equal to that of any ϕk{β̂ (η, α̂)} with η fixed but arbitrary, if the missingness model is correct and even if the outcome regression model is incorrect;

Property 8. they have asymptotic variance smaller than or equal to that of β̂ipw if the missingness model is correct even if the outcome regression model is incorrect.

For β the population mean of a scalar outcome, as in Tan (2008, 2010a) and Cao et al. (2009), or for any other scalar parameter, Property 7 with ϕ1(β) = β is the same as Property 3. For β of dimension 2 or more, Property 7 ensures enhanced efficiency for estimation of a finite set of target scalar features of the vector β specified by the data analyst. For example, if in a two-group comparison analysis, h(Z; β) = expit(β1 + β2Z) with Z binary, one would choose ϕ1(β) = h(0; β), ϕ2(β) = h(1; β) and ϕ3(β) = h(1; β) − h(0; β) if one wants to ensure enhanced efficient estimation of the group means and of their difference. A more ambitious goal would be to construct estimators that satisfy Property 7 for all smooth scalar functions ϕ, not just for a finite number of them, or equivalently, Property 3 even if the dimension of β is 2 or more. Whether or not this can be accomplished remains an open question.

For estimating regression models with missing data, several procedures exist that satisfy some, but not all of Properties 4–8. The standard double robust estimator β̂ (η̂, α̂) satisfies Properties 4 and 5 but not 6–8. Bang & Robins (2005) estimator and estimators derived from application of targeted maximum likelihood methodology (van der Laan & Rubin, 2006; van der Laan, 2010) satisfy Properties 4–6 but not 7 and 8. The one step corrected estimator in § 2.5 of van der Laan & Robins (2003) satisfies Properties 5 and 8 but not 4, 6 and 7. The estimators in Tan (2010b) satisfy Properties 4, 5 and 8 but not 6 and 7.

2.2. Proposed estimator

To compute β̂dr it is required that the parameter η of model (3) has dimension q greater than or equal to the dimension p of β0. If this is not the case, one first augments model (3) by including additional covariates based on transformations of L, e.g., powers of L. The computation of β̂dr is carried out in the following three steps:

Step 1. Estimate η by η̂or solving the p equations

En[ω(α^)b(Z){Ym(L;η)}]=0, (10)

and the additional qp equations

En[Ad(L;η){Ym(L;η)}]=0, (11)

where d(L; η) is any user-specified, (qp) × 1 function, say d(L; η) = (∂m(L; η)/∂η1, …, ∂m(L; η)/∂ηqp)T. Compute β̂or = β̂reg(η̂or) solving the outcome regression equation (4) with η = η̂or.

Step 2. For each k (k = 1, …, K), compute

η˜k=argminηEn([I^k{M(α^,β^or)U(η,α^,β^or)ρ^ηTS(α^)}]2), (12)

where

S(α)=logf(A|L;α)/α,
I^k=φk(β)/βT|β^orEn{M(α^,β)/βT|β^or}1,
M(α,β)=ω(α)b(Z){Yh(Z;β)},
U(η,α,β)=gη,α,βb(A,L)Eα{gη,α,βb(A,L)|L}

and

ρ^ηT=En[{M(α^,β^or)U(η,α^,β^or)}S(α^)T]En[S(α^)S(α^)T]1.

Step 3. Compute the maximum likelihood estimator (α̃, δ̃) of (α, δ) in the augmented missingness model

f(A|L;α,δ)=c(L;α,δ)f(A|L;α)exp{k=1K+1δkTuk(A,L)}, (13)

where uk(A, L) = U(η̃k, α̂, β̂or) (k = 1, …, K), uK + 1(A, L) = U(η̂or, α̂, β̂or) and c(L; α, δ) is the normalizing constant. Estimate η by η̃or jointly solving (11) and (10) with ω (α̂) replaced with ω (α̃, δ̃) = Af (A | L; α̃, δ̃)−1. Finally, β̂dr is the estimator β̂reg(η̃or) solving the outcome regression equation (4) with η = η̃or.

The estimator β̂or of β returned by Step 1 is the extension to the regression setting of the so-called weighted least squares outcome regression double-robust estimator of a population mean of Kang & Schafer (2007, Equation (10)). Step 2 follows Rubin & van der Laan’s (2008) prescription to compute, separately for each k, an estimator η̃k of η targeted at minimizing the asymptotic variance of ϕk{β̂ (η, α̂)} if model (5) is correct. Under this assumption, the empirical mean in (12) is a consistent estimator of the asymptotic variance of ϕk{β̂ (η, α̂)}. Step 3 simply repeats Step 1, after re-estimating the missingness probabilities under the extended model (13). The subscripts in η̃or and η̂or are a reminder that when either is replaced for η in β̂ (η; α̂), it yields estimators solving outcome regression estimating equations.

The vector θ̂ = {α̂, β̂or, η̂or, η̃or, β̂dr, α̃, δ̃, (η̃k, ρ̂η̃k)1⩽k⩽K} solves a system of estimating equations En{Ψ (X; θ)} = 0. Under the conditions of van der Vaart (2000, Theorems 5.9 and 5.21), θ̂θ* = − Vθ*1En{Ψ (X; θ*)} + op(n−1/2), for θ* satisfying E{Ψ (X; θ*)} = 0 and Vθ = ∂E{Ψ (X; θ)}/∂θ. In what follows, we will assume that these conditions hold and argue that the estimator β̂dr satisfies the Properties 4–8.

For Property 4, the estimator η̃or converges in probability to a solution of the joint equations E[ω(α*, δ*)b(Z){Ym(L; η)}] = 0 and E[Ad(L; η){Ym(L; η)}] = 0 where (α*, δ*) = plim (α̃, δ̃). If model (3) is correct, η0 is a solution to this system. Thus β̂dr, being equal to the estimator β̂reg(η̃or), is consistent for β0 and asymptotically normal if model (3) is correct and the preceding joint population equations have a unique solution. On the other hand, β̂dr is equal to β̂{η̃or, (α̃, δ̃)} solving (7) with η = η̃or and α = (α̃, δ̃) because (7) is the same as equations

En[b(Z){m(L;η)h(Z;β)}]+En[ω(α)b(Z){Ym(L;η)}]=0, (14)

and, by construction of β̂dr and η̃or, En[b(Z){m(L; η̃or) − h(Z; β̂dr)}] = 0 and En[ω(α̃, δ̃) b(Z){Ym(L; η̃or)}] = 0. When model (5) is correct, model (13) is also correct with a true value of δ equal to 0. Consequently, when (5) is correct β̂{η̃or, (α̃, δ̃)}, just as any estimator solving (7), is consistent for β0 and asymptotically normal.

Property 5 holds because, as indicated above, β̂dr solves equation (7) with η and α replaced by estimators that converge to the true parameter values when the models they index are correct.

Property 6 holds by construction.

For Property 7, under model (5), β̂(η, α̂) satisfies the expansion

β^(η,α^)(β0)=En[I{M0U0(η)ρηTS(α0)}]+op(n1/2),

where M0 = M(α0, β0), U0(η) = U(η, α0, β0), I = E{∂M(α0, β) / ∂βT|β0}−1 and

ρηT=E[{M0U0(η)}ST(α0)]E{S(α0)ST(α0)}1

is the population least squares coefficient in the multivariate regression of M0U0(η) on the components of the score vector S(α0) (Robins et al., 1994). Thus, an application of the delta method yields that

avar[φk{β^(η,α^)}]=σk2(η)=minρE[[Ik{M0U0(η)ρTS(α0)}]2], (15)

where Ik = ∂ϕk(β)/∂βT|β0E{∂M(α0, β)/∂βT|β0}−1, and in the sequel avar(·) stands for the variance of the limiting distribution of ·. As the dimension of the vector α indexing the missingness model increases, so does the dimension of S(α0) and consequently σk2(η) decreases. With this in mind, we construct in Step 3 an augmented missingness model choosing to enlarge model (5) with just the right additional covariates so as to ensure that the resulting estimator of β0 is at least as efficient asymptotically as ϕk{β̂ ( ηk*, α̂)} (k = 1, …, K), where

ηk*=argminησk2(η).

Specifically, when model (5) is correct, so too is the enlarged model (13) with true parameter values α0 and δ0,k = 0 (k = 1, …, K + 1). It then follows from (8) and the fact that β̂dr = β̂ {η̃or, (α̃, δ̃)} that

β^drβ0=En[I{M0U0(ηor**)ρ*TSα*(α0,δ0)k=1K+1νk*TSδk*(α0,δ0)}]+op(n1/2), (16)

where ηor** = plimη̃or, Sα*(α, δ) = ∂ log f*(A | L; α, δ)/∂α and Sδk*(α, δ) = ∂ log f*(A | L; α, δ)/∂δk are the scores in the model

f*(A|L;α,δ)=c*(L;α,δ)f(A|L;α)exp{k=1K+1δkTU0(ηk*)+δK+1TU0(ηor*)}, (17)

with c*(L; α, δ) the normalizing constant and ηor* = plimη̂or. Furthermore, (ρ*T,ν1*T,,νK+1*T) is the population least squares constant in the regression of M0U0( ηor**) on Sα* (α0, δ0), Sδ1* (α0, δ0),…, SδK+1*(α0, δ0). Now, because of the precise form of model (17), it holds that Sα* (α0, δ0) = S(α0), Sδk* (α0, δ0) = U0( ηk*) (k = 1, …, K), and SδK+1* (α0, δ0) = U0( ηor*). Furthermore, U0( ηor**) = U0( ηor*) because, as argued in the Appendix, when model (5) holds, ηor* = ηor**. Thus, expansion (16) reduces to

β^drβ0=En[I{M0ρ*TS(α0)k=1Kνk*TU0(ηk*)νK+1*TU0(ηor*)}]+op(n1/2), (18)

with (ρ*T,ν1*T,,νK+1*T) re-defined as the population least squares coefficient in the regression of M0 on S(α0), U0( η1*), …, U0( ηK*) and U0( ηor*).

An application of the delta method then yields

avar{φk(β^dr)}=min(ρ,ν)E([Ik{M0ρTS(α0)k=1KνkTU0(ηk*)νK+1TU0(ηor*)}]2). (19)

This shows that avar{ϕk(β̂dr)} ⩽ avar[ϕk{β̂(η, α̂)}] for any η because by definition of ηk*, the smallest possible avar[ϕk{β̂ (η, α̂)}] is the right-hand side of (15) evaluated at ηk*, and this quantity is greater than or equal to the right-hand side of the last display because this is a minimum over a larger set.

Property 8 follows by noticing that

β^IPWβ0=En[I{M0ρ**TS(α0)}]+op(n1/2),

where M0ρ**TS(α0) is the residual from the population regression of M0 on S(α0). This residual has variance larger than or equal, in the positive definite sense, the residual between curly brackets in (18) as the latter is the residual from the regression of M0 on covariates that include S(α0).

The following remarks help clarify our construction. First, Step 1 is needed in order to include the covariate uK + 1(A, L) in model (13). Without this covariate, the asymptotic variance of ϕk(β̂dr) would be equal to the variance of the expression between squared brackets in (16) with Ik instead of I and without the term νK+1*TSδK+1*(α0, δ0). But this variance would not necessarily be smaller than the right-hand side of (15) evaluated at ηk*. Second, we can modify Step 3 to yield β̂dr additionally as efficient as β̂g,aipw for any specified g by simply adding the term δ K + 2u K + 2(A, L), where uK + 2(A, L) = [g(A, L) − Eα̂{g(A, L) | L}], to the exponential tilt in model (13). Third, the computation of η̃or in Step 3 depends on (α̃, δ̃) only through the f (Ai | Li; α̃, δ̃) (i = 1, …, n). If the outcome regression model (3) is correctly specified, then some or all of the uk(A, L) (k = 1, …, K + 1), may converge in probability to the same function of (L, A) and thus they may be highly collinear. In such a case, δ̃ may not be unique. However, k=1K+1δ˜kuk(A,L), and hence f (A | L; α̃, δ̃), will still be unique. Formula (19) for the asymptotic variance of ϕk(β̂dr) remains valid with the provision that some or all of the U0( ηk*) (k = 1, …, K), and U0( ηor*) may be equal. This provision does not invalidate the preceding arguments justifying that β̂dr satisfies Properties 7 and 8.

Standard empirical sandwich variance techniques could in principle be used to derive an estimator that is consistent for the asymptotic variance of β̂dr regardless of the validity of models (5) or (3), because, as indicated earlier, θ̂ = {α̂, β̂or, η̂or,η̃or, β̂dr, α̃, δ̃, (η̃k, ρ̂η̃k)1⩽k⩽K} solves an estimating equation En{Ψ (θ)} = 0. However, finding the analytic expression for Ψ would be cumbersome. Nevertheless, the nonparametric bootstrap can be used to compute a consistent variance estimator of β̂dr because θ̂ is regular and asymptotically linear (Gill, 1989).

Example 1. Consider estimation of β0 = E(Y) with Y binary. In this case, h(β) = β and Z is absent in model (1). Suppose we assume that (5) and (3) are logistic regressions with covariates L and intercept. The score for α is S(α) = expit(αT ){ω(α) − 1}. The function b(·) in equation (4) is a scalar constant since Z is absent. All bs yield the same estimator of β0, so we assume b = 1. In Step 1, η̂or solves (10) and En[AL̃r{Y − expit(ηT )}] = 0 (r = 1, …, q − 1) yielding β̂or = En{expit( η^orTL˜)}. In Step 2, K = 1, ϕ1(β) = β and I1 = 1. Furthermore, U(η, α, β) = {ω (α) − 1}{expit(ηT ) − β}. Consequently,

η˜1=argminηEn([ω(α^)Y{ω(α^)1}{expit(ηTL˜)ρ^ηTL˜expit(α^TL˜)}+β^or]2).

Model (13) of Step 3 is a logistic regression with intercept and covariates L, x1(L) = expit(α̂T )−1{expit( η˜1TL˜)− β̂or}.and x2(L) = expit(α̂T )−1{expit( η^orTL˜) − β̂or}. The estimator η̃or is computed just like η̂or, except that in equation (10) ω (α̂) is replaced by ω(α̃, δ̃) = A expit{α̃T + δ̃1x1(L) + δ̃2x2(L)}−1. Finally, β̂dr = En {expit( η^orTL˜)}, which has asymptotic variance under model (5),

avar(β^dr)=min(ρ,ν)E[{M0ρTS(α0)ν1TU0(η1*)ν2TU0(ηor*)}2],

with M0 = ω (α0) (Yβ0), U0( ηj*) = {ω(α0) − 1}{expit( ηj*TL˜) − β0} and ηj* = plim η̃j for j = 1, or. Interestingly, the estimator of Cao et al. (2009) has asymptotic variance minρ E[{M0ρTS(α0) − U0( η1*)}2], which is generally strictly larger than avar(β̂dr) due to the nonlinear dependence of U0( η1*) on η1*.

3. Causal inference in point-exposure studies

We now consider estimation of marginal structural mean models for causal inference in point exposure observational studies. Suppose that in an observational study with n subjects drawn at random from a population of interest, we observe (Ai, Yi, LiT) (i = 1, …, n), independent and identically distributed across i, where Yi is a scalar outcome, Ai is a treatment variable taking values in a set 𝒜 and Li is a vector of pre-treatment confounding variables. For each a ∈ 𝒜, let Y(a),i be the potential outcome of the i th unit under treatment a. We make the usual consistency assumption that Y(Ai), i = Yi. Suppose that Zi is a subvector of Li, possibly empty. A marginal structural mean model postulates that

E(Y(a)|Z)=h(a,Z;β0),

where h(·; ·) is a known smooth function and β0 is an unknown p × 1 parameter vector (Robins, 1999). For example, h(a, Z; β) = β1 + β2a + β3aZ. We examine estimation of β0 from data (Ai, Yi, LiT) i = 1, …, n), under the no-unmeasured confounders assumption which states that Y(a) and A are conditionally independent given L for all a ∈ 𝒜. Under this assumption, regarding Y(a) as an outcome variable that is observed and equal to Y only in units that received treatment A equal to a it holds, like in § 2.1, that E(Y(a) | Z) = E{E(Y | L, A = a) | Z} = E(ωaY | Z) where ωa = Ia(A) f (A | L)−1 and Ia(A) = 1 if A = a and Ia(A) = 0 otherwise.

In this setting, an outcome regression estimator β̂reg(η̂) is the solution of

𝒜En[b(a,Z){m(a,L;η^)h(a,Z;β)}]dμ(a)=0, (20)

where b(a, z) is a p × 1 user-specified vector function, μ denotes the Lebesgue measure when A is continuous and the counting measure if A is discrete, and η̂ is an estimator of η0 under a postulated outcome regression model

E(Y|L,A)=m(A,L;η0), (21)

which simultaneously parameterizes the separate outcome regressions E(Y | L, A = a) for all a ∈ 𝒜. If model (21) is correct, then β̂reg(η̂) solving (20) is consistent for β0 and asymptotically normal provided η̂ is consistent for η0 and asymptotically normal. Alternatively, positing a model (5) for the treatment mechanism we can consider estimating equations (6) and (7) where b(Z), m(L; η), h(Z; β) are replaced with b(A, Z), m(A, L; η), h(A, Z; β), η̂ is now an estimator that is consistent for η0 when model (21) is correct and with the re-definitions

ω=ωA=f(A|L)1,ω(α)=f(A|L;α)1.

With these modifications, we obtain the estimators α̂, β̂g,aipw, β̂ipw, β̂(η, α) and β̂ (η̂, α) as in § 2.1. Robins (1999) showed that (8) holds and Scharfstein et al. (1999) showed that β̂ (η̂, α̂) satisfies Properties 4 and 5 of § 2.1 where the words missingness model are replaced by treatment model and the outcome regression model is (21) rather than (3).

As in the missing data case, standard double-robust estimators β̂ (η̂, α̂) with η̂ the ordinary or iteratively reweighted least squares estimator of η0 based on all sampled units may not exist, because equation (7) evaluated at η̂ and α̂ may not have a solution. Furthermore, β̂ (η̂, α̂) does not have the advantageous efficiency Properties 7 and 8 of § 2.2. In the present setting, we can define the estimator β̂dr identically as in § 2.2, but with the re-definitions and replacements indicated in the preceding paragraph. Arguments identical to those given in § 2.2 imply that β̂dr satisfies Properties 4–8 of § 2.2 with the outcome regression equation referred to in Property 6 being now (20).

4. Monotone missing data in longitudinal studies

We now turn to double-robust estimation in regression models for longitudinal data with drop-out. The intended data are n independent and identically distributed copies of L¯J=(L0T,L1T,,LJT) where j denotes (B0, …, Bj) throughout. The vector Lj denotes the data we intend to measure at the jth occasion on a sample unit. Let C denote the drop-out occasion: C = j on a given sample unit if and only if we observe j for that unit. We assume L0 is always observed, so C > 0. The outcome of interest Y is r(J), where r(·) is a user-specified function which for simplicity we assume is scalar valued; for example, r(J) is some component of the vector LJ. The goal is to estimate the p × 1 parameter vector β0 of a regression model (4) where Z is a subvector of L0, possibly empty, from n independent and identically distributed copies of (C, L¯CT) under the missing at random assumption

f(Aj|A¯j1,L¯J)=f(Aj|A¯j1,L¯j1)(j=1,,J),

where Aj is the on study indicator, i.e., Aj = 1 if Cj and Aj = 0 otherwise. Provided pr(Aj = 1 | Aj−1 = 1, L̄j−1) > 0 (j = 1, …, J), E(Y | Z) can be expressed in two different ways (Bang & Robins, 2005), each leading to a different estimation strategy,

E(Y|Z)=E(H0|Z)=E(ωJY|Z),

with H0 defined from the recursion HJ = Y, and Hj−1 = E(Hj | Aj = 1, L̄j−1) (j = J, …, 1), and

ωj=Aj×{f(A1|A¯0,L¯0)××=f(Aj|A¯j1,L¯j1)}1(j=0,,J1).

To generalize β̂reg(η̂) to the longitudinal setting, we posit outcome regression models

E(Hj|Aj=1,L¯j1)=mj(L¯j1;η0)(j=1,,J), (22)

with η0 an unknown q × 1 parameter vector and mj (·; ·) known functions, for instance mj (j−1; η) = Φ−1{ηTsj (j−1)} for some link function Φ (·) and some user-specified functions sj (·), and we estimate η0 by some η̂, for example, by η̂ solving

En[AJdJ(L¯J;η){YmJ(L¯J;η)}+j=1J1Ajdj(L¯j;η){mj+1(L¯j+1;η)mj(L¯j;η)}]=0, (23)

where dj (j ; η) = ∂mj (j; η)/∂η and j=1J1()=0if J=1. The outcome regression estimator β̂reg(η̂) then solves (4) with m(L; η) re-defined as m1(0; η) here and throughout this section. If (22) holds, β̂reg(η̂) − β0 = Op(n−1/2) provided η̂ solves (23) or, more generally, provided η̂η0 = Op (n−1/2).

Alternatively, to generalize β̂g,aipw to the longitudinal setting, we posit drop-out models

f(Aj|A¯j1,L¯j1)=Aj1f(Aj|Aj1=1,L¯j1;α0)(j=1,,J) (24)

for instance, f (Aj | Aj−1 = 1, L̄j−1; α) = expit{αTtj (j−1)} for some user-specified functions tj (·), and we compute the maximum likelihood estimator α̂ of α0. Then, we compute β̂g,aipw solving, with α evaluated at α̂, the equation

En[ωJ(α)b(Z){Yh(Z;β)}]En×[j=1Jgj(A¯j,L¯j1)Eα{gj(A¯j,L¯j1)|A¯j1,L¯j1}]=0

for some p × 1 vector functions g1(·), …, gJ (·) and b(·). Under regularity conditions, β̂g,aipw is consistent for β0 and asymptotically normal (Robins & Rotnitzky, 1995) when model (24) holds with asymptotic variance no smaller than that of β̂gopt,aipw where gopt,jb(Āj, L̄j−1) = ωj b(Z){E(Hj | Aj = 1, L̄j−1) − h(Z; β0)} (Robins et al., 1994). The extension of β̂ (η, α) to the longitudinal setting solves

En[ωJ(α)b(Z){Yh(Z;β)}]En{j=1JGη,α,β,jbEα(Gη,α,β,jb|A¯j1,L¯j1)}=0, (25)

with

Gη,α,β,jbgη,α,β,jb(A¯j,L¯j1)ωj(α)b(Z){mj(L¯j1;η)h(Z;β)},

and ωj (α) = Aj × {f (A1 | Ā0, L̄0; α) × ⋯ × f (Aj | Āj−1, L̄j−1; α)}−1. The estimator β̂(η̂, α̂) satisfies (8) and Properties 4 and 5 with models (24) and (22) instead of (5) and (3) (Robins, 2000). However, as in § 2, for most estimators η̂, e.g., for η̂ solving (23), equation (25) evaluated at η̂ and α̂ may not have a solution. Furthermore, β̂ (η̂, α̂) does not satisfy the efficiency Properties 7 and 8.

An estimator β̂dr satisfying Properties 4–8 of § 2.1, with models (24) and (22) instead of (5) and (3) and Property 8 can be constructed implementing the following generalization of the three step procedure of § 2.2 to the longitudinal setting.

Step 1. Compute β̂or = β̂reg(η̂or) solving (4) in which m(L; η) re-defined as m1(0; η) and with η = η̂or, the estimator of the q × 1 parameter vector η of model (22) (qp) solving the p equations

En(b(Z)[j=1J1ωj(α^){mj+1(L¯j;η)mj(L¯j1;η)}+ωJ(α^){YmJ(L¯J,η)}])=0,

and the first qp equations in the system (23).

Step 2. For each k compute

η˜k=argminηEn([I^k{M(α^,β^or)U(η,α^,β^or)ρ^ηTS(α^)}]2),

where U(η,α,β)=j=1JUj(η,α,β) with

Uj(η,α,β)=gη,α,β,jb(A¯j,L¯j1)Eα{gη,α,β,jb(A¯j,L¯j1)|A¯j,L¯j1},
S(α)=j=1Jlogf(Aj|A¯j1,L¯j1;α)/α,M(α,β)=ωJ(α)b(Z){Yh(Z;β)},
I^k=φk(β)/βT|β^orEn{M(α^,β)/βT|β^or}1,
ρ^ηT=En[{M(α^,β^or)U(η,α^,β^or)}S(α^)T][En{S(α^)S(α^)T}]1.

Step 3. Compute the maximum likelihood estimator (α̃, δ̃) of (α, δ) in the augmented dropout models (j = 1, …, J)

f(Aj|A¯j1,L¯j1;α,δ)=Cj(α,δ)f(Aj|A¯j1,L¯j1;α)exp{k=1K+1δk,jTuk,j(A¯j,L¯j1)},

where uk,j (Āj, j−1) = Uj (η̃k, α̂, β̂or), uK + 1,j (Āj, L̄j−1) = Uj (η̂or, α̂, β̂or) and Cj (α, δ) = cj (Āj−1, j−1; α, δ) is the normalizing constant. Estimate η by η̃or computed just like η̂or but with ωj (α̂) replaced with ωj (α̃, δ̃). Finally, compute β̂dr = β̂reg(η̃or) solving (4) with m(L; η) re-defined as m1(0; η) and with η = ηor.

An argument similar to that in § 2.2 shows that under regularity conditions, the estimator β̂dr from the preceding algorithm satisfies Properties 4–8 with models (24) and (22) instead of (5) and (3) and with m1(0; η) instead of m(L; η) in equation (4). The key points are: β̂dr is consistent for β0 and asymptotically normal under model (22) because it is of the form β̂reg(η̃or), and η̃or is consistent for η0 and asymptotically normal under (22). The estimator β̂dr is also consistent for β0 and asymptotically normal under model (24) because it is equal to β̂{η̃or, (α̃, δ̃)} solving equation (25) since, as argued in the Appendix, (25) is the same as

0=En[b(Z){m1(L¯0;η)h(Z;β)}]+En(b(Z)[j=1J1ωj(α){mj+1(L¯j;η)mj(L¯j1;η)}+ωJ(α){YmJ(η)}]), (26)

and the right-hand side evaluates to zero when β, η and α are replaced with β̂dr, η̃or and (α̃, δ̃). The rest of the argument follows with the re-definition Sδk* (α0, δ0) = { Sδk,1* (α0, δ0)T, …, SδK,J*(α0, δ0)T}T where Sα* (α, δ) and SδK,j* (α, δ) are the scores for α and δk,j in models

f*(Aj|A¯j1,L¯j1;α,δ)=Cj*(α,δ)f(Aj|A¯j1,L¯j1;α)×exp{k=1K+1δk,jTU0,j(ηk*)+δK+1,jTUj(ηor*)}(j=1,,J),

with Cj* (α, δ) = cj* (Āj−1, L̄j−1; α, δ) the normalizing constant, U0,j ( ηk*) = Uj ( ηk*,α0, β0) = SδK,j* (α0, β0) and Uj ( ηor*) = Uj ( ηor*, α0, β0) = SδK+1,j* (α0, β0).

5. Simulation studies

We carried out four simulation experiments to assess the performance of β̂dr for estimation of β0 = E(Y) with sample sizes n = 200, 1000. In each experiment, we generated 1000 Monte Carlo datasets and computed, in addition to β̂dr, the estimators β̂reg(η̂), β̂(η̂, α̂), β̂ipw and β̂Cao, the estimator called μ̂PROJ in Cao et al. (2009).

In the first two experiments, we generated data as in Kang & Schafer (2007): (L1, …,L4, ε) ∼ N(0, I5) where I5 is the 5 × 5 identity matrix, Y = 210 + 27.4L1 + 13.7 s=24Ls+ε and A ∼ Ber{expit(−L1 + 0.5L2 − 0.25L3 − 0.1L4)}. As in Kang & Schafer (2007), we computed X = (X1, …, X4), where X1 = exp(L1/2), X2 = L2/{1 + exp(L1)} + 10, X3 = (L1 L3/25 + 0.6)3 and X4 = (L2 + L4 + 20)2. In the first experiment, we assumed that the observed data were (A, AY, L). We computed the estimators using the same outcome and missingness models as in Kang & Schafer (2007). The first, correctly specified, outcome and missingness models were, respectively, an additive linear regression of Y on L with intercept and a logistic regression with intercept, covariate L and outcome A. The second models, incorrectly specified, were as the first ones, but with covariate X instead of L. In the second experiment, as in Robins et al. (2007), we recoded A as 1 − A and replicated the first experiment. We conducted this experiment because Robins et al. (2007) noted that the favourable performance of β̂reg(η̂) compared with β̂ (η̂, α̂) reported by Kang & Schafer (2007) was reversed when the observed data were {1 − A, (1 − A)Y, L}; thus, our study includes scenarios favourable to β̂reg(η̂) and to β̂ (η̂, α̂).

In the last two experiments, we generated L, X and A as in the first experiment, but YBer{expit(60+27.4L1+13.7s=24Ls)} yielding E(Y) 0.0496. We purposely chose to simulate a rare outcome Y as we wanted to examine the performance of β̂dr in a setting where β̂ (η̂, α̂) had some nontrivial probability of falling outside the parameter space. Our estimators used the same correct and incorrect missingness models as in the first experiment and two logistic regression models for Y : the first, correctly specified, with intercept and covariate L and the second, incorrectly specified, with intercept and covariate X. The fourth experiment differed from the third in that A was recoded as 1 − A.

The estimator η̂ was the ordinary least squares estimator and the standard logistic regression estimator in the first two and last two experiments, respectively. The estimators η̂or and η̃or were computed as η̂ except that each unit was weighted by the inverse of specific estimates of the missingness probabilities as described in Steps 1 and 3 of the procedure in § 2.2.

Tables 1 and 2 report results for continuous and binary Y, respectively, and provide Monte Carlo estimates of the bias, root mean square error and median absolute error of the estimators of β0. Bootstrap estimators of their Monte Carlo standard errors can be found in Tables 3 and 4 of the Supplementary Material.

Table 1.

Monte Carlo study of the performance of the proposed estimator with outcome normally distributed and missing at random

Bias RMSE MAE Bias RMSE MAE Bias RMSE MAE Bias RMSE MAE
Miss-C, OR-C Miss-I, OR-C Miss-C, OR-I Miss-I OR-I
n = 200, Y observed iff A = 1
β̂reg(η̂) 0.07 2.54 1.77 0.07 2.54 1.77 −0.51 3.38 2.36 −0.51 3.38 2.36
β̂(η̂, α̂) 0.07 2.54 1.78 0.07 2.60 1.79 0.38 3.50 2.30 −7.71 44.28 3.59
β̂ipw −0.09 4.16 2.54 2.07 10.99 3.20 −0.09 4.16 2.54 2.07 10.29 3.20
β̂Cao 0.07 2.54 1.79 0.06 2.53 1.78 0.08 2.59 1.82 −0.41 3.51 2.14
β̂dr 0.07 2.54 1.78 0.06 2.54 1.77 0.33 2.92 2.05 −1.71 3.58 2.42
n = 1000, Y observed iff A = 1
β̂reg(η̂) 0.01 1.19 0.82 0.01 1.19 0.82 −0.82 1.70 1.19 −0.82 1.70 1.19
β̂(η̂, α̂) 0.01 1.19 0.82 0.03 1.20 0.81 0.08 1.59 1.08 −9.78 23.55 5.24
β̂ipw −0.01 1.68 1.14 4.40 9.20 2.55 −0.01 1.68 1.14 4.40 9.20 2.55
β̂Cao 0.01 1.19 0.81 0.01 1.19 0.81 0.04 1.19 0.84 −1.26 1.81 1.34
β̂dr 0.01 1.19 0.81 0.01 1.19 0.81 0.06 1.22 0.86 −2.56 2.91 2.54
n = 200, Y observed iff A = 0
β̂reg(η̂) 0.07 2.54 1.76 0.07 2.54 1.76 5.01 5.79 5.02 5.01 5.79 5.02
β̂(η̂, α̂) 0.07 2.54 1.75 0.07 2.54 1.76 0.53 3.83 2.38 3.25 4.59 3.44
β̂ipw 0.29 3.89 2.47 3.85 5.02 4.09 0.29 3.89 2.47 3.85 5.02 4.09
β̂Cao 0.07 2.55 1.79 0.08 2.54 1.77 0.47 2.61 1.81 0.94 3.21 2.26
β̂dr 0.08 2.54 1.76 0.08 2.54 1.77 1.16 3.01 2.03 2.54 3.89 2.82
n = 1000, Y observed iff A = 0
β̂reg(η̂) 0.01 1.19 0.83 0.01 1.19 0.83 4.93 5.10 4.93 4.93 5.10 4.93
β̂(η̂, α̂) 0.01 1.19 0.83 0.01 1.19 0.83 0.18 1.67 1.15 3.09 3.40 3.09
β̂ipw 0.12 1.68 1.19 3.71 3.98 3.71 0.12 1.68 1.19 3.71 3.98 3.71
β̂Cao 0.01 1.19 0.83 0.01 1.19 0.84 0.14 1.21 0.84 1.12 1.70 1.24
β̂dr 0.01 1.19 0.83 0.01 1.19 0.84 0.29 1.29 0.94 1.47 2.07 1.54

RMSE, root mean square error; MAE, median absolute error; Miss-C and Miss-I (OR-C and OR-I), correct and incorrect missingness (outcome regression); β̂reg(η̂), outcome regression estimator; β̂(η̂, α̂), standard double robust estimator; β̂ipw inverse probability weighted estimator; β̂Cao, Cao et al. estimator, β̂dr, new double robust estimator.

Table 2.

Monte Carlo study of the performance of the proposed estimator with binary outcome missing at random

Bias RMSE MAE Bias RMSE MAE Bias RMSE MAE Bias RMSE MAE
Miss-C, OR-C Miss-I, OR-C Miss-C, OR-I Miss-I OR-I
n = 200, Y observed iff A = 1
β̂reg(η̂) −0.42 2.92 1.96 −0.42 2.92 1.96 0.06 2.90 1.95 0.06 2.90 1.95
β̂(η̂, α̂) −0.42 2.92 1.96 −0.42 2.92 1.96 0.07 2.92 1.96 0.04 2.96 1.95
β̂ipw −0.24 4.01 2.56 2.55 9.67 3.31 −0.24 4.01 2.56 2.55 9.67 3.31
β̂Cao −0.21 3.04 1.99 −0.25 3.04 1.96 0.05 2.92 1.98 0.00 2.93 1.96
β̂dr −0.41 2.93 1.96 −0.39 2.93 1.96 0.05 2.90 1.96 0.02 2.91 1.95
n = 1000, Y observed iff A = 1
β̂reg(η̂) −0.02 0.98 0.64 −0.02 0.98 0.64 0.27 1.04 0.67 0.27 1.04 0.67
β̂(η̂, α̂) −0.01 0.95 0.64 0.00 0.95 0.64 0.17 1.09 0.75 −0.42 1.54 0.86
β̂ipw −0.12 1.74 1.11 5.51 11.80 2.89 −0.12 1.74 1.11 5.51 11.80 2.89
β̂Cao 0.07 1.08 0.74 0.08 1.26 0.75 0.25 1.07 0.72 0.19 1.12 0.72
β̂dr −0.02 0.95 0.64 −0.01 0.95 0.64 0.06 1.01 0.71 0.04 1.00 0.67
n = 200, Y observed iff A = 0
β̂reg(η̂) 0.05 1.65 1.04 0.05 1.65 1.04 0.87 2.24 1.46 0.87 2.24 1.46
β̂(η̂, α̂) 0.05 1.65 1.04 0.02 1.65 1.04 0.78 2.19 1.46 0.83 2.23 1.46
β̂ipw −0.02 1.75 1.25 −0.02 1.80 1.25 −0.02 1.75 1.25 −0.02 1.80 1.25
β̂Cao 0.07 1.65 1.04 0.06 1.65 1.04 0.89 2.23 1.46 0.92 2.30 1.46
β̂dr 0.05 1.65 1.04 0.05 1.65 1.04 0.82 2.18 1.43 0.84 2.20 1.44
n = 1000, Y observed iff A = 0
β̂reg(η̂) −0.03 0.71 0.47 −0.03 0.71 0.47 0.39 0.88 0.59 0.39 0.88 0.59
β̂(η̂, α̂) −0.03 0.71 0.48 −0.03 0.71 0.48 0.11 0.85 0.56 0.25 0.84 0.56
β̂ipw −0.02 0.80 0.56 −0.02 0.84 0.56 −0.02 0.80 0.56 −0.02 0.84 0.56
β̂Cao −0.02 0.72 0.49 −0.03 0.73 0.51 0.30 0.86 0.57 0.44 0.97 0.61
β̂dr −0.04 0.71 0.48 −0.04 0.71 0.49 0.15 0.78 0.52 0.26 0.84 0.55

RMSE, root mean square error; MAE, median absolute error. All figures in the table are multiplied by 100. Miss-C and Miss-I (OR-C and OR-I), correct and incorrect missingness (outcome regression); β̂reg(η̂), outcome regression estimator; β̂(η̂, α̂), standard double robust estimator; β̂ipw, inverse probability weighted estimator; β̂Cao, Cao et al. estimator; β̂dr, new double robust estimator.

According to theory, when the outcome model is incorrect and the missingness model is correct, β̂(η̂, α̂), β̂ipw, β̂dr and β̂Cao are consistent, β̂reg(η̂) is inconsistent, and both β̂dr and β̂Cao are asymptotically more efficient than β̂(η̂, α̂) and β̂ipw. The performance observed for n = 1000, as quantified by bias and mean squared error, agrees with the asymptotic results except that when Y is binary, β̂(η̂, α̂), β̂ipw, β̂dr and β̂Cao are slightly biased. The comparison of the relative biases of β̂dr and β̂Cao depends on the data generating mechanism. For A = 0 and Y continuous, the bias of β̂Cao is smaller than that of β̂dr. On the other hand, for Y binary, β̂dr has substantially smaller bias than β̂Cao. Although not predicted by theory, when the missingness model is incorrect and the outcome model is correct, all double-robust estimators perform as well as β̂reg(η̂) when n = 1000. When both models were incorrect β̂Cao outperformed β̂dr in Table 1; however, β̂dr outperformed β̂Cao in Table 2. It is not surprising and, indeed predicted by asymptotic theory, that their relative performance would vary with the data generating mechanism. Results for n = 200 were qualitatively similar, except that β̂Cao had smaller root mean squared error for Y continuous when the missingness model was correct and the outcome regression model was incorrect.

Finally, when n = 200, A = 1 and Y binary, of the 1000 replications, β̂(η̂, α̂) fell below zero a total of 27, 54, 27 and 56 times, and β̂Cao fell below zero 49, 94, 59 and 87 times, respectively, under the four possible scenarios combining correct and incorrect specifications of the missingness and outcome regression models, the first with both correct, the second with the missingness model incorrect and the outcome model correct, the third with the missingness model correct and the outcome model incorrect and the last with both incorrect models. In all other cases, β̂(η̂, α̂), or β̂Cao fell between 0 and 1.

6. Concluding remarks

The proposal in this paper relies on the key observation that under the missing at random or the no-unmeasured confounders assumption, efficient estimation of parameters of increasingly larger models for the missingness or treatment probabilities improves the efficiency with which the parameters of models for the full or counterfactual data are estimated. As such, the present proposal can be extended, along the lines of § 4, to the estimation of the parameters of marginal structural mean models and of structural nested mean models for time-dependent treatments in longitudinal studies with time-dependent confounders (Robins, 1999). This extension will be reported elsewhere.

Acknowledgments

Andrea Rotnitzky, Lei Gomez and James Robins were funded by grants from the National Institutes of Health, U.S.A. Andrea Rotnitzky is also affiliated with the Harvard School of Public Health. Mariela Sued was funded by grants from the Agencia de Promocion Cientifica y Tecnica de Argentina and the Consejo Nacional de Investigaciones Cientificas y Tecnicas de Argentina. The authors wish to thank the reviewers for helpful comments.

Appendix

Proof of the equivalence of equations (7) and (14), and of equations (25) and (26). To prove the equivalence of (25) and (26), first note that Gη,α,β,jbEα( Gη,α,β,jb | Āj−1, L̄j−1) = b(Z){mj (j−1; η) − h(Z; β)}{ωj (α) − ωj−1(α)}. Replacing Gη,α,β,jbEα( Gη,α,β,jb | Āj−1, L̄j−1) with this expression in equation (25) and rearranging the terms of the sums in the left-hand side of (25) we arrive at equation (26). The equivalence between (7) and (14) is the special case of the equivalence between (25) and (26) when J = 1.

Sketch of the proof that the estimators η̃or and η̂or converge in probability to the same limit. This follows because η̂or and η̃or solve the same system of q estimating equations except that to compute η̃or, ω(α̂) is replaced by ω(α̃, δ̃) in the last p-equations (10) and second, the left-hand side of (10) evaluated at either ω (α̂) or ω (α̃, δ̃) converges in probability to the same expectation.

Supplementary material

Supplementary material available at Biometrika online includes the programme used to run the simulation study and tables with bootstrap estimators of the Monte Carlo standard errors of the reported bias, root mean squared error and mean absolute deviation.

REFERENCES

  1. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:692–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]
  2. Cao W, Tsiatis A, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–34. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Gill RD. Non- and semi-parametric maximum likelihood estimators and the von Mises method. Scand J Statist. 1989;16:97–128. [Google Scholar]
  4. Kang DYL, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion and rejoinder) Statist Sci. 2007;22:523–80. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran ME, Berry D, editors. Statistical Models in Epidemiology: The Environment and Clinical Trials. New York: Springer; 1999. pp. 95–134. Institute for Mathematics and its Applications 116. [Google Scholar]
  6. Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proc Am Statist Assoc Sect Bayesian Statist Sci. 2000:6–10. [Google Scholar]
  7. Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Am Statist Assoc. 1995;90:122–9. [Google Scholar]
  8. Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–24. [Google Scholar]
  9. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression-coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]
  10. Robins JM, Gomez Q, Sued M, Rotnitzky A. Performance of double-robust estimators when inverse probability weights are highly variable. Statist Sci. 2007;22:544–59. [Google Scholar]
  11. Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]
  12. Rubin D, van der Laan MJ. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. Int J Biostatist. 2008;4 article 5. [PMC free article] [PubMed] [Google Scholar]
  13. Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semipara-metric non-response models. J Am Statist Assoc. 1999;94:1096–20. [Google Scholar]
  14. Tan Z. A distributional approach for causal inference using propensity scores. J Am Statist Assoc. 2006;101:1619–37. [Google Scholar]
  15. Tan Z. Understanding OR, PS and DR. Statist Sci. 2007;22:560–8. [Google Scholar]
  16. Tan Z. Comment: improved local efficiency and double robustness. Int J Biostatist. 2008;4 doi: 10.2202/1557-4679.1109. article 10. [DOI] [PubMed] [Google Scholar]
  17. Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010a;97:661–82. [Google Scholar]
  18. Tan Z. Nonparametric likelihood and doubly robust estimating equations for marginal and nested structural models. Can J Statist. 2010b;38:609–32. [Google Scholar]
  19. van der Laan MJ. Targeted maximum likelihood based causal inference: part I. Int J Biostatist. 2010;6 doi: 10.2202/1557-4679.1211. article 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. van der Laan MJ, Robins J. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer; 2003. [Google Scholar]
  21. van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostatist. 2006;2 doi: 10.2202/1557-4679.1211. article 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 2000. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material available at Biometrika online includes the programme used to run the simulation study and tables with bootstrap estimators of the Monte Carlo standard errors of the reported bias, root mean squared error and mean absolute deviation.


Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES