Improved double-robust estimation in missing data and causal inference models

Andrea Rotnitzky; Quanhong Lei; Mariela Sued; James M Robins

doi:10.1093/biomet/ass013

. 2012 Apr 29;99(2):439–456. doi: 10.1093/biomet/ass013

Improved double-robust estimation in missing data and causal inference models

Andrea Rotnitzky ¹, Quanhong Lei ², Mariela Sued ³, James M Robins ⁴

PMCID: PMC3635709 PMID: 23843666

Abstract

Recently proposed double-robust estimators for a population mean from incomplete data and for a finite number of counterfactual means can have much higher efficiency than the usual double-robust estimators under misspecification of the outcome model. In this paper, we derive a new class of double-robust estimators for the parameters of regression models with incomplete cross-sectional or longitudinal data, and of marginal structural mean models for cross-sectional data with similar efficiency properties. Unlike the recent proposals, our estimators solve outcome regression estimating equations. In a simulation study, the new estimator shows improvements in variance relative to the standard double-robust estimator that are in agreement with those suggested by asymptotic theory.

Keywords: Drop-out, Marginal structural model, Missing at random

1. INTRODUCTION

In a missing data model, an estimator is double-robust if it is consistent when either a model for the missingness mechanism or a model for full-data law is correctly specified. In a causal inference model, an estimator is double-robust if it is consistent when either a model for the treatment assignment mechanism or a model for the counterfactual data distribution is correctly specified.

Scharfstein et al. (1999) noted that an estimator originally developed and identified as the locally efficient estimator in the class of augmented inverse probability weighted estimators in missing data models in Robins et al. (1994), was double-robust. Since then, many estimators with the double-robust property have been proposed, several of which were recently reviewed by Kang & Schafer (2007) and the discussants of that paper, for the special case of estimating a population mean and the causal effect of a binary treatment.

Rubin & van der Laan (2008) noted that the locally efficient estimator of Robins et al. (1994) can be quite inefficient if the model for the full-data distribution is incorrectly specified. To remedy this, they described a new general approach yielding locally efficient estimators with desirable efficiency properties when the full-data model is incorrectly specified. Tan (2008) and Cao et al. (2009) demonstrated that for estimating a population mean with missing data and unknown missingness probabilities, a particular form of the Rubin and van der Laan procedure yields double-robust estimators. In a recent paper, Tan (2010a) combines that procedure with restricted empirical maximum likelihood estimation to derive new double-robust estimators of a population mean with missing data and of population average treatment effects that have the efficiency properties of the Rubin and van der Laan estimator. A property of the procedure in Tan (2010a) not satisfied by the proposals of Tan (2008) and Cao et al. (2009) is that means are estimated as weighted averages with positive weights. Thus, estimated means always fall in the parameter space and in the range of observed outcomes.

In this paper, we describe a new general approach to constructing locally efficient double robust estimators for the parameters of regression models with outcomes missing at random and for parameters of marginal structural mean models for point exposure studies with continuous or discrete exposures that have the advantageous efficiency properties of the Rubin and van der Laan procedure. For the special case of a population mean, our estimators do not reduce to any of the earlier estimators of Tan (2008, 2010a) and Cao et al. (2009). In fact, unlike these other proposals, our estimators solve outcome regression estimating equations. These are equations identical to the ordinary weighted least squares estimating equations but with the missing outcome or counterfactual response replaced by an estimate of its conditional expectation given baseline covariates. As such, unlike augmented inverse probability weighted estimating equations, our equations always have a solution, as long as their full-data analogues also have a solution and the estimated conditional expectation falls in the sample space of the outcome. In particular, like Tan (2010a), our estimators of a population mean always fall in the parameter range.

Several proposals exist for constructing locally efficient double-robust estimators that solve outcome regression estimating equations. For regression models with missing data and for marginal structural models, these include the procedures in Bang & Robins (2005) and the targeted maximum likelihood methodology (van der Laan & Rubin, 2006; van der Laan, 2010). For a population mean, Kang & Schafer (2007, Equation (10)) described a so-called double-robust weighted least squares outcome regression estimator. These authors reported a simulation study in which this estimator performed better than the Bang and Robins estimator. These approaches have not yet been adapted to provide estimators with the improved efficiency properties of Rubin & van der Laan (2008).

The procedure described here is essentially a generalization to the regression setting of Kang and Shafer’s weighted least squares outcome regression estimator of a population mean. The key innovation is that the weights depend on an augmented model for the missingness or treatment probability which incorporates covariates constructed so as to ensure the advantageous efficiency properties of the Rubin & van der Laan (2008) procedure. Our approach exploits the counterintuitive fact that the efficiency of augmented inverse probability weighted estimators improves as the dimension of the model under which the missingness or treatment probabilities are estimated increases.

2. Cross-sectional studies with missing data

2.1. Existing double-robust estimators

Consider a study in which the intended data, (Y_i, L_i) (i = 1, …, n), are independent and identically distributed across i, where Y_i is a scalar outcome and L_i is a vector of additional variables. Assume L_i is observed but Y_i is missing in a subsample. Let A_i = 1 if Y_i is observed and A_i = 0 otherwise. In this section, we consider estimation of the unknown p × 1 parameter vector β₀ of the regression model

E (Y | Z) = h (Z; β_{0}),

(1)

where h(· ; ·) is a known smooth function and Z is a subset of the components of L, possibly empty, from independent and identically distributed copies (A_i, A_iY_i, $L_{i}^{T}$ ) (i = 1, …, n), of (A, AY, L^T). We make the missing at random assumption (Rubin, 1976) that f (A | Y, L) = f (A | L) and the positivity assumption that pr(A = 1 | Y, L) > 0. Under these assumptions, β₀ solves for any p × 1 function b(·), the population moment equation

E [b (Z) {E (Y | L, A = 1) - h (Z; β)}] = 0.

(2)

Equation (2) has motivated the so-called outcome regression estimator β̂_reg(η̂). To compute it, one first estimates the unknown parameter η₀ of a working outcome regression model for the respondents

E (Y | L, A = 1) = m (L; η_{0}),

(3)

where m(·; ·) is a known smooth function, by some η̂ using the units with observed Y. Next, one computes β̂_reg(η̂) where, by definition, for any η, β̂_reg(η) solves

E_{n} [b (Z) {m (L; η) - h (Z; β)}] = 0,

(4)

with b(·) a p × 1 user-specified vector function. In what follows, E_n(·) and E_n(· | ·) stand for the empirical mean and conditional mean operator, i.e., $E_{n} (U) \equiv n^{- 1} \sum_{i = 1}^{n} U_{i}$ , and $E_{n} (U | A = 1, V) \equiv {\sum_{i = 1}^{n} I (A_{i} = 1, V_{i} = V) U_{i}} / {\sum_{i = 1}^{n} I (A_{i} = 1, V_{i} = V)}$ . Under regularity conditions which include the requirement that (2) has a unique solution, the estimator β̂_reg(η̂) is consistent for β₀ and asymptotically normal if model (3) is correctly specified provided η̂ is consistent for η₀ and asymptotically normal. Note that (4) is the same as the weighted least squares estimating equations E_n[b(Z){Y − h(Z; β)}] = 0 that one would ordinarily use in the absence of missing data, for example, with b(Z) = (1, Z^T)^T if h(Z; β) = Z^Tβ or h(Z; β) = expit(Z^Tβ) where expit(u) = {1 + exp(−u)}⁻¹, but with the outcome Y replaced by m(L; η). Then, if the range of m(·; ·) falls in the sample space of Y, equation (4) will ordinarily have a solution.

The moment equation (2) can be re-written as

E [b (Z) ω {Y - h (Z; β)}] = 0,

where ω = Af (A | L)⁻¹. This formulation has motivated the so-called augmented inverse probability weighted estimators β̂_g,_aipw. To compute them, one first posits a parametric model for the missingness probability

f (A | L) = f (A | L; α_{0})

(5)

for instance, f (A | L; α) = expit(α^T L̃)^A{1 − expit(α^T L̃)}¹⁻^A where L̃^T = (1, L^T), and estimates α₀ by its maximum likelihood estimator α̂. Next, one computes β̂_g,_aipw solving, with α evaluated at α̂, the equations

E_{n} [ω (α) b (Z) {Y - h (Z; β)}] - E_{n} [g (A, L) - E_{α} {g (A, L) | L}] = 0,

(6)

where g(·) and b(·) are user-specified p × 1 vector functions, ω(α) = A f (A | L; α)⁻¹ and E_α {g(A, L) | L} is the conditional expectation of g(A, L) given L under f (A | L; α). When g = 0, β̂_g,_aipw, throughout denoted by β̂_ipw, is called an inverse probability weighted estimator. If model (5) is correctly specified, then under regularity conditions, β̂_g,_aipw is consistent and asymptotically normal with asymptotic variance no smaller than that of β̂_{g_optaipw} where $g_{opt}^{b}$ (A, L) = ωb(Z){E(Y | A, L) − h(Z; β₀)}. This motivates estimating β₀ by β̂(η̂, α̂) where, for a given b(·) and each (η, α), β̂ (η, α) solves

E_{n} [ω (α) b (Z) {Y - h (Z; β)}] - E_{n} [g_{η, α, β}^{b} (A, L) - E_{α} {g_{η, α, β}^{b} (A, L) | L}] = 0,

(7)

with $g_{η, α, β}^{b}$ (A, L) = ω (α)b(Z){m(L; η) − h(Z; β)} (Robins & Rotnitzky, 1995).

Regardless of the validity of the outcome regression model (3), if the missingness model (5) is correct no variance correction due to estimation of η is needed, i.e.,

\sqrt n {\hat{β} (\hat{η}, \hat{α}) - \hat{β} (η *, \hat{α})} = o_{p} (1),

(8)

where η^* is the probability limit of η̂. When both the outcome regression and the missingness models are correct, no adjustment due to estimation of α on the asymptotic variance of β̂(η₀, α̂) is needed, i.e., √n{β̂ (η₀, α̂) − β̂ (η₀, α₀)} = o_p(1).

Robins & Rotnitzky (1995) noted that the estimator β̂ (η̂, α̂) satisfies:

Property 1. if the missingness model is correct, it is consistent for β₀ and asymptotically normal;

Property 2. if both the outcome regression and missingness models are correct, it has asymptotic variance equal to the smallest asymptotic variance of all estimators β̂_g,_aipw that use the same b.

Unlike the outcome regression estimator β̂_reg(η), the estimator β̂(η̂, α̂) may fall outside the range of plausible values for β. For example, in the absence of Z and when h(Z; β) = β, (7) is equivalent to

β = E_{n} [ω (α) {Y - m (L; η)}] + E_{n} {m (L; η)} .

(9)

If Y is binary and m(L; η) = expit(η^T L̃), then 0 < E_n{m(L; η̂)} < 1. However, | E_n[ω (α̂){Y − m(L; η̂)}]| may be very large if few weights ω_i (α̂) are exceedingly large, as is the case when few units with A_i = 1 have relatively very small estimated values f (A_i | L_i; α̂). Another manifestation of this problem is in estimating equations with no solution. For instance, if we parameterize E(Y) = expit(β₀), (9) with β replaced by expit(β) has no solution if the right-hand side of (9) is outside the interval (0, 1).

Scharfstein et al. (1999) noted that β̂ (η̂, α̂) is double-robust: it is consistent for β₀ and asymptotically normal provided either model (5) or model (3) is correct. The estimator β̂ (η̂, α̂) with η̂ the ordinary or iteratively reweighted least squares estimator based on units with observed Y is known as the standard double-robust estimator. Several authors (Robins & Wang, 2000; Kang & Schafer, 2007; Rubin & van der Laan, 2008) have noted that standard double-robust estimators may have substantial bias and large variance, even under correct specification of the missingness model, if the estimated missingness probabilities are highly variable and/or the outcome regression model is misspecified. Alternative double-robust estimators have been recently developed to address these problems (van der Laan & Rubin, 2006; van der Laan, 2010; Tan, 2006, 2007, 2008, 2010a, 2010b; Robins et al., 2007; Cao et al., 2009). In particular, Tan (2008, 2010a) and Cao et al. (2009) derived double-robust estimators of E(Y) = β₀, i.e., in the special case in which Z is absent and h(Z; β) = β, which satisfy Properties 1 and 2 and have the enhanced efficiency benefit that:

Property 3. if the missingness model is correct, they have asymptotic variance smaller than or equal to the asymptotic variance of any estimator β̂ (η, α̂) with η fixed but arbitrary even if the outcome regression model is incorrect.

The estimator of Tan (2008) agrees with that of Cao et al. (2009) when m(L; η) is linear in η but otherwise it has a subtle difference that ensures that, when the missingness model is correct, it also has asymptotic variance no larger than that of any estimator computed like β̂ (η, α̂) but with m(L; η) replaced by c₁ + c₂m(L; η) for any c₁ and c₂. Both proposals adapt a particular version of an estimator proposed by Rubin & van der Laan (2008) to the setting in which the missingness probabilities are unknown and estimated under a parametric model. Under both proposals, the estimators are of the form β̂ (η̃, α̂), where η̃ minimizes σ̂²(η), an adequately chosen consistent estimator of the asymptotic variance σ² (η) of β̂ (η, α̂). A drawback of these approaches is that they are based on solving equations of the form (7), which may not have a solution or may yield estimators that fall outside the parameter space. To remedy this Tan (2010a) derived an estimator that maximizes a constrained empirical likelihood. A clever choice of constraints combined with a calibration of a linear model for the missingness probabilities ensures that the estimator is double-robust, is efficient in the aforementioned larger class of estimators considered by Tan (2008), satisfies Properties 1–3 and is a weighted average of the observed outcomes.

In the next section, we introduce a new class of estimators β̂_dr of parameters of regression models with the following properties:

Property 4. they are double-robust, i.e., consistent for β₀ and asymptotically normal if either model (3) or (5) is correct;

Property 5. they satisfy Properties 1 and 2;

Property 6. they solve an outcome regression estimating equation;

Property 7. given user-specified real-valued functions ϕ₁(·), …, ϕ_K (·), each ϕ_k(β̂_dr) (k = 1, …, K), has asymptotic variance smaller than or equal to that of any ϕ_k{β̂ (η, α̂)} with η fixed but arbitrary, if the missingness model is correct and even if the outcome regression model is incorrect;

Property 8. they have asymptotic variance smaller than or equal to that of β̂_ipw if the missingness model is correct even if the outcome regression model is incorrect.

For β the population mean of a scalar outcome, as in Tan (2008, 2010a) and Cao et al. (2009), or for any other scalar parameter, Property 7 with ϕ₁(β) = β is the same as Property 3. For β of dimension 2 or more, Property 7 ensures enhanced efficiency for estimation of a finite set of target scalar features of the vector β specified by the data analyst. For example, if in a two-group comparison analysis, h(Z; β) = expit(β₁ + β₂Z) with Z binary, one would choose ϕ₁(β) = h(0; β), ϕ₂(β) = h(1; β) and ϕ₃(β) = h(1; β) − h(0; β) if one wants to ensure enhanced efficient estimation of the group means and of their difference. A more ambitious goal would be to construct estimators that satisfy Property 7 for all smooth scalar functions ϕ, not just for a finite number of them, or equivalently, Property 3 even if the dimension of β is 2 or more. Whether or not this can be accomplished remains an open question.

For estimating regression models with missing data, several procedures exist that satisfy some, but not all of Properties 4–8. The standard double robust estimator β̂ (η̂, α̂) satisfies Properties 4 and 5 but not 6–8. Bang & Robins (2005) estimator and estimators derived from application of targeted maximum likelihood methodology (van der Laan & Rubin, 2006; van der Laan, 2010) satisfy Properties 4–6 but not 7 and 8. The one step corrected estimator in § 2.5 of van der Laan & Robins (2003) satisfies Properties 5 and 8 but not 4, 6 and 7. The estimators in Tan (2010b) satisfy Properties 4, 5 and 8 but not 6 and 7.

2.2. Proposed estimator

To compute β̂_dr it is required that the parameter η of model (3) has dimension q greater than or equal to the dimension p of β₀. If this is not the case, one first augments model (3) by including additional covariates based on transformations of L, e.g., powers of L. The computation of β̂_dr is carried out in the following three steps:

Step 1. Estimate η by η̂_or solving the p equations

E_{n} [ω (\hat{α}) b (Z) {Y - m (L; η)}] = 0,

(10)

and the additional q − p equations

E_{n} [A d (L; η) {Y - m (L; η)}] = 0,

(11)

where d(L; η) is any user-specified, (q − p) × 1 function, say d(L; η) = (∂m(L; η)/∂η₁, …, ∂m(L; η)/∂η_{q − p})^T. Compute β̂_or = β̂_reg(η̂_or) solving the outcome regression equation (4) with η = η̂_or.

Step 2. For each k (k = 1, …, K), compute

{\tilde{η}}_{k} = arg min_{η} E_{n} ({[{\hat{I}}_{k} {M (\hat{α}, {\hat{β}}_{or}) - U (η, \hat{α}, {\hat{β}}_{or}) - {\hat{ρ}}_{η}^{T} S (\hat{α})}]}^{2}),

(12)

where

S (α) = \partial log f (A | L; α) / \partial α,

{\hat{I}}_{k} = \partial φ_{k} (β) / \partial β^{T} |_{{\hat{β}}_{or}} E_{n} {\partial M (\hat{α}, β) / \partial β^{T} |_{{\hat{β}}_{or}}}^{- 1},

M (α, β) = ω (α) b (Z) {Y - h (Z; β)},

U (η, α, β) = g_{η, α, β}^{b} (A, L) - E_{α} {g_{η, α, β}^{b} (A, L) | L}

and

{\hat{ρ}}_{η}^{T} = E_{n} [{M (\hat{α}, {\hat{β}}_{or}) - U (η, \hat{α}, {\hat{β}}_{or})} S {(\hat{α})}^{T}] E_{n} {[S (\hat{α}) S {(\hat{α})}^{T}]}^{- 1} .

Step 3. Compute the maximum likelihood estimator (α̃, δ̃) of (α, δ) in the augmented missingness model

f (A | L; α, δ) = c (L; α, δ) f (A | L; α) exp {\sum_{k = 1}^{K + 1} δ_{k}^{T} u_{k} (A, L)},

(13)

where u_k(A, L) = U(η̃_k, α̂, β̂_or) (k = 1, …, K), u_K _{+ 1}(A, L) = U(η̂_or, α̂, β̂_or) and c(L; α, δ) is the normalizing constant. Estimate η by η̃_or jointly solving (11) and (10) with ω (α̂) replaced with ω (α̃, δ̃) = Af (A | L; α̃, δ̃)⁻¹. Finally, β̂_dr is the estimator β̂_reg(η̃_or) solving the outcome regression equation (4) with η = η̃_or.

The estimator β̂_or of β returned by Step 1 is the extension to the regression setting of the so-called weighted least squares outcome regression double-robust estimator of a population mean of Kang & Schafer (2007, Equation (10)). Step 2 follows Rubin & van der Laan’s (2008) prescription to compute, separately for each k, an estimator η̃_k of η targeted at minimizing the asymptotic variance of ϕ_k{β̂ (η, α̂)} if model (5) is correct. Under this assumption, the empirical mean in (12) is a consistent estimator of the asymptotic variance of ϕ_k{β̂ (η, α̂)}. Step 3 simply repeats Step 1, after re-estimating the missingness probabilities under the extended model (13). The subscripts in η̃_or and η̂_or are a reminder that when either is replaced for η in β̂ (η; α̂), it yields estimators solving outcome regression estimating equations.

The vector θ̂ = {α̂, β̂_or, η̂_or, η̃_or, β̂_dr, α̃, δ̃, (η̃_k, ρ̂_{η̃_k})₁_⩽k⩽K} solves a system of estimating equations E_n{Ψ (X; θ)} = 0. Under the conditions of van der Vaart (2000, Theorems 5.9 and 5.21), θ̂ − θ^* = − $V_{θ *}^{- 1}$ E_n{Ψ (X; θ^*)} + o_p(n^−1/2), for θ^* satisfying E{Ψ (X; θ^*)} = 0 and V_θ = ∂E{Ψ (X; θ)}/∂θ. In what follows, we will assume that these conditions hold and argue that the estimator β̂_dr satisfies the Properties 4–8.

For Property 4, the estimator η̃_or converges in probability to a solution of the joint equations E[ω(α^*, δ^*)b(Z){Y − m(L; η)}] = 0 and E[Ad(L; η){Y − m(L; η)}] = 0 where (α^*, δ^*) = plim (α̃, δ̃). If model (3) is correct, η₀ is a solution to this system. Thus β̂_dr, being equal to the estimator β̂_reg(η̃_or), is consistent for β₀ and asymptotically normal if model (3) is correct and the preceding joint population equations have a unique solution. On the other hand, β̂_dr is equal to β̂{η̃_or, (α̃, δ̃)} solving (7) with η = η̃_or and α = (α̃, δ̃) because (7) is the same as equations

E_{n} [b (Z) {m (L; η) - h (Z; β)}] + E_{n} [ω (α) b (Z) {Y - m (L; η)}] = 0,

(14)

and, by construction of β̂_dr and η̃_or, E_n[b(Z){m(L; η̃_or) − h(Z; β̂_dr)}] = 0 and E_n[ω(α̃, δ̃) b(Z){Y − m(L; η̃_or)}] = 0. When model (5) is correct, model (13) is also correct with a true value of δ equal to 0. Consequently, when (5) is correct β̂{η̃_or, (α̃, δ̃)}, just as any estimator solving (7), is consistent for β₀ and asymptotically normal.

Property 5 holds because, as indicated above, β̂_dr solves equation (7) with η and α replaced by estimators that converge to the true parameter values when the models they index are correct.

Property 6 holds by construction.

For Property 7, under model (5), β̂(η, α̂) satisfies the expansion

\hat{β} (η, \hat{α}) - (β_{0}) = E_{n} [I {M_{0} - U_{0} (η) - ρ_{η}^{T} S (α_{0})}] + o_{p} (n^{- 1 / 2}),

where M₀ = M(α₀, β₀), U₀(η) = U(η, α₀, β₀), I = E{∂M(α₀, β) / ∂β^T|_β₀}⁻¹ and

ρ_{η}^{T} = E [{M_{0} - U_{0} (η)} S^{T} (α_{0})] E {S (α_{0}) S^{T} (α_{0})}^{- 1}

is the population least squares coefficient in the multivariate regression of M₀ − U₀(η) on the components of the score vector S(α₀) (Robins et al., 1994). Thus, an application of the delta method yields that

avar [φ_{k} {\hat{β} (η, \hat{α})}] = σ_{k}^{2} (η) = min_{ρ} E [{[I_{k} {M_{0} - U_{0} (η) - ρ^{T} S (α_{0})}]}^{2}],

(15)

where I_k = ∂ϕ_k(β)/∂β^T|_β₀E{∂M(α₀, β)/∂β^T|_β₀}⁻¹, and in the sequel avar(·) stands for the variance of the limiting distribution of ·. As the dimension of the vector α indexing the missingness model increases, so does the dimension of S(α₀) and consequently $σ_{k}^{2} (η)$ decreases. With this in mind, we construct in Step 3 an augmented missingness model choosing to enlarge model (5) with just the right additional covariates so as to ensure that the resulting estimator of β₀ is at least as efficient asymptotically as ϕ_k{β̂ ( $η_{k}^{*}$ , α̂)} (k = 1, …, K), where

η_{k}^{*} = arg min_{η} σ_{k}^{2} (η) .

Specifically, when model (5) is correct, so too is the enlarged model (13) with true parameter values α₀ and δ₀_,k = 0 (k = 1, …, K + 1). It then follows from (8) and the fact that β̂_dr = β̂ {η̃_or, (α̃, δ̃)} that

{\hat{β}}_{dr} - β_{0} = E_{n} [I {M_{0} - U_{0} (η_{or}^{* *}) - ρ^{* T} S_{α}^{*} (α_{0}, δ_{0}) - \sum_{k = 1}^{K + 1} ν_{k}^{* T} S_{δ_{k}}^{*} (α_{0}, δ_{0})}] + o_{p} (n^{- 1 / 2}),

(16)

where $η_{or}^{* *}$ = plimη̃_or, $S_{α}^{*}$ (α, δ) = ∂ log f^*(A | L; α, δ)/∂α and $S_{δ_{k}}^{*}$ (α, δ) = ∂ log f^*(A | L; α, δ)/∂δ_k are the scores in the model

f^{*} (A | L; α, δ) = c^{*} (L; α, δ) f (A | L; α) exp {\sum_{k = 1}^{K + 1} δ_{k}^{T} U_{0} (η_{k}^{*}) + δ_{K + 1}^{T} U_{0} (η_{or}^{*})},

(17)

with c^*(L; α, δ) the normalizing constant and $η_{or}^{*}$ = plimη̂_or. Furthermore, $(ρ^{* T}, ν_{1}^{* T}, \dots, ν_{K + 1}^{* T})$ is the population least squares constant in the regression of M₀ − U₀( $η_{or}^{* *}$ ) on $S_{α}^{*}$ (α₀, δ₀), $S_{δ_{1}}^{*}$ (α₀, δ₀),…, $S_{δ_{K + 1}}^{*}$ (α₀, δ₀). Now, because of the precise form of model (17), it holds that $S_{α}^{*}$ (α₀, δ₀) = S(α₀), $S_{δ_{k}}^{*}$ (α₀, δ₀) = U₀( $η_{k}^{*}$ ) (k = 1, …, K), and $S_{δ_{K + 1}}^{*}$ (α₀, δ₀) = U₀( $η_{or}^{*}$ ). Furthermore, U₀( $η_{or}^{* *}$ ) = U₀( $η_{or}^{*}$ ) because, as argued in the Appendix, when model (5) holds, $η_{or}^{*}$ = $η_{or}^{* *}$ . Thus, expansion (16) reduces to

{\hat{β}}_{dr} - β_{0} = E_{n} [I {M_{0} - ρ^{* T} S (α_{0}) - \sum_{k = 1}^{K} ν_{k}^{* T} U_{0} (η_{k}^{*}) - ν_{K + 1}^{* T} U_{0} (η_{or}^{*})}] + o_{p} (n^{- 1 / 2}),

(18)

with $(ρ^{* T}, ν_{1}^{* T}, \dots, ν_{K + 1}^{* T})$ re-defined as the population least squares coefficient in the regression of M₀ on S(α₀), U₀( $η_{1}^{*}$ ), …, U₀( $η_{K}^{*}$ ) and U₀( $η_{or}^{*}$ ).

An application of the delta method then yields

avar {φ_{k} ({\hat{β}}_{dr})} = min_{(ρ, ν)} E ({[I_{k} {M_{0} - ρ^{T} S (α_{0}) - \sum_{k = 1}^{K} ν_{k}^{T} U_{0} (η_{k}^{*}) - ν_{K + 1}^{T} U_{0} (η_{or}^{*})}]}^{2}) .

(19)

This shows that avar{ϕ_k(β̂_dr)} ⩽ avar[ϕ_k{β̂(η, α̂)}] for any η because by definition of $η_{k}^{*}$ , the smallest possible avar[ϕ_k{β̂ (η, α̂)}] is the right-hand side of (15) evaluated at $η_{k}^{*}$ , and this quantity is greater than or equal to the right-hand side of the last display because this is a minimum over a larger set.

Property 8 follows by noticing that

{\hat{β}}_{IPW} - β_{0} = E_{n} [I {M_{0} - ρ^{* * T} S (α_{0})}] + o_{p} (n^{- 1 / 2}),

where M₀ − ρ^**TS(α₀) is the residual from the population regression of M₀ on S(α₀). This residual has variance larger than or equal, in the positive definite sense, the residual between curly brackets in (18) as the latter is the residual from the regression of M₀ on covariates that include S(α₀).

The following remarks help clarify our construction. First, Step 1 is needed in order to include the covariate u_K _{+ 1}(A, L) in model (13). Without this covariate, the asymptotic variance of ϕ_k(β̂_dr) would be equal to the variance of the expression between squared brackets in (16) with I_k instead of I and without the term $ν_{K + 1}^{* T} S_{δ_{K + 1}}^{*}$ (α₀, δ₀). But this variance would not necessarily be smaller than the right-hand side of (15) evaluated at $η_{k}^{*}$ . Second, we can modify Step 3 to yield β̂_dr additionally as efficient as β̂_g,_aipw for any specified g by simply adding the term δ _K _{+ 2}u _K _{+ 2}(A, L), where u_K _{+ 2}(A, L) = [g(A, L) − E_α̂{g(A, L) | L}], to the exponential tilt in model (13). Third, the computation of η̃_or in Step 3 depends on (α̃, δ̃) only through the f (A_i | L_i; α̃, δ̃) (i = 1, …, n). If the outcome regression model (3) is correctly specified, then some or all of the u_k(A, L) (k = 1, …, K + 1), may converge in probability to the same function of (L, A) and thus they may be highly collinear. In such a case, δ̃ may not be unique. However, $\sum_{k = 1}^{K + 1} {\tilde{δ}}_{k} u_{k} (A, L)$ , and hence f (A | L; α̃, δ̃), will still be unique. Formula (19) for the asymptotic variance of ϕ_k(β̂_dr) remains valid with the provision that some or all of the U₀( $η_{k}^{*}$ ) (k = 1, …, K), and U₀( $η_{or}^{*}$ ) may be equal. This provision does not invalidate the preceding arguments justifying that β̂_dr satisfies Properties 7 and 8.

Standard empirical sandwich variance techniques could in principle be used to derive an estimator that is consistent for the asymptotic variance of β̂_dr regardless of the validity of models (5) or (3), because, as indicated earlier, θ̂ = {α̂, β̂_or, η̂_or,η̃_or, β̂_dr, α̃, δ̃, (η̃_k, ρ̂_{η̃_k})₁_⩽k⩽K} solves an estimating equation E_n{Ψ (θ)} = 0. However, finding the analytic expression for Ψ would be cumbersome. Nevertheless, the nonparametric bootstrap can be used to compute a consistent variance estimator of β̂_dr because θ̂ is regular and asymptotically linear (Gill, 1989).

Example 1. Consider estimation of β₀ = E(Y) with Y binary. In this case, h(β) = β and Z is absent in model (1). Suppose we assume that (5) and (3) are logistic regressions with covariates L and intercept. The score for α is S(α) = L̃ expit(α^T L̃){ω(α) − 1}. The function b(·) in equation (4) is a scalar constant since Z is absent. All bs yield the same estimator of β₀, so we assume b = 1. In Step 1, η̂_or solves (10) and E_n[AL̃_r{Y − expit(η^T L̃)}] = 0 (r = 1, …, q − 1) yielding β̂_or = E_n{expit( ${\hat{η}}_{or}^{T} \tilde{L}$ )}. In Step 2, K = 1, ϕ₁(β) = β and I₁ = 1. Furthermore, U(η, α, β) = {ω (α) − 1}{expit(η^T L̃) − β}. Consequently,

{\tilde{η}}_{1} = arg min_{η} E_{n} ({[ω (\hat{α}) Y - {ω (\hat{α}) - 1} {expit (η^{T} \tilde{L}) - {\hat{ρ}}_{η}^{T} \tilde{L} expit({\hat{α}}^{T} \tilde{L})} + {\hat{β}}_{or}]}^{2}) .

Model (13) of Step 3 is a logistic regression with intercept and covariates L, x₁(L) = expit(α̂^T L̃)⁻¹{expit( ${\tilde{η}}_{1}^{T} \tilde{L}$ )− β̂_or}.and x₂(L) = expit(α̂^T L̃)⁻¹{expit( ${\hat{η}}_{or}^{T} \tilde{L}$ ) − β̂_or}. The estimator η̃_or is computed just like η̂_or, except that in equation (10) ω (α̂) is replaced by ω(α̃, δ̃) = A expit{α̃^T L̃ + δ̃₁x₁(L) + δ̃₂x₂(L)}⁻¹. Finally, β̂_dr = E_n {expit( ${\hat{η}}_{or}^{T} \tilde{L}$ )}, which has asymptotic variance under model (5),

avar ({\hat{β}}_{dr}) = min_{(ρ, ν)} E [{M_{0} - ρ^{T} S (α_{0}) - ν_{1}^{T} U_{0} (η_{1}^{*}) - ν_{2}^{T} U_{0} (η_{or}^{*})}^{2}],

with M₀ = ω (α₀) (Y − β₀), U₀( $η_{j}^{*}$ ) = {ω(α₀) − 1}{expit( $η_{j}^{* T} \tilde{L}$ ) − β₀} and $η_{j}^{*}$ = plim η̃_j for j = 1, or. Interestingly, the estimator of Cao et al. (2009) has asymptotic variance min_ρ E[{M₀ − ρ^TS(α₀) − U₀( $η_{1}^{*}$ )}²], which is generally strictly larger than avar(β̂_dr) due to the nonlinear dependence of U₀( $η_{1}^{*}$ ) on $η_{1}^{*}$ .

3. Causal inference in point-exposure studies

We now consider estimation of marginal structural mean models for causal inference in point exposure observational studies. Suppose that in an observational study with n subjects drawn at random from a population of interest, we observe (A_i, Y_i, $L_{i}^{T}$ ) (i = 1, …, n), independent and identically distributed across i, where Y_i is a scalar outcome, A_i is a treatment variable taking values in a set 𝒜 and L_i is a vector of pre-treatment confounding variables. For each a ∈ 𝒜, let Y₍_a₎_,i be the potential outcome of the i th unit under treatment a. We make the usual consistency assumption that Y_{(A_i), i} = Y_i. Suppose that Z_i is a subvector of L_i, possibly empty. A marginal structural mean model postulates that

E (Y_{(a)} | Z) = h (a, Z; β_{0}),

where h(·; ·) is a known smooth function and β₀ is an unknown p × 1 parameter vector (Robins, 1999). For example, h(a, Z; β) = β₁ + β₂a + β₃aZ. We examine estimation of β₀ from data (A_i, Y_i, $L_{i}^{T}$ ) i = 1, …, n), under the no-unmeasured confounders assumption which states that Y₍_a₎ and A are conditionally independent given L for all a ∈ 𝒜. Under this assumption, regarding Y₍_a₎ as an outcome variable that is observed and equal to Y only in units that received treatment A equal to a it holds, like in § 2.1, that E(Y₍_a₎ | Z) = E{E(Y | L, A = a) | Z} = E(ω_aY | Z) where ω_a = I_a(A) f (A | L)⁻¹ and I_a(A) = 1 if A = a and I_a(A) = 0 otherwise.

In this setting, an outcome regression estimator β̂_reg(η̂) is the solution of

\int_{𝒜} E_{n} [b (a, Z) {m (a, L; \hat{η}) - h (a, Z; β)}] d μ (a) = 0,

(20)

where b(a, z) is a p × 1 user-specified vector function, μ denotes the Lebesgue measure when A is continuous and the counting measure if A is discrete, and η̂ is an estimator of η₀ under a postulated outcome regression model

E (Y | L, A) = m (A, L; η_{0}),

(21)

which simultaneously parameterizes the separate outcome regressions E(Y | L, A = a) for all a ∈ 𝒜. If model (21) is correct, then β̂_reg(η̂) solving (20) is consistent for β₀ and asymptotically normal provided η̂ is consistent for η₀ and asymptotically normal. Alternatively, positing a model (5) for the treatment mechanism we can consider estimating equations (6) and (7) where b(Z), m(L; η), h(Z; β) are replaced with b(A, Z), m(A, L; η), h(A, Z; β), η̂ is now an estimator that is consistent for η₀ when model (21) is correct and with the re-definitions

ω = ω_{A} = f {(A | L)}^{- 1}, ω (α) = f {(A | L; α)}^{- 1} .

With these modifications, we obtain the estimators α̂, β̂_g,_aipw, β̂_ipw, β̂(η, α) and β̂ (η̂, α) as in § 2.1. Robins (1999) showed that (8) holds and Scharfstein et al. (1999) showed that β̂ (η̂, α̂) satisfies Properties 4 and 5 of § 2.1 where the words missingness model are replaced by treatment model and the outcome regression model is (21) rather than (3).

As in the missing data case, standard double-robust estimators β̂ (η̂, α̂) with η̂ the ordinary or iteratively reweighted least squares estimator of η₀ based on all sampled units may not exist, because equation (7) evaluated at η̂ and α̂ may not have a solution. Furthermore, β̂ (η̂, α̂) does not have the advantageous efficiency Properties 7 and 8 of § 2.2. In the present setting, we can define the estimator β̂_dr identically as in § 2.2, but with the re-definitions and replacements indicated in the preceding paragraph. Arguments identical to those given in § 2.2 imply that β̂_dr satisfies Properties 4–8 of § 2.2 with the outcome regression equation referred to in Property 6 being now (20).

4. Monotone missing data in longitudinal studies

We now turn to double-robust estimation in regression models for longitudinal data with drop-out. The intended data are n independent and identically distributed copies of ${\bar{L}}_{J} = (L_{0}^{T}, L_{1}^{T}, \dots, L_{J}^{T})$ where B̄_j denotes (B₀, …, B_j) throughout. The vector L_j denotes the data we intend to measure at the jth occasion on a sample unit. Let C denote the drop-out occasion: C = j on a given sample unit if and only if we observe L̄_j for that unit. We assume L₀ is always observed, so C > 0. The outcome of interest Y is r(L̄_J), where r(·) is a user-specified function which for simplicity we assume is scalar valued; for example, r(L̄_J) is some component of the vector L_J. The goal is to estimate the p × 1 parameter vector β₀ of a regression model (4) where Z is a subvector of L₀, possibly empty, from n independent and identically distributed copies of (C, ${\bar{L}}_{C}^{T}$ ) under the missing at random assumption

f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{J}) = f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}) (j = 1, \dots, J),

where A_j is the on study indicator, i.e., A_j = 1 if C ⩾ j and A_j = 0 otherwise. Provided pr(A_j = 1 | A_j₋₁ = 1, L̄_j₋₁) > 0 (j = 1, …, J), E(Y | Z) can be expressed in two different ways (Bang & Robins, 2005), each leading to a different estimation strategy,

E (Y | Z) = E (H_{0} | Z) = E (ω_{J} Y | Z),

with H₀ defined from the recursion H_J = Y, and H_j₋₁ = E(H_j | A_j = 1, L̄_j₋₁) (j = J, …, 1), and

ω_{j} = A_{j} \times {f (A_{1} | {\bar{A}}_{0}, {\bar{L}}_{0}) \times \dots \times = f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1})}^{- 1} (j = 0, \dots, J - 1) .

To generalize β̂_reg(η̂) to the longitudinal setting, we posit outcome regression models

E (H_{j} | A_{j} = 1, {\bar{L}}_{j - 1}) = m_{j} ({\bar{L}}_{j - 1}; η_{0}) (j = 1, \dots, J),

(22)

with η₀ an unknown q × 1 parameter vector and m_j (·; ·) known functions, for instance m_j (L̄_j₋₁; η) = Φ⁻¹{η^Ts_j (L̄_j₋₁)} for some link function Φ (·) and some user-specified functions s_j (·), and we estimate η₀ by some η̂, for example, by η̂ solving

E_{n} [A_{J} d_{J} ({\bar{L}}_{J}; η) {Y - m_{J} ({\bar{L}}_{J}; η)} + \sum_{j = 1}^{J - 1} A_{j} d_{j} ({\bar{L}}_{j}; η) {m_{j + 1} ({\bar{L}}_{j + 1}; η) - m_{j} ({\bar{L}}_{j}; η)}] = 0,

(23)

where d_j (L̄_j ; η) = ∂m_j (L̄_j; η)/∂η and $\sum_{j = 1}^{J - 1} (\cdot) = 0 if J = 1$ . The outcome regression estimator β̂_reg(η̂) then solves (4) with m(L; η) re-defined as m₁(L̄₀; η) here and throughout this section. If (22) holds, β̂_reg(η̂) − β₀ = O_p(n⁻¹^/²) provided η̂ solves (23) or, more generally, provided η̂ − η₀ = O_p (n⁻¹^/²).

Alternatively, to generalize β̂_g,_aipw to the longitudinal setting, we posit drop-out models

f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}) = A_{j - 1} f (A_{j} | A_{j - 1} = 1, {\bar{L}}_{j - 1}; α_{0}) (j = 1, \dots, J)

(24)

for instance, f (A_j | A_j₋₁ = 1, L̄_j₋₁; α) = expit{α^Tt_j (L̄_j₋₁)} for some user-specified functions t_j (·), and we compute the maximum likelihood estimator α̂ of α₀. Then, we compute β̂_g,_aipw solving, with α evaluated at α̂, the equation

\begin{array}{l} E_{n} [ω_{J} (α) b (Z) {Y - h (Z; β)}] \\ - E_{n} \times [\sum_{j = 1}^{J} g_{j} ({\bar{A}}_{j}, {\bar{L}}_{j - 1}) - E_{α} {g_{j} ({\bar{A}}_{j}, {\bar{L}}_{j - 1}) | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}}] = 0 \end{array}

for some p × 1 vector functions g₁(·), …, g_J (·) and b(·). Under regularity conditions, β̂_g,_aipw is consistent for β₀ and asymptotically normal (Robins & Rotnitzky, 1995) when model (24) holds with asymptotic variance no smaller than that of β̂g_opt,_aipw where $g_{opt, j}^{b}$ (Ā_j, L̄_j₋₁) = ω_j b(Z){E(H_j | A_j = 1, L̄_j₋₁) − h(Z; β₀)} (Robins et al., 1994). The extension of β̂ (η, α) to the longitudinal setting solves

E_{n} [ω_{J} (α) b (Z) {Y - h (Z; β)}] - E_{n} {\sum_{j = 1}^{J} G_{η, α, β, j}^{b} - E_{α} (G_{η, α, β, j}^{b} | {\bar{A}}_{j - 1,} {\bar{L}}_{j - 1})} = 0,

(25)

with

G_{η, α, β, j}^{b} \equiv g_{η, α, β, j}^{b} ({\bar{A}}_{j}, {\bar{L}}_{j - 1}) \equiv ω_{j} (α) b (Z) {m_{j} ({\bar{L}}_{j - 1}; η) - h (Z; β)},

and ω_j (α) = A_j × {f (A₁ | Ā₀, L̄₀; α) × ⋯ × f (A_j | Ā_j₋₁, L̄_j₋₁; α)}⁻¹. The estimator β̂(η̂, α̂) satisfies (8) and Properties 4 and 5 with models (24) and (22) instead of (5) and (3) (Robins, 2000). However, as in § 2, for most estimators η̂, e.g., for η̂ solving (23), equation (25) evaluated at η̂ and α̂ may not have a solution. Furthermore, β̂ (η̂, α̂) does not satisfy the efficiency Properties 7 and 8.

An estimator β̂_dr satisfying Properties 4–8 of § 2.1, with models (24) and (22) instead of (5) and (3) and Property 8 can be constructed implementing the following generalization of the three step procedure of § 2.2 to the longitudinal setting.

Step 1. Compute β̂_or = β̂_reg(η̂_or) solving (4) in which m(L; η) re-defined as m₁(L̄₀; η) and with η = η̂_or, the estimator of the q × 1 parameter vector η of model (22) (q ⩾ p) solving the p equations

E_{n} (b (Z) [\sum_{j = 1}^{J - 1} ω_{j} (\hat{α}) {m_{j + 1} ({\bar{L}}_{j}; η) - m_{j} ({\bar{L}}_{j - 1}; η)} + ω_{J} (\hat{α}) {Y - m_{J} ({\bar{L}}_{J}, η)}]) = 0,

and the first q − p equations in the system (23).

Step 2. For each k compute

{\tilde{η}}_{k} = arg min_{η} E_{n} ({[{\hat{I}}_{k} {M (\hat{α}, {\hat{β}}_{or}) - U (η, \hat{α}, {\hat{β}}_{or}) - {\hat{ρ}}_{η}^{T} S (\hat{α})}]}^{2}),

where $U (η, α, β) = \sum_{j = 1}^{J} U_{j} (η, α, β)$ with

U_{j} (η, α, β) = g_{η, α, β, j}^{b} ({\bar{A}}_{j}, {\bar{L}}_{j - 1}) - E_{α} {g_{η, α, β, j}^{b} ({\bar{A}}_{j}, {\bar{L}}_{j - 1}) | {\bar{A}}_{j}, {\bar{L}}_{j - 1}},

S (α) = \sum_{j = 1}^{J} \partial log f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}; α) / \partial α, M (α, β) = ω_{J} (α) b (Z) {Y - h (Z; β)},

{\hat{I}}_{k} = \partial φ_{k} (β) / \partial β^{T} |_{{\hat{β}}_{or}} E_{n} {\partial M (\hat{α}, β) / \partial β^{T} |_{{\hat{β}}_{or}}}^{- 1},

{\hat{ρ}}_{η}^{T} = E_{n} [{M (\hat{α}, {\hat{β}}_{or}) - U (η, \hat{α}, {\hat{β}}_{or})} S {(\hat{α})}^{T}] {[E_{n} {S (\hat{α}) S {(\hat{α})}^{T}}]}^{- 1} .

Step 3. Compute the maximum likelihood estimator (α̃, δ̃) of (α, δ) in the augmented dropout models (j = 1, …, J)

f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}; α, δ) = C_{j} (α, δ) f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}; α) exp {\sum_{k = 1}^{K + 1} δ_{k, j}^{T} u_{k, j} ({\bar{A}}_{j}, {\bar{L}}_{j - 1})},

where u_k,j (Ā_j, L̄_j₋₁) = U_j (η̃_k, α̂, β̂_or), u_K _{+ 1,}_j (Ā_j, L̄_j₋₁) = U_j (η̂_or, α̂, β̂_or) and C_j (α, δ) = c_j (Ā_j₋₁, L̄_j₋₁; α, δ) is the normalizing constant. Estimate η by η̃_or computed just like η̂_or but with ω_j (α̂) replaced with ω_j (α̃, δ̃). Finally, compute β̂_dr = β̂_reg(η̃_or) solving (4) with m(L; η) re-defined as m₁(L̄₀; η) and with η = η_or.

An argument similar to that in § 2.2 shows that under regularity conditions, the estimator β̂_dr from the preceding algorithm satisfies Properties 4–8 with models (24) and (22) instead of (5) and (3) and with m₁(L̄₀; η) instead of m(L; η) in equation (4). The key points are: β̂_dr is consistent for β₀ and asymptotically normal under model (22) because it is of the form β̂_reg(η̃_or), and η̃_or is consistent for η₀ and asymptotically normal under (22). The estimator β̂_dr is also consistent for β₀ and asymptotically normal under model (24) because it is equal to β̂{η̃_or, (α̃, δ̃)} solving equation (25) since, as argued in the Appendix, (25) is the same as

\begin{array}{l} 0 & = & E_{n} [b (Z) {m_{1} ({\bar{L}}_{0}; η) - h (Z; β)}] \\ + E_{n} (b (Z) [\sum_{j = 1}^{J - 1} ω_{j} (α) {m_{j + 1} ({\bar{L}}_{j}; η) - m_{j} ({\bar{L}}_{j - 1}; η)} + ω_{J} (α) {Y - m_{J} (η)}]), \end{array}

(26)

and the right-hand side evaluates to zero when β, η and α are replaced with β̂_dr, η̃_or and (α̃, δ̃). The rest of the argument follows with the re-definition $S_{δ_{k}}^{*}$ (α₀, δ₀) = { $S_{δ_{k, 1}}^{*}$ (α₀, δ₀)^T, …, $S_{δ_{K, J}}^{*}$ (α₀, δ₀)^T}^T where $S_{α}^{*}$ (α, δ) and $S_{δ_{K, j}}^{*}$ (α, δ) are the scores for α and δ_k,j in models

\begin{array}{l} f^{*} (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}; α, δ) & = & C_{j}^{*} (α, δ) f (A_{j} | {\bar{A}}_{j - 1}, {\bar{L}}_{j - 1}; α) \\ \times exp {\sum_{k = 1}^{K + 1} δ_{k, j}^{T} U_{0, j} (η_{k}^{*}) + δ_{K + 1, j}^{T} U_{j} (η_{or}^{*})} (j = 1, \dots, J), \end{array}

with $C_{j}^{*}$ (α, δ) = $c_{j}^{*}$ (Ā_j₋₁, L̄_j₋₁; α, δ) the normalizing constant, U₀_,j ( $η_{k}^{*}$ ) = U_j ( $η_{k}^{*}$ ,α₀, β₀) = $S_{δ_{K, j}}^{*}$ (α₀, β₀) and U_j ( $η_{or}^{*}$ ) = U_j ( $η_{or}^{*}$ , α₀, β₀) = $S_{δ_{K + 1}, j}^{*}$ (α₀, β₀).

5. Simulation studies

We carried out four simulation experiments to assess the performance of β̂_dr for estimation of β₀ = E(Y) with sample sizes n = 200, 1000. In each experiment, we generated 1000 Monte Carlo datasets and computed, in addition to β̂_dr, the estimators β̂_reg(η̂), β̂(η̂, α̂), β̂_ipw and β̂_Cao, the estimator called μ̂_PROJ in Cao et al. (2009).

In the first two experiments, we generated data as in Kang & Schafer (2007): (L₁, …,L₄, ε) ∼ N(0, I₅) where I₅ is the 5 × 5 identity matrix, Y = 210 + 27.4L₁ + 13.7 $\sum_{s = 2}^{4} L_{s} + ε$ and A ∼ Ber{expit(−L₁ + 0.5L₂ − 0.25L₃ − 0.1L₄)}. As in Kang & Schafer (2007), we computed X = (X₁, …, X₄), where X₁ = exp(L₁/2), X₂ = L₂/{1 + exp(L₁)} + 10, X₃ = (L₁ L₃/25 + 0.6)³ and X₄ = (L₂ + L₄ + 20)². In the first experiment, we assumed that the observed data were (A, AY, L). We computed the estimators using the same outcome and missingness models as in Kang & Schafer (2007). The first, correctly specified, outcome and missingness models were, respectively, an additive linear regression of Y on L with intercept and a logistic regression with intercept, covariate L and outcome A. The second models, incorrectly specified, were as the first ones, but with covariate X instead of L. In the second experiment, as in Robins et al. (2007), we recoded A as 1 − A and replicated the first experiment. We conducted this experiment because Robins et al. (2007) noted that the favourable performance of β̂_reg(η̂) compared with β̂ (η̂, α̂) reported by Kang & Schafer (2007) was reversed when the observed data were {1 − A, (1 − A)Y, L}; thus, our study includes scenarios favourable to β̂_reg(η̂) and to β̂ (η̂, α̂).

In the last two experiments, we generated L, X and A as in the first experiment, but $Y \sim Ber {exp i t (- 60 + 27.4 L_{1} + 13.7 \sum_{s = 2}^{4} L_{s})}$ yielding E(Y) 0.0496. We purposely chose to simulate a rare outcome Y as we wanted to examine the performance of β̂_dr in a setting where β̂ (η̂, α̂) had some nontrivial probability of falling outside the parameter space. Our estimators used the same correct and incorrect missingness models as in the first experiment and two logistic regression models for Y : the first, correctly specified, with intercept and covariate L and the second, incorrectly specified, with intercept and covariate X. The fourth experiment differed from the third in that A was recoded as 1 − A.

The estimator η̂ was the ordinary least squares estimator and the standard logistic regression estimator in the first two and last two experiments, respectively. The estimators η̂_or and η̃_or were computed as η̂ except that each unit was weighted by the inverse of specific estimates of the missingness probabilities as described in Steps 1 and 3 of the procedure in § 2.2.

Tables 1 and 2 report results for continuous and binary Y, respectively, and provide Monte Carlo estimates of the bias, root mean square error and median absolute error of the estimators of β₀. Bootstrap estimators of their Monte Carlo standard errors can be found in Tables 3 and 4 of the Supplementary Material.

Table 1.

Monte Carlo study of the performance of the proposed estimator with outcome normally distributed and missing at random

	Bias	RMSE	MAE	Bias	RMSE	MAE	Bias	RMSE	MAE	Bias	RMSE	MAE
	Miss-C, OR-C			Miss-I, OR-C			Miss-C, OR-I			Miss-I OR-I
	n = 200, Y observed iff A = 1
β̂_reg(η̂)	0.07	2.54	1.77	0.07	2.54	1.77	−0.51	3.38	2.36	−0.51	3.38	2.36
β̂(η̂, α̂)	0.07	2.54	1.78	0.07	2.60	1.79	0.38	3.50	2.30	−7.71	44.28	3.59
β̂_ipw	−0.09	4.16	2.54	2.07	10.99	3.20	−0.09	4.16	2.54	2.07	10.29	3.20
β̂_Cao	0.07	2.54	1.79	0.06	2.53	1.78	0.08	2.59	1.82	−0.41	3.51	2.14
β̂_dr	0.07	2.54	1.78	0.06	2.54	1.77	0.33	2.92	2.05	−1.71	3.58	2.42
n = 1000, Y observed iff A = 1
β̂_reg(η̂)	0.01	1.19	0.82	0.01	1.19	0.82	−0.82	1.70	1.19	−0.82	1.70	1.19
β̂(η̂, α̂)	0.01	1.19	0.82	0.03	1.20	0.81	0.08	1.59	1.08	−9.78	23.55	5.24
β̂_ipw	−0.01	1.68	1.14	4.40	9.20	2.55	−0.01	1.68	1.14	4.40	9.20	2.55
β̂_Cao	0.01	1.19	0.81	0.01	1.19	0.81	0.04	1.19	0.84	−1.26	1.81	1.34
β̂_dr	0.01	1.19	0.81	0.01	1.19	0.81	0.06	1.22	0.86	−2.56	2.91	2.54
	n = 200, Y observed iff A = 0
β̂_reg(η̂)	0.07	2.54	1.76	0.07	2.54	1.76	5.01	5.79	5.02	5.01	5.79	5.02
β̂(η̂, α̂)	0.07	2.54	1.75	0.07	2.54	1.76	0.53	3.83	2.38	3.25	4.59	3.44
β̂_ipw	0.29	3.89	2.47	3.85	5.02	4.09	0.29	3.89	2.47	3.85	5.02	4.09
β̂_Cao	0.07	2.55	1.79	0.08	2.54	1.77	0.47	2.61	1.81	0.94	3.21	2.26
β̂_dr	0.08	2.54	1.76	0.08	2.54	1.77	1.16	3.01	2.03	2.54	3.89	2.82
	n = 1000, Y observed iff A = 0
β̂_reg(η̂)	0.01	1.19	0.83	0.01	1.19	0.83	4.93	5.10	4.93	4.93	5.10	4.93
β̂(η̂, α̂)	0.01	1.19	0.83	0.01	1.19	0.83	0.18	1.67	1.15	3.09	3.40	3.09
β̂_ipw	0.12	1.68	1.19	3.71	3.98	3.71	0.12	1.68	1.19	3.71	3.98	3.71
β̂_Cao	0.01	1.19	0.83	0.01	1.19	0.84	0.14	1.21	0.84	1.12	1.70	1.24
β̂_dr	0.01	1.19	0.83	0.01	1.19	0.84	0.29	1.29	0.94	1.47	2.07	1.54

Open in a new tab

RMSE, root mean square error; MAE, median absolute error; Miss-C and Miss-I (OR-C and OR-I), correct and incorrect missingness (outcome regression); β̂_reg(η̂), outcome regression estimator; β̂(η̂, α̂), standard double robust estimator; β̂_ipw inverse probability weighted estimator; β̂_Cao, Cao et al. estimator, β̂_dr, new double robust estimator.

Table 2.

Monte Carlo study of the performance of the proposed estimator with binary outcome missing at random

	Bias	RMSE	MAE	Bias	RMSE	MAE	Bias	RMSE	MAE	Bias	RMSE	MAE
	Miss-C, OR-C			Miss-I, OR-C			Miss-C, OR-I			Miss-I OR-I
	n = 200, Y observed iff A = 1
β̂_reg(η̂)	−0.42	2.92	1.96	−0.42	2.92	1.96	0.06	2.90	1.95	0.06	2.90	1.95
β̂(η̂, α̂)	−0.42	2.92	1.96	−0.42	2.92	1.96	0.07	2.92	1.96	0.04	2.96	1.95
β̂_ipw	−0.24	4.01	2.56	2.55	9.67	3.31	−0.24	4.01	2.56	2.55	9.67	3.31
β̂_Cao	−0.21	3.04	1.99	−0.25	3.04	1.96	0.05	2.92	1.98	0.00	2.93	1.96
β̂_dr	−0.41	2.93	1.96	−0.39	2.93	1.96	0.05	2.90	1.96	0.02	2.91	1.95
n = 1000, Y observed iff A = 1
β̂_reg(η̂)	−0.02	0.98	0.64	−0.02	0.98	0.64	0.27	1.04	0.67	0.27	1.04	0.67
β̂(η̂, α̂)	−0.01	0.95	0.64	0.00	0.95	0.64	0.17	1.09	0.75	−0.42	1.54	0.86
β̂_ipw	−0.12	1.74	1.11	5.51	11.80	2.89	−0.12	1.74	1.11	5.51	11.80	2.89
β̂_Cao	0.07	1.08	0.74	0.08	1.26	0.75	0.25	1.07	0.72	0.19	1.12	0.72
β̂_dr	−0.02	0.95	0.64	−0.01	0.95	0.64	0.06	1.01	0.71	0.04	1.00	0.67
	n = 200, Y observed iff A = 0
β̂_reg(η̂)	0.05	1.65	1.04	0.05	1.65	1.04	0.87	2.24	1.46	0.87	2.24	1.46
β̂(η̂, α̂)	0.05	1.65	1.04	0.02	1.65	1.04	0.78	2.19	1.46	0.83	2.23	1.46
β̂_ipw	−0.02	1.75	1.25	−0.02	1.80	1.25	−0.02	1.75	1.25	−0.02	1.80	1.25
β̂_Cao	0.07	1.65	1.04	0.06	1.65	1.04	0.89	2.23	1.46	0.92	2.30	1.46
β̂_dr	0.05	1.65	1.04	0.05	1.65	1.04	0.82	2.18	1.43	0.84	2.20	1.44
	n = 1000, Y observed iff A = 0
β̂_reg(η̂)	−0.03	0.71	0.47	−0.03	0.71	0.47	0.39	0.88	0.59	0.39	0.88	0.59
β̂(η̂, α̂)	−0.03	0.71	0.48	−0.03	0.71	0.48	0.11	0.85	0.56	0.25	0.84	0.56
β̂_ipw	−0.02	0.80	0.56	−0.02	0.84	0.56	−0.02	0.80	0.56	−0.02	0.84	0.56
β̂_Cao	−0.02	0.72	0.49	−0.03	0.73	0.51	0.30	0.86	0.57	0.44	0.97	0.61
β̂_dr	−0.04	0.71	0.48	−0.04	0.71	0.49	0.15	0.78	0.52	0.26	0.84	0.55

Open in a new tab

RMSE, root mean square error; MAE, median absolute error. All figures in the table are multiplied by 100. Miss-C and Miss-I (OR-C and OR-I), correct and incorrect missingness (outcome regression); β̂_reg(η̂), outcome regression estimator; β̂(η̂, α̂), standard double robust estimator; β̂_ipw, inverse probability weighted estimator; β̂_Cao, Cao et al. estimator; β̂_dr, new double robust estimator.

According to theory, when the outcome model is incorrect and the missingness model is correct, β̂(η̂, α̂), β̂_ipw, β̂_dr and β̂_Cao are consistent, β̂_reg(η̂) is inconsistent, and both β̂_dr and β̂_Cao are asymptotically more efficient than β̂(η̂, α̂) and β̂_ipw. The performance observed for n = 1000, as quantified by bias and mean squared error, agrees with the asymptotic results except that when Y is binary, β̂(η̂, α̂), β̂_ipw, β̂_dr and β̂_Cao are slightly biased. The comparison of the relative biases of β̂_dr and β̂_Cao depends on the data generating mechanism. For A = 0 and Y continuous, the bias of β̂_Cao is smaller than that of β̂_dr. On the other hand, for Y binary, β̂_dr has substantially smaller bias than β̂_Cao. Although not predicted by theory, when the missingness model is incorrect and the outcome model is correct, all double-robust estimators perform as well as β̂_reg(η̂) when n = 1000. When both models were incorrect β̂_Cao outperformed β̂_dr in Table 1; however, β̂_dr outperformed β̂_Cao in Table 2. It is not surprising and, indeed predicted by asymptotic theory, that their relative performance would vary with the data generating mechanism. Results for n = 200 were qualitatively similar, except that β̂_Cao had smaller root mean squared error for Y continuous when the missingness model was correct and the outcome regression model was incorrect.

Finally, when n = 200, A = 1 and Y binary, of the 1000 replications, β̂(η̂, α̂) fell below zero a total of 27, 54, 27 and 56 times, and β̂_Cao fell below zero 49, 94, 59 and 87 times, respectively, under the four possible scenarios combining correct and incorrect specifications of the missingness and outcome regression models, the first with both correct, the second with the missingness model incorrect and the outcome model correct, the third with the missingness model correct and the outcome model incorrect and the last with both incorrect models. In all other cases, β̂(η̂, α̂), or β̂_Cao fell between 0 and 1.

6. Concluding remarks

The proposal in this paper relies on the key observation that under the missing at random or the no-unmeasured confounders assumption, efficient estimation of parameters of increasingly larger models for the missingness or treatment probabilities improves the efficiency with which the parameters of models for the full or counterfactual data are estimated. As such, the present proposal can be extended, along the lines of § 4, to the estimation of the parameters of marginal structural mean models and of structural nested mean models for time-dependent treatments in longitudinal studies with time-dependent confounders (Robins, 1999). This extension will be reported elsewhere.

Acknowledgments

Andrea Rotnitzky, Lei Gomez and James Robins were funded by grants from the National Institutes of Health, U.S.A. Andrea Rotnitzky is also affiliated with the Harvard School of Public Health. Mariela Sued was funded by grants from the Agencia de Promocion Cientifica y Tecnica de Argentina and the Consejo Nacional de Investigaciones Cientificas y Tecnicas de Argentina. The authors wish to thank the reviewers for helpful comments.

Appendix

Proof of the equivalence of equations (7) and (14), and of equations (25) and (26). To prove the equivalence of (25) and (26), first note that $G_{η, α, β, j}^{b}$ − E_α( $G_{η, α, β, j}^{b}$ | Ā_j₋₁, L̄_j₋₁) = b(Z){m_j (L̄_j₋₁; η) − h(Z; β)}{ω_j (α) − ω_j₋₁(α)}. Replacing $G_{η, α, β, j}^{b}$ − E_α( $G_{η, α, β, j}^{b}$ | Ā_j₋₁, L̄_j₋₁) with this expression in equation (25) and rearranging the terms of the sums in the left-hand side of (25) we arrive at equation (26). The equivalence between (7) and (14) is the special case of the equivalence between (25) and (26) when J = 1.

Sketch of the proof that the estimators η̃_or and η̂_or converge in probability to the same limit. This follows because η̂_or and η̃_or solve the same system of q estimating equations except that to compute η̃_or, ω(α̂) is replaced by ω(α̃, δ̃) in the last p-equations (10) and second, the left-hand side of (10) evaluated at either ω (α̂) or ω (α̃, δ̃) converges in probability to the same expectation.

Supplementary material

Supplementary material available at Biometrika online includes the programme used to run the simulation study and tables with bootstrap estimators of the Monte Carlo standard errors of the reported bias, root mean squared error and mean absolute deviation.

REFERENCES

Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:692–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]
Cao W, Tsiatis A, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–34. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gill RD. Non- and semi-parametric maximum likelihood estimators and the von Mises method. Scand J Statist. 1989;16:97–128. [Google Scholar]
Kang DYL, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion and rejoinder) Statist Sci. 2007;22:523–80. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran ME, Berry D, editors. Statistical Models in Epidemiology: The Environment and Clinical Trials. New York: Springer; 1999. pp. 95–134. Institute for Mathematics and its Applications 116. [Google Scholar]
Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proc Am Statist Assoc Sect Bayesian Statist Sci. 2000:6–10. [Google Scholar]
Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Am Statist Assoc. 1995;90:122–9. [Google Scholar]
Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–24. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Estimation of regression-coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]
Robins JM, Gomez Q, Sued M, Rotnitzky A. Performance of double-robust estimators when inverse probability weights are highly variable. Statist Sci. 2007;22:544–59. [Google Scholar]
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]
Rubin D, van der Laan MJ. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. Int J Biostatist. 2008;4 article 5. [PMC free article] [PubMed] [Google Scholar]
Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semipara-metric non-response models. J Am Statist Assoc. 1999;94:1096–20. [Google Scholar]
Tan Z. A distributional approach for causal inference using propensity scores. J Am Statist Assoc. 2006;101:1619–37. [Google Scholar]
Tan Z. Understanding OR, PS and DR. Statist Sci. 2007;22:560–8. [Google Scholar]
Tan Z. Comment: improved local efficiency and double robustness. Int J Biostatist. 2008;4 doi: 10.2202/1557-4679.1109. article 10. [DOI] [PubMed] [Google Scholar]
Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010a;97:661–82. [Google Scholar]
Tan Z. Nonparametric likelihood and doubly robust estimating equations for marginal and nested structural models. Can J Statist. 2010b;38:609–32. [Google Scholar]
van der Laan MJ. Targeted maximum likelihood based causal inference: part I. Int J Biostatist. 2010;6 doi: 10.2202/1557-4679.1211. article 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan MJ, Robins J. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer; 2003. [Google Scholar]
van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostatist. 2006;2 doi: 10.2202/1557-4679.1211. article 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 2000. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[b1-ass013] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:692–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]

[b2-ass013] Cao W, Tsiatis A, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–34. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b3-ass013] Gill RD. Non- and semi-parametric maximum likelihood estimators and the von Mises method. Scand J Statist. 1989;16:97–128. [Google Scholar]

[b4-ass013] Kang DYL, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion and rejoinder) Statist Sci. 2007;22:523–80. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5-ass013] Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran ME, Berry D, editors. Statistical Models in Epidemiology: The Environment and Clinical Trials. New York: Springer; 1999. pp. 95–134. Institute for Mathematics and its Applications 116. [Google Scholar]

[b6-ass013] Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proc Am Statist Assoc Sect Bayesian Statist Sci. 2000:6–10. [Google Scholar]

[b7-ass013] Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Am Statist Assoc. 1995;90:122–9. [Google Scholar]

[b8-ass013] Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–24. [Google Scholar]

[b9-ass013] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression-coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]

[b10-ass013] Robins JM, Gomez Q, Sued M, Rotnitzky A. Performance of double-robust estimators when inverse probability weights are highly variable. Statist Sci. 2007;22:544–59. [Google Scholar]

[b11-ass013] Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]

[b12-ass013] Rubin D, van der Laan MJ. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. Int J Biostatist. 2008;4 article 5. [PMC free article] [PubMed] [Google Scholar]

[b13-ass013] Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semipara-metric non-response models. J Am Statist Assoc. 1999;94:1096–20. [Google Scholar]

[b14-ass013] Tan Z. A distributional approach for causal inference using propensity scores. J Am Statist Assoc. 2006;101:1619–37. [Google Scholar]

[b15-ass013] Tan Z. Understanding OR, PS and DR. Statist Sci. 2007;22:560–8. [Google Scholar]

[b16-ass013] Tan Z. Comment: improved local efficiency and double robustness. Int J Biostatist. 2008;4 doi: 10.2202/1557-4679.1109. article 10. [DOI] [PubMed] [Google Scholar]

[b17-ass013] Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010a;97:661–82. [Google Scholar]

[b18-ass013] Tan Z. Nonparametric likelihood and doubly robust estimating equations for marginal and nested structural models. Can J Statist. 2010b;38:609–32. [Google Scholar]

[b19-ass013] van der Laan MJ. Targeted maximum likelihood based causal inference: part I. Int J Biostatist. 2010;6 doi: 10.2202/1557-4679.1211. article 2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20-ass013] van der Laan MJ, Robins J. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer; 2003. [Google Scholar]

[b21-ass013] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostatist. 2006;2 doi: 10.2202/1557-4679.1211. article 11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22-ass013] van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 2000. [Google Scholar]

PERMALINK

Improved double-robust estimation in missing data and causal inference models

Andrea Rotnitzky

Quanhong Lei

Mariela Sued

James M Robins

Abstract

1. INTRODUCTION

2. Cross-sectional studies with missing data

2.1. Existing double-robust estimators

2.2. Proposed estimator

3. Causal inference in point-exposure studies

4. Monotone missing data in longitudinal studies

5. Simulation studies

Table 1.

Table 2.

6. Concluding remarks

Acknowledgments

Appendix

Supplementary material

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Improved double-robust estimation in missing data and causal inference models

Andrea Rotnitzky

Quanhong Lei

Mariela Sued

James M Robins

Abstract

1. INTRODUCTION

2. Cross-sectional studies with missing data

2.1. Existing double-robust estimators

2.2. Proposed estimator

3. Causal inference in point-exposure studies

4. Monotone missing data in longitudinal studies

5. Simulation studies

Table 1.

Table 2.

6. Concluding remarks

Acknowledgments

Appendix

Supplementary material

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases