Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2018 May 4.
Published in final edited form as: Stat Sci. 2018;33(2):184–197. doi: 10.1214/18-STS647

Introduction to Double Robust Methods for Incomplete Data

Shaun R Seaman 1, Stijn Vansteelandt 2,3
PMCID: PMC5935236  EMSID: EMS76739  PMID: 29731541

Abstract

Most methods for handling incomplete data can be broadly classified as inverse probability weighting (IPW) strategies or imputation strategies. The former model the occurrence of incomplete data; the latter, the distribution of the missing variables given observed variables in each missingness pattern. Imputation strategies are typically more efficient, but they can involve extrapolation, which is difficult to diagnose and can lead to large bias. Double robust (DR) methods combine the two approaches. They are typically more efficient than IPW and more robust to model misspecification than imputation. We give a formal introduction to DR estimation of the mean of a partially observed variable, before moving to more general incomplete-data scenarios. We review strategies to improve the performance of DR estimators under model misspecification, reveal connections between DR estimators for incomplete data and ‘design-consistent’ estimators used in sample surveys, and explain the value of double robustness when using flexible data-adaptive methods for IPW or imputation.

AMS 1991 subject classifications: primary – 62 Statistics, secondary – 62A01 Foundations and philosophical topics

Key words and phrases: augmented inverse probability weighting, calibration estimators, data-adaptive methods, doubly robust, empirical likelihood, imputation, inverse probability weighting, missing data, semiparametric methods

1. Introduction

Statistical analysis of data is often complicated by the data being incomplete, e.g., due to individuals in a survey not answering a question, patients missing a clinic visit, or data simply being lost. The individuals on whom complete data are obtained (the ‘complete cases’) often constitute a non-representative subset of the sample. This makes an analysis that uses only this subset potentially biased, as well as being potentially inefficient because it discards the data available on the individuals with incomplete data (the ‘incomplete cases’). More sophisticated approaches for analysing incomplete data are designed to reduce bias and/or increase efficiency. They can broadly be classified into imputation strategies and inverse probability weighting (IPW) approaches [17].

Imputation approaches involve specifying a model (the ‘imputation model’) for the partially observed variables given any fully observed variables. Missing values are then ‘predicted’ based on this model. Multiple imputation is the most popular such approach, and has close connections to maximum likelihood (ML) methods for incomplete data. The latter methods involve implicit imputation of the missing data. A drawback of imputation approaches is that they can involve much modelling of the incomplete data and there may be large bias when the imputation model is misspecified. This potential for bias is especially large when the distribution of observed data in individuals with a given missingness pattern is very different from that in the overall population. In that case, imputation involves extrapolation under the imputation model, so that even minor model misspecification over the range of the observed data may induce large bias. Additional concerns may arise from the difficulty of specifying the imputation model in a way that obeys the structure imposed by the model that will be used to analyse the imputed data (i.e. such that it is ‘congenial with the analysis model’ [19]).

IPW methods avoid these issues of extrapolation and uncongeniality by not using an imputation model. They instead rely on a missingness model, i.e. a model for the probability that an individual is a complete case given a set of predictors of missingness. The analysis model is then fitted to just the complete cases, inversely weighting each by its estimated probability of being complete given its missingness predictors. A drawback of IPW is that it can be very inefficient, because, like the complete-case analysis, it ignores potentially useful data on the incomplete cases. It can also be subject to large finite-sample bias. Recognition of these problems led to research on augmented IPW (AIPW) estimators. These, like imputation estimators, involve a model for the conditional distribution of the partially observed variables given fully observed variables. AIPW estimators are more efficient than (unaugmented) IPW estimators when this imputation model is correctly specified. Indeed, among all estimators that, like IPW estimators, are consistent whenever the missingness model is correctly specified, AIPW estimators with correctly specified imputation models are the most efficient.

In 1999, Scharfstein et al. [32] noted that an AIPW estimator previously developed by Robins et al. [28] for estimating the mean of a partially observed variable had the property of being consistent not only when the missingness model was correctly specified, but also when an imputation model for the conditional distribution of this variable was correctly specified and the missingness model was misspecified. This property became known as ‘double robustness’ [26]. At about the same time, it was recognised [24] that Robins et al.’s estimator was closely related to a ‘generalised regression’ estimator first developed in the 1970’s for improving the efficiency of an IPW estimator of a finite-sample population mean when sampling probabilities are known [6]. Since the double robust (DR) property was discovered, many estimators possessing this property have been developed. However, the DR property has also been criticised. Simulation studies which showed that minor misspecifications of both the imputation and missingness models can sometimes induce large bias and variance in the DR estimator led to a questioning of the practical usefulness of double robustness [12]. Such scepticism has been reinforced by the availability of imputation and IPW approaches based on flexible ‘data-adaptive’ methods for fitting the imputation and missingness model, respectively, which reduce the risk of model misspecification [16].

This article is an introduction to DR methodology for incomplete data. As in much of the literature on missing data (and DR estimators in particular), we shall assume that data are missing at random (MAR). Data are said to be MAR if the conditional probability that a particular missingness pattern occurs given the data does not depend on the missing values in that pattern [35]. In Section 2, we consider the problem of estimating the mean of a partially observed variable using fully observed auxiliary variables, using this example to contrast imputation with IPW and to present a DR AIPW estimator. In Section 3 we introduce more formality and give a review of the general semiparametric theory underlying DR estimation. This enables us to describe DR estimators for more general missing data problems. So-called ‘standard’ DR estimators use ML to estimate the parameters of the missingness and imputation models. In Section 4 we review more recently developed methods which seek to improve the performance of DR estimators (relative to standard DR estimators) under model misspecification by using alternative estimators of these parameters. In Section 5, we consider the use of data-adaptive methods (e.g. smoothing methods or regularisation methods) for the imputation or missingness model. We argue that there are advantages to using these methods in DR estimators (rather than in imputation or IPW estimators). Section 6 discusses the wide variety of statistical models for which DR estimators have been proposed, DR methods for non-monotone missing and missing not at random (MNAR) data (most work has been on monotone missing, MAR data), and some possible directions of future research. Implementation of DR estimators in standard statistical packages is described in the supplemental article [36].

2. IPW, RI and AIPW for a missing outcome

For pedagogic purposes, we first consider the problem of estimating the expectation β = E(Y) of a partially observed random variable Y from a sample of size n when auxiliary variables W are observed on the whole sample. This has been the focus of much of the work on DR estimation. In Section 3, we discuss DR estimation for more general missing data problems.

Let Yi and Wi denote Y and W for the ith individual in the sample, and Ri be an indicator that Yi is observed (Ri = 1 if Yi is observed; Ri = 0 if missing). Individuals with Ri = 1 are ‘complete cases’; those with Ri = 0 are ‘incomplete cases’. Assume (W1, Y1, R1), . . . , (Wn, Yn, Rn) are independent and identically distributed. Henceforth, we omit subscripts i unless needed.

The full-data (or ‘complete-data’) estimator, n–1 i=1nYi, for β is infeasible when Y can be missing. The complete-case estimator, i=1nRiYi/i=1nRi, is typically inconsistent unless R is independent of Y. The IPW, regression imputation (RI) and DR estimators described below are valid under the weaker assumption that R is independent of Y given W, i.e. that (W1, Y1, R1), . . . , (Wn, Yn, Rn) are MAR.

In IPW, each complete case is weighted by π(W)−1, where π(W) = P(R = 1 | W) is the probability that an individual with this value of W would be a complete case. Each complete case then represents π(W)−1 individuals in the population, all with the same W value. One of these would have observed Y if sampled; the others would have missing Y. The weighted sample of complete cases therefore has (over repeated samples) the same distribution of W as the population, and by MAR, also the same distribution of Y as the population. This motivates the IPW estimators of β:n1i=1nRiπ(Wi)1Yi [11] and i=1nRiπ(Wi)1Yi/i=1nRiπ(Wi)1. Since π(W) is unknown unless data are missing by design, a model π(W; α), called the ‘missingness model’, is specified for it and an estimator α̂ of α calculated from data (R1, W1, . . . , Rn, Wn). E.g., one could use π(W; α) = expit(αW), with α estimated by ML. Let β^IPW=β^IPW(α^)=n1i=1nRiπ(Wi;α^)1Yi and β^IPW,B=β^IPW,B(α^)=i=1nRiπ(Wi;α^)1Yi/i=1nRiπ(Wi;α^)1 denote the IPW estimators with estimated weights (‘B’ in the subscript ‘IPW,B’ stands for ‘sample bounded’: β̂IPW,B is guaranteed to lie within the range of the observed Y values). If π(W; α) is correctly specified and α̂ is a consistent estimator of α, then β̂IPW and β̂IPW,B are consistent estimators of β, provided that there exists a δ > 0 such that P{π(W) ≥ δ} = 1 (this ‘positivity’ assumption rules out scenarios where individuals with certain values of W cannot be complete cases) and π(W; α) is a sufficiently smooth function of α.

In RI, a parametric model m(W; γ) for E(Y | W) is specified. This is called the ‘outcome model’. Let γ̂ be an estimator of γ (e.g. the ML estimator calculated using the complete cases). Parameter β is then estimated by β^RI=β^RI(γ^)=n1i=1nm(Wi;γ^). If m(W; γ) is correctly specified and γ̂ is consistent, then β̂RI is consistent. Moreover, if γ̂ is efficient, then so is β̂RI. Note that if m(W; γ) is a canonical generalised linear model that includes an intercept and γ̂ is the ML estimator, then i=1nRim(Wi;γ^)=i=1nRiYi and so β̂RI can be written as n1=i=1n{RiYi+(1Ri)m(Wi;γ^)}. The RI estimator then equals the mean of Y after replacing missing values by imputed values m(W; γ̂).

The efficiency of β̂RI comes at the cost of assuming that model m(W; γ) is correctly specified. When there is little overlap between the distributions of W in complete and incomplete cases, the RI estimator works by extrapolating the relation between W and Y estimated from complete cases to regions of the W space where incomplete cases but few (if any) complete cases lie. This extrapolation is potentially risky, because even models that fit the data on complete cases perfectly may give a poor approximation of E(Y | W) in these regions [45]. This is illustrated by the following example.

Example 1: Let P(W = 0) = P(W = 1) = P(W = 2) = 1/3, logit P(R = 1 | W) = 4 − 4W, and either a) Y | W ~ N(W, σ2) or b) Y | W ~ N(I(W ≥ 1), σ2), where I(.) denotes the indicator function.

In case a), the RI estimator β̂RI based on linear regression model m(W; γ) = γ1 + γ2W with ML estimator γ̂ is consistent; in case b), it is inconsistent (β^RIp0.94, whereas E(Y) = 0.67). This is a concern because, unless the sample size were very large, it would be difficult to decide on the basis of the observed data whether this linear regression model is correctly specified, as there would generally be few complete cases with W = 2. The IPW estimators β̂IPW and β̂IPW,B based on model logit π(W; α) = α1 + α2W with ML estimator α̂ are consistent in both cases. While these also rely on a model (for π(W)) which may be misspecified, its goodness of fit is arguably easier to assess because this requires data only on R and W, which are fully observed, and there is no need for extrapolation outside the observed data range. The variances of both IPW estimators are larger than that of β̂RI, because of the large weights attributed to the small proportion of complete cases with W = 2. For example, using simulation we estimated that, when n = 100000 and σ2 = 1, the variances (×105) of β̂IPW, β̂IPW,B and β̂RI are 51, 28 and 6.1, respectively. The relatively large variances of the IPW estimators can be seen as reflecting genuine uncertainty about β, in contrast to the variance of β̂RI, which does not reflect model and extrapolation uncertainty about E(Y | W = 2). This uncertainty could be accommodated by using more flexible outcome models, but this would drastically increase the variance of β̂RI; indeed ultimately, if W is categorical and the missingness and outcome models are saturated, the IPW and RI estimators (and their variance estimators) are equivalent [21]. More generally, when the outcome model is not saturated and there is very little overlap between the distributions of W in complete and incomplete cases, it may be difficult to ensure that a flexible outcome model is sufficiently flexible outside the region of the W space where the complete cases lie.

The inefficiency of the IPW estimators is a serious drawback, but it can be reduced by making more use of the W data on the incomplete cases. In particular, the augmented IPW (AIPW) estimator

β^DR=β^DR(α^,γ^)=1ni=1nRiπ(Wi;α^)Yi+1ni=1n{1Riπ(Wi;α^)}m(Wi;γ^) (1)
=1ni=1nm(Wi;γ^)+1ni=1nRiπ(Wi;α^){Yim(Wi;γ^)}, (2)

of β, where α̂ and γ̂ are estimators of α and γ, is efficient relative to all estimators that rely solely on correct specification of the missingness model, provided that the outcome model is also correctly specified (see Section 3). The first term on the right-hand side of equation (1) is just β̂IPW and the second term is called the augmentation term. This uses data on W on the incomplete cases to improve its efficiency. In the alternative (equivalent) expression for β̂DR, equation (2), the first term on the right equals β̂RI and the second term can be viewed as a ‘correction’ factor: it uses IPW to estimate how much β̂RI overestimates (or underestimates) E(Y) and then subtracts this. Estimator β̂DR is consistent and asymptotically normal distributed when either i) π(W; α) is correctly specified and α̂ is a consistent estimator of α, or ii) m(W; γ) is correctly specified and γ̂ is a consistent estimator of γ, a property known as ‘double robustness’. A formal proof of this is given in the supplemental article [36], but essentially it is because: i) when π(W; α) is correctly specified, the augmentation term converges to zero (because then α̂ converges to the true value of α and E{R/π(W; α) | W } = 1 at this true value); and ii) when m(W; γ) is correctly specified, the correction term converges to zero (because then γ̂ converges to the true value of γ and E(Y | W, R) = m(W; γ) at this true value).

The DR estimator β̂DR can be much more efficient than β̂IPW when both π(W; α) and m(W; γ) are correctly specified and α̂ and γ̂ are consistent, especially when Var(Y | W) is small relative to Var(Y), i.e. when W is a strong predictor of Y [25]. This is because the correction term in equation (2) is then small relative to the first term, and so β̂DRβ̂RI. Indeed, when both π(W; α) and m(W; γ) are correctly specified, it can be shown (see [36]) that nVar(β̂DR) → Var(Y) + E[{1 − π(W)}π(W)−1 Var(Y | W)] as n → ∞, which equals n times the variance of the (infeasible) full-data estimator n1i=1nYi when Var(Y | W) = 0. To illustrate this, we return to case a) of Example 1, where Y ~ N (W, σ2).

Example 1 continued: When n = 100000 and σ2 = 1, the variances (×105) of β̂IPW, β̂IPW,B, β̂RI, β̂DR and the full-data estimator are, respectively, 51, 28, 6.1, 20 and 1.7: the DR estimator is more efficient than the IPW estimators, though not as efficient as the RI estimator. When n = 100000 and σ2 = 0.01, the variances (×105) are 31, 8.8, 0.72, 0.86 and 0.67: the DR, RI and full-data estimators are close to being equally efficient.

Example 2: Wirth et al. [48] used data from the National Family Health Survey 3 to estimate the percentage of sexually-active Indian men who had paid for sex in the past year. Of the 49700 men surveyed, 3% refused to answer the question about paying for sex; these were more likely to be young, unmarried, unemployed and to believe that a husband has the right to have sex with another woman. Among men who answered the question, the percentage reporting paying for sex was 0.9%. Wirth et al. built missingness and outcome models using 24 variables thought to be predictive of paying for sex and/or refusing to answer (e.g. age, education, marital status). The resulting DR estimate of the percentage paying for sex was 1.1%. Among unmarried men, 18% refused to answer the question, 6.9% of those who answered reporting paying for sex, and the DR estimate was 12.3%.

3. Semiparametric theory of DR estimators

DR estimators do not require correct specification of the entire data-generating distribution, and are semiparametric in this sense. Semiparametric efficiency was, and continues to be, very important in DR theory: the development of DR estimators by Robins and others, and of earlier related survey sampling estimators, was motivated by the goal of improving the efficiency of IPW estimators; only later was the DR property of these estimators recognised. In this section, we give an introduction to the semiparametric theory that underlies DR estimators and describe estimators for more general missing data problems than that discussed in Section 2. A more detailed account of semiparametric theory for DR estimators can be found in, e.g. [40] or [41].

3.1. Semiparametric models and m-estimators

Assume that random variables Z1, . . . , Zn are independently and identically distributed with density f(z). A semiparametric model is a model ℳ for the density f(z) of Z that parameterises one or more aspects of f(z) in terms of an unknown finite-dimensional parameter β but leaves other aspects unrestricted.

An example is the model for Z = (Y, X, W) that assumes

E(Y|X)=μ(X;β), (3)

where µ(X; β) is a known vector function of X and β, but which otherwise leaves f(z) unrestricted. This is known as a restricted moment model and is usually fitted using generalised estimating equations [15]. A specific example of this model is the semiparametric regression model E(Y | X) = g(βX) for scalar outcome Y, covariates X and known link function g(.). Other examples of semiparametric models are the Cox proportional hazards model, which restricts hazard ratios but otherwise leaves f(z) unrestricted, and the nonparametric model, which places no restriction on f(z). The parameter of interest in the nonparametric model could be, e.g., β = E(Y) = ∫ Y f(z) dZ, where Y denotes an element of Z; then the obvious estimator of β is n1i=1nYi.

Example 3: Schnitzer et al. [33] used data from randomised trials of anti-HIV therapy. The semiparametric regression model logit P (Y = 1 | X1, X2, X3) = βint + β1X1 + β2X2 + β3X3 was used to predict occurrence of a clinical event in a patient within five years (Y) as a function of his/her baseline CD4 (X1) and CD8 cell (X2) counts and age (X3) while he/she remained on assigned therapy.

Example 4: Seaman and Copas [34] used data from a different HIV trial. The binary outcome of interest Yt (t = 1, . . . , T) was whether HIV RNA was detectable in the patient at timepoint t (RNA was measured each 12 weeks for three years). Seaman and Copas estimated how the probability of detectable RNA changed over time in each of the three trial arms. They used the semiparametric regression model logit logitP(Yt=1|X1,X2,X3)=k=13Xk(βint,k+βslo,kt), where binary Xk = 1 if the patient is in arm k and βslo,k is the slope for arm k.

Example 5: Qi et al. [23] used Cox regression to model the dependence of the hazard of bone fracture on age and bone mineral density in a cohort of postmenopausal women. The semiparametric model was h(t | X1, X2) = h0(t) exp(β1X1 + β2X2), where h0(t) is the baseline hazard at time t and h(t | X1, X2) is the hazard given age (X1) and mineral density (X2).

Much of the focus of semiparametric theory has been on finding consistent estimators with the greatest asymptotic efficiency, i.e smallest asymptotic variance. This search has been restricted to estimators that are regular asymptotic linear (RAL) (see the supplemental article [36] for definition of RAL). If an estimator β̂ of parameter β in a semiparametric or parametric model ℳ is RAL, then, for all densities f(z) allowed by model ℳ, β̂ is consistent and asymptotically normally distributed (CAN). Therefore, in particular, β̂ converges to β and nVar(β̂) converges to a constant (which may depend on f(z)) as n → ∞.

For most models ℳ, the task of identifying which of the RAL estimators of β is asymptotically the most efficient among all the RAL estimators under that model requires correct specification of restrictions on aspects of f(z) beyond the restrictions already implied by model ℳ. E.g., ℳ might impose restrictions only on conditional expectations of Z, while identifying the most efficient RAL estimator under ℳ might additionally require correct specification of conditional variances. When this RAL estimator is only the most asymptotically efficient when those further aspects of f(z) are correctly modelled, it is called ‘locally (semiparametric) efficient’ under model ℳ; otherwise it is called ‘globally efficient’. For many models ℳ, locally semiparametric efficient estimators are difficult to obtain. We therefore often content ourselves with finding the most asymptotically efficient among all the RAL estimators in a large subclass of RAL estimators. Such estimators are called ‘locally (semiparametric) efficient’ over the considered class. Local efficiency is important in DR theory, because most — if not all — DR RAL estimators are locally efficient over a large class of RAL estimators. This explains why the search for DR estimators is often helped by the search for efficient estimators, as we show in the next section.

Many RAL estimators for parametric and semiparametric models are m-estimators. We shall focus on these. An m-estimator β̂ is the solution to estimating equations of the form i=1nu(Zi;β^)=0 for some function u(Z; β) of Z and β such that E{u(Z; β0)} = 0, where β0 denotes the true value of β. Subject to regularity conditions [40], β^pβ0 as n → ∞. One example of an m-estimator is that using u(Z; β) = Yβ to estimate β = E(Y) in the nonparametric model. Solving i=1n(Yiβ)=0 yields the estimator β^=n1i=1nYi. All RAL estimators of β in this model are asymptotically equivalent to this β̂ (which is therefore globally efficient over the class of all RAL estimators under this model). Another example is estimation of β in the restricted moment model (equation (3)). It can be shown that all RAL estimators of β in this model are asymptotically equivalent to an m-estimator with u(Z, β) = A(X){Yµ(X; β)} for some conformable matrix A(X) of full rank, and conversely that all m-estimators of this form are RAL estimators of β in this model [40]. Over the class of all RAL estimators of β in this model, the locally efficient one at the true distribution of Z is that using A(X) = D(X)V−1(X), where D(X) = µ(X, β)/∂β evaluated at β = β0 and V (X) = Var(Y | X). A third example of an m-estimator is the ML estimator of β in a parametric model: here u(Z; β) is the score function.

3.2. Construction of DR estimators

Suppose Z is only partially observed. The aim is still to estimate β in the semiparametric model ℳ for the full data (Z1, . . . , Zn), but incompleteness of the data makes use of the full-data m-estimator of Section 3.1 infeasible and we instead seek an estimator that uses only the observed data. Semiparametric theory shows how to convert a RAL m-estimator for full data into a RAL m-estimator for observed data. This is relatively straightforward when data Z are MAR and monotone missing, and we now show how to do this. Consider first the situation where there are only two missingness patterns. Here we can write Z = (Z(1)⊤, Z(2)⊤), where Z(1) is observed on the whole sample and Z(2) is observed on a subset of the sample. The latter could be, e.g. the outcome or a covariate in a restricted moment model. For each individual, let R = 1 if Z(2) is observed and R = 0 if it is missing. Individuals with R = 1 are the complete cases. The observed data are (Z1(1),R1Z1(2),R1,,Zn(1),RnZn(2),Rn).

The MAR assumption implies that P(R = 1 | Z) = P(R = 1 | Z(1)). Let π(Z(1)) = P(R = 1 | Z(1)). Assume there exists a δ > 0 such that P{π(Z(1)) ≥ δ} = 1. A parametric model π(Z(1); α) is specified for π(Z(1)), where π(Z(1); α) is a sufficiently smooth function of α. Denote by ℳmiss the semiparametric model for (Z(1), RZ(2), R) defined by model ℳ for Z, model π(Z(1); α) for R given Z(1), and the MAR assumption. Suppose that the solution to the m-estimating equations i=1nu(Zi;β)=0 is a full-data RAL estimator for β under model ℳ. Then a corresponding observed-data estimator is the solution to the AIPW estimating equations

i=1nRiπ(Zi(1);α^)u(Zi;β)+{1Riπ(Zi(1);α^)}ϕ(Zi(1);β)=0, (4)

where α̂ is an estimator of α based on data (R1,Z1(1),,Rn,Zn(1)), e.g. the ML estimator, and ϕ(Z(1); β) is some function of Z(1) and β. If π(Z(1); α) is correctly specified and α̂ is a consistent estimator of α, then the solution to equation (4) is a RAL estimator for β under model ℳmiss. That is, it is CAN when models ℳ and π(Z(1); α) are correctly specified and data Z are MAR. We prove this later, after introducing the DR estimator. For the restricted moment model in particular, all observed-data RAL estimators of β are asymptotically equivalent to an m-estimator of the form of equation (4) with u(Z, β) = A(X){Yµ(X; β)} for some conformable matrix A(X) of full rank.

If ϕ(Z(1); β) is chosen to be zero, equations (4) reduce to IPW estimating equations, which use only data on complete cases. Semiparametric theory shows that the optimally efficient choice of ϕ(Z(1); β) is ϕopt(Z(1); β) = E{u(Z; β) | Z(1), R = 1}. That is, the asymptotically most efficient RAL estimator of β among the class of estimators that solve equations (4) for a fixed choice of u(Z; β) is that which uses ϕ(Z(1); β) = ϕopt(Z(1); β). Put formally, V (u, ϕ)−V (u, ϕopt) is non-negative definite for any ϕ(.), where V (u, ϕ) denotes the asymptotic variance of the estimator that uses u(.) and ϕ(.).

In practice, E{u(Z; β) | Z(1), R = 1} is unknown. So, a parametric imputation model ϕ(Z(1); β, γ) for E{u(Z; β) | Z(1), R = 1} is specified. This model can be specified either directly, or indirectly by choosing a model f(z(2) | Z(1), R = 1; γ) for f(z(2) | Z(1), R = 1). Denote by ℳimp the semiparametric model for (Z(1), RZ(2), R) defined by models ℳ and ϕ(Z(1); β, γ) and the MAR assumption. Let γ̂ denote an estimator of γ based on the complete cases (e.g. the ML estimator). Now, β can be estimated as the solution β̂DR = β̂DR (α̂, γ̂) to the DR (AIPW) estimating equations

i=1nSβ,i(β,α^,γ^)=i=1nRiπ(Zi(1);α^)u(Zi;β)+{1Riπ(Zi(1);α^)}ϕ(Zi(1);β,γ^)=0. (5)

This is a RAL estimator for β under model ℳmiss when π(Z(1); α) is correctly specified and α̂ is consistent. Moreover, it turns out that β̂DR is also a RAL estimator for β under model ℳimp when γ̂ is consistent. That is, β̂DR is CAN if model ℳ is correctly specified, the data are MAR, and either i) π(Z(1); α) is correctly specified and α̂ is consistent, or ii) ϕ(Z(1); β, γ) is correctly specified and γ̂ is consistent (or both) (see the supplemental article [36] for proof). For this reason, β̂DR is called ‘double robust’. On the other hand, the IPW estimator which replaces ϕ(Z(1); β, γ) with zero is a RAL estimator of β only under model ℳmiss. That is, it is CAN only if π(Z(1); α) is correctly specified.

Let us apply equation (5) to the missing outcome problem of Section 2. In this case, Z(2) = Y, Z(1) = W, u(Z; β) = Yβ and ϕ(Z(1); β, γ̂) = m(W; γ̂) − β. Here E{u(Z(2); β) | Z(1), R = 1} = E(Y | W, R = 1) − β, and so it suffices to specify a model m(W; γ) for E(Y | W, R = 1). It is easy to show that the solution β̂DR to equation (5) is the same estimator β̂DR that we met in Section 2. If, on the other hand, ϕ(Z(1); β, γ̂) is set to zero, then the solution to equation (5) is β̂IPW,B.

When π(Z(1); α) and ϕ(Z(1); β, γ) are correctly specified, the solution β̂DR to equations (5) is locally efficient over the class of estimators that solve equations (4) for the given u(Zi; β) and arbitrary ϕ(Zi(1);β). More broadly, however, the efficiency of β̂DR also depends on the choice of function u(Z; β). The choice that maximises efficiency under model ℳmiss is generally difficult to find [40]. It is usually different from that which gives local efficiency under model ℳ. For example, we saw in Section 3.1 that for the restricted moment model, u(Z; β) = D(X)V−1(X){Yµ(X; β)} gives the locally efficient estimator under ℳ. This is not necessarily the efficient choice under ℳmiss. An exception to this general rule is the missing outcome problem of Section 2, where u(Z; β) = Yβ gives global efficiency under ℳ and local efficiency under ℳmiss.

So far, we have considered the case where there are only two missingness patterns. The general case of monotone missing data (e.g. longitudinal data with dropout) is treated in the supplemental article [36]. Here, the DR estimator is CAN if either of two sets of models is correctly specified. The first set is for the conditional probability of dropout at each time point given the variables available at that time. The second set is for the conditional expectation of u(Z; β) given the variables available up to each time point and not dropping out before that time.

Example 3 continued: The difficulty faced by Schnitzer et al. in estimating the parameters of their prediction model was that many clinical events were censored, due to loss to follow-up or deviation from assigned therapy. To deal with this, they used a DR estimator. During follow-up, CD4-cell and HIV-RNA counts were measured at least every 16 weeks, and the dropout and conditional expectation models for each timepoint used the CD4 and RNA counts measured at the previous timepoint.

Example 4 continued: In the trial considered by Seaman and Copas, 16% of patients dropped out before the end. Dropout was higher among patients who were younger, injected drugs or were no longer on assigned therapy, making estimates based on complete cases potentially biased. So, Seaman and Copas used DR estimation. The dropout and conditional expectation models for each timepoint used treatment arm, injecting behaviour, and an indicator of being on assigned therapy, CD4 cell count and RNA count at the previous timepoint.

Example 5 continued: In the cohort used by Qi et al. bone mineral density was measured in less than 10% of women, making a complete-case analysis potentially inefficient. They instead used DR estimation to handle these missing covariate data. This required models for the probability that mineral density was observed and for the distribution of mineral density given that it was observed. The covariates in these models were the event/censoring time, the event indicator and age, and both models were estimated using kernel smoothers. By using the data on all the women, the precisions of the hazard ratio estimates were increased relative to complete-case estimates. A description of the method used can be found in the supplemental article [36].

3.3. Asymptotic distribution of DR estimators

The variance of β̂DR generally depends on the choice of estimators α̂ and γ̂. Suppose these are obtained as the solutions to estimating equations i=1nSα,i(α^)=0 and i=1nSγ,i(γ^)=0, and let θ = (β, α, γ). For example, if π(Z(1); α) = expit(αZ(1)), then Sα(α) = Z(1){R − expit(αZ(1))}. When π(Z(1); α) or ϕ(Z(1); β, γ) (or both) is correctly specified, the variance of β̂DR is consistently estimated by the sandwich estimator n1{i=1nSβ,i(θ)/β}1{i=1nSi(θ)Si(θ)}{i=1nSβ,i(θ)/β}1 evaluated at θ̂ = (β̂DR, α̂, γ̂), where

Si(θ)=Sβ,i(θ){i=1nSβ,i(θ)α}{i=1nSα,i(α)α}1Sα,i(α){i=1nSβ,i(θ)γ}{i=1nSγ,i(γ)γ}1Sγ,i(γ).

Here, the terms involving Sα,i(α) and Sγ,i(γ) can be viewed as accounting for the uncertainty in α̂ and γ̂, respectively. An alternative to the sandwich estimator is nonparametric bootstrap. The latter is commonly used, possibly because the former may be negatively biased when the effective sample size is small or to construct confidence intervals that do not rely on a normal assumption [7].

When both π(Z(1); α) and ϕ(Z(1); β, γ) are correctly specified, Si(θ) = Sβ,i(θ) (up to a term that converges to zero in probability — see proof of DR in [36]). An important implication of this is that the asymptotic variance of β̂DR does not depend on the choice of (consistent) estimators α̂ and γ̂ in that case, and in fact equals the asymptotic variance of the DR estimator β̂DR(α, γ) that uses the true values of α and γ. It is therefore tempting to replace Si(θ) by Sβ,i(θ) in the sandwich variance estimator. We discourage this in general, because, although β̂DR is DR, inference for β is not DR when this is done, as consistency of the resulting variance estimator is no longer guaranteed as soon as one or both of π(Z(1); α) and ϕ(Z(1); β, γ) is misspecified. Under such misspecification, or when the sample size is small, the choice of estimators α̂ and γ̂ can be very important. We return to this issue in the next section.

4. Improved double robust estimators

For simplicity, we concentrate in this section on the missing outcome problem of Section 2. The notation is the same as used there. Also, α0 and γ0 denote the probability limits of α̂ and γ̂, i.e. α^pα0 and γ^pγ0. Much of the material in this section is adapted from Rotnitzky and Vansteelandt [31] and more details can be found there, including information on which methods have been extended to estimate the parameters of a semiparametric regression model with partially observed outcome and fully observed covariates or to handle longitudinal data with dropout.

4.1. Drawbacks of the standard DR estimator

Let α̂ML and γ̂ML denote locally efficient semiparametric estimators of α and γ under the missingness and outcome models, respectively. For example, α̂ML and γ̂ML could be ML estimators in logistic and linear regression models, respectively. When α̂ML and γ̂ML are used, the estimator β̂DR given by equation (1) or (equivalently) (2) is sometimes called the ‘standard’ DR estimator [5]. There are some issues with this estimator.

First, β̂DR may lie outside its parameter space (e.g. outside [0, 1] when Y is binary). Even when guaranteed to lie within its parameter space, it may not be within the range of the observed Y values. An estimate of E(Y) that is less (more) than the minimum (maximum) observed value of Y may be difficult to defend [25].

Second, when model m(W; γ) is misspecified, there is no guarantee that β̂DR will be at least as efficient as the IPW estimators β̂IPW and β̂IPW,B.

Third, in practical applications, both models π(W; α) and m(W; γ) are likely to be at least mildly misspecified, so that neither of the conditions for consistency of β̂DR applies. The hope is that β̂DR will still perform well when at least one of these models is approximately correctly specified. However, Kang and Schafer [12] demonstrated that this is not necessarily the case. They gave an example of a data-generating mechanism for (Y, W, R) and two misspecified models π(W; α) and m(W; γ) and showed that the standard DR estimator has very large bias and variance in this example, even though the model misspecification is not easily detected from the observed data on a sample of moderate size. They also showed that the RI estimator β̂RI has relatively small bias and variance in this example. Robins et al. [25] examined Kang and Schafer’s data-generating mechanism. They noted that the overlap between the distributions of W in the complete and incomplete cases was small. As discussed in Section 2, this means that β̂RI relies on potentially dangerous extrapolation, and thus that its good performance is partly a matter of luck. Indeed, Robins et al. [25] showed that if Kang and Schafer’s missingness mechanism was altered by making complete cases into incomplete cases and vice versa (by replacing R by 1 − R), the performance of β̂RI became much worse than that of β̂DR. Nevertheless, this example cast some doubt on the practical usefulness of the DR property of β̂DR.

The response to these issues has been the development of improved DR estimators, which aim at greater efficiency and reduced bias relative to the standard DR estimator. These differ from that estimator in the way that α and/or γ are estimated. As noted in Section 3, the choice of α̂ and γ̂ affects the asymptotic variance of β̂DR unless both π(W; α) and m(W; γ) are correctly specified, and affects its asymptotic bias when neither is correctly specified.

These improved estimators are not a panacea for scenarios where the population variance of the true weights π(W)–1 is large. In this case, there is limited overlap between the distributions of W in complete and incomplete cases and, unless one is prepared to trust in extrapolation to incomplete cases of an outcome model fitted to complete cases, considerable uncertainty in the estimate of β is inevitable. However, the improved estimators go a long way to resolving the issues with the standard DR estimator listed above. First, most of them guarantee β̂ lies within the parameter space of β. Some are even sample bounded. As well as avoiding implausible estimates, sample boundedness can reduce the variance of β̂DR when the weights are highly variable. Second, some of the improved estimators are asymptotically efficient over a class of estimators that includes the simple IPW estimators, provided that π(W; α) is correctly specified, even when m(W; γ) is potentially misspecified. Third, some more recent methods aim to improve performance when both m(W; γ) and π(W; α) may be misspecified or when the true weights are unstable. We now review these improved DR methods.

4.2. DR RI and DR sample-bounded IPW estimators

Several methods calculate α̂ML and then estimate γ in such a way that ensures

i=1nRiπ(Wi;α^ML){Yim(Wi;γ^)}=0. (6)

As the left-hand side of equation (6) is the ‘correction’ term in equation (2), this ensures that β̂DR reduces to a RI estimator, i.e. β^DR=n1i=1nm(Wi;γ^). The advantage of this is that, if the range of m(W; γ) equals the parameter space of β, then β̂DR must lie within this space. Further, Gruber and van der Laan show how to ensure that the range of m(W; γ) equals the range of the observed Y values, making the resulting RI estimator sample-bounded [10].

When m(W; γ) is a generalised linear model with canonical link function, two ways to make equation (6) hold are: i) to estimate γ using the ML estimator with weights π(W; α̂ML)−1 [12]; or ii) to include π(W; α̂ML)−1 as an extra covariate in m(W; γ) and then estimate γ by ML [32] (the first way requires that m(W; γ) include an intercept term). In either case, equation (6) is one of the score equations for γ̂ (corresponding to the intercept in the first case and to the covariate π(W; α̂ML)−1 in the second case) and hence holds at γ̂. Note that if the original model for E(Y | W) is correctly specified, then the extended model with covariate π(W; α̂ML)−1 added will still be correctly specified. When the original model for E(Y | W) is misspecified, the first DR RI estimator usually has better performance than the second [31].

Robins et al. [25] proposed calculating γ̂ML and then estimating α in such a way that i=1nRiπ(Wi;α^)1{m(Wi;γ^ML)n1j=1nm(Wj;γ^ML)}=0. The sample-bounded estimator β̂IPW,B(α̂) is then DR. This DR estimator is related to the minimum-discrepancy estimators discussed in [36]: they all calculate the weights in such a way that the weighted average of m(W; γ̂ML) in the complete cases is equal to the corresponding unweighted average in the whole sample.

4.3. Efficient estimators over a class of estimators

All the improved estimators described so far suffer from the drawback that, if m(W; γ) is misspecified, they can potentially be less efficient than the IPW estimators β̂IPW and β̂IPW,B. We now describe DR estimators that are, when π(W; α) is correctly specified, guaranteed to be at least as asymptotically efficient as the IPW estimators that use the same model π(W; α).

Consider a correctly specified model π(W; α) and a fixed choice of (possibly misspecified) model m(W; γ) = h(γW), where h is a known link function, and let α̂ = α̂ML. Let β̂(ν1, ν2, γ), where ν1 and ν2 are real numbers, denote the estimator that solves equation (5) with Zi(1)=Wi,u(Zi;β)=Yiβ and ϕ(Zi(1);β,γ^) replaced by ν1 + ν2m(Wi; γ) − β. So, in particular, β̂(0, 1, γ̂ML) is the standard DR estimator and β̂(0, 0, 0) and β̂(β, 0, 0) are, respectively, β̂IPW and β̂IPW,B.

Cao et al. [5] and Tan [38, 39] independently derived estimators that are asymptotically efficient over the set {β̂(ν1, ν2, γ) : –∞ < ν1, ν2 < ∞, γ ∈ Γ}, where Γ is the parameter space of γ. That is, their asymptotic variances cannot be greater than that of any AIPW estimator that uses in its augmentation term ν1 + ν2m(W; γ) for any fixed ν1, ν2 and γ. In particular, they cannot be greater than those of β̂IPW, β̂IPW,B and the standard DR estimator (because the last has the same asymptotic variance as β̂(0, 1, γ0) when π(W; α) is correctly specified — see proof of DR in [36]). Rotnitzky et al. [30] derived a DR RI estimator that is at least as asympotically efficient as both β̂(0, 1, γ) for any γ ∈ Γ and β̂IPW,B. If m(W; γ) = 0 for some γ ∈ Γ, then β̂(0, 1, γ) = β̂(0, 0, 0) ≡ β̂IPW for this value of γ, so that Rotnitzky et al.’s estimator is also at least as asympotically efficient as β̂IPW.

Tan’s [39] estimator (which builds on his earlier work [37]) has the advantage that it is sample bounded. Cao et al.’s [5] method (further developed by Tsiatis et al. [42]) and Rotnizky et al.’s [30] method have the advantage that they allow estimation of the parameters of a semiparametric regression model, even for longitudinal data with dropout. However, when β is a vector, Cao et al.’s estimator ensures asymptotic efficiency for only one specified element of β. Rotnitzky et al.’s estimator ensures asymptotic efficiency for all elements of β (and indeed for a finite number of arbitrary scalar functions of β).

4.4. Bias-reduced DR estimators

The methods listed in Section 4.3 minimise the asymptotic variance of β̂DR over a class of AIPW estimators when π(W; α) is correctly specified, but are not guaranteed to do so when it is misspecified. Vermeulen and Vansteelandt [46] took a different approach. Rather than seeking directly to minimise the asymptotic variance, their ‘bias-reduced DR estimator’ uses the estimators α̂ and γ̂ obtained by locally minimising the squared asymptotic bias of β̂DR when both models π(W; α) and m(W; γ) are misspecified. This makes the bias-reduced DR estimator less sensitive than the standard DR estimator to mild model misspecification. This can be understood as follows. The asymptotic bias of β̂DR equals E[{π(W; α0)–π(W)}{m(W; γ0)–E(Y | W)}π(W; α0)−1] [46]. That is, it is the product of the degrees of misspecification of the two models inversely weighted by π(W; α0). This weighting is concerning, because it is in the region where π(W; α0) is small that few complete cases are observed, and so misspecification of m(W; γ) is most likely to remain undetected. Vermeulen and Vansteelandt’s choice of α̂ and γ̂ makes the asymptotic bias reduce to E[{m(W; γ0) − E(Y | W)}{1 − π(W)}], hence avoiding this problem.

Bias-reduced DR estimation can be used for quite general semiparametric models, even when data are assumed to be MNAR. However, when β is a vector, the squared asymptotic bias is minimised only for one specified element of β.

Estimating the variance of β̂DR is straightforward for the bias-reduced estimator, because a fortunate effect of the way that α̂ and γ̂ are calculated is that uncertainty in these parameters can be ignored even when both models π(W; α) and m(W; γ) are misspecified. The variance can thus be estimated as explained in Section 3.3, replacing Si(θ) by Sβ,i(θ). This may also explain why the bias-reduced estimator appears to have good efficiency in simulation studies [46].

Simulation studies that compare many of the improved DR methods discussed in Section 4.24.4 have been reported [39, 22, 46]. In these studies, the improved methods had less bias and greater efficiency than the standard DR estimator when the outcome model was misspecified; differences were less marked when only the missingness model was misspecified. The estimators of Sections 4.3 and 4.4 performed better than those of Section 4.2, but among the former group no method was uniformly best. The range of data-generating mechanisms considered in these studies was quite small, however, and more research would be welcome.

5. Data-adaptive methods

The increasing popularity and availability of data-adaptive statistical methods (e.g. kernel smoothing, penalised likelihood, ensemble learners) may lead the reader to wonder what is the use of DR estimators when RI estimators and IPW estimators can be based on outcome imputations and missingness probabilities, respectively, obtained via such flexible methods [16]. In this section, we provide insight into this matter, and argue that DR estimators are in fact especially useful when data-adaptive methods are used.

For simplicity, we return to the missing outcome problem of Section 2. Consider the RI estimator β^RI=n1i=1nm(Wi;γ^), where γ̂ is an estimate obtained through some data-adaptive statistical method (e.g. standard variable selection). The estimator γ̂ will typically have a complicated finite-sample distribution [13] and non-uniform convergence of this distribution to a normal distribution, properties which the RI estimator β̂RI will usually inherit. The practical implication of this is that uniformly valid confidence intervals with nominal coverage for β based on β̂RI are difficult to obtain. Confidence intervals that are not uniformly valid are not guaranteed to perform well, because, for any given n, no matter how large, there exist distributions of the full data for which their coverage is poor. This problem is well known for, e.g., lasso-estimators γ̂, where a small change in the data-generating mechanism (e.g. an element of γ changing from 0 to n–1/2) may lead to a relatively large change in the distribution of γ̂ even for large n, because it may lead to different variables being selected asymptotically [14, 13].

To develop more formal insight into this, we consider the difficulty that arises in the specific example of lasso or post-lasso (post-lasso is the procedure that uses lasso as a variable-selection procedure and then refits the selected model using a standard procedure (e.g. ML) to reduce shrinkage bias [2]). Similar problems arise with other data-adaptive methods. Let γ̂ be an estimator of γ obtained via lasso or post-lasso. Then [3, 8, 9],

n(β^RIβ0)=1ni=1n{m(Wi;γ^)β0}=1ni=1n{m(Wi;γ0)β0}+1ni=1n{m(Wi;γ^)m(Wi;γ0)}=1ni=1n{m(Wi;γ0)β0}+1ni=1nmγ(Wi;γ)|γ=γ0n(γ^γ0)+n||γ^γ0||22Op(1). (7)

where ‖.‖2 denotes the Euclidean norm. Assuming that m(W; γ) is correctly specified, the first term in the expansion (7) generally has an asymptotic normal mean-zero distribution and the remainder term nγ^γ022Op(1) tends to be of lower order than the other two terms. Although the term n1i=1nm(Wi;γ)/γ|γ=γ0×n(γ^γ0) does (for fixed γ0 and assuming regularity conditions) converge in distribution to a normal distribution, this convergence is generally not uniform. That is, for any n, no matter how large, there exist values of γ0 for which the distribution of n(γ^γ0) is far from its asymptotic distribution — and hence for which n(β^RIβ0) is for from its asymptotic distribution.

An additional concern arises when p, the dimension of γ, is large relative to n. Lasso and other penalised likelihood methods are commonly used in such settings. Large-sample behaviour of γ̂ as p increases with n is therefore of interest. When p increases with n, there is (in addition to the forementioned difficulty of obtaining uniformly valid confidence intervals) a problem that bias in β̂RI may vanish only slowly with increasing n unless the true data-generating mechanism shows sufficient sparsity, i.e. unless the rate at which s, the number of non-zero elements of γ0, increases as n increases is sufficiently small [2]. More specifically, it follows from [2] that, for lasso and post-lasso estimators, nγ^γ022=Op((s/n)log(pn)), where ab denotes the maximum of a and b. When there is sufficient sparsity to ensure that (s/n(log(pn) converges to zero, the second-order term nγ^γ022 converges to zero. However, greater sparsity is required to prevent the term n1i=1nm(Wi;γ)/γ|γ=γ0×n(γ^γ0) in equation (7) from diverging to infinity, and so to ensure that bias in n(β^RIβ0) vanishes as n → ∞.

The above concerns largely disappear when data-adaptive methods are combined with DR estimators, because DR estimators enjoy a small bias property [20, 8]. This means that their bias vanishes faster than the bias in the nuisance parameter estimator (e.g. γ̂) when the smoothing parameter (e.g. the bandwidth in a kernel estimator or the penalty parameter in a lasso-estimator) goes to zero. This property is important for ensuring correct inference when data-adaptive methods are used [3]. This can more formally be understood as follows. Consider again the estimator β^DR=n1i=1nρ(Ri,RiYi,Wi;α^,γ^), where

ρ(R,RY,W;α^,γ^)=Rπ(W;α^)Y+{1Rπ(W;α^)}m(W;γ^),

with γ̂ and α̂ obtained through some data-adaptive statistical method. Then upon repeating the expansion of equation (7) with β̂DR in place of β̂RI, ρ(R, RY, W; α̂, γ̂) in place of m(W; γ̂), and θ = (α, γ) in place of γ, one can see that slow convergence of γ̂ and α̂ does not necessarily induce erratic behaviour in β̂DR. This is because, as noted in the proof of DR in [36], ∂ρ(R, RY, W; α, γ)/∂θ has expectation zero at (α0, γ0) when π(W; α) and m(W; γ) are correctly specified, and so slow convergence of the first-order term n(θ^θ0) in the expansion is not a problem (so long as nθ^θ022 converges to zero).

Farrell [9] uses this idea to demonstrate that, under conditions that we specify next, β̂DR is asymptotically unbiased and uniformly valid 95% confidence regions for β can be straightforwardly calculated as β^DR±1.96σ^2/n, where σ̂2 is the sample variance of ρ(R, RY, W; α̂, γ̂). These conditions are that the empirical mean squared errors of m(W; γ̂) and π(W; α̂) converge in probability to zero, and that their product converges at faster than n–1-rate. This in particular allows slow convergence of α̂, so long as γ̂ converges sufficiently fast, and vice versa.

The results of Farrell [9] apply to any data-adaptive method for estimating α̂ and γ̂, so long as it satisfies the aforementioned conditions. Targeted maximum likelihood estimation (TMLE), proposed by van der Laan and Rubin [44] and refined by Gruber and van der Laan [10], is one such procedure. It is designed to ensure that the DR estimator reduces to a RI estimator (or ‘substitution estimator’ in their terminology). It involves two steps. First, a preliminary estimate m(0)(W; γ̂) of E(Y | W) based on a data-adaptive learning algorithm (e.g. an ensemble learner) is obtained, and a parametric missingness model is fitted to obtain α̂. Second, a canonical generalised linear model for E(Y | W) is fitted, with link function h(.), offset term h−1 {m(0)(W; γ̂)} and the single covariate R/π(W; α̂). This covariate is chosen because ML estimation of its coefficient involves setting i=1nRiπ(Wi;α^)1{Yim(Wi;γ^)} to zero, thereby making the DR estimator equivalent to a RI estimator.

6. Discussion

Much research on DR estimators has been for the missing outcome problem of Section 2 and for restricted moment models with missing outcome or covariates (see Section 3 and [36]). Other applications have included, e.g., estimating the area under an ROC curve with missing outcome or predictor [18, 29]. The DR property is not unique to methods for incomplete data. The missing outcome problem of Section 2 is closely related to that of estimating an average causal effect, and essentially the same DR estimators appear in this literature (e.g. [1]). DR estimators have also been proposed for many other causal inference problems. Rotnitzky and Vansteelandt [31] list numerous examples of DR estimators, within and without the causal inference literature.

We have focussed on DR incomplete-data estimators for scenarios where a full-data m-estimator is available. In the supplemental article [36], we describe more general DR theory, and illustrate this using the Cox model with a partially observed covariate. The usual full-data estimator for the Cox model is the solution to partial-likelihood estimating equations, which do not take the form i=1nu(Zi;β^)=0.

The AIPW estimator of Section 2 has close connections to sample survey estimators that pre-date the work of Robins et al. [28], and to DR empirical likelihood (EL) and generalised EL estimators. In the supplemental article [36], we describe these connections and provide an introduction to DR EL estimators.

In missing-data problems, DR estimators require correct specification of either a model for the missingness process (given the full data) or a model for (some functional of) the outcome distribution (given the missing data patterns). When the data are non-monotone missing, plausible models for the missingness process can be difficult to construct. This has hindered the development of DR estimators in such settings [27]. The development of DR estimators for non-monotone missing data constitutes one of the primary open problems in this domain.

The construction of DR estimators for MNAR data is complicated by the lack of factorisation of the likelihood, which makes it difficult to describe the model for the missingness process (given the full data) and the model for (some functional of) the outcome distribution (given the missing data patterns) using variation-independent parameters. Such variation-independent parameterisation is needed to ensure that consistent estimators of the missingness probabilities can be obtained even when the outcome model is misspecified, and vice versa. Nevertheless, some progress has been made. A common approach uses a ‘tilt’ function (e.g. [29]). A simple application of this approach to the missing outcome problem of Section 2 would assume that P(R = 1 | W, Y) = expit{ωY +a(W)}, where a(W) is some function of W and ω is a known parameter (here ωY is the ‘tilt’ function). This implies that f(y | W, R = 0) = f(y | W, R = 1) exp(–ωY)c(W), where c(W) is a normalising constant. The DR estimator of β is consistent if either a model a(W; α) for a(W) or a model b(W; γ) for f(y | W, R = 1) is correctly specified.

Finally, although in Section 5 we considered the implications of using variable (or model) selection strategies for the missingness and/or imputation models, we did not discuss how such selection is best done. Just as the choice of estimators of the nuisance parameters (α and γ) can have a major impact on the performance of the DR estimator when at least one of these models is misspecified, also the choice of selection strategy can be extremely influential. This is well known when instrumental variables are observed, i.e. variables that are predictive of missingness, but not of the partially observed variables themselves [4]. The selection of such variables in the missingness model can cause a major loss of efficiency, and can moreover drastically amplify biases, e.g. due to model misspecification.

The development of variable selection strategies that prevent selection of instrumental variables in the missingness model has been an area of vigorous recent research [43, 47]. One such approach is the ‘collaborative TMLE’ method [43]. In the context of the missing outcome problem of Section 2, this method selects, from a given number of TMLEs for a nested sequence of models for π(W), the one which minimises a penalised log-likelihood criterion, e.g. the sum of the squared residuals from the fitted model for E(Y | W) plus the mean-squared error of the estimator of β estimated by cross-validation. Because selecting instrumental variables inflates the mean-squared error of the estimator of β without changing the sum of the squared residuals, such variables are unlikely to be selected. While targeted variable selection strategies like the above tend to bring major efficiency improvements relative to routine strategies, a concern is that all of them (directly or indirectly) involve jointly modelling the missingness process and the conditional distribution of partially observed variables. As such, they risk giving up on the DR property, since misspecification of one of these two models may then result in inconsistent estimation of the other model, even when it is correctly specified.

Grants and funding

SRS is funded by MRC grants MC_U105260558 and MC_UU_00002/10.

References

  • [1].Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]
  • [2].Belloni A, Chernozhukov V. l1-penalised quantile regression in high-dimensional sparse models. Annals of Statistics. 2011;39:82–130. [Google Scholar]
  • [3].Belloni A, Chernozhukov V, Hansen C. Lasso methods for Gaussian instrumental variables models. ArXiv. 2016 [Google Scholar]
  • [4].Brookhart MA, Van der Laan MJ. A semi-parametric model selection criterion with applications to the marginal structural model. Computational Statistics and Data Analysis. 2006;50:475–498. [Google Scholar]
  • [5].Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–734. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Cassel CM, Sarndal CE, Wretman JH. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika. 1976;63:615–620. [Google Scholar]
  • [7].Cheng G, Yu Z, Huang JZ. The cluster bootstrap consistency in generalized estimating equations. Journal of Multivariate Analysis. 2013;115:33–47. [Google Scholar]
  • [8].Chernozhukov V, Escanciano JC, Ichimura H, Newey WK. Locally robust semiparametric estimation. ArXiv. 2016 [Google Scholar]
  • [9].Farrell MH. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics. 2015;189:1–23. [Google Scholar]
  • [10].Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. International Journal of Biostatistics. 2010;6 doi: 10.2202/1557-4679.1260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–68. [Google Scholar]
  • [12].Kang JDY, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007;22:523–539. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Leeb H, Pötscher BM. Model selection and inference: Facts and fiction. Econometric Theory. 2005;21:21–59. [Google Scholar]
  • [14].Leeb H, Pötscher BM. Performance limits for estimators of the risk or distribution of shrinkage-type estimators, and some general lower risk-bound results. Econometric Theory. 2006;22:69–97. [Google Scholar]
  • [15].Liang K-Y, Zeger SL. Longitudinal data analysis using generalised linear models. Biometrika. 1986;73:13–22. [Google Scholar]
  • [16].Little R, An H. Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica. 2004;14:949–968. [Google Scholar]
  • [17].Little RJA, Rubin DB. Statistical Analysis With Missing Data. Wiley; New Jersey: 2002. [Google Scholar]
  • [18].Long Q, Zhang X, Johnson BA. Robust estimation of area under ROC curve using auxiliary variables in the presence of missing biomarker values. Biometrics. 2011;67:559–567. doi: 10.1111/j.1541-0420.2010.01487.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Meng X-L. Multiple-imputation inferences with uncongenial sources of input. Statistical Science. 1994;9:538–573. [Google Scholar]
  • [20].Newey WK, Hsieh F, Robins JM. Twicing kernels and a small bias property of semiparametric estimators. Econometrica. 2004;72:947–962. [Google Scholar]
  • [21].Paik MC. The generalized estimating equations approach when data are not missing completely at random. Journal of the American Statistical Association. 1997;92:1320–1329. [Google Scholar]
  • [22].Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. International Journal of Biostatistics. 2011;7 doi: 10.2202/1557-4679.1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Qi L, Wang CY, Prentice RL. Weighted estimators for proportional hazards regression with missing covariates. Journal of the American Statistical Association. 2005;100:1250–1263. [Google Scholar]
  • [24].Robins J, Rotnitzky A. Discussion on the paper by Firth and Bennett. Journal of the Royal Statistical Society, Series B. 1998;60:51–52. [Google Scholar]
  • [25].Robins J, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science. 2007;22:544–559. [Google Scholar]
  • [26].Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association Section on Bayesian Statistical Science 1999; 2000. pp. 6–10. [Google Scholar]
  • [27].Robins JM, Gill RD. Non-response models for the analysis of non-monotone ignorable missing data. Statistics in Medicine. 1997;16:39–56. doi: 10.1002/(sici)1097-0258(19970115)16:1<39::aid-sim535>3.0.co;2-d. [DOI] [PubMed] [Google Scholar]
  • [28].Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]
  • [29].Rotnitzky A, Faraggi D, Schisterman E. Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association. 2006;101:1276–1288. [Google Scholar]
  • [30].Rotnitzky A, Lei QH, Sued M, Robins JM. Improved double-robust estimation in missing data and causal inference models. Biometrika. 2012;99:439–456. doi: 10.1093/biomet/ass013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Rotnitzky A, Vansteelandt S. In: Handbook of Missing Data Methodology. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, editors. Chapman & Hall/Press CRC; 2014. Double-robust methods; pp. 185–212. chapter 9. [Google Scholar]
  • [32].Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models: Rejoinder. Journal of the American Statistical Association. 1999;94:11335–1146. [Google Scholar]
  • [33].Schnitzer ME, Lok JJ, Bosch RJ. Double robust and efficient estimation of a prognostic model for events in the presence of dependent censoring. Biostatistics. 2016;17:165–177. doi: 10.1093/biostatistics/kxv028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Seaman S, Copas A. Doubly robust generalized estimating equations for longitudinal data. Statistics in Medicine. 2009;28:937–955. doi: 10.1002/sim.3520. [DOI] [PubMed] [Google Scholar]
  • [35].Seaman SR, Galati J, Jackson D, Carlin J. What is meant by ‘missing at random’? Statistical Science. 2013;28:257–268. [Google Scholar]
  • [36].Seaman SR, Vansteelandt S. Supplement to ”Introduction to double robust methods for incomplete data”. 2018 doi: 10.1214/18-STS647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Tan Z. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association. 2006;101:1619–1637. [Google Scholar]
  • [38].Tan Z. Comment: Improved local efficiency and double robustness. International Journal of Biostatistics. 2008;4 doi: 10.2202/1557-4679.1109. [DOI] [PubMed] [Google Scholar]
  • [39].Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010;97:661–682. [Google Scholar]
  • [40].Tsiatis AA. Semiparametric Theory and Missing Data. Springer; New York: 2006. [Google Scholar]
  • [41].Tsiatis AA, Davidian M. In: Handbook of Missing Data Methodology. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, editors. Chapman & Hall/CRC Press; 2014. Missing data methods: A semi-parametric perspective. chapter 8. [Google Scholar]
  • [42].Tsiatis AA, Davidian M, Cao W. Improved double-robust estimation when data are monotone coursened, with application to longitudinal studies with dropout. Biometrics. 2011;67:536–545. doi: 10.1111/j.1541-0420.2010.01476.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].van der Laan MJ, Gruber S. Collaborative double robust targeted maximum likelihood estimation. International Journal of Biostatistics. 2010;6 doi: 10.2202/1557-4679.1181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. International Journal of Biostatistics. 2006;2 doi: 10.2202/1557-4679.1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Vansteelandt S, Carpenter J, Kenward MG. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology. 2015;6:37–48. [Google Scholar]
  • [46].Vermeulen K, Vansteelandt S. Bias-reduced doubly robust estimation. Journal of the American Statistical Association. 2015;110:1024–1036. [Google Scholar]
  • [47].Wilson A, Reich BJ. Confounder selection via penalized credible regions. Biometrics. 2014;70:852–861. doi: 10.1111/biom.12203. [DOI] [PubMed] [Google Scholar]
  • [48].Wirth KE, Tchetgen Tchetgen EJ, Murray M. Adjustment for missing data in complex surveys using doubly robust estimation. Epidemiology. 2010;21:863–871. doi: 10.1097/EDE.0b013e3181f57571. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES