INTRODUCTION
We congratulate Drs. Kang and Schafer (KS henceforth) for a careful and thought-provoking contribution to the literature regarding the so-called “double robustness” property, a topic that still engenders some confusion and disagreement. The authors’ approach of focusing on the simplest situation of estimation of the population mean μ of a response y when y is not observed on all subjects according to a missing at random (MAR) mechanism (equivalently, estimation of the mean of a potential outcome in a causal model under the assumption of no unmeasured confounders) is commendable, as the fundamental issues can be explored without the distractions of the messier notation and considerations required in more complicated settings. Indeed, as the article demonstrates, this simple setting is sufficient to highlight a number of key points.
As noted eloquently by Molenberghs (2005), in regard to how such missing data/causal inference problems are best addressed, two “schools” may be identified: the “likelihood-oriented” school and the “weighting-based” school. As we have emphasized previously (Davidian, Tsiatis and Leon, 2005), we prefer to view inference from the vantage point of semi-parametric theory, focusing on the assumptions embedded in the statistical models leading to different “types” of estimators (i.e., “likelihood-oriented” or “weighting-based”) rather than on the forms of the estimators themselves. In this discussion, we hope to complement the presentation of the authors by elaborating on this point of view.
Throughout, we use the same notation as in the paper.
SEMIPARAMETRIC THEORY PERSPECTIVE
As demonstrated by Robins, Rotnitzky and Zhao (1994) and Tsiatis (2006), exploiting the relationship between so-called influence functions and estimators is a fruitful approach to studying and contrasting the (large-sample) properties of estimators for parameters of interest in a statistical model. We remind the reader that a statistical model is a class of densities that could have generated the observed data. Our presentation here is for scalar parameters such as μ, but generalizes readily to vector-valued parameters. If one restricts attention to estimators that are regular (i.e., not “pathological”; see Davidian, Tsiatis and Leon, 2005, page 263 and Tsiatis 2006, pages 26–27), then, for a parameter μ in a parametric or semiparametric statistical model, an estimator μ̂ for μ based on independent and identically distributed observed data zi, i = 1, …, n, is said to be asymptotically linear if it satisfies
(1) |
for ϕ(z) with E{ϕ(z)} = 0 and E {ϕ2(z)} < ∞, where μ0 is the true value of μ generating the data, and expectation is with respect to the true distribution of z. The function ϕ(z) is the influence function of the estimator μ̂. A regular, asymptotically linear estimator with influence function ϕ(z) is consistent and asymptotically normal with asymptotic variance E{ϕ2(z)}. Thus, there is an inextricable connection between estimators and influence functions in that the asymptotic behavior of an estimator is fully determined by its influence function, so that it suffices to focus on the influence function when discussing an estimator’s properties. Many of the estimators discussed by KS are regular and asymptotically linear; in the sequel, we refer to regular and asymptotically linear estimators as simply “estimators.”
We capitalize on this connection by considering the problem of estimating μ in the setting in KS in terms of statistical models that may be assumed for the observed data, from which influence functions corresponding to estimators valid under the assumed models may be derived. In the situation studied by KS, the “full” data that would ideally be observed are (t, x, y); however, as y is unobserved for some subjects, the observed data available for analysis are z = (t, x, ty). As noted by KS, the MAR assumption states that y and t are conditionally independent given x; for example, P (t = 1|y, x) = P (t = 1|x). Under this assumption, all joint densities for the observed data have the form
(2) |
where p(y|x) is the density of y given x, p(t|x) is the density of t given x, and p(x) is the marginal density of x. Let p0(z) be the density in the class of densities of form (2) generating the observed data (the true joint density).
One may posit different statistical models by making different assumptions on the components of (2). We focus on three such models:
Make no assumptions on the forms of p(x) or p(t|x), leaving these entirely unspecified. Make a specific assumption on p(y|x), namely, that E(y|x) = m(x, β) for some given function m(x, β) depending on parameters β (p × 1). Denote the class of densities satisfying these assumptions as .
Make no assumptions on the forms of p(x) or p(y|x). Make a specific assumption on p(t|x) that P (t = 1|x) = E(t|x) = π(x, α) for some given function π(x, α) depending on parameters α (s × 1). Here, we also require the assumption that P (t = 1|x) ≥ ε > 0 for all x and some ε. Denote the class of densities satisfying these assumptions as .
Make no assumptions on the form of p(x), but make specific assumptions on p(y|x) and p(t|x), namely, that E(y|x) = m(x, β) and P (t = 1|x) = E(t|x) = π(x, α) ≥ ε > 0 for all x and some ε for given functions m(x, β) and p(x, α) depending on parameters β and α. The class of densities satisfying these assumptions is .
All of I–III are semiparametric statistical models in that some aspects of p(z) are left unspecified. Denote by m0(x) the true function E(y| x) and by π0(x) the true function P (t = 1|x) = E(t|x) corresponding to the true density p0(z).
Semiparametric theory yields the form of all influence functions corresponding to estimators for μ under each of the statistical models I–III. As discussed in Tsiatis (2006, page 52), loosely speaking, a consistent and asymptotically normal estimator for μ in a statistical model has the property that, for all p(z) in the class of densities defined by the model, , where means convergence in distribution under the density p(z), and σ2(p) is the asymptotic variance of μ̂ under p(z).
If model I is correct, then m0(x) = m(x, β) for some β, and it may be shown (e.g., Tsiatis, 2006, Section 4.5) that all estimators for μ have influence functions of the form
(3) |
for arbitrary functions a(x) of x. If model II is correct, then π0(x) = π(x, α) for some α, and all estimators for μ have influence functions of the form
(4) |
for arbitrary h(x), which is well known from Robins, Rotnitzky and Zhao (1994). If model III is correct, then m0(x) = m(x, β) and p0(x) = π(x, α) for some β and α, and influence functions for estimators μ̂ have the form
(5) |
for arbitrary a(x) and h(x). Depending on forms of m(x, β) as a function of β and π(x, α) as a function of α, there will be restrictions on the forms of a(x) and h(x); see below.
We now consider estimators discussed by KS from the perspective of influence functions. The regression estimator μ̂OLS in (7) of KS comes about naturally if one assumes model I is correct. In terms of influence functions, μ̂OLS may be motivated by considering the influence function (3) with a(x) = 0, as this leads to the estimator . In fact, although KS do not discuss it, the “imputation estimator” may be motivated by taking a(x) = 1 in (3). Of course, in practice, β must be estimated. In general, (3) implies that all estimators for μ that are consistent and asymptotically normal if model I is correct must be asymptotically equivalent to an estimator of the form
(6) |
where β is estimated by solving an estimating equation . Because β is estimated, the influence function of the estimator (6) with a particular ã(x) will not be exactly equal to (3) with a(x) = ã(x); instead, it may be shown that the influence function of (6) is of form (3) with a(x) in (3) equal to
(7) |
where mβ (x, β) is the vector of partial derivatives of elements of m(x, β) with respect to β, and β0 is such that m0(x) = m(x, β0).
The IPW estimator μ̂IPW-POP in (3) of KS and its variants arise if one assumes model II. In particular, μ̂IPW-POP can be motivated via the influence function (4) with h(x) = −μ. The estimator μ̂IPW-NR in (4) of KS follows from (4) with h(x) = −E[y {1 − π(x)}]/E[{1 − π(x)}]. In fact, if one restricts h(x) in (4) to be a constant, then, using the fact that the expectation of the square of (4) is the asymptotic variance of the estimator, one may find the “best” such constant minimizing the variance as h(x) = −E[y {1 − π(x)}/π(x)]/E [{1 − π(x)}/π(x)]. An estimator based on this idea was given in (10) of Lunceford and Davidian (2004, page 2943). In general, as for model I, (4) implies that all estimators for μ that are consistent and asymptotically normal if model II is correct must be asymptotically equivalent to an estimator of the form
(8) |
where α̂ is estimated by solving an equation of the form for some (s × 1) B(xi, α), almost always maximum likelihood for binary regression. As above, because α is estimated, the influence function of (8) is equal to (4) with h(x) equal to
(9) |
where πα (x, α) is the vector of partial derivatives of elements of π(x, α) with respect to α, and α0 satisfies π0(x) = π(x, α0).
Doubly robust (DR) estimators are estimators that are consistent and asymptotically normal for models in that is, under the assumptions of model I or model II. When the true density , then the influence function of any such DR estimator must be equal to (3) with a(x) = 1/π0(x) or, equivalently, equal to (4) with h(x) = −m0(x). Accordingly, when , that is, both models have been specified correctly, all such DR estimators will have the same asymptotic variance. This also implies that, if both models are correctly specified, the asymptotic properties of the estimator do not depend on the methods used to estimate β and α.
KS discuss strategies for constructing DR estimators, and they present several specific examples: μ̂BC-OLS in their equation (8); the estimators below (8) using POP or NR weights, which we denote as μ̂BC-POP and μ̂BC-NR, respectively; the estimator μ̂ WLS in their equation (10); μ̂π-cov in their equation (12); and a version of μ̂π-cov equal to the estimator proposed by Scharfstein, Rotnitzky and Robins (1999) and Bang and Robins (2005), which we denote as μ̂ SRR. The results for these estimators under the “Correct-Correct” scenarios ( ) in Tables 5–8 of KS are consistent with the asymptotic properties above. We note that μπ -cov is not DR under because of the additional assumption that the mean of y given π must be equal to a linear combination of basis functions in p. Making this additional assumption may not be unreasonable in practice; however, strictly speaking, it takes μπ -cov outside the class of DR estimators discussed here, and hence we do not consider it in the remainder of this section. However, μ̂SRR is still in this class.
KS suggest that a characteristic distinguishing the performance of DR estimators is whether or not the estimator is within or outside the augmented inverse-probability weighted (AIPW) class. We find this distinction artificial, as all of the above estimators μ̂BC-OLS, μ̂BC-POP, μ̂BC-NR, μ̂WLS and μ̂SRR can be expressed in an AIPW form. Namely, all of these estimators are algebraically exactly of the form (8) with h̃(xi ) replaced by a term -γ̂ - m(xi, β̂), where γ̂BC-OLS = γ̂WLS = γ̂SRR = 0,
(10) |
where we write π̂i = π(xi,α̂) and m̂i = m(xi, β̂) for brevity. For μ̂WLS and μ̂SRR, this identity follows from the fact that , which for μ̂WLS holds because KS restrict to m(x, β) = xT β, with x including a constant term. Thus, we contend that issues of performance under are not linked to whether or not a DR estimator is AIPW, but, rather, are a consequence of forms of the influence functions of estimators under or . In particular, under model II, it follows that the above estimators have influence functions of the form (4) with h(x) equal to (9) with h̃(x) = −{γ* + m(x, β*)}, where γ* and β* are the limits in probability of γ̂ and β̂, respectively. Thus, features determining performance of these estimators when model II is correct are how close γ* + m(x, β*) is to m0(x) and how α is estimated, where maximum likelihood is the optimal choice. In fact, this perspective reveals that, for fixed m(x, β), using ideas similar to those in Tan (2006), the optimal choice of γ̂ is as in γ̂BC-NR with ti (1 − π̂i)/π̂ replaced by .
Similarly, under model I, the influence functions of these estimators are of the form (3) with a(x) equal to (7) with ã(x) = ψ1/π(x, α*) + ψ2, where α* is the limit in probability of α̂ and ψ1 = 1 and ψ2 = 0 for μ̂BC-OLS, μ̂WLS and μ̂SRR; ψ1 = 1/E{π0(x)/π(x, α*)} and ψ2 = 0 for μ̂BC-POP; and ψ1 and ψ2 for μ̂BC-NR are more complicated expectations involving π0(x) and π(x, α*). Thus, under model I, features determining performance of these estimators are the form of a~(x) and how β is estimated through the choice of A(x, β).
We may interpret some of the results in Tables 5, 6 and 8 of KS in light of these observations. Under the “π -model Correct–y-model Incorrect” scenario ( ), μ̂BC-OLS, μ̂WLS and μ̂SRR show some nontrivial differences in performance, which, from above, are likely attributable to differences in m(x, β*). Under the “π -model Incorrect–y-model Correct” ( ) all three estimators share the same ã(x) but use different methods to estimate β, so that any differences are dictated entirely by the choice of A(x, β). The poor performance of μ̂SRR can be understood from this perspective: “β” for this estimator is actually β in the model m(x, β) used by the other two estimators concatenated by an additional element, the coefficient of . The A(x, β) for μ̂SRR thus involves a design matrix that is unstable for small π̂i, consistent with the comment of KS at the end of their Section 3.
In summary, we believe that studying the performance of estimators via their influence functions can provide useful insights. Our preceding remarks refer to large-sample performance, which depends directly on the influence function. Estimators with the same influence function can exhibit different finite-sample properties. It may be possible via higher-order expansions to gain an understanding of some of this behavior; to the best of our knowledge, this is an open question.
BOTH MODELS INCORRECT
The developments in the previous section are relevant in . Key themes of KS are performance of DR and other estimators outside this class; that is, when both the models π(x, α) and m(x, β) are incorrectly specified, and choice of estimator under these circumstances.
One way to study performance in this situation is through simulation. KS have devised a very interesting and instructive specific simulation scenario that highlights some important features of various estimators. In particular, the KS scenario emphasizes the difficulties encountered with some of the DR estimators when π(xi, α̂) is small for some xi. Indeed, in our experience, poor performance of DR and IPW estimators in practice can result from few small π(xi, α̂). When there are small π(xi, α̂), as noted KS, responses are not observed for some portion of the x space. Consequently, estimators like μ̂OLS rely on extrapolation into that part of the x space. KS have constructed a scenario where failure to observe y in a portion of the x space can wreak havoc on some estimators that make use of the π(xi, α̂) but has minimal impact on the quality of extrapolations for these x based on m(x, β̂). One could equally well build a scenario where the x for which y is unobserved are highly influential for the regression m(x, β) and hence could result in deleterious performance of μ̂OLS. We thus reiterate the remark of KS that, although simulations can be illuminating, they cannot yield broadly applicable conclusions.
Given this, we offer some thoughts on other strategies for deriving estimators that may have some robustness properties under the foregoing conditions, that is, offer good performance outside . One approach may be to search outside the class of DR estimators valid under . For example, as suggested by the simulations of KS, estimators in the spirit of μ̂π -cov, which impose additional assumptions rendering them DR in the strict sense only in a subset of , may compensate for this restriction by yielding more robust performance outside ; further study along these lines would be interesting. An alternative tactic for searching outside may be to consider the form of influence functions (5) for estimators valid under . For instance, a “hybrid” estimator of the form
for δ small, may take advantage of the desirable properties of both μ^OLS and DR estimators.
A second possible strategy for identifying robust estimators arises from the following observation. Consider the estimator
(11) |
If π(xi ) = π(xi, α̂), then (11) yields one form of a DR estimator. If π(xi ) ≡ 1, then (11) results in the imputation estimator. If π(xi ) = ∞, (11) reduces to μ̂OLS. This suggests that it may be possible to develop estimators based on alternative choices of π(xi ) that may have good robustness properties. For example, a method for obtaining estimators π(xi, α̂) that shrinks these toward a common value may prove fruitful. The suggestion of KS to move away from logistic regression models for π(xi, α) is in a similar spirit.
Finally, we note that yet another approach to developing estimators would be to start with the premise that one make no parametric assumption on the forms of E(y|x) and E(t| x) beyond some mild smoothness conditions. Here, it is likely that first-order asymptotic theory, as in the previous section, may no longer be applicable. It may be necessary to use higher-order asymptotic theory to make progress in this direction; see, for example, Robins and van der Vaart (2006).
CONCLUDING REMARKS
We again compliment the authors for their thoughtful and insightful article, and we appreciate the opportunity to offer our perspectives on this important problem. We look forward to new methodological developments that may overcome some of the challenges brought into focus by KS in their article.
Acknowledgments
This research was supported in part by Grants R01-CA051962, R01-CA085848 and R37-AI031789 from the National Institutes of Health.
Contributor Information
Anastasios A. Tsiatis, Anastasios A. Tsiatis is Drexel Professor of Statistics at North Carolina State University, Raleigh, North Carolina 27695-8203, USA (e-mail: tsiatis@stat.ncsu.edu)
Marie Davidian, Marie Davidian is William Neal Reynolds Professor of Statistics at North Carolina State University, Raleigh, North Carolina 27695-8203, USA (e-mail: davidian@stat.ncsu.edu).
References
- Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972. doi: 10.1111/j.1541-0420.2005.00377.x. MR2216189. [DOI] [PubMed] [Google Scholar]
- Davidian M, Tsiatis AA, Leon S. Semiparametric estimation of treatment effect in a pretest-posttest study without missing data. Statist Sci. 2005;20:261–301. doi: 10.1214/088342305000000151. MR2189002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine. 2004;23:2937–2960. doi: 10.1002/sim.1903. [DOI] [PubMed] [Google Scholar]
- Molenberghs G. Discussion of “Semiparametric estimation of treatment effect in a pretest–posttest study with missing data,” by M. Davidian, A. A. Tsiatis and S. Leon. Statist Sci. 2005;20:289–292. doi: 10.1214/088342305000000151. MR2189002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Statist Assoc. 1994;89:846–866. MR1294730. [Google Scholar]
- Robins J, van der Vaart A. Adaptive nonparametric confidence sets. Ann Statist. 2006;34:229–253. MR2275241. [Google Scholar]
- Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder to “Adjusting for nonignorable drop-out using semiparametric nonresponse models”. J Amer Statist Assoc. 1999;94:1135–1146. MR1731478. [Google Scholar]
- Tan Z. A distributional approach for causal inference using propensity scores. J Amer Statist Assoc. 2006;101:1619–1637. MR2279484. [Google Scholar]
- Tsiatis AA. Semiparametric Theory and Missing Data. Springer; New York: 2006. MR2233926. [Google Scholar]