A note on semiparametric efficient generalization of causal effects from randomized trials to target populations

Fan Li; Hwanhee Hong; Elizabeth A Stuart

doi:10.1080/03610926.2021.2020291

. Author manuscript; available in PMC: 2023 Jul 21.

Published in final edited form as: Commun Stat Theory Methods. 2021 Dec 29;52(16):5767–5798. doi: 10.1080/03610926.2021.2020291

A note on semiparametric efficient generalization of causal effects from randomized trials to target populations

Fan Li ^a,^b, Hwanhee Hong ^c, Elizabeth A Stuart ^d,^e

PMCID: PMC10361688 NIHMSID: NIHMS1836106 PMID: 37484707

Abstract

When effect modifiers influence the decision to participate in randomized trials, generalizing causal effect estimates to an external target population requires the knowledge of two scores – the propensity score for receiving treatment and the sampling score for trial participation. While the former score is known due to randomization, the latter score is usually unknown and estimated from data. Under unconfounded trial participation, we characterize the asymptotic efficiency bounds for estimating two causal estimands – the population average treatment effect and the average treatment effect among the non-participants – and examine the role of the scores. We also study semiparametric efficient estimators that directly balance the weighted trial sample toward the target population, and illustrate their operating characteristics via simulations.

Keywords: Causal inference, external validity, empirical likelihood, propensity score, sampling score, semiparametric efficiency bound

1. Introduction

While randomized trials ensure the optimal internal validity for treatment comparisons, they may not provide valid effect estimates for external target populations when effect modifiers also influence trial participation (Cole and Stuart 2010; Stuart et al. 2011). Generalizing causal inferences from potentially unrepresentative randomized trials to populations requires an understanding of the sampling mechanism, by which individuals are selected to participate in the trial. In a randomized trial, the propensity score for receiving treatment is usually known by design (Rosenbaum and Rubin 1983). However, the sampling score, which is the conditional probability of trial participation given the covariates, is usually unknown and must be estimated. Assuming no unmeasured effect modifiers and a correct sampling model, the inverse probability of sampling weights can be constructed to provide an unbiased estimate of the average causal effects in the target population and has received considerable attention (Cole and Stuart 2010; Stuart, Bradshaw, and Leaf 2015; Buchanan et al. 2018). These weights are often estimated from a parametric regression model using maximum likelihood, without a formal mechanism to ensure balance of effect modifiers between the weighted trial population and the target population in finite samples. In addition, even if the model for the weights is correctly specified, inverse weighting alone may not be statistically efficient (Robins, Rotnitzky, and Zhao 1994) and could exhibit excessive variability in the effect estimates when the sampling score is estimated to be close to zero (Cole and Hernán 2008; Crump et al. 2009; Li, Thomas, and Li 2019).

In this note, we consider semiparametric efficient approaches for generalizing causal effects from randomized trials to target populations. We first obtain explicit forms of the semiparametric efficiency bounds with regard to the estimation of two population-level causal estimands – the population average treatment effect including the trial participants and the average treatment effect among non-participants. The estimation of the latter is also closely connected to the concept of causal transportability (Westreich et al. 2017). The efficiency bounds parallel those developed in Hahn (1998) and Hirano, Imbens, and Ridder (2003), where the primary interests are unconfounded treatment comparisons in observational studies. The asymptotic variance bounds provide an efficiency perspective to the role of propensity score and sampling score. We show that the propensity score is ancillary for estimation of the aforementioned two estimands while the sampling score is only ancillary for estimation of the population average treatment effect.

To improve the robustness and efficiency of the generalizability estimator, Dahabreh et al. (2019) discussed doubly robust estimators that combine the inverse probability of sampling weights and outcome regression; their estimators are consistent when either the sampling score or the outcome model is correct but not necessarily both. When both models are correct, their estimators are considered more efficient than inverse weighting by sampling scores alone. However, because their doubly robust estimator is a plug-in estimator based on the efficient influence function, it could still be subject to large variability in the presence of extreme sampling scores, which arises, for example, when there is a strong selection effect based on effect modifiers. Besides, the parametric doubly robust approaches does not guarantee adequate balance of the effect modifiers between the weighted trial population and the target population in finite samples. To overcome these potential limitations, we provide robust and semiparametric efficient empirical likelihood estimators that solve the weights from a set of balancing conditions. Since finite-sample covariate balance is enforced by these conditions, our estimators could provide greater efficiency by eliminating residual bias and maintain robustness against model misspecification. Estimating weights that exploit covariate-balancing conditions originates from the survey sampling literature (Deville and Särndal 1992), and has been increasingly used to address confounding bias in observational studies; see, for example, Qin and Zhang (2007); Hainmueller (2012); Imai and Ratkovic (2014); Zubizarreta (2015); and Zhao (2019). Our empirical likelihood estimators extend these earlier results to causal generalizability with new insights. We show that by carefully selecting the balancing conditions, one may tailor the estimator to achieve double robustness and efficiency with respect to different asymptotic variance bounds.

2. Basic setup and efficiency bounds

We consider a single randomized trial with n₁ participants nested within a population of trial-eligible individuals of size n. For each participant i = 1, …, n₁, the outcome Y_i, a binary treatment A_i ∈ {0, 1}, and a set of pretreatment covariates X_i are recorded. We define for each participant two potential outcomes, ${Y_{i}^{1}, Y_{i}^{0}}$ , mapped to each level of treatment. Under the Stable Unit Treatment Value Assumption (SUTVA) (Rubin 1980), the observed outcome is $Y_{i} = A_{i} Y_{i}^{1} + (1 - A_{i}) Y_{i}^{0}$ . Define n₀ = n − n₁, then for each non-participant i = n₁ + 1, …, n, we assume the same set of observable covariates X_i but do not have information on treatment or outcome. We define S_i = 1 if individual i participates in the trial and S_i = 0 otherwise. We also write ω = pr(A_i = 1|S_i = 1) and π = pr(S_i = 1) as two proportions in the population. Generalizing trial findings could focus on two estimands: the population average treatment effect

μ = μ_{1} - μ_{0} = E [Y_{i}^{1} - Y_{i}^{0}],

and the average treatment effect among the non-participants

τ = τ_{1} - τ_{0} = E [Y_{i}^{1} - Y_{i}^{0} ∣ S_{i} = 0] .

In the nested trial design (where the trial is a true subset of the entire target population), μ is of policy interest as it measures the average change in outcome when the intervention is rolled out to the entire population, while τ remains relevant for comparing the effect estimates among trial participants and non-participants. Although we focus on the nested trial design, the estimation of τ is akin to those for transporting inferences from a trial to an external target population, due to similarity in the identification conditions (Westreich et al. 2017). To proceed, we assume the following two identification conditions.

Assumption 2.1. (Strongly Ignorable Assignment)

Assignment to treatment is unconfounded given pretreatment covariates X_i among trial participants, that is, ${Y_{i}^{1}, Y_{i}^{0}} ⊥ A_{i} ∣ X_{i}, S_{i} = 1$ .
The treatment propensity score e(X_i) = pr(A_i = 1|X_i, S = 1) is strictly bounded between zero and one for all values of X_i with a positive density.

Assumption 2.1 holds by randomization within the trial sample, and suffices to identify the within-trial causal estimand, $E [Y_{i}^{1} - Y_{i}^{0} ∣ S_{i} = 1]$ . This quantity typically differs from μ and τ, especially when X_i includes important effect modifiers that influence trial participation. Assumption 2.1 is also comparable to the strongly ignorable assumption in Rosenbaum and Rubin (1983) for observational studies, where the treatment is assumed to be conditionally randomized based on a set of measured covariates X_i.

Assumption 2.2. (Strongly Ignorable Participation)

Trial participation is conditional independent of potential outcomes, that is, ${Y_{i}^{1}, Y_{i}^{0}} ⊥ S_{i} ∣ X_{i}$ .
The sampling score, defined as the conditional probability of participation given the covariates, p(X_i) = pr(S_i = 1|X_i), is strictly bounded between zero and one for all values of X_i with a positive density.

Assumption 2.2 requires the absence of unmeasured treatment effect modifiers and its plausibility may only be indirectly assessed (Stuart et al. 2011). Even though weaker conditions such as mean generalizability and mean transportability (Dahabreh et al. 2020) are sufficient for point identification, we maintain Assumption 2.2 as it is required to obtain the expression of efficiency bounds. Part (ii) of Assumption 2.2 is often referred to as the positivity assumption, and critically depends on the definition of the target population. In settings where some individuals in the external population would never participate in the trial, the target population could be redefined to avoid extrapolation (Tipton 2013). Assumptions 2.1 and 2.2 permit the calculation of the semiparametric asymptotic variance bounds for estimation of μ and τ; in other words, there does not exist any sequence of regular estimators that has a smaller asymptotic variance. The bounds are introduced in Theorems 2.3 and 2.4, with derivations provided in the Appendices A and B.

Theorem 2.3. Assuming the sampling score p(X_i) is unknown, the efficient influence function for estimating μ and τ are

[Ψ_{μ} (Y_{i}, S_{i}, A_{i}, X_{i}) = \frac{S_{i} A_{i}}{p (X_{i}) e (X_{i})} (Y_{i} - μ_{1} (X_{i})) - \frac{S_{i} (1 - A_{i})}{p (X_{i}) (1 - e (X_{i}))} (Y_{i} - μ_{0} (X_{i})) + μ (X_{i}) - μ

Ψ_{τ} (Y_{i}, S_{i}, A_{i}, X_{i}) = \frac{1 - p (X_{i})}{1 - π} (\frac{S_{i} A_{i}}{p (X_{i}) e (X_{i})} (Y_{i} - τ_{1} (X_{i})) - \frac{S_{i} (1 - A_{i})}{p (X_{i}) (1 - e (X_{i}))} (Y_{i} - τ_{0} (X_{i})) + \frac{1 - S_{i}}{1 - p (X_{i})} (τ (X_{i}) - τ)) .

Therefore, the semiparametric efficiency bounds for μ and τ are

Σ_{μ} = E [\frac{σ_{1}^{2} (X_{i})}{p (X_{i}) e (X_{i})} + \frac{σ_{0}^{2} (X_{i})}{p (X_{i}) (1 - e (X_{i}))} + {(μ (X_{i}) - μ)}^{2}],

and

Σ_{τ} = E [{(\frac{1 - p (X_{i})}{1 - π})}^{2} (\frac{σ_{1}^{2} (X_{i})}{p (X_{i}) e (X_{i})} + \frac{σ_{0}^{2} (X_{i})}{p (X_{i}) (1 - e (X_{i}))} + \frac{{(τ (X_{i}) - τ)}^{2}}{1 - p (X_{i})})],

where $σ_{1}^{2} (X_{i}) = V [Y_{i}^{1} ∣ X_{i}]$ , $σ_{0}^{2} (X_{i}) = V [Y_{i}^{0} ∣ X_{i}]$ , $μ (X_{i}) = μ_{1} (X_{i}) - μ_{0} (X_{i}) = E [Y_{i}^{1} ∣ X_{i}] - E [Y_{i}^{0} ∣ X_{i}]$ , and $τ (X_{i}) = τ_{1} (X_{i}) - τ_{0} (X_{i}) = E [Y_{i}^{1} ∣ X_{i}, S_{i} = 0] - E [Y_{i}^{0} ∣ X_{i}, S_{i} = 0]$ .

Theorem 2.4. Assume the sampling score p(X_i) is known, the efficient influence function for estimating μ and τ are ${\tilde{Ψ}}_{μ} (Y_{i}, S_{i}, A_{i}, X_{i}) = Ψ_{μ} (Y_{i}, S_{i}, A_{i}, X_{i})$ and

{\tilde{Ψ}}_{τ} (Y_{i}, S_{i}, A_{i}, X_{i}) = \frac{1 - p (X_{i})}{1 - π} (\frac{S_{i} A_{i}}{p (X_{i}) e (X_{i})} (Y_{i} - τ_{1} (X_{i})) - \frac{S_{i} (1 - A_{i})}{p (X_{i}) (1 - e (X_{i}))} (Y_{i} - τ_{0} (X_{i})) + τ (X_{i}) - τ) .

Therefore, the semiparametric efficiency bounds for μ and τ are

{\tilde{Σ}}_{μ} = Σ_{μ} = E [\frac{σ_{1}^{2} (X_{i})}{p (X_{i}) e (X_{i})} + \frac{σ_{0}^{2} (X_{i})}{p (X_{i}) (1 - e (X_{i}))} + {(μ (X_{i}) - μ)}^{2}],

and

{\tilde{Σ}}_{τ} = E [{(\frac{1 - p (X_{i})}{1 - π})}^{2} (\frac{σ_{1}^{2} (X_{i})}{p (X_{i}) e (X_{i})} + \frac{σ_{0}^{2} (X_{i})}{p (X_{i}) (1 - e (X_{i}))} + {(τ (X_{i}) - τ)}^{2})] .

The derivations in the Appendices A and B indicate that the efficiency bounds for μ and τ remain the same regardless of the knowledge of propensity score e(X_i). Therefore, e(X_i) is ancillary to the estimation of both μ and τ, without respect to the knowledge of sampling score. Interestingly, this suggests that the form of the efficiency bounds do not change whether the causal generalization is from a randomized experiment or an observational study, as long as Assumption 2.1 holds.

On the other hand, even if the trial participants are selected according to a well-planned survey with known sampling score, this knowledge of the sampling score does not reduce the efficiency bound for estimation of μ, because ${\tilde{Σ}}_{μ} = Σ_{μ}$ ; thus p(X_i) is ancillary to the estimation of population average treatment effect. However, the knowledge of the sampling score p(X_i) reduces the efficiency bound for estimating τ, by the amount of

Σ_{τ} - {\tilde{Σ}}_{τ} = {(1 - π)}^{- 2} E [{(τ (X_{i}) - τ)}^{2} p (X_{i}) (1 - p (X_{i}))] \geq 0.

This value quantifies the marginal information content of the sampling score for the estimation of τ, in parallel to the estimation of the average treatment effect among the treated in observational studies (Hahn 1998). In particular, $Σ_{τ} = {\tilde{Σ}}_{τ}$ when there is no effect modification by X_i, or τ(X_i) = τ, among non-participants.

3. Semiparametric efficient generalizability estimators

3.1. Estimation of the population average treatment effect

We study several semiparametric efficient estimators that achieve the above variance lower bounds under certain conditions. The empirical likelihood approach is used to directly balance the covariates between the trial sample and the target population (Owen 2001). Write the set of trial participants receiving treatment by 𝒯 and those receiving control by 𝒞. We further define the set of non-participants by 𝒩, and the entire population 𝒫 = 𝒯 ∪ 𝒞 ∪ 𝒩. The sample proportion of trial participants and that of treated are $\hat{π} = n_{1} / n$ and $\hat{ω} = n_{11} / n_{1}$ . Suppose $e (X_{i}; \hat{α})$ and $p (X_{i}; \hat{β})$ are propensity score and sampling score estimated via “working” parametric models using maximum likelihood. We then estimate μ₁ by ${\hat{μ}}_{1} = \sum_{i \in 𝒯} q_{i} Y_{i}$ , where the empirical weights are obtained by maximizing $\sum_{i \in 𝒯}$ log q_i subject to the following three constraints:

\sum_{i \in 𝒯} q_{i} = 1,

(1)

\sum_{i \in 𝒯} q_{i} p (X_{i}; \hat{β}) e (X_{i}; \hat{α}) = \hat{π} \hat{ω},

(2)

\sum_{i \in 𝒯} q_{i} a (X_{i}) = \frac{1}{n} \sum_{i \in 𝒫} a (X_{i}) .

(3)

The first constraint ensures validity of the empirical weights so that they sum to unity. The second constraint ensures that the product of the estimated propensity score and sampling score among the treated are balanced toward the population mean, while the last constraint directly balances a k₁-vector of effect modifiers or functions of them, denoted by $a (X) = {(a_{1} (X), \dots, a_{k_{1}} (X))}^{T}$ , toward the respective population means. Similarly, we estimate μ₀ by ${\hat{μ}}_{0} = \sum_{i \in 𝒞} q_{i} Y_{i}$ , where the empirical weights maximize $\sum_{i \in 𝒞}$ log q_i and are subject to the following constraints:

\sum_{i \in 𝒞} q_{i} = 1,

(4)

\sum_{i \in 𝒞} q_{i} p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α})) = \hat{π} (1 - \hat{ω}),

(5)

\sum_{i \in 𝒞} q_{i} b (X_{i}) = \frac{1}{n} \sum_{i \in 𝒫} b (X_{i}),

(6)

where $b (X) = {(b_{1} (X), \dots, b_{k_{0}} (X))}^{T}$ is a k₀-vector of effect modifiers one wishes to balance through weighting. Then the population average treatment effect μ is estimated by $\hat{μ} = {\hat{μ}}_{1} - {\hat{μ}}_{0}$ . The solutions to the empirical weights {q_i : i ∈ 𝒯} and {q_i : i ∈ 𝒞} can be obtained via the iterative algorithm suggested by Chen, Sitter, and Wu (2002). It is evident from the last constraint in Equations (3) and (6) that the weights are constructed to directly achieve balance on the effect modifiers and so avoids the iterative fitting-checking procedure when using the inverse probability of sampling weights (Stuart et al. 2011). In addition to balancing the effect modifiers, the inclusion of the second constraint in Equations (2) and (5) ensures the double robustness property of $\hat{μ}$ , as explained in Theorem 3.1.

Theorem 3.1. Assuming that $\hat{α}$ and $\hat{β}$ are maximum likelihood estimators of the regression coefficient in the propensity score and sampling score models, we have $\hat{μ} - μ = n^{- 1} \sum_{i = 1}^{n} Φ_{μ, i} + o_{p} (n^{- 1 / 2})$ . The influence function is expanded as

Φ_{μ, i} = \frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α^{*})} (Y_{i} - μ_{1}) - \frac{S_{i} (1 - A_{i})}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (Y_{i} - μ_{0}) + B_{1}^{T} G_{1}^{- 1} t_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) - B_{0}^{T} G_{0}^{- 1} t_{0} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, b) + Δ_{μ, i} (β^{*}, α^{*}),

where β*, α* are probability limits of $\hat{β}$ , $\hat{α}$ , $a = E [a (X_{i})]$ , $b = E [b (X_{i})]$ , B₁, B₀, G₁, G₀, t₁(·), t₀(·) are defined in the Appendix C, and Δ_{μ, i}(β*, α*) is a residual term arising from estimating β and α through maximum likelihood. The following properties hold for $\hat{μ}$ :

when the propensity score and sampling score models are correct, $\hat{μ} \overset{p}{\to} μ$ ;
when the implied outcome model is correct such that $μ_{1} (X) = γ_{10} + \sum_{l = 1}^{k_{1}} γ_{1 l} a_{l} (X)$ , $μ_{0} (X) = γ_{00} + \sum_{l = 1}^{k_{0}} γ_{0 l} b_{l} (X)$ , the estimator remains consistent and $\hat{μ} \overset{p}{\to} μ$ ; therefore $\hat{μ}$ is doubly robust;
when all models are correct, $V [Φ_{μ, i}] = Σ_{μ}$ , namely $\hat{μ}$ is efficient with respect to the efficiency bound Σ_μ.

In fact, the same robustness and efficiency property can be achieved by enforcing an alternative set of constraints. Specifically, one could define a new estimator $\tilde{μ}$ by replacing the second constraints in Equations (2) and (5) with

\sum_{i \in 𝒯} q_{i} p (X_{i}; \hat{β}) e (X_{i}; \hat{α}) = \tilde{π} \hat{ω},

(7)

\sum_{i \in 𝒞} q_{i} p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α})) = \tilde{π} (1 - \hat{ω}) .

(8)

where $\tilde{π} = n^{- 1} \sum_{i = 1}^{n} p (X_{i}; \hat{β})$ is the average of the estimated sampling scores. Intuitively, had one known the true sampling score model, $\tilde{π}$ is a potentially more efficient estimator for π than $\hat{π}$ . But Theorem 2.4 suggests knowledge of the true sampling score does not reduce the asymptotic variance bound for μ, and correspondingly, we could show that both $\tilde{μ}$ and $\hat{μ}$ are semiparametric efficient with respect to the same efficiency bound Σ_μ. However, because the right hand sides of Equations (7) and (8) will only be consistent to pr[S = 1, A = 1] and pr[S = 1, A = 0] when p(X) is correctly specified, the general asymptotic expansion of $\tilde{μ}$ is unavailable under misspecified p(X).

3.2. Estimation of the average treatment effect among non-participants

Robust and efficient estimation of τ can proceed with a similar strategy, by instead weighting the trial sample toward the population of non-participants. Specifically, τ₁ is estimated by ${\hat{τ}}_{1} = \sum_{i \in 𝒯} q_{i} Y_{i}$ , where the empirical weights are obtained by maximizing $\sum_{i \in 𝒯}$ log q_i subject to

\sum_{i \in 𝒯} q_{i} = 1,

(9)

\sum_{i \in 𝒯} q_{i} \frac{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})}{1 - p (X_{i}; \hat{α})} = \frac{\hat{π} \hat{ω}}{1 - \hat{π}},

(10)

\sum_{i \in 𝒯} q_{i} a (X_{i}) = \frac{1}{n_{0}} \sum_{i \in 𝒩} a (X_{i}) .

(11)

Likewise, τ₀ is estimated by ${\hat{τ}}_{0} = \sum_{i \in 𝒞} q_{i} Y_{i}$ , where the empirical weights maximize $\sum_{i \in 𝒞}$ log q_i and are subject to

\sum_{i \in 𝒞} q_{i} = 1,

(12)

\sum_{i \in 𝒞} q_{i} \frac{p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α}))}{1 - p (X_{i}; \hat{α})} = \frac{\hat{π} (1 - \hat{ω})}{1 - \hat{π}},

(13)

\sum_{i \in 𝒞} q_{i} b (X_{i}) = \frac{1}{n_{0}} \sum_{i \in 𝒩} b (X_{i}),

(14)

where a(X), b(X) are again sets of effect modifiers defined in Section 3.1. Similar to $\hat{μ}$ , $\hat{τ}$ is also a doubly robust construction, as summarized in Theorem 3.2.

Theorem 3.2. Assuming that $\hat{α}$ and $\hat{β}$ are maximum likelihood estimators of the regression coefficient in the propensity score and sampling score models, we have $\hat{τ} - τ = n^{- 1} \sum_{i = 1}^{n} Φ_{τ, i} + o_{p} (n^{- 1 / 2})$ . The influence function is expanded as

Φ_{τ, i} = \frac{1}{1 - π} [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{*})) (Y_{i} - τ_{1})}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} - \frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{*})) (Y_{i} - τ_{0})}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} + J_{1}^{T} K_{1}^{- 1} η_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) - J_{0}^{T} K_{0}^{- 1} η_{0} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, b) + Δ_{τ, i} (β_{*}, α_{*})],

where β*, α* are probability limits of $\hat{β}$ , $\hat{α}$ , $a = E [a (X_{i}) ∣ S_{i} = 0]$ , $b = E [b (X_{i}) ∣ S_{i} = 0]$ , J₁, J₀, K₁, K₀, η₁(·), η₀(·) are defined in the Appendix D, and Δ_{τ, i}(β*, α*) is a residual term arising from estimating β and α through maximum likelihood. The following property holds for $\hat{τ}$ :

when the propensity score and sampling score models are correct, $\hat{τ} \overset{P}{\to} τ$ ;
when the implied outcome model is correct such that $τ_{1} (X) = ξ_{10} + \sum_{l = 1}^{k_{1}} ξ_{1 l} a_{l} (X)$ , $τ_{0} (X) = ξ_{00} + \sum_{l = 1}^{k_{0}} ξ_{0 l} b_{l} (X)$ , the estimator remains consistent and $\hat{τ} \overset{p}{\to} τ$ ; therefore $\hat{τ}$ is doubly robust;
when all models are correct, $V [Φ_{τ, i}] = Σ_{τ}$ , namely $\hat{τ}$ is semiparametric efficient with respect to the efficiency bound Σ_τ.

We realize that the doubly robust estimator $\hat{τ}$ is semiparametric efficient with respect to the efficiency bound without the knowledge of the sampling score. As Theorem 2.4 suggests that the knowledge of the sampling score in fact further reduces the bound, there should exist a more efficient estimator than $\hat{τ}$ perhaps by trading off some of its robustness property. Intuitively, had one known the true sampling score, not only $\tilde{π}$ is a more efficient estimator than $\hat{π}$ , but also ${[n_{0} (1 - \tilde{π})]}^{- 1} \sum_{i \in 𝒫} a (X_{i}) (1 - p (X_{i}; \hat{β}))$ and ${[n_{0} (1 - \tilde{π})]}^{- 1} \sum_{i \in 𝒫} b (X_{i}) (1 - p (X_{i}; \hat{β}))$ could be more efficient estimators for $E [a (X) ∣ S = 0]$ , $E [b (X) ∣ S = 0]$ than $n_{0}^{- 1} \sum_{i \in 𝒩} a (X_{i})$ and $n_{0}^{- 1} \sum_{i \in 𝒩} b (X_{i})$ , which are simple averages. This observation motivates us to employ constraints that exploit more information from the data and possibly arrive at an estimator for τ with higher asymptotic efficiency. Specifically, we could estimate τ₁ by ${\tilde{τ}}_{1} = \sum_{i \in 𝒯} q_{i} Y_{i}$ , where the empirical weights are obtained by maximizing $\sum_{i \in 𝒯}$ log q_i subject to

\sum_{i \in 𝒯} q_{i} = 1,

(15)

\sum_{i \in 𝒯} q_{i} \frac{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})}{1 - p (X_{i}; \hat{α})} = \frac{\tilde{π} \hat{ω}}{1 - \tilde{π}},

(16)

\sum_{i \in 𝒯} q_{i} a (X_{i}) = \frac{\sum_{i \in 𝒫} a (X_{i}) (1 - p (X_{i}; \hat{β}))}{n_{0} (1 - \tilde{π})},

(17)

and estimate τ₀ by ${\tilde{τ}}_{0} = \sum_{i \in 𝒞} q_{i} Y_{i}$ , where the empirical weights are obtained by maximizing $\sum_{i \in 𝒞}$ log q_i subject to

\sum_{i \in 𝒞} q_{i} = 1,

(18)

\sum_{i \in 𝒞} q_{i} \frac{p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α}))}{1 - p (X_{i}; \hat{α})} = \frac{\tilde{π} (1 - \hat{ω})}{1 - \tilde{π}},

(19)

\sum_{i \in 𝒞} q_{i} b (X_{i}) = \frac{\sum_{i \in 𝒫} b (X_{i}) (1 - p (X_{i}; \hat{β}))}{n_{0} (1 - \tilde{π})} .

(20)

The contrast between $\tilde{τ}$ and $\hat{τ}$ exemplifies the bias-variance tradeoff. Because the right hand side of the last constraint in Equations (17) and (20) only converges in probability to the average values of the effect modifiers among the non-participants when the sampling score is correctly estimated, the consistency of $\tilde{τ} = {\tilde{τ}}_{1} - {\tilde{τ}}_{0}$ to τ necessitates the correct specification of p(X) and hence $\tilde{τ}$ is only “singly” robust. In other words, $\tilde{τ}$ is singly robust in a sense that it is consistent for τ if and only if the sampling score model for p(X) is correctly specified, regardless of the specification of a(X), b(X). But as we encode more information in the constraints, $\tilde{τ}$ achieves a lower efficiency bound in large samples compared to $\hat{τ}$ , as elucidated in the following results.

Theorem 3.3. Assuming both the sampling score model is correctly specified, and that $\hat{α}$ , $\hat{β}$ are maximum likelihood estimators of the regression coefficient in the propensity score and sampling score models, we have $\tilde{τ} - τ = n^{- 1} \sum_{i = 1}^{n} {\tilde{Φ}}_{τ, i} + o_{p} (n^{- 1 / 2})$ . The influence function is expanded as

{\tilde{Φ}}_{τ, i} = \frac{1}{1 - π} [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{1})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} - \frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{0})}{p (X_{i}; β_{0}) (1 - e (X_{i}; α_{0}))} + J_{1}^{T} K_{1}^{- 1} {\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) - J_{0}^{T} K_{0}^{- 1} {\tilde{η}}_{0} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, b) + {\tilde{Δ}}_{τ, i} (β_{0}, α_{0})],

where β₀, α₀ are true values of $\hat{β}$ , $\hat{α}$ , $a = E [a (X_{i}) ∣ S_{i} = 0]$ , $b = E [b (X_{i}) ∣ S_{i} = 0]$ , J₁, J₀, K₁, K₀, ${\tilde{η}}_{1} (\cdot)$ , ${\tilde{η}}_{0} (\cdot)$ are defined in the Appendix E, and ${\tilde{Δ}}_{τ, i} (β_{0}, α_{0})$ is a residual term arising from estimating β and α through maximum likelihood. The following properties hold for $\tilde{τ}$ :

since $E [{\tilde{Φ}}_{τ, i}] = 0$ , $\tilde{τ}$ is a consistent estimator, $\tilde{τ} \overset{p}{\to} τ$ ;
when the implied outcome model is correct such that $τ_{1} (X) = ξ_{10} + \sum_{l = 1}^{k_{1}} ξ_{1 l} a_{l} (X)$ , $τ_{0} (X) = ξ_{00} + \sum_{l = 1}^{k_{0}} ξ_{0 l} b_{l} (X)$ , the large-sample variance $V [{\tilde{Φ}}_{τ, i}] = {\tilde{Σ}}_{τ}$ , therefore $\tilde{τ}$ is semiparametric efficient with respect to the efficiency bound ${\tilde{Σ}}_{τ}$ .

The asymptotic analysis of the empirical likelihood estimators for μ and τ exemplifies the role of the sampling score for causal generalization. When all models are correctly specified, $\tilde{μ}$ encodes the knowledge of the sampling score but still has the same asymptotic variance as $\hat{μ}$ ; this illustrates that the sampling score is ancillary for the estimation of μ. By contrast, when all models are correctly specified, $\tilde{τ}$ encodes the knowledge of the sampling score and achieves a lower asymptotic variance than $\hat{τ}$ ; this illustrates that the sampling score is not ancillary for the estimation of τ. Therefore, the insights obtained from deriving the efficiency bounds provide an interesting perspective on the properties of different empirical likelihood estimators.

Although the true propensity score is known in randomized trials, the constraints we imposed involve an estimated propensity score $e (X_{i}; \hat{β})$ . In this regard, the above estimators apply even when the generalization is from a non-randomized, observational study, providing the ignorability condition in Assumption 2.1 holds. In fact, the way how the estimated propensity score is utilized in the constraints could also exemplify the role of the propensity score. One could replace $\hat{ω} = n_{11} / n_{1}$ with a potentially more efficient estimator $\tilde{ω} = n_{1}^{- 1} \sum_{i \in 𝒯} e (X_{i}; \hat{α})$ to re-construct all the above empirical likelihood estimators without modifying their semiparametric efficiency properties; such results then exemplify that the propensity score is ancillary for the estimation of both μ and τ.

4. Numerical studies

We study the finite-sample performance of the empirical likelihood estimators and compare with the classical augmented inverse probability weighting (AIPW) estimators. The AIPW estimator for estimating μ can be obtained as the solution to the estimating equations $0 = \sum_{i = 1}^{n} Ψ_{μ} (Y_{i}, S_{i}, A_{i}, X_{i})$ with estimated propensity scores, sampling scores as well as predicted potential outcomes. This estimator is discussed in Dahabreh et al. (2019) and has the form

{\hat{μ}}_{AIPW} = \frac{1}{n} \sum_{i = 1}^{n} [\frac{S_{i} A_{i} (Y_{i} - μ_{1} (X_{i}; {\hat{γ}}_{1}))}{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})} - \frac{S_{i} (1 - A_{i}) (Y_{i} - μ_{0} (X_{i}; {\hat{γ}}_{0}))}{p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α}))} + μ_{1} (X_{i}; {\hat{γ}}_{1}) - μ_{0} (X_{i}; {\hat{γ}}_{0})],

where $μ_{1} (X_{i}; {\hat{γ}}_{1})$ , $μ_{0} (X_{i}; {\hat{γ}}_{0})$ are estimated values of $E [Y_{i}^{1} ∣ X_{i}]$ and $E [Y_{i}^{0} ∣ X_{i}]$ , for example, through linear regression with finite-dimensional parameters γ₁ and γ₀. Similar to the two types of the empirical likelihood estimators introduced in Section 3.2, the two types of AIPW estimators for estimating τ can be obtained as the solutions to the estimating equations $0 = \sum_{i = 1}^{n} Ψ_{τ} (Y_{i}, S_{i}, A_{i}, X_{i})$ and $0 = \sum_{i = 1}^{n} {\tilde{Ψ}}_{τ} (Y_{i}, S_{i}, A_{i}, X_{i})$ and are expressed by

{\hat{τ}}_{AIPW1} = \frac{1}{n_{0}} \sum_{i = 1}^{n} (1 - p (X_{i}; \hat{β})) [\frac{S_{i} A_{i} (Y_{i} - τ_{1} (X_{i}; {\hat{γ}}_{1}))}{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})} - \frac{S_{i} (1 - A_{i}) (Y_{i} - τ_{0} (X_{i}; {\hat{γ}}_{0}))}{p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α}))} + \frac{1 - S_{i}}{1 - p (X_{i}; \hat{β})} {τ_{1} (X_{i}; {\hat{γ}}_{1}) - τ_{0} (X_{i}; {\hat{γ}}_{0})}],

{\hat{τ}}_{AIPW 2} = \frac{1}{\sum_{i = 1}^{n} (1 - p (X_{i}; \hat{β}))} \sum_{i = 1}^{n} (1 - p (X_{i}; \hat{β})) [\frac{S_{i} A_{i} (Y_{i} - τ_{1} (X_{i}; {\hat{γ}}_{1}))}{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})} - \frac{S_{i} (1 - A_{i}) (Y_{i} - τ_{0} (X_{i}; {\hat{γ}}_{0}))}{p (X_{i}; \hat{β}) (1 - e (X_{i}; \hat{α}))} + τ_{1} (X_{i}; {\hat{γ}}_{1}) - τ_{0} (X_{i}; {\hat{γ}}_{0})],

where $τ_{1} (X_{i}; {\hat{γ}}_{1})$ , $τ_{0} (X_{i}; {\hat{γ}}_{0})$ are estimated values of $E [Y_{i}^{1} ∣ X_{i}, S_{i} = 0]$ and $E [Y_{i}^{0} ∣ X_{i}, S_{i} = 0]$ , for example, through linear regression with finite-dimensional parameters γ₁ and γ₀. Among them, ${\hat{τ}}_{AIPW 1}$ is a doubly robust estimator whose consistency only requires the correct specification of either the sampling score model or the outcome model (Dahabreh et al. 2020), while ${\hat{τ}}_{AIPW 2}$ is only singly robust, in parallel to the empirical likelihood estimators $\hat{τ}$ and $\tilde{τ}$ . In the following simulation studies, the outcome models in the AIPW procedures are all estimated by least-squares regression using the trial sample.

4.1. Simulation study I

To compare the empirical likelihood estimators with the classical AIPW estimators, we first consider the following data generating mechanisms. The covariates are generated from X₁ ~ N(0, 1), $X_{2} = X_{1}^{2}$ , X₃ ~ unif(−2, 2) and X₄ = I(X₃ > 0), with the true sampling score model

p (X) = \frac{exp (β_{0} - 0.8 X_{1} + 0.3 X_{2} + X_{3} - 0.6 X_{4})}{1 + exp (β_{0} - 0.8 X_{1} + 0.3 X_{2} + X_{3} - 0.6 X_{4})} .

We fix the trial sample size n₁ = 500, and consider three values of β₀ so that the proportion of the trial sample is 50%, 10%, and 5%, corresponding to target population sizes n = 1000, n = 5000 and n = 10,000. Overall, the data generating processes represent moderate to strong selection effect based on the effect modifiers. Specifically, as the proportion of trial sample decreases, there is an increasing chance of occurrence of extremely small sampling scores p(X) for specific combination of covariate points. The distributions of the true sampling scores in the population are provided in Figure 1. Such specific data generating processes were chosen to challenge the inferential procedures when there is limited overlap between the trial participants and non-participants.

Figure 1. — Distributions of true sampling scores in the population when the proportion of trial participation is (a) 50%, (b) 10%, and (c) 5%, in the first simulation study. The trial sample size is fixed at n₁ = 500, while the population size corresponding to each panel is (a) n = 1000, (b) n = 5000, and (c) n = 10,000.

The potential outcomes are generated from

Y^{1} ~ N (1 + 2 X_{1} + 2 X_{2} + 3 X_{3} + 2 X_{4}, 1)

Y^{0} ~ N (- 1 - 2 X_{1} + X_{2} - 2 X_{3} + X_{4}, 1) .

The true values of μ and τ are estimated using Monte Carlo integration. For each estimator, we use the first letter to label whether the fitted sampling score model is correct (C) or misspecified (M). The misspecified model omits X₂, X₄ in the logistic model. We similarly use a second letter to label whether the outcome model is correct (C) or misspecified (M), with the misspecified outcome model omitting X₂, X₄ (for the empirical likelihood estimators, the misspecification refers to omitting X₂, X₄ in the covariate-balancing constraints). For simplicity, the treatment propensity score is estimated based on X₁ and X₃ for all methods. In theory, the influence functions in Theorems 3.1 and 3.2 could suggest a plug-in variance estimator for $\hat{μ}$ and $\hat{τ}$ . However, as one does not known whether the sampling score model or the implied outcome model is correct, all terms in the influence functions must be calculated even though certain terms are actually zero. Because the implementation of the plug-in variance is quite complex and would propagate a substantial amount of estimation error, we bootstrapped 200 samples to estimate variance and construct percentile-based confidence intervals. For the estimation of μ, we report bias, root mean squared error (RMSE) and 95% coverage of all estimators over 1000 Monte Carlo replications in Table 1. Corresponding results for the estimation of τ can be found in Table 2.

Table 1.

Summary of results in the first simulation study based on 1000 Monte Carlo replications for estimating μ; all values have been multiplied by 100.

	n = 1000			n = 5000			n = 10,000
	Bias	RMSE	Coverage	Bias	RMSE	Coverage	Bias	RMSE	Coverage
${\hat{μ}}_{AIPW} (CC)$	4	26	94	4	20	94	3	19	94
${\hat{μ}}_{AIPW} (CM)$	3	35	95	4	54	93	2	66	93
${\hat{μ}}_{AIPW} (MC)$	4	26	95	4	33	96	1	59	96
${\hat{μ}}_{AIPW} (MM)$	46	98	89	284	805	85	553	2284	83
$\hat{μ} (CC)$	4	26	94	4	22	93	4	21	95
$\hat{μ} (CM)$	4	29	95	3	30	93	2	32	94
$\hat{μ} (MC)$	4	26	95	4	23	95	4	23	95
$\hat{μ} (MM)$	23	44	92	21	41	89	15	38	93
$\tilde{μ} (CC)$	4	26	94	4	22	93	4	21	95
$\tilde{μ} (CM)$	4	29	95	3	30	94	2	32	94
$\tilde{μ} (MC)$	4	26	95	4	23	94	4	23	95
$\tilde{μ} (MM)$	23	44	92	21	41	89	15	38	92

Open in a new tab

Table 2.

Summary of results in the first simulation study based on 1000 Monte Carlo replications for estimating τ; all values have been multiplied by 100.

	n = 1000			n = 5000			n = 10,000
	Bias	RMSE	Coverage	Bias	RMSE	Coverage	Bias	RMSE	Coverage
${\hat{τ}}_{AIPW 1} (CC)$	6	35	94	2	21	94	3	20	94
${\hat{τ}}_{AIPW 1} (CM)$	5	47	95	3	59	94	2	69	93
${\hat{τ}}_{AIPW 1} (MC)$	6	36	94	3	36	96	1	62	96
${\hat{τ}}_{AIPW 1} (MM)$	90	183	87	314	893	84	581	2401	83
${\hat{τ}}_{AIPW 2} (CC)$	6	35	94	2	21	94	3	20	94
${\hat{τ}}_{AIPW 2} (CM)$	5	47	95	3	59	94	2	69	93
${\hat{τ}}_{AIPW 2} (MC)$	21	41	91	6	37	95	3	62	95
${\hat{τ}}_{AIPW 2} (MM)$	90	183	87	314	893	84	581	2401	83
$\hat{τ} (CC)$	6	35	95	2	23	94	3	22	95
$\hat{τ} (CM)$	6	38	95	1	32	94	2	34	94
$\hat{τ} (MC)$	6	37	95	2	25	93	4	24	94
$\hat{τ} (MM)$	38	58	88	20	41	90	14	39	93
$\tilde{τ} (CC)$	6	35	95	2	23	93	3	22	94
$\tilde{τ} (CM)$	6	38	95	1	32	94	2	34	95
$\tilde{τ} (MC)$	21	41	92	6	25	93	6	24	94
$\tilde{τ} (MM)$	38	58	88	20	41	90	14	39	93

Open in a new tab

Overall, the simulation results demonstrate the double robustness properties of both the AIPW and the empirical likelihood estimators. While the empirical likelihood and the AIPW estimators perform similarly when all models are correct, the empirical likelihood estimators provide notably greater efficiency when the outcome model is misspecified. For instance, the RMSE of ${\hat{μ}}_{AIPW}$ is 60 times larger than that of $\hat{μ}$ or $\tilde{μ}$ when all models are misspecified. Besides, when at least one model is misspecified, the variability of AIPW estimators sharply increases with decreasing proportion of trial participation, but the variability of empirical likelihood estimators remains relatively stable across scenarios. For the estimation of the population average treatment effect μ, the two empirical likelihood estimators, $\hat{μ}$ and $\tilde{μ}$ , have almost identical performance, matching our theoretical discussion in Section 3.1.

For the estimation of τ, when all models are correct, there seems to be negligible difference between the variance for $\hat{τ}$ and $\tilde{τ}$ in finite samples, even though the latter theoretically achieves a lower efficiency bound. This observation suggests that the marginal information content of the sampling score is usually a small positive quantity relative to the population size n. For this reason, $\hat{τ}$ appears to be superior over $\tilde{τ}$ because $\hat{τ}$ is doubly robust while $\tilde{τ}$ is only singly robust.

4.2. Simulation study II

To further compare the empirical likelihood estimators with the AIPW estimators, we consider an additional data generating mechanisms with more covariates and even higher degree of selection bias, leading to high degree of positivity violation. We consider seven covariates generated from X₁ ~ N(0, 1), $X_{2} = X_{1}^{2}$ , X₃ ~ unif(−2, 2), X₄ = I(X₃ > 0), X₅ ~ N(0, 1), X₆ ~ unif(−1, 1) and X₇ = X₅X₆. The true sampling score model is given by

p (X) = \frac{exp (β_{0} - X_{1} + 0.6 X_{2} + X_{3} - X_{4} + X_{5} - X_{6} - 0.6 X_{7})}{1 + exp (β_{0} - X_{1} + 0.6 X_{2} + X_{3} - X_{4} + X_{5} - X_{6} - 0.6 X_{7})} .

Similar to the first simulation study, we fix the trial sample size n₁ = 500, and consider three values of β₀ so that the proportion of the trial sample is 50%, 10% and 5%, corresponding to target population sizes n = 1000, n = 5000 and n = 10,000. Overall, the data generating processes represent very strong selection effect, giving rise to extremely small sampling scores p(X) for a substantial proportion of the population. For example, 28% and 55% of the non-participants have true sample score p(X) smaller than 0.01 when n = 5000 and n = 10,000. The distributions of the true sampling scores in the population are visualized in Figure 2, which clearly shows separation between the trial participants and non-participants.

Figure 2. — Distributions of true sampling scores in the population when the proportion of trial participation is (a) 50%, (b) 10%, and (c) 5%, in the second simulation study with high degree of positivity violation. The trial sample size is fixed at n₁ = 500, while the population size corresponding to each panel is (a) n = 1000, (b) n = 5000, and (c) n = 10,000.

Correspondingly, the potential outcomes of each unit are generated from

Y^{1} ~ N (1 + 2 X_{1} + 2 X_{2} + 3 X_{3} + 2 X_{4} + X_{5} + X_{6} + 2 X_{7}, 1)

Y^{0} ~ N (- 1 - 2 X_{1} + X_{2} - 2 X_{3} + X_{4} + X_{5} - X_{6} - X_{7}, 1) .

The true values of μ and τ are estimated using Monte Carlo integration. For each estimator, we also use the first letter to label whether the fitted sampling score model is correct (C) or misspecified (M). The misspecified model omits X₂, X₄, and X₇ in the logistic model. We similarly use a second letter to label whether the outcome model is correct (C) or misspecified (M), with the misspecified outcome model omitting X₂, X₄, and X₇ in the covariate-balancing constraints. For simplicity, the treatment propensity score is estimated based on X₁, X₃, X₄, and X₅ for all methods. For the estimation of μ and τ, we report bias, root mean squared error (RMSE) and 95% coverage of all estimators over 1000 Monte Carlo replications in Tables 3 and 4.

Table 3.

Summary of results in the second simulation study with high degree of positivity violation based on 1000 Monte Carlo replications for estimating μ; all values have been multiplied by 100.

	n = 1000			n = 5000			n = 10,000
	Bias	RMSE	Coverage	Bias	RMSE	Coverage	Bias	RMSE	Coverage
${\hat{μ}}_{AIPW} (CC)$	3	28	92	4	31	92	5	41	91
${\hat{μ}}_{AIPW} (CM)$	5	42	93	8	111	91	1	165	91
${\hat{μ}}_{AIPW} (MC)$	3	29	93	−8	549	94	−74	3482	94
${\hat{μ}}_{AIPW} (MM)$	58	171	89	2651	14,295	80	18,648	133,237	81
$\hat{μ} (CC)$	3	27	93	2	49	98	−7	82	93
$\hat{μ} (CM)$	3	33	93	2	63	91	8	82	93
$\hat{μ} (MC)$	3	27	92	2	54	95	−8	89	92
$\hat{μ} (MM)$	23	51	90	33	99	88	23	119	92
$\tilde{μ} (CC)$	3	27	93	2	49	96	−7	82	93
$\tilde{μ} (CM)$	3	33	93	2	63	91	8	82	93
$\tilde{μ} (MC)$	3	28	92	2	54	95	−8	89	92
$\tilde{μ} (MM)$	23	51	90	33	99	88	23	119	92

Open in a new tab

Table 4.

Summary of results in the second simulation study with high degree of positivity violation based on 1000 Monte Carlo replications for estimating τ; all values have been multiplied by 100.

	n = 1000			n = 5000			n = 10,000
	Bias	RMSE	Coverage	Bias	RMSE	Coverage	Bias	RMSE	Coverage
${\hat{τ}}_{AIPW 1} (CC)$	0	39	93	3	34	92	4	42	91
${\hat{τ}}_{AIPW 1} (CM)$	3	64	92	8	122	91	0	174	91
${\hat{τ}}_{AIPW 1} (MC)$	−1	43	92	−10	608	94	−78	3661	94
${\hat{τ}}_{AIPW 1} (MM)$	110	333	88	2940	15,853	81	19631	140,198	81
${\hat{τ}}_{AIPW 2} (CC)$	0	39	93	3	34	92	4	42	91
${\hat{τ}}_{AIPW 2} (CM)$	3	64	92	8	122	91	0	174	91
${\hat{τ}}_{AIPW 2} (MC)$	8	43	92	−7	608	94	−76	3661	94
${\hat{τ}}_{AIPW 2} (MM)$	110	333	88	2940	15,853	81	19,631	140,198	81
$\hat{τ} (CC)$	0	40	93	2	47	95	−7	80	93
$\hat{τ} (CM)$	−4	48	92	1	71	92	8	87	93
$\hat{τ} (MC)$	0	44	94	1	55	95	−9	86	91
$\hat{τ} (MM)$	34	71	88	33	107	89	22	124	92
$\tilde{τ} (CC)$	0	40	93	2	47	95	−7	80	93
$\tilde{τ} (CM)$	−4	48	92	1	71	92	8	87	93
$\tilde{τ} (MC)$	9	41	92	6	50	95	−5	79	93
$\tilde{τ} (MM)$	34	71	88	33	107	89	22	124	92

Open in a new tab

The simulation results under high degree of positivity violation are largely consistent with those in Section 4.1. When n = 1000 and n = 5000 and both models are correctly specified, the empirical likelihood estimators have slightly larger RMSE compared to the AIPW estimators. However, when at least one model is misspecified, the empirical likelihood estimators demonstrate substantially smaller bias, RMSE and closer to nominal coverage for estimating both μ and τ, compared to the AIPW estimators. As a typical example, when the outcome model is misspecified, both empirical likelihood estimators $\hat{μ}$ and $\tilde{μ}$ , are 10 times more efficient than ${\hat{μ}}_{AIPW}$ . When both models are misspecified, $\hat{μ}$ and $\tilde{μ}$ , are then 142 times more efficient than ${\hat{μ}}_{AIPW}$ . In the most challenging scenario with n = 10,000 and severe positivity violation (over half of the non-participants have true sampling score < 0.01), the robustness of empirical likelihood estimators appears more evident under model misspecification, even though they can be slightly less efficient compared to AIPW estimators in the absence of model misspecification. Finally, this second simulation also indicates that there is general little difference between $\hat{μ}$ and $\tilde{μ}$ , and between $\hat{τ}$ and $\tilde{τ}$ in finite samples, a finding that is consistent to that shown in Tables 1 and 2.

5. Discussion

The asymptotic results on efficiency bounds in this note could be extended to generalizing results in randomized trials with multiple arms, where the treatment variable A_i ∈ {0, 1, …, J}. Extending the steps outlined in Cattaneo (2010), one could obtain the asymptotic covariance bounds for estimating μ = (μ₀, μ₁, …, μ_J)^T and τ = (τ₀, τ₁, …, τ_J)^T, where $μ_{j} = E [Y_{i}^{j}]$ and $τ_{j} = E [Y_{i}^{j} ∣ S_{i} = 0]$ . By replacing the estimated propensity scores with the estimated generalized propensity scores (Imbens 2000) in the balancing constraints, one could also derive the corresponding empirical likelihood estimators to provide semiparametric efficient inference for μ and τ. On the other hand, the sharp covariate-balancing constraints imposed by the empirical likelihood approach may come with a cost. In practice where the number of covariates is large and the sample size is small, the search algorithm may fail to identify a solution that ensures exact balance of the effect modifiers. Our future work will include relaxing the exact covariate-balancing conditions to tolerate a small amount of imbalance, such as along the lines of Fong, Hazlett, and Imai (2018) and Wang and Zubizarreta (2020).

Acknowledgement

The authors thank Can Meng for computational assistance with the simulations in Section 4.2, and are grateful to the Editor and anonymous reviewer for providing constructive comments, which improved the exposition of this work.

Funding

The research of Fan Li is partially supported by CTSA Grant Number UL1 TR001863 from the National Center for Advancing Translational Science (NCATS), a component of the National Institutes of Health (NIH). The statements presented in this article are solely the responsibility of the authors and do not necessarily represent the views of the National Institutes of Health.

Appendix A. Proof of Theorem 2.3

The calculation of efficiency bounds follow the general approach outlined in Bickel et al. (1993) and Hahn (1998) and includes the following three steps: characterization of tangent space, verification of pathwise differentiability in the parameter and computation of the asymptotic variance bound. By Assumptions 1 and 2, the density of the observed data (Y, S, A, X) is

f (y, s, a, x) = {[{e (x) f_{1} (y ∣ x)}^{a} {(1 - e (x)) f_{0} (y ∣ x)}^{1 - a} p (x)]}^{s} {[1 - p (x)]}^{1 - s} f (x),

where f₁(y|x) = dF₁(y|x) and f₀(y|x) = dF₀(y|x) are densities of Y¹ and Y⁰ given X = x.

Consider a regular parametric submodel for the joint distribution of (Y, S, A, X)

{[{e (x) f_{1} (y ∣ x; θ)}^{a} {(1 - e (x)) f_{0} (y ∣ x; θ)}^{1 - a} p (x; θ)]}^{s} {[1 - p (x; θ)]}^{1 - s} f (x; θ)

which equals f(y, s, a, x) when θ = θ₀, the true value. Define the score to be H(y, s, a, x; θ₀) = H_y(y, s, a, x) + H_p(s, x) + H_x(x), where

H_{y} (y, s, a, x) = {sah}_{1} (y, x) + s (1 - a) h_{0} (y, x), h_{a} (y, x) = \frac{\partial}{\partial θ} log f_{a} (y ∣ x; θ_{0}) H_{p} (s, x) = \frac{s - p_{0} (x)}{p_{0} (x) (1 - p_{0} (x))} {\dot{p}}_{0} (x), p_{0} (x) = p (x; θ_{0}), {\dot{p}}_{0} (x) = \frac{\partial}{\partial θ} log p (x; θ_{0}) H_{x} (x) = \frac{\partial}{\partial θ} log f (x; θ_{0}) .

Let 𝓛²(F) be the usual Hilbert space of mean-zero and square-integrable measurable functions with respect to distribution F. The tangent space is then characterized by 𝓗_y + 𝓗_p + 𝓗_x, with 𝓗_y = {H_y(Y, S, A, X) : h_a(Y, X) ∈ 𝓛²(F_a), a = 0, 1}, 𝓗_p = {H_p(S, X) : H_p(S, X) ∈ 𝓛²(F_S|X)} and 𝓗_x = {H_x(X) : H_x(X) ∈ 𝓛² (F_X)}.

The moment function for μ based on the potential outcomes is m(Y¹, Y⁰; μ) = m(μ) = (Y¹ − Y⁰) – μ, where $E [m (μ (θ_{0}))] = 0$ . Interchanging differentiation and integration, we have $0 = \frac{\partial}{\partial θ} E [m (μ (θ_{0}))] = \frac{\partial}{\partial μ} E [m (μ)] \frac{\partial}{\partial θ} μ (θ_{0}) + \frac{\partial}{\partial θ} E [m (μ)]$ . Since $\frac{\partial}{\partial μ} E [m (μ)] = - 1$ , we obtain

\frac{\partial}{\partial θ} μ (θ_{0}) = {\frac{\partial}{\partial θ} E [m (μ)] |}_{θ = θ_{0}} = \int (y - μ_{1}) f_{1} (y ∣ x; θ_{0}) f (x; θ_{0}) [h_{1} (y, x) + H_{x} (x)] dydx - \int (y - μ_{0}) f_{0} (y ∣ x; θ_{0}) f (x; θ_{0}) [h_{0} (y, x) + H_{x} (x)] dydx = E [(Y^{1} - μ_{1}) h_{1} (Y^{1}, X)] - E [(Y^{0} - μ_{0}) h_{0} (Y^{0}, X)] + E [(μ (X) - μ) H_{x} (X)] .

(A1)

The mapping μ(θ) is pathwise differentiable if there exists a function Ψ_μ(Y, S, A, X) ∈ 𝓗 such that $\frac{\partial}{\partial θ} μ (θ_{0}) = E [Ψ_{μ} (Y, S, A, X) H (Y, S, A, X; θ_{0})]$ for all regular parametric submodels. We can verify that the efficient score

Ψ_{μ} (Y, S, A, X) = \frac{S A}{p (X) e (X)} (Y - μ_{1} (X)) - \frac{S (1 - A)}{p (X) (1 - e (X))} (Y - μ_{0} (X)) + μ (X) - μ

(A2)

is an element in 𝓗 and satisfies the above condition, because

E [\frac{S A}{p (X) e (X)} (Y - μ_{1} (X)) H (Y, S, A, X; θ_{0})] = E [(Y^{1} - μ_{1} (X)) h_{1} (Y^{1}, X)],

E [\frac{S (1 - A)}{p (X) (1 - e (X))} (Y - μ_{0} (X)) H (Y, S, A, X; θ_{0})] = E [(Y^{0} - μ_{0} (X)) h_{0} (Y^{0}, X)],

E [(μ (X) - μ) H (Y, S, A, X; θ_{0})] = E [(μ (X) - μ) H_{x} (X)],

and $E [(μ_{1} (X) - μ_{1}) h_{1} (Y^{1}, X)] = E [(μ_{0} (X) - μ_{0}) h_{0} (Y^{0}, X)] = 0$ .

The moment function for τ is m(Y¹, Y⁰; τ) = m(τ) = (Y¹ - Y⁰) – τ, where $E [m (τ (θ_{0})) ∣ S = 0] = 0$ . Differentiating under the integral sign, we have

\frac{\partial}{\partial θ} τ (θ_{0}) = {\frac{\partial}{\partial θ} E [m (τ) ∣ S = 0] |}_{θ = θ_{0}} = \frac{1}{1 - π} \int (y - τ_{1}) f_{1} (y ∣ x; θ_{0}) (1 - p (x; θ_{0})) f (x; θ_{0}) [h_{1} (y, x) - \frac{{\dot{p}}_{0} (x)}{1 - p_{0} (x)} + H_{x} (x)] dydx - \frac{1}{1 - π} \int (y - τ_{0}) f_{0} (y ∣ x; θ_{0}) (1 - p (x; θ_{0})) f (x; θ_{0}) [h_{0} (y, x) - \frac{{\dot{p}}_{0} (x)}{1 - p_{0} (x)} + H_{x} (x)] dydx = E [(Y^{1} - τ_{1}) h_{1} (Y^{1}, X) ∣ S = 0] - E [(Y^{0} - τ_{0}) h_{0} (Y^{0}, X) ∣ S = 0] - E [(τ (X) - τ) \frac{{\dot{p}}_{0} (X)}{1 - p_{0} (X)} ∣ S = 0] + E [(τ (X) - τ) H_{x} (X) ∣ S = 0] .

(A3)

The mapping τ(θ) is pathwise differentiable if there exists a function Ψ_τ(Y, S, A, X) ∈ 𝓗 such that $\frac{\partial}{\partial θ} τ (θ_{0}) = E [Ψ_{τ} (Y, S, A, X) H (Y, S, A, X; θ_{0})]$ for all regular parametric submodels. We can verify that the function

Ψ_{τ} (Y, S, A, X) = \frac{1 - p (X)}{1 - π} (\frac{S A}{p (X) e (X)} (Y - τ_{1} (X)) - \frac{S (1 - A)}{p (X) (1 - e (X))} (Y - τ_{0} (X)) + \frac{1 - S}{1 - p (X)} (τ (X) - τ))

(A4)

is an element in 𝓗 and satisfies the above condition, because

E [\frac{1 - p (X)}{1 - π} \frac{S A}{p (X) e (X)} (Y - τ_{1} (X)) H (Y, S, A, X; θ_{0})] = E [(Y^{1} - τ_{1} (X)) h_{1} (Y^{1}, X) ∣ S = 0]

(A5)

E [\frac{1 - p (X)}{1 - π} \frac{S (1 - A)}{p (X) (1 - e (X))} (Y - τ_{0} (X)) H (Y, S, A, X; θ_{0})] = E [(Y^{0} - τ_{0} (X)) h_{0} (Y^{0}, X) ∣ S = 0]

(A6)

E [\frac{1 - S}{1 - π} (τ (X) - τ) H (Y, S, A, X; θ_{0})] = - E [(τ (X) - τ) \frac{{\dot{p}}_{0} (X)}{1 - p_{0} (X)} ∣ S = 0] + E [(τ (X) - τ) H_{x} (X) ∣ S = 0],

and $E [(τ_{1} (X) - τ_{1}) h_{1} (Y^{1}, X) ∣ S = 0] = E [(τ_{0} (X) - τ_{0}) h_{0} (Y^{0}, X) ∣ S = 0] = 0$ . By Theorem 2 in Section 3.3 of Bickel et al. (1993), the asymptotic variance bounds are the expected square of the projection of Ψ_μ and Ψ_τ onto 𝓗. But Ψ_μ, Ψ_τ ∈ 𝓗, and therefore $Σ_{μ} = E [Ψ_{μ}^{2} (Y, S, A, X)]$ , $Σ_{τ} = E [Ψ_{τ}^{2} (Y, S, A, X)]$ . The explicit forms of Σ_μ and Σ_τ are given in Theorem 2.3.

The above calculation assumes the knowledge of the propensity score e(X), but the results remain virtually the same even if e(X) is unknown, as when the generalization is based on an observational study. To see why, notice that with e(X) unknown, the score becomes 𝓗(y, s, a, x; θ₀) = H_y(y, s, a, x) + H_e(s, a, x) + H_p(s, x) + H_x(x), with

H_{e} (s, a, x) = \frac{s (a - e_{0} (x)) {\dot{e}}_{0} (x)}{e_{0} (x) (1 - e_{0} (x))}, e_{0} (x) = e (x; θ_{0}), {\dot{e}}_{0} (x) = \frac{\partial}{\partial θ} log e (x; θ_{0}),

and the tangent space is characterized by 𝓗_y + 𝓗_e + 𝓗_p + 𝓗_x, where 𝓗_e = {H_e(S, A, X) : H_e(S, A, X) ∈ 𝓛²(F_{A|X, S=1})}. This change in tangent space, however, does not affect Equations (A1) and (A3). Further because $E [Ψ_{μ} (Y, S, A, X) H_{e} (S, A, X)] = E [Ψ_{τ} (Y, S, A, X) H_{e} (S, A, X)] = 0$ , we can show that the efficiency bounds for estimating μ and τ are still Σ_μ and Σ_τ. This proves that e(X) is ancillary for the estimation of μ and τ, when p(X) is unknown.

Appendix B. Proof of Theorem 2.4

If the sampling score p(X) is known, then the score becomes H(y, s, a, x; θ₀) = H_y(y, s, a, x) + H_x(x), and the tangent space is characterized by 𝓗_y + 𝓗_x. However, one can verify that this change in tangent space does not change Equation (A1). Further, μ(θ) is pathwise differentiable since $\frac{\partial}{\partial θ} μ (θ_{0}) = E [Ψ_{μ} (Y, S, A, X) H (Y, S, A, X; θ_{0})]$ for all regular parametric submodels. This indicates that the efficiency bound for estimation of μ is the same as that when p(X) is unknown. In other words, p(X) is ancillary for the estimation of μ.

For estimation of τ, we obtain

\frac{\partial}{\partial θ} τ (θ_{0}) = \frac{\partial}{\partial θ} E [m (τ) ∣ S = 0] = \frac{1}{1 - π} \int (y - τ_{1}) f_{1} (y ∣ x; θ_{0}) (1 - p (x)) f (x; θ_{0}) [h_{1} (y, x) + H_{x} (x)] dydx - \frac{1}{1 - π} \int (y - τ_{0}) f_{0} (y ∣ x; θ_{0}) (1 - p (x)) f (x; θ_{0}) [h_{0} (y, x) + H_{x} (x)] dydx = E [(Y^{1} - τ_{1}) h_{1} (Y^{1}, X) ∣ S = 0] - E [(Y^{0} - τ_{0}) h_{0} (Y^{0}, X) ∣ S = 0] + E [(τ (X) - τ) H_{x} (X) ∣ S = 0] .

Define

{\tilde{Ψ}}_{τ} (Y, S, A, X) = \frac{1 - p (X)}{1 - π} (\frac{S A}{p (X) e (X)} (Y - τ_{1} (X)) - \frac{S (1 - A)}{p (X) (1 - e (X))} (Y - τ_{0} (X)) + τ (X) - τ),

(B1)

which is an element in the tangent space 𝓗. We can show that for all regular parametric submodels, $\frac{\partial}{\partial θ} τ (θ_{0}) = E [{\tilde{Ψ}}_{τ} (Y, S, A, X) H (Y, S, A, X; θ_{0})]$ , because of Equations (A5) (A6), and

E [(τ (X) - τ) H (Y, S, A, X; θ_{0})] = E [(τ (X) - τ) H_{x} (X) ∣ S = 0] .

This verifies that τ(θ) is pathwise differentiable and therefore ${\tilde{Σ}}_{τ} = E [{\tilde{Ψ}}_{τ}^{2} (Y, S, A, X)]$ , whose explicit expression is given in Theorem 2.4. This also shows that p(X) is not ancillary for the estimation of τ.

Finally, under the assumption that the sampling score p(X) is known, the ancillarity of e(X) to the estimation of μ and τ follows the same reasoning as in the last paragraph of Appendix A.

Appendix C. Proof of Theorem 3.1

Define $\hat{π} = n_{1} / n$ , $\hat{ω} = n_{11} / n_{1}$ , $\hat{a} = n^{- 1} \sum_{i \in 𝒫} a (X_{i})$ and $a = E [a (X_{i})]$ , we have $\hat{π} = π + o_{p} (1)$ and $\hat{ω} = ω + o_{p} (1)$ , $\hat{a} = a + o_{p} (1)$ . From Qin and Lawless (1994) and Owen (2001), the empirical probabilities for estimating μ₁ are written as

q_{i} = \frac{1}{n_{11}} \frac{1}{1 + {\hat{r}}_{1} {p (X_{i}; \hat{β}) e (X_{i}; \hat{α}) - \hat{π} \hat{ω}} + {\hat{r}}_{2}^{T} {a (X_{i}) - \hat{a}}}, i \in 𝒯

where $\hat{r} = {({\hat{r}}_{1}, {\hat{r}}_{2}^{T})}^{T}$ satisfy

\sum_{i \in 𝒯} \frac{p (X_{i}; \hat{β}) e (X_{i}; \hat{α}) - \hat{π} \hat{ω}}{1 + {\hat{r}}_{1} {p (X_{i}; \hat{β}) e (X_{i}; \hat{α}) - \hat{π} \hat{ω}} + {\hat{r}}_{2}^{T} {a (X_{i}) - \hat{a}}} = 0,

\sum_{i \in 𝒯} \frac{a (X_{i}) - \hat{a}}{1 + {\hat{r}}_{1} {p (X_{i}; \hat{β}) e (X_{i}; \hat{α}) - \hat{π} \hat{ω}} + {\hat{r}}_{2}^{T} {a (X_{i}) - \hat{a}}} = 0.

Following the results in Qin (1993, Section 2), the Lagrange multiplier ${\hat{r}}_{1}$ converges in probability to a non-zero limit, therefore a centralization step is required to re-parameterize the above estimating equations. Define ${\hat{λ}}_{1} = \hat{π} \hat{ω} {\hat{r}}_{1} - 1 = n_{11} {\hat{r}}_{1} / n - 1$ , ${\hat{λ}}_{2} = \hat{π} \hat{ω} {\hat{r}}_{2} = n_{11} {\hat{r}}_{2} / n$ , and

h_{1} (X_{i}; β, α, π, ω, a) = \frac{1}{p (X_{i}; β) e (X_{i}; α)} (\begin{matrix} p (X_{i}; β) e (X_{i}; α) - π ω \\ a (X_{i}) - a \end{matrix}),

it follows that we could re-write the empirical probabilities as

q_{i} = \frac{1}{n_{11}} \frac{\hat{π} \hat{ω} / p (X_{i}; \hat{β}) e (X_{i}; \hat{α})}{1 + {\hat{λ}}^{T} h_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})}, i \in 𝒯

with the centralized $\hat{λ} = {({\hat{λ}}_{1}, {\hat{λ}}_{2}^{T})}^{T}$ converging in probability to zero and satisfying

Q (\hat{λ}, \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a}) = \sum_{i \in 𝒯} h_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a}) / {1 + {\hat{λ}}^{T} h_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})} = 0.

We define β_*, α_* as the probability limits of $\hat{β}$ and $\hat{α}$ . According to the results of White (1982), β_* and α_* are the least favorable values minimizing the Kullback-Leibler distance between the distributions implied from the assumed models and those from the true data generating processes. When the propensity model and the sampling score model are both correct, the probability limits equals the true values β_* = β₀ and α_* = α₀. Notice that in randomized trials, the propensity score is known and so α_* = α₀, while this may not be the case in observational studies. Here we allow the more general case where α_* can differ from α₀ since α_* = α₀ could just be included as a special case. For the sampling score model, we have $\hat{β} - β_{*} = n^{- 1} \sum_{i = 1}^{n} I_{β_{*}, i} + o_{p} (n^{- 1 / 2})$ , where

I_{β_{*}, i} = {E [\frac{\dot{p} {(X_{i}; β_{*})}^{\otimes 2}}{p (X_{i}; β_{*}) (1 - p (X_{i}; β_{*}))}]}^{- 1} \frac{(S_{i} - p (X_{i}; β_{*})) \dot{p} (X_{i}; β_{*})}{p (X_{i}; β_{*}) (1 - p (X_{i}; β_{*}))},

$\dot{p} (X_{i}; β_{*}) = \partial p (X_{i}; β_{*}) / \partial β$ , and z^⊗2 = zz^T. Similarly for the propensity score model, $\hat{α} - α_{*} = n^{- 1} \sum_{i = 1}^{n} I_{α_{*}, i} + o_{p} (n^{- 1 / 2})$ , where

I_{α_{*}, i} = {E [S_{i} \frac{\dot{e} {(X_{i}; α_{*})}^{\otimes 2}}{e (X_{i}; α_{*}) (1 - e (X_{i}; α_{*}))}]}^{- 1} \frac{S_{i} (A_{i} - e (X_{i}; α_{*})) \dot{e} (X_{i}; α_{*})}{e (X_{i}; α_{*}) (1 - e (X_{i}; α_{*}))},

where $\dot{e} (X_{i}; α_{*}) = \partial e (X_{i}; α_{*}) / \partial α$ .

Define $D_{1} = E [S_{i} A_{i} / (p (X_{i}; β_{*}) e (X_{i}; α_{*}))]$ , $G_{1} = E [S_{i} A_{i} h_{1} {(X_{i}; β_{*}, α_{*}, π, ω, a)}^{\otimes 2}]$ ,

F_{11} = E [(\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix}) \frac{- S_{i} A_{i} {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) e (X_{i}; α_{*})}],

F_{12} = E [\frac{- S_{i} A_{i} {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) e^{2} (X_{i}; α_{*})} (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix})] .

We expand $Q (\hat{λ}, \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})$ around the limiting values of its arguments as

0 = Q (0, β_{*}, α_{*}, π, ω, a) - n G_{1} \hat{λ} + n F_{11} (\hat{β} - β_{*}) + n F_{12} (\hat{α} - α_{*}) - n D_{1} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) (\hat{π} \hat{ω} - π ω) - n D_{1} (\begin{matrix} 0 \\ 1_{k_{1} \times 1} \end{matrix}) (\hat{a} - a) + o_{p} (n^{\frac{1}{2}}) = - \sum_{i = 1}^{n} t_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) - n G_{1} \hat{λ} + n F_{11} (\hat{β} - β_{*}) + n F_{12} (\hat{α} - α_{*}) + o_{p} (n^{\frac{1}{2}}),

where

t_{1} (S_{i}, A_{i}, X_{i}; β, α, π, ω, a) = [D_{1} - \frac{S_{i} A_{i}}{p (X_{i}; β) e (X_{i}; α)}] (\begin{matrix} p (X_{i}; β) e (X_{i}; α) - π ω \\ a (X_{i}) - a \end{matrix}) + D_{1} {S_{i} A_{i} - p (X_{i}; β) e (X_{i}; α)} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) .

Therefore

\hat{λ} = \frac{1}{n} \sum_{i = 1}^{n} G_{1}^{- 1} {- t_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) + F_{11} I_{β_{*}, i} + F_{12} I_{α_{*}, i}} + o_{p} (n^{- \frac{1}{2}}) .

(C1)

It follows that

{\hat{μ}}_{1} - μ_{1} = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})} \frac{S_{i} A_{i}}{1 + {\hat{λ}}^{T} h_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})} (Y_{i} - μ_{1}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{S_{i} A_{i} (Y_{i} - μ_{1})}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} - B_{1}^{T} \hat{λ} + C_{11} (\hat{β} - β_{*}) + C_{12} (\hat{α} - α_{*}) + o_{p} (n^{- \frac{1}{2}})

(C2)

where

B_{1} = E [\frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} (Y_{i}^{1} - μ_{1}) h_{1} (X_{i}; β_{*}, α_{*}, π, ω, a)]

C_{11} = E [\frac{- S_{i} A_{i} {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) e (X_{i}; α_{*})} (Y_{i}^{1} - μ_{1})],

C_{12} = E [\frac{- S_{i} A_{i} {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) e^{2} (X_{i}; α_{*})} (Y_{i}^{1} - μ_{1})] .

Inserting Equation (C1) into Equation (C2) leads to ${\hat{μ}}_{1} - μ_{1} = n^{- 1} \sum_{i = 1}^{n} Φ_{μ_{1}, i} + o_{p} (n^{- 1 / 2})$ , where

Φ_{μ_{1}, i} = \frac{S_{i} A_{i} (Y_{i} - μ_{1})}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} + B_{1}^{T} G_{1}^{- 1} t_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) + (C_{11} - B_{1}^{T} G_{1}^{- 1} F_{11}) I_{β_{*}, i} + (C_{12} - B_{1}^{T} G_{1}^{- 1} F_{12}) I_{α_{*}, i} .

We next show that ${\hat{μ}}_{1}$ is doubly robust. When models for p(X) and e(X) are correctly specified, β_* = β₀ and α_* = α₀ (the latter usually holds automatically in randomized trials), it is easy to see that $E [I_{β_{0}, i}] = E [I_{α_{0}, i}] = 0$ , and

E [\frac{S_{i} A_{i} (Y_{i} - μ_{1})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})}] = E [\frac{E [S_{i} A_{i} ∣ X_{i}]}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (μ_{1} (X_{i}) - μ_{1})] = 0.

by the ignorability conditions in Assumptions 2.1 and 2.2. In addition, D₁ = 1 and so

t_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) = - \frac{S_{i} A_{i} - p (X_{i}; β_{0}) e (X_{i}; α_{0})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix}),

which has zero expectation. Therefore $E [Φ_{μ_{1}, i}] = 0$ and ${\hat{μ}}_{1} \overset{p}{\to} μ_{1}$ .

We now show consistency under a correct outcome model implied from the constraints. Write

M_{i} = \frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} (Y_{i}^{1} - μ_{1}),

R_{i} = \frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} g_{1} (X_{i}; β_{*}, α_{*}, π, ω, a),

where

g_{1} (X_{i}; β_{*}, α_{*}, π, ω, a) = (\begin{matrix} p (X_{i}; β_{*}) e (X_{i}; α_{*}) - π ω \\ a (X_{i}) - a \end{matrix}) .

Observe that $B_{1}^{T} = E [M_{i} R_{i}^{T}]$ , $G_{1} = E [R_{i}^{\otimes 2}]$ . When the “outcome model” is correct in the sense that $μ_{1} (X) = γ_{10} + \sum_{l = 1}^{k_{1}} γ_{1 l} a_{l} (X)$ , the conditional expectation, $E [Y_{i}^{1} - μ_{1} ∣ X_{i}] = μ_{1} (X_{i}) - μ_{1}$ , lies in the linear space spanned by {a(X_i) - a₀}. Because

E ([M_{i} - \frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} E [Y_{i}^{1} - μ_{1} ∣ X_{i}]] [\frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} L (X_{i})]) = 0

for all functions L(X_i) and g₁(X_i; β_*, α_*, π, ω, a) only involves X_i, we have

B_{1}^{T} G_{1}^{- 1} R_{i} = E [M_{i} R_{i}^{T}] E [R_{i}^{\otimes 2}] R_{i} = \frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} (μ_{1} (X_{i}) - μ_{1}) .

Therefore $B_{1}^{T} G_{1}^{- 1} g_{1} (X_{i}; β_{*}, α_{*}, π, ω, a) = μ_{1} (X_{i}) - μ_{1}$ . Further since

B_{1}^{T} G_{1}^{- 1} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) = 0,

we obtain

B_{1}^{T} G_{1}^{- 1} t_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) = [D_{1} - \frac{S_{i} A_{i}}{p (X_{i}; β_{*}) e (X_{i}; α_{*})}] (μ_{1} (X) - μ_{1})

Similar projection-based arguments lead to $C_{11} - B_{1}^{T} G_{1}^{- 1} F_{11} = C_{12} - B_{1} G_{1}^{- 1} F_{12} = 0$ . Therefore

E [Φ_{μ_{1}, i}] = E [\frac{S_{i} A_{i} (Y_{i} - μ_{1})}{p (X_{i}; β^{*}) e (X_{i}; α^{*})} + B_{1}^{T} G_{1}^{- 1} t_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a)] = D_{1} E [μ_{1} (X) - μ_{1}] = 0.

When all models are correctly specified, it is easy to verify that $Φ_{μ_{1}, i}$ becomes the efficient influence function for estimating μ₁, i.e.,

Φ_{μ_{1}, i} = Ψ_{μ_{1}, i} = \frac{S_{i} A_{i}}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (Y_{i} - μ_{1} (X_{i})) + μ_{1} (X_{i}) - μ_{1} .

The asymptotic analysis for ${\hat{μ}}_{0}$ follows the exact same fashion, except that we will now replace A_i by 1 − A_i, and the propensity score e(X_i; β) by its complement 1 – e(X_i; β). We will omit the details for brevity, but note that the influence function of ${\hat{μ}}_{0}$ is

Φ_{μ_{0}, i} = \frac{S_{i} (1 - A_{i}) (Y_{i} - μ_{0})}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} + B_{0}^{T} G_{0}^{- 1} t_{0} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, b) + (C_{01} - B_{0}^{T} G_{0}^{- 1} F_{01}) I_{β_{*}, i} + (C_{02} - B_{0}^{T} G_{0}^{- 1} F_{02}) I_{α_{*}, i},

where $b = E [b (X_{i})]$ and

t_{0} (S_{i}, A_{i}, X_{i}; β, α, π, ω, b) = [D_{0} - \frac{S_{i} (1 - A_{i})}{p (X_{i}; β) (1 - e (X_{i}; α))}] (\begin{matrix} p (X_{i}; β) (1 - e (X_{i}; α)) - π (1 - ω) \\ b (X_{i}) - b \end{matrix}) + D_{0} {S_{i} (1 - A_{i}) - p (X_{i}; β) (1 - e (X_{i}; α))} (\begin{matrix} 1 \\ 0_{k_{0} \times 1} \end{matrix}),

h_{0} (X_{i}; β, α, π, ω, b) = \frac{1}{p (X_{i}; β) (1 - e (X_{i}; α))} (\begin{matrix} p (X_{i}; β) (1 - e (X_{i}; α)) - π (1 - ω) \\ b (X_{i}) - b \end{matrix}),

D_{0} = E [\frac{S_{i} (1 - A_{i})}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))}],

G_{0} = E [S_{i} (1 - A_{i}) h_{0} {(X_{i}; β_{*}, α_{*}, π, ω, b)}^{\otimes 2}],

F_{01} = E [\frac{- S_{i} (1 - A_{i}) {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (\begin{matrix} - π (1 - ω) \\ b (X_{i}) - b \end{matrix})],

F_{02} = E [\frac{S_{i} (1 - A_{i}) {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) {(1 - e (X_{i}; α_{*}))}^{2}} (\begin{matrix} - π (1 - ω) \\ b (X_{i}) - b \end{matrix})],

B_{0} = E [\frac{S_{i} (1 - A_{i})}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (Y_{i}^{0} - μ_{0}) h_{0} (X_{i}; β_{*}, α_{*}, π, ω, b)],

C_{01} = E [\frac{- S_{i} (1 - A_{i}) {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (Y_{i}^{0} - μ_{0})],

C_{02} = E [\frac{S_{i} (1 - A_{i}) {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) {(1 - e (X_{i}; α_{*}))}^{2}} (Y_{i}^{0} - μ_{0})] .

The last quantities in Theorem 3.1 are thus defined as $Δ_{μ, i} (β_{*}, α_{*}) = (C_{11} - B_{1}^{T} G_{1}^{- 1} F_{11}) I_{β_{*}, i} - (C_{01} - B_{0}^{T} G_{0}^{- 1} F_{01}) I_{β_{*}, i} + (C_{12} - B_{1}^{T} G_{1}^{- 1} F_{12}) I_{α_{*}, i} - (C_{02} - B_{0}^{T} G_{0}^{- 1} F_{02}) I_{α_{*}, i}$ . Finally, when all models are correct, we could verify that the efficient score in (A2), $Ψ_{μ, i} (Y_{i}, S_{i}, A_{i}, X_{i}) = Ψ_{μ_{1}, i} - Ψ_{μ_{0}, i} = Φ_{μ_{1}, i} - Φ_{μ_{0}, i}$ . Therefore, $\hat{μ}$ is semiparametric efficient with respect to the variance bound Σ_μ.

Appendix D. Proof of Theorem 3.2

Define $\hat{π} = n_{1} / n$ , $\hat{ω} = n_{11} / n_{1}$ , $\hat{a} = n_{0}^{- 1} \sum_{i \in 𝒩} a (X_{i})$ and $a = E [a (X_{i}) ∣ S_{i} = 0]$ , we must have $\hat{π} = π + o_{p} (1)$ and $\hat{ω} = ω + o_{p} (1)$ , $\hat{a} = a + o_{p} (1)$ . Denote m₁(X_i; β, α) = p(X_i; β)e(X_i; α)/(1 – p(X_i; β)), the empirical probabilities for estimating τ₁ are written as

q_{i} = \frac{1}{n_{11}} \frac{1}{1 + {\hat{r}}_{1} {m_{1} (X_{i}; \hat{β}, \hat{α}) - \hat{π} \hat{ω} / (1 - \hat{π})} + {\hat{r}}_{2}^{T} {a (X_{i}) - \hat{a}}}, i \in 𝒯

where $\hat{r} = {({\hat{r}}_{1}, {\hat{r}}_{2}^{T})}^{T}$ satisfy

\sum_{i \in 𝒯} \frac{m_{1} (X_{i}; β, \hat{α}) - \hat{π} \hat{ω} / (1 - \hat{π})}{1 + {\hat{r}}_{1} {m_{1} (X_{i}; \hat{β}, \hat{α}) - \hat{π} \hat{ω} / (1 - \hat{π})} + {\hat{r}}_{2}^{T} {a (X_{i}) - \hat{a}}} = 0,

\sum_{i \in 𝒯} \frac{a (X_{i}) - \hat{a}}{1 + {\hat{r}}_{1} {m_{1} (X_{i}; \hat{β}, \hat{α}) - \hat{π} \hat{ω} / (1 - \hat{π})} + {\hat{r}}_{2}^{T} {a (X_{i}) - \hat{a}}} = 0.

Similar to Appendix C, the Lagrange multiplier ${\hat{r}}_{1}$ now converges in probability to a non-zero limit due to biased sampling, therefore a centralization step is required to re-parameterize the above estimating equations (Qin 1993, Section 2). Define ${\hat{λ}}_{1} = {(1 - \hat{π})}^{- 2} \hat{π} \hat{ω} {\hat{r}}_{1} - {(1 - \hat{π})}^{- 1} = (n / n_{0}) (n_{11} {\hat{r}}_{1} / n_{0} - 1)$ , ${\hat{λ}}_{2} = {(1 - \hat{π})}^{- 1} \hat{π} \hat{ω} {\hat{r}}_{2} = n_{11} {\hat{r}}_{2} / n_{0}$ , and

ρ_{1} (X_{i}; β, α, π, ω, a) = \frac{1 - p (X_{i}; β)}{p (X_{i}; β) e (X_{i}; α)} (\begin{matrix} (1 - π) m_{1} (X_{i}; β, α) - π ω \\ a (X_{i}) - a \end{matrix}),

it follows that we could re-write the empirical probabilities as

q_{i} = \frac{1}{n_{11} (1 - \hat{π})} \frac{\hat{π} \hat{ω} / m_{1} (X_{i}; \hat{β}, \hat{α})}{1 + {\hat{λ}}^{T} ρ_{1} (X_{i}; \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \hat{a})}, i \in 𝒯

with the centralized $\hat{λ} = {({\hat{λ}}_{1}, {\hat{λ}}_{2}^{T})}^{T}$ converging in probability to zero and satisfying

Q (\hat{λ}, \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a}) = \sum_{i \in 𝒯} ρ_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a}) / {1 + {\hat{λ}}^{T} ρ_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})} = 0.

Write β_*, α_* as the probability limits of $\hat{β}$ and $\hat{α}$ as in Appendix C, and now define $K_{1} = E [S_{i} A_{i} ρ_{1} {(X_{i}; β_{*}, α_{*}, π, ω, a)}^{\otimes 2}]$ ,

D_{1} = E [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*}) (1 - π)}],

F_{11} = E [\frac{- S_{i} A_{i} {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) e (X_{i}; α_{*})} (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix})],

F_{12} = E [\frac{- S_{i} A_{i} (1 - p (X_{i}; β_{*})) {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) e^{2} (X_{i}; α_{*})} (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix})] .

We expand $Q (\hat{λ}, \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})$ around the limiting values of its arguments as

0 = Q (0, β_{*}, α_{*}, π, ω, a) - n K_{1} \hat{λ} + n F_{11} (\hat{β} - β_{*}) + n F_{12} (\hat{α} - α_{*}) - n (\begin{matrix} π + (1 - π) D_{1} \\ 0_{k_{1} \times 1} \end{matrix}) (\hat{π} - π) ω - n (1 - π) D_{1} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) π (\hat{ω} - ω) - n (1 - π) D_{1} (\begin{matrix} 0 \\ 1_{k_{1} \times 1} \end{matrix}) (\hat{a} - a) + o_{p} (n^{\frac{1}{2}}) = Q (0, β_{*}, α_{*}, π, ω, a) - n K_{1} \hat{λ} + n F_{11} (\hat{β} - β_{*}) + n F_{12} (\hat{α} - α_{*}) - (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) \sum_{i = 1}^{n} π ω (S_{i} - π) - (1 - π) D_{1} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) \sum_{i = 1}^{n} (S_{i} A_{i} - π ω) - D_{1} (\begin{matrix} 0 \\ 1_{k_{1} \times 1} \end{matrix}) \sum_{i = 1}^{n} (1 - S_{i}) (a (X_{i}) - a) + o_{p} (n^{\frac{1}{2}}) = - \sum_{i = 1}^{n} η_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) - n K_{1} \hat{λ} + n F_{11} (\hat{β} - β_{*}) + n F_{12} (\hat{α} - α_{*}) + o_{p} (n^{\frac{1}{2}}),

where

η_{1} (S_{i}, A_{i}, X_{i}; β, α, π, ω, a) = [D_{1} (1 - S_{i}) - \frac{S_{i} A_{i} (1 - p (X_{i}; β))}{p (X_{i}; β) e (X_{i}; α)}] (\begin{matrix} (1 - π) m_{1} (X_{i}; β, α) - π ω \\ a (X_{i}) - a \end{matrix}) - D_{1} (1 - π) (1 - S_{i}) m_{1} (X_{i}; β, α) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) + D_{1} (1 - π) S_{i} A_{i} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) + (1 - D_{1}) π ω (S_{i} - π) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) .

Therefore

\hat{λ} = \frac{1}{n} \sum_{i = 1}^{n} K_{1}^{- 1} {- η_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) + F_{11} I_{β_{*}, i} + F_{12} I_{α_{*}, i}} + o_{p} (n^{- \frac{1}{2}}) .

(D1)

It follows that

{\hat{τ}}_{1} - τ_{1} = \frac{1}{n} \sum_{i = 1}^{n} \frac{n}{n_{0}} \frac{1 - p (X_{i}; \hat{β})}{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})} \frac{S_{i} A_{i}}{1 + {\hat{λ}}^{T} ρ_{1} (X_{i}; \hat{β}, \hat{α}, \hat{π}, \hat{ω}, \hat{a})} (Y_{i} - τ_{1}) = \frac{1}{1 - π} [\frac{1}{n} \sum_{i = 1}^{n} \frac{S_{i} A_{i} (1 - p (X_{i}; β_{*})) (Y_{i} - μ_{1})}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} - J_{1}^{T} \hat{λ} + C_{11} (\hat{β} - β_{*}) + C_{12} (\hat{α} - α_{*})] + o_{p} (n^{- \frac{1}{2}})

(D2)

where

J_{1} = E [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} (Y_{i}^{1} - τ_{1}) ρ_{1} (X_{i}; β_{*}, α_{*}, π, ω, a)],

C_{11} = E [\frac{- S_{i} A_{i} {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) e (X_{i}; α_{*})} (Y_{i}^{1} - τ_{1})],

C_{12} = E [\frac{- S_{i} A_{i} (1 - p (X_{i}; β_{*})) {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) e^{2} (X_{i}; α_{*})} (Y_{i}^{1} - τ_{1})] .

Inserting Equation (D1) into Equation (D2) gives ${\hat{τ}}_{1} - τ_{1} = n^{- 1} \sum_{i = 1}^{n} Φ_{τ_{1}, i} + o_{p} (n^{- 1 / 2})$ , where

Φ_{τ_{1}, i} = \frac{1}{1 - π} [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{*})) (Y_{i} - τ_{1})}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} + J_{1}^{T} K_{1}^{- 1} η_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) + (C_{11} - J_{1}^{T} K_{1}^{- 1} F_{11}) I_{β_{*}, i} + (C_{12} - J_{1}^{T} K_{1}^{- 1} F_{12}) I_{α_{*}, i}] .

We now show that ${\hat{τ}}_{1}$ is also doubly robust. When models for p(X) and e(X) are correctly specified, β_* = β₀ and α_* = α₀ (the latter holds automatically in randomized trials), it is easy to verify that

E [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{1})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})}] = E [(1 - S_{i}) (Y_{i}^{1} - τ_{1})] = 0.

by the Law of Iterated Expectations. In addition, D₁ = 1 and so

η_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) = [(1 - S_{i}) - \frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})}] (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix}),

which has zero expectation. Because $E [I_{β_{0}, i}] = E [I_{α_{0}, i}] = 0$ , we have $E [Φ_{τ_{1}, i}] = 0$ and ${\hat{τ}}_{1} \overset{p}{\to} τ_{1}$ .

To show the consistency under a correct outcome model implied from constraints, we write

M_{i} = \frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} (Y_{i}^{1} - τ_{1}),

R_{i} = \frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} g_{1} (X_{i}; β_{*}, α_{*}, π, ω, a),

where

g_{1} (X_{i}; β_{*}, α_{*}, π, ω, a) = (\begin{matrix} (1 - π) m_{1} (X_{i}; β_{*}, α_{*}) - π ω \\ a (X_{i}) - a \end{matrix}) .

Observe that $J_{1}^{T} = E [M_{i} R_{i}^{T}]$ , $K_{1} = E [R_{i}^{\otimes 2}]$ . When the “outcome model” is correct in the sense that $τ_{1} (X) = ξ_{10} + \sum_{l = 1}^{k_{1}} ξ_{1 l} a_{l} (X)$ , the conditional expectation, $E [Y_{i}^{1} - τ_{1} ∣ X_{i}] = E [Y_{i}^{1} - τ_{1} ∣ X_{i}, S_{i} = 0] = τ_{1} (X_{i}) - τ_{1}$ , lies in the linear space spanned by {a(X_i) − a}. Because

E ([M_{i} - \frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} E [Y_{i}^{1} - τ_{1} ∣ X_{i}]] [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} L (X_{i})]) = 0

for all functions L(X_i) and g₁(X_i; β_*, α_*, π, ω, a) only involves X_i, we have

J_{1}^{T} K_{1}^{- 1} R_{i} = E [M_{i} R_{i}^{T}] E [R_{i}^{\otimes 2}] R_{i} = \frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} (τ_{1} (X_{i}) - τ_{1}) .

Therefore $J_{1}^{T} K_{1}^{- 1} g_{1} (X_{i}; β_{*}, α_{*}, π, ω, a) = τ_{1} (X_{i}) - τ_{1}$ . Further because

J_{1}^{T} K_{1}^{- 1} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) = 0,

and we obtain

J_{1}^{T} K_{1}^{- 1} η_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a) = [D_{1} (1 - S_{i}) - \frac{S_{i} A_{i} (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) e (X_{i}; α_{*})}] (τ_{1} (X) - τ_{1})

Similar projection-based arguments lead to $C_{11} - J_{1}^{T} K_{1}^{- 1} F_{11} = C_{12} - J_{1} K_{1}^{- 1} F_{12} = 0$ . Therefore

(1 - π) E [Φ_{τ_{1}, i}] = E [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{*})) (Y_{i}^{1} - τ_{1})}{p (X_{i}; β_{*}) e (X_{i}; α_{*})} + J_{1}^{T} K_{1}^{- 1} η_{1} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, a)] = D_{1} E [(1 - S_{i}) (τ_{1} (X) - τ_{1})] = 0,

and ${\hat{τ}}_{1} \overset{p}{\to} τ_{1}$ .

Finally when all models are correct, it is easy to verify that $Φ_{τ_{1}, i}$ becomes the efficient influence function for estimating τ₁ in Theorem 1, i.e.,

Φ_{τ_{1}, i} = Ψ_{τ_{1}, i} = \frac{1}{1 - π} (\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (Y_{i} - τ_{1} (X_{i})) + (1 - S_{i}) (τ_{1} (X_{i}) - τ_{1})) .

The asymptotic analysis for ${\hat{τ}}_{0}$ follows the same way, except that we will now replace A_i by 1 − A_i, and the propensity score e(X_i; β) by its complement 1 – e(X_i; β). The details are omitted for brevity, but we note that the influence function of ${\hat{τ}}_{0}$ is

Φ_{τ_{0}, i} = \frac{1}{1 - π} [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{*})) (Y_{i} - τ_{0})}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} + J_{0}^{T} K_{0}^{- 1} η_{0} (S_{i}, A_{i}, X_{i}; β_{*}, α_{*}, π, ω, b) + (C_{01} - J_{0}^{T} K_{0}^{- 1} F_{01}) I_{β_{*}, i} + (C_{02} - J_{0}^{T} K_{0}^{- 1} F_{02}) I_{α_{*}, i}],

where $b = E [b (X_{i}) ∣ S_{i} = 0]$ , m₀(X_i; β, α) = p(X_i; β)(1 - e(X_i; α))/(1 - p(X_i; β)) and

η_{0} (S_{i}, A_{i}, X_{i}; β, α, π, ω, b) = [D_{0} (1 - S_{i}) - \frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β))}{p (X_{i}; β) (1 - e (X_{i}; α))}] (\begin{matrix} (1 - π) m_{0} (X_{i}; β, α) - π (1 - ω) \\ b (X_{i}) - b \end{matrix}) - D_{0} (1 - π) (1 - S_{i}) m_{0} (X_{i}; β, α) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) + D_{0} (1 - π) S_{i} (1 - A_{i}) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) + (1 - D_{0}) π (1 - ω) (S_{i} - π) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix})

ρ_{0} (X_{i}; β, α, π, ω, b) = \frac{1 - p (X_{i}; β)}{p (X_{i}; β) (1 - e (X_{i}; α))} (\begin{matrix} (1 - π) m_{0} (X_{i}; β, α) - π (1 - ω) \\ b (X_{i}) - b \end{matrix}),

D_{0} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*})) (1 - π)}],

K_{0} = E [S_{i} (1 - A_{i}) ρ_{0} {(X_{i}; β_{*}, α_{*}, π, ω, a)}^{\otimes 2}],

F_{01} = E [\frac{- S_{i} (1 - A_{i}) {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (\begin{matrix} - π (1 - ω) \\ b (X_{i}) - b \end{matrix})],

F_{12} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{*})) {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) {(1 - e (X_{i}; α_{*}))}^{2}} (\begin{matrix} - π (1 - ω) \\ b (X_{i}) - b \end{matrix})],

J_{0} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{*}))}{p (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (Y_{i}^{0} - τ_{0}) ρ_{0} (X_{i}; β_{*}, α_{*}, π, ω, b)],

C_{01} = E [\frac{- S_{i} (1 - A_{i}) {\dot{p}}^{T} (X_{i}; β_{*})}{p^{2} (X_{i}; β_{*}) (1 - e (X_{i}; α_{*}))} (Y_{i}^{0} - τ_{0})],

C_{02} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{*})) {\dot{e}}^{T} (X_{i}; α_{*})}{p (X_{i}; β_{*}) {(1 - e (X_{i}; α_{*}))}^{2}} (Y_{i}^{0} - τ_{0})] .

The residual quantities in Theorem 3.2 are thus defined as $Δ_{τ, i} (β_{*}, α_{*}) = (C_{11} - J_{1}^{T} K_{1}^{- 1} F_{11}) I_{β_{*}, i} - (C_{01} - J_{0}^{T} K_{0}^{- 1} F_{01}) I_{β_{*}, i} + (C_{12} - J_{1}^{T} K_{1}^{- 1} F_{12}) I_{α_{*}, i} - (C_{02} - J_{0}^{T} K_{0}^{- 1} F_{02}) I_{α_{*}, i}$ . Finally, when all models are correct, we could verify that the efficient score in Equation (A4), $Ψ_{τ, i} (Y_{i}, S_{i}, A_{i}, X_{i}) = Ψ_{τ_{1}, i} - Ψ_{τ_{0}, i} = Φ_{τ_{1}, i} - Φ_{τ_{0}, i}$ . Therefore, $\hat{τ}$ is semiparametric efficient with respect to Σ_τ, the bound without the knowledge of the sampling score.

Appendix E. Proof of Theorem 3.3

Although $\hat{ω} = n_{11} / n_{1} = ω + o_{p} (1)$ , the consistency of $\tilde{π} = n^{- 1} \sum_{i = 1}^{n} p (X_{i}; \hat{β})$ and $\tilde{a} = n_{0}^{- 1} \sum_{i \in 𝒫} a (X_{i}) (1 - p (X_{i}; \hat{β}))$ to π = pr(S_i = 1) and $a = E [a (X_{i}) ∣ S_{i} = 0]$ relies on the correct specification of the sampling score model. Because of this, the general asymptotic expansion of $\tilde{τ}$ cannot be obtained. This challenge is due to the fact that the probability limits for the Lagrange multipliers will change depending on whether p(X_i) is correctly estimated. We shall focus on the case when both the sampling score and propensity score are correctly estimated, and therefore β_* = β₀, α_* = α₀. Denote m₁(X_i; β, α) = p(X_i; β)e(X_i; α)/(1 – p(X_i; β)), the empirical probabilities for estimating τ₁ are written as

q_{i} = \frac{1}{n_{11}} \frac{1}{1 + {\tilde{r}}_{1} {m_{1} (X_{i}; \hat{β}, \hat{α}) - \tilde{π} \hat{ω} / (1 - \tilde{π})} + {\tilde{r}}_{2}^{T} {a (X_{i}) - \tilde{a}}}, i \in 𝒯

where, and $\tilde{r} = {({\tilde{r}}_{1}, {\tilde{r}}_{2}^{T})}^{T}$ satisfy

\sum_{i \in 𝒯} \frac{m_{1} (X_{i}; \hat{β}, \hat{α}) - \tilde{π} \hat{ω} / (1 - \tilde{π})}{1 + {\tilde{r}}_{1} {m_{1} (X_{i}; \hat{β}, \hat{α}) - \tilde{π} \hat{ω} / (1 - \tilde{π})} + {\tilde{r}}_{2}^{T} {a (X_{i}) - \tilde{a}}} = 0,

\sum_{i \in 𝒯} \frac{a (X_{i}) - \tilde{a}}{1 + {\tilde{r}}_{1} {m_{1} (X_{i}; \hat{β}, \hat{α}) - \tilde{π} \hat{ω} / (1 - \tilde{π})} + {\tilde{r}}_{2}^{T} {a (X_{i}) - \tilde{a}}} = 0.

Since we assume sampling scores are correctly estimated, the probability limits of the Lagrange multipliers are the same as those in Appendix D, and therefore we can employ a similar centralization step to re-parameterize the above estimating equations. Define ${\tilde{λ}}_{1} = {(1 - \tilde{π})}^{- 2} \tilde{π} \hat{ω} {\hat{r}}_{1} - {(1 - \tilde{π})}^{- 1}$ , ${\tilde{λ}}_{2} = {(1 - \tilde{π})}^{- 1} \tilde{π} \hat{ω} {\hat{r}}_{2}$ , and

ρ_{1} (X_{i}; β, α, π, ω, a) = \frac{1 - p (X_{i}; β)}{p (X_{i}; β) e (X_{i}; α)} (\begin{matrix} (1 - π) m_{1} (X_{i}; β, α) - π ω \\ a (X_{i}) - a \end{matrix}),

it follows that we could re-write the empirical probabilities as

q_{i} = \frac{1}{n_{11} (1 - \tilde{π})} \frac{\tilde{π} \hat{ω} / m_{1} (X_{i}; \hat{β}, \hat{α})}{1 + {\tilde{λ}}^{T} h_{1} (X_{i}; \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \tilde{a})}, i \in 𝒯

with the centralized $\tilde{λ} = {({\tilde{λ}}_{1}, {\tilde{λ}}_{2}^{T})}^{T}$ converging in probability to zero and satisfying

Q (\tilde{λ}, \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \tilde{a}) = \sum_{i \in 𝒯} ρ_{1} (X_{i}; \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \tilde{a}) / {1 + {\tilde{λ}}^{T} ρ_{1} (X_{i}; \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \tilde{a})} = 0.

Define $V_{1} = E [{\dot{p}}^{T} (X_{i}; β_{0})]$ , $E_{1} = E [a (X_{i}) {\dot{p}}^{T} (X_{i}; β_{0})]$ , we have

\tilde{π} = \frac{1}{n} \sum_{i = 1}^{n} p (X_{i}; β_{0}) + V_{1} (\hat{β} - β_{0}) + o_{p} (n^{- \frac{1}{2}})

and

\tilde{a} = \frac{1}{n (1 - π)} \sum_{i = 1}^{n} a (X_{i}) (1 - p (X_{i}; β_{0})) + \frac{a}{1 - π} (\tilde{π} - π) - \frac{E_{1}}{1 - π} (\hat{β} - β_{0}) + o_{p} (n^{- \frac{1}{2}}) = a + \frac{1}{n (1 - π)} \sum_{i = 1}^{n} a (X_{i}) (1 - p (X_{i}; β_{0})) - \frac{a}{1 - π} (1 - \tilde{π}) - \frac{E_{1}}{1 - π} (\hat{β} - β_{0}) + o_{p} (n^{- \frac{1}{2}}) = a + \frac{1}{n (1 - π)} \sum_{i = 1}^{n} (a (X_{i}) - a) (1 - p (X_{i}; β_{0})) - \frac{E_{1} - a V_{1}}{1 - π} (\hat{β} - β_{0}) + o_{p} (n^{- \frac{1}{2}}) = a + \frac{1}{n (1 - π)} \sum_{i = 1}^{n} (a (X_{i}) - a) (1 - p (X_{i}; β_{0})) + o_{p} (n^{- \frac{1}{2}})

where the last equation results from the fact that

E_{1} - a V_{1} = \int (a (x) - a) {(\frac{\partial p (x; β_{0})}{\partial β})}^{T} f (x) d x = \frac{\partial}{\partial β^{T}} \int (a - a (x)) (1 - p (x; β_{0})) f (x) d x = 0.

Now define

K_{1} = E [S_{i} A_{i} h_{1} {(X_{i}; β_{0}, α_{0}, π, ω, a)}^{\otimes 2}],

F_{11} = E [\frac{- S_{i} A_{i} {\dot{p}}^{T} (X_{i}; β_{0})}{p^{2} (X_{i}; β_{0}) e (X_{i}; α_{0})} (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix})],

F_{12} = E [\frac{- S_{i} A_{i} (1 - p (X_{i}; β_{0})) {\dot{e}}^{T} (X_{i}; α_{0})}{p (X_{i}; β_{0}) e^{2} (X_{i}; α_{0})} (\begin{matrix} - π ω \\ a (X_{i}) - a \end{matrix})] .

We expand $Q (\tilde{λ}, \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \tilde{a})$ around the limiting values of its arguments as

0 = Q (0, β_{0}, α_{0}, π, ω, a) - n K_{1} \tilde{λ} + n F_{11} (\hat{β} - β_{0}) + n F_{12} (\hat{α} - α_{0}) - n (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) (\tilde{π} - π) ω - n (1 - π) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) π (\hat{ω} - ω) - n (1 - π) (\begin{matrix} 0 \\ 1_{k_{1} \times 1} \end{matrix}) (\tilde{a} - a) + o_{p} (n^{\frac{1}{2}}) = Q (0, β_{0}, α_{0}, π, ω, a) - n K_{1} \tilde{λ} + n {\tilde{F}}_{11} (\hat{β} - β_{0}) + n F_{12} (\hat{α} - α_{0}) - (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) \sum_{i = 1}^{n} ω (p (X_{i}; β_{0}) - π) - (1 - π) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) \sum_{i = 1}^{n} S_{i} (A_{i} - ω) - (\begin{matrix} 0 \\ 1_{k_{1} \times 1} \end{matrix}) \sum_{i = 1}^{n} (1 - p (X_{i}; β_{0})) (a (X_{i}) - a) + o_{p} (n^{\frac{1}{2}}) = - \sum_{i = 1}^{n} {\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) - n K_{1} \tilde{λ} + n {\tilde{F}}_{11} (\hat{β} - β_{0}) + n F_{12} (\hat{α} - α_{0}) + o_{p} (n^{\frac{1}{2}}),

where ${\tilde{F}}_{11} = F_{11} - (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) w V_{1}$ and

{\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β, α, π, ω, a) = [(1 - p_{i} (X_{i}; β)) - \frac{S_{i} A_{i} (1 - p (X_{i}; β))}{p (X_{i}; β) e (X_{i}; α)}] (\begin{matrix} (1 - π) m_{1} (X_{i}; β, α) - π ω \\ a (X_{i}) - a \end{matrix}) + (1 - π) (S_{i} A_{i} - p (X_{i}; β) e (X_{i}; α)) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) - (1 - π) ω (S_{i} - p (X_{i}; β)) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) .

Therefore

\tilde{λ} = \frac{1}{n} \sum_{i = 1}^{n} K_{1}^{- 1} {- {\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) + {\tilde{F}}_{11} I_{β_{0}, i} + F_{12} I_{α_{0}, i}} + o_{p} (n^{- \frac{1}{2}}) .

(E1)

It follows that

{\tilde{τ}}_{1} - τ_{1} = \frac{1}{n_{11}} \sum_{i = 1}^{n} \frac{\tilde{π} \hat{ω}}{1 - \tilde{π}} \frac{1 - p (X_{i}; β)}{p (X_{i}; \hat{β}) e (X_{i}; \hat{α})} \frac{S_{i} A_{i}}{1 + {\tilde{λ}}^{T} h_{1} (X_{i}; \hat{β}, \hat{α}, \tilde{π}, \hat{ω}, \tilde{a})} (Y_{i} - τ_{1}) = \frac{1}{1 - π} [\frac{1}{n} \sum_{i = 1}^{n} \frac{S_{i} A_{i} (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{1})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} - J_{1}^{T} \tilde{λ} + C_{11} (\hat{β} - β_{0}) + C_{12} (\hat{α} - α_{0})] + o_{p} (n^{- \frac{1}{2}})

(E2)

where

J_{1} = E [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (Y_{i}^{1} - τ_{1}) ρ_{1} (X_{i}; β_{0}, α_{0}, π, ω, a)],

C_{11} = E [\frac{- S_{i} A_{i} {\dot{p}}^{T} (X_{i}; β_{0})}{p^{2} (X_{i}; β_{0}) e (X_{i}; α_{0})} (Y_{i}^{1} - τ_{1})],

C_{12} = E [\frac{- S_{i} A_{i} (1 - p (X_{i}; β_{0})) {\dot{e}}^{T} (X_{i}; α_{0})}{p (X_{i}; β_{0}) e^{2} (X_{i}; α_{0})} (Y_{i}^{1} - τ_{1})] .

Inserting Equation (E1) into Equation (E2) gives ${\tilde{τ}}_{1} - τ_{1} = n^{- 1} \sum_{i = 1}^{n} {\tilde{Φ}}_{τ_{1}, i} + o_{p} (n^{- 1 / 2})$ , where

{\tilde{Φ}}_{τ_{1}, i} = \frac{1}{1 - π} [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{1})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} + J_{1}^{T} K_{1}^{- 1} {\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) + (C_{11} - J_{1}^{T} K_{1}^{- 1} {\tilde{F}}_{11}) I_{β_{0}, i} + (C_{12} - J_{1}^{T} K_{1}^{- 1} F_{12}) I_{α_{0}, i}] .

The estimator ${\tilde{τ}}_{1}$ is only singly robust since the above derivation rests on correct specification of the sampling score (regardless of the implied outcome model). It is easy to see that in this case ${\tilde{τ}}_{1} \overset{p}{\to} τ_{1}$ because

E [\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{1})}{p (X_{i}; β_{0}) e (X_{i}; α_{0})}] = E [{\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a)] = E [I_{β_{0}, i}] = E [I_{α_{0}, i}] = 0.

In addition, we can write

M_{i} = \frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (Y_{i}^{1} - τ_{1}),

R_{i} = \frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} g_{1} (X_{i}; β_{0}, α_{0}, π, ω, a),

where

g_{1} (X_{i}; β_{0}, α_{0}, π, ω, a) = (\begin{matrix} (1 - π) m_{1} (X_{i}; β_{0}, α_{0}) - π ω \\ a (X_{i}) - a \end{matrix}) .

Observe that $J_{1}^{T} = E [M_{i} R_{i}^{T}]$ , $K_{1} = E [R_{i}^{\otimes 2}]$ . When the “outcome model” is correct in the sense that $τ_{1} (X) = ξ_{10} + \sum_{l = 1}^{k_{1}} ξ_{1 l} a_{l} (X)$ , the conditional expectation, $E [Y_{i}^{1} - τ_{1} ∣ X_{i}] = τ_{1} (X_{i}) - τ_{1}$ , lies in the linear space spanned by {a(X_i) – a}. Similar projection-based arguments to those in Appendix D lead to

J_{1}^{T} K_{1}^{- 1} g_{1} (X_{i}; β_{0}, α_{0}, π, ω, a) = τ_{1} (X_{i}) - τ_{1}, J_{1}^{T} K_{1}^{- 1} (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) = 0.

Therefore we obtain

J_{1}^{T} K_{1}^{- 1} {\tilde{η}}_{1} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, a) = [(1 - p_{i} (X_{i}; β_{0})) - \frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})}] (τ_{1} (X) - τ_{1}) .

Repeating the projection-based arguments, we also get $C_{11} - J_{1}^{T} K_{1}^{- 1} {\tilde{F}}_{11} = C_{11} - J_{1}^{T} K_{1}^{- 1} F_{11} = C_{12} - J_{1} K_{1}^{- 1} F_{12} = 0$ . This proves that when all models are correctly specified, ${\tilde{Φ}}_{τ_{1}, i}$ in fact becomes the efficient influence function for estimating τ₁ in Theorem 2.4, i.e.,

{\tilde{Φ}}_{τ_{1}, i} = {\tilde{Ψ}}_{τ_{1}, i} = \frac{1}{1 - π} (\frac{S_{i} A_{i} (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) e (X_{i}; α_{0})} (Y_{i} - τ_{1} (X_{i})) + (1 - p (X_{i}; β_{0})) (τ_{1} (X_{i}) - τ_{1})) .

The asymptotic analysis for ${\tilde{τ}}_{0}$ follows the same fashion. The details are omitted for brevity, but we note that with a correctly specified sampling score, the influence function of ${\tilde{τ}}_{0}$ is

{\tilde{Φ}}_{τ_{0}, i} = \frac{1}{1 - π} [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{0})) (Y_{i} - τ_{0})}{p (X_{i}; β_{0}) (1 - e (X_{i}; α_{0}))} + J_{0}^{T} K_{0}^{- 1} {\tilde{η}}_{0} (S_{i}, A_{i}, X_{i}; β_{0}, α_{0}, π, ω, b) + (C_{01} - J_{0}^{T} K_{0}^{- 1} {\tilde{F}}_{01}) I_{β_{0}, i} + (C_{02} - J_{0}^{T} K_{0}^{- 1} F_{02}) I_{α_{0}, i}],

where $b = E [b (X_{i}) ∣ S_{i} = 0]$ , m₀(X_i; β, α) = p(X_i; β)(1 - e(X_i; α))=(1 - p(X_i; β)) and

{\tilde{η}}_{0} (S_{i}, A_{i}, X_{i}; β, α, π, ω, b) = [(1 - p_{i} (X_{i}; β)) - \frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β))}{p (X_{i}; β) (1 - e (X_{i}; α))}] (\begin{matrix} (1 - π) m_{0} (X_{i}; β, α) - π (1 - ω) \\ b (X_{i}) - b \end{matrix}) + (1 - π) (S_{i} (1 - A_{i}) - p (X_{i}; β) (1 - e (X_{i}; α))) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) - (1 - π) (1 - ω) (S_{i} - p (X_{i}; β)) (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}),

ρ_{0} (X_{i}; β, α, π, ω, b) = \frac{1 - p (X_{i}; β)}{p (X_{i}; β) (1 - e (X_{i}; α))} (\begin{matrix} (1 - π) m_{0} (X_{i}; β, α) - π (1 - ω) \\ b (X_{i}) - b \end{matrix}),

K_{0} = E [S_{i} (1 - A_{i}) ρ_{0} {(X_{i}; β_{0}, α_{0}, π, ω, a)}^{\otimes 2}],

{\tilde{F}}_{01} = E [\frac{- S_{i} (1 - A_{i}) {\dot{p}}^{T} (X_{i}; β_{0})}{p^{2} (X_{i}; β_{0}) (1 - e (X_{i}; α_{0}))} (\begin{matrix} - π (1 - ω) \\ b (X_{i}) - b \end{matrix})] + (\begin{matrix} 1 \\ 0_{k_{1} \times 1} \end{matrix}) (1 - ω) V_{1},

F_{02} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{0})) {\dot{e}}^{T} (X_{i}; α_{0})}{p (X_{i}; β_{0}) {(1 - e (X_{i}; α_{0}))}^{2}} (\begin{matrix} - π (1 - ω) \\ b (X_{i}) - b \end{matrix})],

J_{0} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{0}))}{p (X_{i}; β_{0}) (1 - e (X_{i}; α_{0}))} (Y_{i}^{0} - τ_{0}) ρ_{0} (X_{i}; β_{0}, α_{0}, π, ω, b)],

C_{01} = E [\frac{- S_{i} (1 - A_{i}) {\dot{p}}^{T} (X_{i}; β_{0})}{p^{2} (X_{i}; β_{0}) (1 - e (X_{i}; α_{0}))} (Y_{i}^{0} - τ_{0})],

C_{02} = E [\frac{S_{i} (1 - A_{i}) (1 - p (X_{i}; β_{0})) {\dot{e}}^{T} (X_{i}; α_{0})}{p (X_{i}; β_{0}) {(1 - e (X_{i}; α_{0}))}^{2}} (Y_{i}^{0} - τ_{0})] .

The residual quantities in Theorem 3.3 are thus defined as ${\tilde{Δ}}_{τ, i} (β_{0}, α_{0}) = (C_{11} - J_{1}^{T} K_{1}^{- 1} {\tilde{F}}_{11}) I_{β_{0}, i} - (C_{01} - J_{0}^{T} K_{0}^{- 1} {\tilde{F}}_{01}) I_{β_{0}, i} + (C_{12} - J_{1}^{T} K_{1}^{- 1} F_{12}) I_{α_{0}, i} - (C_{02} - J_{0}^{T} K_{0}^{- 1} F_{02}) I_{α_{0}, i}$ . Finally, when all models are correct, we could verify that the efficient score in Equation (B1), ${\tilde{Ψ}}_{τ, i} (Y_{i}, S_{i}, A_{i}, X_{i}) = {\tilde{Ψ}}_{τ_{1, i}} - {\tilde{Ψ}}_{τ_{0}, i} = {\tilde{Φ}}_{τ_{1}, i} - {\tilde{Φ}}_{τ_{0}, i}$ . Therefore, $\tilde{τ}$ is semiparametric efficient with respect to ${\tilde{Σ}}_{τ}$ , the lower variance bound assuming the knowledge of the sampling score.

Footnotes

Disclosure statement

No potential conflict of interest was reported by the authors.

References

Bickel CA, Klaassen J, Ritov Y, and Wellner JA. 1993. Efficient and adaptive estimation for semiparametric models. New York: Springer. [Google Scholar]
Buchanan AL, Hudgens MG, Cole SR, Mollan KR, Sax PE, Daar ES, Adimora AA, Eron JJ, and Mugavero MJ. 2018. Generalizing evidence from randomized trials using inverse probability of sampling weights. Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (4):1193–209. doi: 10.1111/rssa.12357. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cattaneo MD 2010. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155 (2):138–54. doi: 10.1016/j.jeconom.2009.09.023. [DOI] [Google Scholar]
Chen J, Sitter RR, and Wu C. 2002. Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika 89 (1):230–7. doi: 10.1093/biomet/89.1.230. [DOI] [Google Scholar]
Cole SR, and Hernán MA. 2008. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology 168 (6):656–64. doi: 10.1093/aje/kwn164. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cole SR, and Stuart EA. 2010. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology 172 (1):107–15. doi: 10.1093/aje/kwq084. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crump RK, Hotz VJ, Imbens GW, and Mitnik OA. 2009. Dealing with limited overlap in estimation of average treatment effects. Biometrika 96 (1):187–99. doi: 10.1093/biomet/asn055. [DOI] [Google Scholar]
Dahabreh IJ, Robertson SE, Steingrimsson JA, Stuart EA, and Hernán MA. 2020. Extending inferences from a randomized trial to a new target population. Statistics in Medicine 39 (14):1999–2014. doi: 10.1002/sim.8426. [DOI] [PubMed] [Google Scholar]
Dahabreh IJ, Robertson SE, Tchetgen Tchetgen EJ, Stuart EA, and Hernán MA. 2019. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics 75 (2):685–12. doi: 10.1111/biom.13009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deville JC, and Särndal CE. 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association 87 (418):376–82. doi: 10.1080/01621459.1992.10475217. [DOI] [Google Scholar]
Fong C, Hazlett C, and Imai K. 2018. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics 12 (1):156–77. doi: 10.1214/17-AOAS1101. [DOI] [Google Scholar]
Hahn J 1998. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66 (2):315. doi: 10.2307/2998560. [DOI] [Google Scholar]
Hainmueller J 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20 (1):25–46. doi: 10.1093/pan/mpr025. [DOI] [Google Scholar]
Hirano K, Imbens GW, and Ridder G. 2003. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 (4):1161–89. doi: 10.1111/1468-0262.00442. [DOI] [Google Scholar]
Imai K, and Ratkovic M. 2014. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1):243–63. doi: 10.1111/rssb.12027. [DOI] [Google Scholar]
Imbens GW 2000. The role of the propensity score in estimating dose-response functions. Biometrika 87 (3):706–10. doi: 10.1093/biomet/87.3.706. [DOI] [Google Scholar]
Li F, Thomas LE, and Li F. 2019. Addressing extreme propensity scores via the overlap weights. American Journal of Epidemiology 188 (1):250–7. [DOI] [PubMed] [Google Scholar]
Owen A 2001. Empirical likelihood. New York: Chapman & Hall/CRC Press. [Google Scholar]
Qin J 1993. Empirical likelihood in biased sample problems. The Annals of Statistics 21 (3): 1182–96. doi: 10.1214/aos/1176349257. [DOI] [Google Scholar]
Qin J, and Lawless J. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics 22 (1):300–25. doi: 10.1214/aos/1176325370. [DOI] [Google Scholar]
Qin J, and Zhang B. 2007. Empirical-likelihood-based inference in missing response problems and its application in observational studies. Journal of the Royal Statistical Society, Series B 69 (1):101–22. [Google Scholar]
Robins JM, Rotnitzky A, and Zhao LP. 1994. Estimation of regression-coefficients when some regressors are not always observed. Journal of the American Statistical Association 89 (427):846–66. doi: 10.1080/01621459.1994.10476818. [DOI] [Google Scholar]
Rosenbaum PR, and Rubin DB. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1):41–55. doi: 10.1093/biomet/70.1.41. [DOI] [Google Scholar]
Rubin DB 1980. Comment on ‘Randomization analysis of experimental data: The fisher randomization test’ by D. Basu. Journal of the American Statistical Association 75 (371):591–3. doi: 10.2307/2287653. [DOI] [Google Scholar]
Stuart EA, Bradshaw CP, and Leaf PJ. 2015. Assessing the generalizability of randomized trial results to target populations. Prevention Science: The Official Journal of the Society for Prevention Research 16 (3):475–85. doi: 10.1007/s11121-014-0513-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stuart EA, Cole SR, Bradshaw CP, and Leaf PJ. 2011. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society) 174 (2):369–86. doi: 10.1111/j.1467-985X.2010.00673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tipton E 2013. Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics 38 (3):239–66. doi: 10.3102/1076998612441947. [DOI] [Google Scholar]
Wang Y, and Zubizarreta JR. 2020. Minimal dispersion approximately balancing weights: Asymptotic properties and practical considerations. Biometrika 107 (1):93–105. [Google Scholar]
Westreich D, Edwards JK, Lesko CR, Stuart E, and Cole SR. 2017. Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology 186 (8): 1010–4. doi: 10.1093/aje/kwx164. [DOI] [PMC free article] [PubMed] [Google Scholar]
White H 1982. Maximul likelihood estimation of misspecified models. Econometrica 50 (1):1–25. doi: 10.2307/1912526. [DOI] [Google Scholar]
Zhao Q 2019. Covariate balancing propensity score by tailored loss functions. Annals of Statistics 47 (2):965–93. [Google Scholar]
Zubizarreta JR 2015. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110 (511):910–22. doi: 10.1080/01621459.2015.1023805. [DOI] [Google Scholar]

[R1] Bickel CA, Klaassen J, Ritov Y, and Wellner JA. 1993. Efficient and adaptive estimation for semiparametric models. New York: Springer. [Google Scholar]

[R2] Buchanan AL, Hudgens MG, Cole SR, Mollan KR, Sax PE, Daar ES, Adimora AA, Eron JJ, and Mugavero MJ. 2018. Generalizing evidence from randomized trials using inverse probability of sampling weights. Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (4):1193–209. doi: 10.1111/rssa.12357. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cattaneo MD 2010. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155 (2):138–54. doi: 10.1016/j.jeconom.2009.09.023. [DOI] [Google Scholar]

[R4] Chen J, Sitter RR, and Wu C. 2002. Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika 89 (1):230–7. doi: 10.1093/biomet/89.1.230. [DOI] [Google Scholar]

[R5] Cole SR, and Hernán MA. 2008. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology 168 (6):656–64. doi: 10.1093/aje/kwn164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cole SR, and Stuart EA. 2010. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology 172 (1):107–15. doi: 10.1093/aje/kwq084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Crump RK, Hotz VJ, Imbens GW, and Mitnik OA. 2009. Dealing with limited overlap in estimation of average treatment effects. Biometrika 96 (1):187–99. doi: 10.1093/biomet/asn055. [DOI] [Google Scholar]

[R8] Dahabreh IJ, Robertson SE, Steingrimsson JA, Stuart EA, and Hernán MA. 2020. Extending inferences from a randomized trial to a new target population. Statistics in Medicine 39 (14):1999–2014. doi: 10.1002/sim.8426. [DOI] [PubMed] [Google Scholar]

[R9] Dahabreh IJ, Robertson SE, Tchetgen Tchetgen EJ, Stuart EA, and Hernán MA. 2019. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics 75 (2):685–12. doi: 10.1111/biom.13009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Deville JC, and Särndal CE. 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association 87 (418):376–82. doi: 10.1080/01621459.1992.10475217. [DOI] [Google Scholar]

[R11] Fong C, Hazlett C, and Imai K. 2018. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics 12 (1):156–77. doi: 10.1214/17-AOAS1101. [DOI] [Google Scholar]

[R12] Hahn J 1998. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66 (2):315. doi: 10.2307/2998560. [DOI] [Google Scholar]

[R13] Hainmueller J 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20 (1):25–46. doi: 10.1093/pan/mpr025. [DOI] [Google Scholar]

[R14] Hirano K, Imbens GW, and Ridder G. 2003. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 (4):1161–89. doi: 10.1111/1468-0262.00442. [DOI] [Google Scholar]

[R15] Imai K, and Ratkovic M. 2014. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1):243–63. doi: 10.1111/rssb.12027. [DOI] [Google Scholar]

[R16] Imbens GW 2000. The role of the propensity score in estimating dose-response functions. Biometrika 87 (3):706–10. doi: 10.1093/biomet/87.3.706. [DOI] [Google Scholar]

[R17] Li F, Thomas LE, and Li F. 2019. Addressing extreme propensity scores via the overlap weights. American Journal of Epidemiology 188 (1):250–7. [DOI] [PubMed] [Google Scholar]

[R18] Owen A 2001. Empirical likelihood. New York: Chapman & Hall/CRC Press. [Google Scholar]

[R19] Qin J 1993. Empirical likelihood in biased sample problems. The Annals of Statistics 21 (3): 1182–96. doi: 10.1214/aos/1176349257. [DOI] [Google Scholar]

[R20] Qin J, and Lawless J. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics 22 (1):300–25. doi: 10.1214/aos/1176325370. [DOI] [Google Scholar]

[R21] Qin J, and Zhang B. 2007. Empirical-likelihood-based inference in missing response problems and its application in observational studies. Journal of the Royal Statistical Society, Series B 69 (1):101–22. [Google Scholar]

[R22] Robins JM, Rotnitzky A, and Zhao LP. 1994. Estimation of regression-coefficients when some regressors are not always observed. Journal of the American Statistical Association 89 (427):846–66. doi: 10.1080/01621459.1994.10476818. [DOI] [Google Scholar]

[R23] Rosenbaum PR, and Rubin DB. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1):41–55. doi: 10.1093/biomet/70.1.41. [DOI] [Google Scholar]

[R24] Rubin DB 1980. Comment on ‘Randomization analysis of experimental data: The fisher randomization test’ by D. Basu. Journal of the American Statistical Association 75 (371):591–3. doi: 10.2307/2287653. [DOI] [Google Scholar]

[R25] Stuart EA, Bradshaw CP, and Leaf PJ. 2015. Assessing the generalizability of randomized trial results to target populations. Prevention Science: The Official Journal of the Society for Prevention Research 16 (3):475–85. doi: 10.1007/s11121-014-0513-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Stuart EA, Cole SR, Bradshaw CP, and Leaf PJ. 2011. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society) 174 (2):369–86. doi: 10.1111/j.1467-985X.2010.00673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Tipton E 2013. Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics 38 (3):239–66. doi: 10.3102/1076998612441947. [DOI] [Google Scholar]

[R28] Wang Y, and Zubizarreta JR. 2020. Minimal dispersion approximately balancing weights: Asymptotic properties and practical considerations. Biometrika 107 (1):93–105. [Google Scholar]

[R29] Westreich D, Edwards JK, Lesko CR, Stuart E, and Cole SR. 2017. Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology 186 (8): 1010–4. doi: 10.1093/aje/kwx164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] White H 1982. Maximul likelihood estimation of misspecified models. Econometrica 50 (1):1–25. doi: 10.2307/1912526. [DOI] [Google Scholar]

[R31] Zhao Q 2019. Covariate balancing propensity score by tailored loss functions. Annals of Statistics 47 (2):965–93. [Google Scholar]

[R32] Zubizarreta JR 2015. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110 (511):910–22. doi: 10.1080/01621459.2015.1023805. [DOI] [Google Scholar]

PERMALINK

A note on semiparametric efficient generalization of causal effects from randomized trials to target populations

Fan Li

Hwanhee Hong

Elizabeth A Stuart

Abstract

1. Introduction

2. Basic setup and efficiency bounds

3. Semiparametric efficient generalizability estimators

3.1. Estimation of the population average treatment effect

3.2. Estimation of the average treatment effect among non-participants

4. Numerical studies

4.1. Simulation study I

Figure 1.

Table 1.

Table 2.

4.2. Simulation study II

Figure 2.

Table 3.

Table 4.

5. Discussion

Acknowledgement

Funding

Appendix A. Proof of Theorem 2.3

Appendix B. Proof of Theorem 2.4

Appendix C. Proof of Theorem 3.1

Appendix D. Proof of Theorem 3.2

Appendix E. Proof of Theorem 3.3

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A note on semiparametric efficient generalization of causal effects from randomized trials to target populations

Fan Li

Hwanhee Hong

Elizabeth A Stuart

Abstract

1. Introduction

2. Basic setup and efficiency bounds

3. Semiparametric efficient generalizability estimators

3.1. Estimation of the population average treatment effect

3.2. Estimation of the average treatment effect among non-participants

4. Numerical studies

4.1. Simulation study I

Figure 1.

Table 1.

Table 2.

4.2. Simulation study II

Figure 2.

Table 3.

Table 4.

5. Discussion

Acknowledgement

Funding

Appendix A. Proof of Theorem 2.3

Appendix B. Proof of Theorem 2.4

Appendix C. Proof of Theorem 3.1

Appendix D. Proof of Theorem 3.2

Appendix E. Proof of Theorem 3.3

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases