Proximal Causal Inference without Uniqueness Assumptions

Jeffrey Zhang; Wei Li; Wang Miao; Eric Tchetgen Tchetgen

doi:10.1016/j.spl.2023.109836

. Author manuscript; available in PMC: 2024 Jul 1.

Published in final edited form as: Stat Probab Lett. 2023 Mar 21;198:109836. doi: 10.1016/j.spl.2023.109836

Proximal Causal Inference without Uniqueness Assumptions

Jeffrey Zhang ^a, Wei Li ^b, Wang Miao ^c, Eric Tchetgen Tchetgen ^a

PMCID: PMC10887303 NIHMSID: NIHMS1884513 PMID: 38405420

Abstract

We consider identification and inference about a counterfactual outcome mean when there is unmeasured confounding using tools from proximal causal inference. Proximal causal inference requires existence of solutions to at least one of two integral equations. We motivate the existence of solutions to the integral equations from proximal causal inference by demonstrating that, assuming the existence of a solution to one of the integral equations, $\sqrt{n}$ -estimability of a mean functional of that solution requires the existence of a solution to the other integral equation. Solutions to the integral equations may not be unique, which complicates estimation and inference. We construct a consistent estimator for the solution set for one of the integral equations and then adapt the theory of extremum estimators to find from the estimated set a consistent estimator for a uniquely defined solution. A debiased estimator is shown to be root- $n$ consistent, regular, and semiparametrically locally efficient under additional regularity conditions.

Keywords: Proximal Causal Inference, √n-estimability

1. Introduction

It is widely acknowledged that unmeasured confounding is pervasive in observational studies, as it is unlikely that an investigator will have measured all confounders of the treatment and outcome. Often, the best one can hope for is that some measured confounders can act as proxies for true, unmeasured confounders. Proximal causal inference was developed to circumvent the issue of unmeasured confounders through the use of suitable proxy variables. See Tchetgen Tchetgen et al. (2020), Shi et al., Miao et al. (2018), and Shi et al. (2020) for a more comprehensive overview of the framework. Proximal causal inference leverages the existence of either a treatment confounding bridge function or an outcome confounding bridge function, which solve certain integral equations (Miao et al. (2018), Cui et al. (2020), Deaner (2018)). Then, the population average treatment effect (ATE) and the average treatment effect on the treated (ATT) are respectively uniquely identified nonparametrically as a certain linear mean functional of a confounding bridge function, without the latter necessarily being uniquely identified (Miao et al. (2018), Cui et al. (2020), Deaner (2018)). However, construction of a root- $n$ consistent, regular and asymptotically linear semiparametric locally efficient estimator of the ATE or ATT in prior literature has relied exclusively on uniqueness of the bridge functions. In this note, the goal is to investigate estimation and inference of the counterfactual mean, and thus of the ATE and ATT, when uniqueness of solutions to the integral equations does not hold. Specifically, we construct a root- $n$ consistent, regular and asymptotically linear nonparametric estimator of the ATE without requiring uniqueness of the confounding bridge functions. The proposed methods build on recent methods methods developed in Santos (2011) and Li et al. (2021). In somewhat related settings, the former considers a linear functional of the structural function in (i) the widely studied nonparametric instrumental variable problem (Chen and Pouzo (2012)), while the latter considers (ii) a nonparametric shadow variable framework in a nonignorable missing data problem (D’Haultfoeuille (2010), Miao and Tchetgen Tchetgen (2016)). Proximal identification and inference differs from both of these settings in that the identification challenge requires the use of two proxies, while (i) and (ii) technically require a single proxy, mainly a valid instrument in (i) and a valid shadow variable in (ii).

An outline of the paper is as follows: in Section 2, we review identification strategies for the counterfactual mean, and thus the ATE and ATT, from previous works under the proximal framework, and describe sufficient and necessary conditions for identification and root- $n$ estimation of the ATE. In Section 3, we develop an estimator for the counterfactual mean, and therefore for the ATE and ATT, establish the asymptotic theory, and discuss its semiparametric efficiency. We conclude with a discussion in Section 4 and include proofs in the Supplementary Material. While for ease of exposition, all results are given for the counterfactual outcome mean, they equally apply to a broader class of functionals as discussed in the Supplementary Material.

2. Identification

We wish to estimate the effect of a binary treatment $A$ on an outcome $Y$ in a setting where there is unmeasured confounding. Define $Y (a)$ for $a = 0, 1$ to be the potential outcomes of the response if treatment had been externally set to $a$ . The overall goal is to estimate the average treatment effect, $E [Y (1) - Y (0)]$ . Let $U$ be an unmeasured confounder and $L$ a vector of observed covariates. First, consider the following familiar assumptions:

Assumption 1.

(Consistency) $Y = Y (A)$ almost surely.

Assumption 2.

(Positivity) $0 < ℙ (A = a ∣ L) < 1 a = 0, 1$ almost surely.

Assumption 3.

(Exchangeability) $Y (a) ⊥ A ∣ L$ for $a = 0, 1$ .

Under these assumptions, the average treatment effect is identified. However, the exchangeability assumption requires that there be no unmeasured confounders, an assumption that is often untenable in observational study settings. Instead, we adopt the recent proximal causal inference framework wherein we require there to be a treatment confounding proxy $Z$ and an outcome confounding proxy $W$ . This leads to the following assumptions as introduced by Cui et al. (2020):

Assumption 4.

(Latent Unconfoundednes)

(Z, A) ⊥ (Y (a), W) | U, X f o r a = 0, 1

Assumption 5.

(Positivity) $0 < ℙ (A = a ∣ U, X) < 1$ almost surely, $a = 0, 1$ .

Assumption 6.

(Completeness 1) For any $g (U)$ square-integrable, $a, x,$ if $E [g (U) ∣ Z, A = a, X = x] = 0$ almost surely, then $g (U) = 0$ almost surely.

There are two ways to identify the counterfactual mean $E [Y (a)]$ . The first method is given by the following theorem:

Theorem 2.1.

(Miao et al. (2018)) Suppose that there exists an outcome confounding bridge function $h (w, a, x)$ that solves the following integral equation

E [Y ∣ Z, A, X] = \int h (w, A, X) d F (w ∣ Z, A, X)

(1)

almost surely. Under Assumptions 1, and 4,5, and 6, one has that

E [Y (a)] = \int_{𝓧} \int h (w, a, x) d F (w ∣ x) d F (x)

(2)

Assumption 7.

(Completeness 2) For any $g (U)$ square-integrable, $a, x,$ if $E [g (U) ∣ W, A = a, X = x] = 0$ almost surely, then $g (U) = 0$ almost surely.

Using this second completeness assumption, the following theorem provides an alternative identification scheme:

Theorem 2.2.

(Cui et al. (2020)) Suppose that there exists a treatment confounding bridge function $q (z, a, x)$ that solves the integral equation:

E [q (Z, a, X) ∣ W, A = a, X] = \frac{1}{f (A = a ∣ W, X)}

(3)

Then under Assumptions 1, 4, 5, and 7, one has

E [Y (a)] = \int_{𝓧} \int I (\tilde{a} = a) q (z, a, x) y d F (y, z, \tilde{a} ∣ x) d F (x)

(4)

In the above theorems, only existence of an outcome confounding bridge function $h$ or treatment confounding bridge function $q$ is required for identification of the ATE; they need not be unique. Suppose that one has observed $n$ i.i.d. data samples consisting of variables $(A, Z, X, W, Y)$ . Let $L_{2} (Y, W, Z, A, X)$ denote the space of real valued functions of $(Y, W, Z, A, X)$ that are square integrable with respect to the distribution of $(Y, W, Z, A, X)$ and use the inner product $〈 f_{1}, f_{2} 〉 : = E [f_{1} f_{2}]$ . For any bounded linear map $T$ , define $𝓓 (T), ℛ (T), 𝓝 (T), T^{'}$ to be the domain, range, null space, and adjoint of $T$ . Let $T^{⊥}$ be the orthocomplement of a set $T$ . Let $T_{o} : L_{2} (W, A, X) \to L_{2} (Z, A, X)$ where $T_{o} g (W, A, X) = E [g (W, A, X) ∣ Z, A, X$ . Let $T_{t} : L_{2} (Z, A, X) \to L_{2} (W, A, X)$ where $T_{o} g (Z, A, X) = E [g (Z, A, X) ∣ W, A, X]$ .

Before proceeding, we provide a purely statistical motivation for the integral equations 1 and 3. Therefore, the following requires no reference to an unmeasured confounder $U$ and makes neither assumption 2.1 or 2.2. Consider the following two scenarios:

Suppose (1) holds, i.e. there exists a function $h \in L_{2} (W, A, X)$ such that $E [Y ∣ Z, A, X] = E [h (W, A, X) ∣ Z, A, X]$
Suppose (3) holds, i.e. there exists a function $q \in L_{2} (Z, A, X)$ such that $\frac{1}{f (A ∣ W, X)} = E [q (Z, A, X) ∣ W, A, X]$

Under the first scenario, consider the problem of estimating a functional of the following form:

β_{o} = E [ϕ_{o} (W, A, X) h (W, A, X)]

(5)

where $ϕ_{o}$ is a known function in $L_{2} (W, A, X)$ . Let $T_{o} : L_{2} (W, A, X) \to L_{2} (Z, A, X)$ where $T_{o} g (W, A, X) = E [g (W, A, X) ∣ Z, A, X]$ . Then we have the following:

Proposition 2.3.

Under the assumption that 1 holds, $β_{o}$ is identified iff $ϕ_{o} \in 𝓝 {(T_{o})}^{⊥}$ .

Proof. First, suppose $β_{o}$ is identified. Consider $h_{1}, h_{2}$ that satisfy Equation 1. Note that this implies that $h_{1} - h_{2} \in 𝓝 (T_{o})$ . Since $β_{o}$ is identified, both $h_{1}$ and $h_{2}$ yield the same value of $β_{o}$ . Thus, we have that

0 = E ϕ_{o} (W, A, X) (h_{1} - h_{2})]

and so $ϕ_{o} (W, A, X) \in 𝓝 {(T_{o})}^{⊥}$ since $h_{1} - h_{2}$ is an arbitrary element of $𝓝 (T_{o})$ . Conversely, suppose $ϕ_{o} (W, A, X) \in 𝓝 {(T_{o})}^{⊥}$ . Then for any $h_{1}$ and $h_{2}$ that satisfy Equation 1, we have $h_{1} - h_{2} \in 𝓝 (T_{o})$ and so $E [ϕ_{o} (W, A, X) h_{1}] = E [ϕ_{o} (W, A, X) h_{2}]$ and so $β_{o}$ is identified. □

Note that $𝓝 {(T_{o})}^{⊥} = cl (ℛ (T_{o}^{'}))$ . However, the following Lemma establishes that $ϕ_{o} (W, X) \in ℛ (T_{o}^{'})$ is necessary for $β_{o}$ to be $\sqrt{n}$ estimable.

Lemma 2.4.

Assuming equation 1 holds and additional regularity conditions described in the Supplementary Material, $ϕ_{o} (W, X) \in ℛ (T_{o}^{'})$ is necessary for $β_{o}$ to be $\sqrt{n}$ estimable.

This result is analogous to a result derived in Severini and Tripathi (2012) in the non-parametric instrumental variables context. Next, note that

E [h (W, a, X)] = E [E [h (W, a, X) ∣ W, X]] = E [E [h (W, a, X) I (A = a) / ℙ (A = a ∣ W, X) ∣ W, X]] = E [h (W, a, X) / ℙ (A = a ∣ W, X) I (A = a)] = E [h (W, A, X) / ℙ (A = a ∣ W, X) I (A = a)]

which is in the form of equation 5 with $ϕ_{o} (W, A, X) = I (A = a) / ℙ (A = a ∣ W, X)$ which for current purposes may be assumed known. Lemma 2.4 thus implies that for $E [h (W, a, X)]$ to be $\sqrt{n}$ estimable, there must be a function $q (Z, A, X)$ that satisfies

E [q (Z, A, X) ∣ W, A, X] = I (A = a) / ℙ (A = a ∣ W, X)

This corresponds to the condition from Equation 3. Likewise, consider the problem of estimating a functional of the following form:

β_{t} = E [ϕ_{t} (Z, A, X) q (Z, A, X)]

(6)

where $ϕ_{t}$ is a known function in $L_{2} (Z, A, X)$ . Write $T_{t} : L_{2} (Z, A, X) \to L_{2} (W, A, X)$ where $T_{t} g (Z, A, X) = E [g (Z, A, X) ∣ W, A, X]$ . Analogous to Proposition 2.3, we have the following:

Proposition 2.5.

Under the assumption that 3 holds, $β_{t}$ is identified iff $ϕ_{t} \in 𝓝 {(T_{t})}^{⊥}$ .

Proof. Note that for any $q_{1}$ and $q_{2}$ that satisfy 3, we must have $q_{1} - q_{2} \in 𝓝 (T_{t})$ . Then the argument follows in the exact same manner as in Proposition 2.3. □

As above, it is possible to establish that $ϕ_{t} \in R (T_{t}^{'})$ is necessary for $β_{t}$ to be $\sqrt{n}$ estimable.

Lemma 2.6.

Assuming equation 3 holds and additional regularity conditions described in the Supplementary Material, $ϕ_{o} (W, X) \in R (T_{o}^{'})$ is necessary for $β_{o}$ to be $\sqrt{n}$ estimable.

Next, observe that from Equation 4, we have

E [I (A = a) q (Z, a, X) Y] = E [E [I (A = a) q (Z, a, X) Y ∣ Z, A, X]] = E [I (A = a) q (Z, a, X) E [Y ∣ Z, A, X]] = E [I (A = a) q (Z, A, X) E [Y ∣ Z, A, X]]

which is in the form of equation 6 with $ϕ_{t} (Z, A, X) = I (A = a) E [Y ∣ Z, A, X]$ which for current purposes may be assumed known. Lemma 2.6 thus implies that for $E [I (A = a) q (Z, A, X) Y]$ to be $\sqrt{n}$ estimable, there must be a function $h (W, A, X)$ such that

E [h (W, A, X) ∣ Z, A, X] = I (A = a) E [Y ∣ Z, A, X]

This corresponds to the condition from Equation 1. We may conclude that taking as a primitive condition that a solution to Equation 1 exists everywhere in the model, i.e. at all laws included in the semiparametric model, identification and root-n estimation of the counterfactual outcome mean necessarily implies that a solution to 3 exists at the true data generating law. On the other hand, taking as a primitive condition that a solution to Equation 3 exists at all laws of the semiparametric model, identification and root-n estimation of the counterfactual outcome mean necessarily implies that a solution to 1 exists at the true data generating law. The present setting differs from the shadow variable missing data setting studied in Li et al. (2021) somewhat in ways worth discussing. In the current setting, we aim to account for the presence of an unmeasured confounder $U$ and the key assumption 4 to identification involves this latent variable together with two fully observed auxiliary factors in the form of a pair of proxies $Z$ and $W$ , each of which plays a specific role. In contrast, identification in a shadow variable setting does not require invoking a latent factor, and requires only a single fully observed auxiliary variable which satisfies a certain conditional independence condition (Li et al. (2021)). Despite these differences, our paper demonstrates that the analytic framework of Li et al. (2021) readily extends to the proximal causal inference setting. We further establish in the Supplementary Material that the approach actually applies to a general class of doubly robust functionals studied by Ghassami et al. (2022), for which a pair of nuisance functions is defined as solution to Fredholm integral equations. The above propositions give some motivation for assuming solutions of the Fredholm integral equations from those theorems. In the next section, we describe an estimation strategy for the counterfactual mean without the assumption of a unique $h$ or $q$ that solve the integral equations.

3. Estimation Strategy

We follow estimation strategies from Santos (2011) and Li et al. (2021). By the above discussion, it is sensible to construct solution sets for either of the Fredholm integral equations from equation 1 and 3. However, estimating solution sets for the latter requires an estimate for the propensity score. Thus, we consider estimating the solution set of equation 1. First, let $𝓗$ be a set of smooth functions. Define the solution sets of the Fredholm integral equations as follows:

𝓗_{0} = {h \in 𝓗 : E [h (W, A, X) ∣ Z, A, X] = E [Y ∣ Z, A, X]}

(7)

Under the assumptions from Theorem 2.1 and Theorem 2.2, $E [h (W, a, X)]$ has a causal interpretation as the counterfactual mean $E [Y (a)]$ . Under these assumptions, to estimate $μ_{a} : = E [Y (a)]$ , we first construct a consistent estimator ${\hat{𝓗}}_{0}$ for the set $𝓗_{0}$ ; next, we choose a specific ${\hat{h}}_{0} \in {\hat{𝓗}}_{0}$ so that it is a consistent estimator for a fixed element $h_{0} \in 𝓗_{0}$ .

3.1. Estimation of solution sets

Define the criterion function

C (h) = E [E {[Y - h (W, A, X) ∣ Z, A, X]}^{2}]

In practice, the estimation of the solution set can be done in the two arms separately, for example, by taking the criterion function $C (h) = E [I (A = a) E [{(Y - h (W, a, X) ∣ Z, A = a, X]}^{2}]$ . Note that

𝓗_{0} = {h \in 𝓗 : C (h) = 0}

i.e., the solution set of the Fredholm integral equation consists of the zeros of the criterion function. To proceed with estimation, we adopt a two-stage approach. We aim to construct sample analogues $C_{n}$ of the criterion function $C$ . We let $𝓗_{n}$ be sieve for $𝓗$ . Specifically, for a known sequence of approximating functions ${ψ_{m} (w, a, x)}_{m = 1}^{\infty}$ , let

𝓗_{n} = {h \in 𝓗 : h (w, a, x) = \sum_{m = 1}^{m_{n}} β_{m} ψ_{m} (w, a, x)}

(8)

For $β, h$ unknown and $m_{n}$ known. To construct $C_{n}$ , we require a nonparametric estimator of conditional expectations. For this, let ${ϕ_{k} (z, a, x)}_{k = 1}^{\infty}$ be a known sequence of approximating functions. Denote

ϕ (z, a, x) = {ϕ_{1} (z, a, x), \dots, ϕ_{k_{n}} (z, a, x)}^{T}

and let

Φ = {ϕ (Z_{1}, A_{1}, X_{1}), \dots, ϕ (Z_{n}, A_{n}, X_{n})}^{T}

For a generic random variable $B = B (W, A, X, Y,)$ with realizations ${B_{i} = B {(W_{i}, A_{i}, X_{i}, Y_{i})}_{i = 1}^{n}}$ the nonparametric sieve estimator of $E [B ∣ a, z, x]$ is

\hat{E} (B ∣ A, Z, X) = ϕ (Z, A, X) {(Φ^{T} Φ)}^{- 1} \sum_{i = 1}^{n} ϕ (Z_{i}, A_{i}, X_{i}) B_{i}

(9)

The sample analogue $C_{n} (h)$ is then

C_{n} (h) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{e}}^{2} (Z_{i}, A_{i}, X_{i}, h)

(10)

where

\hat{e} (Z_{i}, A_{i}, X_{i}, h) = \hat{E} [Y - h (W, A, X) ∣ A_{i}, Z_{i}, X_{i}]

(11)

Then the proposed estimator of $𝓗_{0}$ is ${\hat{𝓗}}_{0} = {h \in 𝓗_{n} : C_{n} (h) \leq c_{n}}$ where $c_{n}$ is an appropriately chosen sequence that tends to 0.

3.2. Set consistency

In this section, we establish the set consistency of ${\hat{𝓗}}_{0}$ for $𝓗_{0}$ under Hausdorff distances. For this, define the Hausdorff distance between two sets $𝓗_{1}, 𝓗_{2} \subset 𝓗$ as

d_{H} (𝓗_{1}, 𝓗_{2}, ‖ \cdot ‖) = \max {d (𝓗_{1}, 𝓗_{2}), d (𝓗_{2}, 𝓗_{1})}

where $d (𝓗_{1}, 𝓗_{2}) = \sup_{h_{1} \in 𝓗_{1}} \inf_{h_{2} \in 𝓗_{2}} ‖ h_{1} - h_{2} ‖$ and $‖ \cdot ‖$ is a given norm. Consider the following two norms:

‖ h ‖_{w}^{2} = E [{E [h (W, A, X) ∣ Z, A, X]}^{2}] ‖ h ‖_{\infty}^{2} = \sup_{w, a, x} | h (w, a, x) |

Notice that any $h, h_{0} \in 𝓗_{0}$ satisfy 1, so it holds that for any ${\hat{h}}_{0} \in {\hat{𝓗}}_{0}$ and any $h, h_{0} \in 𝓗_{0}, {‖ {\hat{h}}_{0} - h_{0} ‖}_{w} = {‖ {\hat{h}}_{0} - h ‖}_{w}$ and so

{‖ {\hat{h}}_{0} - h_{0} ‖}_{w} = \inf_{h \in 𝓗_{0}} {‖ {\hat{h}}_{0} - h ‖}_{w} \leq d_{H} ({\hat{𝓗}}_{0}, 𝓗_{0}, ‖ \cdot ‖_{w})

(12)

Thus, we can calculate the convergence rate of ${‖ {\hat{h}}_{0} - h_{0} ‖}_{w}$ by finding the convergence rate of $d_{H} ({\hat{𝓗}}_{0}, 𝓗_{0}, ‖ \cdot ‖_{w})$ . We will need to consistently estimate $𝓗_{0}$ under the supremum norm because under the $‖ \cdot ‖_{w}$ norm, elements of $𝓗_{0}$ form an equivalence class. We will require several assumptions.

Assumption 8.

The vector of covariates $X \in ℝ^{d}$ has support ${[0, 1]}^{d}$ , and outcome $Y \in ℝ$ and proxies $Z, W \in ℝ$ have compact support.

Definition 1.

For a generic function $ρ (ω)$ defined on $ω \in ℝ^{d}$ we define

‖ ρ ‖_{\infty, α} = \max_{| λ | \leq \underline{α}} \sup_{w} | D^{λ} ρ (w) | + \max_{λ = \underline{α}} \sup_{w \neq w^{'}} \frac{D^{λ} ρ (w) - D^{λ} ρ (w^{'})}{{‖ w - w^{'} ‖}^{α - \underline{α}}}

where $λ$ is a d-dimensional vector of nonegative integers, $| λ | = \sum_{i = 1}^{d} λ_{i}, \underline{α}$ denotes the largest integer smaller than $α, D^{λ} ρ (ω) = \partial^{| λ |} ρ (ω) / \partial ω_{1}^{λ_{1}} \dots \partial ω_{d}^{λ_{d}}$ , and $D^{0} ρ (ω) = ρ (ω)$ .

Assumption 9.

The following conditions hold:

$\sup_{h \in 𝓗} ‖ h ‖_{\infty, α} < \infty$ for some $α > (d + 1) / 2; 𝓗_{0} \neq 0, 𝓗_{n}$ and $𝓗$ are closed.
for every $\in 𝓗$ , there is a $Π_{n} h \in 𝓗_{n}$ such that $\sup_{h \in 𝓗} {‖ h - Π_{n} h ‖}_{\infty} = O (η_{n})$ for some $η_{n} = o (1)$ .

Assumption 10.

The following conditions hold:

The smallest and largest eigenvalues of $E [ϕ (Z, A, X) ϕ {(Z, A, X)}^{T}]$ are bounded above and away from zero for all $k_{n}$
for every $h \in 𝓗$ , there is a $π_{n} (h) \in ℝ^{k_{n}}$ such that
$\sup_{h \in 𝓗} {‖ E [h (Z, A, X) ∣ a, z, x] - ϕ^{T} (a, z, x) π_{n} (h) ‖}_{∞} = O (k_{n}^{- \frac{α}{d + 1}})$
$ξ_{n}^{2} k_{n} = o (n)$ , where $ξ_{n} = \sup_{z, a, x} ‖ ϕ (z, a, x) ‖_{2}$

Then we have the following theorem:

Proposition 3.1.

Suppose that Assumptions 8-10 hold. If $a_{n} = O λ_{n}^{- 1}, b_{n} \to \infty$ , and $b_{n} = o (a_{n})$ , then

d_{H} ({\hat{𝓗}}_{0}, 𝓗_{0}, ‖ \cdot ‖_{\infty}) = o_{p} (1) and d_{H} ({\hat{𝓗}}_{0}, 𝓗_{0}, ‖ \cdot ‖_{w}) = O_{p} (c_{n}^{1 / 2})

After obtaining a consistent estimator ${\hat{𝓗}}_{0}$ of $𝓗_{0}$ , we select a specific estimator from ${\hat{𝓗}}_{0}$ such that it converges to a unique element in $𝓗_{0}$ .

3.3. A representer-based estimator

To do so, we let $M : 𝓗 \to ℝ$ be a population criterion function that attains a unique minimum $h_{0}$ on $𝓗_{0}$ and $M_{n} (h)$ its sample analogue. We then select

{\hat{h}}_{0} \in {argmin}_{h \in {\hat{𝓗}}_{0}} M_{n} (h)

(13)

We make the following assumption about $M$ :

Assumption 11.

The function set $𝓗$ is convex; the functional $M : 𝓗 \to ℝ$ is strictly convex and attains a unique minimum at $h_{0}$ on $𝓗_{0}$ ; its sample analogue $M_{n} : 𝓗 \to ℝ$ is continuous and $\sup_{h \in 𝓗} | M_{n} (h) - M (h) | = o_{p} (1)$ .

A sensible choice for $M$ is the squared length $M (h) = E [h {(W, A, X)}^{2}]$ with corresponding sample analog $M_{n} (h) = \frac{1}{n} \sum_{i = 1}^{n} h {(W_{i}, A_{i}, X_{i})}^{2}$ .

Proposition 3.2.

Suppose that assumptions 8-11 hold. Then

{‖ {\hat{h}}_{0} - h_{0} ‖}_{\infty} = o_{p} (1)

where ${\hat{h}}_{0}$ is defined in equation 13. If $a_{n} = O (λ_{n}^{- 1}), b_{n} \to \infty$ and $b_{n} = o (a_{n})$ , we then have

{‖ {\hat{h}}_{0} - h_{0} ‖}_{w} = O_{p} (C_{n}^{1 / 2})

Based on the identification condition $μ_{a} = E [Y (a)] = E [h_{0} (W, a, X)]$ , we propose the following estimator for $μ_{a}$ :

{\hat{μ}}_{a} = \frac{1}{n} \sum_{i = 1}^{n} {{\hat{h}}_{0} (W_{i}, a, X_{i})}

(14)

To facilitate analysis of the asymptotic properties of the estimator, let $\bar{𝓗}$ be the closure of the linear span of $𝓗$ under $‖ \cdot ‖_{w}$ , and define

{〈 h_{1}, h_{2} 〉}_{w} = E [I (A = a) E {h_{1} (W, A, X) ∣ A, Z, X} E {h_{2} (W, A, X) ∣ A, Z, X}]

Next, we require the following representer assumption:

Assumption 12.

The following conditions hold:

there exists a function $g_{0} \in 𝓗$ such that
${〈 g_{0}, h 〉}_{w} = E {h (W, a, X)}$
for all $h \in \bar{𝓗}$
$η_{n} = o (n^{- 1 / 3}), k_{n}^{- 3 α / (d + 1)} = o (n^{- 1}), k_{n}^{3} = o (n), ξ_{n}^{2} k_{n}^{2} = o (n)$ , and $ξ_{n}^{2} k_{n}^{- 2 α / (d + 1)} = o (1)$ .

Observe that any $g_{0}$ that satisfies Assumption 12(i) satisfies the following for all $h \in \bar{𝓗}$ :

E [I (A = a) E {g_{0} (W, A, X) ∣ A, Z, X} h (W, A, X)] = E [I (A = a) / P (A = a ∣ W, X) h (W, A, X)]

(15)

Then suppose that

E [I (A = a) E {g_{0} (W, A, X) ∣ A, Z, X} - I (A = a) / P (A = a ∣ W, X) ∣ W, A, X] \in \bar{𝓗}

Using 15, this implies that

E [I (A = a) ({g_{0} (W, A, X) ∣ A, Z, X} - 1 / P (A = a ∣ W, X)) ∣ W, A, X] = 0 .

In other words, $E [{g_{0} (W, A, X) ∣ A, Z, X}]$ solves the integral equation from 3. Thus, Assumption 12(i) can be viewed as a strengthening of the assumption that there exists a function $q (Z, A, X)$ that solves Equation 3. As in Santos (2011), the $g_{0}$ function will be unique up to equivalence class. Now we can address the asymptotic expansion of our estimator:

Theorem 3.3.

Suppose that assumptions 8-12 hold. Then we have that

\sqrt{n} ({\hat{μ}}_{a} - μ_{a}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [h_{0} (W_{i}, a, X_{i}) - μ_{a} + I (A_{i} = a) E {g_{0} (W, A, X) ∣ A_{i}, Z_{i}, X_{i}} \times (Y_{i} - h_{0} (W_{i}, A_{i}, X_{i}))] - \sqrt{n} r_{n} ({\hat{h}}_{0}) + o_{p} (1)

where

r_{n} ({\hat{h}}_{0}) = \frac{1}{n} \sum_{i = 1}^{n} I (A_{i} = a) \hat{E} {Π_{n} g_{0} (W, A, X) ∣ A_{i}, Z_{i}, X_{i}} \hat{e} (Z_{i}, A_{i}, X_{i}, {\hat{h}}_{0})

(16)

To get an asymptotically normal estimator, we may de-bias the estimator, which requires estimating the term $r_{n} ({\hat{h}}_{0})$ , which may not be asymptotically negligible.

3.4. A de-biased estimator

First, we define a new criterion function

R (h) = E [I (A = a) E {h (W, A, X) ∣ Z, A, X}^{2}] - 2 E {h (W, a, X)}, h \in 𝓗

and the sample analog

R_{n} (h) = \frac{1}{n} \sum_{i = 1}^{n} I (A_{i} = a) \hat{E} {h (W, A, X) ∣ Z_{i}, A_{i}, X_{i}}^{2} - \frac{2}{n} \sum_{i = 1}^{n} h (W_{i}, a, X_{i}), h \in 𝓗

Observe that since ${〈 g_{0}, h 〉}_{w} = E {h (W, a, X)}$ , we have that $R (h) = {‖ h - g_{0} ‖}_{w}^{2} - {‖ g_{0} ‖}_{w}^{2}$ . It follows that $g_{0}$ is the unique minimizer of the mapping $h \to R (h)$ . Then since $g_{0}$ is close to $Π_{n} g_{0}$ by assumption 9(ii), we can estimate the term $Π_{n} g_{0}$ by

\hat{g} \in {argmin}_{h \in 𝓗_{n}} R_{n} (h)

(17)

With this estimate, we can construct the following estimator for $r_{n} ({\hat{h}}_{0})$ as

{\hat{r}}_{n} ({\hat{h}}_{0}) = \frac{1}{n} \sum_{i = 1}^{n} I (A_{i} = a) \hat{E} {\hat{g} (W, A, X) ∣ A_{i}, Z_{i}, X_{i}} \hat{e} (Z_{i}, A_{i}, X_{i}, {\hat{h}}_{0})

(18)

Next, we have a lemma characterizing the convergence of ${\hat{r}}_{n} ({\hat{h}}_{0})$ to $r_{n} ({\hat{h}}_{0})$ .

Lemma 3.4.

Suppose that assumptions 8-10 and 12 hold. Then

\sup_{{\hat{h}}_{0} \in {\hat{𝓗}}_{0}} | {\hat{r}}_{n} ({\hat{h}}_{0}) - r_{n} ({\hat{h}}_{0}) | = O_{p} [c_{n}^{1 / 2} {{(\frac{k_{n}}{n})}^{1 / 4} + k_{n}^{- \frac{α}{2 (d + 1)}}}]

Using this lemma, we can construct the following debiased estimator that is asymptotically normal:

{\hat{μ}}_{a - d b} = {\hat{μ}}_{a} + {\hat{r}}_{n} ({\hat{h}}_{0})

(19)

Theorem 3.5.

Suppose assumptions 8 hold. If an $a_{n} = O (λ_{n}^{- 1}), b_{n} \to \infty$ , and $n^{2 / 3} b_{n} = o (a_{n})$ , then $\sqrt{n} ({\hat{μ}}_{a - d b} - μ_{a})$ converges in distribution to $N (0, σ^{2})$ , where $σ^{2}$ is the variance of

h_{0} (W, a, X) - μ_{a} + I (A = a) E {g_{0} (W, A, X) ∣ Z, A, X} \times (Y - h_{0} (W, A, X))

(20)

Equation 20 is the influence function for the de-biased estimator ${\hat{μ}}_{a - d b}$ . Cui et al. (2020) consider a nonparametric model that leaves the $h$ and $q$ functions that solve the Fredholm integral equations unrestricted. With this model and under additional assumptions that ensure uniqueness of $h$ and $q$ , they derive the efficient influence function for the counterfactual mean as $h (W, a, X) - μ_{a} + I (A = a) q (Z, a, X) \times (Y - h (W, a, X))$ . Thus, we have the immediate corollary:

Corollary 3.6.

The influence function 20 attains the semiparametric efficiency bound under the model considered in Cui et al. (2020) under assumptions 1,4-7, assumption 10 in Cui et al. (2020).

4. Discussion

Under the proximal causal inference framework, we have established an estimator for the counterfactual mean. We have shown it is consistent and presented conditions for when it is asymptotically normal. We have also discussed under what conditions it achieves the semiparametric efficiency bound. Note that if interest is in the average causal effect, $E [Y (1) - Y (0)]$ , the proposed methodology can be adapted by slightly adjusting 12(i).

Supplementary Material

NIHMS1884513-supplement-1.pdf^{(303.8KB, pdf)}

Acknowledgements

Jeffrey Zhang was supported by NIH grant 5R01HD101415-02. Wei Li’s research was supported by the National Natural Science Foundation of China (NSFC 12101607), the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China. Eric Tchetgen Tchetgen (PI) was supported by NIH Grants: R01AI27271, R01CA222147, R01AG065276, R01GM139926.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Chen X. and Pouzo D. Estimation of nonparametric conditional moment models with possiblynonsmooth generalized residuals. Econometrica, 80(1):277–321, 2012. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/41336586. [Google Scholar]
Cui Y, Pu H, Shi X, Miao W, and Tchetgen ET Semiparametric proximal causal inference,2020. URL https://arxiv.org/abs/2011.08411.
Deaner B. Proxy controls and panel data, 2018. URL https://arxiv.org/abs/1810.00283.
D’Haultfoeuille X. A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1):1–15, 2010. URL https://EconPapers.repec.org/RePEc:eee:econom:v:154:y:2010:i:1:p:1-15. [Google Scholar]
Ghassami A, Ying A, Shpitser I, and Tchetgen Tchetgen E. Minimax kernel machine learning for a class of doubly robust functionals with application to proximal causal inference. In CampsValls G, Ruiz FJR, and Valera I, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 7210–7239. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/v151/ghassami22a.html. [Google Scholar]
Li W, Miao W, and Tchetgen Tchetgen E. Identification and estimation of nonignorable missing outcome mean without identifying the full data distribution, 2021. URL https://arxiv.org/abs/2110.05776.
Miao W. and Tchetgen Tchetgen EJ On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2):475–482, 2016. ISSN 00063444. URL http://www.jstor.org/stable/43908634. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miao W, Geng Z, and Tchetgen Tchetgen EJ Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987–993, 08 2018. ISSN 0006–3444. doi: 10.1093/biomet/asy038. URL 10.1093/biomet/asy038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Santos A. Instrumental variable methods for recovering continuous linear functionals. Journal of Econometrics, 161(2):129–146, 2011. ISSN 0304–4076. doi: 10.1016/j.jeconom.2010.11.014. URL https://www.sciencedirect.com/science/article/pii/S0304407610002253. [DOI] [Google Scholar]
Severini TA and Tripathi G. Efficiency bounds for estimating linear functionals of nonparametric regression models with endogenous regressors. Journal of Econometrics, 170(2):491–498, 2012. ISSN 0304–4076. doi: 10.1016/j.jeconom.2012.05.018. URL https://www.sciencedirect.com/science/article/pii/S0304407612001303. Thirtieth Anniversary of Generalized Method of Moments. [DOI] [Google Scholar]
Shi X, Miao W, and Tchetgen Tchetgen E. A selective review of negative control methods in epidemiology. Current Epidemiology Reports, 7(4):190–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shi X, Miao W, Nelson JC, and Tchetgen Tchetgen EJ Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):521–540, 2020. doi: 10.1111/rssb.12361. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tchetgen Tchetgen EJ, Ying A, Cui Y, Shi X, and Miao W. An introduction to proximal causal learning, 2020. URL https://arxiv.org/abs/2009.10982.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1884513-supplement-1.pdf^{(303.8KB, pdf)}

[R1] Chen X. and Pouzo D. Estimation of nonparametric conditional moment models with possiblynonsmooth generalized residuals. Econometrica, 80(1):277–321, 2012. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/41336586. [Google Scholar]

[R2] Cui Y, Pu H, Shi X, Miao W, and Tchetgen ET Semiparametric proximal causal inference,2020. URL https://arxiv.org/abs/2011.08411.

[R3] Deaner B. Proxy controls and panel data, 2018. URL https://arxiv.org/abs/1810.00283.

[R4] D’Haultfoeuille X. A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1):1–15, 2010. URL https://EconPapers.repec.org/RePEc:eee:econom:v:154:y:2010:i:1:p:1-15. [Google Scholar]

[R5] Ghassami A, Ying A, Shpitser I, and Tchetgen Tchetgen E. Minimax kernel machine learning for a class of doubly robust functionals with application to proximal causal inference. In CampsValls G, Ruiz FJR, and Valera I, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 7210–7239. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/v151/ghassami22a.html. [Google Scholar]

[R6] Li W, Miao W, and Tchetgen Tchetgen E. Identification and estimation of nonignorable missing outcome mean without identifying the full data distribution, 2021. URL https://arxiv.org/abs/2110.05776.

[R7] Miao W. and Tchetgen Tchetgen EJ On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2):475–482, 2016. ISSN 00063444. URL http://www.jstor.org/stable/43908634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Miao W, Geng Z, and Tchetgen Tchetgen EJ Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987–993, 08 2018. ISSN 0006–3444. doi: 10.1093/biomet/asy038. URL 10.1093/biomet/asy038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Santos A. Instrumental variable methods for recovering continuous linear functionals. Journal of Econometrics, 161(2):129–146, 2011. ISSN 0304–4076. doi: 10.1016/j.jeconom.2010.11.014. URL https://www.sciencedirect.com/science/article/pii/S0304407610002253. [DOI] [Google Scholar]

[R10] Severini TA and Tripathi G. Efficiency bounds for estimating linear functionals of nonparametric regression models with endogenous regressors. Journal of Econometrics, 170(2):491–498, 2012. ISSN 0304–4076. doi: 10.1016/j.jeconom.2012.05.018. URL https://www.sciencedirect.com/science/article/pii/S0304407612001303. Thirtieth Anniversary of Generalized Method of Moments. [DOI] [Google Scholar]

[R11] Shi X, Miao W, and Tchetgen Tchetgen E. A selective review of negative control methods in epidemiology. Current Epidemiology Reports, 7(4):190–202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Shi X, Miao W, Nelson JC, and Tchetgen Tchetgen EJ Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):521–540, 2020. doi: 10.1111/rssb.12361. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Tchetgen Tchetgen EJ, Ying A, Cui Y, Shi X, and Miao W. An introduction to proximal causal learning, 2020. URL https://arxiv.org/abs/2009.10982.

PERMALINK

Proximal Causal Inference without Uniqueness Assumptions

Jeffrey Zhang

Wei Li

Wang Miao

Eric Tchetgen Tchetgen

Abstract

1. Introduction

2. Identification

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Theorem 2.1.

Assumption 7.

Theorem 2.2.

Proposition 2.3.

Lemma 2.4.

Proposition 2.5.

Lemma 2.6.

3. Estimation Strategy

3.1. Estimation of solution sets

3.2. Set consistency

Assumption 8.

Definition 1.

Assumption 9.

Assumption 10.

Proposition 3.1.

3.3. A representer-based estimator

Assumption 11.

Proposition 3.2.

Assumption 12.

Theorem 3.3.

3.4. A de-biased estimator

Lemma 3.4.

Theorem 3.5.

Corollary 3.6.

4. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases