Estimating causal effects with hidden confounding using instrumental variables and environments

James P Long; Hongxu Zhu; Kim-Anh Do; Min Jin Ha

doi:10.1214/23-ejs2160

. Author manuscript; available in PMC: 2024 Jul 2.

Published in final edited form as: Electron J Stat. 2023 Nov 10;17(2):2849–2879. doi: 10.1214/23-ejs2160

Estimating causal effects with hidden confounding using instrumental variables and environments

James P Long ¹, Hongxu Zhu ², Kim-Anh Do ³, Min Jin Ha ⁴

PMCID: PMC11219021 NIHMSID: NIHMS1953713 PMID: 38957485

Abstract

Recent works have proposed regression models which are invariant across data collection environments [24, 20, 11, 16, 8]. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage Least Squares (TSLS). In this work we derive the CD as a generalized method of moments (GMM) estimator. The GMM representation leads to several practical results, including 1) creation of the Generalized Causal Dantzig (GCD) estimator which can be applied to problems with continuous environments where the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which has properties superior to GCD or TSLS alone 3) straightforward asymptotic results for all methods using GMM theory. We compare the CD, GCD, TSLS, and Hybrid estimators in simulations and an application to a Flow Cytometry data set. The newly proposed GCD and Hybrid estimators have superior performance to existing methods in many settings.

Keywords and phrases: Causal inference, hidden confounding, instrumental variables, Causal Dantzig

MSC2020 subject classifications: Primary 62D20, 62D20

1. Introduction

Causal inference is challenging because of confounding and reverse causality. One solution is to make strong assumptions about confounding (e.g. no hidden confounding) and the direction of causation (e.g. $X$ is a cause of $Y$ and not the reverse). Under these assumptions, causal parameters may be identifiable from observational data.

When these assumptions are not valid, instrumental variables (IV) are a classical method for identifying causal effects. The variable $E$ is an instrument for the $X \to Y$ causal relation if 1) $E$ is uncorrelated with the error term in the $Y$ on $X$ regression and 2) $E$ is correlated with $X$ (valid first stage). IV estimators such as Two Stage Least Squares (TSLS) remain consistent under hidden confounding and unknown direction of causality. IV methods date back to [28] and have more recently been generalized to high dimensional problems [14, 9, 2] and causal discovery applications where $X \in ℝ^{p}$ is a vector and identifying the causes of each $X_{i}$ is of interest [6].

Recently, several works have proposed causal estimators based on the concept of data collection environment [24, 20, 11, 16, 8]. [20] introduced the concept of data collection environment and developed a causal estimator, Invariant Causal Prediction (ICP). In this framework, each observation is collected in an environment. The environment may represent randomized experiments on some of the exposures of interest, shift, and/or noise interventions. Environments are typically discrete and often small in number (e.g. 2 or 3).

Estimators are constructed from environments based on the principle that parameters in a causal regression model $Y$ on $X$ should be invariant across environments while parameters in a merely associational model will vary. As a simple heuristic example, suppose we are interested in estimating the causal effect of $X$ on $Y$ . In truth $Y$ is a cause of $X$ and the true causal effect of $X$ on $Y$ is 0. Standard regression based estimators are inconsistent. More generally with purely observational data it will be impossible to determine the causal effect. However if we have data from two environments, e.g. an observational environment and an interventional environment where noise is added to $X$ , then it is possible (under some conditions) to infer the causal effect as 0 by noting that the distribution of $Y$ is identical in the observational and interventional environment. This would not be the case if $X$ has a causal effect on $Y$ .

The original environment estimator ICP has been generalized to problems with sequential data and non-linear models [21, 11]. [24] proposed the Causal Dantzig (CD) environment estimator to address two weaknesses of ICP: computational complexity and inconsistency under hidden confounding. While ICP requires fitting models on all subsets of the exposure variables (making the algorithm superexponential in the number of exposures), the CD estimator has computational burden similar to linear regression. Further, like instrumental variable estimators, the CD is consistent when hidden variables confound the $X \to Y$ causal relation.

In this work, we show that the Causal Dantzig can be represented as a generalized method of moments estimator (GMM). This immediately leads to a new estimator, termed the Generalized Causal Dantzig (GCD), which is equivalent to the CD in the two environment case but can be applied with continuous environments, a setting not handled by the original CD. The GMM representation facilitates straightforward asymptotic results based on GMM theory. GMM theory shows how to optimally weight the moment constraints in over-identified problems, which occurs whenever there are more than two data collection environments. Finally, the GMM representation of the CD facilitates construction of a Hybrid estimator which uses both the CD and IV moment constraints for estimating parameters. This Hybrid estimator is consistent in some settings where neither the CD nor TSLS are consistent.

This work is organized as follows. In Section 2 we review the GMM representation of IV estimators and the environment invariance representation of the Causal Dantzig. In Section 3 we propose the Generalized Causal Dantzig (GCD), a GMM estimator, and construct a Hybrid estimator which use both IV and CD/GCD moment constraints. Section 4 derives asymptotic results for the GCD based on the GMM representation of the estimator. In Section 5, we assess consistency of the estimators in causal Structural Equation Models. Section 6 contains simulations which demonstrate some of the potential applications of the GCD and hybrid GCD-IV estimators. In Section 7 we apply IV, CD, GCD, and Hybrid estimators to Flow Cytometry data of [26]. In several cases, we show that IV and Hybrid estimators identify more plausible causal relations than the CD alone. We conclude with a discussion in Section 8. All code and data for reproducing the computational aspects of this work is available.¹

2. Instrumental variables and the Causal Dantzig

Let $X \in ℝ^{p}$ be a set of endogenous exposures and $Y \in ℝ^{1}$ be a response. The goal is to estimate the causal effect of $X$ on $Y$ . Consider a linear model of the form

Y = X^{T} β_{0} + δ_{Y}

(2.1)

where $E [δ_{Y}] = 0$ . Under a potential outcomes [1] or a structural equation modelling [19] framework, $β_{0 j}$ ( $j^{th}$ element of $β_{0}$ ) can be given a causal interpretation as the average treatment effect (ATE) of $X_{j}$ on $Y$ when shifting $X_{j}$ by 1 unit. In either of these frameworks, correlation between $X$ and $δ_{Y}$ is induced by hidden confounders which exert a causal effect on $X$ and $Y$ . Straightforward regression of $Y$ on $X$ may result in inconsistent estimates of $β_{0}$ when the error term $δ_{Y}$ is correlated with $X$ .

In this section we review two approaches to constructing consistent estimators with hidden confounding and reverse causality: the classical Two Stage Least Squares (TSLS) which uses Instrumental variables (IV) and the recently proposed Causal Dantzig which uses data collection environments.

2.1. Instrumental variable estimators

Instrumental variables techniques, dating back to [28], use instrumental variables (IVs) $E \in ℝ^{q}$ to construct consistent estimates of $β$ in the presence of hidden confounding. Suppose that 1) the instruments $E$ are uncorrelated with the error term $δ_{Y} (E [E δ_{Y}] = 0)$ and 2) $E [E X^{T}] \in ℝ^{q \times p}$ is of rank at least $p$ . The latter condition implies that $q \geq p$ (i.e. there are at least as many instruments as exposures). We review the construction of IV estimators from a GMM perspective. See [15] for more background.

Let $Z = (Y, X, E)$ and $g_{I V} (Z, β) = E (Y - X^{T} β)$ . Then the true causal parameter $β_{0}$ is the unique solution to

m_{I V} (β) = E [g_{I V} (Z, β)] = E [E (Y - X^{T} β)] = 0.

(2.2)

This can be seen by noting

E [E (Y - X^{T} β)] = \underset{= 0}{\underset{︸}{E [E δ_{Y}]}} + \underset{rank \geq p}{\underset{︸}{E [E X^{T}]}} (β_{0} - β) .

The rank condition implies that the null-space of the matrix is 0 implying $β_{0}$ is the only solution. To construct a consistent estimator, the expectation is approximated with a sample. Define $X \in ℝ^{n \times p}, E \in ℝ^{n \times q}, Y \in ℝ^{n \times 1}$ to be a matrices of $n$ i.i.d. observations. With $i$ indexing observations, we have

{\hat{m}}_{I V} (β) = \frac{1}{n} \sum_{i = 1}^{n} E_{i} (Y_{i} - X_{i}^{T} β) = \frac{1}{n} E^{T} (Y - X β) .

(2.3)

When $q > p$ , the model is over-identified and there will typically be no $\hat{β}$ such that ${\hat{m}}_{I V} (\hat{β}) = 0$ in Equation (2.3). In this case, the standard GMM approach is to use estimator

{\hat{β}}_{I V} (\hat{W}) = \underset{β}{argmin} {‖ {\hat{m}}_{I V} (β) ‖}_{\hat{W}}^{2} = \underset{β}{argmin} {\hat{m}}_{I V} {(β)}^{T} \hat{W} {\hat{m}}_{I V} (β)

(2.4)

where $\hat{W} ≻ 0$ is a positive definite weighting matrix. The TSLS IV estimator uses

\tilde{W} = {(\frac{1}{n} E^{T} E)}^{- 1} .

This weight matrix is chosen for asymptotic efficiency considerations which we discuss further in Section 4.2 (see also Section 1.3.4.2 of [15]). With $\tilde{W}$ , Equation (2.4) has the form

{\hat{β}}_{T S L S} \equiv {\hat{β}}_{I V} (\tilde{W}) = {({\hat{X}}^{T} \hat{X})}^{- 1} \hat{X} Y

(2.5)

where $\hat{X} = E {(E^{T} E)}^{- 1} E^{T} X$ . The TSLS estimator derives its name from the fact that it is computed by first regressing $X$ on $E$ (first stage) and then regressing $Y$ on the predicted values from the first stage (second stage).

Note that when $p = q$ (just identified case) the unique ${\hat{β}}_{I V} (\hat{W})$ does not depend on $\hat{W}$ and has the form

{\hat{β}}_{I V} (\hat{W}) = {(E^{T} X)}^{- 1} E^{T} Y .

2.2. Causal Dantzig

The Causal Dantzig (CD) uses environments to estimate $β_{0}$ in Equation (2.1). Each observation belongs to one of a discrete set of environments. Let $ℰ = (1, 2, \dots)$ be a set of data collection environments with $# ℰ$ denoting the number of environments (at least 2). Let $X^{e} (Y^{e})$ denote exposures (response) collected in environment $e \in ℰ$ . For any $f, g \in ℰ$ , the CD seeks a $β$ which satisfies

E [X^{f} (Y^{f} - X^{f^{T}} β)] = E [X^{g} (Y^{g} - X^{g^{T}} β)] .

(2.6)

Under conditions specified in Section 4, Equation (2.6) will have a unique solution which equals the causal estimand $β_{0}$ . The CD estimator is constructed by enforcing sample based versions of constraints in Equation (2.6). Since $X^{e} \in ℝ^{p}$ , when $# ℰ = 2$ . (two data collection environments), the invariances specified in Equation (2.6) produce $p$ sample constraints on $β \in ℝ^{p}$ .uppose there are $n_{e}$ observations from environment $e$ . Let $X^{e} \in ℝ^{n_{e} \times p} (Y^{e} \in ℝ^{n_{e}})$ represent the design matrix (response vector) for environment $e$ . Then the sample version of constraints in Equation (2.6) (with $f = 2$ and $g = 1$ ) is

\frac{1}{n_{2}} X^{2^{T}} (Y^{2} - X^{2} {\hat{β}}_{C D}) = \frac{1}{n_{1}} X^{1^{T}} (Y^{1} - X^{1} {\hat{β}}_{C D}) .

Solving for ${\hat{β}}_{C D}$ one obtains

{\hat{β}}_{C D} = {(\frac{1}{n_{2}} X^{2^{T}} X^{2} - \frac{1}{n_{1}} X^{1^{T}} X^{1})}^{- 1} (\frac{1}{n_{2}} X^{2^{T}} Y^{2} - \frac{1}{n_{1}} X^{1^{T}} Y^{1})

(2.7)

assuming the inverse exists (see Equation 7 of [24]). Equation (2.7) shows that the CD exploits how the covariance structure of $X$ changes with the environment to construct a consistent estimate of $β$ . This is in contrast to TSLS which exploits how the mean of $X$ changes with the instrument $E$ . The CD is a consistent and asymptotically normal estimator of $β_{0}$ under conditions specified in Sections 4 and 5.

3. New estimators

We now construct the Generalized Causal Dantzig (GCD) and a Hybrid estimator. The GCD extends the CD to work in new settings (e.g. continuous environments). The Hybrid estimator imposes both TSLS and CD moment constraints to produce an estimator with potentially better asymptotic properties than either the CD or TSLS. Finally we discuss connections between the GCD and estimators proposed by Lewbel [13].

3.1. Generalized Causal Dantzig

Following notation used for the IV estimators in Section 2.1 where $Z = (Y, X, E)$ , define the Generalized Causal Dantzig (GCD) using the moment conditions

m_{G C D} (β) = E [g_{G C D} (Z, β)] = E [vec (E X^{T}) (Y - X^{T} β)] = 0.

(3.1)

Here $E X^{T} \in ℝ^{q \times p}$ and $v e c (E X^{T})$ vectorizes (column stacks) the matrix $E X^{T}$ [12]. Specifically

v e c (E X^{T}) = v e c (\begin{matrix} E_{1} X_{1} & \dots & E_{1} X_{p} \\ ⋮ & ⋮ & ⋮ \\ E_{q} X_{1} & \dots & E_{q} X_{p} \end{matrix}) = (\begin{matrix} E_{1} X_{1} \\ ⋮ \\ E_{q} X_{1} \\ ⋮ \\ E_{1} X_{p} \\ ⋮ \\ E_{q} X_{p} \end{matrix}) \in ℝ^{q p} .

We now justify the name Generalized Causal Dantzig (GCD) by showing an equivalence between GCD moment conditions in Equation (3.1) and CD invariance constraints in Equation (2.6).To achieve this result, we first translate CD environment data notation ( $(Y^{e}, X^{e})$ for $e \in ℰ$ ) to IV and GCD data notation $(Z = (Y, X, E))$ . We term this translation $Z$ -Encoding.

Definition 1 ( $Z$ -Encoding). Let $(Y^{e}, X^{e})$ denote data collected in environment $e \in ℰ$ . Define $Y = Y^{e}, X = X^{e}$ and $R = e$ . Further encode the categorical variable $R$ with random variable $E \in ℝ^{# ℰ - 1}$ where

E_{f} = {\begin{array}{l} s_{f 1} & R = f \\ s_{f 0} & R \neq f \end{array}

(3.2)

for $f \in {1, \dots, # ℰ - 1}$ . Select $s_{f 1}$ and $s_{f 0}$ such that $E [E_{f}] = 0$ . The $Z$ -Encoding of environment data $(Y^{e}, X^{e})$ is $(Y, X, E)$

Theorem 3.1. Let $(Y^{e}, X^{e})$ for $e \in ℰ$ be environment data and $Z = (Y, X, E)$ be its $Z$ -Encoding constructed according to Definition 1. Then the Causal Dantzig invariance constraints (Equation (2.6)) applied to the environment data and the GCD moment constraints (Equation (3.1)) applied to the Z-Encoding are identical. Specifically

{β : m_{G C D} (β) = 0} = {β : E [X^{f} (Y^{f} - X^{f^{T}} β)] = E [X^{g} (Y^{g} - X^{g^{T}} β)] \forall f, g \in ℰ} .

See Section 9.2 for a proof. The result has conceptual and practical implications. On the conceptual side, comparing the IV moment constraints (Equation (2.2)) with the GCD moments constraints (Equation (3.1)) shows the environment and instrument $(E)$ play similar roles in the estimators. TSLS requires each instrument/environment to be orthogonal to $Y - X^{T} β$ while CD requires each instrument/environment be orthogonal to each element of $X (Y - X^{T} β)$ . On a practical side, the GCD provides a natural generalization of the CD to problems with continuous environments/instruments and GMM theory enables straightforward derivation of GCD asymptotics. This includes optimal weighting of invariance (equivalently moment) criteria in the overidentified case using a two-step estimator. We explore these ideas further in Sections 4 and 6.

We now construct the GCD estimator by imposing the constraints in Equation (3.1) on the sample. Note that $v e c (E X^{T}) \in ℝ^{q p}$ induces $q p$ constraints on $β$ . Define

E • X = (\begin{matrix} v e c {(E_{1} X_{1}^{T})}^{T} \\ ⋮ \\ v e c {(E_{n} X_{n}^{T})}^{T} \end{matrix}) = (\begin{matrix} E_{11} X_{11} & \dots & E_{1 q} X_{11} & \dots & E_{11} X_{1 p} & \dots & E_{1 q} X_{1 p} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ E_{n 1} X_{n 1} & \dots & E_{n q} X_{n 1} & \dots & E_{n 1} X_{n p} & \dots & E_{n q} X_{n p} \end{matrix}) \in ℝ^{n \times p q} .

Define the GCD moment equations as

{\hat{m}}_{G C D} (β) = \frac{1}{n} {(E • X)}^{T} (Y - X β) .

(3.3)

and the resulting GCD estimator as

{\hat{β}}_{G C D} (\hat{W}) = \underset{β}{argmin} {‖ {\hat{m}}_{G C D} (β) ‖}_{\hat{W}}^{2}

(3.4)

where $\hat{W} ≻ 0$ is some weighting matrix. When $E$ is constructed from two environments, its dimension is 1 because $q = # ℰ - 1 = 1$ . In this case the $q p = p$ constraints just identify $β$ . Further ${\hat{β}}_{G C D} (\hat{W})$ is invariant to different choices of $\hat{W}$ and identical to the original Causal Dantzig estimator ${\hat{β}}_{C D}$ . The following theorem formalizes these results.

Theorem 3.2. Suppose $\hat{W} ≻ 0$ and $X^{T} (E • X) \in ℝ^{p \times p q}$ has column rank $p$ . Then

The unique minimizer in Equation (3.4) is
${\hat{β}}_{G C D} (\hat{W}) = {(X^{T} (E • X) \hat{W} {(E • X)}^{T} X)}^{- 1} (X^{T} (E • X) \hat{W} {(E • X)}^{T} Y) .$
If $q = dim (E) = 1$ (just identified case), then ${\hat{β}}_{G C D} (\hat{W})$ is invariant to the choice of $\hat{W}$ and
${\hat{β}}_{G C D} (\hat{W}) = {\hat{β}}_{G C D} = {({(E • X)}^{T} X)}^{- 1} {(E • X)}^{T} Y .$
If $Z = (Y, X, E)$ is the $Z$ -Encoding (Definition 1) of environment data $(Y^{e}, X^{e})$ with $# ℰ = 2$ (two environments), then ${\hat{β}}_{G C D} = {\hat{β}}_{C D}$ .

See Section 9.3 for a proof. With more than two environments, ${\hat{β}}_{G C D} (\hat{W})$ will generally not be equivalent to the $# ℰ > 2$ CD estimators proposed in [24]. [24] proposed two approaches for adapting the CD to the case with more than two environments: 1) Merge environments to obtain two distinct environments. Fit the two environment CD estimator on the resulting data. 2) Fit a minimax estimator which seeks to satisfy the CD invariance constraints in a one-versus-all environment approach (Equation 8 in [24]). No theory is given to guide merging of environments (if approach 1 is taken) or how to compute uncertainties on parameter estimates (if approach 2 is taken). The CD was previously applied to the Flow Cytometry Application studied in Section 6 of this work [17]. In that application, a third strategy was used for adapting the CD to more than two environments which involves iteratively fitting the two environment CD to pairs of environments (see Section 6.2 of this work and [17] for a more detailed description). In contrast, the GCD with more than two environments uses the matrix $\hat{W}$ to weight moment conditions. GMM theory is then used to derive the optimal weight matrix (minimizes asymptotic variance) which can be estimated using standard two-step procedures. GMM theory provides (asymptotically) valid uncertainty estimates. This approach is used both in the simulations and the Flow Cytometry application of Section 6. We discuss optimal selection of $\hat{W}$ in Section 4.2.

3.2. Hybrid estimator

The GMM representation of the GCD in Equation (3.1) and the GMM representation of the TSLS estimator in Equation (2.2) share a similar structure. The GCD enforces orthogonality between $Y - X^{T} β$ and $v e c (E X^{T})$ while TSLS enforces orthogonality between $Y - X^{T} β$ and $E$ . Using GMMs, it is straightforward to construct estimators which enforce both of these constraints. We define the Hybrid estimator moment conditions

m_{H} (β) = E [(\begin{matrix} E \\ v e c (E X^{T}) \end{matrix}) (Y - X^{T} β)] = 0.

(3.5)

This estimator enforces orthogonality between $Y - X^{T} β$ and both $v e c (E X^{T})$ and $E$ , resulting in a total of $q + q p$ constraints. Following standard GMM practice, the sample version of these orthogonality constraints is

{\hat{m}}_{H} (β) = \frac{1}{n} {(E, E • X)}^{T} (Y - X β)

where $(E, E • X) \in ℝ^{n \times (q + q p)}$ columns joins the matrices $E$ and $E • X$ . The hybrid estimator is

{\hat{β}}_{H} (\hat{W}) = \underset{β}{argmin} {‖ {\hat{m}}_{H} (β) ‖}_{\hat{W}}^{2}

(3.6)

where $\hat{W} \in ℝ^{(q + q p) \times (q + q p)} ≻ 0$ is some weighting matrix. Since ${\hat{β}}_{I V}, {\hat{β}}_{G C D}$ , and ${\hat{β}}_{H}$ are all GMMs, asymptotic results for these estimators can be derived using GMM theory. We do this in the following section.

3.3. Relation to Lewbel [13]

Lewbel [13] constructed consistent estimators of causal effects with hidden confounding by exploiting heteroscedasticity in endogenous variables. The relationship between Lewbel’s estimators and the CD was briefly discussed in the original CD paper [24]. [24] claimed that the methods are different because Lewbel uses exogenous variables while the CD directly exploits endogenous covariance structure, “resulting in a different method.” Complicating comparison of the methods is the fact that the original CD was constructed based on invariance and environments (Equation (2.6)) while Lewbel used GMM.

Our result that the CD can be represented as a GMM (Theorem (3.2)), facilitates comparison between the methods. In fact, for certain choices of Lewbel’s variable Z, the CD and Lewbel are identical. See Supplementary Material Section 9.1 for details. Neither Lewbel’s estimators nor the CD/GCD/Hybrid completely overlap in the cases they consider. Lewbel considers fully simultaneous systems and models additional exogenous variable which are not part of this work. Lewbel primarily considers univariate endogenous variable models, briefly developing an extension of the GMM to the $p = 2$ case (Section 3.3). Here we study the $p > 1$ case in depth. The Hybrid estimator is to our knowledge completely new and not considered in Lewbel.

4. Asymptotic properties of estimators

The GMM representation of the IV, GCD, and Hybrid estimators makes derivation of asymptotic properties straightforward. We first discuss consistency and then asymptotic normality.

4.1. Consistency

Assumptions 1 and Theorem 4.1 below may be specialized to the IV, GCD, or Hybrid estimators by substituting in the corresponding $m, \hat{m}$ , and $\hat{β} (\hat{W})$ . For example, for the GCD, let $m = m_{G C D}$ , $\hat{m} = {\hat{m}}_{G C D}$ , and $\hat{β} (\hat{W}) = {\hat{β}}_{G C D} (\hat{W})$ .

Assumptions 1. Suppose

$m (β)$ exists and is finite for all $β \in Θ$
$m (β) = 0$ iff $β = β_{0}$ , the causal parameter
$\hat{m} (β) \to_{p} m (β)$ uniformly in $β$
$\hat{W} \to_{P} W$ where $\hat{W}, W ≻ 0$

Theorem 4.1. Under Assumptions 1, $\hat{β} (\hat{W})$ is a consistent estimator of $β_{0}$ .

See Theorem 1.1 and Section 1.3.4.1 of [15] for a proof and application to the i.i.d. case. Assumption 1 d) is the simplest to satisfy. For example, $\hat{W} = W = I$ satisfies the assumption. In general $\hat{W}$ will be chosen to be a consistent estimator of a $W$ which minimizes the asymptotic variance of the estimator, as discussed in Section 4.2. Assumption 1 a) will generally hold whenever error terms have sufficient moments. While Assumption 1 c) is strong, it may be replaced with assumptions restricting $β$ to a compact subset of $ℝ^{p}$ and continuity of $m (β)$ [10, 18]. Assumption 1 b) is closely related to the concepts of instrument/environment validity. We give conditions under which Causal Structural Equation Models (SEM) will satisfy Assumption 1 b) in Section 9.5. These results clarify the strengths and weaknesses of the IV, GCD, and Hybrid estimators.

4.2. Asymptotic normality

Asymptotic normality of the GCD and Hybrid estimators can be derived from standard GMM theory. The asymptotic theory provides guidance on selection of the weight matrix $\hat{W}$ in the overidentified case (recall that overidentification will occur for the GCD whenever there are more than two environments or more generally when the dimension of $E$ is greater than 1). The optimal weight matrix is chosen to minimize the asymptotic variance. For the GCD, this optimal weight matrix can be estimated using a GMM two-step procedure.

Assumptions 2 below are used for showing asymptotic normality of GMM estimators. These may be specialized to the IV, GCD, or Hybrid estimators by substituting in the corresponding $g, m, \hat{m}$ , and $\hat{β}$ . For example, for the GCD, let $g = g_{G C D}, m = m_{G C D}, \hat{m} = {\hat{m}}_{G C D}$ , and $\hat{β} (\hat{W}) = {\hat{β}}_{G C D} (\hat{W})$ .

Assumptions 2. Suppose

$g (Z, β)$ is continuously differentiable for $β \in Θ$ .
Let $k = dim (\hat{m})$ . Define
$\hat{M} (β) = \frac{\partial \hat{m} (β)}{\partial β} \in ℝ^{k \times p}$

and suppose $\hat{β} \to_{P} β_{0}$ . Suppose there exists an $M \in ℝ^{k \times p}$ of full column rank such that
$\hat{M} (\hat{β}) \to_{P} M .$
$V \equiv V a r (g (Z, β_{0}))$ exists (i.e. $g (Z, β_{0})$ has two moments).

Theorem 4.2. Under Assumptions 1 and 2,

\sqrt{n} (\hat{β} (\hat{W}) - β_{0}) \to_{d} N (0, Σ (W))

with asymptotic variance

Σ (W) = {(M^{T} W M)}^{- 1} M^{T} W V W M {(M^{T} W M)}^{- 1} .

See Section 1.3.4.1 of [15] for these results. The asymptotic variance is minimized by setting $W = \bar{W} \propto V^{- 1}$ which results in

Σ (\bar{W}) = {(M^{T} V^{- 1} M)}^{- 1} .

(4.1)

We now review weighting for the TSLS estimator. For the instrumental variables estimator

\bar{W} \propto E {[g_{I V} (Z, β_{0}) g_{I V} {(Z, β_{0})}^{T}]}^{- 1} = E {[E (Y - X^{T} β_{0}) (Y - X^{T} β_{0}) E^{T}]}^{- 1} = {(E [E E^{T} δ_{Y}^{2}])}^{- 1} .

When $E$ is independent of $δ_{Y}, \bar{W} \propto E {[E E^{T}]}^{- 1}$ . The matrix $\tilde{W} = {(n^{- 1} E^{T} E)}^{- 1}$ is a consistent estimator of $\bar{W}$ and thus is an asymptotically optimal weighting for the IV estimator. The estimator ${\hat{β}}_{I V} (\tilde{W})$ can be computed using a set of two regressions, giving it the name Two Stage Least Squares. Since consistency of ${\hat{β}}_{I V}$ only requires orthogonality of $E$ and $δ_{Y}$ , there may be cases where ${\hat{β}}_{I V} (W)$ is a consistent, asymptotically normal estimate of $β$ but $W = E {[E E^{T}]}^{- 1}$ is not asymptotically efficient. In these cases, an asymptotically efficient IV estimator can be constructed assuming one can construct a consistent estimate of $\bar{W}$ . One possibility is to use two-step procedures. First a pilot estimator ${\hat{β}}_{I V} (W)$ is computed using some initial weight matrix $W$ . Possible choices include $W = I$ (identity) or $W = \tilde{W}$ (TSLS weight). Residuals are defined as

{\hat{δ}}_{Y, i} = Y_{i} - X_{i}^{T} {\hat{β}}_{I V} (W) .

Define $\hat{Σ}$ as a diagonal matrix with ${\hat{Σ}}_{i i} = {\hat{δ}}_{Y, i}^{2}$ and

{\hat{W}}_{T S} = {(\frac{1}{n} E^{T} \hat{Σ} E)}^{- 1} .

If ${\hat{W}}_{T S} \to_{P} \bar{W}$ , then ${\hat{β}}_{I V} ({\hat{W}}_{T S})$ is asymptotically efficient. This two-step procedure is sometimes referred to as optimal GMM. See Section 6.4.2 of [4] for a discussion of two-step optimal GMM estimators and comparison with TSLS.

For the GCD, the asymptotically optimal estimators can be constructed from two-step procedures, following the same strategy as used for IV. The following theorem provides formal justification of this approach. Note that this is only necessary in the over-identified case since in the just identified case the estimator is invariant to different choices of $W$ .

Theorem 4.3. Suppose Assumptions 1 and 2 hold for the GCD estimator. Let $\hat{W} ≻ 0$ be some initial weight matrix such that $\hat{W} \to W ≻ 0$ . Define

{\hat{δ}}_{Y, i} = Y_{i} - X_{i}^{T} {\hat{β}}_{G C D} (\hat{W}) .

Let $\hat{Σ}$ be a diagonal matrix with ${\hat{Σ}}_{i i} = {\hat{δ}}_{Y, i}^{2}$ and define

{\hat{W}}_{G C D} = {(\frac{1}{n} {(E • X)}^{T} \hat{Σ} (E • X))}^{- 1} .

Then ${\hat{β}}_{G C D} ({\hat{W}}_{G C D})$ is asymptotically efficient with variance specified in Equation (4.1).

See Section 9.4 for a proof. For the GCD, the asymptotic variance can be estimated by first estimating $M$ with

\hat{M} = \frac{1}{n} {(E • X)}^{T} X

and $V$ with

\hat{V} = \frac{1}{n} {(E • X)}^{T} \hat{Σ} (E • X) .

Then

\hat{Σ} (W) = {({\hat{M}}^{T} {\hat{V}}^{- 1} \hat{M})}^{- 1} .

(4.2)

5. Causal models and consistency

Recall that Assumption 1 b) states

m (β) = 0 iff β = β_{0} .

We now consider Causal Structural Equation Models (SEM) which guarantee that Assumption 1 b) holds. In combination with regularity Assumptions 1 a), c), and d), these will ensure consistency of the estimators by Theorem 4.1. We also contrast these estimators and assumptions with Independent Instrumental Variable (IIV) methods.

Assumptions 3 (Causal SEM). Let $Z = (Y, X, E)$ be generated from a Causal SEM with independent exogenous variables $ϵ_{E}, ϵ_{H}, ϵ_{X}, ϵ_{Y}$ and endogenous variables $X \in ℝ^{p}$ and $Y \in ℝ^{1}$ :

E \leftarrow f_{E} (ϵ_{E}) H \leftarrow f_{H} (ϵ_{H}) X \leftarrow f_{X} (H, E, ϵ_{X}) = E [f_{X} (H, E, ϵ_{X}) ∣ E] + δ_{X} (H, E, ϵ_{X}) Y \leftarrow f_{Y} (H, X, ϵ_{Y}) = X^{T} β_{0} + δ_{Y} (H, X, ϵ_{Y})

where

E [Y ∣ d o (X = x)] = x^{T} β_{0} δ_{Y} (H, X, ϵ_{Y}) = δ_{Y}^{1} (H, ϵ_{Y}) + δ_{Y}^{2} (X, ϵ_{Y}) E [E] = 0

The corresponding Directed Acyclic Graph (DAG) is show in Figure 1 a). The dashed circle around $H$ indicates that it is not measured and thus a hidden confounder. For GCD and Hybrid consistency, we need an additional assumption.

Fig 1. — a) Causal DAG model with instrument/environment E. b) Causal DAG for observational data. c) Causal DAG for interventional data $(d o (X = x_{1}))$ .

Assumptions 4 (Error Decomposition).

δ_{X} (H, E, ϵ_{X}) = δ_{X}^{1} (E, ϵ_{X}) + δ_{X}^{2} (H, ϵ_{X}) .

We discuss this decomposition assumption further in Section 5.1. We have the following result.

Theorem 5.1. Suppose Assumptions 1 a), c) and d) and Assumption 3 hold.

Then $m_{I V} (β_{0}) = 0$ and ${\hat{β}}_{I V} (\hat{W})$ is consistent if $E [E X^{T}]$ has column rank $p$
If Assumption 4 holds, then $m_{G C D} (β_{0}) = 0$ and ${\hat{β}}_{G C D} (\hat{W})$ is consistent if $E [v e c (E X^{T}) X^{T}]$ has column rank $p$ .
If Assumption 4 holds, then $m_{H} (β_{0}) = 0$ and ${\hat{β}}_{H} (\hat{W})$ is consistent if $E [(\begin{matrix} E \\ vec (E X^{T}) \end{matrix}) X^{T}]$ has column rank $p$ .

See Section 9.2 for a proof. Column rank conditions in Theorem 5.1 depend on observed random variables $X$ and $E$ and are empirically verifiable. The structure of $f_{X}$ dictates whether these column rank conditions hold and which estimator (IV, GCD, or Hybrid) will be most appropriate for a given problem. We discuss some specific cases now. For notational convenience let

r (E) \equiv E [f_{X} (H, E, ϵ_{X}) ∣ E]

IV: The IV rank condition is $E [E X^{T}] = E [Er {(E)}^{T}]$ . The IV estimator leverages how the instruments change the mean of $X$ . If $r {(E)}_{j} = 0$ for any $j \in ℝ^{p}$ , then IV is not consistent. Further if $q = dim (E) < dim (X) = p$ , then the IV is not consistent because the column rank of $E [E X^{T}]$ is bounded by $q < p$ . Thus with a single instrument, IV may only estimate the causal effect of a single exposure on the response.
GCD: Suppose $r (E) = 0$ so the IV estimator is not consistent. The GCD rank condition may be rewritten as
$E [v e c (E X^{T}) X^{T}] = E [v e c (E δ_{X} {(H, E, ϵ_{X})}^{T}) δ_{X} {(H, E, ϵ_{X})}^{T}] = E [v e c (E δ_{X}^{1} {(E, ϵ_{X})}^{T}) δ_{X}^{1} {(E, ϵ_{X})}^{T}] .$

The GCD leverages how $E$ shifts higher moments of $X (δ_{X}^{1} (E, ϵ_{X}))$ . The GCD does not require $q \geq p$ because $E [v e c (E X^{T}) X^{T}] \in ℝ^{p q \times p}$ . With a single binary instrument/environment taking values $s_{1}$ or $s_{0}$ , the GCD will be consistent if
$E [X X^{T} ∣ E = s_{1}] - E [X X^{T} ∣ E = s_{0}]$

is column rank $p$ .
Hybrid: Consistency of the Hybrid estimator is weaker than for the GCD since
$colrank (E [(\begin{matrix} E \\ v e c (E X^{T}) \end{matrix}) X^{T}]) \geq colrank (E [v e c (E X^{T}) X^{T}]) .$

The Hybrid estimator leverages how $E$ shifts the mean (via the IV constraints) or higher moments (via the GCD constraints) of $X$ .

Theorem 5.1 shows consistency of estimators under assumptions on the data generating model. In some settings, particular elements of the estimator may be consistent while others are not, e.g. ${\hat{β}}_{j} (\hat{W}) \to_{P} β_{0 j}$ for some but not all $j \in {1, \dots, p}$ . Such partial identifiability/consistency results have been derived for IV estimators under weaker conditions than considered here [25, 20]. Detailed consideration of these cases for the GCD and Hybrid estimators is beyond the scope of this work.

5.1. Decomposition condition and do operators

Assumption 4 will hold for many models including an additive hidden confounder model where $δ_{X}^{2} (H, ϵ_{X}) = γ (H)$ for some function $γ$ . The assumption will not hold when the instrument/environment $E$ represents a do operator. Consider the structural model for $X$

X \leftarrow f_{X} (H, E, ϵ_{X}) = {\begin{array}{l} f_{X} (H, ϵ_{X}) & if E = - 1 \\ x_{1} & if E = 1 \end{array} .

for some $f_{X} (H, ϵ_{X})$ . When $E = - 1$ , the data is observational and generated from the DAG in Figure 1 b). When $E = 1$ , the data is interventional and generated from the DAG in Figure 1 c). For this model the error term is

δ_{X} (H, E, ϵ_{X}) = {\begin{array}{l} f_{X} (H, ϵ_{X}) - r (E) & if E = - 1 \\ x_{1} - r (E) & if E = 1 \end{array},

and cannot be decomposed according to Assumption 4. Inconsistency of the CD under do interventions was discussed in [24]. Since the Hybrid estimator also requires the error decomposition assumption, it will also be inconsistent for instruments/environments which are used to represent do operators.

5.2. Independent instrumental variables

Independent instrumental variable (IIV) models assume that $δ_{Y} ⫫ E$ . This can be leveraged to construct consistent estimates of $β_{0}$ even when $r (E) = 0$ , much like the GCD [27, 7, 22]. Note that $δ_{Y} ⫫ E$ is equivalent to

E [η (E) ϕ (Y - X^{T} β_{0})] = E [η (E)] E [ϕ (Y - X^{T} β_{0})]

(5.1)

for all bounded, continuous functions $η$ and $ϕ$ . Thus estimates of $β_{0}$ may be constructed by finding $β$ which satisfy empirical versions of Equation (5.1) (up to sampling variability) for a large set of $η$ and $ϕ$ . See [22] for specific implementations of this approach.

Assumptions 3 do not require $δ_{Y} ⫫ E$ because $δ_{Y} (H, X, ϵ_{X})$ is a function of $X$ which is a function of $E$ . One common setting where Assumptions 3 holds but $δ_{Y} \notin E$ is heteroskedastic response models such as

Y \leftarrow X^{T} β_{0} + H + s (X) ϵ_{Y}

where $s : ℝ^{p} \to ℝ^{+}$ and $E [ϵ_{Y}] = 0$ . Here IIV models should not be used.

6. Simulations

We demonstrate some of the applications of the GCD and Hybrid estimators in simulations.

6.1. Continuous environments

We fit the GCD to a model with continuous environments/instruments.

E \leftarrow Unif [0, 1] H \leftarrow N (0, 1) X \leftarrow 3 H + (1 + 10 E) ϵ_{X} Y \leftarrow X + 9 H + ϵ_{Y}

where exogenous variables $ϵ_{X}$ and $ϵ_{Y}$ are standard normal. The true causal parameter is $β_{0} = 1$ . We simulate $n = 100$ samples.

IV is not consistent for this model because $E$ shifts the variance of $X$ (larger $E$ implies larger variance for $X$ ), but not the mean of $X$ . With univariate $E$ and $X$ the GCD is

{\hat{β}}_{G C D} = \frac{\sum_{i = 1}^{n} E_{i} X_{i} Y_{i}}{\sum_{i = 1}^{n} E_{i} X_{i}^{2}} .

(6.1)

The CD is not directly applicable here because the environment is continuous. One could discretize $E$ with some function $e : ℝ \to {0, 1}$ and then apply the CD using environment $e (E)$ . We consider this approach with $e (E_{i}) = 1_{E_{i} > median (E_{i})}$ (environment is 1 if $E_{i}$ is greater than sample median of $E$ values). We simulate $N = 1000$ times and plot sampling distributions of the Generalized Causal Dantzig (GCD), the Causal Dantzig (CD), and Ordinary Least Squares (OLS) in Figure 2. Note that $E$ is centered (mean shifted to 0 in the sample) before the GCD is fit. OLS is inconsistent due to hidden confounding. The GCD and CD sampling distributions are both centered at the true causal effect of 1. The GCD empirical sampling distribution is more concentrated around the causal effect. The GCD is also easier to fit because it does not require selection of the function $e$ to binarize the continuous variable $E$ into discrete environments.

Fig 2. — The GCD sampling distribution is centered around the true causal effect with lower asymptotic variance than the CD. Ordinary Least Squares is inconsistent due to hidden confounding. IV (not shown) is also inconsistent because the instrument does not shift the mean of the exposures.

6.2. Overidentified models

The CD/GCD constraints overidentify $β_{0}$ whenever there is more than one instrument/environment. We consider data generating model:

E_{1} \leftarrow Bernoulli (1 / 2) E_{2} \leftarrow Unif (0, 1) X_{1} \leftarrow Y + X_{2} + (1 + 3 E_{1}) ϵ_{1} X_{2} \leftarrow H + (1 + 3 E_{1} + 5 E_{2}) ϵ_{2} X_{3} \leftarrow H + X_{1} + (1 + 5 E_{2}) ϵ_{3} Y \leftarrow H + X_{2} + ϵ_{Y} .

All exogenous variables $(H, ϵ_{1}, ϵ_{2}, ϵ_{3}, ϵ_{Y})$ are standard normal. The true parameter value is $β_{0} = (0, 1, 0)$ because only $X_{2}$ has a causal effect on $Y$ . The hidden confounder $H$ will cause OLS to be inconsistent. The sample size is $n = 200$ . We simulate $N = 500$ runs.

We consider four estimators: GCD (using both $E_{1}$ and $E_{2}$ ), GCDE1 (uses only environment $E_{1}$ ), GCDE2 (uses only environment $E_{2}$ ), and OLS (ordinary least squares). Note that GCDE1 is equivalent to the CD using $E_{1}$ to indicate the environment of the observation. GCDE2 is not equivalent to a CD estimator because $E_{2}$ is a continuous environment. For the GCD, the two-step estimator is used since the two environments overidentify $β_{0}$ . For initial GCD weight matrix we use

\hat{W} = \frac{1}{n} {(E • X)}^{T} (E • X) .

Table 1 shows the empirical coverage probabilities of 95% confidence intervals for each parameter and the median CI width. The rows of the table correspond to different estimators. OLS is inconsistent because of hidden confounding. This results in the coverage probabilities being well below nominal levels (0 in the case of $β_{2}$ ). The three GCD methods (GCD, GCDE1, and GCDE2) all have empirical coverage probabilities near or above 95%. However GCDE1 and GCDE2 obtain this coverage by producing extremely wide confidence intervals. For example, the median CI width for GCDE2 for $β_{2}$ is 10 times that for GCD. Similarly, the median width of the $β_{3}$ CI for GCDE1 is about 10 times the median width for GCD. By only using the information in one of the environment variables, GCDE1 and GCDE2 produce highly uncertain estimators with very wide confidence intervals.

Table 1.

Coverage and median width of confidence intervals for different estimators.

	Coverage			Median Width
	$β_{1}$	$β_{2}$	$β_{3}$	$β_{1}$	$β_{2}$	$β_{3}$
GCD	0.94	0.96	0.94	0.25	0.39	0.16
GCDE1	1.00	0.99	1.00	1.61	0.63	1.68
GCDE2	0.98	0.98	0.99	1.94	3.91	0.24
OLS	0.09	0.00	0.35	0.15	0.23	0.09

Open in a new tab

6.3. Hybrid estimator

Consider a model in which the instrument $E$ shifts the mean and variance of $X$ :

E \leftarrow Unif [0, 1] H \leftarrow N (0, 1) X \leftarrow H + R E + (α_{v} E + α_{0}) ϵ_{X} Y \leftarrow H + β_{0} X + ϵ_{Y} .

IV is inconsistent when $R = 0$ . It will be a poor estimator when $R$ is near 0 because the instrument is weak. In similar fashion GCD is inconsistent when $α_{v} = 0$ . It will be a poor estimator when $α_{v}$ is near 0. The Hybrid estimator is consistent whenever IV or GCD is consistent because it can leverage changes in the mean or variance induced by the instrument/environment $E$ . Here we consider two simulation settings: In Model 1, $R = 5$ and $α_{v} = 1$ so that $E$ has a strong effect on the mean of $X$ and only a weak effect on the variance. In Model 2, $R = 1$ and $α_{v} = 5$ so that $E$ has a weak effect on the mean of $X$ and a strong effect on the variance. We fit IV, GCD, and the Hybrid estimator on these two models. IV and GCD do not require specification of a weight matrix because the number of parameters equals the number of constraints. For the Hybrid estimator, we perform a two step procedure to estimate the optimal weight matrix, using

\hat{W} = \frac{1}{n} {(E, E • X)}^{T} (E, E • X)

for an initial weighting. We let $β_{0} = 1$ and simulate $n = 200$ samples. Empirical sampling distributions with $N = 1000$ simulations are shown in Figure 3. The Hybrid estimator performs well for both models while IV and the GCD each only perform well for one of the models.

Fig 3. — a) Model 1: IV and Hybrid dominate the GCD when the mean shift is strong but noise shift is weak. b) Model 2: GCD and Hybrid dominate IV when the mean shift is weak but the noise shift is strong.

7. Application to flow cytometry data

[26] measured the abundance of 11 biochemical agents in thousands of cells using flow cytometry. These data were collected under several conditions (or environments) in which external reagents were added to the system. Each reagent has the effect of stimulating or inhibiting particular agents in the system. Five conditions used in this work (1 observational and 4 interventional) are described in Table 2. In the observational condition, only a general perturbation (CD3+CD28) is applied. For the observational condition, the expression of the 11 biochemical agents was measured in 853 cells. In Condition 3, the reagent Psitectorignin, an inhibitor of PIP2 was added to the system in addition to the general perturbation. The effect of this perturbation should reduce the abundance of PIP2 (one of the 11 measured agents) as well as alter the abundance of any agents which PIP2 itself effects.

Table 2.

Description of the conditions we used in our data application.

	Additional reagent	Target	Sample size
Observational	–	–	853
Condition 1	AKT-inhibitor	AKT	911
Condition 2	G0076	PKC	723
Condition 3	Psitectorignin	PIP2	810
Condition 4	U0126	MEK	799

Open in a new tab

[17] fit the CD to this data to infer a causal signalling network (graph) among the 11 agents. Here we compare the performance of the Causal Dantzig with IV, GCD, and Hybrid estimators. Before fitting any models, we hyperbolic arcsine transform the data. This technique is used to approximately normalize the flow cytometry data to better satisfy modelling assumptions and reduce the influence of outliers [23].

7.1. Univariate analysis

We first consider the problem of determining whether a particular agent $X$ has a total causal effect on another agent $Y$ . In general, simple regression of $Y$ on $X$ will not consistently estimate the total causal effect of $X$ on $Y$ because hidden confounding and reverse causality will induce an association between $X$ and $Y$ even when $X$ has no causal effect on $Y$ . To address this problem, we consider $(X, Y)$ data from two environments: an observational environment involving only a general system perturbation and an interventional environment which includes the general perturbation plus a reagent designed to perturb $X$ . Presence or absence of the additional reagent is modeled using an instrument (environment) variable $E$ . We focus on two causes (different $X$ variables), PIP2 and MEK, in order to demonstrate similarities and differences in IV and CD modeling results.

7.1.1. PIP2

Using the Causal Dantzig, [17] (Figure 3) did not find that PIP2 is a direct cause of changes in any of the other 10 biochemical agents. This implies that PIP2 should not have a total effect on any of the agents in the system. To investigate this, we consider abundance measures from cells in two conditions: observational (the general perturbation only) and Condition 3 (Psitectorignin plus general perturbation). The condition is treated as a binary instrumental variable / environment. This is justified by the fact that Psitectorignin is meant to directly target PIP2 and any effects on other agents should occur by way of PIP2. Since the CD and GCD are identical in this case, we compare the CD, IV, and Hybrid estimators.

Figure 4 displays scatter plots of cellular abundances of Plcg versus PIP2 and PIP3 versus PIP2. Red points are for cells measured with the intervention Psitectorignin applied while blue points are cells measured without the intervention. As expected, Psitectorignin has a strong effect on PIP2, substantially decreasing its mean. Plcg and PIP3 abundances are also strongly influenced by the intervention. This suggests that PIP2 is a cause (either directly or possibly indirectly through other agents) of both Plcg and PIP3. Note that the intervention primarily effects the mean, rather than the variance, of PIP2. Thus the CD may struggle to identify an effect because it is sensitive to variance, not mean, shifts. In contrast, IV is better suited to settings where the instrument/environment effects the exposure mean. This is a possible explanation for why the CD did not identify PIP2 as a cause of changes of other agents in [17].

Fig 4. — a) Scatter plot for PIP2 vs Plcg. b) Scatter plot for PIP2 vs PIP3.

Table 3 contains parameter estimates, confidence intervals, and p-values for the CD, IV and Hybrid estimators fit to the data in Figure 4. As expected, the CD does not find a significant causal effect while IV does. The Hybrid estimator, which can leverage changes in mean or variance, produces estimates very similar to the IV.

Table 3.

Estimation results using PIP2 as the input variable. The left shows the results when Plcg is the response, while the right being PIP3 as response.

	PIP2 -> Plcg			PIP2 -> PIP3
	CD(GCD)	IV	Hybrid	CD(GCD)	IV	Hybrid
Coefficient	1.88	0.42	0.43	−1.44	0.22	0.22
P value	1	<0.0001	<0.0001	1	<0.0001	<0.0001
95% CI	(−5.46, 9.21)	(0.40, 0.45)	(0.40, 0.45)	(−8.50, 5.62)	(0.20, 0.25)	(0.19, 0.25)

Open in a new tab

7.1.2. MEK

We now consider estimating the total causal effect of MEK on RAF, using data from the observational condition and condition 4. Since the condition 4 reagent targets MEK, this serves as a good instrument. Using the Causal Dantzig, [17] found that MEK has a direct effect on RAF.

Figure 5 shows the scatter plot of MEK versus RAF. The mean of MEK in condition 4 is higher than in the observational condition. Further the variance of MEK has increased in condition 4 relative to the observational condition. Thus both CD and IV are likely suitable to estimating a causal effect in this situation. Table 4 shows IV, CD and Hybrid parameter estimates, confidence intervals, and p-values fit using the data in Figure 5. All three estimators identify a causal effect.

Fig 5. — Scatter plot that shows the relation between Mek and Raf.

Table 4.

Estimation result for Mek → Raf.

	CD (GCD)	IV	Hybrid
Coefficient	0.94	0.60	0.63
P value	<0.0001	<0.0001	<0.0001
95% CI	(0.87, 1.00)	(0.58, 0.62)	(0.62, 0.65)

Open in a new tab

7.2. Multivariate analysis

For the multivariate analysis, we fit models using the 5 conditions specified in Table 2. Each condition is treated as a data generating environment. Thus there are 5 environments. An instrument/environment variable $E \in ℝ^{4}$ is created following the procedure outlined in Equation (3.2). We iteratively treat each of the 11 agents as a response and regress it on the other 10 agent abundance measurements. Due to the limited number of conditions relative to reagents (4 instruments and 10 exposures), the IV can not be directly used in this situation. However, we would like the estimator to be sensitive to shifts in mean induced by the interventions. Thus we fit a Hybrid estimator specified in Equation (3.5). We compare the performance of the Hybrid estimator with the CD and GCD.

Note that, if an agent is used as the target or response variable, then intervention on that agent is not allowed in either CD, GCD or Hybrid. This is equivalent to an instrument having a direct effect on the response and would violate Assumption 1b. So when one of Akt, PIP2, Mek or PKC is used as the response, the total number of environments is 4 (observational and 3 conditions), while for all the other cases, the number of environments is 5 (observational and 4 conditions). For constructing the weight matrices, we use two-step estimators proposed Theorem 4.3 with initial weight matrices as described in Section 6.2 for the GCD and Section 6.3 for the Hybrid.

When the number of environments is greater than two, the GCD and the CD are not equivalent. [24] proposed two methods for handling the case with greater than 2 environments. We use the hiddenICP version of the CD from the R package InvariantCausalPrediction. With $K > 2$ environments, hiddenICP iteratively fits the CD with one environment versus all the other environments ( $K$ fits). This produces $K$ point estimates and $K$ confidence intervals for each parameter. The parameter estimates across the $K$ fits are averaged to create a single point estimate. The confidence intervals lower limit is the smallest of the lower limits of the individual CD fit confidence intervals. Likewise the confidence interval upper limit is the largest of the upper limits of the individual confidence intervals. Thus the intervals are conservative.

Figure 6 compares 95% confidence intervals for the CD, GCD, and Hybrid estimators with a) Plcg and b) PIP3 as the response. CD confidence intervals are very wide relative to GCD and Hybrid. The GCD and the Hybrid estimators perform similarly. In our univariate analysis, we found that PIP2 had a total effect on both Plcg and PIP3. In the multivariate analysis here, GCD and Hybrid identify PIP2 as direct causes of changes in both Plcg and PIP3. The CD fails to identify this effect, due to suboptimal merging of environments and an excessively conservative strategy for constructing confidence intervals.

Fig 6. — a) Hybrid, GCD and CD estimation results when Plcg is the response. b) Hybrid, GCD and CD estimation results when PIP3 is the response.

We define a causal effect as strong if the entire 95% confidence interval is outside the range of (−.2, .2). Figure 7 shows the strong causal relations found by the Hybrid estimator in a graph. Compared with the results from [17] (Figure 3), our hybrid estimator finds more strong causal relations: 24 strong relations for hybrid versus 13 for CD. Some relations are found by both methods, e.g. the causal effects and the reverse causal effects between Raf-Mek, P38-Jnk, and Erk-Akt. The Hybrid model identifies the entire Raf -> Mek -> Erk path which was seen as a major validation of the original Bayesian network applied to the system [26] while the original application of the CD failed to identify the Mek -> Erk edge. Further, the CD did not discover any causal effects from PIP2 to Plcg and PIP2 to PIP3, which are illustrated to exist from Figure 4 and discovered by our hybrid estimator. In addition, the hybrid estimator also found similar causal relations with other methods (ICP from [20] and the Bayesian network from [26]). For example, ICP and the hybrid both discovered the causal relations from PIP2 to Plcg and PKA to Erk, while the Bayesian network and the hybrid both discovered the relations from Plcg to PKC and PIP2 to PKC.

Fig 7. — Causal network found by the Hybrid estimator. The red circles represent the agents on which interventions were applied.

8. Discussion

In this work, we proposed two new methods, the Generalized Causal Dantzig (GCD) and Hybrid, for estimating causal effects in the presence of hidden confounding and reverse causality. The GCD generalizes the Causal Dantzig estimator of [24] to problems with continuous environments. Further we developed theory (based on GMM estimation) for the GCD in the over-identified case. This was not present in the original CD paper. The Hybrid estimator enforces both GCD environment and instrumental variable (IV) moment constraints, further illustrating connections between the concepts of environment and instrument which have been previously noted [11, 20]. We demonstrated the utility of these estimators in simulations and an application to Flow Cytometry data.

In this work, we did not discuss high-dimensional estimation. [24] proposed using the Dantzig selector [5] to regularize the Causal Dantzig in high dimensional problems. This required extensive theoretical development. Since both the GCD and Hybrid are are GMM estimators, existing methods for fitting high dimensional GMM estimators (penalty terms, tuning parameter selection algorithms, and theory) may be applicable for the GCD and Hybrid estimators [3]. This represents a direction for future research. The performance of these high-dimensional estimators may be compared with existing high-dimensional instrumental variable estimators [2, 9, 14].

Supplementary Material

NIHMS1953713-supplement-1.pdf^{(159.3KB, pdf)}

Funding

James P. Long was partially supported by National Institutes of Health SPORE [P50CA127001, P50CA140388] and CCTS [UL1TR003167]. Kim-Anh Do was partially supported by the National Institutes of Health [P30CA016672], SPORE [P50CA140388], CCTS [TR000371] and CPRIT [RP160693].

Footnotes

https://github.com/longjp/gcd-code

Contributor Information

James P. Long, Department of Biostatistics, University of Texas MD Anderson Cancer Center

Hongxu Zhu, Department of Biostatistics, University of Texas, School of Public Health.

Kim-Anh Do, Department of Biostatistics, University of Texas MD Anderson Cancer Center.

Min Jin Ha, Department of Biostatistics, Graduate School of Public Health, Yonsei University.

References

[1].Angrist JD, Imbens GW, and Rubin DB. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996. [Google Scholar]
[2].Belloni A, Chen D, Chernozhukov V, and Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429, 2012. MR3001131 [Google Scholar]
[3].Belloni A, Chernozhukov V, Chetverikov D, Hansen C, and Kato K. High-dimensional econometrics and regularized GMM. arXiv preprint arXiv:1806.01888, 2018. [Google Scholar]
[4].Cameron AC and Trivedi PK. Microeconometrics: Methods and Applications. Cambridge University Press, 2005. [Google Scholar]
[5].Candes E, Tao T, et al. The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313–2351, 2007. MR2382644 [Google Scholar]
[6].Chen C, Ren M, Zhang M, and Zhang D. A two-stage penalized least squares method for constructing large systems of structural equations. The Journal of Machine Learning Research, 19(1):40–73, 2018. MR3862409 [Google Scholar]
[7].Dunker F. Adaptive estimation for some nonparametric instrumental variable models with full independence. Electronic Journal of Statistics, 15(2):6151–6190, 2021. MR4355705 [Google Scholar]
[8].Gimenez JR and Rothenhäusler D. Causal aggregation: estimation and inference of causal effects by constraint-based data fusion. arXiv preprint arXiv:2106.03024, 2021. MR4577774 [Google Scholar]
[9].Gold D, Lederer J, and Tao J. Inference for high-dimensional instrumental variables regression. Journal of Econometrics, 217(1):79–111, 2020. MR4093746 [Google Scholar]
[10].Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029–1054, 1982. MR0666123 [Google Scholar]
[11].Heinze-Deml C, Peters J, and Meinshausen N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018. MR4335430 [Google Scholar]
[12].Henderson HV and Searle SR. The vec-permutation matrix, the vec operator and Kronecker products: A review. Linear and Multilinear Algebra, 9(4):271–288, 1981. MR0611262 [Google Scholar]
[13].Lewbel A. Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. Journal of Business 83 Economic Statistics, 30(1):67–80, 2012. MR2899185 [Google Scholar]
[14].Lin W, Feng R, and Li H. Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association, 110(509):270–288, 2015. MR3338502 [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Mátyás L, Gourieroux C, Phillips PC, et al. Generalized Method of Moments Estimation, volume 5. Cambridge University Press, 1999. MR1688695 [Google Scholar]
[16].Meinshausen N. Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6–10. IEEE, 2018. [Google Scholar]
[17].Meinshausen N, Hauser A, Mooij JM, Peters J, Versteeg P, and Bühlmann P. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27):7361–7368, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Newey WK and McFadden D. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245, 1994. MR1315971 [Google Scholar]
[19].Pearl J et al. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009. MR2545291 [Google Scholar]
[20].Peters J, Bühlmann P, and Meinshausen N. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 947–1012, 2016. MR3557186 [Google Scholar]
[21].Pfister N, Bühlmann P, and Peters J. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264–1276, 2019. MR4011778 [Google Scholar]
[22].Poirier A. Efficient estimation in models with independence restrictions. Journal of Econometrics, 196(1):1–22, 2017. MR3572810 [Google Scholar]
[23].Ray S and Saumyadipta P. A computational framework to emulate the human perspective in flow cytometric data analysis. PLoS ONE, 7(5)(35693), 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Rothenhäusler D, Bühlmann P, Meinshausen N, et al. Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47(3):1688–1722, 2019. MR3911127 [Google Scholar]
[25].Rothenhäusler D, Meinshausen N, Bühlmann P, and Peters J. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2):215–246, 2021. MR4250274 [Google Scholar]
[26].Sachs K, Perez O, Pe’er D, Lauffenburger DA, and Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005. [DOI] [PubMed] [Google Scholar]
[27].Saengkyongam S, Henckel L, Pfister N, and Peters J. Exploiting independent instruments: Identification and distribution generalization. In International Conference on Machine Learning, pages 18935–18958. PMLR, 2022. [Google Scholar]
[28].Wright PG. Tariff on Animal and Vegetable Oils. Macmillan Company, New York, 1928. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1953713-supplement-1.pdf^{(159.3KB, pdf)}

[R1] [1].Angrist JD, Imbens GW, and Rubin DB. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996. [Google Scholar]

[R2] [2].Belloni A, Chen D, Chernozhukov V, and Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429, 2012. MR3001131 [Google Scholar]

[R3] [3].Belloni A, Chernozhukov V, Chetverikov D, Hansen C, and Kato K. High-dimensional econometrics and regularized GMM. arXiv preprint arXiv:1806.01888, 2018. [Google Scholar]

[R4] [4].Cameron AC and Trivedi PK. Microeconometrics: Methods and Applications. Cambridge University Press, 2005. [Google Scholar]

[R5] [5].Candes E, Tao T, et al. The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313–2351, 2007. MR2382644 [Google Scholar]

[R6] [6].Chen C, Ren M, Zhang M, and Zhang D. A two-stage penalized least squares method for constructing large systems of structural equations. The Journal of Machine Learning Research, 19(1):40–73, 2018. MR3862409 [Google Scholar]

[R7] [7].Dunker F. Adaptive estimation for some nonparametric instrumental variable models with full independence. Electronic Journal of Statistics, 15(2):6151–6190, 2021. MR4355705 [Google Scholar]

[R8] [8].Gimenez JR and Rothenhäusler D. Causal aggregation: estimation and inference of causal effects by constraint-based data fusion. arXiv preprint arXiv:2106.03024, 2021. MR4577774 [Google Scholar]

[R9] [9].Gold D, Lederer J, and Tao J. Inference for high-dimensional instrumental variables regression. Journal of Econometrics, 217(1):79–111, 2020. MR4093746 [Google Scholar]

[R10] [10].Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029–1054, 1982. MR0666123 [Google Scholar]

[R11] [11].Heinze-Deml C, Peters J, and Meinshausen N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018. MR4335430 [Google Scholar]

[R12] [12].Henderson HV and Searle SR. The vec-permutation matrix, the vec operator and Kronecker products: A review. Linear and Multilinear Algebra, 9(4):271–288, 1981. MR0611262 [Google Scholar]

[R13] [13].Lewbel A. Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. Journal of Business 83 Economic Statistics, 30(1):67–80, 2012. MR2899185 [Google Scholar]

[R14] [14].Lin W, Feng R, and Li H. Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association, 110(509):270–288, 2015. MR3338502 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Mátyás L, Gourieroux C, Phillips PC, et al. Generalized Method of Moments Estimation, volume 5. Cambridge University Press, 1999. MR1688695 [Google Scholar]

[R16] [16].Meinshausen N. Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6–10. IEEE, 2018. [Google Scholar]

[R17] [17].Meinshausen N, Hauser A, Mooij JM, Peters J, Versteeg P, and Bühlmann P. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27):7361–7368, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Newey WK and McFadden D. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245, 1994. MR1315971 [Google Scholar]

[R19] [19].Pearl J et al. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009. MR2545291 [Google Scholar]

[R20] [20].Peters J, Bühlmann P, and Meinshausen N. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 947–1012, 2016. MR3557186 [Google Scholar]

[R21] [21].Pfister N, Bühlmann P, and Peters J. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264–1276, 2019. MR4011778 [Google Scholar]

[R22] [22].Poirier A. Efficient estimation in models with independence restrictions. Journal of Econometrics, 196(1):1–22, 2017. MR3572810 [Google Scholar]

[R23] [23].Ray S and Saumyadipta P. A computational framework to emulate the human perspective in flow cytometric data analysis. PLoS ONE, 7(5)(35693), 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Rothenhäusler D, Bühlmann P, Meinshausen N, et al. Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47(3):1688–1722, 2019. MR3911127 [Google Scholar]

[R25] [25].Rothenhäusler D, Meinshausen N, Bühlmann P, and Peters J. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2):215–246, 2021. MR4250274 [Google Scholar]

[R26] [26].Sachs K, Perez O, Pe’er D, Lauffenburger DA, and Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005. [DOI] [PubMed] [Google Scholar]

[R27] [27].Saengkyongam S, Henckel L, Pfister N, and Peters J. Exploiting independent instruments: Identification and distribution generalization. In International Conference on Machine Learning, pages 18935–18958. PMLR, 2022. [Google Scholar]

[R28] [28].Wright PG. Tariff on Animal and Vegetable Oils. Macmillan Company, New York, 1928. [Google Scholar]

PERMALINK

Estimating causal effects with hidden confounding using instrumental variables and environments

James P Long

Hongxu Zhu

Kim-Anh Do

Min Jin Ha

Abstract

1. Introduction

2. Instrumental variables and the Causal Dantzig

2.1. Instrumental variable estimators

2.2. Causal Dantzig

3. New estimators

3.1. Generalized Causal Dantzig

3.2. Hybrid estimator

3.3. Relation to Lewbel [13]

4. Asymptotic properties of estimators

4.1. Consistency

4.2. Asymptotic normality

5. Causal models and consistency

Fig 1.

5.1. Decomposition condition and do operators

5.2. Independent instrumental variables

6. Simulations

6.1. Continuous environments

Fig 2.

6.2. Overidentified models

Table 1.

6.3. Hybrid estimator

Fig 3.

7. Application to flow cytometry data

Table 2.

7.1. Univariate analysis

7.1.1. PIP2

Fig 4.

Table 3.

7.1.2. MEK

Fig 5.

Table 4.

7.2. Multivariate analysis

Fig 6.

Fig 7.

8. Discussion

Supplementary Material

Funding

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases