Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Dec 28.
Published in final edited form as: Proc Mach Learn Res. 2019 Apr;89:3080–3088.

A Potential Outcomes Calculus for Identifying Conditional Path-Specific Effects

Daniel Malinsky 1, Ilya Shpitser 2, Thomas Richardson 3
PMCID: PMC6935349  NIHMSID: NIHMS1063775  PMID: 31886462

Abstract

The do-calculus is a well-known deductive system for deriving connections between interventional and observed distributions, and has been proven complete for a number of important identifiability problems in causal inference [1, 8, 18]. Nevertheless, as it is currently defined, the do-calculus is inapplicable to causal problems that involve complex nested counterfactuals which cannot be expressed in terms of the “do” operator. Such problems include analyses of path-specific effects and dynamic treatment regimes. In this paper we present the potential outcome calculus (po-calculus), a natural generalization of do-calculus for arbitrary potential outcomes. We thereby provide a bridge between identification approaches which have their origins in artificial intelligence and statistics, respectively. We use po-calculus to give a complete identification algorithm for conditional path-specific effects with applications to problems in mediation analysis and algorithmic fairness.

1. Introduction

Pearl’s do-calculus [6, 7, 8] is an abstract set of rules for reasoning about interventions that has proven to be influential in settings, such as computer science and artificial intelligence, where graphical models are used to represent causal relationships. In statistics and some social/biomedical sciences, the potential outcome framework [4, 15] is more commonly used to express causal assumptions and reason about interventions. Richardson and Robins [11] have made an important contribution by unifying causal formalisms grounded in graphical causal models with the potential outcomes framework. In this paper we build on those connections, presenting a calculus for reasoning about interventions in the potential outcomes notation that is equivalent to Pearl’s do-calculus for standard interventions, but allows generalizations to nested causal quantities pertinent to evaluating (e.g.) dynamic treatment regimes or path-specific interventions (for which the “do” notation is insufficiently expressive). We show how the new calculus can be applied to problems in mediation analysis, specifically the identification of conditional path-specific causal effects. We introduce a procedure which is complete for expressing such quantities as functions of the observed data distribution, i.e., an algorithm which will produce an identifying expression for a conditional path-specific effect if and only if the effect is identifiable.

Conditional path-specific effects are quantified via conditional distributions over potential outcomes, where treatment variables are assigned to possibly distinct values for different causal pathways. In mediation analysis, functions of such distributions are used to isolate the effect of a drug, therapy, or other treatment assignment along a specific pathway in a specific subpopulation, defined by pre-treatment variables (such as age or gender) or post-treatment variables (such as adverse reactions to the treatment). Importantly, there are settings where the marginal path-specific effect is identified but the conditional path-specific effect is not identified; we later discuss one simple example shown in Fig. 1.

Figure 1:

Figure 1:

(a) A hidden variable causal DAG where p(Y(a, M(a′))) is identified, but p(Y(a, M(a′)) | C) is not identified. (b) A seemingly similar hidden variable causal DAG where both p(Y(a, M(a′))) and p(Y(a, M(a′)) | C) are identified.

Another context in which conditional path-specific effects may be of interest is in the study of algorithmic fairness. Recent papers [2, 3, 21] have proposed to combat disparities perpetuated by some automated decision-making systems by identifying, estimating, and constraining unfair causal influences that propagate along certain pathways, e.g., the direct effect of gender on hiring outcomes or the indirect effect of race on criminal justice outcomes via geographical factors. It may also be desirable to constrain such path-specific effects for certain subpopulations, which requires identifying conditional path-specific effects.

We begin by introducing potential outcomes, causal models, graphs, and some relevant results. Then we review the do-calculus, propose our potential outcome calculus, demonstrate they are equivalent, and give some simple derivations to establish the soundness of the rules in the language of potential outcomes. Finally, we introduce a formalism for expressing path-specific effects (PSEs) and a complete identification procedure for conditional PSEs.

2. Potential Outcomes, the Do Operator and Causal Models

Fix a set of indices K ≡ {1, …, k} under a total ordering ≺. For each random variable Vi, iK, define a state space Xi, and the sets Prei ≡ {1, …, i – 1}. Given AK, we will denote subsets of random variables indexed by A with VA and elements υA of XA by a (lowercase letters).

We assume the existence of all one-step-ahead potential outcome random variables (a.k.a. counterfactuals) of the form Vi(pai)Vi(vPai), where Pai is a fixed subset of Prei, and paivPai is any element in XPai. The variable Vi(pai) denotes the value of Vi had the set of direct causes of Vi, or vPai, been set, possibly contrary to fact, to values pai. The existence of a total ordering ≺ on indices, and the fact that Pai ⊆ Prei precludes the existence of cyclic causation. (That is, we consider causal models that are recursive.) Vi(pai) may be conceptualized as the output of a structural equation fi:XPai{ϵi}Xi, a function representing a causal mechanism that maps values of Pai, as well as the value of an exogenous disturbance variable ϵi, to values of Vi. We define causal models as sets of densities over the set of random variables

V{Vi(pai)|i{1,,k},paiXPai}.

For simplicity of presentation, we assume Xi is always finite, and thus ignore the measure theoretic complications that arise with defining densities over sets of random variables above in the case where some state spaces on Pai are infinite.1

Given a set of one-step-ahead potential outcomes V, for any AK and iK we define the potential outcome Vi(a), the response of Vi had variables in VA been set to a, by the definition known as recursive substitution:

Vi(a)Vi(apai,{Vj(a)|jPai\A}). (1)

In words, this states that Vi(a) is the potential outcome where variables Pai in A are set to their corresponding values in a, and all elements of Pai not in A are set to whatever values their recursively defined counterfactual versions would have had had A been set to a. Equivalently, Vi(a) is the random variable induced by a modified set of structural equations: specifically the set of functions fj for all VjA are replaced by constant functions fj* that set Vj to the corresponding value in a.

We denote by V* the set of all variables derived by (1) from V, together with V. In addition, for notational conciseness, we will use index sets to denote sets of potential outcomes themselves. That is, for YK, AK, we will denote the set {Vi(a) | iY} by Y(a). Note that we allow Y and A to intersect. Thus, we allow sets of potential outcomes of the form A(a), which denote the sets {Vi(a) | iA}, where each Vi(a) is defined using (1) above. In particular, if A = {Vi} (a singleton), Vi(υi) is defined in our notation to be the random variable Vi, not the constant υi.

In cases where Y and A do not intersect, the distribution p(Y(a)) has been denoted by Pearl as p(Y | do(a)) [8]. This formulation places emphasis on the intervention operator do(a), which replaces structural equations by constants.

Recursive substitution provides a link between observed variables and potential outcomes. In particular, it implies the consistency property:2 for any disjoint A, BK, iK \ (AB), aXA, bXB,

B(a)=b implies Vi(a,b)=Vi(a). (2)

Proposition 1 (consistency) Given V* derived from V via(1), then (2) holds.

Proof: By (1), Vi(a) and Vi(a, b) are defined as

Vi(aPai,{Vj(a)|jPai\(AB)},{Vj(a)=bj|jBPai})

and Vi(aPai,{Vj(a,b)|jPai\(AB)},bPai), respectively. The conclusion follows immediately. □

(1) implies that every Vi(a) is can be written as a function of a unique minimally causally relevant subset of a.

Proposition 2 (causal irrelevance) Given V* derived from V via (1), let Vi(a)V*, and let A* be the maximal subset of A such that for every AjA*, there exists a sequence Vw1,,Vwm that does not intersect A, where AjPaw1,VwiPawi+1, for i = 1, … m − 1, and VwmPai. Then Vi(a) = Vi(a*).

Proof: Follows by definition of A* and (1). □

A functional causal model (a.k.a. a non-parametric structural equation model with independent errors, NPSEM-IE) asserts that the sets of variables

{{Vi(pai)|paiXPai}|i{1,,k}} (3)

are mutually independent. Phrased in terms of structural equations, the functional causal model states that the joint distribution of the disturbance terms factorizes into a product of marginals: p(ϵ1,,ϵk)=i=1kp(ϵi).

Alternative causal models, which make fewer assumptions than the functional model but are sufficient for all inferences we aim to make in this paper, are discussed in [11, 20]. We focus on the functional causal model here, since it is simpler to describe and the original setting of Pearl’s do-calculus. We discuss how our results apply to a weaker causal model [11] in the Supplement.

3. Graphical Models

Much conceptual clarity may be gained by viewing causal models as graphs. We will consider graphs with either directed edges only (→), or mixed graphs with both directed and bidirected (↔) edges. Vertices correspond to random variables, and we simplify notation by using Vi to refer to both the graph vertex and corresponding random variable. In all cases we will require the absence of directed cycles, meaning that whenever the graph contains a path of the form Vi → ⋯ → Vj, the edge VjVi cannot exist. Directed graphs with this property are called directed acyclic graphs (DAGs), and mixed graphs with this property are called acyclic directed mixed graphs (ADMGs). We will refer to graphs by G(V), where V is the set of random variables indexed by {1, …, k}. We will use the following standard definitions for sets of vertices in a graph:

PaiG{Vj|VjVi in G} (parents of Vi)AniG{Vj|VjVi in G} (ancestors of Vi)DeiG{Vj|VjVi in G} (descendants of Vi)

By convention, we assume ViAniG and ViDeiG. We will generally drop the superscript G if the relevant graph is obvious and sometimes write G in place of G(V) when the vertex set is clear. Given a DAG G(V), a statistical DAG model (a.k.a. a Bayesian network) associated with G(V) is a set of distributions that are Markov relative to G(V), i.e., the set of distributions that can be written as the following product of conditional densities:

p(V)=i=1kp(Vi|Pai). (4)

Given p(V) that is Markov relative to a DAG G(V), conditional independence relations (written: (YZ | X), where X, Y, Z are disjoint subsets of the index set K) satisfied by p(V) can be derived using the well-known d-separation criterion [5], which we reproduce in the Supplement. We write (YdZ|X)G(V) when Y is d-separated from Z given X in G(V). If p(V) is Markov relative to G(V), then the following global Markov property holds: for any disjoint X, Y, Z

(YdZ|X)G(V)(YZ|X) in p(V)

Functional causal models may also be associated with a DAG G by identifying Pai with the graphical parents of Vi in G(V). Given a functional causal model for DAG G, the joint distribution for any V(a) derived from V using (1) is identified via the following formula:

p(V(a))=i=1Kp(Vi|Pai\A,apai), (5)

provided (aiap(aj|Paj\A,apaj))>0. See [11] for a simple proof. The modified factorization (5) is known as the extended g-formula [11, 13]. Note that (5) has a term for every ViV, just like (4).

The formula (5) resembles (4) and in fact may be viewed as a factorization of p(V(a)) with respect to a certain graph derived from G. Such graphs, called Single World Intervention Graphs (SWIGs), were introduced in [11]. SWIGs are graphical representations of potential outcome densities that help unify the graphical and potential outcome formalisms. Given a set A of variables which are assigned to values a, a SWIG G(a) is constructed from G(V) by splitting all vertices in A into a random half and a fixed half, with the random half inheriting all edges with an incoming arrowhead and the fixed half inheriting all outgoing directed edges. Then, all random vertices Vi are re-labelled as Vi(a) or equivalently (due to Proposition 2) as Vi(aani*), where ani* consists of values of the ancestors of Vi in the split graph. In [11], unsplit vertices were drawn as circles, and split nodes as half circles, with fixed nodes denoted by a lowercase. Fixed nodes are enclosed by a double line. For an example of a SWIG representing the joint density p(Y(a), M(a), C(a), A(a)) = p(Y(a), M(a), C, A), see Fig. 2 (b). Because of the resemblance of (5) to a DAG factorization, we say that p(V(a)) is Markov relative to a SWIG G(a) if p(V(a)) may be written as (5).

Figure 2:

Figure 2:

(a) A simple causal DAG G, with a treatment A, an outcome Y, a vector C of baseline variables, and a mediator M. (b) A SWIG G(a) derived from (a) corresponding to the world where A is intervened on to value a. (c) An extended graph Ge derived from (a).

A SWIG G(a) is a DAG with a vertex set {V(a), a}, and may be viewed as a conditional graph, with vertices in V(a) corresponding to random variables, and vertices in a corresponding to variables fixed to a value. We extend the notion of d-separation to allow fixed vertices. Specifically, we allow d-separation statements of the form (Y(a),adZ(a)|X(a))G(a), for disjoint random subsets Y(a), Z(a), X(a) of V(a) and a′ a subset of a. Note that a possibly d-connecting path may only contain random nodes as non-endpoint vertices (as in [11] where fixed nodes are always blocked). Our extension here consists only in allowing fixed vertices to also appear as one endpoint in a d-separation statement. Just as (4) implied the global Markov property for a DAG, the modified factorization (5) implies a global Markov property for a SWIG.

Proposition 3 (SWIG global Markov property) If p(V(a)) is Markov relative to G(a), then for any disjoint subsets Y(a), Z(a), X(a) of V(a) and a subset aof a, if (Y(a),adZ(a)|X(a))G(a) then, for some f(·),

p(Z(a)|Y(a),X(a))=p(Z(a)|X(a))=f(Z,X,a\a).

Proof: The first equality is due to Theorem 12 in [11], the second follows from Theorem 19 in [10]. □

Note that f(Z, X, a\a′) is not necessarily equal to p(Z(a\a′) | X(a \ a′)).

The SWIG global Markov property implies the following intuitive result (proved in the Supplement) relating independence statements in p(V(a)) for various sets A. Specifically, the result is that interventions “always help” when it comes to conditional independence.

Proposition 4 (intervention monotonicity) For any disjoint subsets Y(a), Z(a), X(a) of V(a) and a subset aof a, if (Y(a),aZ(a)|X(a))G(a) then for any A″ ⊇ A, (Y(a),aZ(a)|X(a))G(a).

Graphical Models With Hidden Variables

We also consider causal models where some variables are unmeasured (a.k.a. “latent” or “hidden” variables). Given a DAG G(VH), define a latent projection mixed graph G(V) as follows. V is the vertex set of G(V), and for any Vi, VjV there is an edge ViVj if there exists a directed path from Vi to Vj in G(VH), with all intermediate nodes on the path in H; there is an edge ViVj if there exists a path from Vi to Vj of the form Vi ← ⋯ → Vj, where every intermediate node on the path is in H and no consecutive edges on the path are of the form → Hk ← for HkH. The latent projection G(V) obtained from a DAG G(VH) is always an ADMG. Our results in this paper apply to ADMGs, and indeed this is the intended setting for Pearl’s do-calculus (he used the terminology “semi-Markovian models”).

The definition of d-separation naturally generalizes to ADMGs with minor modification for bidirected edges; the resulting criterion is called m-separation [9]. We write (YmZ|X)G(V) if Y is m-separated from Z given X in ADMG G(V). In the following we sometimes drop the d or m subscripts and just write ⫫, where the relevant criterion is implicit.

Given an ADMG G(V), we define a SWIG G(V)(a) by the analogous node splitting construction as for DAGs. Specifically, each node is split into a random half and a fixed half, with random halves inheriting all incoming directed and bidirected edges, and fixed halves inheriting all outgoing directed edges. Alternatively given a SWIG G(V)(a) derived from a DAG G(VH), we define the latent projection operation in the natural way, yielding the SWIG G(a)(V) with random vertices V, fixed vertices a, and directed edges from aia or ViV to VjV if there is a directed path from the corresponding vertices in G(a) with all intermediate vertices in H, and bidirected edges from ViV to VjV if there exists a path from Vi to Vj of the form Vi ← … → Vj, where every intermediate node on the path is in H and no consecutive edges on the path are of the form → Hk ← for HkH. These operations commute, and we can derive independence statements via m-separation on G(V)(a), as we prove in the Supplement.

4. Do-Calculus and Potential Outcomes Calculus

Pearl formulated the do-calculus originally as follows:

1:p(y|z,w,do(x))=p(y|w,do(x))  if (YZ|W,X)GX¯
2:p(y|z,w,do(x))=p(y|w,do(z),do(x))  if (YZ|W,X)GX¯,Z_
3:p(y|w,do(z),do(x))=p(y|w,do(x))  if (YZ|W,X)GX¯,Z(W)¯

where GX¯ denotes the graph obtained from G by removing all edges with arrowheads into X, GZ¯ denotes the graph obtained from G by removing all directed edges out of Z, and Z(W)Z\AnGX¯(W).

Here we present the do-calculus entirely in terms of potential outcomes (the “potential outcomes calculus” or “po-calculus” for short). The conditions are phrased in terms of conditional independencies implied by SWIGs, e.g., G(x) for the SWIG where X is assigned value x. We restate the rules as follows:

1:p(Y(x)|Z(x),W(x))=p(Y(x)|W(x))  if (Y(x)Z(x)|W(x))G(x)
2:p(Y(x,z)|W(x,z))=p(Y(x)|W(x),Z(x)=z) if (Y(x,z)Z(x,z)|W(x,z))G(x,z)
3:p(Y(x,z)|W(x,z))=p(Y(x)|W(x)) if (Y(x,z1),W(x,z1)z1)G(x,z1) and (Y(x,z1)Z2(x,z1)+W(x,z1))G(x,z1) where Z1=Z\AnG(x)(W),Z2=ZAnG(x)(W)

Recall that random variables in a SWIG G(x) are labelled Vi(x) or equivalently as Vi(xani*), where ani* consists of values of the ancestors of Vi in the split graph. We can view Rule 1 as the fragment of the SWIG global Markov property that pertains to random variables in V(a). Rule 2 may be called “generalized conditional ignorability” because it is a general version of the standard ignorability assumption used in causal inference settings, where (Y(a) ⫫ A | C), or equivalently (Y(a) ⫫ A(a) | C(a)), enables identification of (e.g.) the average treatment effect by adjusting for C. Note that Rule 3 does not have a simple interpretation, as it involves an equality of interventional distributions in two distinct “worlds,” given an independence condition in a third. However, below we suggest an alternative, simpler rule which may be used without loss of generality, and is more intuitive. First, we state some basic results.

Proposition 5 Rule 1 of po-calculus holds if and only if Rule 1 of do-calculus holds.

Proof: Follows from the definition of G(x) and GX¯, and the definition of m-separation. □

Proposition 6 Rule 2 of po-calculus holds if and only if Rule 2 of do-calculus holds.

Proof: Follows from the definition of G(x,z) and GX¯,Z_, and the definition of m-separation in G(x,z). □

Proposition 7 Rule 3 of po-calculus holds if and only if Rule 3 of do-calculus holds.

Proof: Since path separation criteria on graphs quantify over elements in vertex sets, and since Z is a disjoint union of Z1 (Z(W) in Pearl’s terminology) and Z2, the precondition in Rule 3 of do-calculus may be written as two preconditions: (YZ1|W,X)GX¯,Z1¯ and (YZ2|W,X)GX¯,Z1¯.

By definition of Z1, it contains only non-ancestors of W in GX¯ (and therefore also in GX¯,Z1¯, which is an edge sub-graph of GX¯). Since Z1 only has adjacent outgoing directed arrows in GX¯,Z1¯, all elements of W are marginally m-separated from Z1 in GX¯,Z1¯. Thus, (W(x,z1)z1)G(x,z1) by the definition of G(x,z1). Furthermore, no element of Z1 can be an ancestor of Y in GX¯,Z1¯. To see this, suppose an element Zi of Z1 were an ancestor of Y. Then since (YZ1|W,X)GX¯,Z1¯, the directed path from Zi must be blocked by W and X. W cannot be on this directed path because it is non-descendant of Z1, and X cannot be on the path because GX¯,Z1¯ has no directed edges into X. So we conclude that Zi is not an ancestor of Y in GX¯,Z1¯ and therefore (Y(x,z1)z1)G(x,z1) by the definition of G(x,z1). Thus, if do-calculus Rule 3 precondition holds, po-calculus Rule 3 precondition holds.

We now prove the converse. If (Y(x,z1)z1)G(x,z1) then Z1 is not an ancestor of Y in GX¯,Z1¯. Similarly if (W(x,z1)z1)G(x,z1) then Z1 is not an ancestor of W in GX¯,Z1¯. Since Z1 only has adjacent edges that are outgoing directed edges, this implies (Y,Wz1|X)GX¯,Z1¯ holds. Since semi-graphoid axioms hold for m-separation, this implies (Yz1|W,X)GX¯,Z1¯ holds. Finally, (Y(x,z1)Z2(x,z1)|W(x,z1))G(x,z1) holds if and only if (YZ2|W,X)GX¯,Z1¯ holds, by the definitions of G(x,z1), GX¯,Z1¯, and m-separation. □

We now briefly demonstrate the soundness of the three rules of the po-calculus using only potential outcomes machinery and our background assumptions.

Proposition 8 Rules 1, 2, and 3 are sound.

Proof: Proposition 3 licenses deriving conditional independence statements corresponding to the graphical conditions in each rule. Then we have the following derivations:

 Rule 1:p(Y(x)|Z(x),W(x))=p(Y(x)|W(x))  by Y(x)Z(x)|W(x).
Rule 2:p(Y(x,z)|W(x,z))=p(Y(x,z)|Z(x,x)=z,W(x,z))=p(Y(x)|Z(x),W(x))  by Y(x,z)Z(x,z)|W(x,z) and consistency.
 Rule 3:p(Y(x)|W(x))=p(Y(x,z1)|W(x,z1))  since Y(x,z1),W(x,z1)z1.=p(Y(x,z1)|Z2(x,z1)=z2,W(x,z1)) since Y(x,z1)Z2(x,z1)|W(x,z1).=p(Y(x,z1,z2)|Z2(x,z1,z2)=z2,W(x,z1,z2)) by consistency. =p(Y(x,z)|Z2(x,z)=z2,W(x,z)) since Y(x,z1)Z2(x,z1)|W(x,z1),Z2Z, and so by Proposition 4,=p(Y(x,z)|W(x,z))

The proof of Proposition 8 has a number of interesting consequences. First, the soundness of Rule 2 follows by Rule 1 and consistency. Second, the soundness of Rule 3 follows by applications of Rule 1, Rule 2, consistency, causal irrelevance, and intervention monotonicity.

Causal irrelevance, as used in the proof, is implied by m-separation statements in the SWIG G(x,z1); however this property, like consistency, follows by (1) alone and does not require any assumption regarding the distributions p(V(a)) for any AV; specifically, (5) is not required. As a result the three rules of po-calculus, taken as a whole, are consequences of consistency and causal irrelevance, which hold in any recursive causal model, together with the SWIG Markov property for random variables in V(a). (Intervention monotonicity follows from these.)

The proof of Proposition 8 also implies that a simpler reformulation of po-calculus suffices without loss of generality. Specifically, this reformulation replaces Rule 3 by the following simpler rule (encoding causal irrelevance in graphical form):

3*:p(Y(x,z))=p(Y(x))  if (Y(x,z)z)G(x,z).

A benefit of translating the do-calculus exactly into our potential outcomes formulation is that the do-calculus rules as stated have been shown to be sufficient for a wide class of possible derivations on distributions expressible in terms of the do operator [1, 18]. However, since we phrased the rules for arbitrary potential outcomes, they may be applied to causal contrasts not expressible in standard do notation. We illustrate this by applying these rules to mediation analysis.

5. Path-Specific Effects and Extended Graphs

The identification theory for path-specific effects generally proceeds by considering nested, path-specific potential outcomes. Fix a set of treatment variables A, and a subset of proper causal paths π from any element in A. A proper causal path only intersects A at the source node. Next, pick a pair of value sets a and a′ for elements in A. For any ViV, define the potential outcome Vi(π, a, a′) by setting A to a for the purposes of paths in π, and to a′ for the purposes of proper causal paths from A to Y not in π. Formally, the definition is as follows, for any ViV:

Vi(π,a,a)a if ViAVi(π,a,a)Vi({Vj(π,a,a)|VjPaiπ},{Vj(a)|VjPaiπ¯}) (6)

where Vj(a′) ≡ a′ if VjA, and given by (1) otherwise, Paiπ is the set of parents of Vi along an edge which is a part of a path in π, and Paiπ¯ is the set of all other parents of Vi.

A counterfactual Vi(π, a, a′) is said to be edge inconsistent if counterfactuals of the form Vj(ak, …) and Vj(ak,) occur in Vi(π, a, a′), otherwise it is said to be edge consistent. It is well known that a joint distribution p(V(π, a, a′)) containing an edge-inconsistent counterfactual Vi(π, a, a′) is not identified in the functional causal model (nor weaker causal models) with a corresponding graphical criterion on π and G(V) called the ‘recanting witness’ [16, 20]. For example, in Fig. 2 (a), given π = {CAY}, Y(π, c, c′) ≡ Y(c′, M(c′, A(c′)), A(c)), while given π = {AY }, Y(π, a, a′) ≡ Y(C, a, M(a′, C)). Note that Y(π, c, c′) is edge inconsistent due to the presence of A(c) and A(c′), while Y(π, a, a′) is edge consistent.

Counterfactuals defined by (6) form the basis for direct, indirect, and path-specific effects estimated in the mediation analysis literature. There are generalizations where elements in A are set to arbitrary values for different paths, under the name of path interventions [20]. Similarly, edge consistent counterfactuals V(π, a, a′) generalize to responses to edge interventions [20]. We do not discuss this further here in the interests of space, although the results presented below generalize without issue. Note that edge consistent counterfactuals cannot, in general, be phrased in terms of the do operator.

We have the following the result, proven in [20].

Theorem 1 If V(π, a, a′) is edge consistent, then under the functional causal model for DAG G,

p(V(π,a,a))=i=1Kp(Vi|apaiπ,apaπ¯,PaiG\A). (7)

As an example, the distribution p(Y(π, a, a′)) = p(Y(C, a, M(a′, C))) of the edge consistent counterfactual in Fig. 2 (a) is identified as a marginal distribution derived from (7), specifically ∑C,M p(Y | a, M, C)p(M | a′, C)p(C). The po-calculus as presented above may be applied to any sort of potential outcome, including nested potential outcomes representing path-specific effects. In the following, we exploit an equivalence between path-specific potential outcomes and standard potential outcomes defined from an extended graph Ge, which is constructed from G following [14]. This both simplifies complex nested potential outcome expressions and enables us to leverage a series of prior results to identify conditional PSEs.

Given an ADMG G(V), define for each AiAV the set of variables AiCh{Aij|VjChi}, and let AChAiAAiCh. We define the extended graph of G(V), written Ge(VACh), as the graph with the vertex set VACh, with edges of the form AiAijVi if and only if AiVj is present in G(V), for AiA, VjV; furthermore, ViVj in Ge(VACh) if and only if ViVj is present for Vi, VjV in G(V). As an example, the extended graph for the DAG in Fig. 2 (a), with A = V, is shown in Fig. 2 (c). For conciseness, we will generally drop explicit references to vertices VACh, and denote extended graph of G(V) by Ge. Extended graphs as we define them here are straightforward generalizations of those presented in [14], where they only consider “node copies” of a single “treatment” variable, whereas here extended graphs have “copies” corresponding to every parent-child relationship of a set of treatments A.

The edges AiAij in Ge are understood to represent deterministic relationships. More precisely, we associate a causal model with Ge as follows. For G we had associated a set of potential outcomes V, and for Ge we have Ve. For every Vi(pai)V, we let Vi(pai) be in Ve. Note that this is well-defined, since Vi in G and Ge share the number of parents, and the parent sets for every Vi share state spaces. In addition, for every AijACh, we let Aij(ai) for aiXAi be in Ve. By assumption, every AijACh has a single parent Ai, and we further require that p(Aij(ai)) is a deterministic density, with p(Aij(ai)=ai)=1. To fix intuitions, consider the example of Pearl’s discussed in [14]. They consider an analysis where Ai corresponds to smoking status, and affects hypertensive status Vj as well as myocardial infarction status Vk through nicotine Aij and non-nicotine Aik components respectively. The relationships AiAij and AiAik are deterministic relationships between smoking and exposure to nicotine/non-nicotine components. [14] go on to consider potential outcomes of the form Vk(aij,aik) (where the “node copies” Aij and Aik are assigned to perhaps different values) inspired by a hypothetical intervention on the nicotine components of cigarette exposure that fixes non-nicotine components at some reference value (e.g., a new nicotine-free cigarette). In this case, the path-specific effect of smoking on outcome via nicotine components is easy to write down and identify, at the price of introducing new variables and deterministic relationships into the model.

We now show the following two results. First, we show that an edge-consistent V(π, a, a′) may be represented without loss of generality by a counterfactual response to an intervention on a subset of ACh in Ge with the causal model defined above. Second, we show that this response is identified by the same functional (7).

Given an edge consistent V(π, a, a′), define Ge via AV. We define aπ that assigns ai to AijACh if AiVj in G(V) is in π, and assigns aj to AijACh if AiVj in G(V) is not in π. The resulting set of counterfactuals V(aπ) is well defined in the model for Ve, and we have the following result, proved in the Supplement.

Proposition 9 Fix an element p(V) in the causal model for a DAG G(V), and consider the corresponding element pe(Ve) in the restricted causal model associated with a DAG Ge(VACh). Then p(V) = pe(VACh) and p(V(π, a, a′)) = pe(V(aπ)).

Corollary 1 Given an extended DAG Ge,

p(V(aπ))=i=1Kpe(Vi|aπpai,PaiGe\A).

Proof: This follows from Proposition 9, and the fact that the functional in (7) in p(V) is equal to i=1Kpe(Vi|aπpai,PaiGe\A) in pe(VACh). □

In the causal models derived from DAGs with unobserved variables (e.g., G(VH)), identification of distributions on potential outcomes such as p(V(a)) or p(V(π, a, a′)) may be stated without loss of generality on the latent projection ADMG G(V). A complete algorithm for identification of path-specific effects in hidden variable models was given in [16] and presented in a more concise form in [19]. We describe this form in detail in the Supplement. We also note (and prove in the Supplement) that the latent projection and the extended graph operations commute.

We now show that identification theory for p(V(π, a, a′)) in latent projection ADMGs G(V) may be restated, without loss of generality, in terms of identification of p(V(aπ)) in Ge(VACh).

Proposition 10 For any YV, p(Y(π, a, a′)) is identified in the ADMG G(V) if and only if p(Y(aπ)) is identified in the ADMG Ge(V,ACh). Moreover if p(Y(aπ)) is identified, it is by the same functional as p(Y(π, a, a′)).

Note that this Proposition is a generalization of Corollary 1 from DAGs to latent projection ADMGs. The proof of this claim, and all claims in the next section, are given in the Supplement.

6. Identification of Conditional PSEs

Having established that we can identify path-specific effects by working with potential outcomes derived from the Ge model, we turn to the identification of conditional path-specific effects using the po-calculus. In [17], the authors present the conditional identification (IDC) algorithm for identifying quantities of the form p(Y(x)|W(x)) (in our notation), given an ADMG. Since conditional path-specific effects correspond to exactly such quantities defined on the extended model Ge, we can leverage their scheme for our purposes. The idea is to reduce the conditional problem, identification of p(Y(aπ)|W(aπ)), to an unconditional (joint) identification problem for which a complete identification algorithm already exists.

The algorithm has three steps: first, exhaustively apply Rule 2 of the po-calculus to reduce the conditioning set as much as possible; second, identify the relevant joint distribution using Proposition 10 and the complete algorithm in [19]; third, divide that joint by the marginal distribution of the remaining conditioning set to yield the conditional path-specific potential outcome distribution. The procedure is presented formally as Algorithm 1, with the subroutine corresponding to Proposition 10 named PS-ID.

Note that we make use of SWIGs defined from extended graphs, e.g., Ge(aπ,z). Beginning with Ge the SWIG Ge(aπ,z) is constructed by the usual node-splitting operation: split nodes Z and Aij into random and fixed halves, where Aij is has fixed copy a if AiVj in G(V) is in π, and ai if AiVj in G(V) is not in π. Relabeling of random nodes proceeds as previously described.

The following two results are adapted from [17]; they are simply translated into potential outcomes and applied to extended graphs Ge.

Proposition 11 If (Y(x,z)L(x,z)|W(x,z))Ce(x,z) and TW then (Y(x,t)T(x,t)|Z(x,t),W1(x,t))Ge(x,t) if and only if (Y(x,z,t)T(x,z,t)|W1(x,z,t))Ge(x,z,t), where W1 = W \ T.

Corollary 2 For any Ge(x) and any conditional distribution p(Y(x)|W(x)), there exists a unique maximal set Z(x) = {Zi(x) ∈ W(x) | p(Y(x)|W(x)) = p(Y(x, zi)|W(x, zi) \ {Zi(x, zi)})} such that Rule 2 applies for Z(x, z) in Ge(x,z) for p(Y(x, z)|W(x, z)).

Algorithm 1 PS-IDC(Y, aπ, W, Ge)
Input: outcome Y, path-specific setting aπ, conditioning set W, and graph G
Output: p(Y(aπ)|W(aπ))
 1: ifZW s.t. (Y(aπ,z)Z(aπ,z)|W(aπ,z))Ge(aπ,z)
  return PS-IDC(Y,aπz,W\Z,Ge)
 2: else let p(Y(aπ),W(aπ))PSID(YW,aπ,Ge)
  return p(Y(aπ),W(aπ))/yp(Y(aπ),W(aπ))

The following is similar to Theorem 6 in [17], but extended to path-specific queries in extended graphs. The proof is in the Supplement.

Theorem 2 Let p(Y(π, a, a′) | W(π, a, a′)) be a conditional path-specific distribution in the causal model for G, and let p(Y(aπ) | W(aπ)) be the corresponding distribution in the extended causal model for Ge(VACh). Let Z be the maximal subset of W such that p(Y(aπ) | W(aπ)) = p(Y(aπ, z) | W(aπ, z) \ Z(aπ, z)). Then p(Y(aπ) | W(aπ)) is identifiable in Ge if and only if p(Y(aπ, z), W(aπ, z) \ Z(aπ, z)) is identifiable in Ge.

We then have by Corollary 2, Theorem 2, and completeness of the identification algorithm for path-specific effects [19]:

Theorem 3 Algorithm 1 is complete.

As an example, p(Y(a, M(a′))) is identified from p(C, A, M, Y) in the causal model in Fig. 1 (a), via

MCp(Y,M|a,C)p(C)Cp(M|a,C)p(C)Cp(M|a,C)p(C)

However p(Y(a, M(a′))|C) is not identified, since PS-IDC concludes p(Y(a, M(a′)), C) must first be identified, and this joint distribution is not identified via results in [16]. On the other hand, p(Y(a, M(a′))|C) is identified from p(C, A, M, Y) in a seemingly similar graph in Fig. 1 (b), via ∑M p(Y | M, a, C)p(M | a′, C).

7. Conclusion

In this paper we introduced the potential outcomes calculus, a generalization of do-calculus that applies to arbitrary potential outcomes. We have shown that potential outcome calculus is equivalent to Pearl’s do-calculus for standard interventional quantities, and is a logical consequence of the properties of consistency and causal irrelevance, as well as the global Markov property associated with SWIGs. Finally, we used the potential outcomes calculus to give a sound and complete algorithm for conditional distributions defined on potential outcomes associated with path-specific effects. This algorithm may be viewed as a path-specific generalization of the identification algorithm for conditional interventional distributions in [17].

Supplementary Material

Appendix

8. Acknowledgments

The authors would like to thank the American Institute of Mathematics for supporting this research via the SQuaRE program. This project is sponsored in part by the National Institutes of Health grant R01 AI127271-01 A1, and the Office of Naval Research grants N00014-18-1-2760 and N00014-15-1-2672. The authors would like to thank James M. Robins for helpful discussions.

Footnotes

1

The set of p(V) for a particular set of Pai and an ordering ≺ was called the finest causally interpretable structured tree graph (FCISTG) in [12].

2

Some readers may be more familiar with the simpler formulation where a = ∅, so “B = b implies Vi(b) = Vi.” Our reasons for allowing multiple intervention sets will become clear in what follows.

Contributor Information

Daniel Malinsky, Johns Hopkins University, Department of Computer Science, Baltimore, MD USA.

Ilya Shpitser, Johns Hopkins University, Department of Computer Science, Baltimore, MD USA.

Thomas Richardson, University of Washington, Department of Statistics, Seattle, WA USA.

References

  • [1].Huang Yimin and Valtorta Marco. Pearl’s calculus of interventions is complete In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI-06), pages 217–224. AUAI Press, 2006. [Google Scholar]
  • [2].Nabi Razieh, Malinsky Daniel, and Shpitser Ilya. Learning optimal fair policies. arXiv preprint arXiv:1809.02244, 2018. [PMC free article] [PubMed] [Google Scholar]
  • [3].Nabi Razieh and Shpitser Ilya. Fair inference on outcomes In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pages 1931–1940. AAAI Press, 2018. [PMC free article] [PubMed] [Google Scholar]
  • [4].Neyman Jerzy. On the application of probability theory to agricultural experiments: essay on principles (1923), section 9. Reprinted in English, with Discussion. Statistical Science, pages 463–480, 1990. [Google Scholar]
  • [5].Pearl Judea. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo, 1988. [Google Scholar]
  • [6].Pearl Judea. A probabilistic calculus of actions In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI-94), pages 452–462. Morgan Kaufmann, 1994. [Google Scholar]
  • [7].Pearl Judea. Causal diagrams for empirical research. Biometrika, 82(4):669–709, 1995. [Google Scholar]
  • [8].Pearl Judea. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009. [Google Scholar]
  • [9].Richardson Thomas S.. Markov properties for acyclic directed mixed graphs. Scandinavial Journal of Statistics, 30(1):145–157, 2003. [Google Scholar]
  • [10].Richardson Thomas S., Evans Robin J., Robins James M., and Shpitser Ilya. Nested Markov properties for acyclic directed mixed graphs. preprint: arXiv:1701.06686, 2017.
  • [11].Richardson Thomas S. and Robins Jamie M.. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. preprint: http://www.csss.washington.edu/Papers/wp128.pdf, 2013.
  • [12].Robins James M.. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512, 1986. [Google Scholar]
  • [13].Robins James M., Hernan Miguel A., and Siebert Uwe. Effects of multiple interventions In Comparative Quantification of Health Risks: Global and Regional Burden of Disease Attributable to Selected Major Risk Factors, volume 2, chapter 28, pages 2191–2230. World Health Organization, 2004. [Google Scholar]
  • [14].Robins James M. and Richardson Thomas S.. Alternative graphical causal models and the identification of direct effects In Causality and Psychopathology: Finding the Determinants of Disorders and their Cures. Oxford University Press, 2011. [Google Scholar]
  • [15].Rubin DB. Causal inference and missing data (with discussion). Biometrika, 63:581–592, 1976. [Google Scholar]
  • [16].Shpitser Ilya. Counterfactual graphical models for longitudinal mediation analysis with unobserved confounding. Cognitive Science (Rumelhart Special Issue), 37:1011–1035, 2013. [DOI] [PubMed] [Google Scholar]
  • [17].Shpitser Ilya and Pearl Judea. Identification of conditional interventional distributions In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI-06), pages 437–444. AUAI Press, 2006. [Google Scholar]
  • [18].Shpitser Ilya and Pearl Judea. Identification of joint interventional distributions in recursive semi-Markovian causal models In Proceedings of the Twenty-First AAAI Conference on Artificial Intelligence (AAAI-06), pages 1219–1226. AAAI Press, 2006. [Google Scholar]
  • [19].Shpitser Ilya and Sherman Eli. Identification of personalized effects associated wisth causal pathways In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI-18). AUAI Press, 2018. [PMC free article] [PubMed] [Google Scholar]
  • [20].Shpitser Ilya and Tchetgen Tchetgen Eric J.. Causal inference with a graphical hierarchy of interventions. Annals of Statistics, 44(6):2433–2466, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Zhang Lu, Wu Yongkai, and Wu Xintao. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pages 3929–3935, 2017. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES