Abstract
Targets of inference that establish causality are phrased in terms of counterfactual responses to interventions. These potential outcomes operationalize cause effect relationships by means of comparisons of cases and controls in hypothetical randomized controlled experiments. In many applied settings, data on such experiments is not directly available, necessitating assumptions linking the counterfactual target of inference with the factual observed data distribution. This link is provided by causal models. Originally defined on potential outcomes directly (Rubin, 1976), causal models have been extended to longitudinal settings (Robins, 1986), and reformulated as graphical models (Spirtes et al., 2001; Pearl, 2009). In settings where common causes of all observed variables are themselves observed, many causal inference targets are identified via variations of the expression referred to in the literature as the g-formula (Robins, 1986), the manipulated distribution (Spirtes et al., 2001), or the truncated factorization (Pearl, 2009).
In settings where hidden variables are present, identification results become considerably more complicated. In this manuscript, we review identification theory in causal models with hidden variables for common targets that arise in causal inference applications, including causal effects, direct, indirect, and path-specific effects, and outcomes of dynamic treatment regimes. We will describe a simple formulation of this theory (Tian and Pearl, 2002; Shpitser and Pearl, 2006b,a; Tian, 2008; Shpitser, 2013) in terms of causal graphical models, and the fixing operator, a statistical analogue of the intervention operation (Richardson et al., 2017).
Keywords: identification, graphical models, causal inference
AMS 2000 subject classifications: 62H99, 60E05
1. Introduction
In causal inference, the relationship between causes (termed exposures or treatments) and effects (termed outcomes) is quantified via responses of outcomes to different hypothetical assignments of values to causes. These responses are called potential outcomes, and were first defined in the context of analysis of experimental data in (Neyman, 1923). Randomized treatment assignment in trials of the kind considered by Neyman implies variation in outcomes in response to changes in treatment assignments, detected by standard statistical inference methods, could be used to draw valid causal inferences.
In datasets where treatment assignment is not randomized, causal inference is not always possible due to spurious associations between treatments and outcomes introduced by their unobserved common causes. Nevertheless, certain assumptions, given by causal models, allow a link to be made between data that is actually observed, and causal parameters involving outcomes that would have occurred under hypothetical treatment assignments. An early example of such a link via the conditionally ignorable model and the stable unit treatment value assumption (SUTVA) is found in (Rubin, 1976). More general models which allow identification of causal parameters in longitudinal observational data via the g-computation algorithm formula (g-formula) were developed in (Robins, 1986). Causal models in general settings were reconceptualized in terms of graphical models by Pearl (2009) and Spirtes et al. (2001), with early versions for the case of linear causal relationships given in (Wright, 1921; Haavelmo, 1943). Graphical models were used to derive algorithms which express causal parameters of interest as functions of observed data, and which generalize g-computation (Tian and Pearl, 2002; Shpitser and Pearl, 2006b,a). These algorithms have also been extended to mediation analysis problems, where effects along causal pathways are of interest (Shpitser, 2013; Shpitser and Tchetgen Tchetgen, 2016), and problems, such as those encountered in precision medicine, where responses to treatments are assigned by a policy (Tian, 2008).
In this paper, we review modern non-parametric identification theory for parameters of interest in causal models represented by directed acyclic graphs (DAGs), possibly with hidden variables. We describe how functionals of the observed data corresponding to causal parameters identified under a fully observed DAG model have a close relationship to a truncated version of the Markov factorization associated with DAGs. Similarly, we describe how not every causal parameter is identified if hidden variables are present, and that causal parameters that are identified under a hidden variable DAG model have a close relationship to a truncated version of the nested Markov factorization of the observed marginal distribution, associated with a mixed graph representing that hidden variable DAG (Richardson et al., 2017). We reformulate existing identification theory in terms of this factorization, phrased in terms of mixed graphs, kernels (which generalize conditional densities), and a fixing operation which can be viewed as a statistical analogue of interventions, and which generalizes marginalization, conditioning, and applications of the g-formula.
2. Preliminaries
We will associate random variables with vertices in graphs. We will denote both a single vertex and a single corresponding random variable as an uppercase Roman letter, e.g. A. Sets of vertices (and corresponding random variables) will be denoted by upper case vectors, e.g. . For a random variable V, we denote the state space of V as . For example if V is binary, then . We denote elements of a set (values of A) by lowercase Roman letters: . The state space of a set of random variables is simply the Cartesian product of the individual state spaces: . Sets of values corresponding to sets of random variables will be denoted by lowercase vectors, e.g. . Sometimes we will denote a restriction of a set of values by a set subscript. That is if is a set of values of , and , then is a restriction of to values in .
2.1. The Intervention Operation
Just like statistical models are sets of joint distributions representing uncertainty about an observable situation, causal models are sets of joint distributions representing uncertainty about a counterfactual situation. Counterfactual situations are represented by means of an intervention operation (Pearl, 2009), and hypothetical responses to this operation (Neyman, 1923).
For a subset of random variables , and a value assignment to , an intervention is a forced assignment of to an element of . The intervention operation which maps to was denoted by by Pearl (2009). The result of the intervention operation is a counterfactual world “minimally altered” from the factual world such that variables in have values . If the factual world is represented by a set of factual random variables , and their joint distribution , the counterfactual world is represented by a set of potential outcome random variables , and their joint distribution written as . In general, the intervention operation does not correspond to conditioning on the event . To define potential outcomes, and the notion of “minimal alteration” formally, we first introduce graphs, graphical models, and causal models.
2.2. Acyclic Directed Graphs
We will define causal models using directed graphs. A directed graph only contains directed edges (→). We will denote graphs by capital calligraphy letters , and when necessary will explicitly add their vertex sets as part of the notation: . A directed graph may contain at most one edge between two vertices.
We denote edges as ordered pairs of distinct vertices subscripted by the type of edge. For example (AB)→ is a directed edge from A to B. We omit the subscript when the edge type is not relevant. A path is a sequence of edges of the form 〈(V1V2), (V2V3), …, (Vk−2Vk−1), (Vk−1Vk)〉, where V1 ≠ Vk. Edges in a path may only occur once, and vertices may appear at most twice as elements of edges that are adjacent on the path. We will denote paths by indexed Greek letters, e.g. πi, and sets of paths by Greek letters, e.g. π.
A directed path from V1 to Vk has the form 〈(V1V2)→, (V2V3)→, …, (Vk−2Vk−1)→, (Vk−1Vk)→〉. A directed graph has a directed cycle if it contains a path of the form 〈(V1V2)→, (V2V3)→, …, (Vk−2Vk−1)→, (Vk−1Vk)→〉, and an edge (Vk, V1)→. A directed acyclic graph (DAG) is a directed graph with no directed cycles. Given a DAG , and a subset , define the induced subgraph as the DAG containing vertices and any edge in between elements in .
Given a DAG , define the following genealogic sets: parents, children, ancestors, and descendants of V, to be
Define the set of non-descendants of to be . By convention, for any , . A total ordering ≺ on elements in in a DAG is called topological with respect to if for all distinct , whenever V1 ≺ V2, .
DAGs have been used to define a statistical model via sets of conditional independence statements represented by Markov properties and a factorization. We review this model as a prelude for the more complex definition of a causal model of a DAG.
2.3. Statistical Models Of A DAG
A statistical model of a DAG , or a Bayesian network, is a set of distributions such that
| (1) |
Any distribution in the above set is said to Markov factorize or be Markov relative to . An alternative formulation of statistical models of a DAG is in terms of the global Markov property defined via d-separation (Pearl, 1988), which we reproduce below.
A sequence of two consecutive edges on a path is called a triple. A triple of the form 〈(AC)→(CB)←〉 is called a collider, a triple of the form 〈(AC)→(CB)→〉 is called a chain, and a triple of the form 〈(AC)←(CB)→〉 is called a fork. The latter two types of triples are collectively called non-colliders. Given a DAG , disjoint vertices , and a set , such that , a path from A to B in is said to be d-separated given a set if there exists a collider 〈(AC)→, (CB)←〉 such that or a non-collider 〈(AC), (CB)〉 such that . For three disjoint sets , , , we say is d-separated from given in , written as a shorthand, if every path from any to any is d-separated by . The d-separation criterion states d-separation always implies conditional independence in statistical DAG models. That is, if is Markov relative to a DAG , the following implication holds:
where is a shorthand for the statement “ is independent of conditional on in .” The global Markov property given by d-separation and the Markov factorization (1) are equivalent ways of defining the statistical DAG model, without requiring that elements of the model be positive distributions. Such a requirement is necessary for the Hammersley-Clifford theorem for undirected graphical models. See (Lauritzen, 1996) for details.
2.4. Atomic Potential Outcomes and Causal Models Of A DAG
Just as a statistical model of is a set of distributions defined by (1), a causal model of is a set of distributions over (atomic) potential outcome random variables in the set
| (2) |
defined by some restrictions 1. Here, fV is a structural equation which maps values of and a random variable εV, representing unobserved exogenous factors, to values v of V. Thus, fV, and the random variable εV induce the random variable .
We consider a causal model of , described in (Pearl, 2009) and (Richardson and Robins, 2013). The functional model, also known as the multiple worlds model (Shpitser and Tchetgen Tchetgen, 2016), or the non-parametric structural equation model with independent errors (NPSEM-IE) is the set of all distributions such that variables in
| (3) |
are mutually independent. The assumption (3) corresponding to the functional model can be interpreted to mean that the joint distribution factorizes as . The name “non-parametric structural equation model with independent errors” comes from this property, and the fact the structural equations fV are unrestricted. As an example, the binary functional model associated with the DAG in Fig. 1 (a) asserts that sets of random variables {W}, {A(w) | w ∈ {0, 1}}, {M(a, w)| a ∈ {0, 1}, w ∈ {0, 1}}, {Y(a, m, w) | a ∈{0, 1}, m ∈ {0, 1}} are mutually independent.
Figure 1.
(a) A simple causal DAG, with a single treatment A, a single outcome Y, a vector W of baseline variables, and a single mediator M. (b) A more complex causal DAG with two mediators M and Z. (c) A version of the DAG in (b) with an unobserved confounder H. (d) The ADMG obtained from the DAG in (c) via the latent projection operation collapsing over the unobserved variable H.
Weaker models than the functional model exist, such as the finest fully randomized causally interpreted structured tree graph (FFRCISTG) model, also known as the single world model (Robins, 1986; Shpitser and Tchetgen Tchetgen, 2016; Richardson et al., 2017). In the interests of space, we will not discuss this model further in this paper, although its assumptions generally suffice for identification of any targets described in this paper other than path-specific effects.
2.5. Defining Arbitrary Potential Outcomes
Given a set of atomic potential outcomes of the form , other potential outcomes of the form , where is an arbitrary subset of , can be defined by means of recursive substitution:
| (4) |
In words, this states that the response of Y had we applied the intervention operator , that is had we, possibly contrary to fact, set values of to , is defined as the potential outcome for Y where
all parents of Y which are in are counterfactually assigned an appropriate value from , and
all other parents are counterfactually assigned whatever value they would have attained had we applied the intervention operator .
Responses involving W are themselves counterfactual and defined recursively. The definition terminates because of the lack of directed cycles in . For example, in the graph in Fig. 1 (a), Y(a) = Y(a, M(a, W), W). The appearance of W as a capital letter in the expression is interpreted to mean “counterfactually assign W to whatever value is actually observed.” Recursively setting all ancestors of Y to their actually observed values results in the factual random variable Y ≡ Y(A, M(A, W), W). More generally, every factual variable that is a part of the observed data distribution can be formed in this way.
The causal model may impose restrictions on counterfactual variables defined via (4). For example, it is a straightforward consequence of (4) that for any , such that , and any , , . In other words, V is not affected by interventions on any variable outside the set , as long as all variables in the set are already intervened on. In this sense, the variables in can be viewed as the observed direct causes of V, and the structural equation fV as the causal mechanism determining the value of V in terms of values of observed direct causes, and the exogenous direct cause εV.
The causal model may impose restrictions on factual variables as well. In particular, for the set of factual variables obtained via (4) from atomic counterfactuals , it is known that if lies in the functional causal model of , then obeys (1), in other words lies in the statistical model of . As a matter of convenience, and to avoid complications with identifiability, we will assume positive observed data distributions from this point on.
2.6. The Fundamental Problem Of Causal Inference
Most random variables defined via (4) are counterfactual, meaning realizations from these variables are not available as observed data. Observed data is restricted to realizations of the observed distribution . Making inferences about any counterfactual parameter thus entails building a link between factual and counterfactual variables.
A standard assumption which provides this link is called consistency, and states that for any , , and Y, implies . Though consistency (in the simple form where is empty) is often stated as a standalone part of the stable unit treatment value assumption (SUTVA) (Rubin, 1976), it is also a logical consequence of (4) above, see (Malinsky et al., 2019) for a simple proof. In words, consistency states that in any given counterfactual situation where were intervened on values , if it were the case that random variables were observed to attain values , then any counterfactual random variable in that situation is equal to the counterfactual random variable representing the situation where were also intervened on to the values . Viewed in terms of structural equations, consistency asserts a kind of mechanism modularity: for every V, fV reliably maps inputs to outputs regardless of the type of process that may have set those inputs.
Given an observed dataset specifying values yi, ai, for i = 1, …, n, consistency allows inferences to be made about Y(ai) from realizations Yj(ai) where j is such that aj = ai. However, for all rows j, realizations of Y(ai) are missing from the dataset for row j, if ai ≠ aj. This means that for any row j, only half of the potential outcomes are available (possibly less if A has more than two values). This missing data problem is referred to as the fundamental problem of causal inference (Rubin, 1976). Making inferences using counterfactual realizations missing in observed data is not always possible, and when it is, relies on additional assumptions over and above consistency, given by a causal model.
2.7. Identification
If assumptions imposed by a particular causal model imply a counterfactual quantity is a unique functional of the observed data distribution in every element of the model, we say the counterfactual is identified in the model. Here we concentrate on identification of counterfactual distributions. Formally, a distribution or parameter is said to be identified from in a causal model, if there exists a function of which yields (or ) in every element of the model. Identification in this sense is important to establish before estimating from observed data. A distribution or parameter is said to be non-identified from in a causal model if there exist two elements in the model which share but differ in (or ). Estimation for non-identified parameters or distributions is an ill-posed problem.
3. Targets of Inference
We now describe common targets of inference, phrased in terms of potential outcome random variables, that arise in causal inference applications.
3.1. Average Treatment Effects
The relationship of a set of treatments and an outcome Y is quantified via a comparison of potential outcomes , under two hypothetical value assignments and to , representing cases and controls. These comparisons are often made on the mean difference scale. This contrast is known as the average causal effect (ACE):
For a set of treatments , a comparison of interest might include the controlled direct effect, where outcomes Y are compared for two treatment trajectories that agree on all values except values ai, :
Identifiable average causal effects and controlled direct effects, possibly also conditioned on a vector of baseline covariates, are sometimes modeled directly in semi-parametric approaches based on structural nested models and marginal structural models (Robins, 1999a,b).
3.2. Mediation Analysis And Path-Specific Effects
Given that a treatment effect of A on Y is established, it is often desirable to understand the mechanism by which the causal influence takes place. A simple type of mechanisms analysis is mediation analysis – the decomposition of the treatment effect into components associated with causal pathways from A to Y.
While controlled direct effects capture a type of direct effect of A on Y (that is, the effect not mediated by other variables), it suffers from two disadvantages. First, since other direct causes are fixed to a set of reference values, the controlled direct effect is best viewed not as a single contrast, but a potentially high dimensional mapping from the values of other direct causes of the outcome to contrasts. This makes this type of effect potentially difficult to estimate and interpret. Second, there is no corresponding version of the indirect effect.
An alternative, proposed in (Robins and Greenland, 1992; Pearl, 2001), is to define natural direct and indirect effects by means of nested counterfactuals of the form Y(a, M(a′)), and Y(a′, M(a)), where a is the treatment value corresponding to cases, and a′ the treatment value corresponding to controls. These direct and indirect effects are defined in such a way that they form a decomposition of the average treatment effect on the appropriate scale. For example, the following definitions give an additive decomposition of the average treatment effect defined on the mean difference scale:
One way of conceptualizing these types of effects (Robins and Richardson, 2010) is to assume a treatment A can be partitioned into two components: a component that only directly influences the outcome Y, and a component that only directly influences a mediator variable M. In the toy example discussed in (Robins and Richardson, 2010), we may believe smoking affects health due to the presence of smoke in the lungs, which may cause cancer, and by means of nicotine in the bloodstream, which may contribute to heart disease. If we are interested in the effect of smoking mediated by cancer, we may consider comparing smokers to a population which only received the component of the treatment where the influence of smoke is absent, and the influence of nicotine is present, such as a nicotine patch. If we view Y as health status, M as cancer, and A as smoking, then the expected value contrast above corresponding to the total indirect effect of smoking would represent the comparison of health outcomes in smokers and nicotine patch users.
The intuition behind direct and indirect effects can be generalized to settings where an effect along a particular causal path is of interest. Consider an example from (Miles et al., 2017) represented by Fig. 1 (b), representing a cross-sectional study of HIV patients. Here, W is a set of baseline characteristics, A is exposure to one of two HIV treatments, M is treatment toxicity, Z is a measure of treatment adherence (how much of the prescribed treatment the patient actually took), and Y is the outcome such as viral failure. In this example, we may be interested in assessing not only the effectiveness of the drug itself, as quantified by the average causal effect, but also the extent to which the drug might influence the outcome through adherence but not toxicity. This type of adherence might be affected by the size of the pill, if the drug is ingested in pill form, or a patient in a low food security situation being prescribed a drug that must be taken with a meal. The influence of the drug on the outcome involved in this type of adherence corresponds to the influence of A on Y along the path 〈(AZ)→, (ZY)→〉. Another issue of possible interest is lack adherence due to toxicity of the drug. The influence of A on Y involved in this type of adherence corresponds to the influence of A on Y along the path 〈(AM)→, (MZ)→, (ZY)→〉.
Effects of treatments on outcomes along a predefined set of paths are called path-specific effects (Pearl, 2001). Such effects are defined using a special type of potential outcome, where the treatment is set to one value a with respect to variables on the set of paths of interest, and to another value a′ with respect to variables on all other paths. A single variable may occur in both types paths: those of interest, and those that are not of interest. We can modify (4) to define these types of potential outcomes as follows. Fix a set of directed paths π from A to Y, and let be the set of parents of Y along an edge which is a part of a path in π, and be the set of all other parents of Y. Given π and values a, a′, define the π-specific potential outcome Y as
| (5) |
where W(a′) ≡ a′ if W = A. This definition says that the π-specific potential outcome Y(π, a, a′) is defined by setting values of using the following four cases:
If and (AY)→ is in a path in π, set A to a.
If and (AY)→ is not in a path in π, set A to a′.
If and (WY)→ is not in any path in π, set W to the value it would have attained under potential outcomes W(a′).
If and (WY)→ is in some path in π, set W to the value it would have attained under the (recursively defined) π-specific potential outcome W(π, a, a′).
As was the case with (4), the inductive definition terminates because is acyclic. This definition is a natural generalization of the way natural direct effects were defined above, with the direct path 〈(AY)→〉 being a path of interest, and the indirect path 〈(AM)→, (MY)→〉 being a path not of interest.
Applying this definition to π consisting of a single path 〈(AZ)→, (ZY)→〉 in Fig. 1 (b) yields
By analogy with direct and indirect effects, we can use this counterfactual to define the total effect not through π as
and the pure π-specific effect as
Equation (5) generalizes in a natural way to multiple treatments , and multiple outcomes . In such cases, attention is restricted to sets π of proper causal paths for and , which are directed paths from an element in to any element in that otherwise does not intersect . In such cases, definition (5) is applied to every , with a set of proper causal paths π, a set of case values , and a set of control values . The resulting distribution is .
3.3. Dynamic Treatment Regimes
All targets considered so far represented potential outcomes where treatments were set to specific constant values, representing case and control treatments. Distributions over these outcomes can be used to make quantitative comparisons of how different treatments affect average responses in a population. However, in settings such as precision medicine, the primary goal is obtaining good outcomes for every individual, rather than assessing average treatment effects in a population. This difference in crucial where treatments can cause allergies and side effects – a treatment may be both very effective on average and extremely harmful for certain subsets of patients.
In simple versions of this setting, shown in Fig. 1 (a), the goal is not to learn the potential outcome Y(a) given a particular treatment assignment a, but the potential outcome Y(A = g(W)), where A is counterfactually set not to a fixed value a, but to a value given by a policy or dynamic treatment regime g(W) that depends on the vector of baseline factors W (Chakraborty and Moodie, 2013). Policy quality can be assessed by comparing two different policies, as was done with treatment effects, or finding the optimal policy directly. Just as was the case with Y(a), the response Y(A = g(W)) to a dynamic treatment regime is a counterfactual random variable, and is not necessarily identified from observed data, as we will see below. This is because assignment to A given W in the observed data is not necessarily according to the policy of interest g(.). There is a close relationship between the distinction between factual and counterfactual policies and off-policy learning in the reinforcement learning literature.
More generally, given a temporally ordered set of treatments , assessing the quality of a set of policies with respect to a response Y might be of interest. In such settings, complex interdependence between policies is possible. For a simple example, consider Fig. 1 (b), where W is a set of baseline factors, A, Z are treatments of interest, M is an intermediate outcome, and Y is the final outcome. We assume a temporal order W, A, M, Z, Y on the variables, and allow the policy determining the value of each treatment to depend on the entire observed history, that is gA is a function of W and gZ is a function of M, A, W. In medical settings, A may represent first line treatments, and Z secondary treatment options given poor response to the first line treatment. The interdependence of policies gA and gZ occurs because gZ depends on the entire observed history up to Z, and specifically on M, which itself depends on gA. Defining entails recursively setting all parents of Y to values they would have attained under , with the result being the following generalization of (4):
| (6) |
In oncology, an example corresponding to Fig. 1 (b) would represent assigning treatments to cancer patients, with A representing types of induction chemotherapy, and Z being either continuation of induction chemotherapy, or a switch to salvage chemotherapy in patients that are not responding to primary induction chemotherapy. Naturally, a policy gZ which switches treatments effectively would depend on the intermediate outcome M(A = gA(W), W), which measures the degree of response to induction chemotherapy. This intermediate outcome would itself depend on the choice of induction therapy. This choice, in turn, may be governed by patient age, and other baseline covariates that form a part of W.
The definition of the response Y to an arbitrary set of policies is as follows. Given a set of treatments in a causal model represented by a DAG , fix a topological ordering ≺ on consistent with (typically the temporal order for variables in the data), and consider for every a set of variables earlier in the ordering ≺ than A. Fix a set of policies that determine the value of each using values of . For any , the counterfactual response Y had every been determined by is defined using the appropriate generalization of the recursive substitution definition (4):
| (7) |
In words, this says that to define the response of Y given a set of policies , we counterfactually set each element of as follows. Each element A of is set to the value of the appropriate gA evaluated given the values of its inputs which are themselves evaluated recursively given . Each element W of is set to a value it would have attained had it been evaluated, recursively, under the policy set .
The quality of the policy set is often expressed as the outcome expectation under the policy set: , or any other appropriate function of .
4. Identification In Fully Observed Causal Models Of A DAG
Having given the necessary preliminaries, and defined appropriate targets of inference, we now consider the question of their identification. In causal models of a DAG where all variables are observed, interventional distributions of the form and responses to dynamic treatment regimes are always identified, while path-specific effects are identified according to a simple criterion on the graph, given further below.
4.1. Identification Of Interventional Distributions And Responses To Dynamic Treatment Regimes
For any value set of , the interventional distribution over counterfactuals is identified by
| (8) |
This equation is known as the g-formula (Robins, 1986), the manipulated distribution (Spirtes et al., 2001), or the truncated factorization (Pearl, 2009).
The g-formula asserts that in the functional model corresponding to a DAG , the effect of setting any set of variables to values using the intervention operation amounts to replacing the set of structural equations by another set
where for every V, and , , where is the subset of values of corresponding to . In words, is implemented by replacing all structural equations that determine elements , by new structural equations that ignore all inputs and set A to the appropriate value for A in , and replacing all structural equations fV that determine elements by structural equations that agree with fV, except they always evaluate using the appropriate subset of values in .
Implementing interventions by replacing structural equations generalizes in a straightforward way to yield responses to a set of dynamic treatment regimes . Such responses are generated by replacing , for each , not by , but by . Similarly, for each V such that agree with fV, except they always evaluate using the appropriate subset of values of determined by . This immediately yields the following expression as the identifying formula for the distribution over responses in to a treatment regime
| (9) |
This formula is a generalization of (8).
The expression (8) has a number of well-known special cases. For instance, in Fig. 1 (a),
| (10) |
where the last equality follows by marginalizing out M and chain rule. This recovers the well-known adjustment or backdoor formula (Pearl, 2009). In Fig. 1 (b),
Let gA be a mapping from values of W to values of A, and gZ be a mapping from values of M, A, W to values of Z. Then the distribution of the potential outcome is identified as
The first expression recovers the g-computation algorithm formula (Robins, 1986) for inferring causal effects in longitudinal studies, while the second gives the appropriate generalization for dynamic treatment regimes. Both expressions are given here for two time points, but generalize in a straightforward way.
4.2. Identification Of Path-Specific Effects
Potential outcomes associated with path-specific effects and defined via (5) are more complicated objects than potential outcomes defined via (4) due to the presence of two conflicting treatment assignments within the same object. A consequence of this is that even in causal models of a DAG where every element of is observed, distributions over some such potential outcomes are not identified. A characterization of identifiable π-specific potential outcomes exists, based on a feature of the graph and the path set π (Avin et al., 2005).
Fix a set of treatments , a set of outcomes , and a set π of proper causal paths for and in a DAG . A variable is called a recanting witness for π if there exists a directed path in π of the form 〈(AW)→, …, (Z1Y1)→〉 for some , and another proper causal path of the form 〈(AW)→, …, (Z2Y2)→〉 for some that is not in π. As an example, in Fig. 1 (b), if we are interested in the path-specific effect of A on Y along a single path 〈(AM)→, (MY)→〉, that is if π = {〈(AM)→, (MY)→〉}, then M is a recanting witness for π, since the path 〈(AM)→, (MZ)→, (ZY)→〉 is a proper causal path for A and Y, is not an element of π, and has as its first edge (AM)→ which is also the first edge of 〈(AM)→, (MY)→〉. Despite the existence of a recanting witness, the potential outcome Y(π, a, a′) is still definable via (5), and is equal to Y(a′, Z(a′, M(a′, W)),M(a, W), W). Both potential outcomes M(a′, W) and M(a, W) appear in this expression, and this is what prevents identification. This issue arises because the same child M of the intervened on variable A, appears both in a path of interest 〈(AM)→, (MY)→〉 and in another path 〈(AM)→, (MZ)→, (ZY)→〉 that is not of interest, but that was involved in defining the potential outcome.
In fact, identification of potential outcomes involved in a path-specific effect is characterized in DAGs by the absence of a recanting witness. Specifically, given disjoint vertex sets , in a DAG , and a set π of proper causal paths for and , the distribution is identified if and only if there are no recanting witnesses for π. If the recanting witness does not exist, then the joint counterfactual distribution over variables is identified via a generalization of equation (8) called the edge g-formula (Shpitser and Tchetgen Tchetgen, 2016), with an early version appearing in (Avin et al., 2005):
| (11) |
Just as the ordinary g-formula, the edge g-formula can be viewed as a truncated DAG factorization. However, in the edge g-formula a variable in a Markov factor can be set to either its value in or its value in , depending on whether the edge (AV)→ is a part of a path in π or not.
As an example, a recanting witness does not exist for the path-specific effect of A on Y along the set of paths π ≡ {〈(AM)→, (MY)→〉; 〈(AM)→, (MZ)→, (ZY)→〉} in Fig. 1 (b). The counterfactual distribution p(Y(π, a, a′)) corresponding to this set of paths is then identified as
In the classical mediation analysis setting shown in Fig. 1 (a), where we are interested in the direct effect of A on Y, in other words in the path set π ≡ {〈(AY)→〉}, the counterfactual distribution p(Y({〈(AY)→}, a, a′)) = p(Y(a, M(a′))) is identified by the edge g-formula as
If we are interested in the pure direct effect , the formula above recovers the well-known mediation formula (Pearl, 2011):
5. Identification In Causal Models Of A DAG With Hidden Variables
Identification results described so far assumed a causal model of a DAG where every element in corresponds to an observed variable. Unfortunately, this assumption is unrealistic in practice. For example, in observational studies in healthcare, compliance and health status of enrolled patients are generally not completely observed, except through imperfect proxies. This motivates the study of causal models of a DAG where corresponds to observed variables, and to hidden variables.
The presence of hidden variables considerably complicates identification theory. We will describe a simple characterization of identifiable targets of causal inference in hidden variable causal DAGs, based on mixed graphs, and the fixing operation that can be viewed as a statistical version of the intervention operation. This characterization was described in (Richardson et al., 2017) in the context of treatment effects, and generalized for mediation analysis and dynamic treatment regime problems in (Shpitser and Sherman, 2018).
In a fully observed DAG , all identified functionals are based on the g-formula (4), which is a truncated version of the Markov factorization of associated with . In a hidden variable DAG , all identified functionals are also based on a truncated version of a Markov factorization. However, this nested Markov factorization is of a marginal distribution , and with respect not to the DAG , but a special mixed graph we denote obtained by a latent projection operation (Verma and Pearl, 1990) from . This factorization is defined in terms of objects called kernels that generalize conditional distributions and which can be “put together” to construct the observed marginal , as well as certain other distributions that correspond to causal targets of inference.
We now give the roadmap for the definitions we will need that will define the nested factorization. First, we describe special mixed graphs called acyclic directed mixed graphs (ADMGs), and their conditional versions (CADMGs), and generalize existing definitions, and genealogic relations defined for DAGs in Section 2.2 to ADMGs and CADMGs. Then we define a special ADMG called a latent projection (Verma and Pearl, 1990) which represents identification theory for an infinite class of “structurally similar” hidden variable DAGs. We then describe how targets of inference can be defined on this ADMG directly in a way that represents the target in any hidden variable DAG in its class. We then describe kernels, which are generalizations of conditional distributions that will represent terms of the nested factorization. Next, we define the fixing operator (Richardson et al., 2017) on graphs and kernels which will be used iteratively to give the nested factorization. The nested factorization will link CADMG and kernel pairs, with the pair derived from and via the fixing operator.
Finally, we define the nested factorization of a distribution with respect to an ADMG, reformulate the ID algorithm (Tian and Pearl, 2002; Shpitser and Pearl, 2006b) as a truncated version of this factorization (Richardson et al., 2017), and generalize this formulation to also give identifying functionals for responses to dynamic treatment regimes and path-specific potential outcomes. We also describe the potential outcomes calculus (Malinsky et al., 2019), a generalization of Pearl’s do-calculus defined in terms of single world intervention graphs (SWIGs) (Richardson and Robins, 2013) and potential outcomes, and show how this calculus may be used to give a complete identification theory for conditional interventional distributions.
5.1. Mixed Graphs
A mixed graph only contains directed (→) and bidirected (↔) edges, although other types of mixed graphs have been considered in the literature (Lauritzen, 1996; Drton, 2009; Shpitser, 2015). In mixed graphs representing causal systems, directed edges represent the possible presence of a direct causal relationship between the vertices sharing the edge, while bidirected edges represent the possible presence of a particular type of spurious association between the vertices sharing the edge, as would occur had there been some unnamed and unobserved common cause. A mixed graph may contain at most two edges between two vertices. Moreover, if two vertices have two edges in common, one of them must be directed and one must be bidirected. An acyclic directed mixed graph (ADMG) is a mixed graph with no directed cycles.
A conditional ADMG (CADMG) (Richardson et al., 2017) is a graph with a vertex set where for every it is the case that neither (ZW)→ nor (ZW)↔ edges exist in for any . That is, in any CADMG , there is no edge of any kind with an arrowhead pointing into any element . Although every element in has this property, there may also exist elements such that neither (ZV)→ nor (ZV)↔ edges exist in for any . The distinction between elements in and elements in does not come from the way these vertex sets are defined, but from how they are used when defining graphical models. Vertices in correspond to random variables, as in the statistical model of a DAG. Vertices in correspond to variables that were fixed to a value. CADMGs will correspond to sets of distributions that depend on values in . These distributions, called kernels, along with CADMGs, will be used later to concisely express identification in terms of a truncated factorization.
A bidirected path from V1 to Vk has the form 〈(V1V2)↔, (V2V3)↔, …, (Vk−2Vk−1)↔, (Vk−1Vk)↔〉. For , define the district (Richardson and Spirtes, 2002) or the c-component (Tian and Pearl, 2002) of V as the set
The set of districts of , denoted by is a partition of . Specifically, elements of correspond to connected components in the graph derived from by dropping all vertices in and edges adjacent to such vertices, and all directed edges. In such a graph, vertices in connected components are connected by bidirected paths only, and thus correspond precisely to districts in the original CADMG. The same definition applies to any ADMG to yield .
Other definitions, and genealogic relations defined on DAGs in Section 2.2, generalize in a straightforward way to ADMGs and CADMGs. In particular, d-separation generalizes to m-separation (Richardson, 2003) in ADMGs. In an ADMG, triples of the form 〈(AC)→, (CB)←〉; 〈(AC)↔, (CB)←〉; 〈(AC)→, (CB)↔〉; 〈(AC)↔, (CB)↔〉 are called colliders, and any other kind of triple a non-collider. Given an ADMG , disjoint vertices , and a set , such that , a path from A to B is said to be m-separated given a set if there exists a collider 〈(AC), (CB)〉 such that or a non-collider 〈(AC), (CB)〉 such that .
5.2. Latent Projections
Identification theory in a causal model described by a hidden variable DAG , where correspond to observed variables and to hidden variables, is often described in terms of the ADMG constructed from via the latent projection operation (Verma and Pearl, 1990).
Given a DAG , a latent projection onto is the following ADMG . First, contains all directed edges in between elements in . Second, for every pair , whenever a directed path from V1 to V2 exists in such that all intermediate elements of the path are in , contains an edge (V1V2)→. Finally, whenever a path from V1 to V2 without collider triples exists in , where all intermediate elements of the path are in , the first edge points to V1, and the last edge points to V2, contains an edge (V1V2)↔. As an example, the latent projection of a hidden variable DAG in Fig. 1 (c), where W, A, M, Y are observed, and H is hidden is an ADMG shown in Fig. 1 (d), while the latent projection of a hidden variable DAG in Fig. 2 (a), where W, A, Y are observed, and H is hidden is an ADMG shown in Fig. 2 (b). Corresponding to the ADMG is the class of all hidden variable DAGs such that the latent projection of onto results in .
Figure 2.
(a) A causal DAG with an unobserved common cause of the treatment A and the outcome Y which prevents identification of p(Y(a)). (b) The ADMG obtained from the DAG in (a) via the latent projection operation collapsing over the unobserved variable H. (c) A subgraph of the ADMG in (b) relevant for identification of p(Y | do(a)). (d) A causal DAG with an unobserved common cause of the baseline variables W, the treatment A and the outcome Y. This DAG also contains a mediator M that “captures” all the causal influence of A on Y that is also not confounded by H. (e) The ADMG obtained from the DAG in (d) via the latent projection operation collapsing over the unobserved variable H. (f) A subgraph of the ADMG in (e) relevant for identification of p(Y | do(a)). (g) The CADMG obtained from the ADMG in (e). (h) The CADMG obtained from the ADMG in (e). (i) The CADMG obtained from the ADMG in (e). (j) The CADMG obtained from the ADMG in (e).
The latent projection of a hidden variable DAG where vertices in corresponding to hidden variables are “projected out” is written as . Similarly, the marginal distribution of where variables in are marginalized out is written as . The notation for latent projections in graphs thus intentionally resembles the notation for marginalization in distributions, as the latent projection operation can be viewed as the graphical analogue of the marginalization operation.
5.3. Targets Of Inference In Hidden Variable DAG Models
Defining via (4), via (5), and via (7) in a hidden variable DAG can, in certain cases, be done directly on a latent projection , and in a way that only variables in are mentioned, without changing the meaning of the counterfactual. For the only requirement is that , and . For , where , in addition it must be the case that for every A, .
Since directed paths in π may involve vertices in , defining a version of on the latent projection involves considering a counterfactual , where is a set of directed paths containing only vertices in . Specifically, is the set of directed paths consisting of, for every path πi ∈ π, the largest subpath of πi consisting only of elements in . Every path in π yields a path in , but multiple paths in π may end up yielding the same path in . The counterfactual is well defined if for any formed from πi ∈ π, it is not the case that there exists πj ∉ π such that the largest subpath of πj consisting only of elements in is . This requirement is necessary for elements in to have a consistent value assignment.
Under conditions mentioned above, defining via (4), via (5), and via (7) in yields the same counterfactual as if these definitions were applied in to define , , and , respectively. These results follow from the way these counterfactuals are defined via (4), (5) and (6), and the definition of the latent projection ADMG, and are also described in (Shpitser, 2017). We will restrict attention to counterfactuals that can be defined on directly.
5.4. Kernels As Generalized Conditional Distributions
Functionals of identified causal parameters based on the g-formula described in section 4 can be viewed as (functions of) truncated versions of the DAG factorization (1) of . In hidden variable DAGs represented by latent projection ADMGs, the functionals for identified causal parameters can also be viewed as truncated versions of a particular factorization of with respect to the latent projection ADMG . However, pieces of this factorization are not simple functions of the observed distribution corresponding to conditional distributions. Instead, these pieces must be obtained from the observed distribution by an operator that sequentially applies a certain fixing operation. To describe this operator, we introduce a special type of distribution that represents, along with CADMGs introduced earlier, the operator’s intermediate inputs and outputs.
A kernel (Lauritzen, 1996) is a mapping from values of to normalized densities over . For , is a restriction of to a mapping from values of to normalized densities . A conditional distribution is a type of kernel, though other types of kernels are possible. For example, the identifying functional in (10) is a kernel q(Y | a) ≡ ΣW p(Y | a,W)p(W) that is not in general equal to the conditional distribution p(Y | a). Given a subset , marginalization and conditioning for kernels are defined in the usual way:
CADMGs and kernels involved in identification are derived from and by sequential application of the fixing operation, defined in (Richardson et al., 2017).
5.5. The Fixing Operation And Reachable And Intrinsic Sets
Given a CADMG , a vertex is said to be fixable if . Given a fixable vertex V in define the fixing operator on graphs to be an operator that produces a CADMG which is obtained from by removing all edges of the form (WV)→, (WV)↔. As an example, M is fixable in the ADMG shown in Fig. 2 (e), since . The CADMG is shown in Fig. 2 (g).
Given a CADMG and a kernel , and any fixable V in , define the fixing operator on kernels to be one that produces the kernel
We remind the reader that the set of non-descendants of V, is defined as every element that is not a descendant of . As an example, given the joint distribution p(W, A, M, Y) associated with the ADMG shown in Fig. 2 (e), we have, since M is fixable in ,
The fixing operator is applied iteratively to CADMG/kernel pairs, starting with the latent projection of a hidden variable DAG and the observed distribution which is a marginal of the true distribution that is Markov relative to . Not all sequences of fixing operations are allowed.
Given a vertex set in a CADMG , we will denote sequences on elements in as or 〈V1, …, Vk〉. Given , where is not the empty set, the operator yields the subsequence consisting of all but the first element V1, in the same order as in .
A sequence of the set is said to be fixable in a CADMG if
, that is if is the empty set, or
has at least one element, V1 is fixable in , and is a sequence fixable in .
If there exists fixable in , we say is fixable in , and is reachable in . A reachable set is said to be intrinsic if contains a single district. In particular, for any reachable (due to a fixable sequence ), all districts in are intrinsic sets. The notions of reachable and intrinsic sets are purely graphical, hence every ADMG has a fixed set of reachable and intrinsic subsets of , we denote the latter by .
Given a CADMG , a kernel , a set , and a sequence fixable in , we define the fixing operator for both graphs and kernels for as follows:
- If , that is if is the empty set,
- If ,
For any two distinct sequences , of vertices in fixable in , it is known (Richardson et al., 2017) that . That is, any fixable sequence applied to the same set of vertices in the same original CADMG yields the same final CADMG. Hence, we define the fixing operator on graphs for reachable directly on sets as to mean “apply the fixing operator to elements in according to some (any) fixable sequence.”
As an example, the sequence 〈M, A〉 is fixable in the ADMG shown in Fig. 2 (e). The result of applying is a CADMG shown in Fig. 2 (h). Similarly, the result of applying the fixing operator according to this sequence to p(W, A, M, Y) and yields
5.6. The Nested Markov Factorization
A distribution is said to nested Markov factorize with respect to, or be nested Markov relative to if there exists a set of kernels such that for every reachable and any corresponding fixable sequence ,
In words, this says that nested Markov factorizes with respect to an ADMG if
there exists a set of intrinsic kernels corresponding to intrinsic sets in , such that
every kernel corresponding to a reachable set , obtained by applying the fixing operator to and using the fixable sequence ,
factorizes into a product of intrinsic kernels corresponding to the districts of the CADMG corresponding to ,
where this CADMG is obtained by applying the fixing operator to the graph and the set .
The factorization is nested because it describes the factorization for many kernels corresponding to many reachable sets, some “nested” within others. In particular, the set is (trivially) reachable in , so all other reachable sets are “nested” within . Since is reachable, the nested factorization asserts that the original distribution factorizes according to districts in .
If is nested Markov relative to the ADMG , then for any two distinct sequences , of vertices in fixable in , it is known (Richardson et al., 2017) that . In other words, if is in the nested Markov model, then any distinct fixable sequences on the same set applied to and lead to the same kernel. This result justifies restating the fixing operation on kernels in terms of sets as well. For any reachable , we write to mean “the kernel obtained from applying the fixing operator to and according to some (any) fixable sequence.”
In addition, it can be shown (Richardson et al., 2017) that the set of intrinsic kernels is, in fact , and the nested Markov factorization can be rephrased as:
for every reachable in .
5.7. The ID Algorithm
Given a hidden variable DAG representing a causal model, assume we are interested in identification of the interventional distribution , for arbitrary disjoint subsets , of , from the observed distribution . This identification problem was given a general solution in terms of a recursive algorithm called the ID algorithm (Tian and Pearl, 2002) which was shown to be complete in (Shpitser and Pearl, 2006b; Huang and Valtorta, 2006). Here we give a simple one line reformulation of the ID algorithm as a truncated nested factorization.
Let be the latent projection of onto observable vertices , and let , be the set of ancestors of via only directed paths that exclude elements in (in particular, does not contain any element of ). Let be the induced subgraph of containing only elements in and edges in among those elements. Let be the set of districts in . If every is an intrinsic set, then
| (12) |
If some is not intrinsic, then is not identifiable in the causal model, meaning no other algorithm can obtain the identifying expression for without additional assumptions. In words, this says that is identifiable if and only if the induced graph for a certain set factorizes into districts that all correspond to intrinsic sets in the original ADMG . may not be reachable itself, but still yield this sort of factorization. The factorization implies the interventional distribution of interest may be obtained from the product of the corresponding intrinsic kernels (which are all themselves identifiable), and possibly a marginalization operation that sums out elements in .
The proof of soundness of the original formulation of the ID algorithm appears in (Tian and Pearl, 2002), and of completeness in (Shpitser and Pearl, 2006b; Huang and Valtorta, 2006). The proof of soundness of the simplified version shown in (12) appears in (Richardson et al., 2017). Since preconditions for the application of (12), and all expressions within (12) itself were phrased in terms of the latent projection , and not the original hidden variable DAG , all hidden variable DAGs that share the latent projection agree on which distributions are identifiable and which are not, and further agree on all identifying functionals. An implementation of the ID algorithm, and a number of related algorithms, exists in the causaleffect package in the R programming language.
We now illustrate how (12) is applied with two examples. The first example is the causal model corresponding to the DAG in Fig. 2 (a). This DAG contains an unobserved common cause of A and Y, which will prevent identification of p(Y | do(a)), as we now show. The latent projection of this graph is the ADMG shown in Fig. 2 (b). Here , with shown in Fig. 2 (c). Then , and the set is not fixable, since Y is a descendant of A and lies in the district of A. Thus p(Y | do(a)) = p(Y(a)) is not identified in the causal model represented by Fig. 2 (a).
Next, consider the causal model corresponding to the DAG in Fig. 2 (d). Like in Fig. 2 (a), there is a common cause of A and Y. However, in Fig. 2 (d) there is, in addition, a mediator variable M which lies on a causal pathway from A to Y, indicated by the presence of a directed path A → M → Y. In fact, this is the only directed path from A to Y, meaning that M captures or mediates all of the causal influence of A on Y. Finally, M is not a child of H, meaning it is determined entirely by W and A and remains unconfounded by H, unlike A and Y. The presence of this type of mediator allows p(Y | do(a)) to be identified, as we now show.
The latent projection of Fig. 2 (d) is the ADMG shown in Fig. 2 (e). Here , with shown in Fig. 2 (f). It’s easy to verify that . In this case, the sets and are fixable. We first consider {A, M}. M is fixable in Fig. 2 (e), which yields the CADMG in Fig. 2 (g), with the corresponding kernel
In this CADMG, A becomes fixable (it was not fixable in Fig. 2 (e)), yielding the CADMG in Fig. 2 (h), with the corresponding kernel
Similarly, the set {Y, W, A} is also fixable. First, Y is fixable in Fig. 2 (e), yielding the CADMG in Fig. 2 (i), with the corresponding kernel
Next, A becomes fixable in Fig. 2 (i), yielding the CADMG in Fig. 2 (j), with the corresponding kernel
Finally, W is fixable in Fig. 2 (j), yielding a CADMG obtained from Fig. 2 (j) by drawing W as a square, with the corresponding kernel
Combining these two kernels into (12), where , and evaluating q{M}(M | W, A, Y) at A = a yields
This is known as the front-door formula (Pearl, 2009).
Expressions obtained from (12) may become quite complicated in arbitrary ADMGs. However, the ID algorithm as expressed in (12) may be still be viewed as a kind of “nested g-formula” appropriate to the hidden variable DAG setting, in the following sense. Just as was the case in (8), no term in (12) is a density over any element in . This is because no variable in is an element of any , which means all elements in are fixed in every term in (12). In fact, removing terms of the form from the DAG factorization corresponds to fixing in DAGs without hidden variables.
5.8. Path-Specific Effects
Path-specific effects in DAGs with all variables observed were not always identified due to the presence of recanting witnesses. In hidden variable DAGs, an additional type of graphical structure may also prevent identification.
Given a latent projection ADMG , fix disjoint sets , and a set of proper causal paths π for and , where each is the origin of at least one path in π. Let . Then is said to be a recanting district for π if there exists a path in π of the form 〈(AD1)→, …, (W2Y1)→〉, and another proper causal path not in π of the form 〈(AD2)→, …, (W2Y2)→〉, where , and . Identification of potential outcomes involved in a path-specific effect is characterized in DAGs with hidden variables by the absence of a recanting district, and identifiability of the effect of on along all paths. Specifically, given disjoint vertex sets , in a DAG , and a set π of proper causal paths for and , the distribution is identified if and only if there does not exist a recanting witness for π, and is identified. If is identified, its identifying expression is
| (13) |
where is defined to be if all elements in are connected to elements in via edges in π, defined to be if all elements in are connected to elements in via edges not in π, and defined to be the empty set if . The absence of a recanting district guarantees these three possibilities are exhaustive. Just as (12) was the appropriate nested generalization of the g-formula (8) to hidden variable DAGs, so is (13) the appropriate nested generalization of the edge g-formula (11) to hidden variable DAGs. The original version of this algorithm was described in (Shpitser, 2013), with the above reformulation based on the fixing operator found in (Shpitser and Sherman, 2018).
Consider Fig. 1 (c), where we are interested in the path-specific effect of A on Y via the path 〈(AZ)→, (ZY)→〉. The latent projection of this graph is shown in Fig. 1 (d). Here , and . Note that there is no recanting district – the district containing the first post-exposure variable on the only path of interest is {Z}, and no path other than 〈(AZ)→, (ZY)→〉 has the first post-exposure variable in this district. Furthermore, p(Y | do(a)) is identifiable. Thus, the counterfactual corresponding to the path-specific effect is identified:
where is the graph shown in Fig. 1 (d).
On the other hand, if we were interested in the path-specific effect of A on Y along paths π = {〈(AZ)→, (ZY)→〉; 〈(AY)→〉}, this path-specific effect is not identified. This is because the path 〈(AM)→, (MY)→〉 is not in π but has (AM)→ as the first edge, while 〈(AY)→〉 is a path in π. M and Y share a district in , where . This implies {M, Y} is a recanting district, and will prevent identification of p(Y(π, a, a′)).
5.9. Responses To Dynamic Treatment Regimes
An adaption of the ID algorithm for identification of distributions over responses to dynamic treatment regimes of the form in causal models represented by a hidden variable DAG was given in (Tian, 2008).
As before, we rephrase this algorithm in terms of the fixing operation, CADMGs and kernels. Given a latent projection of , define the graph to be an ADMG obtained from by removing all edges pointing into and adding a directed edge W → A for any . Define . Then is identified if is identified. Moreover, the identification formula is
where for every , is defined to be if is not empty, and is defined to be the empty set otherwise. The sum over is vacuous if is a set of deterministic policies, since in this case there is no variation in values of in any . This algorithm was shown to be complete in (Shpitser and Sherman, 2018).
As an example, in Fig. 1 (c), if , is shown in Fig. 1 (d), and is the same graph as . Since p(Y, M, W | do(a, z)) is identified as
is also identified as
6. Conditional Counterfactual Distributions
A common version of the identification problem is identification of conditional interventional distributions which is defined as or . These types of distributions are of interest in causal inference applications where effect modification, variation of causal effects within particular subpopulations, is of interest.
Characterizing identification of these distributions is possible using (12), (13) and a subset of obtained using a generalized version of a conditional ignorability argument, where the necessary conditional independence is read off from a particular version of a causal graph. Prior work phrased the independence needed in terms of rule 2 of do-calculus (Pearl, 2009). Here we describe a generalization of do-calculus called potential outcome calculus that describes identities governing distributions on potential outcomes defined via (4), based on graphs describing the Markov properties of distributions over these potential outcomes. These graphs are called single world intervention graphs (SWIGs) (Richardson and Robins, 2013). The advantage of the formulation we describe here is it generalizes in a straightforward way to counterfactuals not readily expressed by means of Pearl’s do(.) operator, such as counterfactuals that describe path-specific effects (Malinsky et al., 2019).
6.1. Single World Intervention Graphs
Given a causal DAG and a set of variables we wish to intervene on, we construct the SWIG as follows from . Each vertex in is split into a random vertex A and a fixed vertex a (note the lower case). All edges with arrowheads into A in are inherited by the random vertex, and all directed edges out of A in are inherited by the fixed vertex. All other edges in remain in . Finally, a vertex in corresponding to V in is labeled as a counterfactual with a subset consisting of all elements of with a directed path in to the vertex corresponding to V. As an example, given a hidden variable DAG in Fig 3 (a), the SWIG is shown in Fig. 3 (c).
Figure 3.
(a) A hidden variable graph corresponding to the observed data distribution. (b) The latent projection of the graph in (a), assuming H is a hidden variable. (c) A SWIG corresponding to the world where A is set to value a, derived from the DAG in (a). (d) A latent projection version of the SWIG in (c).
The resulting vertices correspond to the set of counterfactuals defined by (4). The following result was proved in (Richardson and Robins, 2013):
In other words Markov factorizes with respect to the SWIG . An important consequence of this result is that conditional independences in may be directly read off from using d-separation. As an example, p(A, H, M(a), Y(a)) Markov factorizes with respect to the SWIG shown in Fig. 3 (c):
We can therefore use d-separation in the SWIG to obtain independence statements that hold in the model. For example, the conditional ignorability constraint (A ⊥⊥Y(a) | H) holds by d-separation in Fig. 3 (c).
SWIGs may be constructed from latent projection ADMGs of a hidden variable DAG by an identical “splitting” procedure, and conditional independences of a marginal counterfactual distribution may be directly read off from using m-separation, by a simple corollary of the above result. As an example, the SWIG in Fig. 3 (d) is obtained from the latent projection in Fig. 3 (b) obtained from Fig. 3 (a). In this SWIG, A ⊥⊥ M(a) can be obtained by m-separation.
6.2. The Potential Outcomes Calculus
Potential outcomes calculus is the following three identities with preconditions given by dseparation (or m-separation) on SWIGs. Fix disjoint subsets , , , of . Then
in .
in .
in .
Rule 1 encodes the fact that conditional independences in a counterfactual distribution may be read off by d-separation or m-separation in the appropriate SWIG . Rule 2 is a kind of generalized conditional ignorability that governs where interventions on are equivalent with conditioning on , for a particular set of variables given that other variables were either intervened on , or conditioned on . Classical conditional ignorability assumption is often assumed directly, whereas here conditional ignorability type statements are read off by m-separation from the appropriate SWIG .
Rules 1 and 2 are straightforward analogues of Pearl’s do-calculus rules, rephrased in terms of SWIGs and potential outcomes. Rule 3 we state here is far simpler than Pearl’s rule 3, and states that a counterfactual does not depend on if it is the case that the corresponding set of fixed vertices in the SWIG are d-separated (or m-separated) from , meaning that there is no directed path from the former to the latter. This rule simply encodes the fact that recursive substitution (4) for any may yield a counterfactual that does not depend on some values in . This may occur if the corresponding variables are in the future relative to Y, or due to some exclusion restriction in the causal model, as may happen with instrumental variable models. Importantly, it was shown in (Malinsky et al., 2019) that simplifying Pearl’s rule 3 in this way is without loss of generality: all do-calculus derivations are still possible to derive with the 3 rules given here, including the simpler rule 3.
Rule 1 may be termed the “interventional global Markov property,” rule 2 may be termed “generalized conditional ignorability,” and rule 3 may be termed “causal irrelevance.” Rule 2 is of particular relevance to identification theory we present below.
6.3. Conditional Causal Effects
Given rule 2 of potential outcomes calculus, identification of distributions of the form is quite simple to characterize. Let be any maximal subset of such that for any , . Such a set is the unique largest set to which rule 2 applies.
Given this set, the distribution is identifiable if and only if is identifiable. Moreover, the identification formula is given by
where is identified via (12), and is obtained from via marginalization (Shpitser and Pearl, 2006a).
As an example, in shown in Fig. 2 (e), if the conditional interventional distribution is of interest, we conclude that , since W and Y(a, w) are not m-separated in the SWIG derived from . Thus, to identify we first identify p(Y(a), W(a)) as
which yields
An extension of these results that gives a complete identification formula for conditional distributions associated with path-specific effects, using rule 2 of the potential outcomes calculus is described in (Malinsky et al., 2019).
7. Conclusion
In this paper we introduced a number of functions of potential outcome random variables used in causal inference applications, including treatment effects, direct, indirect, and path-specific effects, and responses to dynamic treatment regimes. We described a simple characterization of identifiability of these targets in fully observed causal models of a directed acyclic graph based on variations of the g-formula (Robins, 1986). Finally, we gave a simplified description of identification algorithms for these targets in causal models of a directed acyclic graph where some variables are unobserved, which first appeared in (Tian and Pearl, 2002; Tian, 2008; Shpitser and Pearl, 2006a; Shpitser, 2013). This description was based on a truncated nested Markov factorization, which is itself based on conditional mixed graphs, kernels, and the fixing operator described in (Richardson et al., 2017). The identification algorithms described are known to be complete for treatment effects, conditional treatment effects, and path-specific effects, and responses to dynamic treatment regimes.
7.1. Acknowledgements
Definitions, some examples, and some identification results that appear in this review manuscript also appeared in (Shpitser, 2017).
Footnotes
To avoid measure-theoretic complications, we consider finite state spaces here.
References
- Avin C, Shpitser I, and Pearl J (2005). Identifiability of path-specific effects. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), volume 19, pages 357–363. Morgan Kaufmann, San Francisco. [Google Scholar]
- Chakraborty B and Moodie EEM (2013). Statistical Methods for Dynamic Treatment Regimes (Reinforcement Learning, Causal Inference, and Personalized Medicine). Springer, New York. [Google Scholar]
- Drton M (2009). Discrete chain graph models. Bernoulli, 15(3):736–753. [Google Scholar]
- Haavelmo T (1943). The statistical implications of a system of simultaneous equations. Econometrica, 11:1–12. [Google Scholar]
- Huang Y and Valtorta M (2006). Pearl’s calculus of interventions is complete. In Twenty Second Conference On Uncertainty in Artificial Intelligence. [Google Scholar]
- Lauritzen SL (1996). Graphical Models. Oxford, U.K.: Clarendon. [Google Scholar]
- Malinsky D, Shpitser I, and Richardson TS (2019). A potential outcomes calculus for identifying conditional path-specific effects. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. [PMC free article] [PubMed] [Google Scholar]
- Miles C, Shpitser I, Kanki P, Melone S, and Tchetgen Tchetgen EJ (2017). Quantifying an adherence path-specific effect of antiretroviral therapy in the nigeria pepfar program. Journal of the American Statistical Association. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neyman J (1923). Sur les applications de la thar des probabilities aux experiences agaricales: Essay des principle. excerpts reprinted (1990) in English. Statistical Science, 5:463–472. [Google Scholar]
- Pearl J (1988). Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo. [Google Scholar]
- Pearl J (2001). Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 411–420. Morgan Kaufmann, San Francisco. [Google Scholar]
- Pearl J (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition. [Google Scholar]
- Pearl J (2011). The causal mediation formula – a guide to the assessment of pathways and mechanisms. Technical Report R-379, Cognitive Systems Laboratory, University of California, Los Angeles. [Google Scholar]
- Richardson T and Spirtes P (2002). Ancestral graph Markov models. Annals of Statistics, 30:962–1030. [Google Scholar]
- Richardson TS (2003). Markov properties for acyclic directed mixed graphs. Scandinavial Journal of Statistics, 30(1):145–157. [Google Scholar]
- Richardson TS, Evans RJ, Robins JM, and Shpitser I (2017). Nested Markov properties for acyclic directed mixed graphs. Working paper.
- Richardson TS and Robins JM (2013). Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. preprint: http://www.csss.washington.edu/Papers/wp128.pdf.
- Robins JM (1986). A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Mathematical Modeling, 7:1393–1512. [Google Scholar]
- Robins JM (1999a). Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology: The Environment and Clinical Trials. NY: Springer-Verlag. [Google Scholar]
- Robins JM (1999b). Testing and estimation of direct effects by reparameterizing directed acyclic graphs with structural nested models. In Glymour C and Cooper G, editors, Computation, Causation, and Discovery, pages 349–405. Menlo Park, CA, CAmbridge, MA: AAAI Press/The MIT Press. [Google Scholar]
- Robins JM and Greenland S (1992). Identifiability and exchangeability of direct and indirect effects. Epidemiology, 3:143–155. [DOI] [PubMed] [Google Scholar]
- Robins JM and Richardson TS (2010). Alternative graphical causal models and the identification of direct effects. Causality and Psychopathology: Finding the Determinants of Disorders and their Cures. [Google Scholar]
- Rubin DB (1976). Causal inference and missing data (with discussion). Biometrika, 63:581–592. [Google Scholar]
- Shpitser I (2013). Counterfactual graphical models for longitudinal mediation analysis with unobserved confounding. Cognitive Science (Rumelhart special issue), 37:1011–1035. [DOI] [PubMed] [Google Scholar]
- Shpitser I (2015). Segregated graphs and marginals of chain graph models. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc. [Google Scholar]
- Shpitser I (2017). Identification in graphical causal models. In Handbook of Graphical Models. [Google Scholar]
- Shpitser I and Pearl J (2006a). Identification of conditional interventional distributions. In Proceedings of the Twenty Second Conference on Uncertainty in Artificial Intelligence (UAI-06), pages 437–444. AUAI Press, Corvallis, Oregon. [Google Scholar]
- Shpitser I and Pearl J (2006b). Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06). AAAI Press, Palo Alto. [Google Scholar]
- Shpitser I and Sherman E (2018). Identification of personalized effects associated with causal pathways. In Proceedings of the 34th Annual Conference on Uncertainty in Artificial Intelligence (UAI-18). [PMC free article] [PubMed] [Google Scholar]
- Shpitser I and Tchetgen Tchetgen EJ (2016). Causal inference with a graphical hierarchy of interventions. Annals of Statistics, 44(6):2433–2466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spirtes P, Glymour C, and Scheines R (2001). Causation, Prediction, and Search. Springer Verlag, New York, 2 edition. [Google Scholar]
- Tian J (2008). Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08), pages 554–561, Corvallis, Oregon. AUAI Press. [Google Scholar]
- Tian J and Pearl J (2002). On the testable implications of causal models with hidden variables. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI-02), ume 18, pages 519–527. AUAI Press, Corvallis, Oregon. [Google Scholar]
- Verma TS and Pearl J (1990). Equivalence and synthesis of causal models. Technical Report R-150, Department of Computer Science, University of California, Los Angeles. [Google Scholar]
- Wright S (1921). Correlation and causation. Journal of Agricultural Research, 20:557–585. [Google Scholar]



