Abstract
Causal inference quantifies cause effect relationships by means of counterfactual responses had some variable been artificially set to a constant. A more refined notion of manipulation, where a variable is artificially set to a fixed function of its natural value is also of interest in particular domains. Examples include increases in financial aid, changes in drug dosing, and modifying length of stay in a hospital.
We define counterfactual responses to manipulations of this type, which we call shift interventions. We show that in the presence of multiple variables being manipulated, two types of shift interventions are possible. Shift interventions on the treated (SITs) are defined with respect to natural values, and are connected to effects of treatment on the treated. Shift interventions as policies (SIPs) are defined recursively with respect to values of responses to prior shift interventions, and are connected to dynamic treatment regimes. We give sound and complete identification algorithms for both types of shift interventions, and derive efficient semi-parametric estimators for the mean response to a shift intervention in a special case motivated by a healthcare problem. Finally, we demonstrate the utility of our method by using an electronic health record dataset to estimate the effect of extending the length of stay in the intensive care unit (ICU) in a hospital by an extra day on patient ICU readmission probability.
1. INTRODUCTION
Establishing cause effect relationships is a fundamental goal in data-driven empirical science and decision making. An influential approach to causal inference quantifies causal effects by means of responses to an intervention operation, which manipulates variables to attain specified values, possibly contrary to fact. This intervention operation is denoted by do(.) in (Pearl, 2009), and is used to define potential outcome random variables in wide use in statistics and public health (Neyman, 1923; Rubin, 1976).
Other kinds of intervention operations have been considered in the literature. Dynamic treatment regimes (DTRs), used in precision medicine and related applications (Chakraborty and Moodie, 2013), manipulate variables to values that depend on causally prior variables. Edge and path interventions (Shpitser and Tchetgen Tchetgen, 2016) manipulate variables to distinct values with respect to different causal pathways the variables are involved in. These interventions have been used to quantify direct, indirect, and path-specific effects in mediation analysis. Soft interventions (Eberhardt, 2014) “nudge” variables (or the data-generating process for variables) away from their natural state, rather than manipulating them to attain specific constant values. A recent type of intervention of this sort that manipulates the propensity score was considered in (Kennedy, 2019).
In this paper we consider a particular type of soft intervention where variables are manipulated to attain values given by fixed functions of their existing values. We call such interventions shift interventions. Shift interventions arise in settings where the counterfactual change of interest is most naturally expressed in terms of existing realizations of variables to be manipulated. Examples of such settings include changes in drug dosing, increases in financial aid, or policy deviations from an existing standard in medical, social, or economic domains. We show that in the presence of multiple variables being manipulated, two types of shift interventions are possible. Shift interventions on the treated (SITs) are defined with respect to their naturally observed values, and are connected to effects of treatment on the treated (ETTs) (Shpitser and Pearl, 2009). Shift interventions as policies (SIPs) are defined recursively with respect to values of responses to prior shift interventions, and are connected to dynamic treatment regimes. Despite these connections, responses to shift interventions are distinct types of counterfactuals, and we show their identification gives rise to subtleties not present in identification of either DTRs or ETTs.
We give sound and complete identification algorithms for both types of shift interventions, and derive an efficient semi-parametric estimator for the response to a shift intervention in a special case motivated by a healthcare problem. Finally, we demonstrate the utility of our method by using an electronic health record dataset to estimate the effect of extending the length of stay in the intensive care unit (ICU) in a hospital by an extra day on patient ICU readmission probability.
2. PRELIMINARIES
Causal inference aims to establish a link between observed random variables V ≡ {V1, … , Vk} and counterfactual random variables Vi(a), which denote the response of Vi had variables A ⊆ V been manipulated, possibly contrary to fact, to obtain values a. The distribution over Vi(a) is denoted by p(Vi|do(a)) in (Pearl, 2009). Counterfactuals quantify causal effects as contrasts defined by two manipulations, representing treatment and control arms of a hypothetical randomized controlled trial (RCT). For example, the average causal effect (ACE) is defined as , where Y is an outcome variable, and A are one or more treatment variables manipulated to values a representing treatments of interest, or values a′ representing the baseline treatments in the control group.
An elegant formalism for defining causal models uses directed acyclic graphs (DAGs). A DAG is a directed graph with no directed cycles. In a causal model represented by a DAG , each vertex in corresponds to a variable (we will use the same letter, e.g. Vi, for both). For each Vi, the set of variables with directed arrows into Vi in , denoted as parents of Vi, or pa(Vi), is the set of direct causes of Vi, in the following sense. We assume the existence of atomic counterfactual random variables of the form Vi(a), for each value set a in , the state space of pa(Vi). We use these random variables to define other counterfactuals by means of the recursive substitution definition. For any Vi ∈ V, A ⊆ V, we have
| (1) |
As an example, consider the DAG in Fig. 1 (a), representing an observational study with a single treatment A (representing a drug dose), an outcome Y, and a vector of baseline covariates L. Given , atomic counterfactuals are of the form L, A(l), Y(a, l), for any values a, l in . Further, we define Y(a) using (1) as Y(a, L). Note that (1) allows definitions of the form A(a, L) ≡ A(L) ≡ A, since A ∉ pa(A), and A ∉ pa(L).
Figure 1:

(a) A causal graph representing a single treatment cross-sectional study. (b) A causal graph representing a two stage observational study, with a first line treatment A, and a second line treatment B. (c) A causal graph representing a two stage observational study, where the second line treatment B is not assigned using intermediate outcomes W. (d) The latent projection representing the front-door causal model. (e) The latent projection representing the bow arc causal model. (f) A class of latent projections representing failure of identification of SITs in Theorem 4.
Causal models are defined by restrictions on counterfactual random variables. We will work with a popular model called the structural causal model (Pearl, 2009), which asserts the following marginal independences:
In our example, these assert
Causal models such as the structural causal model allow counterfactual quantities such as p(Y(a)) to be expressed in terms of the observed data distribution. If all variables in the causal model are observed, every p(V(a)) is identified by the following functional:
| (2) |
known as the extended g-formula. If A is empty, we have
| (3) |
which is the well-known Bayesian network factorization of the observed data distribution p(V) (Pearl, 1988). In other words, assuming a causal model on a DAG implies the observed data distribution p(V) factorizes according to as in (3), and all interventional distributions p(V(a)) are identified by modified versions, as in (2), of this factorization.
In our example, p(V(a)) = p(Y(a), A, L) is identified as p(Y|a, L)p(A|L)p(L). If we are only interested in p(Y(a)), we simply marginalize the modified factorization appropriately to yield the adjustment formula ∑L p(Y|a, L)p(L).
Generalized Interventions and Targets of Inference
Before describing shift interventions, we consider two related counterfactual quantities considered in the causal inference literature. Aside from the example in Fig. 1 (a), we will also consider the causal model in Fig. 1 (b) representing an observational study with two treatments given in stages. A is given based on a set of baseline characteristics L, and would represent the primary treatment in healthcare contexts. W represents an intermediate outcome, while B, given based on values of L, A, W, would represent salvage therapy or second line treatment in cases of poor response to A. Y represents the final outcome of interest.
For a single treatment A, the effect of treatment on the treated (ETT) is defined as . Such an effect can be viewed as a version of the ACE among the set of people naturally exposed to a particular level of the treatment. For example, ETT compares the effect of smoking one pack of cigarettes a day to smoking nothing in the set of people who happen to smoke one pack of cigarettes a day. In the causal model represented by Fig. 1 (a), the ETT is identified as .
For multiple treatments, the effect of treatment on the multiply treated is defined similarly. In Fig. 1 (b), this effect is defined as . While it can be shown that in Fig. 1 (b) the ACE is identified as , the ETT is not identified. This is due to the fact that the second term of the ETT is a function of variables Y(a′, b′) and B, where the former is defined via (1) as Y(a′, b′, W(a′, L), L), while the latter is defined via (1) as B(W(A, L), A(L), L). In other words, the ETT is a function of a joint distribution containing p(W(a′), W) as a marginal, which is not identified under the structural causal model. This issue is described in detail in (Shpitser and Tchetgen Tchetgen, 2016).
The ACE and the ETT, where variables are manipulated to constants in order to mimic RCTs, are contrasts of substantive interest in applied settings such as econometrics and public health. In settings such as precision medicine, variables are manipulated based on observed patient characteristics, with the aim of improving positive outcomes or minimizing harmful ones. The resulting counterfactuals are defined as follows. For every Ai ∈ A to be manipulated, define a set Li ⊆ V \ A to be some set of variables not causally determined by Ai (graphically, this means there is no directed path from Ai to any element in Li in ). Given a set of functions , we define the response Y(f) to setting values of each Ai ∈ A according to its corresponding function fi as
by analogy with (1). As an example, in the model shown in Fig. 1 (a), given a function , Y(fA) ≡ Y(A = fA(L), L). Similarly, in the model shown in Fig. 1 (b), given functions , and , Y(fA, fB) ≡ Y(L, A = fA(L), W(L, A = fA(L)), B = fB(L, W(L, A = fA(L)))). Here the second response of Y is defined according to value of B set by fB using values of W recursively determined by counterfactually setting A according to fA. Functions in the set of f above are also known as dynamic treatment regimes (DTRs).
As before, if all variables in a causal model are observed, p(V(f)) is identified for any set A ⊆ V, and set of functions fA ≡ {fi: Ai ∈ A} by the following variation of (2):
Responses of specific variables in V to A being set according to f is obtained from the above formula by marginalization, as before. As an example, p(Y (fA)) = ∑L p(Y|A = fA(L),L)p(L) in Fig. 1 (a).
Having described the ETT and responses to DTRs, we are now ready to describe shift interventions. Assume we are interested, in Fig. 1 (a), in the outcome Y had the drug dose A been changed from its given value a by a known function fA. We define such a counterfactual, by analogy with (1) as Y(A = fA(A), L). Note that unlike Y(a), each person in the data is assigned a potentially different dose, as would be the case for responses to DTRs. However, unlike DTR counterfactuals, the function only uses values of A as inputs.
Assuming A, B in Fig. 1 (b) also represent drug doses administered over time, we may be interested in how the outcome Y changes had drug doses been changed from their values by known functions fA, fB. Note that there are two ways to define such a counterfactual, which diverge in how the second treatment A2 is manipulated. One definition might consider the response of Y to the first treatment A being given by a fixed function fA of the observed treatment A, and the second treatment B being given by a fixed function fB of the observed treatment B. This response Y(A = fA(A), B = fB(B)) is defined as Y(A = fA(A), B = fB(B), W(A = fA(A), L), L). Another definition might consider the response of Y to the first treatment A being given by a fixed function fA of the observed treatment A, and the second treatment B being given by a fixed function fB of the treatment B observed in the world where the first treatment A was counterfactually shifted according to fA. This response Y(A = fA(A), B = fB(B(A = fA(A)))) is defined as , where .
We call the first definition shift interventions on the treated (SITs), and the second definition shift interventions as policies (SIPs). Unsurprisingly, identification theory for SITs bears some similarity to that of ETTs, while identification theory for SIPs bears some similarity to that of DTRs, although in both cases new subtleties present themselves.
SITs are of interest whenever deviations from current best practices are investigated. For instance, responses to SITs would be the correct counterfactual to use in healthcare settings to investigate the effect of dosing changes from an existing standard. SIPs are of interest when variable manipulations have a compound effect, and therefore effects of prior shift interventions on intermediate outcomes must be taken into account. For instance, responses to SIPs could be used to evaluate changes to financial aid, or a medical treatment administered over time with a compound effect. SIPs have been described, under a different name, in section 5.1 in (Richardson and Robins, 2013).
Before describing identification theory for SITs and SIPs, we give their general definitions, using a modification of (1). Fix . By analogy with (1), we define for any Y ∈ V, the counterfactual response Y(f(A)) to SITs on A as
and the counterfactual response Y(f) to SIPs on A as
3. IDENTIFICATION UNDER FULL OBSERVABILITY
We first describe identification theory for SITs and SIPs in cases where all variables in a causal model are observed. Identification for SIPs in fully observed models is given by the following result.
Theorem 1
Fix A ⊆ V, and a set of functions in a fully observed functional causal model given by the DAG . Then p(V(f)) is identified and equal to
For example, given fA, fB in Fig. 1 (b), p({L, A, W, B, Y}(fA, fB)) is identified as p(L)p(A|L)p(W|A = fA(A), L) p(B|W, A = fA(A), L) p(Y|B = fB(B), W, A = fA(A), L), and so p(Y(fA, fB)) is equal to
That is, identification of responses to SIPs in fully observed models resembles identification of DTRs.
Now let us consider identification of responses to SITs. It turns out that even if the causal model is fully observed, SITs may not be identified if multiple treatments are manipulated simultaneously, due to the same issue that prevents identification of ETTs. We have the following result.
Theorem 2
Fix disjoint A, Y ⊆ V, and a set of unrestricted functions in a fully observed functional causal model given by the DAG .
Fix the set of all directed paths π in which start with Ai ∈ A, end in some element in A ∪ Y, and which do not intersect elements in A ∪ Y otherwise. Then p(Y(f(A))) is identified if and only if there are no two elements in π which share the first edge and where one path ends in an element in A, and another path ends in an element in Y. Moreover, if p(Y(f(A))) is identified, it is equal to
where Y * is the set of ancestors of Y in , and is the set of variables not in A which lie on a path in π that ends in Y.
For example, given fA, fB, p(Y(fA(A), fB(B))) is not identified in Fig. 1 (b), since the set of directed paths in π will contain B → Y, A → Y, A → W → Y, A → W → B, and A → B. Since A → W → Y and A → W → B share the first edge, and have final elements in Y and B, the condition of theorem 2 applies.
However, if we consider identification of the same distribution p(Y(fA(A), fB(B))) in Fig. 1 (c), where the edge W → B is absent, we obtain identification:
| (4) |
Note that while identification of ETTs and SITs in fully observed DAGs runs into a similar difficulty having to do with recanting witnesses (Avin et al., 2005), identification results for these two types of counterfactuals are nevertheless quite different. This is because ETTs are defined as functions of counterfactual conditionals p(Y(a)|A = a′) for some set A, while SITs are defined as counterfactual marginals.
4. IDENTIFICATION WITH HIDDEN VARIABLES
Most causal inference problems of practical importance contain hidden but relevant variables, motivating the use of causal models of a DAG where some variables are not observed. As we now show, identification theory implied by the structural causal model of DAGs with hidden variables is more involved for both SIPs and SITs.
Identification theory of a causal model of a DAG with vertices V ∪ H, where V corresponds to observed variables and H corresponds to hidden variables is often phrased on an acyclic directed mixed graph (ADMG) called a latent projection (Verma and Pearl, 1990). By an ADMG we mean a graph with directed (→) and bidirected (↔) edges and no directed cycles.
Given a DAG where V are observed variables and H are hidden variables, we define the latent projection ADMG with vertices V as follows. For every Vi, Vj ∈ V, if there exists in a directed path from Vi to Vj with all intermediate vertices in H, an edge Vi → Vj exists in . For every Vi, Vj, if there exists a collider-free path from Vi to Vj in with the first edge on the path of the form Vi ← and the last edge on the path of the form → Vj, an edge Vi ↔ Vj exists in . For example, if L is unobserved in Fig. 1 (a), then the resulting latent projection is shown in Fig. 1 (e). This example illustrates that latent projections are not always simple graphs.
Latent projections are used because for any two distinct DAGs , that share the same latent projection also share non-parametric identification theory (Richardson et al., 2017).
Before describing this theory, we introduce a few additional definitions we will need. Given an ADMG , and S ⊆ V, define the induced subgraph to be a graph containing vertices in S, and any edge in connecting elements of S. Given an ADMG , a district of is a bidirected-connected component. The set of districts of forms a partition of vertices in , and is denoted by . Finally, given a set S in , define .
Identification theory in hidden variable models uses ADMGs in an analogous way identification theory in fully observed models uses DAGs. Just as the structural causal model defined on a fully observed DAG implies the DAG factorization on the observed data distribution with respect to , and identification of all interventional distributions p(V(a)) in terms of a modified factorization of , so does the structural causal model defined on a hidden variable DAG implies the nested Markov factorization (Richardson et al., 2017) on the observed data distribution with respect to the latent projection ADMG , and identification of certain marginal interventional distributions p(Y(a)) in terms of a modified nested factorization of given by the ID algorithm (Tian and Pearl, 2002; Shpitser and Pearl, 2006).
The nested Markov factorization of p(V) with respect to an ADMG is defined in terms of Markov kernels of the form qS(S|WS), with a single kernel for each subset S ⊆ V that is an intrinsic set. A Markov kernel qS(S|WS) is any map from to normalized densities over S. For any A ⊆ S, conditioning and marginalization in Markov kernels is defined in the usual way as:
A set S is intrinsic in if contains a single district and is reachable in . A set S is said to be reachable in if there exists a sequence of ADMGs such that , , each is obtained from by removing a specific vertex Vi and all edges with Vi as one endpoint. Finally, for each , the vertex Vi to be removed to obtain has no directed and bidirected (consisting entirely of ↔ edges) path to any other vertex Vj in .
The Markov kernels defining the nested Markov models are always functionals of p(V). For example, in Fig. 1 (d), the Markov kernels corresponding to all intrinsic sets are:
We describe the general scheme for deriving functionals for intrinsic Markov kernels from p(V) in the Supplement.
The nested Markov factorization expresses p(V) and any kernel qR(R|WR) where R is a reachable set in terms of Markov kernels corresponding to intrinsic sets, as follows:
For instance, the nested Markov factorization for the ADMG in Fig. 1 (d) implies p(Y, M, A) = q{Y, A}(Y, A|M)qM(M|A), which is sometimes called the district or c-component factorization of an ADMG.
Given disjoint subsets Y, A of V, the nested Markov factorization naturally leads to the following reformulation of the complete algorithm for identification of p(Y(a)), sometimes called the ID algorithm (Shpitser and Pearl, 2006). This algorithm can be expressed as a modified nested Markov factorization as follows:
where Y* is ancestors of Y in . This factorization is defined provided each D on the right hand side is intrinsic, otherwise it is undefined and p(Y(a)) is not identified given the structural causal model for any hidden variable DAG that yields the latent projection .
For example, in the graph shown in Fig. 1 (d), we have:
known as the front-door formula, while p(Y(a)) is not identified in Fig. 1 (e).
Identification of SIPs can be characterized in terms of the nested Markov factorization, with an additional subtlety, by the following result.
Theorem 3
Fix disjoint subsets A, Y ⊆ V, and a set of unrestricted functions in a functional causal model given by the DAG that yields the latent projection ADMG . Define Y* as the set of ancestors of Y in . Then p(Y(f)) is identified if and only if for some district , no element of A in D has children in D in . Moreover, if p(Y(f)) is identified, it is equal to
As an example, the distribution p(Y(f)) in Fig. 1 (d) is identified, since the districts of ancestors of Y are {A, Y}, and {M}, and no district contains a child of A in the induced subgraph for that district. The identifying formula is ∑M,A p(Y|A,M)p(A)p(M|A = f(A)). On the other hand, the distribution p(Y(f)) in Fig. 1 (e) is not identified, even though the single district among the ancestors of Y, namely {A, Y}, is intrinsic. This is because this district contains a child of A.
Identification of SITs is a little more involved, as we must also ensure the difficulty described with the ETT, where the counterfactual is a function of a non-identified marginal of the form p(W(Ai = fi(Ai)), W) is avoided.
Theorem 4
Fix disjoint subsets A, Y ⊆ V, and a set of unrestricted functions in a functional causal model given by the DAG that yields the latent projection ADMG . Fix the set of all directed paths π in which start with Ai ∈ A, end in some element in A ∪ Y, and which do not intersect elements in A ∪ Y otherwise. Define Y* as the set of ancestors of Y in . Then p(Y(f(A))) is identified if and only if
There are no two paths in π which start with the same edge, and where one path ends in an element of Y, and another in an element of A.
Every element of A that lies in a district D in does not have children in D in .
For any two paths in π where the second vertex on the path is in district D, either both paths have the final element in A or both paths have the final element in Y.
Moreover, if p(Y(f(A))) is identified, it is equal to
where paY (D) are parents of D along edges that are first edges on paths in π that end in Y.
As an example of the application of this theorem, consider Fig. 1 (f), where we are interested in identifying p(Y(A1 = f1(A1), A2 = f2(A2))). If all green edges are absent, the conditions of the theorem are satisfied, and this distribution is identified, in fact by the same functional as in (4). If the edge (1) is present, identification fails because of the presence of paths A → W → B and A → W → Y, as in Theorem 2. If the edge (2) is present, there exists a district in Y*, namely {A, L, W, Y} with an element A in the district that also has a child in the district (W). If the edge (3) is present, a path A → B ends in a treatment, while a path A → Y ends in an outcome, and both paths have a second vertex in the same district.
A Note On Completeness
Completeness results in this section have specified an unrestricted set of shift functions , and only hold under a sufficiently large class of shift functions that allow counterexamples in our proofs to be constructed. Results of this type are in the spirit of non-parametric identification theory in the sense that shift functions act as a kind of user-specified structural equation, and identification theory results are often stated in a way that does not restrict structural equations. A similar notion of completeness for Tian’s identification algorithm for responses to dynamic treatment regimes (Tian, 2008), was shown to hold in (Shpitser and Sherman, 2018).
Identification theory for sufficiently restricted classes of shift functions becomes considerably more complicated than stated here, and indeed it may be possible identification may be shown to hold even if the response to an unrestricted class of shift functions is not identified. The situation is similar to one where semi-parametric restrictions are placed on structural equations in a causal model.
It is also worth noting we will always have identification when shift functions are specified as identity functions, in which case the interventional distributions of p(Y(f)) or p(Y(f(A))) are equal to p(Y).
Differences In Identifying Functionals
We now give another example that illustrates that when SITs and SIPs that involve multiple treatments are identified, they will in general give different identifying functionals. Consider the hidden variable causal model represented by a graph in Fig. 2, where Y is the outcome of interest, and we are interested in its response to both SITs and SIPs on treatment variables A0 and A1.
Figure 2:

An example where SITs and SIPs give different identifying functionals for p(Y(f)) and p(Y(f(A))) respectively.
Here, Y* = {Y, T, A1, Z, W, A0}, and the set of districts in are D1 = {A0}, D2 = {W, A1, Y}, D3 = {Z}, D4 = {T}. We first note that the SIP p(Y(f)) is identified because no elements of A in some district D have children in that district. In particular, for A0 ∈ D1 = {A0}, , and for A1 ∈ D2 = {W, A1, Z, T}.
The corresponding sets pa(D) for each district D are pa(D1) = ∅, pa(D2) = {A0, Z, T}, pa(D3) = {W, A0}, pa(D4) = {A0, A1}, and therefore A ∩ pa(D1) = ∅, A ∩ pa(D2) = {A0}, A ∩ pa(D3) = {A0}, A ∩ pa(D4) = {A0, A1}.
The identifying functional from applying theorem 3 is therefore
with each term corresponding to the districts in Y* enclosed in braces.
Next, consider the SIT p(Y(f(A))) for A = {A0, A1}. We note that A ∪ Y = {A0, A1, Y}. Y* is unchanged, and all three identification conditions are satisfied. The set of paths π are
paY (D) for each district are paY (D1) = ∅, paY (D2) = {A0}, paY (D3) = ∅, paY (D4) = {A1}. paY (D2) only includes A0, as there is only one path ending in Y whose first edges are parents of D2 - namely, A0 → Y. paY (D3) is empty since no such paths exist. A ∩ paY (D) for each D gives A ∩ paY (D1) = ∅, A ∩ paY (D2) = {A0}, A ∩ paY (D3) = ∅, A ∩ paY (D4) = {A1}, which means that the identifying functional is changed in exactly one place - p(Z|W, A0 = f0(A0)) is replaced with p(Z|W, A0), yielding:
Once again, each term corresponding to the districts in Y* is enclosed in braces.
5. PARAMETRIC AND SEMI-PARAMETRIC INFERENCE
Assessing the impact of responses to SIPs and SITs entails evaluating functions of counterfactual distributions p(Y(f)) and p(Y(f(A))) from data. Here we concentrate on estimating expected value parameteres β in cases where these distributions are identified, e.g. , and .
If a parametric model for the observed data distribution p(V), or a sufficiently large part of the distribution, can be correctly specified, maximum likelihood plug-in estimators are used for efficient statistical inference for β. In the fully observed model, plug-in estimators may be straightforwardly derived from a DAG observed data likelihood. For example, with respect to the distribution in (4) may be estimated via
where , , are maximum likelihood estimates of parameters for parametric models above.
If β is identified in a hidden variable model with a latent projection ADMG , parametric statistical inference is sometimes possible using plug-in estimators that maximize nested Markov likelihoods, which are known for discrete data (Evans and Richardson, 2018), and multivariate normal distributions (Shpitser et al., 2018). We do not discuss these estimators further in the interests of space.
If a parametric likelihood cannot be assumed, statistical inference must proceed within a semi-parametric or non-parametric model, where a part of the likelihood or the whole likelihood is infinite-dimensional. In such cases, plug-in estimators are known to have non-negligible first order bias. A principled alternative approach to obtaining high quality consistent estimators is based on the semi-parametric theory, and influence functions (Tsiatis, 2006).
The resulting regular asymptotically linear (RAL) estimators take the form
where with mean zero and finite variance, op(1) denotes a term that approaches to zero in probability, and ϕ(Zi) is the influence function (IF) of the ith observation for the parameter vector β. RAL estimators are consistent and asymptotically normal (CAN), with the variance of the estimator given by its IF:
Thus, there is a bijective correspondence between RAL estimators and IFs.
We now derive the IF for β in a single treatment setting given by Fig. 1 (a), where SITs and SIPs coincide.
Theorem 5
Fix , which is equal to under the model in Fig. 1 (a). The efficient influence function for β under the non-parametric observed data model is given by
| (5) |
The influence function U(β) leads to a RAL estimator which solves the estimating equation , and which resembles augmented inverse probability weighted (AIPW) estimators derived in other contexts in causal inference (Scharfstein et al., 1999). As is often the case with these estimators, our estimator exhibits the property of double robustness, where the estimator remains consistent in the union model where either or p(A|C) is correctly specified.
Theorem 6
The estimator of β which solves the estimating equation is consistent, and asymptotically normal (CAN) in the union model where one of π(C; ηA) = p(A|C), is correctly specified.
In the Supplement we also derive the efficient influence function for the shift intervention p(Y(f)) in a variant of the causal model shown in Fig. 1 (d) that also contains a vector of baseline covariates.
6. SIMULATIONS AND A DATA APPLICATION
We now present a simulation study that demonstrates our estimator is doubly robust to misspecification of either the E[Y|A, C] model or the p(A|C) model. The precise data generating process is described in the Supplement.
Based on the simulation above, our parameter of interest β = E[Y(f(A))], where f(A) = A + 0.5, is equal to 6.5. We simulated datasets of size 500 and used 5000 replicates. The results are seen in Fig. 3a, where denotes the correctly specified models for , and p(A|C), denotes the model where only is specified correctly, denotes the model where only p(A|C) is specified correctly, and denotes the model where both and p(A|C) are specified incorrectly. As expected, the estimates show no bias for , , and , while bias is introduced in the model .
Figure 3:

(a) Estimation of using (6) under various types of model misspecification. (b) Empirical distribution (N = 500) of , with 95% confidence interval (−0.0043, 0.0047). (c) The bounceback probability (Y axis) learned by the random forest model for vs discretized length of stay (X axis) for all patients in the data set. Blue values denote bounceback actually occurred, red values indicate bounceback did not actually occur.
Data Application
We now describe our data application. Intensive care unit (ICU) readmission (“bounceback”) after cardiac surgery is costly and associated with worse mortality and morbidity outcomes (Benetis et al., 2013). We used our methods to estimate shift interventions to investigate whether increasing length of stay may influence the probability of bounceback. Data from 5242 patient visits to our institution who had undergone a surgical procedure on the heart, entered the hospital ICU at any point, and did not die during the visit were curated from our institution’s contribution to the Society of Thoracic Surgeon Adult Cardiac Surgery database, and our internal electronic health records. 151 discrete and continuous variables covering patient demographics, medications, as well as pre-, inter- and post-operative status were used.
We partitioned variables in the dataset into three types: the treatment variable A which is the number of initial ICU hours, discretized into 12-hour time intervals, the binary outcome Y representing bounceback, and a vector C of covariates representing potential confounders. We discretized A to avoid issues with lack of support. Specifically, we avoid unstable or invalid inferences which occur if p(A|C) = 0. We are interested in the change in probability of bounceback after a hypothetical increase of length of stay by 24 hours. We estimate this probability by using (6), where the outer expectation is evaluated empirically, and the required nuisance models p(A|C) and are estimated via a negative binomial regression (in case of overdispersion) and a random forest classifier, respectively. We are interested in a policy where patients receive an additional 24 initial ICU hours, denoted f(A) = A + 2.
We compare the total effect under the shift intervention against the total effect under the observed distribution of A, . The distribution for under 500 bootstrap samples is given in Fig. 3b. As the 95% bootstrap confidence interval contains 0, we fail to reject the null of no statistically significant effect of the shift intervention of increased initial ICU hours on ICU readmission rates.
To explore why the null hypothesis was not rejected, we considered the behavior of the learned outcome regression function with respect to A. Fig. 3c shows the predicted bounceback probabilities for each unit in our data, plotted vs their observed discretized length of stay. Red values denote no bounceback (the significantly more common case), while blue values denote bounceback. The response to the shift intervention that we estimated via (6) can be viewed as a modified empirical average of this regression, augmented with an inverse weighted term. The learned regression function appears to indicate that our data contains two types of patients: the significantly more common low risk patients, and the rarer high risk patients. Both types of patients occur at all durations of length of stay, and variations of length of stay are not a significantly predictive feature for type. In particular, variations in A do not significantly alter patient’s risk from its level predicted from other features.
7. CONCLUSIONS
In this paper we define a type of soft intervention where a set of variables are manipulated to obtain values which are fixed functions of their previous values. We call this type of intervention shift intervention. We showed that if multiple variables are manipulated, shift interventions may be defined with respect to naturally occurring values of manipulated variables, or with respect to recursively defined values of manipulated variables responding to previous shift interventions. We gave a sound and complete identification algorithm for both types of shift interventions in fully observed and hidden variable causal models.
In addition, we derived an efficient semi-parametric estimator based on efficient influence functions for a special case of responses to shift interventions motivated by a clinical problem. We demonstrated the utility of our method by a simulation study, and applied it to consider how the readmission probability to the intensive care unit (ICU) of a hospital changes if the duration of the patients’ stay in the ICU is manipulated to be longer.
Supplementary Material
8. Acknowledgments
This project is sponsored in part by the National Institutes of Health grant R01-AI127271-01A1, Office of Naval Research grant N00014-18-1-2760, National Science Foundation grant CAREER-1942239 and National Science Foundation grant 1939675. The contents of this paper don’t necessarily reflect the position of policy of the government, and no official endorsement should be inferred.
References
- Avin C, Shpitser I, and Pearl J. Identifiability of path-specific effects. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), volume 19, pages 357–363. Morgan Kaufmann, San Francisco, 2005. [Google Scholar]
- Benetis R, Širvinskas E, Kumpaitiene B, and Kinduris Š. A case-control study of readmission to the intensive care unit after cardiac surgery. Medical Science Monitor: International Medical Journal of Experimental and Clinical Research, 19:148–152, Feb. 2013. ISSN 1234–1010. doi: 10.12659/MSM.883814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakraborty B and Moodie E. Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine. New York: Springer-Verlag, 2013. [Google Scholar]
- Eberhardt F. Direct causes and the trouble with soft interventions. Erkenntnis, 79(4):755–777, 2014. [Google Scholar]
- Evans RJ and Richardson TS. Smooth, identifiable supermodels of discrete DAG models with latent variables. Bernoulli, 2018. (to appear). [Google Scholar]
- Kennedy EH. Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association, 114(526):645–656, 2019. doi: 10.1080/01621459.2017.1422737. [DOI] [Google Scholar]
- Malinsky D, Shpitser I, and Richardson TS. A potential outcomes calculus for identifying conditional path-specific effects. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019. [PMC free article] [PubMed] [Google Scholar]
- Neyman J. Sur les applications de la thar des probabilities aux experiences agaricales: Essay des principle. excerpts reprinted (1990) in English. Statistical Science, 5:463–472, 1923. [Google Scholar]
- Pearl J. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo, 1988. [Google Scholar]
- Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009. ISBN 978–0521895606. [Google Scholar]
- Richardson TS and Robins JM. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. preprint: http://www.csss.washington.edu/Papers/wp128.pdf, 2013. [Google Scholar]
- Richardson TS, Evans RJ, Robins JM, and Shpitser I. Nested Markov properties for acyclic directed mixed graphs, 2017. Working paper. [Google Scholar]
- Robins JM, Mark SD, and Newey WK. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, pages 479–495, 1992. [PubMed] [Google Scholar]
- Rubin DB. Causal inference and missing data (with discussion). Biometrika, 63:581–592, 1976. [Google Scholar]
- Scharfstein DO, Rotnitzky A, and Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94:1096–1146, 1999. [Google Scholar]
- Shpitser I and Pearl J. Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06). AAAI Press, Palo Alto, 2006. [Google Scholar]
- Shpitser I and Pearl J. Effects of treatment on the treated: identification and generalization. In Uncertainty in Artificial Intelligence, volume 25. AUAI Press, 2009. [Google Scholar]
- Shpitser I and Sherman E. Identification of personalized effects associated with causal pathways. In Proceedings of the 34th Annual Conference on Uncertainty in Artificial Intelligence (UAI-18), 2018. [PMC free article] [PubMed] [Google Scholar]
- Shpitser I and Tchetgen Tchetgen EJ. Causal inference with a graphical hierarchy of interventions. Annals of Statistics, 44(6):2433–2466, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shpitser I, Evans RJ, and Richardson TS. Acycliclinear sems obey the nested markov property. In Proceedings of the 34th Annual Conference on Uncertainty in Artificial Intelligence (UAI-18), 2018. [PMC free article] [PubMed] [Google Scholar]
- Tian J. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08), pages 554–561, Corvallis, Oregon, 2008. AUAI Press. [Google Scholar]
- Tian J and Pearl J. A general identification condition for causal effects. In Eighteenth National Conference on Artificial Intelligence, pages 567–573, 2002. ISBN 0-262-51129-0. [Google Scholar]
- Tsiatis A. Semiparametric Theory and Missing Data. Springer-Verlag New York, 1st edition edition, 2006. [Google Scholar]
- Verma TS and Pearl J. Equivalence and synthesis of causal models. Technical Report R-150, Department of Computer Science, University of California, Los Angeles, 1990. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
