Abstract
This paper frames causal structure estimation as a machine learning task. The idea is to treat indicators of causal relationships between variables as ‘labels’ and to exploit available data on the variables of interest to provide features for the labelling task. Background scientific knowledge or any available interventional data provide labels on some causal relationships and the remainder are treated as unlabelled. To illustrate the key ideas, we develop a distance-based approach (based on bivariate histograms) within a manifold regularization framework. We present empirical results on three different biological data sets (including examples where causal effects can be verified by experimental intervention), that together demonstrate the efficacy and general nature of the approach as well as its simplicity from a user’s point of view.
Keywords: causal learning, manifold regularization, semi-supervised learning, interventional data, causal graphs
1. Introduction
Causal structure learning is concerned with learning causal relationships between variables. Such relationships are often represented using directed graphs with nodes corresponding to the variables of interest. Consider a set of p variables or nodes indexed by V = {1, … , p}. The aspect we focus on in this paper is to determine, for each (ordered) pair (i, j) ∈ V × V, whether or not node i exerts a causal influence on node j. In particular, our focus is on the binary ‘detection’ problem (of learning whether or not node i exerts a causal influence on node j) rather than estimation of the magnitude of any causal effect.
Methods for learning causal structures can be usefully classified according to whether the graph is intended to encode direct or total (ancestral) causal relationships. For example if variable A acts on B which in turn acts on C, A has an ancestral effect on C (via B). Here, the graph of direct effects has edges A → B → C, while the graph of total or ancestral effects has in addition the edge A → C. Methods based on (causal) directed acyclic graphs (DAGs) are a natural and popular choice for causal discovery (Spirtes et al., 2000; Pearl, 2009). The PC algorithm (Spirtes et al., 2000) is an important example of such a method. Using a sequence of tests of conditional independence, the PC algorithm estimates an underlying causal DAG. Due to the fact that the graph may not be identifiable, the output is an equivalence class of DAGs (encoded as a completed partially directed acyclic graph or CPDAG). Here the estimand is intended to encode direct influences. IDA (Intervention calculus when the DAG is Absent; Maathuis et al., 2009) uses the PC output to bound the quantitative total causal effect of any node i on any other node j. These estimated effects can be thresholded to provide a set of edges. FCI (Fast Causal Inference; Spirtes et al., 2000) and RFCI (Really Fast Causal Inference; Colombo et al., 2012) consider a type of ancestral graph as estimand and allow for latent variables. Greedy Interventional Equivalence Search (GIES; Hauser and Bühlmann, 2012) is a score-based approach that allows for the inclusion of interventional data.
Methods for learning causal structures (such as those above) are often rooted in data-generating causal models. In a quite different vein, there have been some interesting recent efforts in the direction of labelling pairs of variables as causal or otherwise, such as in Lopez-Paz et al. (2015) and Mooij et al. (2016). These approaches are ‘discriminative’ in spirit, in the sense that they need not be rooted in an explicit data-generating model; rather the emphasis is on learning how to tell causal and non-causal apart. Our work is in this latter vein. We address a specific aspect of causal learning—that of estimating edges in a graph encoding causal relationships between a defined set of vertices—but via a machine learning approach that allows the inclusion of any available information concerning known cause-effect relationships. The output of our method is a directed graph that need not be acyclic (see Spirtes, 1995; Richardson, 1996; Hyttinen et al., 2012, for discussion of cyclic causality) and whose edges may encode either direct or total/ancestral relationships, as discussed below. The main differences between our work and previous work on labelling causal pairs (Lopez-Paz et al., 2015; Mooij et al., 2016) are the specific methods and associated theory that we put forward, the manifold regularization framework, and the empirical examples.
In general terms the idea is as follows: let 𝒟 denote the available data and Φ denote any available knowledge on causal relationships among the variables indexed in V (e.g., based on background knowledge or experimental intervention). We view the causal learning task in terms of constructing an estimator of the form Ĝ(𝒟, Φ), where Ĝ is a directed graph with vertex set V and edge set E(Ĝ), with (i, j) ∈ E(Ĝ) corresponding to the claim that variable i has a causal influence on variable j. To put this another way: entries in a binary adjacency matrix encoding causal relationships are treated as ‘labels’ in a machine learning sense. From this point of view, the task of constructing the estimator Ĝ(𝒟, Φ) is essentially one of learning these labels from available data and from any a priori known labels (derived from Φ). Thus, a key difference with respect to a number of existing methods is the nature of the inputs needed: our approach requires causal background information Φ as an input while several existing methods (such as PC) use only observational data. The casual background information Φ need not be interventional data per se, but must encode knowledge on some causal relationships in the system (we consider both scenarios in empirical examples below). Note also that in our approach the causal status of multiple pairs is coupled via the learning scheme: loosely speaking (see below for technical details), it is the position of a test pair on a classification manifold (relative to other pairs) that determines its status.
Our approach differs in several ways from graphical model-based methods. In our approach, the same framework can be used to estimate either direct or ancestral causal relationships, depending on the precise input (we show real data examples of both tasks below). This is because the classifier can be agnostic to the label semantics: provided the Bayes’ risk for the label of interest is sufficiently low, these labels can in principle be learned. In contrast to much of the literature, our approach does not try to provide a full data-generating model of the causal system but instead focuses on the specific problem of learning edges encoding causal relationships. As we see in experiments below, this can lead to good empirical performance, but the output is in a sense less rich than a full causal model (see the Discussion). Our work is motivated by scientific problems where good performance with respect to this narrower task can be useful in reducing the hypothesis space and targeting future work.
The remainder of the paper is organized as follows. We first introduce some notation and discuss in more detail how causal learning can be viewed as a semi-supervised task. We then discuss a specific instantiation of the general approach, based on manifold regularization using a simple bivariate featurization. Using this specific approach—which we call Manifold Regularized Causal Learning (MRCL)—we present empirical results using three biological data sets. The results cover a range of scenarios and include examples with explicitly interventional data.
2. Methods
2.1. Notation
Let V = {1, … , p} index a set of variables whose mutual causal relationships are of interest. Let G denote a directed graph with vertex set V and edge set E; where useful, we use V (G), E(G) to denote its vertex and edge sets and A(G) to denote the corresponding p×p binary adjacency matrix. To make the connection between causal relationships and machine learning more transparent, we introduce linear indexing by [k] of the pairs (i, j) ∈ V × V. Where needed, we make the correspondence explicit, denoting by (i[k], j[k]) the variable pair corresponding to linear index [k] and by [k(i, j)] the linear index for pair (i, j). Suppose A is the adjacency matrix of the unknown graph of interest. Let y[k] ∈ {−1, +1} be a binary variable (for convenience mapped onto {−1, +1}) corresponding to the entry (i[k], j[k]) in A; these y[k]’s are the labels or outputs to be learned. Available data are denoted 𝒟. Available a priori knowledge about causal relationships between the variables V is denoted Φ.
2.2. Causal Semantics
Given data 𝒟 and background knowledge Φ we aim to construct an estimate Ĝ, the latter being a directed graph that need not be acyclic. The information in Φ guides the learner. Two main cases arise, both of which we consider in experiments below:
Total or ancestral effects. Here, Φ contains information on total effects—for example via interventional experiments as performed in biology—and the edges in the estimate Ĝ are intended to describe such effects. This means that an edge (i, j) ∈ E(Ĝ) is interpreted to mean that node i is inferred to be a causal ancestor of node j.
Direct effects. Here, Φ contains information on direct effects (relative to the variable set V) and the edges in the estimated graph Ĝ are intended to describe direct effects. Then, an edge (i, j) ∈ E(Ĝ) is interpreted to mean that i is inferred to be a direct cause of j (relative to the variable set V).
Our immediate motivation comes from the experimental sciences and we focus in particular on causal influences that can, at least in principle, be experimentally verified (even in the presence of latent variables) and where causal cycles are possible (as is often the case in biology or economics, see e.g., Hyttinen et al., 2012). Accordingly, we do not demand acyclicity. In our empirical work in biology, the nature of the underlying chemical/physical systems means that there are many small magnitude causal effects that are essentially irrelevant in the scientific context and this is a characteristic of many problem settings in the natural and social sciences. This motivates a pragmatic approach assuming that estimated graphs are not very dense or fully connected nor necessarily transitive1.
2.3. Semi-Supervised Causal Learning
With the notation above, the task is to learn the y[k]’s using 𝒟 and Φ. This is done using a semi-supervised estimator ŷ[k](𝒟, Φ) (we make the connection to semi-supervised learning explicit shortly). For now assume availability of such an estimator (we discuss one specific approach below). Then from the ŷ[k] we have an estimate of the graph of interest as Ĝ(𝒟, Φ) = (V, E(Ĝ(𝒟, Φ))) (recall that the vertex set V is known) with the edge set specified via the semi-supervised learner as
| (1) |
Background knowledge Φ could be based on relevant science or on available interventional data. For example, in a given scientific setting, certain cause-effect information may be known from previous work or theory. Alternatively, if some interventional data are available in the study at hand, this gives information on some causal relationships. Whatever the source of the information, assume that it is known that certain pairs (i, j) are either causal pairs (positive information) or not causal pairs (negative information). Using the notation above, this amounts to knowing, for some pairs [k], the value of y[k]. In semi-supervised learning terms, the pairs whose causal status is known correspond to the labelled objects and the remaining pairs are the unlabelled objects.
For each pair [k], some of the data, or some transformation thereof will be used as predictors or inputs, denote these generically as g[k](𝒟). That is, g[k] is a featurization of the data, with the featurization specific to variables (i[k], j[k]). Let 𝒦 be the set of linear indices (i.e., [k] ∈ 𝒦 is a variable pair), ℒ ⊂ K be the variable pairs with labels available (via Φ) and 𝒰 = 𝒦 \ ℒ be the set of unlabelled pairs. Let yℒ be a binary vector comprising the mℒ = |ℒ| available labels and y𝒰 be an unknown binary vector of length m𝒰 = |𝒰|. The available labels are determined by the background information Φ and we can write yℒ(Φ) to make this explicit. A semi-supervised learner gives estimates for the unlabelled objects, given the data and available labels. That is, an estimate of the form ŷ𝒰(g(𝒟), yℒ(Φ)). With these in hand we have estimates for all labels and therefore for all edges via (1).
Formulated in this way, it is clear that essentially any combination of featurization g and semi-supervised learner could be used in this setting. Below, as a practical example, we explore graph-based manifold learning (following Belkin et al., 2006) combined with a simple bivariate featurization.
2.4. A Bivariate Featurization
For distance-based learning, we require a distance measure between objects (here, variable pairs) [k], [k′] ∈ 𝒦. The simplest candidate distance between variable pairs [k], [k′] is based only on the bivariate distribution for the variables comprising the pairs (we make this notion precise below). Proofs of propositions appearing in this Section are provided in Appendix A.
2.4.1. Distance between variable pairs
Let Z denote the p-dimensional random variable whose n realizations z(l), l = 1, …, n, comprise the data set 𝒟. Assume Z ∈ Ƶp = [zmin, zmax]p and that Ƶp is endowed with the Borel σ-algebra ℬp = ℬ(Ƶp). Let 𝒫 be the set of all twice continuously differentiable probability density functions, generically denoted π, with respect to Lebesgue measure Λ2 on (Ƶ2, ℬ2). Let Π[k] be the bivariate (marginal) distribution for components i[k], j[k] ∈ V of Z.
Assumption 1 Each Π[k] admits a density function π[k] ∈ 𝒫.
If available, the densities π[k], π[k′] could be used to define a distance between the pairs [k], [k′]. Let d𝒫 : 𝒫 × 𝒫 → [0, ∞) denote a pseudo-metric2 on 𝒫. Since we do not have access to the underlying probability density functions, we construct an analogue using the available data 𝒟. Let 𝒮n ≔ [zmin, zmax]2n denote the space of possible bivariate samples (the sample size is n) and S[k] ∈ 𝒮n denote the subset of the data for the variable pair [k]. That is,
Let κ : 𝒮n → 𝒫 be a density estimator (DE). We consider sample quantities of the form d𝒮 = d𝒫 ○ (κ × κ). That is, given data S[k], S[k′] ∈ 𝒮n on two pairs [k], [k′], the DE is applied separately to produce density estimates κ(S[k]) and κ(S[k′]), that are compared using d𝒫 to give d𝒮 (S[k], S[k′]) = d𝒫 (κ(S[k]), κ(S[k′])). This construction ensures that d𝒮 is a pseudo-metric without assumptions on the DE κ:
Proposition 1 Assume that d𝒫 is a pseudo-metric on 𝒫. Then d𝒮 is a pseudo-metric on 𝒮n. If, in addition, κ is injective and d𝒫 is a metric on 𝒫, then d𝒮 is a metric on 𝒮n.
2.4.2. Choice of distance
For semi-supervised learning we need a notion of distance under which causal pairs are relatively ‘close’ to each other. For a measurable space 𝒳 equipped with a measure ρ we let The notion of distance that we consider is
The right hand side exists since the integrand is continuous on a compact set and thus bounded. This can be contrasted with the kernel embedding that was proposed for supervised causal learning in Lopez-Paz et al. (2015).
Proposition 2 d𝒫 is a metric on 𝒫.
The main requirement that we have of the DE is that it provides consistent estimation in the ǁ · ǁL2(Λ2) norm when π ∈ 𝒫. Specifically, consider a sequence S(n) in 𝒮n indexed by the number n of data points. In particular, suppose that S(n) is built from n independent data points whose distribution is Π (the shorthand notation will be used). Let π be the density function for Π. Then κ is said to be “consistent” if ǁπ − κ(S(n))ǁL2(Λ2) = oP (1) holds for whenever π ∈ 𝒫.
Proposition 3 Suppose κ is consistent and that Π, admit densities Then, for where S(n) and are not necessarily independent, we have that
Thus d𝒮 approximates the idealized metric d𝒫 in the limit of draws from Π and . Note that, in our intended use case, the S(n) and will correspond to bivariate scatter plots S[k], S[k′] generated from the same underlying z(l), l = 1, …, n, and hence S(n) and will not be independent.
For the experiments in this paper, motivated by computational ease, we used a simple bivariate histogram as the DE κ. To this end, partition Ƶ2 into an M × M regular grid whose (m1, m2)th element is denoted Bm1,m2. The standard bandwidth notation h = M−1 will also be used. For a scatter plot S ∈ 𝒮n, let xm1,m2 denote the number of elements that belong to the set Bm1,m2. Then the histogram estimator is
| (2) |
This DE is consistent in the sense of Proposition 3. Indeed:
Proposition 4 Let the bandwidth parameter h of the histogram estimator κ be chosen such that nh2 → ∞. Then κ is consistent. Moreover, an optimal choice of h ≍ n−1/4 leads to ǁπ − κ(S(n))ǁ L2(Λ2) = OP(n−1/4) whenever and π ∈ 𝒫.
We note that this histogram DE is not rate optimal for the class 𝒫 (for comparison, kernel DEs attain a rate of OP (n−2/3) over the same class 𝒫 of twice continuously differentiable bivariate densities considered here, see Wand and Jones, 1994). However, an important advantage of the histogram DE is that the subsequent evaluation of κ(S) is O(1), compared with O(n) for the kernel DE.
2.4.3. Implementation of the DE
The above arguments support the use of a bivariate histogram to provide a simple featurization for variable pairs. In practice, for all examples below, the data were standardized, then truncated to [−3, 3]2, following which a bivariate histogram with bins of fixed width 0.2 was used. The dimension of the resulting feature matrix was then reduced (to 100) using PCA.
2.5. Manifold Regularization
Recall that the goal is to estimate binary labels y𝒰 for a subset 𝒰 ⊂ 𝒦 of variable pairs given available data 𝒟 and known labels yℒ(Φ) for a subset ℒ = 𝒦 \ 𝒰 (these are taken to be obtained from available interventional experiments and/or background knowledge). For any two pairs [k], [k′] ∈ 𝒦, we also have available a distance d𝒮(S[k], S[k′]). This is a task in semi-supervised learning (see e.g., Belkin et al., 2006; Fergus et al., 2009) and a number of formulations and methods could be used for estimation in this setting. Here we describe a specific approach in detail, using manifold regularization methods discussed in Belkin et al. (2006).
Let x[k] denote a vector whose entries are the bin-counts xi,j, 1 ≤ i, j ≤ M, appearing in (2), for scatter plot S[k]. Let 𝒳 = ×1≤i,j≤M[0, n] and note that x[k] ∈ 𝒳. Then we make the observation that, for the histogram estimator,
This perspective emphasizes that g[k](𝒟) = x[k] is the featurization that underpins this work, and that the classification task can be considered as the construction of a map c : 𝒳 → {−1, +1}. To develop an approach to semi-supervised classification in the manner of Belkin et al. (2006), let ρ𝒳 be a reference measure on 𝒳 and let K : 𝒳 × 𝒳 → ℝ be a Mercer kernel; i.e., continuous, symmetric and positive semi-definite. The reproducing kernel Hilbert space, ℋK, associated to K can be defined via the integral operator ∑K : L2(ρ𝒳) → L2(ρ𝒳) where
From the fact that K is a Mercer kernel it follows that ∑K is self-adjoint, positive semi-definite and compact. In particular, is well-defined for α ∈ (0, ∞). The reproducing kernel Hilbert space is defined as and its norm is c.f. Corollary 4.13 in Cucker and Zhou (2007).
Recall that mℒ = |ℒ| is the number of available labels and m𝒰 = |𝒰| the number of unlabelled pairs. Let m = m𝒰+mℒ (= |𝒳|) be the total number of pairs. Using the distance function d𝒮 we first define an m × m similarity matrix W with entries
| (3) |
where σ1 > 0 must be specified. The squared-exponential form is motivated by an analytic connection between the heat kernel and the Laplace-Beltrami operator, which will be exploited in Section 2.5.1. We will use a partition of the matrix corresponding to the sets 𝒰, ℒ as follows
where we have assumed, without loss of generality, that the variable pairs are ordered so that the labelled pairs appear in the first mℒ places, followed by the m𝒰 = m − mℒ unlabelled pairs. Correspondingly let
denote a label matrix, where +1 indicate those pairs [k] for which y[k] = 1. The vector y𝒰 is unknown and is the object of estimation.
Let D be the m × m diagonal matrix with diagonal entries D[k],[k] = ∑[k′]∈𝒦 W[k],[k′]. Define L = D − W (i.e., the un-normalized graph Laplacian; all matrices with O(m2) entries are denoted as bold capitals to emphasize the potential bottleneck that is associated with storage and manipulation of these matrices). Let
be a vector corresponding to a classification function f : 𝒳 → ℝ evaluated at the m variable pairs 𝒦, with the superscripts indicating correspondence with the labelled and unlabelled pairs. Intuitively, we want the sign of f to agree with the known labels yℒ and also to take account of the manifold structure encoded in L.
In this work we consider a classifier of the form ĉ(x) = sign((x)) where arises from the Laplacian-regularized least squares method
| (4) |
following Section 4.2 of Belkin et al. (2006). Here the first term relates the known labels to the values of the function f. The second term imposes ‘smoothness’ on the label assignment in the sense of encouraging solutions where the labels do not change quickly with respect to the distance metric. The third term is principally to ensure that the infimum remains well-defined and unique in the situation where there is insufficient data for the first penalty alone to be sufficient (see Remark 2 in Belkin et al., 2006).
Remark 5 (Choice of loss) It is important to comment on our choice of a squared-error loss function in (4), which differs from the more natural approach of using hinge loss for a binary classification task. Our motivation here is principally computational expedience; the computational burden associated with the m = O(p2) different scatter plots requires that a light-weight estimation procedure is used. However, we note that we are not the first to propose the use of squared-error loss in the classification context; it is in fact a standard approach to classification in the situation when the number of classes is > 2 (e.g., Wang et al., 2008).
2.5.1. Consistency of the Classifier
As explained in Remark 5, the use of a squared-error loss function in a classification context is somewhat unnatural. It is therefore incumbent on us to establish consistency of the proposed method.
To this end, we exploit the specific form of the similarity matrix used in (3). Indeed, if we re-write
| (5) |
then it can be established (under certain regularity conditions) that, if input data x are independently drawn from ρ𝒳, then (5) converges to the quantity (up to proportionality), a smoothness penalty based on weighted Laplace-Beltrami operator Δℳ on the manifold ℳ induced by ρ𝒳 (Grigor’yan, 2006). The convergence occurs as (Theorem 3.1 of Belkin and Niyogi, 2008).
This convergence of the graph Laplacian to the Laplace-Beltrami operator underlies existing consistency results for semi-supervised regression (e.g., Cao and Chen, 2012) and is exploited again to establish the consistency of our classifier ĉ(x) = sign((x)) in Appendix B. In summary, the ability to assign the correct label to an unlabelled pair [k] ∈ ℒ depends on both the intrinsic predictability of the label as a function of the scatter plot S[k], as quantified by the Bayes risk, and the smoothness of the Bayes classifier fρ as quantified by the largest value α ∈ (0, 1] such that see Corollary 9 in Appendix B for full detail.
2.5.2. Implementation of the Classifier
Given training labels yℒ, label estimates ŷ𝒰 = sign(𝒰) are obtained by minimizing the objective function described above, as explained in Equation 8 in Belkin et al. (2006). This gives
| (6) |
where K𝒰,𝒦 is the m𝒰 × m kernel matrix based on the unlabeled 𝒰 and total 𝒦 data, K𝒦,𝒦 is the m × m kernel matrix based on the total data 𝒦 and Im denotes an m-dimensional identity matrix.
Here ŷU provides a point estimate for the unknown labels while 𝒰 is real-valued and can be used to rank candidate pairs if required. The linear system in (6) can be solved at a naive computational cost of O(m3). Computation for large-scale semi-supervised learning has been studied in the literature (see e.g., Fergus et al., 2009) and a number of approaches could be used to scale up to larger problems, but were not pursued in this work.
For experiments reported below we employed a similarity matrix (with length scale σ1 as in (3)) and a kernel
whose length-scale parameter σ2 was set equal to σ1 in the absence of prior knowledge about the manifold ℳ. The scale σ1 was set to the average distance to the nearest 50 points in the feature space (in practice estimated via a subsample).
The two penalty parameters in (4) were set to small positive values (λ1 = λ2 = 0.001; we found results were broadly insensitive to this choice). Following common practice we worked with the normalized graph Laplacian in place of L (see Remark 3 of Belkin et al., 2006).
3. Empirical Results
We tested our approach using three data sets with different characteristics. The key features of each data set are outlined below, with a full description of each data set appearing in the respective subsection. In all cases performance was assessed using either held-out interventional data or scientific knowledge.
D1: Yeast knockout data. Here, we used a data set due to Kemmeren et al. (2014), previously considered for causal learning in Peters et al. (2016); Meinshausen et al. (2016). The data consist of a large number of gene deletion experiments with corresponding gene expression measurements.
D2: Kinase intervention data from human cancer cell lines. These data, due to Hill et al. (2017), involve a small number of interventions on human cells, with corresponding protein measurements over time.
D3: Protein data from cancer patient samples. These data arise from The Cancer Genome Atlas (TCGA) and are presented in Akbani et al. (2014). There are no interventional data, but the data pertain to relatively well-understood biological processes allowing inferences to be checked against causal scientific knowledge.
An appealing feature of MRCL is the simplicity with which it can be applied to diverse problems. In each case below, we simply concatenate available data to form the data set 𝒟 and available knowledge/interventions to form Φ, then directly apply the methods as described.
3.1. General Problem Set-Up
The basic idea in all three problems was as follows: given data on a set of variables, for each (ordered) pair (i, j) of variables we sought to determine whether or not i has a causal effect on j. In the case of data sets D1 and D2 the results were assessed against the outcome of experiments involving explicit interventions. As discussed above, such experiments reveal ancestral relationships (that need not be direct) and the goal in these examples was to learn such relationships. The availability of a large number of interventions in D1 allowed a wider range of experiments, whereas D2 is a smaller data set (but from human cells), allowing only a relatively limited assessment. In the case of D3, where interventional data (i.e., interventions on the same biological material that give rise to the training data) were not available but the relevant biological mechanisms are relatively well understood, we compared results to a reference mechanistic graph derived from the domain literature. The literature itself is in effect an encoding of extensive interventional experiments combined with biochemical and biophysical knowledge. This gives information on direct edges and here the edges learned are intended to represent direct causes (relative to the set of observed variables). Within the semi-supervised set-up, a subset of pairs were labelled at the outset and the remaining pairs were unlabelled. All empirical results below are for unlabelled pairs; that is, in all cases assessment is carried out with respect to causal (and non-causal) relationships that were not used to train the models.
3.2. Data Set D1: Yeast Gene Expression
Data
The data consisted of gene expression levels (log ratios) for a total of ptotal = 6170 genes. Some of the data samples were measurements after knocking out a specific gene (interventional data) and the other samples were without any such intervention (observational data), with sample sizes of nint = 1479 and nobs = 153 respectively. Each of the genes intervened on was one of the ptotal genes. Let t(l) be the index of the gene targeted by the lth intervention. That is, the lth interventional sample was an experiment in which gene t(l) was knocked out. Let T = {t(1), … , t(nint)} be the subset of genes that were the target of an interventional experiment.
Problem set-up
Our problem set-up was as follows. We sampled a subset C ⊂ T of the genes that were intervened upon, with |C| = 50, and treated this as the vertex set of interest (i.e., setting V = C and p = |C| = 50). The goal was to uncover causal relationships between these p variables.
Since by design interventional data were available for all variables j ∈ C, we used these data to define an interventional ‘gold standard’. To this end we used a robust z-score that considered the change in a variable of interest under intervention, relative to its observational variation. Let denote the expression level of gene j following intervention on gene i. For any pair of genes i, j ∈ C we say that gene i has a causal effect on gene j if and only if where is the median level of gene j (calculated using half of the observational data samples; the remaining samples were used as training data—see below), the corresponding inter-quartile range and τ = 5 was a fixed threshold. That is, we say there is an (experimentally verified) causal relationship between gene i and gene j if and only if ζij > τ. An absence of causal effects precludes estimation of true positive rates; hence we sampled C subject to a sparsity condition (that at least 2.5% of gene pairs show an effect).
Let A(C) be a p×p binary matrix encoding the causal effects as described in the foregoing (i.e., A(C)ij = 1 indicates that i has an experimentally verified causal effect on j). Then, given data on genes C, we set up the learning problem as follows. We treated a fraction ρ of the entries in A(C) as the available labels Φ. Thus, here m = p2 = 2500, mℒ = ⌊ρ m⌋ and m𝒰 = m – mℒ. Using these labels and data on the variables C, we learned causal edges as described. This gave estimates for the remaining (unseen) entries in A(C), which we compared against the corresponding true values. The data set 𝒟 comprised expression measurements for the genes in C for observational data samples (those samples not used to calculate the robust z-scores), plus interventional data samples where genes outside the set of interest were intervened upon; that is, a subset of the 1429 genes in T\C. This set-up ensured that 𝒟 include neither any of the interventional nor observational data that was used to obtain the ground-truth matrix A(C). The total amount of training data is denoted by We considered ntrain = 200, 500 and 1000 (corresponding to respectively, sampled at random).
Results
We compared the proposed Manifold Regularized Causal Learning (MRCL) approach with the following approaches:
Penalized regression with an ℓ1 penalty (Lasso; Tibshirani, 1996). Each variable j ∈ C was regressed on all other variables i ∈ C, i ≠ j to obtain regression coefficients. This is not a causal approach as such, but is included as a simple multivariate baseline.
Intervention-calculus when the DAG is absent (IDA; Maathuis et al., 2009, 2010). A lower bound for the total causal effect of variable i on variable j was estimated for each pair i, j ∈ C, i ≠ j.
The PC algorithm (PC; Spirtes et al., 2000). This provides a CPDAG estimate for the variables C.
GIES (GIES; Hauser and Bühlmann, 2012). This provides an essential graph estimate for the variables C, and allows inclusion of interventional data in a principled manner.
As simple baselines, we also included Pearson and Kendall correlation coefficients (Pearson and Kendall) and, following a suggestion from a referee, a simple k-nearest neighbor approach based on the featurization introduced above (k-NN).
We note that the causal methods compared against here differ in various ways from MRCL in the nature of their inputs and outputs and should not be regarded as direct competitors. Rather, the aim of the experiments is to investigate how MRCL performs on real data, whilst providing a set of baselines corresponding to well-known causal tools and standard correlation measures.
For the methods resulting in a score sij for all pairs i, j ∈ C, i ≠ j (i.e., correlation or regression coefficients, total causal effects, or, for MRCL, the real-valued in (6)), the scores were thresholded and pairs (i, j) whose absolute values of the score fell above the threshold were labelled as ‘causal’. Varying the threshold and calculating true positives and false positives with respect to the binary unseen entries in the matrix A(C) resulted in a receiver operating characteristic (ROC) curve.
Figure 1 shows the area under the ROC curve (AUC) as a function of the proportion ρ of entries in A(C) that were observed, for the three sample sizes. Results were averaged over 25 iterations. MRCL showed good performance relative to the other approaches for all 12 considered combinations of ntrain and ρ (for the other methods shown in Figure 1, any variation in performance with ρ was solely due to the changing test set as these methods do not use the background knowledge Φ). Results for PC, which provides a point estimate of a graphical object, are shown as points on the ROC plane for the 12 different regimes in Appendix C (Fig. 6). We considered also the transitive closure (motivated by the nature of the experimental data) and exploiting the background information Φ via additional constraints. MRCL performs well relative to the other methods in all regimes (see also the Discussion).
Figure 1.
Results for data set D1 (yeast data), random sampling. Area under the ROC curve (AUC; with respect to causal relationships determined from unseen interventional data), as a function of the fraction ρ of labels available (labels were sampled at random). Results are shown for three training data sample sizes ntrain. Results are mean values over 25 iterations and error bars indicate standard error of the mean. Additional results for the PC algorithm appear in Appendix C (see text for details).
In the above results the pairs whose causal relationship was to be predicted were chosen at random (i.e., the set of unlabelled pairs was a random subset of the set of all pairs). In contrast, in some settings it may be relevant to predict the effect of intervening on variable i, without knowing the effect of intervening on i on any other variable. For this setting, the unlabelled set should comprise entire rows of the causal adjacency matrix A(C). Figure 2 considers this case. To ensure a sufficient number of rows were non-empty, we imposed the additional restriction on the gene subset C that at least half of the rows had at least one causal effect. Results for PC are shown in Appendix C (Fig. 7) as points on the ROC plane. As for the random sampling case above, MRCL offers an improvement over the other methods. k-NN also performs well relative to the other approaches here.
Figure 2.
Results for data set D1 (yeast data), row-wise sampling. As Figure 1, except the subset of labels available to the learner were obtained by sampling entire rows of the causal adjacency matrix. As before, a proportion ρ were sampled. The remaining rows were then used as test data. Additional results for the PC algorithm appear in Appendix C (see text for details).
We additionally compared MRCL with GIES. GIES and MRCL differ in terms of their required inputs: In addition to data 𝒟, MRCL requires binary labels on causal relationships via background information Φ, while GIES requires the interventional data itself and metadata specifying the intervention targets. For row-wise sampling, to allow for a reasonable comparison, we ran GIES providing the interventional data corresponding to the rows whose labels are provided to MRCL. The same data was also provided as input to the other approaches, including in data set 𝒟 for MRCL. This means the data matrices differ from those above, with sample size dependent on ρ, and for MRCL, 𝒟 now includes data that was used to obtain background information Φ (train/test validity is preserved since it remains the case that all testing is done with respect to entirely unseen interventions). Results appear in Figure 3, with PC and GIES shown as a points on the ROC plane. MRCL appears to offer an improvement relative to the other methods (see also the Discussion). Note that GIES is not directly applicable to the random sampling setting above since it requires the interventional data with respect to all other variables (and not just a subset thereof).
Figure 3.
Results for data set D1 (yeast data), comparison including GIES, row-wise sampling. ROC curves are shown with respect to causal relationships determined from unseen interventional data. “TC” indicates use of a transitive closure operation and “cnstrnts” indicates that the background information Φ was included via input constraints. Results for PC and GIES are shown as points on the ROC plane. Note that due to the nature of input required by GIES the data matrices in this example differ from the row-wise sampling example in Figure 2 (see text for details). Results are averages over 25 iterations.
3.3. Data Set D2: Protein Time-Course Data
Data
The data consisted of protein measurements for p = 35 proteins measured at seven time points in four different ‘cell lines’ (BT20, BT549, MCF7 and UACC812; these are laboratory models of human cancer) and under eight growth conditions. The proteins under study act as kinases (i.e., catalysts for a biochemical process known as phosphorylation) and interventions were carried out using kinase inhibitors that block the kinase activity of specific proteins. A total of four intervention regimes were considered, plus a control regime with no interventions. The data used here were a subset of the complete data set reported in detail in Hill et al. (2017) and were also previously used in a Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge on learning causal networks (Hill et al., 2016).
Problem set-up
Treating each cell line as a separate, independent problem, the intervention regimes were used to define an interventional ‘gold standard’, in a similar vein as for data set D1. This followed the procedure described in detail in Hill et al. (2016) with an additional step of taking a majority vote across growth conditions to give a causal gold standard for each cell line c. For each cell line c, we formed a data matrix Zc consisting of all available data for the p = 35 proteins except for one of the intervention regimes. The intervention regime not included was a kinase inhibitor targeting the protein mTOR. This intervention was entirely held out and used to provide the test labels. As background knowledge Φc we took as training labels causal effects under the other interventions. With this set-up, the task was to determine the (ancestral) causal effects of the entirely unseen intervention. Note that each cell line c was treated as an entirely different data set and task, with its own data matrix, background knowledge and interventional test data.
Results
Figure 4 shows AUCs (with respect to changes seen under the test intervention) for each of the four cell lines and each of the methods. There was no single method that outperformed all others across all four cell lines. MRCL performed particularly well relative to the other methods for cell lines BT549 and MCF7 (k-NN also performed well for BT549), was competitive for cell line UACC812, but performed less well for cell line BT20. We note also that, for cell lines BT549 and MCF7, the performance of MRCL was competitive with the best performers in the DREAM challenge and with an analysis reported in Hill et al. (2017). The latter involved a Bayesian model specifically designed for such data. In contrast, MRCL was applied directly to a data matrix comprising all training samples simply collected together.
Figure 4.
Results for data set D2 (protein time course data). Each panel is a different cell line, with its own training and (interventional) test data. AUC is with respect to an entirely held-out intervention. See text for details.
3.4. Data Set D3: Human Cancer Data
Data
The data consisted of protein measurements for p = 35 proteins measured in n = 820 human breast cancer samples (from biopsies). The data originate from The Cancer Genome Atlas (TCGA) Project, are described in Akbani et al. (2014) and were retrieved from The Cancer Proteome Atlas (TCPA) data portal (Li et al., 2013, https://tcpaportal.org; data release version 4.0; Pan-Can 19 Level 4 data). Data for many cancer types are available, but here we focus on a single type (breast cancer) to minimize the potential for confounding by cancer type. It is at present difficult to carry out interventions in biopsy samples of this kind. However, we focused on the same 35 proteins as in data set D2, whose mutual causal relationships are relatively well-understood, and used a reference causal graph for these proteins based on the biochemical literature (as reported in Hill et al., 2017).
Problem set-up
We formed a data set 𝒟 consisting of measurements for the p = 35 proteins for three different sample sizes: (i) ntrain = 200, (ii) ntrain = 500 or (iii) all ntrain = 820 patient samples. For (i) and (ii) patient samples were selected at random. We then used a random fraction ρ of the reference graph as background knowledge, testing output on the (unseen) remainder.
Results
Figure 5 shows AUCs (with respect to the held-out causal labels) as a function of the proportion ρ of causal labels that were observed, for each of the methods and for the three sample sizes. Results were averaged over 25 iterations. MRCL performed well relative to the other methods, with performance improving with ρ. Results were qualitatively similar for the three sample sizes, with increases in AUC for ntrain = 820 and ntrain = 500 relative to ntrain = 250. Results for PC are shown in Appendix C (Fig. 8) as points on the ROC plane.
Figure 5.
Results for data set D3 (human cancer data). Data are protein measurements from breast cancer patient samples from The Cancer Genome Atlas (TCGA). AUC is with respect to a reference graph based on the (causal) biochemical literature. Results are mean values over 25 iterations and error bars indicate standard error of the mean. See text for details. Additional results appear in Appendix C.
4. Discussion
In this paper, we showed how a key aspect of causal structure learning can be framed as a machine learning task. Although many available approaches, including those based on DAGs and related graphical models, offer a well-studied framework, we think it may be fruitful to revisit some questions in causality using machine learning tools.
In our experiments, based on three real data sets, we found that MRCL performed well relative to a range of graphical model-based approaches. However, two points should be noted regarding these comparative results. First, the various methods differ with respect to their required inputs and the nature of their outputs. This means that in some cases specific methods may not be an ideal fit to the context of the specific data/task (as detailed when presenting the empirical results above). Second, the biological systems underlying these data sets are likely to have features (such as causal insufficiency and cycles) that violate one or more of the assumptions of some of these methods. That said, we think biological data sets of the kind we focused on here offer perhaps the best opportunity at present to empirically study causal learning methods and that causal learning tasks of the kind addressed here are highly relevant in many applications, in biology and beyond. Hence, we think that pursuing empirical work on such data is valuable both from methodological and applied points of view. As more interventional data become available in the future, it will be important to carry out similar analyses in other contexts, in order to better understand the extent to which our findings generalize to other scientific settings.
An open question from a theoretical point of view is to understand conditions on data-generating processes needed to permit a discriminative approach as pursued here and we think this will be an interesting direction for future work. One point of view—analogous to that used in practical applications of classification—is to estimate the risk of the learner and thereby report an estimate of (causal) efficacy without having to directly consider requirements on the underlying system. We think this approach is acceptable when some causal information is available, since one can then empirically test problem-specific efficacy (as in our examples above). This then gives confidence with respect to generalization to new interventions on the system of interest (but does not address the broader theoretical question).
In our approach, information on multiple variable pairs is coupled via the classifier but not by global constraints on the graph. In the scientific settings we focused on we did not consider further coupling via global constraints but such constraints (e.g. enforcing transitivity) could be relevant in some applications and an interesting direction for further work. The main advantage of our approach is that it allows regularities in the data to emerge via learning, rather than having to be encoded via an explicit causal or mechanistic model. It also naturally provides some uncertainty quantification, in the sense of scores that can be used to guide decisions or future experimental work. The main disadvantage relative to methods rooted in DAGs and related graphical models is the lack of a full causal model. Albeit under relatively strong assumptions, DAG-based models, once estimated, can be used to shed light on a huge range of questions concerning causal relationships, including direct and ancestral effects, and details of post-intervention distributions. In contrast, our approach in itself provides only estimates of binary causal relationships. That said, given the efficacy and simplicity of our approach, we think it would be fruitful to consider coupling it to established causal tools in a two-step approach, with our methods used to learn an edge structure in a data-driven manner and this structure used to inform a full analysis in a second step. Such an approach would require some care to avoid bias, and sample splitting techniques that have been studied in high-dimensional statistics could be relevant (Wasserman and Roeder, 2009; Städler and Mukherjee, 2017).
Supplementary Material
Acknowledgments
The authors are grateful to the Editor and Reviewers for their constructive feedback on an earlier version of the manuscript. The authors are grateful to Umberto Noè for input on the empirical work and to Oliver Crook for feedback on an earlier version of the manuscript. This work was supported by the UK Medical Research Council (University Unit Programme number MC_UU_00002/2). CJO was supported by the ARC Centre of Excellence for Mathematics and Statistics, Australia, and the Lloyd’s Register Foundation programme on data-centric engineering at the Alan Turing Institute, UK. The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
Footnotes
We emphasize that these are pragmatic assumptions motivated by the nature of experimental data and scientific applications, and not intended to be fundamental statements about causality. For example, Hyttinen et al. (2012) make the point that cycles can be removed by considering time-varying data on a suitable time scale, but that nevertheless cycles are common in causal scientific models in economics, engineering and biology due to the fact that measurements are usually taken at wider intervals.
Recall that a pseudo-metric d satisfies all of the properties of a metric with the exception that d(x, y) = 0 ⇏ x = y.
Code Availability
All computational analysis was performed in R (R Core Team, 2018). Source code for MRCL and scripts to generate the empirical results presented in Section 3 are available at https://github.com/Steven-M-Hill/MRCL.
Author contributions: CJO, DB and SM developed methodology based on original ideas by SM. SMH performed the computational work and CJO contributed theory. CJO and SM wrote the paper with input from SMH and DB.
Contributor Information
Steven M. Hill, Email: steven.hill@mrc-bsu.cam.ac.uk, MRC Biostatistics Unit, University of Cambridge, Cambridge, CB2 0SR, UK.
Chris J. Oates, Email: chris.oates@ncl.ac.uk, School of Mathematics, Statistics and Physics, Newcastle University, Newcastle-upon-Tyne, NE1 7RU, UK.
Duncan A. Blythe, Email: duncanblythe@googlemail.com.
Sach Mukherjee, Email: sach.mukherjee@dzne.de.
References
- Akbani R, Ng PKS, Werner HM, Shahmoradgoli M, Zhang F, Ju Z, Liu W, Yang JY, Yoshihara K, Li J, Ling S, et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature Communications. 2014;5 doi: 10.1038/ncomms4887. 3887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Belkin M, Niyogi P. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences. 2008;74(8):1289–1308. [Google Scholar]
- Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research. 2006;7:2399–2434. [Google Scholar]
- Cao Y, Chen D. Generalization errors of Laplacian regularized least squares regression. Science China Mathematics. 2012;55(9):1859–1868. [Google Scholar]
- Colombo D, Maathuis MH, Kalisch M, Richardson TS. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics. 2012;40(1):294–321. [Google Scholar]
- Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press; 2007. [Google Scholar]
- Fergus R, Weiss Y, Torralba A. Semi-supervised learning in gigantic image collections. Proceedings of the 23rd Annual Conference on Neural Information Processing Systems; 2009. pp. 522–530. [Google Scholar]
- Grigor’yan A. Heat kernels on weighted manifolds and applications. Cont Math. 2006;398:93–191. [Google Scholar]
- Hauser A, Bühlmann P. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research. 2012;13:2409–2464. [Google Scholar]
- Hill SM, Heiser LM, Cokelaer T, Unger M, Nesser NK, Carlin DE, Zhang Y, Sokolov A, Paull EO, Wong CK, Graim K, et al. Inferring causal molecular networks: empirical assessment through a community-based effort. Nature Methods. 2016;13(4):310–318. doi: 10.1038/nmeth.3773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill SM, Nesser NK, Johnson-Camacho K, Jeffress M, Johnson A, Boniface C, Spencer SEF, Lu Y, Heiser LM, Lawrence Y, Pande NT, et al. Context specificity in causal signaling networks revealed by phosphoprotein profiling. Cell Systems. 2017;4(1):73–83. doi: 10.1016/j.cels.2016.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hyttinen A, Eberhardt F, Hoyer PO. Learning linear cyclic causal models with latent variables. Journal of Machine Learning Research. 2012;13:3387–3439. [Google Scholar]
- Kemmeren P, Sameith K, van de Pasch LA, Benschop JJ, Lenstra TL, Margaritis T, O’Duibhir E, Apweiler E, van Wageningen S, Ko CW, van Heesch S, et al. Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell. 2014;157(3):740–752. doi: 10.1016/j.cell.2014.02.054. [DOI] [PubMed] [Google Scholar]
- Li J, Lu Y, Akbani R, Ju Z, Roebuck PL, Liu W, Yang J-Y, Broom BM, Verhaak RGW, Kane DW, Wakefield C, et al. TCPA: a resource for cancer functional proteomics data. Nature Methods. 2013;10(11):1046–1047. doi: 10.1038/nmeth.2650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lopez-Paz D, Muandet K, Schölkopf B, Tolstikhin I. Towards a learning theory of causation. Proceedings of the 32nd International Conference on Machine Learning; 2015. pp. 1452–1461. [Google Scholar]
- Maathuis MH, Kalisch M, Bühlmann P. Estimating high-dimensional intervention effects from observational data. Annals of Statistics. 2009;37(6A):3133–3164. [Google Scholar]
- Maathuis MH, Colombo D, Kalisch M, Bühlmann P. Predicting causal effects in large-scale systems from observational data. Nature Methods. 2010;7(4):247–248. doi: 10.1038/nmeth0410-247. [DOI] [PubMed] [Google Scholar]
- Meinshausen N, Hauser A, Mooij JM, Peters J, Versteeg P, Bühlmann P. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences. 2016;113(27):7361–7368. doi: 10.1073/pnas.1510493113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research. 2016;17(32):1–102. [Google Scholar]
- Pearl J. Causality. Cambridge University Press; 2009. [Google Scholar]
- Peters J, Bühlmann P, Meinshausen N. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B. 2016;78(5):947–1012. [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2018. URL https://www.R-project.org/. [Google Scholar]
- Richardson T. A discovery algorithm for directed cyclic graphs. Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence; 1996. pp. 454–461. [Google Scholar]
- Spirtes P. Directed cyclic graphical representations of feedback models. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence; 1995. pp. 491–498. [Google Scholar]
- Spirtes P, Glymour CN, Scheines R. Causation, Prediction, and Search. MIT press; 2000. [Google Scholar]
- Städler N, Mukherjee S. Two-sample testing in high dimensions. Journal of the Royal Statistical Society: Series B. 2017;79(1):225–246. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B. 1996;58(1):267–288. [Google Scholar]
- Wand MP, Jones MC. Kernel Smoothing. CRC Press; 1994. [Google Scholar]
- Wang J, Jebara T, Chang S-F. Graph transduction via alternating minimization. Proceedings of the 25th International Conference on Machine Learning; 2008. pp. 1144–1151. [Google Scholar]
- Wasserman L, Roeder K. High dimensional variable selection. Annals of Statistics. 2009;37(5A):2178. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wassermann L. All of Nonparametric Statistics. Springer; 2006. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





