Sample Selection Bias in Evaluation of Prediction Performance of Causal Models

James P Long; Min Jin Ha

doi:10.1002/sam.11559

. Author manuscript; available in PMC: 2023 Feb 1.

Published in final edited form as: Stat Anal Data Min. 2021 Oct 20;15(1):5–14. doi: 10.1002/sam.11559

Sample Selection Bias in Evaluation of Prediction Performance of Causal Models

James P Long ^1,^*, Min Jin Ha ¹

PMCID: PMC9053600 NIHMSID: NIHMS1746637 PMID: 35498876

Summary

Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However prediction performance does depend on the selection of training and test sets. In particular biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren [5]. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association based estimators such as Lasso. Finally we compare the performance of causal estimators in simulation studies which reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.

Keywords: causal inference, prediction, sample selection bias, genetic perturbation experiments

1 |. INTRODUCTION

Modern scientific experiments offer the possibility to evaluate causal models using prediction performance. A causal estimator predicts whether X is a cause of Y for a large number of (X, Y) pairs. Ground truth for some subset of these pairs is obtained and is used to score the predictions. Prediction performance assesses models without relying on statistical assumptions regarding the joint distribution of the random variables or complex and often unverifiable assumptions regarding confounding.

Large scale gene perturbation experiments offer one such opportunity. In these experiments, the expression of certain genes is perturbed and the effects of these perturbations on other genes is measured. These experiments provide insight into which genes causally influence each other. For gene perturbation experiments, causal prediction performance may be assessed as follows. Gene X has a causal effect on gene Y if when gene X is set to a value outside its “normal” range, the expression of Y is outside its “normal” range. Given p genes there are p(p−1) possible ordered cause–effect pairs (excluding self pairs). Ground truth can be obtained for some fraction of these gene pairs via knockout or knockdown experiments where the expression of a gene is set to a very low value. A causal model may use observational and/or perturbation data to make causal predictions for all gene pairs. The ground truth from the knockout experiments can then be used to score models and compare performance across models. High scoring models could be used to prioritize limited resources for follow–up experiments.

This strategy was used to compare performance of several recently proposed causal models on a yeast knockout data set produced in [5] (hereafter Kemmeren). The causal models assessed include Invariant Causal Prediction (ICP) and the Causal Dantzig (CD) which fit regression models which are invariant across data collection environments [9, 8, 6] and Local Causal Discovery (LCD) which tests for dependence among gene expression and context variables [11]. These models were compared to standard (non–causal) models such as Lasso regression. The causal models (ICP, CD, LCD) are expected to outperform non–causal models because when X and Y are correlated, non–causal models are unable to distinguish between X causes Y, Y causes X, or no causal relationship (correlation induced by a third variable which is a cause of both X and Y). Indeed on the Kemmeren data set ICP, CD, and LCD were all found to have performance superior to non–causal models (in particular the Lasso and Boosting).

The purpose of this work is to identify a source of bias in performance evaluation on Kemmeren, propose an alternative, less-biased method of evaluation, and study the performance of the causal estimators in simulations which reproduce the structure of the Kemmeren data. These results provide an improved understanding of the predictive performance of these causal models on gene perturbation data sets. Our contributions and results are summarized as follows: The essential problem in Kemmeren is that perturbations (knockouts) are performed not on a random sample of genes but on genes which are believed to be causes of expression changes in other genes (“putative regulators” according to [5]). Sample selection bias produces optimistic assessments of the predictive performance of causal models because correct predictions are more likely to have ground truth available (and thus be scored) than incorrect predictions. We see evidence of this when reviewing top ranked causal gene pairs returned by the causal estimators. To address this problem, we propose scoring only causal predictions (X, Y) where ground truth is available for both the X and Y knockout ((X, Y) and (Y, X) have ground truth available). We reassess the performance of several causal estimators on Kemmeren using this new criteria. Under this new criteria, the causal estimators do not demonstrate performance improvement over Lasso regression. To further explore these issues in an environment where biased follow–up does not effect results, we simulate data following the experimental setup of Kemmeren. These simulations reproduce the sample size, data dimension, and knockout complexity of Kemmeren. None of the previous works studying Kemmeren presented simulation results from data generating distributions meant to approximate Kemmeren. In the simulation settings studied, causal models are unable to consistently outperform association based estimators. The simulations provide further evidence that the experimental setup of Kemmeren is challenging for discovering cause–effect gene pairs.

This work is organized as follows. Section 2 introduces several of the causal estimators studied in this work, describes the Kemmeren data set, and identifies a potential source of bias in performance assessment on Kemmeren. Section 3 proposes a new criteria for scoring causal predictions on Kemmeren and reassesses the performance of models based on this metric. Section 4 presents simulation results which provide insight into the difficultly causal inference with the Kemmeren data set. We conclude with discussion in Section 5.

2 |. CAUSAL MODELS AND KEMMEREN

2.1 |. Causal Inference and Prediction

Causal models seek to answer how changing a variable X will effect another variable Y. Causal estimands may be formally expressed in the Neyman–Rubin potential–outcomes framework or via Pearl counterfactuals derived from structural equations [10, 7]. For example, in the Neyman–Rubin potential–outcomes framework, $E [Y^{X = x}]$ is the expected value of Y when X is set to x by an external intervention. Formally, Y^X=x is a set of random variables (indexed by x) called potential outcomes which are distinct from Y. Causal estimands are not generally identifiable from the joint distribution of the observed random variables (X, Y). Thus statistical assumptions on the joint distribution (e.g. Y is a linear function of X) are not sufficient for causal inference. Causal estimators require a separate set of causal assumptions regarding the potential outcome random variables. For example, the “ignorability assumption” Y^X=x ⫫ X|C (read as “the value Y would take if X is set to x is independent of X given C”) and causal consistency assumptions imply $E [Y^{X = x}] = E [E [Y ∣ C, X = x]]$ . Since $E [E [Y ∣ C, X = x]]$ is a function of the joint distribution of (X, Y, C), under the ignorability and consistency assumptions the causal estimand of interest, $E [Y^{X = x}]$ , is identifiable. Causal assumptions such as ignorability can be especially questionable (relative to statistical assumptions) since they are generally not verifiable from observational data (even asymptotically).

Recently several works have sought to validate causal models using prediction. This form of validation is possible when there are many causal questions of interest and ground truth can be obtained for a subset. A statement such as “Causal model M provides an accurate estimate of the causal effect of smoking on lung cancer” cannot generally be validated using prediction performance because there is a single causal effect of interest (effect of smoking on cancer) and validating the model experimentally would obviate the need for the model in the first place (along with being ethically and financially impractical).

In contrast, a statement such as “Causal model M successfully identifies cause–effect yeast gene pairs” can be validated in a prediction accuracy framework. Model M predicts whether gene X causes gene Y for every (X, Y) pair (e.g. ranks all gene pairs from most to least likely to be cause–effect pairs).

Ground truth is obtained for some subset of these predictions by external interventions on the purported causes. The method of obtaining ground truth may vary from application to application. One method is to “knockout” gene X. In a knockout experiment, X is set to a low value (typically outside its normal range) and then Y is observed. Gene X has a causal effect on Y if the distribution under the knockout is different from the distribution in an observational (wild type) condition, i.e. Y^X=0 ≠ _d Y. Since we may only have one knockout of X, we will only have one observation y^x=0 generated from the Y^X=0 distribution. Thus one cannot conclusively determine Y^X=0 ≠ _d Y. Further it may be of interest to identify gene pairs (X, Y) for which X has a large causal effect on Y, i.e. the distribution of Y^X=0 is very different from the distribution of Y. One possibility is to estimate the range of Y from a non–interventional sample y₁, …, y_n. If the knockout sample is outside the range of observational Y, i.e. y^x=0 ∉ [min y_i, max y_i], then we conclude that X has a (most likely large) effect on Y.

Once ground truth, however defined, is obtained for a sample of (X, Y) pairs, the performance of a set of causal models M_j for j = 1, …, J can be computed and compared. For example we may consider M_j to be the best causal model if it has the highest accuracy or the largest area under the Receiver Operating Characteristic curve as compared with all other models (when scored on the set of (X, Y) for which ground truth is available). If the performance of model M_j is deemed sufficiently good by some measure, then performance on non–validated causal pairs (X, Y) should also be good as well (up to sampling variability). In practical terms M_j could be used to guide the choice of future knockout experiments to maximize the number causal pairs discovered.

Prediction performance measures do not rely on validity of either the statistical or causal assumptions used by the model. For example, M₁ may make ignorability assumptions while M₂ may avoid these assumptions using instrumental variable techniques but instead assume linear causal relations among the variables. Which model is better? The standard approach is to attempt to assess which model assumptions are more plausible for the given application. This is very difficult. Both models are approximations and will be wrong to some degree. The ignorability assumption of model M₁ is generally impossible to validate, even asymptotically. An alternative is to select the model with better causal prediction performance. This prediction based evaluation of causal models can be seen as an extension of Breiman’s proposals for model validation to the field of causality [1].

A major caveat to the above discussion regards the selection of the set of (X, Y) on which ground truth is obtained. If ground truth is available for a simple random sample of all potential causal pairs, then we expect that a model’s performance on the ground truth set will generalize well to the set of untested pairs. However if the ground truth set represents a biased selection, e.g. highly enriched for true effects, then the performance may not generalize. We discuss this issue now in the context of the Kemmeren data set.

2.2 |. Kemmeren

Kemmeren measured expression of p = 6170 genes with n₁ = 262 wild-type samples where no intervention was performed and n₂ = 1479 interventional samples where one gene was knocked out.¹ Each gene was knocked out at most once (so there are n₂ genes which were knocked out once and p − n₂ genes which were not knocked out). Figure 1 shows scatterplots of the observational and interventional data for two pairs of genes. The red points (ko) are interventional samples where one gene’s expression is set very low. The blue points (obs) are non-interventional samples. The variance of the gene expression is higher for the interventional samples because knockouts perturb the expression of the knocked out gene and all downstream targets, leading to greater variability in expression.

Kemmeren has been used to evaluate the predictive performance of several causal models including ICP, the CD and LCD [8, 9, 11, 6]. The performance of these models has been compared to purely association based methods such as Lasso regression. The following is a summary of how these models are fit and evaluated on Kemmeren. We follow closely the procedures used in the existing studies so that we can reproduce their results. Since each of these studies used somewhat different evaluation strategies, we do not exactly follow any single study.

Since there are p genes, there are (p − 1) ordered pairs of genes (i, j) for which causal predictions can be made. The causal models rank all these ordered pairs from most likely to least likely to be cause–effect pairs. Ground truth for the gene pair (i, j) is determined by observing the expression of gene j when gene i is knocked out. Following [8], we say that gene i has a significant causal effect on gene j if in the sample where gene i is knocked out, the expression of gene j is outside of the range of gene j measured in the n₁ observational samples. See [11] and [6] for other possible definitions of causal effect.

Ground truth (i.e. knowledge of whether i has a significant causal effect on j) is only available for n₂(p − 1) = 9, 125, 430 of these gene pairs (roughly 25% of samples). When the causal models are scored, only these predictions are considered. Predictions for the remaining ~ 75% of genes pairs are not used because ground truth is not available. About 6% of gene pairs with ground truth are significant causal effects.

All models are fit using a cross–validation strategy which prevents the model from using the result of the i knockout when making a prediction on the (i, j) cause–effect pair. In particular, the n₂ interventional samples are split into K folds. The causal models are trained on all the n₁ observational samples and K − 1 folds of the interventional samples. The causal models then make predictions for the held out interventions (approximately n₂/K × p pairs). Here we set K = 3.

Finally the models are trained on 100 bootstrap samples of the training data (n₁ wild type samples and K − 1 folds of interventional data). For each bootstrap sample the models are fit and coefficients computed. Coefficients are ranked by absolute size. These ranks are averaged across the bootstrap samples to produce final rankings. This bootstrap procedure does not use the held out fold of data.

We assess the performance of four methods:

Lasso Regression (L1): The response gene is regressed on the expression of all other genes, ignoring knockout information. The glmnet package is used to fit the model [2]. The model with 4 nonzero coefficients is selected.
Causal Dantzig (CD): The CD is fit to the 4 variables selected by L1. We briefly describe the CD now and refer readersto [9] for full details. Let X denote a set of genes that may be causes of Y. The CD assumes a linear model
$Y = β^{T} X + ϵ$
where the error term ϵ may be correlated with X to account for hidden confounders. The parameter β_j from the above model is the causal effect of X_j on Y. With hidden confounders and reverse causality, regressing Y and X is well known to produce inconsistent estimates of β. The CD assumes observations (X, Y) are generated from environments e ∈ {1,2}. For Kemmeren, e = 1 corresponds to the n₁ observational samples while e = 2 corresponds to the n₂ knockout samples. Let X^e (Y^e) represent the design matrix (response vector) for environment e. The CD estimates β with
${\hat{β}}_{C D} = {(\frac{1}{n_{1}} X^{1^{T}} X^{1} - \frac{1}{n_{2}} X^{2^{T}} X^{2})}^{- 1} (\frac{1}{n_{1}} X^{1^{T}} Y^{1} - \frac{1}{n_{2}} X^{2^{T}} Y^{2})$
The form of the estimator ${\hat{β}}_{C D}$ is motivated by seeking inner product invariance across environments. Under assumptions on the complexity and form of environment interventions and regularity assumptions on the error term, ${\hat{β}}_{C D}$ is an asymptotically normal estimator of β. The CD is fit using the causalDantzig function in the InvariantCausalPrediction R package (version 0.7–1).
Invariant Causal Prediction (ICP): ICP is fit to the 4 variables selected by L1 and maximin coefficients returned. We briefly describe ICP now and refer readers to [8] for full details. ICP identifies a set coefficients R such that the distribution $Y^{e} | X_{R}^{e} = x$ does not depend on the environment e for any x. The implementation of ICP used here(as well as in the original application of ICP to Kemmeren in [8]) assumes linear models so $Y^{e} = X_{R}^{e T} β_{R}^{e} + ϵ^{e}$ . The set R and the coefficient values β_R are identified by first fitting the linear model using least squares and then testing if a) the distribution of ϵ^e depends on e and b) the value of $β_{R}^{e}$ depend on e. For a given R, if either hypothesis is rejected, then R is rejected and is unlikely to be the set of causes of Y. If coefficient sets ${R_{k}}_{k = 1}^{K}$ are not rejected, then the estimated causal set is $\hat{R} = \cap_{k = 1}^{K} R_{j}$ . The maximin coefficient for β_j is computed by taking a union of the K confidence intervals for β_j (one confidence interval for each R_k) and then selecting the point in the confidence interval union closest to 0. ICP is fit using the ICP function in the InvariantCausalPrediction R package (version 0.7–1).
L1 Random (L1R): The coefficients for L1 selected variables (i.e. variables with non–zero coefficients) are randomly permuted among the selected variables. Thus L1 is used for variable selection, but the actual non-zero coefficient estimates are replaced by noise (the coefficients of different variables). This serves as a useful benchmark (particularly in the simulations) which a well performing method should outperform.

2.3 |. Performance Evaluation

Figure 2 presents an ROC curve comparing the performance of the methods. Panel a) containing the top predictions is a zoomed in version of panel b). Panel a) shows that the performance of Invariant Causal Prediction (ICP) and the Causal Dantzig (CD) is particularly good for the top predictions (5/5 for ICP and 4/5 for CD). Further out on the ROC curve in b), the performance of CD, L1, and L1R are similar and ICP is worse than the other three. This roughly reproduces the results shown in Figure 2 of [6] and Figure 4 of [11]. The significant enrichment of causal effects in the top 5 or 10 predictions made by ICP and CD was seen in these works as evidence that these methods are identifying causal relations. For example, since only 6% of (i, j) pairs represent significant causal effects, the probably of randomly selecting 5 true significant causal effects in the top 5 predictions is less than 10⁻⁶.

ROC curves for estimators scoring all gene pairs with ground truth available. a) Top predictions. b) Larger set of predictions.

To further investigate these predictions, Table 1 shows the top 10 predictions made by ICP for which ground truth was available. The column cause is the name of the purported causal gene and the column effect contains the name of the purported effect gene. The column res is TRUE if this pair is truly causal as determined by the ground truth knockout experiment and FALSE otherwise. All of the top 5 predictions are correct and 8 of the top 10 are correct. Several of the top predicted pairs match the predictions contained in Table 4 of [6]. The rank column contains the overall rank of the prediction, which includes gene pairs for which ground truth is not available. For example, the causal pair YJL141C→YJL142C is the second ranked pair for which ground truth is available (because it is in second row of table) but the 8 ranked pair overall (ranks may have decimals to break ties). The pair YCL040W→YCL042W is the third ranked pair for which ground truth is available (because it is in the third row) but the 17 ranked pair overall.

TABLE 1.

ICP rankings

cause	effect	res	rank	res-flip	rank-flip
YLL019C	YLL020C	TRUE	1.0	NA	3
YJL141C	YJL142C	TRUE	8.0	NA	16
YCL040W	YCL042W	TRUE	17.0	NA	60
YDR432W	YDR433W	TRUE	20.5	NA	7
YGR152C	YGR151C	TRUE	33.0	NA	53
YDR155C	YDR154C	TRUE	52.0	NA	796
YDR101C	YLR276C	FALSE	55.5	NA	28845024
YJL168C	YDL074C	FALSE	61.0	FALSE	34506690
YML058W	YKL037W	TRUE	76.0	NA	19112995
YMR104C	YMR103C	TRUE	82.0	NA	14

Open in a new tab

The column res-flip contains ground truth for the reverse knockout. It is NA when the reverse knockout was not performed. For example, for the first row the reverse prediction is YLL020C→YLL019C. Since YLL020C was not knocked out, there is no ground truth for this prediction. Thus the res-flip column is NA. Of the top 10 prediction, in only one case was the reverse knockout performed and in that case the the reverse pair was not causal (YDL074C is not a cause of YJL168C). The column rank–flip is the rank of the flipped cause effect pair, e.g. for the first row the rank–flip is the rank of YLL020C→YLL019C (which is 3). Interestingly, ICP often ranks the flipped pairs as being as likely to be cause–effect as the actual pair. For example for the first row the rank of the reverse pair YLL020C→YLL019C is 3 meaning it was judged the third most likely cause effect pair of all ~ 36 million pairs. In two instances the flipped rank is lower than the pair itself indicating ICP believes it is more likely the flipped pair is cause–effect than the actual scored pair. Ground truth is generally not available for the often highly ranked flipped pairs, so they are not scored.

Thus we observe that ICP often ranks both (i, j) and (j, i) highly but that typically only one of these pairs is scored. There are two possible explanations for this phenomenon:

ICP is correctly identifying feedback loops (gene i is a cause of j and gene j is a cause of gene i). Under this explanation, if the reverse knockouts had been done, then the effects would alter the causes.
Since the Kemmeren data set obtained ground truth on “putative regulators”, when ICP ranks both (i, j) and (j, i) highly, whichever is the true cause–effect pair is actually scored while the incorrect pair is not scored. Essentially if i is a cause of j and not the reverse, then Kemmeren is more likely to knockout i (and thus obtain ground truth for (i, j)) than knock out j and obtain ground truth for (j, i). Under this explanation, if the reverse knockouts had been done, few or none of them would have been scored as correct.

ICP is not generally consistent under feedback loops that include the target variable, a point which favors the second explanation. We note that the situation is similar, although less severe, for other methods including L1 regression and the Causal Dantzig. We conclude that the sample selection bias introduced by knocking out “putative regulators” may be a key driver in the performance of methods. In the following sections we seek performance metrics which are less sensitive to gene knockout selection bias and thus more informative regarding the causal predictive performance of the models.

3 |. MODIFIED SCORING SET

We propose scoring causal algorithms on a subset of gene pairs (i, j) for which ground truth is known. We argue that this protects against the performance inflation identified in the previous section. Let

F_{i j} = {\begin{array}{l} 1 & if gene i has a significant effect on gene j \\ 0 & o . w . \end{array}

Causal prediction algorithms predict the F_ij values. Since there are p genes, there are (p − 1) interesting F_ij values (F_ii is not interesting because it is the causal effect of gene i on itself). Since only n₂ gene knockouts were performed in Kemmeren, F_ij is known (and algorithms can be scored) on at most n₂(p − 1) gene pairs. Figure 3 illustrates the F_ij matrix. F_ij is known for the first n₂ rows (union of orange and blue regions). Scoring algorithms on this entire set can result in biased conclusions because the blue region will have a higher proportion of causal effects (higher proportion of F_ij = 1) than the grey region due to selection of putative regulators as knockouts. Specifically, an algorithm can rank both (i, j) and (j, i) highly when i and j are highly correlated and then use the selection effect to only score the pair that is most likely to be causal. In the case of Figure 3, (i, j) will be scored and (j, i) will not be scored.

Ground truth availability and proposed scoring set for Kemmeren. Scoring predictions on all the data where ground truth is available (union of the blue and orange regions) results in biased assessments of model performance. We propose scoring models on the orange set where ground truth for the reverse knockout is always available.

Our solution is to only score gene pairs in the orange region of Figure 3. We term this scoring set S. S satisfies the property (i, j) ∈ S if and only if (j, i) ∈ S. Models are scored on both (i, j) and (j, i) or neither. Since there are n₂ = 1479 genes which were knocked out (and the expression of every other gene was measured on these knockouts), the orange scoring region contains a total of n₂(n₂ − 1) = 2, 185, 962 gene pairs. This represents approximately 1/16 of all pairs of genes (1/4 of genes are knocked out and these can be scored on the 1/4 of knocked out genes). Using scoring set S, a model cannot obtain good performance by ranking highly correlated pairs highly and letting the “putative regulators” selection effect score only the correct prediction. For example, if i′ and j′ are highly correlated, then an algorithm can rank both (i′, j′) and (j′, i′) as being causal pairs. Both of these will be scored since they are both in the orange region. In contrast, neither (i, j) nor (j, i) will be scored because no ground truth is available for (j, i) and scoring only (i, j) could bias performance measures.

In the original studies with Kemmeren, the algorithms (ICP, CD) were scored on the union of the blue and orange regions. We now score the algorithms only on set S (orange region). Figure 4 shows the ROC curves for the four models when scoring the algorithm performance only on set S. The performance of all methods has significantly deteriorated. ICP is now the worst performing method. The performance improvement of CD over L1 is the result of only 3 correct predictions. There is no evidence that ICP is outperforming purely association based methods such as L1 and little evidence that CD is outperforming L1.

ROC curves for estimators scoring on set S. a) Top predictions. b) Larger set of predictions.

4 |. SIMULATIONS

We test the performance of the four estimators in simulations which reproduce the structure of Kemmeren in terms of sample size (n), number of variables (p), number of knockouts performed (n₂), and type of knockout (single gene knockouts with each gene knocked out at most once). The simulations use linear models without any hidden confounding or causal feedback loops. This setting is designed to be favorable for ICP and CD because both estimators assume linear models and ICP assumes no hidden confounding.

A main goal of these simulations is to study when and by how much causal estimators such as CD and ICP can outperform association based methods such as L1 regression. Towards this end, we simulate under parameter settings which are easy for L1 and very difficult for L1. In the former, one hopes that CD and ICP can match the performance of L1 while in the latter CD and ICP can beat L1.

4.1 |. Simulation Parameters

Data is generated from a directed acyclic graph with p + 1 nodes X₁, …, X_p+1 where p = 6400. Let upper triangular matrix A represent edges in the DAG with A_ij the causal effect of X_i on X_j. The response of interest is Y = X_p/2+1. The goal is to identify direct causes of Y i.e. determine which elements of $A_{\cdot, p / 2 + 1} (p / 2 + 1 column of A)$ are non–zero.

The matrix A is randomly generated on each simulation run. We simulate Strong causes where the causal coefficients of X on Y are large (relative to the effects of Y) and Weak causes where they are small (again relative to the effects of Y). In the Strong causes case we expect L1 regression to perform well because the nodes which are most strongly associated with Y are actually causes of Y. In the Weak case, we expect L1 regression to perform poorly because the nodes which are most strongly associated with Y are actually effects of Y. A is generated in the following manner:

A_ij is drawn independently with P(A_ij = −1) = P(A_ij = 1) = n_t/(2p) and P(A_ij = 0) = 1 − n_t/p. The n_t parameter controls the density of the connections in the network. We consider n_t = 20, 40, 80, 160, 320. For the n_t = 20 simulation, X₁ has approximately 20 descendants and for the n_t = 320 simulations, X₁ has approximately 320 descendants.
The lower triangle (and diagonal) entries of A are subsequently set to 0, ensuring A represents an acyclic graph.
All causes and effects of Y are set to 0 (p/2 + 1 row and column of A set to 0).
Nodes $X_{p / 2 + 1 - p_{0}}, \dots, X_{p / 2}$ are set to be causes of Y = X_p/2+1 and nodes X_p/2+2, …, X_p/2+p0 are set to be effects of Y. Note that since p₀ is 1 or 2, this amounts to Y having either 1 direct cause and 1 direct effect (p₀ = 1 case) or 2 direct causes and 2 direct effects (p₀ = 2 case). Specifically P(A_k,p/2+1 = s₁) = P(A_k,p/2+1 = −s₁) = 1/2 for k = p/2+1−p₀, …, p/2 and P(A_p/2+1,k = s₂) = P(A_p/2+1,k = −s₂) = 1/2 for k = p/2 + 2, …, p/2 + 1 + p₀. The parameters s₁ and s₂ control the strength of the causes of Y and the effects of Y respectively. Large values of s₁ result in causes of Y which are highly correlated with Y. Similarly large values of s₂ result in the effects of Y being strongly correlated with Y. Thus the regime where s₁ is small and s₂ is large is challenging for association based estimators because the X nodes which are most strongly correlated with Y are the effects of Y, not the causes of Y. In the Strong (easy for L1) setting, s₁ = s₂ = 1 and in the Weak (challenging for L1) setting, s₁ = 1 and s₂ = 1000.
A is column normalized.

In each simulation run, n₁ = 300 non-interventional observations are generated and n₂ = p/4 knockouts are generated. Thus there are a total of n = 1900 observations. Following Kemmeren, each gene is knocked out either 0 or 1 times. Non–interventional data is generated according to X₁ ~ ϵ₁ and

X_{j} = \sum_{i = 1}^{j - 1} A_{i j} X_{i} + ϵ_{j}

for j > 1 where ϵ_j ~ N(0, 1). For the interventional data, when gene j is knocked out its expression is shifted by −40. The p₀ direct causes and p₀ direct effects of Y are never knocked out.

4.2 |. Fitting Models and Performance Evaluation

The four models (ICP, CD, L1, and L1R) are fit as described in Section 2.2. Performance is evaluated in the following manner: Each model returns a coefficient vector $\hat{β}$ where ${\hat{β}}_{j}$ is the estimated causal effect of X_j on Y. The estimates are ranked by absolute size and the number of true causes among the top ranked p₀ estimates is recorded. For p₀ = 1 the number of true causes among the top ranked p₀ estimates is either 0 or 1. For p₀ = 2, it is 0, 1 or 2. We compute the mean number of true causes in the top ranked p₀ estimates across the N = 100 simulation runs, regenerating A, ϵ_X, and the set of genes to knockout on each run. The best possible performance for the p₀ = 1 simulations is 1: In all N simulation runs the true cause of Y always had the largest parameter coefficients in absolute size. The best possible performance for the p₀ = 2 simulations is 2: In all N simulation runs the two true causes of Y coefficients were always the largest two parameter estimates in absolute size. We compare the performance of the same four methods analyzed in Section 2.2. We do not implement the bootstrap sampling strategy for these results.

On a given simulation run, ICP and/or CD may return all 0 parameter estimates or an error. For example, the maximin coefficients computed by ICP are 0 when the model does not believe there is sufficient information to assign causality to any variable. The CD requires the inverse of the difference in two Gram matrices to be well–defined. If this difference is not invertible, the CD estimator is undefined. In these case, we assign these methods the L1R parameter estimates. This is most likely to happen when the interventions are not sufficiently complex to estimate the causal parameters [9].

4.3 |. Results

Results are summarized in Table 2. With Strong causes, L1 regression obtains perfect performance, 1 for p₀ = 1 case and 2 for p₀ = 2 case. This is not surprising. In the Strong case the true cause of Y is very highly correlated with Y relative to all other variables and so has the largest parameter coefficient.

TABLE 2.

Simulations results. For the p₀ = 1 rows, the best performance is 1 and the worst performance is 0. For the p₀ = 2 rows, the best performance is 2 and the worst performance is 0.

n _t	Strong				Weak
n _t	L1	L1R	CD	ICP	L1	L1R	CD	ICP
P₀ = 1
20	1	0.27	0.70	0.29	0	0.19	0.43	0.19
40	1	0.32	0.71	0.34	0	0.34	0.37	0.34
80	1	0.25	0.82	0.27	0	0.25	0.35	0.25
160	1	0.32	0.84	0.33	0	0.26	0.46	0.26
320	1	0.28	0.85	0.30	0	0.32	0.39	0.32
P₀ = 2
20	2	0.96	1.37	0.97	0	0.95	0.78	0.95
40	2	1.02	1.47	1.02	0	1.05	0.79	1.05
80	2	1.06	1.52	1.06	0	0.99	0.66	0.99
160	2	1.16	1.61	1.17	0	1.00	0.78	1.00
320	2	1.10	1.77	1.11	0	1.01	0.82	1.01

Open in a new tab

Recall that L1R randomly permutes the ranks of coefficients selected by L1. Thus a method which matches the performance of L1R is providing no performance benefit beyond what is already provided by the L1 variable prescreening. L1R obtains performance of approximately 0.25 for the p₀ = 1 case and 1 for the p₀ = 2 case. This is expected. Lasso regression selects a model with 4 variables. This will typically include the true causes (1 if p₀ = 1 and 2 if p₀ = 2) of Y, along with 3 (if p₀ = 1) or 2 (if p₀ = 2) non–causes of Y. Thus when randomly permuting coefficients, L1R will select the true coefficient about 25% of the time when p₀ = 1 and will have on average 1 correct in the top 2 when p₀ = 2.

The CD has performance worse than L1 in all Strong settings tested. Relatively speaking CD performs best with dense networks (high n_t) values. For example, in the most favorable setting for CD relative to L1 (p₀ = 2 and n_t = 320), CD identifies about 20% fewer causal effects than L1. This can be explained by the fact that with sparser networks (such as n_t = 20 where each node effects few other nodes), very few of gene knockouts will effect the true causes and effects of Y. Thus the perturbation environment distribution will resemble the observational environment data distribution, a challenging setting for CD. ICP has worse performance than CD and is comparable to L1R.

The Weak columns represent a very challenging setting for L1. The effects of Y are consistently more strongly correlated with Y than the causes of Y. Thus the top ranked genes by L1 are expected to be effects of Y, not causes. At all p and p₀, L1 never selects the true causes of Y in its top 1 or 2 predictions. This is the worst possible performance a method can have. Here is a setting where we may hope a causal estimator can outperform L1.

Indeed both CD and ICP perform consistently higher than L1. In this case, L1R provides a useful performance benchmark for CD and ICP. ICP performs very similarly to L1R. This is because the maximin coefficients returned by ICP are very often 0, so the method defaults to the L1R parameter estimates. Qualitatively ICP is returning 0 maximin parameter estimates because it is unable to determine causes of Y. CD has somewhat better performance than L1R for p₀ = 1 and somewhat worse performance than L1R for p₀ = 2.

Our general conclusion is that causal inference with sample sizes approximating Kemmeren is quite difficult. Under the scenarios considered CD and ICP are unable to consistently beat L1 and L1R models, both association based estimators. ICP has particularly poor performance. While these simulations do not reproduce all the complexity of an actual genetic perturbation experiment, they provide a setting to test the models which is not subject to sample selection bias.

5 |. DISCUSSION

Sample selection bias is a well known issue in non–causal prediction models [4, 3]. In this work we demonstrated that sample selection bias in Kemmeren is likely a key driver of the reported predictive performance of several recently proposed causal estimators. To avoid potential sample selection bias, we suggest that future studies with Kemmeren examine the ranks of flipped predictions (as done in Table 1 here), evaluate prediction performance on the set S (as proposed in Section 3 here), and conduct simulations studies which can be used to evaluate models in a way that is free of sample selection bias. More broadly, the generation and curation of data sets suitable for causal prediction will aid in the unbiased evaluation of causal predictive models.

ACKNOWLEDGEMENTS

J.P.L. was supported by NIH/NCI grants P50CA127001, P30CA016672, and P50CA140388. M.J.H. was supported by NIH/NCI grants R21CA22029 and 1R01CA244845-01A1.

Footnotes

Data from file Kemmeren. RData hosted at http://deleteome.holstegelab.nl/downloads_causal_inference.php. This data set and all code used to analyze it are available in the github repository https://github.com/longjp/causal-bias-code.

References

[1].Breiman L et al. , 2001: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16, no. 3, 199–231. [Google Scholar]
[2].Friedman J, Hastie T, and Tibshirani R, 2010: Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, no. 1, 1. [PMC free article] [PubMed] [Google Scholar]
[3].Globerson A and Roweis S, 2006: Nightmare at test time: robust learning by feature deletion. Proceedings of the 23rd international conference on Machine learning, 353–360. [Google Scholar]
[4].Huang J, Gretton A, Borgwardt K, Schölkopf B, and Smola A, 2006: Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19, 601–608. [Google Scholar]
[5].Kemmeren P, Sameith K, Van De Pasch LA, Benschop JJ, Lenstra TL, Margaritis T, O’Duibhir E, Apweiler E, van Wageningen S, Ko CW, et al. , 2014: Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell, 157, no. 3, 740–752. [DOI] [PubMed] [Google Scholar]
[6].Meinshausen N, Hauser A, Mooij JM, Peters J, Versteeg P, and Bühlmann P, 2016: Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113, no. 27, 7361–7368. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Pearl J et al. , 2009: Causal inference in statistics: An overview. Statistics surveys, 3, 96–146. [Google Scholar]
[8].Peters J, Bühlmann P, and Meinshausen N, 2016: Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 947–1012. [Google Scholar]
[9].Rothenhäusler D, Bühlmann P, Meinshausen N, et al. , 2019: Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47, no. 3, 1688–1722. [Google Scholar]
[10].Rubin DB, 2005: Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100, no. 469, 322–331. [Google Scholar]
[11].Versteeg P and Mooij JM, 2019: Boosting local causal discovery in high-dimensional expression data. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2599–2604. [Google Scholar]

[R1] [1].Breiman L et al. , 2001: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16, no. 3, 199–231. [Google Scholar]

[R2] [2].Friedman J, Hastie T, and Tibshirani R, 2010: Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, no. 1, 1. [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Globerson A and Roweis S, 2006: Nightmare at test time: robust learning by feature deletion. Proceedings of the 23rd international conference on Machine learning, 353–360. [Google Scholar]

[R4] [4].Huang J, Gretton A, Borgwardt K, Schölkopf B, and Smola A, 2006: Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19, 601–608. [Google Scholar]

[R5] [5].Kemmeren P, Sameith K, Van De Pasch LA, Benschop JJ, Lenstra TL, Margaritis T, O’Duibhir E, Apweiler E, van Wageningen S, Ko CW, et al. , 2014: Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell, 157, no. 3, 740–752. [DOI] [PubMed] [Google Scholar]

[R6] [6].Meinshausen N, Hauser A, Mooij JM, Peters J, Versteeg P, and Bühlmann P, 2016: Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113, no. 27, 7361–7368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Pearl J et al. , 2009: Causal inference in statistics: An overview. Statistics surveys, 3, 96–146. [Google Scholar]

[R8] [8].Peters J, Bühlmann P, and Meinshausen N, 2016: Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 947–1012. [Google Scholar]

[R9] [9].Rothenhäusler D, Bühlmann P, Meinshausen N, et al. , 2019: Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47, no. 3, 1688–1722. [Google Scholar]

[R10] [10].Rubin DB, 2005: Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100, no. 469, 322–331. [Google Scholar]

[R11] [11].Versteeg P and Mooij JM, 2019: Boosting local causal discovery in high-dimensional expression data. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2599–2604. [Google Scholar]

PERMALINK

Sample Selection Bias in Evaluation of Prediction Performance of Causal Models

James P Long

Min Jin Ha

Summary

1 |. INTRODUCTION