Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Jun 5:2025.06.05.657988. [Version 1] doi: 10.1101/2025.06.05.657988

Learning Genetic Perturbation Effects with Variational Causal Inference

Emily Liu 1,2,, Jiaqi Zhang 1,2,3,♯,*, Caroline Uhler 1,2,3,*
PMCID: PMC12157634  PMID: 40501829

Abstract

Advances in sequencing technologies have enhanced the understanding of gene regulation in cells. In particular, Perturb-seq has enabled high-resolution profiling of the transcriptomic response to genetic perturbations at the single-cell level. This understanding has implications in functional genomics and potentially for identifying therapeutic targets. Various computational models have been developed to predict perturbational effects. While deep learning models excel at interpolating observed perturbational data, they tend to overfit and may not generalize well to unseen perturbations. In contrast, mechanistic models, such as linear causal models based on gene regulatory networks, hold greater potential for extrapolation, as they encapsulate regulatory information that can predict responses to unseen perturbations. However, their application has been limited to small studies due to overly simplistic assumptions, making them less effective in handling noisy, large-scale single-cell data. We propose a hybrid approach that combines a mechanistic causal model with variational deep learning, termed Single Cell Causal Variational Autoencoder (SCCVAE). The mechanistic model employs a learned regulatory network to represent perturbational changes as shift interventions that propagate through the learned network. SCCVAE integrates this mechanistic causal model into a variational autoencoder, generating rich, comprehensive transcriptomic responses. Our results indicate that SCCVAE exhibits superior performance over current state-of-the-art baselines for extrapolating to predict unseen perturbational responses. Additionally, for the observed perturbations, the latent space learned by SCCVAE allows for the identification of functional perturbation modules and simulation of single-gene knockdown experiments of varying penetrance, presenting a robust tool for interpreting and interpolating perturbational responses at the single-cell level.

Keywords: Perturb-seq, Perturbation Modeling, Causal Inference, Variational Inference

Author summary:

Understanding how genes interact and respond to perturbations is crucial for uncovering the mechanisms of cells and identifying potential ways to treat diseases. Recent advances in sequencing technologies now allow us to measure how individual cells react when specific genes are altered. However, making sense of this complex data requires advanced computational tools. In our work, we address the challenge of predicting how cells respond to potentially new untested genetic perturbations. We noticed that while deep learning models perform well on data measured before, they struggle with making predictions on new cases. On the other hand, models based on biological understanding can, in theory, make better predictions, but they often rely on overly simple assumptions that do not hold with real-world data. We developed a new method that combines the strengths of both approaches. Our model, called SCCVAE, uses knowledge of gene networks together with deep learning to better predict how cells will respond to gene changes. It can simulate new experiments and help identify groups of genes that work together. This tool could be valuable for researchers studying perturbational changes, as well as gene functions and diseases.

1. Introduction

Gene-editing technologies provide useful probes for the study of gene regulation in cells [1]. By perturbing individual genes and observing transcriptomic changes, we can disentangle and resolve the downstream effects of these perturbations. These insights facilitate a range of downstream applications, from identifying genes involved in fundamental cellular processes (e.g., [2, 3, 4]) to discovering potential drug targets for therapeutic use (e.g., [5, 6]). There are numerous potential target genes for perturbation. Perturb-seq allows for large-scale exploration by combining high-throughput CRISPR gene editing with single-cell RNA sequencing [7, 8]. Recent advances have further expanded its scale, enabling the collection of data on genome-wide perturbations in millions of cells [9]. Understanding cellular responses to the genetic perturbations introduced through these high-content data is of great importance.

Various computational approaches have been proposed to interpret and predict perturbational effects. One predominant line of work explores the performative powers and inductive biases brought by popular deep learning architectures. For example, Lotfollahi et al. [10] uses a compositional architecture combined with an adversarial network to disentangle perturbational effects from the basal cell states. Yu and Welch [11] utilizes two separate autoencoders to learn perturbation-specific and cell-specific latent representations and employs a normalizing flow to map between these representations. Roohani et al. [12] uses a graph-based network that leverages a gene ontology-based graph to simultaneously learn perturbation embeddings and their corresponding effects. Lopez et al. [13] proposes a modified variational autoencoder with carefully designed noise models, modeling perturbational effects as sparse shifts of these noise distributions. Wu et al. [14] uses a variational autoencoder with a graph attention architecture to encode gene regulations. Several variational-inference-based analyses have also been proposed to interpret the observed perturbations [15, 16]; see Rood et al. [17] for a comprehensive review. Recently, transformer architectures have also been used to learn unsupervised representations of cells and genes, with perturbation prediction being a downstream task [18, 19]. While these methods excel at interpolating observed data, they are prone to overfitting, which may limit their ability to generalize to unseen perturbations. Ahlmann-Eltze et al. [20] observed that simple linear models can outperform sophisticated deep learning models in generalizing to unseen perturbations. However, their approach focuses on pseudo-bulk analysis, whereas this work aims to resolve perturbational effects at the single-cell level.

There are several axes to consider when generalizing to unseen perturbations. First, the generalization task may involve extrapolating from single-gene perturbations to combinatorial perturbations that target multiple genes or to single-gene perturbations targeting novel genes, potentially administered at varying penetrance / multiplicity of infection. In this work, we mainly focus on extrapolations to novel single-gene perturbations. Second, depending on the task, the principles one uses for generalization may differ. For example, to generalize to combinatorial perturbations, a popular principle is to use additivity to combine the effects (potentially in a latent space) and learn the interactions as residuals (e.g., [12, 21, 22, 23]). To generalize to unseen novel perturbations, two major principles are to use prior knowledge of how the new target relates to observed targets (e.g., [11, 12] reviewed above) or a mechanistic model that specifies how the perturbation propagates according to a gene regulatory network (e.g., [24]). Prominent deep learning models, as discussed above, are mostly prior-knowledge-based, typically relying on gene ontology annotations, which might suffer from poor generalization as such annotations usually capture partial correlations. On the other hand, mechanistic models hold greater potential for extrapolation as they capture regulatory information. Prior mechanistic models usually make assumptions about how perturbations change transcriptomic profiles. For example, Kamimoto et al. [24], Zhang et al. [25] uses a linear causal model based on a learned gene regulatory network; Dibaeinia and Sinha [26] uses a chemical master equation to simulate transcription. However, their applications have been limited to relatively small studies due to simplistic parametric assumptions and/or expensive stochastic differentiation equation simulations.

In our work, we propose a hybrid model that combines a mechanistic causal model with variational deep learning, termed the Single Cell Causal Variational Autoencoder (SCCVAE). The mechanistic causal model captures regulatory information by employing a learned regulatory network and models perturbations as shift interventions that propagate through this network. It is defined on a lower-dimensional space, capturing essential information to reconstruct the entire transcriptomic readout. To address the issue of simplistic parametric assumptions in most mechanistic models, SCCVAE integrates this model into a variational autoencoder. This integration allows SCCVAE to learn and generate rich, comprehensive transcriptomic responses. Our findings demonstrate that SCCVAE excels at generalizing beyond observed perturbations, enabling accurate predictions of unseen single-gene perturbations, and outperforms both standard and state-of-the-art baselines. The mechanistic model specifies the penetrance of the perturbation, allowing simulation of single-gene knockdown perturbations with varying penetrance. As for the observed perturbations, since SCCVAE learns how the perturbation shifts the variables defined by the mechanistic model, we can extract this information to serve as a perturbation representation, which we observe to capture functional perturbation modules.

2. Methods

2.1. Variational Causal Model for Perturbations

Consider the gene expressions of a cell, denoted by Xp, perturbed by a single-gene perturbation (i.e., single-guide RNA) represented by p. When the cell is a negative control (e.g., with non-targeting guide RNAs), we denote its gene expressions as XØ or X in short with p=Ø. Each cell is measured by the expression levels of m distinct genes, i.e., XpRm×1. We now explain the components of SCCVAE.

2.1.1. Structural causal model.

Central to the design of our model is the concept of structural causal models [27, 28]. A structural causal model (SCM) is a framework for understanding complex systems by modeling the regulations between causal variables. Formally, a structural causal model is defined as a tuple Z,U,F consisting of a set of exogenous noise variables Z, a set of endogenous random variables U, and a set of functions F:Z×UU. Each SCM is associated with a directed acyclic graph 𝒢, where node i in 𝒢 corresponds to an endogenous variable Ui and edges denote causal relationships between variables. For a given i, the endogenous variable Ui depends on its associated exogenous noise variable Zi and its endogenous parents Upa(i), where jpa(i) if there is an edge ji𝒢. Mathematically, we have

Ui=FiUpa(i),Zi. (1)

In the context of gene regulations, the endogenous variables URn×1 denote an abstracted representation of n gene modules, where each Ui corresponds to one gene module. Intuitively, this representation can be interpreted as a low-dimensional feature summarizing the expression of m genes in the context of a cell, where m can be much larger than n. The causal graph 𝒢 indicates gene regulations, where jpa(i) if genes in module j directly regulate genes in module i. The exogenous variables Z record the intrinsic variations of U, not explained by regulations. Given the values of Upa(i) and Zi, function Fi then generates Ui associated with module i. In our setting, we choose to use a specific parametric family of SCM, where Zi are assumed to be Gaussian and Fi are linear functions, as this parameterization can encapsulate all multivariate Gaussian distributions, which we observe to sufficiently fit the transcriptomic data. In other words, Eq. (1) is realized by

Ui=jpa(i)AijUj+Zi,Zi𝒩0,σi2, (2)

where Aij describes the contribution of Uj in Ui. Since the scaling of the endogenous variables can be set arbitrarily, we further assume that σi=1.

A genetic perturbation p alters the endogenous variables U into Up by modifying the regulatory relationships in Eq. (2). Here, we use an additive shift model, summarized by

Uip=jpa(i)AijUjp+Zi+Sip,Zi𝒩0,σi2, (3)

where Sp is the shift vector associated with perturbation p. Figure 1 shows an example. In the following section, we explain how to learn this SCM within a variational inference framework.

Figure 1:

Figure 1:

Key components of SCCVAE. (A) Illustration of a structural causal model, where the parameters associated with gene i are annotated. (B) The architecture of SCCVAE. It contains: an expression encoder that maps X to exogenous noise variables Z, a shift encoder that maps p to a shift vector Sp, a structural causal model (e.g., as illustrated in (A)) that maps Z,Sp to Up, and an expression decoder.

We make a note here that the definition of SCM assumes acyclic 𝒢 to ensure that Eq. (1) has a unique solution. However, it can be easily extended to cyclic graphs to model feedback loops by setting pa(i) as all parent nodes of i excluding itself in the cyclic graph.

2.1.2. Hybrid variational causal model.

SCCVAE integrates and learns the SCM in a variational autoencoder. It consists of four key components: the expression encoder, the shift encoder, the SCM, and the expression decoder. Figure 1 shows all inputs, outputs, and components of the model architecture.

The expression encoder takes in the expression of one control cell XRm×1 and encodes it into the exogenous variables ZRn×1, with n<m. Intuitively, such Z represents intrinsic variations of n gene modules, not explained by regulations, that is shared between the control and the perturbed distributions. This set of n modules captures essential information to reconstruct the entire transcriptomic readout. In our implementations, we select n=512. We encode the conditional distribution q(ZX) using diagonal normal distribution, where the mean and variance are parameterized by neural networks, and minimize the KL divergence EXDKL(q(ZX)p(Z)), where p(Z) corresponds to 𝒩0,In with In being the identity matrix of order n. We then use the reparameterization trick [29] to sample Z from q(ZX) to obtain the endogenous variables.

As inputs to the shift encoder, we need a label representing each single-gene perturbation. This label should be attainable for unseen perturbations for generalization purposes. Here we use the top principal components of the transposed control cell expression matrix of shape Rm×l as perturbation labels, where l is the number of control cells. For each of the m measured genes, we compute the associated top principal components vector in Rl using the control cell expression matrix. We take top principal components in this vector as our label representing perturbation on this measured gene.

The shift encoder encodes a perturbation label p into a shift vector SpRn×1. Intuitively, such Sp denotes the direct effects that a perturbation has on these n gene modules, capturing both on-target and potentially off-target effects. Such direct effects then propagate to other endogenous variables through Eq. (3). Note that Eq. (2) is a special case of Eq. (3) with p=Ø and Sp=0.

We can stack Eq. (3) for different i=1,,n and write it into a matrix form

Up=AUp+Z+Sp,

which is equivalent to

Up=InA1Z+Sp,

where Up=U1p,,Unp and Aij=0 if jpa(i). Therefore in the SCM, we only need to specify the graph 𝒢 and parametrize A by masking it according to the sparsity pattern in 𝒢, upon which Up can be directly computed based on Z and Sp. In our implementations, we find that a learned network 𝒢 (i.e., using an upper-triangular mask) that is optimized during training to work well.

Finally the expression decoder takes the perturbed endogenous variables Up and constructs the expression profile Xp of perturbed cells. The model is trained using the variational lower bound with an added maximum mean discrepancy (MMD) term [30] specified by

(θ)=EpEXXpXˆp22+βEXDKL(q(ZX)p(Z))+γEpØMMD(Xp,Xˆp).

The variational lower bound in the first two terms (with p=Ø) maximizes the likelihood of the hybrid causal model over the control distribution, whereas the mean square error in the first term (with pØ) and the added MMD term match and stabilize the training for the perturbational distributions.

Here θ subsumes all the unknown parameters, Xˆp denotes the output of SCCVAE, and hyperparameters β, γ are scaling factors specified in Appendix A.1.

In summation format, (θ)=1|P|pP1Dpi=1DpXipXˆip22+βDØi=1DØDKLqZXip(Z)+γ|P|pPMMDXp,Xˆp, where P is the set of all perturbations, Dp is the number of cells for pP, and DØ is the number of control cells.

2.2. Shift Selection for Unseen Perturbations

When generalizing to an unseen single-gene perturbation, especially in knockdown or activation experiments, it is essential to account for the penetrance of the perturbation, i.e., the magnitude of its effect. This is because such perturbations may be administered with guides of different efficiencies compared to the perturbations seen during training. As a result, simply encoding the identity of the new perturbation is insufficient; we must also infer how strongly it acts. Specifically, for an unseen perturbation q, we can encode it into a shift vector SqRn×1 using SCCVAE trained on observed perturbations. To quantify the effects of different penetrance, we attribute one single scalar cqR to denote its effect on the endogenous variables. In other words,

Uq=AUq+Z+cqSq.

Then Uq is passed to the decoder to generate in-silico transcriptomic profiles of single cells perturbed by q.

If additional metadata exists, e.g., a proliferation screen with the same library, one can use this information to specify the intensity of the perturbation. When such information is not available, one can use, e.g., bulk perturbation data to select cq in order to evaluate the capability of the model. Specifically, one can grid search for cq within a range by comparing the average predicted perturbed cell expression Xˆq with the bulk expression of cells subjected to perturbation q using mean squared error. The accurate penetrance cq should give rise to the minimal error. This shift selection process allows for the identification of penetrance from pseudo-bulk data without having access to single-cell data for novel perturbations.

Note that such shift selections are unavoidable to obtain accurate predictions, as the unseen perturbation could have very distinct penetrance. However, the performance of the model will still rely on whether SCCVAE successfully learns the underlying regulatory information, as attributing a single scalar parameter alone will not lift sufficient capacity to generalize to unseen perturbations.

2.2.1. Computational complexity.

Consider NI perturbations at inference time. By generating B cells per perturbation, the computational complexity of evaluating C possible shift values is ONIBCm, where m is the number of genes in the considered expression data. The typical choice of B is around 32 ~ 128. A finer level of shift selection with larger C results in more accurate results, where the computational complexity scales linearly in C.

3. Results

We evaluate on the single-gene perturbational datasets by [31] which contains normalized gene expression transcripts of essential genes for two different cell lines, K562 and RPE1 cells, where m=8563 on K562 cells and m=8749 on RPE1 cells.

In our first set of analyses, we focus on the cancerous K562 cell line and examine a subset of perturbations that induced distinct expression distributions from the control distribution, in order to reduce the noise to signal ratio. This subset is chosen by first filtering out the perturbations with less than 200 single cells. Out of the remaining perturbations, we consider only the perturbations that yield an average five-fold cross validation score of greater than 0.6 in a logistic regression task distinguishing a perturbed cell from negative controls. This ensures that the subset of perturbations (N=279) are distinct from control and require nontrivial predictions. Further details can be found in Appendix A.3.1.

We additionally include experiments on the entire K562 cell line (without filtering perturbations) by comparison, along with experiments on a selected subset of the RPE1 cell line. The additional experimental results can be found in Appendix A.3.6

3.1. Setup

3.1.1. In-distribution experiments.

In this set of experiments, we evaluate how well different models can interpolate observed perturbations. All cells in the test set are drawn from the same perturbational distributions as those in the training set. For each perturbation in the dataset, we assign 70% of the cells to the training set, 10% to the validation set, and the remaining 20% to the test set.

3.1.2. Out-of-distribution experiments.

In this set of experiments, we evaluate the performance of different models on generalizing to unseen perturbations. We designate the train/test split so that there is no overlap between the test set and the train set perturbations. First, we randomly select 20% of the perturbations to make up the test set. Out of the remaining perturbations, we assign 85% of the cells from each distribution to the training set, and the remaining 15% to the validation set. To account for variation in the amount of distributional shift that occurs from random selection of test perturbations, we repeat this experiment using five different train/test splits, so that each perturbation is included in the test set for one of the splits. Results are averaged across all splits. For each unseen perturbation Xq (defined in Section 2.2), we select cq by searching through the range of all learned cp in the model and select the value that minimizes mean squared error for pseudo-bulk predictions. Details of this search process are found in Appendix A.1.3.

In both in-distribution and out-of-distribution tasks, training hyperparameters were selected based on minimizing loss (defined in Section 2.1) within the validation set, drawn from the same distribution as the training set and determined as described above. Hyperparameters are selected through the validation set separately for each train/test split. As such, no points from the test set are seen in the hyperparameter search for any given split. Further details are given in Appendix A.1.

3.2. Prediction of Single-Cell Expressions

To evaluate SCCVAE, we consider six metrics: Mean squared error, Pearson correlation, maximum mean discrepancy, energy distance, and fraction of genes changed/unchanged from control in the opposite direction of ground truth. Formulas for computing each metric are found in Appendix A.2. Each perturbation is evaluated separately, and we report the mean and standard deviation across all test perturbations for each metric. Each metric is computed on the set of all essential genes and on the more difficult task of predicting just the top 50 most highly variable genes.

3.2.1. SCCVAE versus single-cell-level baselines.

Here we focus on perturbational effects at the single-cell level. For baselines, we compare against the control cell distribution, a popular deep learning model (GEARS [12]), and a transformer-based model (scGPT [19]) for single-cell predictions.

Results of the in-distribution experiments can be found in Table 3 in Appendix A.3.2, where we observed that SCCVAE outperforms or is on par with both baselines on all metrics. This shows the benefit of utilizing an expressive network in SCCVAE to interpolate observed perturbations.

Table 1 shows the results of the out-of-distribution experiments. It can be seen that the SCCVAE outperforms both the GEARS and control baselines for generalizing to unseen perturbations. The improvements above baselines for single-cell based metrics, i.e., MMD and Energy Distance, are more pronounced when evaluating on all essential genes. This is further illustrated in Fig. 2, where we observe that SCCVAE matches the perturbation distributions much better than the baselines. The fraction of perturbations changed or in the same direction from control is also improved in the SCCVAE predictions. SCCVAE additionally achieves better MSE and Pearson R than the baselines, although all baselines perform well over these metrics. However, in SCCVAE there is much less variation across different perturbations.

Table 1:

SCCVAE vs baselines on the out-of-distribution task.

All genes Control GEARS scGPT Linear SCCVAE

MSE 0.00817±0.00575 0.00734±0.00524 0.0101±0.00105 0.00438±0.000392 0.00542±0.002
PearsonR 0.989±0.00748 0.989±0.00738 0.997±0.000644 0.992±0.00071 0.992±0.004
MMD 0.240±0.0391 0.239±0.0405 0.266±0.0104 - 0.229±0.0193
Energy Dist. 0.0363±0.0138 0.0333±0.0132 0.0337±0.00929 - 0.0299±0.00958
Frac. Same 0.496±0.0175 0.551±0.0590 0.717±0.0468 0.627±0.0123 0.630±0.0509
Frac. Changed 0.471±0.0175 0.449±0.0590 0.283±0.0468 0.373±0.0123 0.370±0.0509


Top 50 genes Control GEARS scGPT Linear SCCVAE

MSE 0.00969±0.00830 0.00879±0.00770 0.0215±0.00362 0.00561±0.000359 0.00651±0.00537
PearsonR 0.995±0.00503 0.995±0.00498 0.998±0.000742 0.996±0.000303 0.996±0.00376
MMD 0.246±0.0512 0.245±0.0531 0.313±0.0246 - 0.236±0.0323
Energy Dist. 0.0385±0.0199 0.0372±0.0203 0.0588±0.0221 - 0.0338±0.0160
Frac. Same 0.492±0.0541 0.553±0.0945 0.721±0.0763 0.618±0.0163 0.612±0.0820
Frac. Changed 0.468±0.0541 0.447±0.0945 0.279±0.0763 0.382±0.0163 0.388±0.0821

Results are averaged across five different splits, each containing a different set of perturbations in the test set covering the entire set of all perturbations. Both when evaluating all essential genes and top 50 genes, SCCVAE achieves better metrics than control or GEARS for every metric except Pearson correlation, where all three methods achieve similar mean values but SCCVAE has lower variance in its predictions. SCCVAE outperforms a transformer-based model, scGPT, on both distributional-based metrics. When compared to a simple linear model from Ahlmann-Eltze et al [20], SCCVAE achieves similar metrics to the linear model, while additionally achieving low distributional loss, expanding its scope beyond the linear model.

Figure 2:

Figure 2:

(A) When comparing quantitative performance on the OOD task of GEARS vs SCCVAE to the control distribution, GEARS on average learns the control distribution but SCCVAE is more closely able to approximate the ground truth and is comparable to the linear method across all metrics. This effect is very pronounced when observing all essential genes, in the case of top 50 genes there are a few outlier perturbations with unusually high error. (B) SCCVAE and GEARS UMAP visualizations versus ground-truth perturbations on select perturbations in the OOD task. Consistent with quantitative results, GEARS outputs match the control distribution while SCCVAE outputs match the perturbationally distinct ground truth.

Additionally, although scGPT attains better Pearson correlations and fraction changed/same metrics, the distributional error metrics (MMD and Energy Distance) are all higher than control. This is indicative of the ability of transformer-based models to capture large-scale patterns in the data better than the causal model, but the large distributional error indicates that the model still overfits and cannot generalize to finer distributional details, unlike in the causal model.

3.2.2. SCCVAE versus linear model.

The performance of the SCCVAE model in comparison to the simple linear model explored in Ahlmann-Eltze et. al is showcased in Table 1 above. ([20]). We use the principal component based representation for the perturbations, same as SCCVAE. It can be seen that both the SCCVAE and the linear model achieve on par performance for average bulk metrics. Recall our discussion in Section 2.2; notably, the linear model is able to predict the bulk expression well without explicitly specifying the penetrance. This may be because unseen perturbations in this dataset have similar penetrance to those in the training set–specifically, their guides have similar efficiencies. As such, one possible alternative for SCCVAE to perform shift selection in this library can be using predicted bulk expression from the linear model.

On the other hand, the SCCVAE is able to handle single-cell data on a distributional level, while the linear model is limited to bulk analysis. Extending the linear model to capture a distribution over gene expressions is challenging, as it requires specifying higher-order moments of the distribution (beyond just the mean) and modifying the loss function to account for these moments.

3.3. Ablation Studies

We evaluate the SCCVAE architecture using different choices of graph 𝒢 (Section 2.1). The results in Table 1 are generated using a learned DAG without sparsity restrictions. We compare the performance using a non-hybrid model with a conditional variational autoencoder (Conditional), a model using a pre-specified gene regulatory network inferred using a causal discovery algorithm on negative control cells [32] (Causal-GSP), and a model using a pre-specified randomly generated graph (Random). We evaluate the performance in the out-of-distribution setting. The details of this setup are described in Appendix A.1.

The results of this comparison are shown in Table 2 and Figure 3. We observe that SCCVAE with a learned graph or an inferred DAG identified by GSP outperform the conditional model and the random graph models. Because bulk-level metrics experience minimal change, we focus on MMD. These results further demonstrate that incorporating a mechanistic causal model improves the model performance on generalizing to unseen perturbations.

Table 2:

Ablation studies.

All genes SCCVAE Conditional Causal-GSP Random

MSE 0.00542±0.00295 0.00580±0.00399 0.00569±0.00367 0.00779±0.00270
PearsonR 0.992±0.00469 0.991±0.00552 0.991±0.00532 0.991±0.00432
MMD 0.229±0.0193 0.232±0.0293 0.231±0.0271 0.268±0.0259
Energy Dist. 0.0299±0.00958 0.0299±0.0112 0.0300±0.0104 0.0332±0.00703
Frac. Same 0.630±0.0509 0.618±0.0655 0.622±0.0630 0.644±0.0521
Frac. Changed 0.370±0.0509 0.382±0.0655 0.378±0.0630 0.356±0.0521


Top 50 genes SCCVAE Conditional Causal-GSP Random

MSE 0.00651±0.00537 0.00728±0.00685 0.00689±0.00641 0.0101±0.00542
PearsonR 0.996±0.00376 0.996±0.00436 0.996±0.00425 0.995±0.00377
MMD 0.236±0.0323 0.243±0.0435 0.240±0.0413 0.289±0.0413
Energy Dist. 0.0338±0.0160 0.0329±0.0182 0.0328±0.0167 0.0389±0.013
Frac. Same 0.612±0.082 0.610±0.0939 0.616±0.0912 0.633±0.0717
Frac. Changed 0.388±0.0821 0.390±0.0938 0.384±0.0912 0.367±0.0716

SCCVAE with a learned graph (SCCVAE) consistently achieves superior MSE and MMD values Both the conditional model and the pre-specified graph-based model (Causal- GSP) still outperform the GEARS baselines from Table 1, and the pre-specified graph-based model outperforms the conditional model, but both are more restrictive than SCCVAE with a learned graph. The models with random causal graphs achieve high error, as is expected, but outperform all other models on the fraction same/changed metrics. This is likely due to the shifts being tuned to extreme values during the training/shift selection process to account for the poor distributional match from the random graph.

Figure 3:

Figure 3:

Distribituional loss (MMD) on SCCVAE ablations. The learned causal graph in the SCCVAE model achieves lower MMD than conditional, sparse causal graph, and random graph equivalents.

3.3.1. SCCVAE versus conditional model.

The causal SCCVAE achieves lower error than the equivalent conditional model, where the input to the decoder is Zi+cpSip. The difference is most noticeable in the distributional loss metrics (MMD and Energy Distance), as the conditional model has no way of learning interactions between latent variables that aid in out-of-distribution generalization.

3.3.2. Learned versus pre-specified causal graphs.

Our results indicate that the sparse graph learned by GSP is overly restrictive in the OOD setting. Overall, it does not generalize as well to the out-of-distribution split as SCCVAE with a learned graph, in terms of the distributional-level metrics (MMD and Energy Distance). However, it is notable that Causal-GSP still performs consistently better than the conditional model, especially on top 50 variable genes, indicating that the sparse graph identified by the GSP algorithm still successfully learns certain regulatory relationships.

3.3.3. SCCVAE versus random DAGs.

The error metrics (MSE, Pearson, MMD, Energy Distance) for the random graph model are significantly worse than the other models. However, the average direction of change metrics are improved, likely because the shift value c (defined in Section 2.2) is selected to overcompensate for the lack of distributional precision in the predictions. Recall that the shift selection process for c is a required step for out-of-distribution prediction in our mechanistic model, because the penetrance for novel genetic perturbations is unknown. However, when the mechanisms in the model are mis-specified (e.g., with a random graph), shift selection is unable to correct the prediction. This provides further evidence that the ability of model to generalize to unseen perturbations depends on whether the regulatory information can be successfully learned.

3.4. Latent Gene Embeddings

To examine the latent representations learned by SCCVAE, we first investigate if they contain information on whether the perturbations induce distribution shifts from the negative controls. Figure 4 compares the MMD between Xp and XØ in the gene expression space with the L2 distance between Up and UØ for each p in the latent space in all five OOD splits. For each split, these two distances are highly correlated, which is consistent with the knowledge that SCCVAE learns the perturbation changes through Up, since Up=InA1Z+cpSp and UØ=InA1Z (Section 2.1) A further breakdown of the scatter plots into training and testing perturbations is found in Appendix A.3.3.

Figure 4:

Figure 4:

Distance from the control distribution in the observational space vs the latent space Up, for all perturbations in each OOD split. The Euclidean distance in the latent space is strongly correlated with the MMD in the expression space.

By modulating cp, we can simulate varying penetrance of the perturbation, pushing the post perturbational distribution closer or further away from control. Figure 5 demonstrates this in action on the gene MED7, as cp is set to various values in [−1, 3]. As c is increased from −1 to 2, the error decreases until reaching a minimum at c2.0 (Fig. 5A). Likewise, the output distribution matches the control distribution when cp0, most closely matches the ground truth in the dataset at cp2.0, but can continue to be extended beyond that value as values of cp>2.0 push the prediction further away from the control distribution (Fig. 5B). Further analysis of the effect of shift selection on other perturbation predictions is given in Appendix A.3.4.

Figure 5:

Figure 5:

(A) For OOD test perturbations, the shift is selected to minimize MSE of pseudo-bulk predictions. (B) Shift values closer to zero result in output predictions closer to control, and larger magnitude shift values result in more distinct perturbational distributions. The shift value that most closely matches the ground truth is c2.0.

Figure 6 visualizes Up embeddings for both training and testing distributions from one of the out-of-distribution splits. These embeddings were generated by encoding the perturbations (p or q) along with a sample of control cells (X), and no reparameterization was used to simulate the effect of taking the average after infinite samples. It can be seen that perturbations of genes with similar functions map to similar variables in the latent space, and identifiable perturbational modules (such as mediator complex genes, ribosomal proteins, cell cycle regulators, proteasome subunits, and genes involved in DNA replication and repair) are clustered together in the latent space, regardless of if the genes in the module were included in the original training set or not. As such, it is possible to infer the function of an unknown gene using SCCVAE by examining the functions of its neighbors when encoded to Up. More visualizations of perturbation modules in Up can be found in Appendix A.3.5.

Figure 6:

Figure 6:

(A) UMAP Visualizations of latent causal variables with various functional perturbation modules. Genes belonging to the same perturbation module are close together within the latent space. (B) When visualizing just the average Up of each perturbation, the perturbation modules form distinct clusters in the latent space.

4. Conclusions

In this work, we demonstrated the ability of a hybrid variational causal model (SCCVAE) to effectively model gene expression outcomes following single-gene perturbations, particularly in scenarios when extrapolating to unseen perturbations. By leveraging a learned mechanistic model within a flexible variational neural network, SCCVAE is able to both interpolate observed perturbations and generalize to unseen perturbations. These findings underscore the importance of modeling gene regulatory relationships in the predictive model for accurate predictions of genetic interventions. Additionally, the ability of SCCVAE to learn shifts through a structural causal model provides a robust foundation for simulating gene knockdown experiments with varying penetrance.

Our findings open various directions for future exploration. First, we used SCCVAE for single-gene knockdown perturbations. Extending our model to different types of perturbations, or different combinations of perturbations, is an interesting future direction. Second, the current version of SCCVAE implicitly uses a Gaussian likelihood in the output space (i.e., P(XZ) is modeled as a Gaussian distribution) to approximate the normalized gene expressions. However, raw gene expressions can be modeled using a zero-inflated negative binomial (ZINB) distribution, the parameters of which reveal additional gene-specific information about the perturbed distribution. Extending and further analyzing SCCVAE on ZINB based likelihood is of interest. Lastly, we demonstrated how to incorporate a mechanistic model into a variational autoencoder framework. It would be interesting to build upon this work to test incorporation with alternative generative models.

Supplementary Material

Supplement 1

Acknowledgements

We would like to acknowledge E. Forte for manuscript proofreading. E.L. was supported by the Eric and Wendy Schmidt Center at the Broad Institute. J.Z. was partially supported by an Apple AI/ML PhD Fellowship. C.U. was partially supported by NCCIH/NIH (1DP2AT012345), ONR (N00014–22-1–2116 and N00014–24-1–2687), the United States Department of Energy (DE-SC0023187), the Eric and Wendy Schmidt Center at the Broad Institute, and a Simons Investigator Award.

References

  • [1].Bock Christoph, Datlinger Paul, Chardon Florence, Coelho Matthew A, Dong Matthew B, Lawson Keith A, Lu Tian, Maroc Laetitia, Norman Thomas M, Song Bicna, et al. High-content crispr screening. Nature Reviews Methods Primers, 2(1):1–23, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Grevet Jeremy D, Lan Xianjiang, Hamagami Nicole, Edwards Christopher R, Sankaranarayanan Laavanya, Ji Xinjun, Bhardwaj Saurabh K, Face Carolyne J, Posocco David F, Abdulmalik Osheiza, et al. Domain-focused crispr screen identifies hri as a fetal hemoglobin regulator in human erythroid cells. Science, 361(6399):285–290, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Cheng Weiwei, Wang Shaopeng, Zhang Zhe, Morgens David W, Hayes Lindsey R, Lee Soojin, Portz Bede, Xie Yongzhi, Nguyen Baotram V, Haney Michael S, et al. Crispr-cas9 screens identify the rna helicase ddx3x as a repressor of c9orf72 (ggggcc) n repeat-associated non-aug translation. Neuron, 104(5):885–898, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Dixon Gary, Pan Heng, Yang Dapeng, Bess P Rosen Therande Jashari, Verma Nipun, Pulecio Julian, Caspi Inbal, Lee Kihyun, Stransky Stephanie, et al. Qser1 protects dna methylation valleys from de novo methylation. Science, 372(6538):eabd0875, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Wang Tim, Wei Jenny J, Sabatini David M, and Lander Eric S . Genetic screens in human cells using the crispr-cas9 system. Science, 343(6166):80–84, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Hart Traver, Chandrashekhar Megha, Aregger Michael, Steinhart Zachary, Brown Kevin R, MacLeod Graham, Mis Monika, Zimmermann Michal, Fradet-Turcotte Amelie, Sun Song, et al. High-resolution crispr screens reveal fitness genes and genotype-specific cancer liabilities. Cell, 163(6):1515–1526, 2015. [DOI] [PubMed] [Google Scholar]
  • [7].Dixit Atray, Parnas Oren, Li Biyu, Chen Jenny, Charles P Fulco Livnat Jerby-Arnon, Marjanovic Nemanja D, Dionne Danielle, Burks Tyler, Raychowdhury Raktima, et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. cell, 167(7):1853–1866, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Adamson Britt, Norman Thomas M, Jost Marco, Cho Min Y, Nuñez James K, Chen Yuwen, Villalta Jacqueline E, Gilbert Luke A, Horlbeck Max A, Hein Marco Y, et al. A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7):1867–1882, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Replogle Joseph M, Saunders Reuben A, Pogson Angela N, Hussmann Jeffrey A, Lenail Alexander, Guna Alina, Mascibroda Lauren, Wagner Eric J, Adelman Karen, Lithwick-Yanai Gila, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell, 185(14):2559–2575, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Lotfollahi Mohammad, Anna Klimovskaia Susmelj, De Donno Carlo, Hetzel Leon, Ji Yuge, Ibarra Ignacio L, Srivatsan Sanjay R, Naghipourfar Mohsen, Daza Riza M, Martin Beth, et al. Predicting cellular responses to complex perturbations in high-throughput screens. Molecular systems biology, 19(6):e11517, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Yu Hengshi and Welch Joshua D. Perturbnet predicts single-cell responses to unseen chemical and genetic perturbations. BioRxiv, pages 2022–07, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Roohani Y, Huang K, and Leskovec J.. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. In Nat Biotechnol 42, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Lopez Romain, Tagasovska Natasa, Ra Stephen, Cho Kyunghyun, Pritchard Jonathan, and Regev Aviv. Learning causal representations of single cells via sparse mechanism shift modeling. In Conference on Causal Learning and Reasoning, pages 662–691. PMLR, 2023. [Google Scholar]
  • [14].Wu Yulun, Barton Robert A, Wang Zichen, Ioannidis Vassilis N, De Donno Carlo, Price Layne C, Voloch Luis F, and Karypis George. Predicting cellular responses with variational causal inference and refined relational information. arXiv preprint arXiv:2210.00116, 2022. [Google Scholar]
  • [15].Aliee Hananeh, Kapl Ferdinand, Hediyeh-Zadeh Soroor, and Theis Fabian J. Conditionally invariant representation learning for disentangling cellular heterogeneity. arXiv preprint arXiv:2307.00558, 2023. [Google Scholar]
  • [16].Tu Xinming, Hutter Jan-Christian, Wang Zitong Jerry, Kudo Takamasa, Regev Aviv, and Lopez Romain. A supervised contrastive framework for learning disentangled representations of cell perturbation data. bioRxiv, pages 2024–01, 2024. [Google Scholar]
  • [17].Rood Jennifer E, Hupalowska Anna, and Regev Aviv. Toward a foundation model of causal cell and tissue biology with a perturbation cell and tissue atlas. Cell, 187(17):4520–4545, 2024. [DOI] [PubMed] [Google Scholar]
  • [18].Theodoris Christina V, Xiao Ling, Chopra Anant, Chaffin Mark D, Sayed Zeina R Al, Hill Matthew C, Mantineo Helene, Brydon Elizabeth M, Zeng Zexian, Liu X Shirley, et al. Transfer learning enables predictions in network biology. Nature, 618(7965):616–624, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Cui Haotian, Wang Chloe, Maan Hassaan, Pang Kuan, Luo Fengning, Duan Nan, and Wang Bo. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11, 2024. [DOI] [PubMed] [Google Scholar]
  • [20].Ahlmann-Eltze Constantin, Huber Wolfgang, and Anders Simon. Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods. BioRxiv, pages 2024–09, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Bunne Charlotte, Krause Andreas, and Cuturi Marco. Supervised training of conditional monge maps. Advances in Neural Information Processing Systems, 35:6859–6872, 2022. [Google Scholar]
  • [22].Zhang Jiaqi, Greenewald Kristjan, Squires Chandler, Srivastava Akash, Shanmugam Karthikeyan, and Uhler Caroline. Identifiability guarantees for causal disentanglement from soft interventions. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
  • [23].Gaudelet Thomas, Del Vecchio Alice, Carrami Eli M, Cudini Juliana, Kapourani Chantriolnt-Andreas, Uhler Caroline, and Edwards Lindsay. Season combinatorial intervention predictions with salt & peper. arXiv preprint arXiv:2404.16907, 2024. [Google Scholar]
  • [24].Kamimoto Kenji, Stringa Blerta, Hoffmann Christy M, Jindal Kunal, Lilianna Solnica-Krezel, and Samantha A Morris. Dissecting cell identity via network inference and in silico gene perturbation. Nature, 614(7949):742–751, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Zhang Jiaqi, Cammarata Louis, Squires Chandler, Sapsis Themistoklis P, and Uhler Caroline. Active learning for optimal intervention design in causal models. Nature Machine Intelligence, 5 (10):1066–1075, 2023. [Google Scholar]
  • [26].Dibaeinia Payam and Sinha Saurabh. Sergio: a single-cell expression simulator guided by gene regulatory networks. Cell systems, 11(3):252–271, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Spirtes Peter, Glymour Clark, and Scheines Richard. Causation, prediction, and search. MIT press, 2001. [Google Scholar]
  • [28].Pearl Judea. Causality. Cambridge university press, 2009. [Google Scholar]
  • [29].Kingma Diederik P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [Google Scholar]
  • [30].Gretton Arthur, Borgwardt Karsten M, Rasch Malte J, Schölkopf Bernhard, and Smola Alexander. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012. [Google Scholar]
  • [31].Replogle Joseph M., Saunders Reuben A., Pogson Angela N., Hussmann Jeffrey A., Lenail Alexander, Guna Alina, Mascibroda Lauren, Wagner Eric J., Adelman Karen, Bonnar Jessica L., Jost Marco, Norman Thomas M., and Weissman Jonathan S.. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Solus Liam, Wang Yuhao, and Uhler Caroline. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795–814, 2021. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES