Abstract
Diverse machine learning methods promise to forecast gene expression changes in response to novel genetic perturbations. However, these methods’ accuracy is not well characterized. We created a benchmarking platform that combines a panel of 11 large-scale perturbation datasets with an expression forecasting software engine that encompasses or interfaces to a wide variety of methods. We used our platform to assess methods, parameters, and sources of auxiliary data, finding that it is uncommon for expression forecasting methods to outperform simple baselines. Our platform will serve as a resource to improve methods and to identify contexts in which expression forecasting can succeed.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03840-y.
Keywords: Transcriptional regulation, Transcription factor, Network inference, Gene regulatory network, Perturb-seq, Expression prediction, Expression forecasting
Background
Genetic screening is of fundamental importance to basic biology, and in drug development, it roughly doubles the chance that a preclinical finding will survive translation [1–3]. Fueled by ATAC-seq [4], single-cell RNA-seq [5], and related perturbation assays [6–9], a raft of recent computational methods now offer expression forecasting: prediction of genetic perturbation effects on the transcriptome [10–24]. These expression forecasting methods aim to serve as a new type of general-purpose screening tool. Compared to Perturb-seq and similar assays, in silico modeling is cheaper, less labor-intensive, and easier to apply to less accessible cell types [25]. Applications frequently include nomination, ranking, or screening of up to hundreds of genetic perturbations that are expected to have interesting or valuable effects on cell state [10–12, 14, 18, 21–23]. For example, expression forecasting is being used to optimize reprogramming protocols, to search for antiaging transcription factor (TF) cocktails, and to nominate drug targets for heart disease [18, 19, 26, 27]. In many contexts, expression forecasting is poised to augment or even circumvent genetic screens as a candidate gene selection method.
Existing scholarship provides a number of reasons to believe that expression forecasting under novel genetic perturbations can work, but empirical tests lag behind available methods and data. Reasons for optimism include mathematical identifiability guarantees [12, 28], prior knowledge of gene function [14], gene regulatory networks (GRNs) based on TF-to-target binding from ChIP-seq or motif analysis [10, 11, 20, 21], and empirical tests against simulated perturbation outcomes [23] or real perturbation outcomes [10–12, 15, 18, 22, 27, 29]. However, existing empirical results have shortcomings. Genetic perturbation encompasses a vast range of possible experiments: for example, pluripotent stem cell (PSC) reprogramming via overexpression of TFs yields a completely different predictive task compared to knockdown of stress-response proteins in K562 cells. Whereas data from diverse contexts will be crucial to build general-purpose models of the cell [25], most expression forecasting methods have been tested in only a small number of cellular contexts [11, 14, 15, 18, 20, 22, 28]. Expression forecasting benchmarks on mammalian data have often been conducted by the authors of competing prediction tools in a process that iterates between running tests and altering or tuning methods. This is necessary for method development, yet it raises the possibility of overoptimistic results due to researcher degrees of freedom [30, 31]. Interpreting forecasts and improving tools will require benchmarking of expression forecasting performance on a diverse collection of data.
Here, we fill this gap by creating an expression forecasting framework called the Grammar of Gene Regulatory Networks (GGRN) and a benchmarking platform called PEREGGRN (PErturbation Response Evaluation via a Grammar of Gene Regulatory Networks). Our software enables neutral evaluation across varied methods, parameters, datasets, and evaluation schemes, and we present multiple experiments outlining specific challenges in expression forecasting.
We have recently become aware of many similar benchmarking projects conducted around the same time as our study. These other benchmarking projects are largely complementary to our study, covering different methods, and we synthesize our findings with other recent work in the “Discussion”.
Results
Modular software for GRN-based expression forecasting and benchmarking
GGRN uses supervised machine learning to forecast expression of each gene based on the expression of candidate regulators. GGRN is inspired by the modular structure of CellOracle [10] and the grammar of Oates and Mukherjee [32] but includes additional features. GGRN can use any of nine different regression methods, including mean and median dummy predictors (Fig. 1A). Samples where gene j is directly perturbed are omitted when training models to predict gene j’s expression, which allows GGRN to be trained on interventional data. GGRN can efficiently incorporate user-provided network structures, including dense (all TFs regulate all genes) or empty (no TF regulates any gene) negative control networks (Fig. 1B). GGRN can train models to predict gene expression from regulators measured in the same sample under a steady-state assumption or can instead match each sample to a control or other baseline sample and predict the change in expression (Fig. 1C). GGRN can be run for multiple iterations depending on the desired prediction timescale (Fig. 1D). GGRN can fit cell type-specific models or can use all training data to fit global models (Fig. 1E). A precise description of how these choices interact is given in Additional File 1. In addition to implementing a “grammar,” the GGRN software can also interface with any containerized method (Fig. 1F). Thus, GGRN allows easy head-to-head comparison across individual pipeline components or across full expression forecasting methods.
Fig. 1.
GGRN and PEREGGRN features. A GGRN can predict gene expression using nine different general-purpose supervised learning methods from scikit-learn [37]. The models predict target gene expression (y-axis) from expression of other genes (x-axis). B GGRN interfaces to a collection of published GRNs for use as candidate causal structures. GRNs connect target genes (blue) with candidate regulators (gray). C GGRN predicts expression in a given observation using features either from the same observation or from a matched observation. Matched observations are taken from previous timepoints or from negative controls. Hexagons represent individual observations. D GGRN can be run for multiple iterations depending on the desired prediction timescale. E GGRN can fit a single set of models trained on all cell types or different sets of models for each cell type. F GGRN’s software can interface to any containerized expression forecasting method even when the method is not described by the grammar itself. This panel uses clipart from the Noun Project: “Machine Learning” by apixlabs and “Whale” by Juicy Fish, each licensed under CC BY-SA 3.0. G PEREGGRN includes a variety of genetically perturbed transcriptome datasets. H For robust benchmarking, PEREGGRN is configurable with different evaluation metrics, ways of splitting data, and numbers of features to select
To facilitate performance evaluation, we pair GGRN with PEREGGRN. PEREGGRN includes a collection of 11 quality-controlled and uniformly formatted perturbation transcriptomics datasets (Table 1, Fig. 1G) and a uniformly formatted collection of cell type-specific gene networks derived from motif analysis, co-expression, and other approaches (Table 2). PEREGGRN includes configurable benchmarking software, allowing users to easily choose different numbers of genes, datasets, data splitting schemes, or performance metrics (Table 3, Fig. 1H). PEREGGRN is designed to be reused and extended, and online documentation explains how to add new experiments, datasets, networks, and metrics.
Table 1.
Perturbation transcriptomics data used in this study
| Identifier | Transcript omic method |
Citation | Genes perturbed | Gene selection | Perturbation method | Duration, days |
Cell line |
|---|---|---|---|---|---|---|---|
| Nakatake | SureSelect, SurePrint G3 (Agilent) | [43] | 714 | Curated TFs | OE: Transfection with dox-inducible transgenes | 2 | SEES3 (PSC) |
| Nakatake scRNA simulated: resampled with more samples and less depth per sample (methods) | |||||||
| Joung | 10 × 3′ scRNA | [44] | 3548 isoforms | All TFs | OE: Lentiviral ORF transduction | 7 | H1 (PSC) |
| Norman | 10 × 3′ scRNA | [45] | 112 | Screen for fitness | CRISPRa: See legend | 7 | K562 |
| replogle1 | 10 × 3′ scRNA | [46] | 63 | Manual |
CRISPRa: dCas9- SunTag/scFV-VP64 |
8 | K562 |
| replogle3 | 10 × 3′ scRNA | [47] | 2285 | All essential genes | CRISPRi: Lentiviral delivery of guide RNA and dCas9-KRAB | 6 | K562 |
| replogle4 | 10 × 3′ scRNA | [47] | 2285 | All essential genes | CRISPRi: Lentiviral delivery of guide RNA, dCas9-Zim3-KRAB | 7 | RPE1 |
| Adamson | 10 × 3′ scRNA | [8] | 87 | UPR-related genes | CRISPRi: dCas9-KRAB and lentiviral guide RNA transduction | 7 | K562 |
| replogle2 | 10 × 3′ scRNA | [47] | 9867 | All expressed genes | CRISPRi: Lentiviral delivery of guide RNA and dCas9-KRAB | 8 | K562 |
| replogle2 large effect: as above but controls and perturbations with energy score above 0.6 | |||||||
| replogle2 tf only: as above but controls and perturbations of TFs | |||||||
| Freimer | 3′ Tag-seq | [48] | 24 | Regulators of IL2RA, IL-2, CTLA4 | CRISPR: Cas9 RNP with lentiviral guide RNA delivery | 5 | CD4+ T cells |
| Dixit | 10 × 3′ scRNA | [7] | 24 | Prior work [49] | CRISPR: Cas9 mouse, lentiviral guide RNA delivery | 7 | K562 |
| Frangieh | 10 × 3′ scRNA | [50] | 248 | Checkpoint inhibitor resistance genes | CRISPR: Cas9 RNP nucleofection and lentiviral guide RNA delivery | 14 | Primary melanoma |
| Frangieh IFNg v1: IFN-gamma-exposed subset, preprocessed identically to DCD-FG experiments | |||||||
| Frangieh IFNg v2: As v1 but with stricter inclusion criteria (“Methods”) | |||||||
| Frangieh IFNg v3: As v2 but pseudo-bulk aggregated within each guide RNA combination (“Methods”) | |||||||
OE overexpression, RNP ribonucleoprotein, PSC pluripotent stem cell. The CRISPRa system used by Norman et al. is a dCas9-SunTag fusion that recruits multiple copies of an scFV-VP64 fusion, where scFV is a single-chain antibody and VP64 is a transcriptional activator [51]
Table 2.
Published networks that are expected to be enriched for human transcriptional regulatory relationships
| Internal name | Description | Data used | Species | Citation |
|---|---|---|---|---|
| encode-nets_human | TF-ChIP data from ENCODE | TF ChIP | Human | [52] |
| chea | TF ChIP data (miscellaneous) | ChIP-X | Human, mouse | [53] |
| humanbase | Bayesian classifier predicts GO term co-occurrence from diverse gene-pair features | RNA-seq, motif matches, perturbation responses | Human | [54] |
| celloracle_human | Motif analysis of promoters | None | Human | [10] |
| csnets | Motif analysis + RNA integration | DNAse, RNA | Human | [55] |
| ANANSE | Motif analysis of enhancers | ATAC and ChIP | Human | [56] |
| MAGNUM | Motif analysis of FANTOM5 CAGE data | CAGE | Human | [57] |
| MARA_FANTOM4 | MARA on FANTOM4 THP-1 data | CAGE | Human | [58] |
| gtex_rna | Graphical LASSO on gTEX data with tree-structured random effects | RNA-seq | Human | [59] |
| CellNet | CLR analysis of microarray data from GEO | Expression microarrays | Human, mouse | [60] |
Table 3.
Evaluation metrics
| Metric | Definition | Motivation |
|---|---|---|
| Proportion correct direction |
mean( (O > 0 and E > 0) or (O == 0 and E == 0) or (O < 0 and E < 0) ) |
Readily interpretable by end users |
| Spearman | corr(rank(O), rank(E)) | Used by [61], outlier-robust |
| MAE | |O-E|1 | Used by DCD-FG, outlier-robust |
| MSE | ||O-E||22 | Common in transcriptomics |
| MSE top 100 | ||(O-E)[rank(-|O|) <= 100]||22 | Used by GEARS, prioritizes signal when effects are sparse |
| Pearson top 100 | corr((O, E)[rank(-|O|) <= 100]) | Similar to an evaluation from [11]* |
| Overlap top 100 | sum((rank(-|O|) <= 100) & (rank(-|E|) <= 100))/100 | Complements MSE top 100 and Pearson top 100 |
| Cell label accuracy | mean(classifier(C + O) == classifier(C + E)) | Relevant to reprogramming applications |
O indicates observed log fold change, E indicates expected log fold change, and C indicates log-scale expression of controls. Classifiers are trained on Louvain cluster labels in the training data (“Methods”). *Wang et al. use a p-value cutoff, meaning their evaluation can include a variable number of genes depending on the observed differential expression, not always the top 100 genes
PEREGGRN is designed to evaluate predictions about unseen genetic interventions. A key aspect of our evaluations is a nonstandard data split: no perturbation condition is allowed to occur in both the training and the test set. Randomly chosen perturbation conditions and all controls are allocated to the training data, while a distinct set of perturbation conditions is allocated to the test data. Different data splitting strategies are also included and can be a useful point of comparison, but good performance on held-out perturbation conditions is essential to realize the real-world potential of in silico perturbation screening.
Testing on unseen perturbations also requires special handling of the directly targeted gene in order to avoid illusory success: it is not biologically insightful to outperform a baseline by predicting that a knocked-down gene will produce fewer transcripts [33, 34]. To predict perturbation outcomes, we begin with the average expression of all controls. The perturbed gene is then set to 0 (for knockout experiments) or to its observed value after intervention (for knockdown or overexpression experiments). The resulting expression vectors are provided to the predictive models, so predictions must be made for all genes except the genes directly intervened on.
Given predictions, PEREGGRN reports a variety of evaluation metrics (Table 3). These methods fall into three broad categories. Some are obvious or commonly used performance metrics (mean absolute error (MAE), mean squared error (MSE), Spearman correlation, proportion of genes whose direction of change is predicted correctly). Others are computed on the top 100 most differentially expressed genes, emphasizing signal over noise for datasets with sparse effects. Finally, accuracy when classifying cell type is of special interest in studies of reprogramming or cell fate.
There is no consensus about what type of metric to use for evaluating and interpreting perturbation predictions. Some methods are intended to provide only cell-type classifications or low-dimensional embeddings [10, 12, 15, 18], while others provide gene-specific predictions of expression, velocity, or fold change [11, 13, 14, 21, 23]. When mining perturbation transcriptomic data for functional relatedness among genes, distance within a low-dimensional space has been shown to outperform other distance metrics [35]. Meanwhile, other studies have recommended mean squared error [36] but not as a blanket solution. In Additional File 2, we discuss the behavior of MSE, MSE on a subset of genes, and MSE on low-dimensional embeddings via a bias-variance decomposition. Different metrics give different results when used for model selection, and the best metric depends on biological assumptions. We will also show that different metrics sometimes give substantially different conclusions empirically.
Large-scale perturbation datasets display a high rate of success in targeting individual genes and mostly small global effects
As a first step in building a comprehensive benchmarking infrastructure, we collected and quality-controlled datasets with transcriptome-wide profiles of many genetic perturbations (Table 1). For clarity, we will always refer to each dataset using the identifier shown in Table 1. We focused on datasets with many genes perturbed and on datasets that were previously used to showcase expression forecasting methods. We focused on human data: despite the abundance of perturbation transcriptomic data in Saccharomyces cerevisiae, Escherichia coli, and Caenorhabditis elegans [38–42], such data are less useful for drug target discovery or optimizing directed differentiation protocols for stem cells.
To detect and minimize technical issues, we conducted typical exploratory analysis, quality control, filtering, aggregation, and normalization (“Methods”). For each overexpression or knockdown experiment, the targeted genes’ expression mostly increases or decreases as expected, with the lowest rate being Joung, Nakatake, and replogle1: in Joung, 73% of overexpressed transcripts increased as expected, and in Nakatake and replogle1, about 92% of overexpressed transcripts increased as expected. We removed knockdown or overexpression samples where the targeted transcript did not increase or decrease as expected, and transcriptome-wide effect size was not obviously correlated with targeted-transcript effect size (Additional File 3: Fig. S1A, B). As a measure of robustness, we examined the Spearman correlation in log fold change between replicates (Additional File 3: Fig. S1C). In cases lacking biological replicates, we examined correlation between technical replicates (e.g., different guide RNAs) (Additional File 3: Fig. S1C). We note that replogle2, replogle3, and replogle4 mostly lack replication even on the level of different guide RNAs, and the samples for which replication is available may not be representative of a typical perturbation. We also computed correlations across any pair of datasets with the same cell type or cell line, the same direction of perturbation effect, and multiple genes perturbed in both datasets (Additional File 3: Fig. S1D). The lowest correlations were between Joung and Nakatake, which are in different PSC lines with different takedown timepoints (2 days versus 7 days) and culture media (StemDiff AK03/AN02N versus StemDiff APEL) and between Dixit (which used CRISPR knockout) and other K562 loss-of-function datasets (which used CRISPRi knockdown). Log fold change was well correlated across some pairs of datasets, including pairs where each individual dataset had low between-replicate correlations. In summary, this collection provides conservative quality control and a well-characterized testing ground for expression forecasting under genetic perturbation.
There are some key exceptions to our preprocessing. For Adamson, Dixit, and Norman, we used preprocessing identical to the GEARS evaluations [29]. For Frangieh, we preprocess in multiple ways, and version 1 is identical to that used by the DCD-FG evaluations [28]. These choices facilitate closer comparisons to other work, especially papers using the same evaluation setup as GEARS [14, 15, 34].
In most of these datasets, genes to perturb were not selected at random but rather were known to be important for survival or regulation (Table 1). One exception is replogle2, in which all expressed genes were knocked down. To understand how nonrandom perturbation selection affects results, we separately labeled subsets of replogle2 that include (1) only TF perturbations or (2) only large effects.
Regression-based analyses do not outperform non-informative baselines
Different expression forecasting methods are based on different GRN structures and different supervised machine learning methods [10, 11, 13, 21, 23, 28]. Seeing this diversity in existing literature, we investigated what regression methods and network structures perform best while holding other factors constant.
We first compared nine supervised learning methods. In contrast to expression forecasting evaluations to date, we also included mean and median baseline predictors which ignore the input features and return the training-set mean or median of the target (dependent) variable. We further included a “linear embedding” baseline method based on linear functions of gene embeddings that showed good relative performance in a recent benchmark study [62]. To our surprise, the mean or median baseline was almost always the top performer (Fig. 2, left column). Some performance improvements over mean and median baselines were seen on Nakatake and replogle3 when evaluating cell label accuracy (Additional File 3: Fig. S1, left column), but for the large majority of evaluation metrics and datasets, the mean or median baseline performed best.
Fig. 2.
Evaluation of supervised ML and feature selection methods against simple baseline predictors. Each dot represents performance of one method on one dataset by one metric. Boxplots summarize performance across all datasets. The right margin shows the performance metrics described in Table 3. The y-axis indicates performance as the percent change over the “mean” baseline, capped at ± 100%. Metrics where larger values would normally indicate worse performance, such as MSE, are inverted such that percentages above zero always indicate improved performance. The x-axis labels describe which regression method was used (left), which network was used for feature selection (middle), or how models handle time (right). “Closest” indicates that features used to predict each perturbed profile are drawn from the closest control profile, except for the perturbed gene itself. “Steady state” indicates that features used to predict each perturbed profile are drawn from that same profile
We suspected these generally negative results could be driven by overfitting or lack of causal identification, so we sought to reduce the difficulty of the task. In several cell types, TF regulons based on evidence of direct binding are enriched for target genes that respond to perturbation of those TFs [63–66], and many methods restrict causal relationships by predicting a target gene’s expression using only TFs with nearby motif occurrences in promoters or enhancers [10, 11, 67, 68]. This yields easier regression problems with far fewer predictors. We collected 10 sets of previously published gene networks that had been generated in a variety of ways, including co-expression analysis, shared function of gene products, and cis-regulatory element motif analysis (Table 2). We processed them into a uniform format and incorporated them into our benchmarking framework (“Methods”). The software allows users to easily train diverse supervised ML methods using expression of network neighbors as the input features.
Separately using each of several networks for feature selection, we trained ridge regression models with TF expression as features and target gene expression as labels. (Ridge regression is a simple, fast approach with previous success in interpretable models for post-perturbation expression forecasting [10].) We predicted expression after held-out interventions and then evaluated performance. As controls, we included a dense network (all features are used for each target) and an empty network (no features are used, equivalent to the “mean” baseline). These results use the same data split as the regression model tests described above, so performance can be directly compared. Use of certain prior network structures led to improved cell-type accuracy, but by other metrics, the mean, median, or empty network always remained among the top performers (Fig. 2 and Additional File 3: Fig. S2 center column).
Finally, we conducted additional experiments surrounding handling of time, another feature of regression-based methods whose treatment in the literature varies. The simplest way of training a regression model for expression forecasting uses predictive features drawn from the same cell as the target (dependent) variable. Viewing the transcriptome as a dynamic process, this implies an assumption that the regulation of interest can be inferred from steady-state measurements alone [10, 11, 28]. But some tools do not assume steady state, rather predicting gene expression from features drawn from a control sample or a previous time point, with only the perturbed gene(s) altered [12, 14, 21]. Among steady-state methods, some tools iterate predictions to simulate total effects, and the number of iterations can affect the content and quality of the results [10, 11]. To understand the effect of these choices, we constructed features to predict each observation from either itself (“steady state” in Fig. 2 and Additional File 3: Fig. S2) or from the closest control (“closest” in Fig. 2 and Additional File 3: Fig. S2). When controls were chosen, we set the perturbed gene to its post-perturbation expression level. We iterated predictions for 1, 3, or 10 time-steps. These configurations did not outperform baselines overall (Fig. 2) but did succeed on certain datasets and metrics (Additional File 3: Fig. S2, right column).
Very few subsets of genes are predictable
Even if expression forecasting methods do not outperform baseline predictors in terms of transcriptome-wide mean absolute error, they may work for certain target genes or perturbations. The MAE has the advantage of being additive, so contributions from individual samples or features can be examined separately. We used this additivity to search for predictable sets of genes, which we define as sets of genes whose contribution to MAE is lowest for a non-baseline model. For each of six gene-level features (probability of being loss-of-function intolerant [69]; number of exons; mean, variance, or dispersion in the training data; and degree from a CellNet human co-expression network), we grouped target genes into quintiles of that feature. Baselines outperformed other models in most groups of genes (Additional File 3: Fig. S3). Above-baseline performance was concentrated on matching-method experiments on the Norman data (Additional File 3: Fig. S3). Thus, poor performance of simple expression forecasting methods is not driven by specific categories of genes, and it is also not avoidable by focusing on any of these specific subsets of genes.
Results are robust to different preprocessing choices
Transcriptomic outcomes of genetic perturbation are often analyzed solely in terms of bulk RNA or average expression effects [7, 8, 35, 45–47, 70]. However, GRNs can also be inferred from variation across single cells [10, 11, 71]. For some of the smaller datasets, PEREGGRN retains information on different cells, donors, guide RNAs, and/or clones, rather than only pseudo-bulk profiles. In each experiment, PEREGGRN allows users to specify whether observations should be averaged within each perturbation condition. For five datasets (Nakatake, Adamson, Dixit, Norman, and Freimer), we compared ridge regression models based on empty, dense, and CellOracle default GRNs, with or without averaging. The performance relative to baseline was similar (Fig. S4A). We also repeated a subset of the initial network experiments using different numbers of highly variable genes to fit the models themselves, again finding little variation in performance (Additional File 3: Fig. S4B). This subset of results is robust to alternative preprocessing decisions.
Investigating strong performance of the “mean” baseline
We next sought to understand why the mean of the training data has such strong relative performance. Indeed, the “mean” baseline works surprisingly well even in absolute terms: the Spearman correlations between the observed and predicted log fold changes are typically well above zero (Additional File 3: Fig. S5a). Recall that because the fold change is defined relative to control experiments, the mean is not expected to be exactly zero and does vary from gene to gene. To explain the performance of the mean baseline, we considered two hypotheses that are not mutually exclusive: a statistical hypothesis rooted in the bias-variance trade-off and a biological hypothesis rooted in stereotypical responses of each gene to perturbation. Here, we demonstrate that either could contribute to the surprisingly strong mean baseline performance we observe.
Regarding the bias-variance trade-off, consider two alternative estimators of baseline expression: the mean of all training data or the mean of only the control samples. The mean of the control samples is unbiased, but given limited sample sizes, it has high variance. The mean of all training data will be biased by perturbation effects but will have lower variance. If perturbation effects are generally small and noise levels generally high, the mean of all samples may be a better estimator of baseline expression (supporting derivations are provided in Additional File 2). A simulation (“Methods”) shows that in simulated data with only noise and no true perturbation effects, the mean of the training data can easily yield correlations greater than 0.6 between observed and predicted expression changes. This simulation illustrates how the “mean” baseline can yield large, positive correlations between predicted and observed log fold change over controls.
Alternatively, we consider a second hypothesis: stereotypical perturbation responses. A stereotypical response for a given gene is similar across many perturbations, perhaps due to common affected signaling pathways, cell stress, or inadequate controls (e.g., off-target effects of a control guide RNA). If perturbation responses are stereotypical, the mean of the training data would perform well on held-out perturbations as it learns the usual response pattern of each gene. Of course, there may be perturbations for which a gene deviates from its stereotypical response, but there is still information in the mean. Specifically, this hypothesis predicts that for independent experiments A and B, fold change of treatment A over control A will be correlated with fold change of treatment B over control B, even when control A and control B are generated independently and even when treatment A is different from treatment B. For datasets with replicated controls, we computed this type of correlation for 100 independent pairs of treatment-over-control log fold change. It was positive and far from zero for some datasets, notably Joung, replogle4, replogle2 large effect, and Nakatake (Additional File 3: Fig. S5B). The same datasets tended to have the highest correlations between predicted and observed log fold change in the “mean” baseline analyses (Additional File 3: Fig. S5A).
To understand the biological meaning of stereotypical perturbation responses, we used Enrichr [72–74] for interactive analysis of the top 100 genes ranked by log fold change from controls to the average post-perturbation profile: genes that, on average, increase the most after perturbation. In Nakatake, these had significant overlap with POU5F1 knockdown response, suggesting that on average, perturbations cause a signature of exit from pluripotency. The list also included 12 known effectors of p53, a known regulator of pluripotency [75]. In Joung, some of the top annotations were for binding of p300, SMAD4, and SMARCD1 in ESCs. SMARCD1 is a SWI/SNF complex member, and p300 is a histone acetyltransferase, suggesting that highly accessible genes in pluripotent cells are poised to increase upon perturbation. In Joung, Enrichr results also showed that 45/100 genes were among the longest 10% in the human genome. This may indicate a technical effect or a transition upon treatment from shorter, ubiquitously expressed genes to longer, tissue-specific genes [76]. In the replogle4 data, the top 100 genes significantly overlapped with responders to ZNF395 knockdown, p53 effectors, and regulators of apoptosis. Since all genes perturbed in replogle4 are essential, LoF may often trigger apoptosis [47]. In replogle2, 34 of the top 100 genes were near NCOR1-binding sites. The replogle2 data are from K562 cells, and NCOR1 targets are important to erythroid differentiation of K562 cells [77]. Supporting the notion of an average effect in replogle2 enriched for erythroid differentiation signature genes, the top 100 genes included 3 hemoglobin subunits, 2 glycophorins, the iron importer SLC25A37, and the heme metabolism component BLRVB. Also, among the top 100 are BTG1 and BTG2, which regulate proliferation; 2 E2 ubiquitin ligases, which regulate protein turnover; and PRG2, which is a major component of eosinophil granules; K562 cells can also differentiate into eosinophils [78]. These findings are exploratory, and the choice of enrichments to discuss is subjective, but overall, the results suggest that the average of the training data often captures biologically meaningful aspects of perturbation responses. Thus, by learning the mean of the training data, perturbation prediction models could appear to display some level of biological insight, even without including any correct mechanisms.
When predicting previously seen perturbations, fully connected networks match or exceed performance of published GRNs
Optimizing for low MSE can in principle select the training data mean even over the correct causal structure (Additional File 2). To determine whether poor performance is caused by noise alone or also incorrect causal inferences, we repeated the network-selection experiment using the opposite type of data split: we guaranteed that every test-set perturbation condition was seen during training. Many networks outperformed the mean (Additional File 3: Fig. S6). The dense (fully connected) network was always the best or among the best performers (Additional File 3: Fig. S6). Thus, statistical shortcomings of MSE as a model-selection criterion do not fully explain our results. Data splitting based on a specific causal inference task — prediction of novel interventions — is essential to our results.
Results are mostly consistent across data different splits
To assess the variation caused by specific random seeds used in data splitting, we repeated a small experiment across three separate splits, comparing ridge regression models with features selected using CellOracle’s default human network, a dense (fully connected) network, and an empty-network baseline (Additional File 3: Fig. S7). The qualitative outcome (i.e., whether the model outperformed the baseline) did not change with data split in 59 of the 64 (92%) combinations of dataset and performance metric. This analysis shows that our results are robust and are highly unlikely to be due to an unrepresentative data split.
In network-based simulated data, correct network structures perform best
In some mathematical formulations, GRN reconstruction requires all genes to be perturbed, and held-out perturbations cannot necessarily be predicted [17, 79, 80]. To determine whether evaluation on held-out perturbations can yield meaningful results in an ideal scenario, we simulated data from autoregressive models whose structures were based on several known networks (“Methods”). Compared to a fully connected network, an empty network, or another mis-specified network, the network used to generate the data achieves equal or slightly better performance (Additional File 3: Fig. S8). Other top performers were often networks from similar sources (Additional File 3: Fig. S8). This shows that correctly specified structural models can successfully predict responses to unseen perturbations.
Review and benchmarking of diverse published methods
Given the preceding largely negative findings, we sought to test additional, previously published expression forecasting methods. We included any computational method capable of predicting the outcome of novel genetic perturbations as long as the training requires only single-timepoint transcriptome data (not time-series data or RNA velocity data). We included all methods with peer-reviewed paper and code available as of Sept 1, 2023, which were DCD-FG, GEARS, and GeneFormer [14, 18, 28].
Although all of these methods demonstrate perturbation predictions after training on expression data, they employ very different strategies, and they have different primary purposes. DCD-FG is intended for causal graph structure inference, not prediction. GEARS is built for predicting post-perturbation gene expression but with a primary emphasis on novel genetic interactions rather than unseen interventions. GeneFormer is a general-purpose foundation model with usage demonstrations focusing on gene dosage sensitivity, chromatin dynamics predictions, gene network dynamics predictions, and prediction of cell state after genetic perturbation. Our evaluations are modeled after a specific demonstration in which GeneFormer predicted changes in cardiac muscle expression state in response to genetic perturbation, after being fine-tuned to predict cell type labels but with no fine-tuning on any perturbation data [18].
When we computed the median performance relative to a non-informative baseline, no methods were consistently better by any metric (Fig. 3). Seeking to reconcile this negative overall result with prior work, we sought to understand how each dataset, evaluation metric, and other specific choices affects the outcomes (Additional File 3: Fig. S9).
Fig. 3.
Performance of published expression forecasting methods. The y-axis shows different performance metrics as described in Table 3. The x-axis labels describe which method was used. The y-axis indicates performance as the percent change over the “mean” baseline. Each boxplot aggregates over all datasets, and each outlying point indicates a single dataset. Variables where larger values indicate worse performance are inverted such that percentages above zero always indicate improved performance
Using code published by the DCD-FG authors, we repeated the benchmarks of DCD-FG on the gamma interferon-treated subset of Frangieh, which was used for expression forecasting benchmarks in the DCD-FG paper [28]. Comparisons are made against NO-TEARS-LR and NO-TEARS, which are closely related to DCD-FG and are also implemented in the DCD-FG software [81]. The relative performance of DCD-FG and related methods is consistent with Lopez et al. (Additional File 3: Fig. S9a). In addition to the methods implemented in DCD-FG, we included an IID Gaussian baseline that ignores perturbations, and we fit either a diagonal or a full covariance matrix. Surprisingly, the IID baseline outperformed all causal inference methods when using a full covariance matrix (Additional File 3: Fig. S9a).
When predicting a target gene’s expression, the DCD-FG benchmarks allow each method to use held-out expression of genes that are causally upstream according to the learned model (Romain Lopez, personal communication). This is a natural choice due to the mathematical formulation of DCD-FG, but it does not fit the use-case motivating our study, in which held-out perturbations are completely unobserved. Thus, we modified our benchmarking framework to include DCD-FG and to explicitly control the use of held-out gene expression during prediction. Using the top (roughly) 1,000 genes in each dataset with a gene-selection procedure identical to the DCD-FG paper (Methods), we compared DCD-FG against NO-TEARS-LR and two baselines by a variety of metrics. Different metrics yielded substantially different results in many cases (Additional File 3: Fig. S9b). By some metrics, especially MSE top 100, DCD-FG and/or NO-TEARS-LR very narrowly outperformed baselines on Nakatake, Norman, replogle4, and replogle2 large effect (Additional File 3: Fig. S9b). However, the largest of DCD-FG’s improvements over baseline mostly required access to held-out expression of upstream genes (Additional File 3: Fig. S9b).
Since the DCD-FG model does not include measurement error and the melanoma data contain many perturbational regimes with only one cell measured (66% of regimes), measurement error may substantially affect results. We repeated the Frangieh benchmarks after removing perturbational regimes with fewer than 50 cells (Frangieh IFNg v2). This preprocessing also eliminated combinations of knockouts and focused the evaluation entirely on perturbations of genes that are not perturbed in the training data, rather than novel combinations of genes that may be perturbed individually in the training data. On this dataset, DCD-FG did not out-perform baselines, even with access to held-out expression of upstream genes (Additional File 3: Fig. S9b). To further understand the effect of bulk versus single-cell measurement error, we repeated the same experiments after removing perturbational regimes with fewer than 50 cells and averaging expression within each perturbational regime (this is labeled Frangieh IFNg v3). DCD-FG did not out-perform baselines, even with access to held-out expression of upstream genes.
Since Lopez et al. tested DCD-FG on single-cell data, not bulk data, we also repeated the nakatake experiments after simulating single-cell data based on the real bulk transcriptome data (Methods). We further tested DCD-FG on Norman, where the focus is on genetic interactions [45]. In Norman, there is a mix of individual perturbations and combinations, so even though no combination is present in both the training and test data, the same gene may be perturbed in both the training and the test data. In both of these experiments, and without having access to post-perturbation expression of upstream genes, DCD-FG’s predictions narrowly outperformed NOTEARS-LR and baselines across several metrics (Additional File 3: Fig. S9b).
GEARS is another expression forecasting method that has been successfully tested on held-out genetic perturbations or held-out combinations of genetic perturbations [14]. Compared to methods mentioned so far, GEARS uses a different principle to generalize to genes not perturbed in the training data: a graph neural network that enforces similar outcomes for genes with similar membership in Gene Ontology terms. Although its primary purpose is in predicting genetic interactions, GEARS has also shown robust improvement over diverse baseline methods when predicting fold change due to held-out perturbations [14].
We examined relative performance of GEARS stratified by dataset and evaluation metric (Additional File 3: Fig. S10). We note that three of the Perturb-seq datasets [7, 8, 45] in our collection were acquired directly from GEARS, guaranteeing identical preprocessing, although only 1000 highly variable genes were selected (the GEARS evaluations used 6000). GEARS performed best specifically on datasets and evaluation metrics used in the original paper (MSE top 100 on Adamson and Norman). We note that two key factors limit this evaluation. First, GEARS is not designed for use on bulk RNA data, and PEREGGRN cannot scale to millions of cells, so for bulk data (Freimer, Nakatake) or data that were pseudo-bulked (replogle2–4, Frangieh IFNg v3), no GEARS results are presented or included in aggregate performance metrics. Second, to reduce computational burden, we conducted only a single train-test split of each dataset. This could yield volatile results especially on datasets with few perturbations, such as Dixit.
Finally, stratifying GeneFormer results by dataset show that it outperformed baselines on some datasets for cell label accuracy prediction, but not other metrics (Additional File 3: Fig. S10). This corresponds to how GeneFormer was fine-tuned. We note that label prediction is not the only way of fine-tuning GeneFormer for expression forecasting, and other fine-tuning schemes may yield good performance on a wider variety of metrics. Also, datasets acquired directly from GEARS (Dixit, Adamson, Norman) lacked raw count data, meaning that GeneFormer could not be tested on these datasets.
Discussion
Expression forecasting could have a large impact as a way to augment genetic screens, which are important for drug target discovery and basic science. New expression forecasting methods are being developed rapidly, and in some cases, their predictions are being reported without independent validation [10–24]. End users urgently need concrete evidence regarding their accuracy. To begin to answer this need, we have provided expression forecasting benchmarks across a wide variety of human cell types.
Our framework enables head-to-head comparison of diverse expression forecasting methods and sources of prior knowledge. We test numerous methods using interpretable causal networks, as well as GEARS, which uses prior knowledge of functional relatedness, and GeneFormer, which is pretrained on tens of millions of single-cell transcriptomes. One key result is that most evaluations are negative, with very simple baselines often performing at least as well as the benchmarked methods.
Other recent benchmarks also compare diverse expression forecasting methods using Perturb-seq or similar data [33, 34, 62, 82–86]. Relative to this work, PEREGGRN focuses much more on interpretable models using previously published GRNs. PEREGGRN also puts a distinct priority on biological diversity in the data assembled, with most other studies focusing on K562 cells and none of the other studies including the Joung or Nakatake PSC reprogramming experiments. Unexpectedly, the most influential contribution of PEREGGRN is the mean baseline. Other benchmarking projects have adopted this baseline [33, 34, 62, 84] or have noted that complex models often converge on the mean [82, 85]. Since GEARS, the mean baseline, and the Norman and Adamson datasets are included in our results and very often in others’ results, some findings have begun to converge on a consensus. Specifically, many of these studies either showed that the mean baseline was most often the best method, or they did not include the mean baseline, but did show that GEARS outperformed other methods. In the latter case, we believe the mean baseline would likely outperform GEARS if it were included. For straightforward comparison of this unexpected finding across different studies, we strongly encourage benchmark projects to include the training data mean (and benchmarks using MAE should also include the median). When implementing evaluation metrics, we also encourage omitting directly targeted genes or similar special handling. Wong et al. and Csendes et al. demonstrate how this can improve baselines relative to published methods, especially on evaluations that involve only the top few most differentially expressed genes [33, 34]. To summarize, PEREGGRN was intended to investigate how different GRNs and supervised learning approaches affect prediction accuracy across different cell types; however, PEREGGRN and other recent work have shown that for predicting outcomes of novel genetic perturbations, the mean of the training data outperforms a wide variety of models, including foundation models and GRN models, across many different cell types and evaluation metrics.
Poor perturbation prediction has important implications for competing conceptions of biological systems. Some mathematical frameworks have proven that causal models or structures can be identified even with few or no perturbations [28, 87, 88], but these guarantees begin by assuming that all relevant causal factors have random biological variation and are perfectly observed. In other frameworks, it is not necessarily possible to predict novel perturbation outcomes [79, 80], because some causal factors are unobserved or because causal factors remain constant until reached by a perturbation. Our results indicate that the latter frameworks may be more useful conceptual models.
Another result with key implications for future studies is that expression forecasting performance evaluation has enough degrees of freedom to show many methods as the best performer, or at least better than baseline, in ways that are not robust or generalizable. Results can depend on the choice of demo dataset, data split, and baselines. Results can also vary according to the performance metric chosen [35, 36]. The MSE, MAE, and Spearman correlation tended to show baseline methods performing best — and due to noise in the control samples, the baselines themselves can produce high correlations between observed and predicted fold change. Given that Perturb-seq effects are sparse, it seems logical to predict which genes will be most affected, and then only for those genes, to predict an effect size. Focusing on the top most differentially expressed genes frequently yielded slight positive results for non-baseline methods. However, no methods consistently predicted which genes would change the most, meaning the top differentially expressed genes must be known in advance. Because of these pitfalls, unbiased evaluation will require pre-specified, exactly repeatable experiments with strong baselines in a wide variety of biological contexts.
There are two settings where expression forecasting would be especially valuable in lieu of experiments that are impossible, expensive, or unethical. First, in cell types or conditions where large-scale data are not available, especially primary human cells, in silico genetic screens would meet an obvious need. Second, in cell types where many individual perturbations are available [47, 89], in silico perturbation would be useful for combinatorial screening. For the first domain, expression forecasting in primary cells, it would be useful to develop methods and benchmarks for using accessible cell types to assist models of less accessible cell types [90]. Alternatively, natural variation in time-series or RNA velocity data could in principle be enough to reconstruct GRN’s and forecast expression, and expression forecasting based on such data has seen promising empirical results, especially when augmented with GRN structures based on motif analysis of open chromatin [10, 12, 13, 21, 68]. For combinatorial screening, the purpose is typically to search for cocktails of genes with specific effects such as reprogramming [91]. Combinatorial explosion of the search space means that expression forecasts could contribute value through scale, prioritizing a test set out of a huge number of candidate groupings [29]. Additional methods development, data generation, and benchmarking will be important in order to define and extend the limits of modern data for cell type transfer and combinatorial screen planning.
Enabling future extensions is crucial for bioinformatics benchmarks [92]. Accordingly, a major component of this work is open-source software and online documentation, with guides on not only how to repeat our results but also how to conduct evaluations involving new methods, new datasets, new evaluation metrics, and alternative ways of splitting data that emphasize different challenges. We hope PEREGGRN will provide a durable resource to the expression forecasting research community.
Conclusions
Expression forecasting under novel genetic perturbations based on a sample of observed genetic perturbations may not be feasible. Simple baselines outperform published methods and network-based methods on most datasets and evaluation metrics. Future work should include strong baselines, especially the training data mean, and may benefit from focusing on genetic interaction prediction or transfer across cell types rather than completely unseen genetic perturbations.
Methods
Simulated data (noise only)
To demonstrate that high correlation can occur between predicted and observed log fold change even when using the “mean” baseline predictor, we used the following R code, which outputs a correlation value of 0.649.
Simulated data (based on real networks)
During simulations, we generated data from a vector autoregressive model using the following algorithm.
- Inputs
- ◦ D by D matrix M
- ◦ N by D initial states X0
- ◦ Genes to perturb P
- ◦ number of steps S
- Outputs
- ◦ Simulated expression profiles of size N by D for the following:
- ▪ Each perturbation at the final time-point
- ▪ Each control at the final time-point
- ▪ Each control’s initial state
- Procedure
- ◦ For p in P
- ▪X0,p = X0
- ▪For t in 1 … S
- Xt,p = MXt-1,p
- Xt,p[:,P] = 1
- ▪ Return Xt,p
- Initial values
- ◦ D = 811, 713, 788, or 663: the number of TFs in the CellOracle, gtex, cellnet_human_Hg1332, or cellnet_human_Hugene networks, respectively
- ◦ Mij = 1 if i regulates j in the indicated network, else 0, and M is scaled to have a maximum eigenvalue of 0.1
- ◦ X0 = Independent random numbers each uniform on [0, 1]
- ◦ P includes all genes
- ◦ S = 1
Perturbation data acquisition and preprocessing
The Nakatake data were acquired as normalized expression values by personal communication with ExAtlas maintainer Alexei Sharov [43]. The Adamson, Dixit, and Norman datasets were obtained using GEARS. The Dixit paper describes multiple Perturb-seq experiments, but we include only the K562 TF and cell-cycle regulator experiments [7]. For all other datasets, source URL’s and citations are provided in the metadata accompanying our data collection.
Our default preprocessing included pseudo-bulk aggregation, DESeq-2 normalization, and log1p transformation. For pseudo-bulk aggregation, we summed raw counts from all cells in a given perturbation condition, reasoning that the main source of error in scRNA-seq is multinomial noise in counts and that perturbation responses in cell lines are homogeneous [93]. We also conducted exploratory analysis and clustering via scanpy [94]: we ran PCA, selected neighbors in a 100-dimensional principal subspace, and used the PCA embeddings and neighbors as input to run UMAP [95] and the Louvain algorithm [96]. Louvain clusters were omitted from Freimer, Frangieh, Norman, and Joung datasets because they did not seem attributable to perturbation effects but rather cell cycle, donor, or depth. For knockdown and overexpression experiments, we removed samples where directly perturbed genes do not change in the expected direction. For knockout experiments, no such filter was applied, since mRNA may still be produced and may even increase due to compensatory mechanisms.
Some dataset-specific procedures differ from the above steps.
- For “nakatake,” missing values were replaced with control expression, and replication enabled testing of most perturbed genes, so we included samples only if the overexpressed gene saw a significant increase (p < 0.1, t-test) or was not measured.
- ◦ Nakatake scRNA simulated consists of 50 simulated cells for each profile in Nakatake. The expression level of cell i and gene g is Poisson with mean sixg, where xg is the normalized expression in the corresponding bulk sample and si is a size factor ensuring the total expected count is 10,000 reads per cell. No normalization is applied.
For “Freimer,” we follow the original authors and remove the outlying sample Donor_4_AAVS1_6, and we use the first 10 principal components. Samples cluster by donor, so we do not evaluate methods based on Louvain cluster label predictions.
To preprocess replogle1, we removed cells with under 2000 UMIs or over 17,000 UMIs (99th percentile). We removed cells where over 40% of UMIs were from the 50 highest expressed genes, cells where over 20% of UMIs were from mitochondrially encoded genes, and cells where over 30% of UMIs were ribosomal protein subunits. We removed genes detected in fewer than 10 cells, unless they were overexpressed. These cutoffs are based on preliminary clustering analysis. We summed counts for each guide RNA and ran TMM normalization, and, starting from the normalized pseudo-bulk expression, we removed guide RNAs that did not increase their targets’ levels as expected.
Preprocessing for Frangieh includes three distinct versions. Version 1 follows Lopez et al. exactly to facilitate comparisons with prior work [28]. Version 2 additionally removes cells with total UMIs above the 99th percentile, with 6000 or fewer total UMIs, with 30% or more UMIs in the top 50 expressed genes, with 10% or more UMIs from mitochondrially encoded genes, or with 20% or more UMIs from ribosomal protein subunits. Version 2 removes cells with two or more guide RNAs detected, which excludes all perturbation conditions having fewer than 50 cells because the low-MOI study design only produced high cell numbers for individual guide RNAs. Version 3 follows version 2 but “pseudobulk”: we added together raw counts for all cells sharing a guide RNA and applied total count normalization and log1p transformation.
Preprocessing for Adamson, Dixit, and Norman follows GEARS [14] to facilitate comparisons with their benchmarks; this includes only the 6000 most variable genes in each dataset (including perturbed samples).
We used the pseudo-bulk aggregated forms of replogle2, 3, and 4 as provided by the associated studies. We subset “replogle2” as follows: Transcription factors were identified using the catalog of Lambert et al. [97], and large effects were defined as mean_leverage_score at least 0.6.
In Joung, expression levels were summed within groups defined by TF ORF and predicted cell cycle phase, and then total count was normalized.
Gene selection
All experiments followed the DCD-FG paper’s gene-selection procedure [28]. Namely, for each dataset, genes were ranked by dispersion using scanpy.pp.highly_variable_genes (…, flavor = “seurat_v3”). To select roughly N genes in a dataset with P perturbed genes, we included all genes ranked below N-P and all perturbed genes. These lists may overlap, yielding fewer than N genes. Experiments use N = 10,000 by default, with the following exceptions. All experiments involving Adamson, Dixit, and Norman include at most 6000 genes due to upstream preprocessing (see above). All experiments involving DCD-FG use N = 1000 to reduce compute time. Exact specifications for each experiment are released alongside our software (see “Availability of data and materials”).
Network acquisition and preprocessing
Source URLs, citations, and descriptions are given in the metadata accompanying our network collection, and most network acquisition is fully automated using R or python. Networks from FNTM and HumanBase were subsetted to retain only edges with posterior probability over 50%, and networks from ANANSE_0.5 or ANANSE_0.8 were subsetted to retain edges with weight over 50% or 80%, respectively. GTEx co-expression networks were symmetrized: for any edge A- > B, the edge B- > A was added. No other edges were added to or removed from any original source. For CellNet, starting from the documentation at http://pcahan1.github.io/cellnetr/, we downloaded the network files as follows:
cnProc_Hg1332_062414.R
cnProc_mogene_062414.R
cnProc_Hugene_062414.R
cnProc_mouse4302_062414.R
We installed cellnetr using the instructions at https://groups.google.com/forum/#!topic/cellnet_r/pXHt2J6ZH6I. The remainder of the processing is automated.
GRN inference methods
In initial evaluations, we use simple methods based on supervised machine learning. For each gene, a scikit-learn regressor is trained using the gene’s expression as labels and using the expression of certain TFs as features. By default, all TFs are included, using a manually curated list of TFs [97]. But, if a network is used in the evaluation, then only network-adjacent TFs are included as features. The empty network or dense network baselines fit the same models using no features or all available features. No gene is allowed to autoregulate. Instances where gene A was perturbed directly are not used to train the model that is used to predict expression of gene A. To predict perturbation outcomes, we begin with the average expression of all controls. The corresponding transcription factor is then set to 0 (for knockout experiments) or to its observed test-set value (for knockdown or overexpression experiments). From the resulting perturbed features, regression models are used to make predictions for any gene that is not directly intervened on. The “mean” and “median” methods are implemented alongside all other regression-based methods using sklearn.dummy.regressor with strategy = “mean” or strategy = “median”.
Within this same regression framework, GeneFormer is used as an alternate feature extraction method. We use GeneFormer to generate perturbed embeddings using in_silico_perturber.delete_index, in_silico_perturber.overexpress_index, and a forward pass through the model. This yields 256 features per observation. Raw counts with all genes, not normalized counts with variable genes, were passed to the GeneFormer tokenizer. When cell-type labels were available, GeneFormer was fine-tuned according to the maintainers’ cell-type classification examples, with a hyperparameter optimization step included.
DCD-FG and GEARS are used according to the maintainers’ documentation. GEARS was run using the “no_test” type of split unless the model could not be initialized, in which case the default type of split was used with 95% of the data used for training. NOTEARS-LR was run using DCD-FG with the LinearModuleGaussianModel. DCD-FG and NO-TEARS-LR hyperparameters were selected by reserving a simple random sample of ⅓ of the training data and then training 20 models for 600 epochs with constraint mode spectral_radius. The 20 models spanned all combinations of 4 latent dimensions [5, 10, 20, 50] and 5 LASSO penalty parameters (10, 1, 0.1, 0.01, 0.001). The combination with the best mse was selected, and a model was retrained on the full training data with those parameters.
The linear embedding baseline was re-implemented in Python based on the original authors’ R code, available at https://github.com/const-ae/linear_perturbation_prediction-Paper/blob/main/benchmark/src/run_linear_pretrained_model.R. It was deployed with a ridge penalty of 0.01 and a latent dimension of 10, except on the Freimer data, where the latent dimension was reduced to 4.
Evaluations
Initial experiments with GGRN were run on a subset of datasets chosen for either large numbers of perturbations or coverage of unique cell types. We attempted to test published methods on all applicable datasets. Any omissions are due to RAM usage exceeding 64 GB, convergence problems, runtime errors, or invalid input. Specifically, certain tree-based regression methods failed due to memory requirements exceeding 64 GB. DCD-FG and NO-TEARS sometimes yielded NaN predictions across all hyperparameters. GEARS is designed for single-cell data with high sample count; we did not run it on datasets where our collection contains only bulk or pseudo-bulk expression. GeneFormer requires raw count for all genes; we did not run it on datasets where only a subset of genes were available.
For evaluation, we used the type of split marked “interventional” in the PEREGGRN documentation. Samples were grouped by perturbation condition, and each perturbation condition was allocated to the training set or the test set. In datasets with multigene perturbations (Frangieh version 1 preprocessing and Norman), each combination is considered a separate perturbation condition. Thepost-perturbation expression of any directly perturbed gene was revealed to each algorithm, including code that otherwise only returned the training data mean or median. Perturbations were eligible for the test set only if the perturbed gene was also measured. We used a 50–50 split, meaning 50% of perturbations (but not necessarily 50% of observations) were allocated to the test set. If fewer than 50% of perturbations were eligible for the test set, we allocated all eligible perturbations to the test set. All controls were allocated to the training set. We verified that GeneFormer was not pretrained on any of the datasets we use for evaluation.
To compute cell labeling accuracy, we trained a scikit-learn random forest classifier with 100 trees on each training dataset. The classifier was trained on the cluster labels obtained by running the Louvain algorithm [96], which is further described in the section “Perturbation data acquisition and preprocessing.” The classifier was applied to the observed and predicted test-set expression profiles, and accuracy was measured as the fraction where classifier-assigned labels for observed and predicted expression match.
Supplementary Information
Additional file 1. Full documentation of the Grammar of Gene Regulatory Networks.
Additional file 2. Bias-variance analysis of simple models for unperturbed expression and log fold change after perturbation.
Additional file 3: Fig. S1. Related to Fig. 1. Quality analysis of large perturbation datasets. All values shown in this figure are computed after filtering, aggregation, and normalization as described in the Methods. A. Effect on total-count-normalized transcript levels of the perturbed gene (X-axis). B. Effect on the perturbed gene (X-axis) and the whole transcriptome (Y-axis). Log fold change (logFC) over controls is estimated using a pseudocount of 1. For Norman, Dixit, and Adamson, only 6,000 highly variable genes are included. C. Pearson correlation between log fold change vectors for pairs of replicates (nakatake, joung), donors (freimer), or distinct guide RNA’s (other datasets). Correlations are computed across all genes available in the dataset (Methods). Fig. S2. Related to Fig. 2. Various performance metrics to assess prediction of held-out perturbations. All experiments within each dataset use the same data split and can be directly compared. The y axis shows different performance metrics as described in Table 3. The x axis labels describe which regression method was used (left), which network was used for feature selection (middle), or how models handle time (right). Black lines separate baselines within each panel. In the right column, “closest” indicates that features used to predict each perturbed profile are drawn from the closest control profile, except for the perturbed gene itself. “Steady state” indicates that features used to predict each perturbed profile are drawn from that same profile. ExtraTrees and KernelRidge are missing from certain panels due to memory requirements exceeding 64 GB. The Ahlmann-Eltze baseline is missing from the replogle2 tf only panel because the entire training set consisted of genes that were perturbed but not measured in the original source data. Fig. S3. Related to Fig. 2. Grouping target genes into quintiles using gene-level features reveals worse-than-baseline mae for most gene sets. The y axis shows the mae contribution of each gene set, expressed as percent change from the best baseline (inverted so that higher is better). The x axis indicates which feature we stratified genes on, and within each feature, the color indicates the quintile. Fig. S4. Related to Fig. 2. Robustness checks. A. Robustness to preprocessing. The x axis shows GRNs used. Each point represents performance of a single method on a single dataset according to a single metric. The y axis shows performance as percent improvement relative to the empty-network baseline, oriented so that higher numbers are better. The right margin shows different evaluation metrics and the top margin shows whether or not replicates or cells were averaged within each perturbation condition. B. Robustness to the number of highly variable genes selected. The x axis shows GRNs used. Each point represents performance of a single method on a single dataset according to a single metric. The y axis shows performance as percent improvement relative to the empty-network baseline, oriented so that higher numbers are better. The right margin shows different evaluation metrics and the top margin shows how many variable genes were included in the analysis. The actual number of genes may vary from the number shown because all perturbed genes are always included. Fig. S5. Investigating strong performance of the “mean” baseline. A. Spearman correlation between observed fold change over controls and fold change over controls predicted by the “mean” baseline (y axis) across various datasets (x axis). B. Pearson correlation between observed fold change of treatment A over control A and treatment B over control B, where treatments A and B are different perturbations and controls A and B are separate replicates. Display includes 100 randomly selected pairs per dataset. Fig. S6. Related to Fig. 2. Relative performance of different GRN structures on previously-seen observations. In this experiment, the data split guarantees that each perturbation condition present in the test data was seen at least once in the training data. Each point represents performance of a single method on a single dataset according to a single metric. The x axis shows networks used. The y axis shows performance as percent improvement relative to the empty-network baseline, oriented so that higher numbers are better. The right margin shows different evaluation metrics. Fig. S7. Variability of various performance metrics (y axis) across data splits (x axis) in a comparison of ridge regression models using feature selection based on the default CellOracle human network, a dense (fully-connected) network, and an empty-network baseline (y = 0). Fig. S8. Related to Fig. 2. Impact of regulatory network source (x axis) on expression forecasting performance (color scale). Evaluations use multiple train-test splits (top margin) of data generated from a specific network structure (right margin). Metrics are defined in Table 3. Data are generated from autoregressive models with random initial state that are run for a single time-step, with matched controls for each sample provided during training (Methods). Red lines indicate that the network used for inference matches the network used to generate the data. Fig. S9: Evaluations of DCD-FG. a. Negative log likelihood on held-out perturbations in frangieh, performed using the original code for DCD-FG with Gaussian baselines added. The red line (“optimal”) indicates the best median performance across all methods. The blue line (“DCDFG-Best”) indicates the best median performance across all DCDFG-related methods. b. Benchmarks of DCD-FG and baselines (x axis) on various datasets (top margin). We use as an initial state either the control or held-out expression of upstream genes (right margin). Evaluation metrics (y axis) are as defined in Table 3. Missing data indicate that DCD-FG or NO-TEARS did not converge, or that cell type labels were not available. Performance (colorscale) is shown as percent improvement over the ‘mean’ baseline, oriented so that higher numbers always indicate better performance. Percentages are capped at ± 100%. Fig. S10: Evaluations of published expression forecasting methods (x axis) versus simple baselines. Evaluation metrics (y axis) are as defined in Table 3. Performance (colorscale) is shown as percent improvement over the ‘mean’ baseline, oriented so that higher numbers always indicate better performance. Percentages are capped at ± 100%. Missing data indicate that metrics were not available, or methods did not converge, or methods cannot accept the available input data format.
Acknowledgements
We are grateful to Romain Lopez, Yusuf Roohani, and Christina Theodoris for generous help running DCD-FG, GEARS, and GeneFormer respectively. Thanks to Prashanthi Ravichandran and Emily Su for useful discussions and to Stephen Rosen and Dan Peng for help navigating Python packaging. Thanks to Zexi Liu for helping test the code.
Peer review information
Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer review history is available in the online version of this article.
Authors’ contributions
E.K. initiated the study, harmonized the data, and conducted most of the benchmarking. Y.Y. benchmarked DCD-FG and performed QC on several perturbation datasets. J.W. tested code, interpreted results, and advised simulations. P.C. and A.B. funded and supervised the study. All authors contributed to the manuscript. All authors read and approved the final manuscript.
Funding
A. B. was funded by NIH grant R35GM139580. P. C. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM124725.
Data availability
The datasets supporting the conclusions of this article were previously published elsewhere (7,8,43–47,50,98). The harmonized data are available as a collection from Zenodo with DOI 10.5281/zenodo.15048156 [98]. Maintained versions of the project infrastructure are linked from the project homepage [99], and the exact code used is archived on Zenodo [100]. Code is under the MIT license, which is OSI-compliant. All code is written for Python 3.10 on Linux/Unix, with detailed installation and dependency information available online.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
A.B. is a stockholder for Alphabet, Inc, is a co-founder and equity holder in CellCipher, and has consulted for Third Rock Ventures. All other authors have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Alexis Battle, Email: ajbattle@jhu.edu.
Patrick Cahan, Email: patrick.cahan@jhmi.edu.
References
- 1.Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, et al. The support of human genetic evidence for approved drug indications. Nat Genet. 2015;47(8):856–60. [DOI] [PubMed] [Google Scholar]
- 2.Minikel EV, Painter JL, Dong CC, Nelson MR. Refining the impact of genetic evidence on clinical success. Nature. 2024;629(8012):624–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.King EA, Davis JW, Degner JF. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet. 2019Dec 12;15(12):e1008489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10(12):1213–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, et al. Comparative analysis of single-cell RNA sequencing methods. Mol Cell. 2017Feb 16;65(4):631-643.e4. [DOI] [PubMed] [Google Scholar]
- 6.Jaitin DA, Weiner A, Yofe I, Lara-Astiaso D, Keren-Shaul H, David E, et al. Dissecting immune circuits by linking CRISPR-pooled screens with single-cell RNA-seq. Cell. 2016Dec 15;167(7):1883-1896.e15. [DOI] [PubMed] [Google Scholar]
- 7.Dixit A, Parnas O, Li B, Chen J, Fulco CP, Jerby-Arnon L, et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell. 2016Dec 15;167(7):1853-1866.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Adamson B, Norman TM, Jost M, Cho MY, Nuñez JK, Chen Y, et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell. 2016Dec 15;167(7):1867-1882.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Datlinger P, Rendeiro AF, Schmidl C, Krausgruber T, Traxler P, Klughammer J, et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat Methods. 2017;14(3):297–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kamimoto K, Stringa B, Hoffmann CM, Jindal K, Solnica-Krezel L, Morris SA. Dissecting cell identity via network inference and in silico gene perturbation. Nature. 2023Feb 8;614(7949):742–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang L, Trasanidis N, Wu T, Dong G, Hu M, Bauer DE, et al. Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multi-omics. Nat Methods. 2023;20:1368–78. 10.1038/s41592-023-01971-3. [DOI] [PubMed]
- 12.Yeo GHT, Saksena SD, Gifford DK. Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions. Nat Commun. 2021May 28;12(1):3222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Qiu X, Zhang Y, Martin-Rufino JD, Weng C, Hosseinzadeh S, Yang D, et al. Mapping transcriptomic vector fields of single cells. Cell. 2022Feb 17;185(4):690-711.e45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Roohani Y, Huang K, Leskovec J. GEARS: predicting transcriptional outcomes of novel multi-gene perturbations. Nat Biotechnol. 2024;42(6):927–35. 10.1038/s41587-023-01905-6. [DOI] [PMC free article] [PubMed]
- 15.Cui H, Wang C, Maan H, Wang B, et al. scGPT: towards building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024;21(8):1470–80. 10.1038/s41592-024-02201-0. [DOI] [PubMed]
- 16.Osorio D, Zhong Y, Li G, Xu Q, Yang Y, Tian Y, et al. scTenifoldKnk: an efficient virtual knockout tool for gene function predictions via single-cell gene regulatory network perturbation. Patterns (N Y). 2022Mar 11;3(3):100434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hyttinen A, Eberhardt F, Hoyer PO. Learning linear cyclic causal models with latent variables. J Machine Learn Res. 2012;13:3387–439. https://jmlr.org/papers/v13/hyttinen12a.html.
- 18.Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023;618(7965):616–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Amrute JM, Lai L, Ma P, Koenig AL, Kamimoto K, Bredemeyer A, et al. Defining cardiac recovery at single cell resolution. BioRxiv. 2022 Sep 13;Preprint posted online 9/13/2022:10.1101/2022.09.11.507463.
- 20.Tran A, Yang P, Yang JYH, Ormerod JT. scREMOTE: using multimodal single cell data to predict regulatory gene relationships and to build a computational cell reprogramming model. NAR Genom Bioinform. 2022;4(1):lqac023. [DOI] [PMC free article] [PubMed]
- 21.Burdziak C, Zhao CJ, Haviv D, Alonso-Curbelo D, Lowe SW, Pe’er D. scKINETICS: inference of regulatory velocity with single-cell transcriptomics data. Bioinformatics. 2023;39(39 Suppl 1):i394–403. [DOI] [PMC free article] [PubMed]
- 22.Jiang J, Chen S, Tsou T, McGinnis CS, Khazaei T, Zhu Q, et al. D-SPIN constructs gene regulatory network models from multiplexed scRNA-seq data revealing organizing principles of cellular perturbation response. BioRxiv. 2023. 10.1101/2023.04.19.537364.
- 23. Erbe R, Stein-O’Brien G, Fertig EJ. Transcriptomic forecasting with neural ODEs. BioRxiv. 2023. 10.1101/2022.08.04.502825. [DOI] [PMC free article] [PubMed]
- 24.Lambert J, Oc S, Worssam MD, Häußler D, Figg NL, Baxter R, et al. Network-based prioritisation and validation of novel regulators of vascular smooth muscle cell proliferation in disease. BioRxiv. 2023. 10.1101/2023.08.25.554834.
- 25.Bunne C, Roohani Y, Rosen Y, Gupta A, Zhang X, Roed M, et al. How to build the virtual cell with artificial intelligence: priorities and opportunities. arXiv. 2024. https://arxiv.org/abs/2409.11654. [DOI] [PMC free article] [PubMed]
- 26.Kimmel JC. 2023 Year in Review. NewLimit Blog. 2024 [cited 2024 Sep 3]. Available from: https://blog.newlimit.com/p/2023-year-in-review
- 27.Kamimoto K, Adil MT, Jindal K, Hoffmann CM, Kong W, Yang X, et al. Gene regulatory network reconfiguration in direct lineage reprogramming. Stem Cell Reports. 2023Jan 10;18(1):97–112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lopez R, Hütter J-C, Pritchard JK, Regev A. Large-scale differentiable causal discovery of factor graphs. arXiv. 2022.
- 29.Roohani Y, Huang K, Leskovec J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat Biotechnol. 2024;42(6):927–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Boulesteix A-L, Lauer S, Eugster MJA. A plea for neutral comparison studies in computational sciences. PLoS ONE. 2013Apr 24;8(4):e61562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Nießl C, Herrmann M, Wiedemann C, Casalicchio G, Boulesteix A. Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. WIREs Data Min & Knowl. 2022. 10.1002/widm.1441. [Google Scholar]
- 32.Oates CJ, Mukherjee S. Network inference and biological dynamics. Ann Appl Stat. 2012;6(3):1209–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wong DR, Hill AS, Moccia R. Simple controls exceed best deep learning algorithms and reveal foundation model effectiveness for predicting genetic perturbations. BioRxiv. 2025. 10.1101/2025.01.06.631555. [DOI] [PMC free article] [PubMed]
- 34.Csendes G, Sanz G, Szalay KZ, Szalai B. enchmarking foundation cell models for post-perturbation RNA-seq prediction. BMC Genomics. 2025;26(1):393. 10.1186/s12864-025-11600-2. [DOI] [PMC free article] [PubMed]
- 35.Dorrity MW, Saunders LM, Queitsch C, Fields S, Trapnell C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat Commun. 2020Mar 24;11(1):1537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Ji Y, Green T, Peidli S, Bahrami M, Liu M, Zappia L, et al. Optimal distance metrics for single-cell RNA-seq populations. BioRxiv. 2023. 10.1101/2023.12.26.572833.
- 37.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Machine Learn Res. 2011;12:2825–30. https://jmlr.org/papers/v12/pedregosa11a.html.
- 38.Kemmeren P, Sameith K, van de Pasch LAL, Benschop JJ, Lenstra TL, Margaritis T, et al. Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell. 2014Apr 24;157(3):740–52. [DOI] [PubMed] [Google Scholar]
- 39.Han Y, Li W, Filko A, Li J, Zhang F. Genome-wide promoter responses to CRISPR perturbations of regulators reveal regulatory networks in Escherichia coli. Nat Commun. 2023Sep 16;14(1):5757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hackett SR, Baltz EA, Coram M, Wranik BJ, Kim G, Baker A, et al. Learning causal networks using inducible transcription factors and transcriptome-wide time series. Mol Syst Biol. 2020;16(3):e9174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.MacNeil LT, Pons C, Arda HE, Giese GE, Myers CL, Walhout AJM. Transcription factor activity mapping of a tissue-specific in vivo gene regulatory network. Cell Syst. 2015Aug 26;1(2):152–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Alam MT, Zelezniak A, Mülleder M, Shliaha P, Schwarz R, Capuano F, et al. The metabolic background is a global player in Saccharomyces gene expression epistasis. Nat Microbiol. 2016Feb;1(1):15030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Nakatake Y, Ko SBH, Sharov AA, Wakabayashi S, Murakami M, Sakota M, et al. Generation and profiling of 2,135 human ESC lines for the systematic analyses of cell states perturbed by inducing single transcription factors. Cell Rep. 2020May 19;31(7):107655. [DOI] [PubMed] [Google Scholar]
- 44.Joung J, Ma S, Tay T, Geiger-Schuller KR, Kirchgatterer PC, Verdine VK, et al. A transcription factor atlas of directed differentiation. Cell. 2023Jan 5;186(1):209-229.e26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Norman TM, Horlbeck MA, Replogle JM, Ge AY, Xu A, Jost M, et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science. 2019Aug 23;365(6455):786–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Replogle JM, Norman TM, Xu A, Hussmann JA, Chen J, Cogan JZ, et al. Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat Biotechnol. 2020;38(8):954–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Replogle JM, Saunders RA, Pogson AN, Hussmann JA, Lenail A, Guna A, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell. 2022Jul 7;185(14):2559-2575.e28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Freimer JW, Shaked O, Naqvi S, Sinnott-Armstrong N, Kathiria A, Garrido CM, et al. Systematic discovery and perturbation of regulatory genes in human T cells reveals the architecture of immune networks. Nat Genet. 2022;54(8):1133–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Amit I, Garber M, Chevrier N, Leite AP, Donner Y, Eisenhaure T, et al. Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science. 2009Oct 9;326(5950):257–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Frangieh CJ, Melms JC, Thakore PI, Geiger-Schuller KR, Ho P, Luoma AM, et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat Genet. 2021Mar 1;53(3):332–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Gilbert LA, Horlbeck MA, Adamson B, Villalta JE, Chen Y, Whitehead EH, et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell. 2014Oct 23;159(3):647–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan K-K, Cheng C, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012Sep 6;489(7414):91–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Lachmann A, Xu H, Krishnan J, Berger SI, Mazloom AR, Ma’ayan A. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics. 2010;26(19):2438–44. [DOI] [PMC free article] [PubMed]
- 54.Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47(6):569–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ma S, Jiang T, Jiang R. Constructing tissue-specific transcriptional regulatory networks via a Markov random field. BMC Genomics. 2018Dec 31;19(Suppl 10):884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Xu Q, Georgiou G, Frölich S, van der Sande M, Veenstra GJC, Zhou H, et al. ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination. Nucleic Acids Res. 2021Aug 20;49(14):7966–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Marbach D, Lamparter D, Quon G, Kellis M, Kutalik Z, Bergmann S. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nat Methods. 2016;13(4):366–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.FANTOM Consortium, Suzuki H, Forrest ARR, van Nimwegen E, Daub CO, Balwierz PJ, et al. The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet. 2009:41(5):553–62. [DOI] [PMC free article] [PubMed]
- 59.Pierson E, GTEx Consortium, Koller D, Battle A, Mostafavi S, Ardlie KG, et al. Sharing and specificity of co-expression networks across 35 human tissues. PLoS Comput Biol. 2015:11(5):e1004220. [DOI] [PMC free article] [PubMed]
- 60.Cahan P, Li H, Morris SA, Lummertz da Rocha E, Daley GQ, Collins JJ. CellNet: network biology applied to stem cell engineering. Cell. 2014:158(4):903–15. [DOI] [PMC free article] [PubMed]
- 61.Magaletta ME, Lobo M, Kernfeld EM, Aliee H, Huey JD, Parsons TJ, et al. Integration of single-cell transcriptomes and chromatin landscapes reveals regulatory programs driving pharyngeal organ development. Nat Commun. 2022Jan 24;13(1):457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Ahlmann-Eltze C, Huber W, Anders S. Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods. BioRxiv. 2024. 10.1101/2024.09.16.613342. [DOI] [PMC free article] [PubMed]
- 63.Minaeva M, Domingo J, Rentzsch P, Lappalainen T. Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs. NAR Genom Bioinform. 2025;7(1):lqae178. 10.1093/nargab/lqae178. [DOI] [PMC free article] [PubMed]
- 64.Lenstra TL, Holstege FCP. The discrepancy between chromatin factor location and effect. Nucleus. 2012May 1;3(3):213–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Cusanovich DA, Pavlovic B, Pritchard JK, Gilad Y. The functional consequences of variation in transcription factor binding. PLoS Genet. 2014Mar 6;10(3):e1004226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Duttke SH, Guzman C, Chang M, Delos Santos NP, McDonald BR, Xie J, et al. Position-dependent function of human sequence-specific transcription factors. Nature. 2024Jul 17;631(8022):891–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Young WC, Raftery AE, Yeung KY. Fast bayesian inference for gene regulatory networks using ScanBMA. BMC Syst Biol. 2014;8(17):47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Pemberton-Ross PJ, Pachkov M, van Nimwegen E. ARMADA: using motif activity dynamics to infer gene regulatory networks from gene expression data. Methods. 2015Sep;1(85):62–74. [DOI] [PubMed] [Google Scholar]
- 69.Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016Aug 18;536(7616):285–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Nadig A, Replogle JM, Pogson AN, McCarroll SA, Weissman JS, Robinson EB, et al. Transcriptome-wide analysis of differential expression in perturbation atlases. Nature Genetics. 2025;57(5):1228–37. 10.1038/s41588-025-02169-3. [DOI] [PMC free article] [PubMed]
- 71.Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali TM. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat Methods. 2020;17(2):147–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Xie Z, Bailey A, Kuleshov MV, Clarke DJB, Evangelista JE, Jenkins SL, et al. Gene set knowledge discovery with Enrichr. Curr Protoc. 2021;1(3):e90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016Jul 8;44(W1):W90–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013Apr;15(14):128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Ungewitter E, Scrable H. Delta40p53 controls the switch from pluripotency to differentiation by regulating IGF signaling in ESCs. Genes Dev. 2010Nov 1;24(21):2408–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Telonis AG, Rigoutsos I. The transcriptional trajectories of pluripotency and differentiation comprise genes with antithetical architecture and repetitive-element content. BMC Biol. 2021Mar 25;19(1):60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Long MD, van den Berg PR, Russell JL, Singh PK, Battaglia S, Campbell MJ. Integrative genomic analysis in K562 chronic myelogenous leukemia cells reveals that proximal NCOR1 binding positively regulates genes that govern erythroid differentiation and Imatinib sensitivity. Nucleic Acids Res. 2015Sep 3;43(15):7330–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Sadaf S, Singh AK, Awasthi D, Nagarkoti S, Agrahari AK, Srivastava RN, et al. Augmentation of iNOS expression in myeloid progenitor cells expedites neutrophil differentiation. J Leukoc Biol. 2019;106(2):397–412. [DOI] [PubMed] [Google Scholar]
- 79.Wagner A. How to reconstruct a large genetic network from n gene perturbations in fewer than n(2) easy steps. Bioinformatics. 2001;17(12):1183–97. [DOI] [PubMed] [Google Scholar]
- 80.Zhang J, Squires C, Greenewald K, Srivastava A, Shanmugam K, Uhler C. [2307.06250] Identifiability guarantees for causal disentanglement from soft interventions. arXiv. 2023. https://arxiv.org/abs/2307.06250.
- 81.Zheng X, Aragam B, Ravikumar P, Xing EP. DAGs with no tears: continuous optimization for structure learning. arXiv. 2018.
- 82.Wenteler A, Occhetta M, Branson N, Huebner M, Curean V, Dee WT, et al. PertEval-scFM: benchmarking single-cell foundation models for perturbation effect prediction. BioRxiv. 2024. https://biorxiv.org/content/early/2024/10/03/2024.10.02.616248.abstract.
- 83.Liu T, Li K, Wang Y, Li H, Zhao H. Evaluating the utilities of foundation models in single-cell data analysis. BioRxiv. 2024. 10.1101/2023.09.08.555192.
- 84.Li C, Gao H, She Y, Bian H, Chen Q, Liu K, et al. Benchmarking AI models for in silico gene perturbation of cells. BioRxiv. 2024. 10.1101/2024.12.20.629581.
- 85.Wu Y, Wershof E, Schmon SM, Nassar M, Osiński B, Eksi R, et al. PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis. arXiv. 2025. https://arxiv.org/abs/2408.10609.
- 86.Li L, You Y, Liao W, Fan X, Lu S, Cao Y, et al. A systematic comparison of single-cell perturbation response prediction models. BioRxiv. 2024. 10.1101/2024.12.23.630036.
- 87.Hashimoto T, Gifford D, Jaakkola T. Learning population-level diffusions with generative RNNs. Proc Mach Learn Res. 2016;48:2417–26. https://proceedings.mlr.press/v48/hashimoto16.html.
- 88.Akutsu T, Miyano S, Kuhara S. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pac Symp Biocomput. 1999;17–28. 10.1142/9789814447300_0003. [DOI] [PubMed]
- 89.Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017Nov 30;171(6):1437-1452.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Ji Y, Lotfollahi M, Wolf FA, Theis FJ. Machine learning for perturbational single-cell omics. Cell Syst. 2021Jun 16;12(6):522–37. [DOI] [PubMed] [Google Scholar]
- 91.Takahashi K, Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell. 2006Aug 25;126(4):663–76. [DOI] [PubMed] [Google Scholar]
- 92.Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, et al. Essential guidelines for computational method benchmarking. Genome Biol. 2019Jun 20;20(1):125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet. 2021Jun;53(6):770–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018Feb 6;19(1):15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.McInnes L, Healy J, Saul N, Großberger L. UMAP: uniform manifold approximation and projection. JOSS. 2018Sep 2;3(29):861. [Google Scholar]
- 96.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008Oct 9;2008(10):P10008. [Google Scholar]
- 97.Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, et al. The human transcription factors. Cell. 2018Feb 8;172(4):650–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Kernfeld E, Yang Y, Weinstock J, Battle A, Cahan P. A collection of draft gene regulatory networks and perturbation transcriptomics data. 2025. Zenodo. 10.5281/zenodo.15048156.
- 99.Kernfeld, E. Yang, Y. Weinstock, J. Battle, A. and Cahan, P. perturbation_benchmarking. Github. https://github.com/ekernf01/perturbation_benchmarking (2025).
- 100.Kernfeld E, Yang Y, Weinstock J, Battle A, Cahan P. perturbation_benchmarking. 2025. Zenodo. 10.5281/zenodo.17103072.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1. Full documentation of the Grammar of Gene Regulatory Networks.
Additional file 2. Bias-variance analysis of simple models for unperturbed expression and log fold change after perturbation.
Additional file 3: Fig. S1. Related to Fig. 1. Quality analysis of large perturbation datasets. All values shown in this figure are computed after filtering, aggregation, and normalization as described in the Methods. A. Effect on total-count-normalized transcript levels of the perturbed gene (X-axis). B. Effect on the perturbed gene (X-axis) and the whole transcriptome (Y-axis). Log fold change (logFC) over controls is estimated using a pseudocount of 1. For Norman, Dixit, and Adamson, only 6,000 highly variable genes are included. C. Pearson correlation between log fold change vectors for pairs of replicates (nakatake, joung), donors (freimer), or distinct guide RNA’s (other datasets). Correlations are computed across all genes available in the dataset (Methods). Fig. S2. Related to Fig. 2. Various performance metrics to assess prediction of held-out perturbations. All experiments within each dataset use the same data split and can be directly compared. The y axis shows different performance metrics as described in Table 3. The x axis labels describe which regression method was used (left), which network was used for feature selection (middle), or how models handle time (right). Black lines separate baselines within each panel. In the right column, “closest” indicates that features used to predict each perturbed profile are drawn from the closest control profile, except for the perturbed gene itself. “Steady state” indicates that features used to predict each perturbed profile are drawn from that same profile. ExtraTrees and KernelRidge are missing from certain panels due to memory requirements exceeding 64 GB. The Ahlmann-Eltze baseline is missing from the replogle2 tf only panel because the entire training set consisted of genes that were perturbed but not measured in the original source data. Fig. S3. Related to Fig. 2. Grouping target genes into quintiles using gene-level features reveals worse-than-baseline mae for most gene sets. The y axis shows the mae contribution of each gene set, expressed as percent change from the best baseline (inverted so that higher is better). The x axis indicates which feature we stratified genes on, and within each feature, the color indicates the quintile. Fig. S4. Related to Fig. 2. Robustness checks. A. Robustness to preprocessing. The x axis shows GRNs used. Each point represents performance of a single method on a single dataset according to a single metric. The y axis shows performance as percent improvement relative to the empty-network baseline, oriented so that higher numbers are better. The right margin shows different evaluation metrics and the top margin shows whether or not replicates or cells were averaged within each perturbation condition. B. Robustness to the number of highly variable genes selected. The x axis shows GRNs used. Each point represents performance of a single method on a single dataset according to a single metric. The y axis shows performance as percent improvement relative to the empty-network baseline, oriented so that higher numbers are better. The right margin shows different evaluation metrics and the top margin shows how many variable genes were included in the analysis. The actual number of genes may vary from the number shown because all perturbed genes are always included. Fig. S5. Investigating strong performance of the “mean” baseline. A. Spearman correlation between observed fold change over controls and fold change over controls predicted by the “mean” baseline (y axis) across various datasets (x axis). B. Pearson correlation between observed fold change of treatment A over control A and treatment B over control B, where treatments A and B are different perturbations and controls A and B are separate replicates. Display includes 100 randomly selected pairs per dataset. Fig. S6. Related to Fig. 2. Relative performance of different GRN structures on previously-seen observations. In this experiment, the data split guarantees that each perturbation condition present in the test data was seen at least once in the training data. Each point represents performance of a single method on a single dataset according to a single metric. The x axis shows networks used. The y axis shows performance as percent improvement relative to the empty-network baseline, oriented so that higher numbers are better. The right margin shows different evaluation metrics. Fig. S7. Variability of various performance metrics (y axis) across data splits (x axis) in a comparison of ridge regression models using feature selection based on the default CellOracle human network, a dense (fully-connected) network, and an empty-network baseline (y = 0). Fig. S8. Related to Fig. 2. Impact of regulatory network source (x axis) on expression forecasting performance (color scale). Evaluations use multiple train-test splits (top margin) of data generated from a specific network structure (right margin). Metrics are defined in Table 3. Data are generated from autoregressive models with random initial state that are run for a single time-step, with matched controls for each sample provided during training (Methods). Red lines indicate that the network used for inference matches the network used to generate the data. Fig. S9: Evaluations of DCD-FG. a. Negative log likelihood on held-out perturbations in frangieh, performed using the original code for DCD-FG with Gaussian baselines added. The red line (“optimal”) indicates the best median performance across all methods. The blue line (“DCDFG-Best”) indicates the best median performance across all DCDFG-related methods. b. Benchmarks of DCD-FG and baselines (x axis) on various datasets (top margin). We use as an initial state either the control or held-out expression of upstream genes (right margin). Evaluation metrics (y axis) are as defined in Table 3. Missing data indicate that DCD-FG or NO-TEARS did not converge, or that cell type labels were not available. Performance (colorscale) is shown as percent improvement over the ‘mean’ baseline, oriented so that higher numbers always indicate better performance. Percentages are capped at ± 100%. Fig. S10: Evaluations of published expression forecasting methods (x axis) versus simple baselines. Evaluation metrics (y axis) are as defined in Table 3. Performance (colorscale) is shown as percent improvement over the ‘mean’ baseline, oriented so that higher numbers always indicate better performance. Percentages are capped at ± 100%. Missing data indicate that metrics were not available, or methods did not converge, or methods cannot accept the available input data format.
Data Availability Statement
The datasets supporting the conclusions of this article were previously published elsewhere (7,8,43–47,50,98). The harmonized data are available as a collection from Zenodo with DOI 10.5281/zenodo.15048156 [98]. Maintained versions of the project infrastructure are linked from the project homepage [99], and the exact code used is archived on Zenodo [100]. Code is under the MIT license, which is OSI-compliant. All code is written for Python 3.10 on Linux/Unix, with detailed installation and dependency information available online.




