Abstract
Advances in single-cell sequencing and CRISPR technologies have enabled detailed case-control comparisons and experimental perturbations at single-cell resolution. However, uncovering causal relationships in observational genomic data remains challenging due to selection bias and inadequate adjustment for unmeasured confounders, particularly in heterogeneous datasets. To address these challenges, we introduce causarray, a doubly robust causal inference framework for analyzing array-based genomic data at both bulk-cell and single-cell levels. causarray integrates a generalized confounder adjustment method to account for unmeasured confounders and employs semiparametric inference with flexible machine learning techniques to ensure robust statistical estimation of treatment effects. Benchmarking results show that causarray robustly separates treatment effects from confounders while preserving biological signals across diverse settings. We also apply causarray to two single-cell genomic studies: (1) an in vivo Perturb-seq study of autism risk genes in developing mouse brains and (2) a case-control study of Alzheimer’s disease using three human brain transcriptomic datasets. In these applications, causarray identifies clustered causal effects of multiple autism risk genes and consistent causally affected genes across Alzheimer’s disease datasets, uncovering biologically relevant pathways directly linked to neuronal development and synaptic functions that are critical for understanding disease pathology.
Keywords: causal inference, confounder adjustment, counterfactual, double robustness, differential expression analysis
Introduction
The advent of genomic research has transformed our understanding of biological processes and disease mechanisms. Advances in single-cell RNA sequencing (scRNA-seq) have driven this rapid progress, offering unprecedented insights into gene expression patterns at the cellular level (1). The high resolution provided by scRNA-seq data is essential to elucidate cellular heterogeneity and its implications for health and disease (2–4). However, fully harnessing the potential of these data requires robust analytical frameworks capable of moving beyond association to unravel complex causal relationships at single-cell resolution (5–7). The fundamental difference between association and causation is that association assesses correlations between treatments and outcomes, whereas causal inference aims to quantify the effect of a treatment on an outcome. A popular framework for causal inference is the potential outcomes framework, which estimates what would have happened if a different treatment had been assigned, the counterfactual (7, 8). To understand the inner workings and mechanisms of biological processes and diseases for the purpose of treatments, precision medicine, genomic medicine and more, causal inferences will be required (9, 10).
One of the primary challenges in leveraging scRNA-seq data for causal inference is its inherent hierarchical organization and heterogeneity (6, 7, 11). Cells derived from the same individual are not independent observations; they share biological factors, such as correlated variability and technical factors, including batch effects introduced during storage and sequencing. These dependencies violate the assumption of independent and identically distributed (i.i.d.) samples, complicating statistical analyses and rendering traditional methods inadequate for handling heterogeneous data with unwanted variations (12, 13). Furthermore, most genomic studies are observational in nature. Unlike randomized controlled trials, observational studies lack complete knowledge of the disease or treatment assignment mechanism, leading to potential biases in counterfactual estimation.
CRISPR perturbation experiments, a more recent but rapidly expanding area, offer a new set of challenging analysis scenarios (14–16). For this experimental setting, perturbed cells are contrasted with cells that receive a non-targeting perturbation. While there is some randomness in the treatment assignment, it is not entirely random: continuous unmeasured confounders such as variability in cell size or differential drug exposure can result in biased causal estimates. Additionally, when such experiments are performed in vivo, the possibility of confounding increases (17), further justifying the need for robust causal inference analysis.
Existing methods for causal inference, such as CoCoA-diff (6) and CINEMA-OT (11), rely on simple matching techniques that assume the causal structure is transferable between treatment and control groups. However, this assumption breaks down when covariate distributions differ significantly across groups, leading to biased estimates. Moreover, even after controlling for observed confounders, unmeasured confounders can undermine the validity of causal conclusions (18, 19). Other methods like surrogate variable analysis (SVA) (20) and RUV (13) aim to address confounding and unwanted variation via linear models that assume additive relationships between covariates and outcomes. While effective for certain bulk RNA-seq datasets, these approaches often fail to capture the sparsity, zero inflation, and over-dispersion inherent in single-cell genomic data (18, 21). Tackling these challenges requires integrating robust confounder adjustment with flexible modeling techniques to ensure valid causal inference in complex genomic data.
In response to these challenges, we introduce a new framework for applying causal inference in genomic studies. Our approach leverages generalized factor models tailored to count data to account for unmeasured confounders, ensuring robust adjustment for unmeasured confounders while preserving biological signals. It further relies on the potential outcomes framework and employs a doubly robust estimation procedure, which combines outcome and propensity score models to ensure reliable statistical inference even if one model is misspecified (22, 23). This framework effectively addresses biases introduced by both observed and unobserved confounders making it particularly well-suited for analyzing complex genomic data at both bulk and single-cell levels (Fig. 1a). By integrating advanced statistical and machine learning techniques with a causal inference framework, our method enables a range of downstream analyses, including accurate estimation of counterfactual distributions, causal gene detection, and conditional treatment effect analysis. This approach not only improves the interpretability and precision of genomic analyses but also uncovers critical insights into gene expression dynamics under disease or perturbation conditions, advancing our understanding of underlying biological mechanisms.
Fig. 1. Overview of the proposed causarray method.
a, Illustration of the data generation process for pseudo-bulk and single-cell data. b, The gene expression matrix, , is linked to the treatment, , measured covariates, , and confounding variables, , via a GLM model. The cell-wise size factor, , and gene-wise dispersion parameter, , are estimated from the data, and the unmeasured confounder is estimated by through the augmented GCATE method. c, Generalized linear models and flexible machine learning methods including random forest and neural network can be applied for outcome modeling and propensity modeling The estimated outcome and propensity score functions give rise to the estimated potential outcomes for each cell and each gene. d, Downstream analysis includes contrasting the estimated counterfactual distributions, performing causal inference, and estimating the conditional average treatment effects.
We demonstrate the effectiveness of causarray through benchmarking on several simulated datasets, comparing its performance with existing single-cell-level perturbation analysis methods and pseudo-bulk-level differential expression (DE) analysis methods. Next, we apply causarray to two single-cell genomic studies: a Perturb-seq study investigating autism spectrum disorder/neurodevelopmental disorder (ASD/ND) genes in developing mouse brains and a case-control study of Alzheimer’s disease using human brain transcriptomic datasets For the Alzheimer’s disease analysis, we validate our findings across three independent datasets, showcasing the robustness and reproducibility of causarray in identifying causally affected genes and uncovering biologically meaningful pathways. These applications highlight the potential of causarray to advance our understanding of complex disease mechanisms through rigorous causal inference.
Results
Doubly-robust counterfactual imputation and inference
Our objective is to determine whether a gene is causally affected by a “treatment” variable after controlling for other technical and biological covariates, which may affect the treatment and outcome variables. Here, we use the term treatment generally; in the narrow sense, it can mean genetic and/or chemical perturbations (17, 24), such as CRISPR-CAS9, and, more broadly, it can mean the phenotype of a disease (6). We acknowledge that while many differentially expressed genes can be considered a result of disease status, for most late-onset disorders, a smaller fraction of genes could have initiated disease phenotypes. Our method aims to determine the direct effects of treatments on modulated gene expression outcomes.
In observational data, the response variable can be confounded by measured and unmeasured biological and technical covariates, making it difficult to separate the treatment effect from other unknown covariates. As a consequence, it is challenging to draw causal inferences; even tests of association may lead to an excess of false discoveries and/or low power. Fortunately, the potential outcomes framework (22, 23) formulates general causal problems in a way that allows for the treatment effect to be separated from the effects of other variables. However, even this framework is challenged by unmeasured covariates. Before introducing our method for estimating unmeasured confounders, we first outline the general potential outcomes framework.
Consider a study in which is the response variable and is the binary treatment variable for an observation. In the potential outcomes framework, is the outcome that we would have observed if we set the treatment to . Naturally, we can only observe one of the two potential outcomes for each observation, so
In the context of a case-control study of a disease, this would answer the question: What is the expected difference in gene expression if an individual had the disease (case, ) versus if they did not (control, )?
Doubly robust methods provide a powerful tool for estimating potential outcomes in observational studies where randomization is not possible (22, 23). Specifically, we estimate two key quantities: (1) , the mean response of the outcome variable conditional on treatment and covariates , and (2) , the propensity score, which is defined as the probability of receiving treatment given covariates , i.e., . Using these estimates, we compute potential outcomes as
The doubly robust estimator’s name comes from the fact that it provides a consistent estimate as long as either the outcome model, , or the propensity score model, , is correctly specified. Given this estimate, we can easily perform downstream inference tasks such as computing log fold change (LFC) (Methods), and testing for causal effects on gene expressions (Fig. 1a). An advantage of this approach is that counterfactual imputation denoises/balances gene expression under two different conditions. Additionally, having access to estimated potential outcomes facilitates downstream analyses such as estimating causal effects conditional on measured confounders like age.
A key step in these types of analyses is estimating unmeasured confounders. To adjust for confounding, factor models were popularized in surrogate variable analysis literature and have since been widely adopted in bulk gene expression studies (20). Recently, we extended this approach to single-cell RNA-seq data using generalized linear models that better accommodate pseudobulk and single-cell outcome variables (18). Using this generalized factor analysis approach, we estimate unmeasured confounders alongside potential outcomes (Fig. 1b–c), enabling direct estimation of downstream quantities such as LFC (Fig. 1d).
Simulation study demonstrates the advantages of causarray
We evaluate the performance of causarray in two simulated settings (Appendix S3). In the first setting, we generate simulated pseudo-bulk data, while in the second, we generate simulated single-cell data using the Splatter simulator (25), which explicitly models the hierarchical Gamma-Poisson processes underlying scRNA-seq data and captures multi-faceted variability. Each dataset consists of 100–300 cells, approximately 2,000 genes, 1–2 covariates, and 4 unmeasured confounders.
To benchmark causarray, we compare it with several existing methods designed for differential expression (DE) testing, both with and without confounder adjustment (Fig. 2a). For methods that do not account for unmeasured confounders, we include the Wilcoxon rank-sum test and DESeq2 (26). In the presence of measured covariates, both regress the gene expression counts with respect to the covariates using the Poisson or negative binomial generalized linear model, respectively. The input to the Wilcoxon rank sum test is the deviance residuals. For confounder-adjusted methods, we consider CoCoA-diff (6), CINEMA-OT (11), CINEMA-OT-W (11), RUV (12), and RUV-III-NB (13), where recommended DE test methods are subsequently applied with estimated confounders. A short summary of each of these benchmarking comparison methods can be found in Methods.
Fig. 2. Benchmarking of causarray against other methods for single-cell differential expression testing on synthetic expression data with unmeasured confounders.
a, The analysis pipeline produces a confounder adjustment and a statistic for DE testing. We illustrate two types of criteria used for benchmarking confounder adjustment and DE methods in simulation for bulk simulations (b-e) and single-cell simulations (Fig. S1). b, Performance comparison of causarray and other methods with a well-specified number of latent factors . Bar plots show median ARI and ASW scores for confounder estimation, while box plots display FPR and TPR for biological signal preservation. The top and bottom hinges represent the top and bottom quartiles, and whiskers extend from the hinge to the largest or smallest value no further than 1.5 times the interquartile range from the hinge. The center indicates the median. c, Robustness analysis of causarray, RUV-III-NB, and RUV under varying numbers of latent factors . Bar plots show ARI and ASW scores for confounder estimation, while box plots display FPR and TPR for DE testing. d-e, causarray disentangles the treatment effects and unmeasured confounding effects in the response and confounder spaces. UMAP projection of (d) expression data colored by the values of treatment (purple for control and yellow for treated ) and unmeasured continuous confounder ; and (e) estimated potential outcome under control colored by the values of treatment and continuous confounder .
To assess the performance of unmeasured confounder adjustment procedures, we use two metrics: adjusted Rand index (ARI) and average silhouette width (ASW). More specifically, we use ARI to quantify the alignment between estimated and true unmeasured confounders and ASW to evaluate cell type separation in the control response space. A higher ARI value indicates better coherence and a higher ASW value reflects better preservation of biological signals after removing confounding effects. Additionally, to assess the performance of DE testing, we use two metrics: false positive rate (FPR) and true positive rate (TPR) (Methods).
We first evaluate how sample size and confounding levels influence the performance of DE testing across methods. Among all tested approaches, only causarray, RUV, Wilcoxon, and DESeq2 effectively control FPR across all settings (Fig. 2b and Fig. S1ab). causarray maintains FPR close to the nominal level of 0.1 across all sample sizes and confounding levels, while RUV-III-NB, CINEMA-OT-W, CINEMA-OT, and CoCoA-diff exhibit inflated FPRs exceeding 0.5 in most cases. Notably, causarray achieves the highest TPRs across all scenarios, with values ranging from approximately 0.8 to 0.9 depending on sample sizes and confounding levels (Fig. 2b and Fig. S1ab). This is significantly higher than RUV-III-NB and CoCoA-diff, which achieve TPRs below 0.5 in most settings, particularly for smaller sample sizes or higher confounding levels. These results highlight causarray’s ability to balance sensitivity and specificity effectively.
In terms of unmeasured confounder adjustment, causarray, RUV-III-NB, and CoCoA-diff achieve both ARI and ASW scores consistently above 0.7 across all sample sizes in both bulk and single-cell data (Fig. 2b, Fig. S1ab), outperforming RUV, CINEMA-OT-W, CINEMA-OT, which show ARI scores below 0.5 in most cases. Furthermore, causarray effectively disentangles treatment effects from unmeasured confounding effects. In the response space (Fig. 2d), treatment groups are distinctly separated with minimal overlap, while variations within groups reflect unmeasured confounders. In the confounder space (Fig. 2e), causarray produces a uniform mixing of treatment groups while accurately reconstructing continuous confounder values.
Finally, we assess the robustness of causarray, RUV-III-NB, and RUV under varying numbers of latent factors (Fig. 2c and Fig. S1c). Among these methods, only causarray consistently controls FPR at nominal levels of 0.1 regardless of the number of factors or sample size. In contrast, RUV-III-NB exhibits inflated median FPRs exceeding 0.2 when more factors are included (e.g., ). While RUV-III-NB performs well in terms of ARI (above 0.8) and ASW (above 0.7), its DE testing performance is inferior to RUV due to poor FPR control under certain conditions. Based on these findings, we proceed with causarray and RUV for real data analysis.
causarray applied to an in vivo Perturb-seq study reveals causal effects of ASD/ND genes
An integrative analysis of multiple single perturbations.
Autism spectrum disorders and neurodevelopmental delay (ASD/ND) represent a complex group of conditions that have been extensively studied using genetic approaches. To investigate the underlying mechanisms of these disorders, researchers have employed scalable genetic screening with CRISPR-Cas9 technology (17). Frameshift mutations were introduced in the developing mouse neocortex in utero, followed by single-cell transcriptomic analysis of perturbed cells from the early postnatal brain (17). These in vivo single-cell Perturb-seq data allow for the investigation of causal effects of a panel of ASD/ND risk genes. We analyze the transcriptome of cortical projection neurons (excitatory neurons) perturbed by one risk gene or a non-targeting control perturbation, which serves as a negative control.
Unmeasured confounders, such as batch effects and unwanted variation, are likely present in this dataset due to the batch design being highly correlated with perturbation conditions (Fig. S2ab). Additionally, the heterogeneity of single cells assessed in vivo introduces further complexity. These confounding factors may reduce statistical power for gene-level differential expression (DE) tests, as noted in the original study (17), which instead focused on gene module-level effects. To address this limitation, we apply causarray to incorporate unmeasured confounder adjustment and conduct a more granular analysis at the single-gene level. This approach enables us to uncover nuanced genetic interactions and causal effects that may provide deeper insights into the etiology of ASD/ND.
Functional analysis.
Gene module-level analyses have been shown to provide greater statistical power for detecting biologically meaningful perturbation effects when fewer cells are available (17). The original study adopted this approach but relied on a linear model rather than a negative binomial model, potentially limiting its ability to detect broader signals at the individual gene level. Here, we compare causarray with RUV and DESeq2 (without confounder adjustment) to identify significant genes and enriched gene ontology (GO) terms associated with various perturbations.
In terms of significant gene detection, causarray identifies a comparable number of significant genes to RUV across most perturbations, while DESeq2 consistently detects fewer significant genes (Fig. 3a). The variation in significant detections across different perturbed genes suggests distinct biological impacts of each knockout. Functional analysis focuses on enriched GO terms on the DE genes under each perturbation condition where discrepancies arise between causarray and other methods. Genes identified by causarray are enriched for biologically relevant GO terms with clear clustering patterns (Fig. 3b–c, Fig. S2c). In contrast, RUV shows less distinct clustering and enrichment patterns.
Fig. 3. Statistical test results of the effects of CRISPR perturbation on gene expression in excitatory neuron data.
a, Number of significant genes detected under all perturbations using three different methods. The detection threshold for significant genes is FDR< 0.1 for all methods. b-c, Heatmaps of GO terms enriched (adjusted ) in discoveries from causarray and RUV, respectively, where the common GO terms are highlighted in blue. Only the top 20 GO terms that have the most occurrences in all perturbations are displayed. d-e, Barplots of GO terms enriched in discoveries under Satb2 perturbation from causarray and RUV, respectively.
Notably, while RUV identifies GO terms related to ribosome processes previously implicated in ASD studies (27), these findings remain controversial. Some argue that dysregulation in translation processes and ribosomal proteins may reflect secondary changes triggered by expression alterations in synaptic genes rather than direct causal effects (28). In contrast, GO terms identified by causarray align more closely with the expected causal effects of ASD/ND gene perturbations (29, 30).
To further validate these findings, we examine the perturbation condition for Satb2, which yields the largest number of significant genes identified by both methods (adjusted ). Satb2 is known to play critical roles in neuronal development, synaptic function, and cognitive processes (31, 32). Using causarray, we detect enrichment for GO terms directly related to neuronal function and development, such as “regulation of neuron projection development,” “regulation of synapse structure or activity,” and “synapse organization” (Fig. 3d). These findings are consistent with Satb2’s established roles in neuronal development and synaptic plasticity (33, 34). On the other hand, RUV identifies enrichment for terms related to mitochondrial function and energy metabolism, such as “mitochondrial electron transport,” “cellular respiration,” and “ATP synthesis” (Fig. 3e). While these processes are important for general cellular function, they are less directly relevant to Satb2’s primary biological roles.
Overall, this analysis demonstrates that causarray provides greater specificity in detecting biologically meaningful causal effects of gene perturbations. Its ability to disentangle confounding influences while preserving relevant biological signals highlights its effectiveness in analyzing complex genomic datasets.
causarray reveals causally affected genes of Alzheimer’s disease in a case-control study
An integrative analysis of excitatory neurons.
We analyze three Alzheimer’s disease (AD) single-nucleus RNA sequencing (snRNA-seq) datasets: a transcriptomic atlas from the Religious Orders Study and Memory and Aging Project (ROSMAP) (35) and two datasets from the Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD) consortium (36), which include samples from the middle temporal gyrus (MTG) and prefrontal cortex (PFC). Our objective is to compare the performance of causarray and RUV in pseudo-bulk DE tests of AD in excitatory neurons.
To evaluate the validity, we perform a permutation experiment on the ROSMAP-AD dataset by permuting phenotypic labels. Ideally, no significant discoveries should be made under this null scenario. However, RUV produces a large number of false discoveries, with its performance deteriorating as the number of latent factors increases. In contrast, causarray effectively controls the false discovery rate (FDR), producing minimal false positives (Fig. 4a). Additionally, we assess coherence across datasets by examining effect sizes in SEA-AD (MTG) and SEA-AD (PFC). Effect sizes estimated by causarray exhibit higher consistency across varying q-value cutoffs compared to RUV (Fig. 4b, Fig. S3b). When inspecting DE genes across all three AD datasets, causarray identifies more consistent discoveries than RUV (Fig. 4c), highlighting its robustness in detecting causally affected genes.
Fig. 4. Comparison of DE genes discovered by causarray and RUV on excitatory neurons for Alzheimer’s disease.
a, The ratio of false discoveries to all 15586 genes of DE test results with permuted disease labels on the ROSMAP-AD dataset. Three methods, causarray with FDX control, causarray with FDR control, and RUV with FDR control, are compared. b, The similarity of estimated effect sizes on SEA-AD MTG and PFC datasets. The slope is estimated from linear regression of effect sizes on the PFC dataset against those on the MTG dataset. c, DE genes by causarray and RUV over 15586 genes (adjusted ). d, Venn diagram of associated GO terms from causarray and RUV (adjusted ). e, Considering only the top 50 positively regulated and the top 50 negatively regulated DE genes from causarray and RUV, we map them to the top 5 biological processes (the green nodes).
Functional analysis.
We further compare functional enrichment results between causarray and RUV using gene ontology (GO) terms associated with DE genes. Across the three datasets, causarray identifies 165 common GO terms, significantly more than the 60 identified by RUV (Fig. 4d). Both methods detect GO terms relevant to neuronal development and synaptic functions, which are critical for understanding AD pathology. However, causarray shows distinct enrichment in categories such as “positive regulation of cell development” and “negative regulation of cell cycle’, reflecting its increased sensitivity to synaptic and neurotransmission-related processes. In contrast, RUV’s results exhibit more dataset-specific enrichments, such as biosynthetic processes in SEA-AD (PFC), apoptotic processes in SEA-AD (MTG), and catabolic processes in ROSMAP-AD (Fig. S3c). These findings suggest that causarray captures more generalizable biological signals across datasets.
Both methods identify overlapping top functional categories related to key biological processes associated with AD pathology (Fig. S3e). However, causarray associates a larger number of genes with these categories, identifying 3393 DE genes compared to 3187 for RUV (Fig. 4c). Additionally, causarray reveals 165 common GO terms across the three datasets, significantly more than the 60 identified by RUV (Fig. 4d). The visualization of the discovered networks, as defined as the top 5 GO terms and associated genes included in the top 100 DE gene discoveries, further highlights the enhanced sensitivity and comprehensiveness of causarray. Specifically, the causarray network contains 17 gene nodes and 81 edges, compared to 14 gene nodes and 57 edges in the RUV network (Fig. 4e). This greater interconnectedness in the larger causarray network suggests a more intricate and informative representation of underlying biological relationships, emphasizing its ability to capture broader and more relevant genetic factors associated with AD pathology.
Counterfactual analysis.
The counterfactual framework employed by causarray enables downstream analyses that directly utilize estimated potential outcomes. By examining counterfactual distributions for significant genes (Fig. 5a), we observe distinct shifts in expression levels between treatment and control groups. Downregulated genes show a shift toward lower expression levels under disease conditions, while upregulated genes exhibit increased expression. Conditional average treatment effects (CATEs) reveal age-dependent trends for these genes (Fig. 5b). For example, upregulated genes such as SLC16A6 and RFLNA show stronger effects at extreme ends of the age distribution, while others like SLC38A2 and BAG6 display nuanced changes across the aging spectrum.
Fig. 5. Results of DE analysis of 10 selected genes by causarray.
The top 5 up-regulated and top 5 down-regulated genes in estimated LFCs (adjusted ) are visualized. a, Estimated counterfactual distributions. The values are shown in the log scale after adding one pseudo-count. b, Estimated log-fold change of treatment effects, conditional on age for selected genes. The center lines represent the mean of the locally estimated scatter plot smoothing (LOESS) regression, and the shaded area represents a 95% confidence interval at each value of age.
These findings align with prior studies highlighting the roles of specific genes in aging-related processes. For instance, ZFR2, RFLNA, BAG6, and RAD21 have been implicated in chromatin remodeling, synaptic plasticity, and cellular stress responses critical for aging and neurodegeneration (37–40). While nonparametric fitted curves may exaggerate age effects due to uncertainty bands, significant trends observed for key genes underscore their potential relevance in AD pathology. Overall, these results demonstrate that causarray provides nuanced insights into age-dependent gene regulation mechanisms while maintaining robust control over confounding influences.
Discussion
The rapid growth of high-throughput single-cell technologies has created an urgent need for robust causal inference frameworks capable of disentangling treatment effects from confounding influences. Existing methods, such as CINEMA-OT (11), have advanced the field by separating confounder and treatment signals and providing per-cell treatment-effect estimates. However, these methods rely on the assumption of no unmeasured confounders, which is often violated in observational studies and in vivo experiments. Additionally, many confounder adjustment methods, such as RUV (12), depend on linear model assumptions that do not directly model count data or provide robust differential expression testing at the gene level. Addressing these limitations, causarray introduces a doubly robust framework that integrates generalized confounder adjustment with semiparametric inference to enable reliable and interpretable causal analysis.
causarray directly models count data using generalized linear models for unmeasured confounder estimation, overcoming a key limitation of RUV in DE analysis. Unlike CINEMA-OT (11) and CoCoA-diff (6), which rely on optimal transport or matching techniques, causarray employs a doubly robust framework that combines flexible machine learning models with semiparametric inference. This approach enhances stability and interpretability while enabling valid statistical inference of treatment effects. Benchmarking results demonstrate that causarray outperforms existing methods in disentangling treatment effects from confounding influences across diverse experimental settings, maintaining superior control over false positive rates while achieving higher true positive rates.
In an in vivo Perturb-seq study of ASD/ND genes, causarray uncovered gene-level perturbation effects that were missed by prior module-based analyses. It identified biologically relevant pathways linked to neuronal development and synaptic functions for multiple autism risk genes. Similarly, in a case-control study of Alzheimer’s disease using three human brain transcriptomic datasets, causarray revealed consistent causal gene expression changes across datasets and highlighted key biological processes such as synaptic signaling and cell development. These findings underscore the ability of causarray to provide biologically meaningful insights across diverse contexts.
Despite its strengths, causarray has certain limitations. Its performance depends on the accurate estimation of unmeasured confounders, which may vary with dataset complexity and experimental design. Furthermore, while causarray provides robust DE testing, its integration with advanced spatial or trajectory analysis frameworks remains unexplored (41, 42). Future research could focus on extending causarray to incorporate prior biological knowledge or extrapolate to unseen perturbation-cell pairs, similar to emerging methods like CPA (43). Such advancements would further enhance its applicability in single-cell causal inference.
Methods
Counterfactual
Potential outcomes framework.
Let be a tuple of random vectors, where is the binary treatment variable (e.g., presence or absence of a disease or perturbation), is the vector of covariates (e.g., biological or technical factors influencing both treatment and outcome), and is the observed outcomes, defined as , where and are the potential outcomes under treatment and control, respectively.
The potential outcomes framework assumes that for each individual or observation, there exist two potential outcomes: one if the individual receives the treatment and one if they do not . However, only one of these outcomes can be observed for each individual, depending on whether they were treated or not . This framework allows us to define causal effects in terms of these unobservable potential outcomes.
To estimate causal effects, we rely on the following key assumptions:
Assumption 1 (Consistency) The observed response is consistent such that .
Assumption 2 (Positivity) The propensity score for some .
Assumption 3 (No unmeasured confounders) , for all .
Under these assumptions (Assumptions 1–3), the observed outcome is conditionally independent of the treatment , given the covariates . This allows us to estimate the expected potential outcome for gene under treatment or control as:
where is a regression function that models the relationship between covariates, treatment, and outcomes.
Suppose we have a dataset consisting of i.i.d. samples from the same distribution as . Let denote the empirical measure over , defined as:
for any measurable function . This represents the sample average of a function evaluated on all observations in the dataset.
A naive plug-in estimator for can then be constructed by replacing the true regression function with its estimated counterpart and using sample averages to approximate expectations. The resulting estimator is:
This plug-in estimator provides an estimate of the expected potential outcome by averaging predictions from the estimated regression model over all observations in the dataset.
While Assumptions 1–3 are foundational for causal inference, violations of the no unmeasured confounders assumption (Assumption 3) are common in real-world applications (18, 19). For instance, in single-cell transcriptomic studies, technical factors such as batch effects or biological heterogeneity (e.g., cell size or cell cycle stage) may act as unmeasured confounders. These unmeasured variables can bias estimates of causal effects by introducing spurious associations between treatment and outcome. Addressing this limitation motivates the need for methods that explicitly model and adjust for unmeasured confounders.
The probabilistic modeling of confounders.
To account for unmeasured confounders, we propose an improved version of the GCATE method (18), which identifies potential unmeasured confounders under generalized linear models (GLMs). This approach extends traditional confounder adjustment methods by incorporating more flexible nonlinear models that better capture the unique characteristics of genomic count data, such as zero-inflation (an excess of zero counts) and over-dispersion (greater variability than expected under standard Poisson assumptions). These enhancements allow for more accurate modeling of gene expression data, addressing limitations of simpler linear models in high-dimensional genomic analyses.
For the th observation (e.g., a single cell or sample) and the th gene, we model the adjusted expression , where is the observed expression level, and is the size factor for the th gene. The size factor accounts for differences in sequencing depth or library size across samples, ensuring that comparisons are not biased by technical variability. We assume that follows an exponential family distribution, which is a flexible class of probability distributions commonly used in GLMs. The density of is given by:
where is the natural parameter that determines the mean and variance of is a known base measure, and is the log-partition function, which ensures that the density integrates to 1.
In matrix form, we model the natural parameters
as a decomposition into two components:
Here, combines observed covariates (e.g., biological or technical factors) with treatment indicators , where is the number of observations, and is the dimension of represents unknown regression coefficients for the effects of covariates and treatments on gene expression; represents latent variables capturing unmeasured confounders, where is the number of latent factors; and represents unknown coefficients linking unmeasured confounders to gene expression.
This decomposition assumes that gene expression levels are influenced by both observed covariates and unmeasured confounders . The term captures the effects of observed covariates and treatments, while captures the effects of unmeasured confounders.
To estimate these unknown quantities , we employ methods detailed in Appendix S1. This includes techniques for estimating latent factors and extending the framework to handle multiple treatments. Once these quantities are estimated, we treat as the complete set of confounding covariates—combining both observed covariates and estimated unmeasured confounders .
With this expanded set of covariates, we perform doubly robust estimation and inference as described in subsequent sections. This approach ensures that treatment effects are estimated while accounting for both observed and unmeasured confounding influences, improving robustness and reliability in causal inference.
Doubly robust estimation.
Throughout the paper, we consider the log fold change (LFC) as the target estimand:
which quantifies the relative change in expected gene expression levels between treatment and control conditions for gene . Extensions to other estimands are provided in Appendix S2.
The doubly robust estimation framework is a widely used approach that is agnostic to the underlying data-generating process. It provides valid estimation and inference results as long as either the conditional mean model or the propensity score model is correctly specified. This robustness property ensures reliable causal effect estimation even in the presence of potential misspecification of one of the models.
More specifically, a one-step estimator of the estimand admits a linear expansion:
where is the influence function of , which quantifies how individual observations contribute to the overall estimate. Here, is the propensity score model, and is the outcome model for gene . See Appendix S2 for detailed derivations of these functions.
To estimate the nuisance functions ’s (outcome models) and (propensity score model), we use flexible statistical machine learning methods. Specifically, for outcome models , we employ generalized linear models (GLMs) with a negative binomial likelihood and log link function. This choice accounts for over-dispersion in count data while ensuring computational efficiency given the high dimensionality of genomic data. For the propensity score model , we provide two built-in options: (i) logistic regression and (ii) random forests. In our experiments, random forests are configured with 1,000 trees, a minimum leaf size of 3, and a maximum tree depth of 11. Extrapolated cross-validation (ECV) (44) is used to select hyperparameters by minimizing the estimated mean squared error. Users can also supply alternative estimates for these nuisance functions if desired.
To perform inference, we first compute the estimated influence function values and use them to estimate the variance for gene :
Using these quantities, a -statistic for gene can be computed as:
This statistic enables hypothesis testing and confidence interval construction for causal effects on gene expression.
False discovery rate control.
Genomic studies often involve testing thousands of hypotheses simultaneously, making it crucial to control statistical Type-I errors. Two widely recognized error rate metrics are the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR), each suited to different contexts. Consider hypothesis tests, let denote the set of discoveries, and denote the set of true null hypotheses. The false discovery proportion (FDP) is defined as the ratio of false positives to total discoveries:
The FWER controls the probability of making at least one false discovery:
where is a predefined significance level. This stringent control is particularly useful in scenarios where even a single false positive is unacceptable. However, FWER control often leads to reduced statistical power, especially in high-dimensional settings with many hypotheses, potentially over-looking true effects.
In contrast, FDR control provides a more balanced approach by controlling the expected proportion of false discoveries among all discoveries:
This approach enhances power in multiple testing scenarios and has become the standard for differential expression analysis in genomics due to its ability to identify more significant features while maintaining a low proportion of false positives (45). Importantly, FDR controls the expected proportion of false discoveries across repeated experiments but does not guarantee bounds on FDP in any single experiment. This distinction becomes critical in genomic studies where test statistics are often highly dependent, leading to variability in FDP across experiments.
To address limitations of standard FDR procedures, such as their inability to capture FDP variability in a single experiment, alternative error control metrics like False Discovery Exceedance (FDX) have been proposed:
for a threshold . FDX provides stricter control by limiting the probability that FDP exceeds a predefined threshold . This makes it particularly useful in applications where minimizing false positives is critical or when restricting analysis to a small subset of discoveries is desired.
To ensure robust error rate control tailored to genomic applications, causarray implements two complementary strategies for FDR control: (i) Benjamini–Hochberg (BH) Procedure: The BH procedure (45) is applied directly to P-values obtained from the doubly robust estimation framework. BH controls the FDR under independence or specific positive dependence structures among test statistics. (ii) Gaussian Multiplier Bootstrap: For tighter control of FDP variability, particularly when test statistics are highly dependent, causarray incorporates a Gaussian multiplier bootstrap approach (Algorithm S2). This method simulates null distributions to estimate FDP more accurately and provides robust FDR control even under complex dependence structures (7).
The choice between BH and Gaussian multiplier bootstrap depends on the dependency structure among test statistics. While BH is computationally efficient and widely used, it may not adequately control FDR under strong dependencies. The Gaussian multiplier bootstrap, on the other hand, accounts for complex dependency structures and provides more accurate bounds on FDP variability. Additionally, incorporating FDX offers an extra layer of conservatism for applications where minimizing false positives is critical. By offering these complementary strategies, causarray ensures robust error rate control tailored to diverse genomic applications while balancing power and error control.
Data simulation and analysis
We consider two simulation settings. In the first simulation, we generate cells from zero-inflated Poisson distributions. In the second simulation, we use a specialized single-cell simulator Splatter (25) to generate cells with batch effects. Both simulations include 1 observed covariate and 4 unmeasured confounders. The details of the simulation are provided in Appendix S3.
Benchmarking methods.
To evaluate the performance of differential expression (DE) testing, we compare causarray with several established methods, both with and without confounder adjustment. These methods are grouped into two categories based on whether they account for unmeasured confounders.
Methods without confounder adjustment include:
Wilcoxon rank-sum test: This nonparametric test is applied to deviance residuals obtained by regressing gene expression counts on measured covariates using a negative binomial generalized linear model (GLM). The deviance residuals serve as input for the test, which does not explicitly account for unmeasured confounders.
DESeq2 (26): This widely used method fits a negative binomial GLM to gene expression counts and adjusts for measured covariates. However, it does not account for unmeasured confounders, which may bias results in the presence of hidden variation.
Methods with confounder adjustment include:
CoCoA-diff (R package mmutilR 1.0.5) (6): Designed for individual-level case-control studies, CoCoA-diff prioritizes disease genes by adjusting for confounders estimated from parametric models. After adjusting for these confounders, the Wilcoxon rank-sum test is applied to the adjusted residuals, as recommended in the original paper.
CINEMA-OT (Python package cinemaot 0.0.3) (11): CINEMA-OT separates confounding sources of variation from perturbation effects using optimal transport matching to estimate counterfactual cell pairs. Similar to CoCoA-diff, the Wilcoxon rank-sum test is applied to the adjusted residuals of CINEMA-OT.
RUV-III-NB (R package ruvIIInb 0.8.2.0)(13): This method normalizes gene expression data using pseudo-replicates and a negative binomial model to remove unwanted variation induced by library size differences. The Kruskal-Wallis test (equivalent to the Wilcoxon test for two-group comparisons) is then applied to log-percentile adjusted counts, as suggested by the authors. However, RUV-III-NB does not directly adjust for library size and its ability to control FDR remains unclear, as it was not demonstrated in their experiments.
RUV (R package ruv 0.9.7.1) (12): RUVr is used to estimate unmeasured confounders, which are then incorporated into DESeq2 for statistical inference based on both observed and estimated covariates. Before running RUV, we successively use the functions calcNormFactors, estimateGLMCommonDisp, estimateGLMTagwiseDisp, and glmFit of edgeR package (4.0.16) (46) to extract residuals not explained by observed covariates and treatments.
This comprehensive benchmarking enables a thorough evaluation of each method’s ability to address unmeasured confounder estimation and perform robust statistical inference in simulated data settings.
Evaluation metrics.
To compare the performance of different methods, we use four evaluation metrics, focusing on two aspects: confounder estimation and biological signal preservation. DESeq2 and Wilcoxon are excluded from confounder estimation evaluation as they do not estimate unmeasured confounders or counterfactuals.
The performance of confounder estimation is assessed using two clustering-based metrics: Adjusted Rand Index (ARI) and Average Silhouette Width (ASW) (47). These metrics evaluate the quality of mixing in response and confounder spaces, respectively. Formally, measures the similarity between the clustering results based on the estimated control responses and the true cell-type labels of the same samples. It adjusts for similarities that occur by chance:
where is the total number of samples, is the number of samples in both cluster and partition is the sum over rows in the contingency table, and is the sum over columns. Higher ARI values indicate better conservation of cell identity based on estimated counterfactuals compared to true labels. ARI ranges from −1 (complete disagreement) to 1 (perfect agreement), with 0 indicating random clustering. On the other hand, ASW quantifies how well each sample fits within its assigned cluster compared to other clusters. It is defined as:
where is the average dissimilarity of sample to all other samples within its cluster, and is the average dissimilarity to samples in the nearest neighboring cluster. ASW values range from −1 to 1, with higher values indicating betterdefined clusters (47). For both metrics, median scores are scaled between 0 and 1 across methods within each simulation setup. For these two metrics, we use the implementations from the scib (1.1.5) package (47).
To evaluate biological signal preservation, we use False Positive Rate (FPR) and True Positive Rate (TPR), which are standard metrics derived from confusion matrices: PR quantifies the proportion of false positives among all true negatives:
where FP and TN are false positives and true negatives, respectively. A lower FPR indicates fewer false discoveries relative to true negatives. Also known as sensitivity or recall, TPR measures the proportion of true positives among all actual positives:
where TP and FN are true positives and false negatives, respectively. A higher TPR indicates better detection of true signals. These metrics provide complementary insights: FPR evaluates specificity by penalizing false discoveries, while TPR assesses sensitivity by rewarding correct detections. Together, they measure how well a method balances identifying true signals while avoiding false discoveries.
Single-cell Perturb-Seq dataset
We utilize the Perturb-Seq dataset from (17), which enables high-resolution transcriptomic profiling of genetic perturbations in excitatory neurons. This scalable platform systematically investigates gene functions across diverse cell types and perturbation conditions, providing critical insights into neurodevelopmental processes (17). We focus on excitatory neurons of the dataset, a key population implicated in neurodevelopmental disorders such as autism spectrum disorders and neurodevelopmental delay, with perturbations targeting genes involved in neuronal development and synaptic function (17).
For preprocessing, we filter out cells with perturbations measured in fewer than 50 cells and genes expressed in fewer than 50 cells, resulting in a dataset containing 2926 cells under 30 perturbation conditions. The GFP (Green Fluorescent Protein) condition is used as a negative control to benchmark the effects of other perturbations by providing a baseline for comparison in downstream analyses. After filtering lowly expressed genes with a maximum count of fewer than 10, we retain 3221 genes.
The batch design is highly correlated with perturbation conditions; therefore, it is not included as a covariate in the model for testing. Instead, only the intercept is included as a covariate. For propensity score estimation, we incorporate the logarithm of library sizes as an additional covariate to account for technical variability and use GLM as the propensity score model.
Single-nucleus Alzheimer’s disease dataset
This study integrates data from three single-nucleus RNA sequencing (snRNA-seq) datasets to investigate Alzheimer’s disease (AD): the ROSMAP-AD dataset (35) and two datasets from the Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD) consortium (36), covering the middle temporal gyrus (MTG) and prefrontal cortex (PFC). These datasets provide complementary insights into AD pathology across different brain regions and donor cohorts.
The ROSMAP-AD dataset is derived from a single-nucleus transcriptomic atlas of the aged human prefrontal cortex, including 2.3 million cells from postmortem brain samples of 427 individuals with varying degrees of AD pathology and cognitive impairment (35). To ensure balanced representation across subjects, we perform stratified down-sampling of 300 cells per subject, focusing on excitatory neurons while excluding two rare subtypes (‘Exc RELN CHD7’ and ‘Exc NRGN’). This preprocessing results in a dataset with 124997 cells and 33538 genes.
Next, we create pseudo-bulk gene expression profiles by aggregating gene expression counts across cells for each subject. Genes expressed in fewer than 10 subjects are filtered out, resulting in a final dataset of 427 samples and 26,106 genes. Binary treatment is defined based on the variable ‘age_first_ad_dx’, which approximates the “age at the time of onset of Alzheimer’s dementia.” Covariates included in the analysis are ‘msex’ (biological sex), ‘pmi’ (postmortem interval), and ‘age_death’ (age at death). Missing values for ‘pmi’ are imputed using the median of observed values.
The SEA-AD data are obtained from a multimodal cell atlas of AD developed by the Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD) consortium (36). This resource includes snRNA-seq datasets from two brain regions: the middle temporal gyrus (MTG) and prefrontal cortex (PFC), covering 84 donors with varying AD pathologies.
For both MTG and PFC datasets, we perform stratified down-sampling of 300 cells per subject, focusing on excitatory neurons. Pseudo-bulk gene expression profiles are created by aggregating counts across cells for each subject. Genes expressed in fewer than 40 subjects are filtered out, resulting in final datasets with: 80 samples and 24,621 genes for MTG and 80 samples and 25,361 genes for PFC. Covariates included in the analysis are ‘sex’, ‘pmi’, and ‘Age_at_death’. These variables account for biological and technical variability across donors.
To enable comparative analyses across the three datasets (ROSMAP-AD, SEA-AD MTG, and SEA-AD PFC), we restrict the analysis to 15586 common genes that are expressed in all three datasets. Genes with a maximum expression count below 10 among subjects are excluded to ensure robust comparisons.
Supplementary Material
ACKNOWLEDGEMENTS
This work used the Bridges-2 system at the Pittsburgh Supercomputing Center (PSC) through allocation MTH230011P from the Advanced Cyberinfrastructure Co-ordination Ecosystem: Services & Support (ACCESS) program. This project was funded by the National Institute of Mental Health (NIMH) grant R01MH123184.
Footnotes
CODE AVAILABILITY
The code for reproducing the results in the paper and the causarray package can be accessed at https://github.com/jaydu1/causarray.
DATA AVAILABILITY
All datasets used in this paper are previously published and freely available, except the metadata for donors from the ROSMAP cohort. The Perturb-seq dataset is available through the Broad single cell portal as txt files. The gene expression count matrices of ROSMAP-AD datasets (35) can be obtained from supplementary website, which have been deidentified to protect confidentiality - the mapping to ROSMAP IDs and complete metadata can be found on Synapse as Seurat objects (rds files). The SEA-AD datasets of nuclei-by-gene matrices with counts and normalized expression values from the snRNA-seq assay (36) are available through the Open Data Registry in an AWS bucket (sea-ad-single-cell-profiling) as AnnData objects (h5ad files).
Bibliography
- 1.Svensson Valentine, Vento-Tormo Roser, and Teichmann Sarah A. Exponential scaling of single-cell rna-seq in the past decade. Nature protocols, 13(4):599–604, 2018. [DOI] [PubMed] [Google Scholar]
- 2.Tirosh Itay, Izar Benjamin, Prakadan Sanjay M, Wadsworth Marc H, Treacy Daniel, Trombetta John J, Rotem Asaf, Rodman Christopher, Lian Christine, Murphy George, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science, 352(6282):189–196, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Luecken Malte D and Theis Fabian J. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Editorial. A focus on single-cell omics. Nat Rev Genet, 24(8):485, Aug 2023. doi: 10.1038/s41576-023-00628-3. [DOI] [PubMed] [Google Scholar]
- 5.Lähnemann David, Köster Johannes, Szczurek Ewa, McCarthy Davis J, Hicks Stephanie C, Robinson Mark D, Vallejos Catalina A, Campbell Kieran R, Beerenwinkel Niko, Mahfouz Ahmed, et al. Eleven grand challenges in single-cell data science. Genome biology, 21:1–35, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Park Yongjin P and Kellis Manolis. Cocoa-diff: counterfactual inference for single-cell gene expression analysis. Genome Biology, 22(1):1–23, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Du Jin-Hong, Zeng Zhenghao, Kennedy Edward H, Wasserman Larry, and Roeder Kathryn. Causal inference for genomic data with multiple heterogeneous outcomes. arXiv preprint arXiv:2404.09119, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
- 9.Shendure Jay, Findlay Gregory M, and Snyder Matthew W. Genomic medicine-progress, pitfalls, and promise. Cell, 177(1):45–57, Mar 2019. doi: 10.1016/j.cell.2019.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sanchez Pedro, Voisey Jeremy P, Xia Tian, Watson Hannah I, O’Neil Alison Q, and Tsaftaris Sotirios A. Causal machine learning for healthcare and precision medicine. R Soc Open Sci, 9(8):220638, Aug 2022. doi: 10.1098/rsos.220638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Dong Mingze, Wang Bao, Wei Jessica, Fonseca Antonio H de O., Perry Curtis J, Frey Alexander, Ouerghi Feriel, Foxman Ellen F, Ishizuka Jeffrey J, Dhodapkar Rahul M, et al. Causal identification of single-cell experimental perturbation effects with cinema-ot. Nature Methods, pages 1–11, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Risso Davide, Ngai John, Speed Terence P, and Dudoit Sandrine. Normalization of rna-seq data using factor analysis of control genes or samples. Nat Biotechnol, 32(9):896–902, Sep 2014. doi: 10.1038/nbt.2931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Salim Agus, Molania Ramyar, Wang Jianan, Livera Alysha De, Thijssen Rachel, and Speed Terence P. Ruv-iii-nb: Normalization of single cell rna-seq data. Nucleic Acids Research, 50(16):e96–e96, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kampmann Martin. Crispr-based functional genomics for neurological disease. Nat Rev Neurol, 16(9):465–480, Sep 2020. doi: 10.1038/s41582-020-0373-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hong Derek and lakoucheva Lilia M. Therapeutic strategies for autism: targeting three levels of the central dogma of molecular biology. Transl Psychiatry, 13(1):58, Feb 2023. doi: 10.1038/s41398-023-02356-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cheng Junyun, Lin Gaole, Wang Tianhao, Wang Yunzhu, Guo Wenbo, Liao Jie, Yang Penghui, Chen Jie, Shao Xin, Lu Xiaoyan, Zhu Ling, Wang Yi, and Fan Xiaohui. Massively parallel CRISPR-based genetic perturbation screening at single-cell resolution. Adv Sci (Weinh), 10(4):e2204484, Feb 2023. doi: 10.1002/advs.202204484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jin Xin, Simmons Sean K, Guo Amy, Shetty Ashwin S, Ko Michelle, Nguyen Lan, Jokhi Vahbiz, Robinson Elise, Oyler Paul, Curry Nathan, Deangeli Giulio, Lodato Simona, Levin Joshua Z, Regev Aviv, Zhang Feng, and Arlotta Paola. In vivo perturb-seq reveals neuronal and glial abnormalities associated with autism risk genes. Science, 370(6520), Nov 2020. doi: 10.1126/science.aaz6063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Du Jin-Hong, Wasserman Larry, and Roeder Kathryn. Simultaneous inference for generalized linear models with unmeasured confounders. arXiv preprint arXiv:2309.07261, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Du Jin-Hong, Roeder Kathryn, and Wasserman Larry. Assumption-lean post-integrated inference with negative control outcomes. arXiv preprint arXiv:2410.04996, 2024. [Google Scholar]
- 20.Leek Jeffrey T and Storey John D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet, 3(9):1724–35, Sep 2007. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sarkar Abhishek and Stephens Matthew. Separating measurement and expression models clarifies confusion in single-cell rna sequencing analysis. Nature genetics, 53(6):770–777, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Robins James M, Rotnitzky Andrea, and Zhao Lue Ping. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994. [Google Scholar]
- 23.Scharfstein Daniel O, Rotnitzky Andrea, and Robins James M. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999. [Google Scholar]
- 24.McFaline-Figueroa José L, Srivatsan Sanjay, Hill Andrew J, Gasperini Molly, Jackson Dana L, Saunders Lauren, Domcke Silvia, Regalado Samuel G, Lazarchuck Paul, Al-varez Sarai, et al. Multiplex single-cell chemical genomics reveals the kinase dependence of the response to targeted therapy. Cell Genomics, 4(2), 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zappia Luke, Phipson Belinda, and Oshlack Alicia. Splatter: simulation of single-cell RNA sequencing data. Genome biology, 18(1):174, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Love Michael, Anders Simon, and Huber Wolfgang. Differential analysis of count data–the deseq2 package. Genome Biol, 15(550):10–1186, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lombardo Michael V. Ribosomal protein genes in post-mortem cortical tissue and ipscderived neural progenitor cells are commonly upregulated in expression in autism. Mol Psychiatry, 26(5):1432–1435, May 2021. doi: 10.1038/s41380-020-0773-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Griesi-Oliveira Karina and Passos-Bueno Maria Rita. Reply to lombardo, 2020: An additional route of investigation: what are the mechanisms controlling ribosomal protein genes dysregulation in autistic neuronal cells? Mol Psychiatry, 26(5):1436–1437, May 2021. doi: 10.1038/s41380-020-0792-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lalli Matthew A, Avey Denis, Dougherty Joseph D, Milbrandt Jeffrey, and Mitra Robi D. High-throughput single-cell functional elucidation of neurodevelopmental disease–associated genes reveals convergent mechanisms altering neuronal differentiation. Genome research, 30(9):1317–1331, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Fu Jack M, Satterstrom F Kyle, Peng Minshi, Brand Harrison, Collins Ryan L, Dong Shan, Wamsley Brie, Klei Lambertus, Wang Lily, Hao Stephanie P, Stevens Christine R, Cusick Caroline, Babadi Mehrtash, Banks Eric, Collins Brett, Dodge Sheila, Gabriel Stacey B, Gauthier Laura, Lee Samuel K, Liang Lindsay, Ljungdahl Alicia, Mahjani Behrang, Sloofman Laura, Smirnov Andrey N, Barbosa Mafalda, Betancur Catalina, Brusco Alfredo, Chung Brian H Y, Cook Edwin H, Cuccaro Michael L, Domenici Enrico, Ferrero Giovanni Battista, Gargus J Jay, Herman Gail E, Hertz-Picciotto Irva, Maciel Patricia, Manoach Dara S, Passos-Bueno Maria Rita, Persico Antonio M, Renieri Alessandra, Sutcliffe James S, Tassone Flora, Trabetti Elisabetta, Campos Gabriele, Cardaropoli Simona, Carli Diana, Chan Marcus C Y, Fallerini Chiara, Giorgio Elisa, Girardi Ana Cristina, Hansen-Kiss Emily, Lee So Lun, Lintas Carla, Ludena Yunin, Nguyen Rachel, Pavinato Lisa, Pericak-Vance Margaret, Pessah Isaac N, Schmidt Rebecca J, Smith Moyra, Costa Claudia I S, Trajkova Slavica, Wang Jaqueline Y T, Yu Mullin H C, Autism Sequencing Consortium (ASC), Broad Institute Center for Common Disease Genomics (Broad-CCDG), iPSYCH-BROAD Consortium, Cutler David J, Rubeis Silvia De, Buxbaum Joseph D, Daly Mark J, Devlin Bernie, Roeder Kathryn, Sanders Stephan J, and Talkowski Michael E. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat Genet, 54(9): 1320–1331, Sep 2022. doi: 10.1038/s41588-022-01104-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang Lei, Song Ning-Ning, Zhang Qiong, Mei Wan-Ying, He Chun-Hui, Ma Pengcheng, Huang Ying, Chen Jia-Yin, Mao Bingyu, Lang Bing, et al. Satb2 is required for the regionalization of retrosplenial cortex. Cell Death & Differentiation, 27(5):1604–1617, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wahl Nico, Espeso-Gil Sergio, Chietera Paola, Nagel Amelie, Laighneach Aodán, Morris Derek W, Rajarajan Prashanth, Akbarian Schahram, Dechant Georg, and Apostolova Galina. Satb2 organizes the 3d genome architecture of cognition in cortical neurons. Molecular Cell, 84(4):621–639, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jaitner Clemens, Reddy Chethan, Abentung Andreas, Whittle Nigel, Rieder Dietmar, Delekate Andrea, Korte Martin, Jain Gaurav, Fischer Andre, Sananbenesi Farahnaz, et al. Satb2 determines mirna expression and long-term memory in the adult central nervous system. Elife, 5:e17361, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Guo Qiufang, Wang Yaqiong, Wang Qing, Qian Yanyan, Jiang Yinmo, Dong Xinran, Chen Huiyao, Chen Xiang, Liu Xiuyun, Yu Sha, et al. In the developing cerebral cortex: axono-genesis, synapse formation, and synaptic plasticity are regulated by satb2 target genes. Pediatric Research, 93(6):1519–1527, 2023. [DOI] [PubMed] [Google Scholar]
- 35.Mathys Hansruedi, Peng Zhuyu, Boix Carles A, Victor Matheus B, Leary Noelle, Babu Sudhagar, Abdelhady Ghada, Jiang Xueqiao, Ng Ayesha P, Ghafari Kimia, et al. Single-cell atlas reveals correlates of high cognitive function, dementia, and resilience to alzheimer’s disease pathology. Cell, 186(20):4365–4385, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gabitto Mariano I, Travaglini Kyle J, Rachleff Victoria M, Kaplan Eitan S, Long Brian, Ariza Jeanelle, Ding Yi, Mahoney Joseph T, Dee Nick, Goldy Jeff, et al. Integrated multimodal cell atlas of alzheimer’s disease. Nature Neuroscience, pages 1–18, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lee Ming-Hui, Shih Yao-Hsiang, Lin Sing-Ru, Chang Jean-Yun, Lin Yu-Hao, Sze Chun-I, Kuo Yu-Min, and Chang Nan-Shan. Zfra restores memory deficits in alzheimer’s disease triple-transgenic mice by blocking aggregation of trappc6aδ, sh3glb2, tau, and amyloid β, and inflammatory nf-κb activation. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 3(2):189–204, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.He Kan, Zhang Jian, Liu Justin, Cui Yandi, Leyna G Liu Shoudong Ye, Ban Qian, Pan Ruolan, and Liu Dahai. Functional genomics study of protein inhibitor of activated stat1 in mouse hippocampal neuronal cells revealed by rna sequencing. Aging (Albany NY), 13(6): 9011, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kasu Yasar Arfat T, Arva Akshaya, Johnson Jess, Sajan Christin, Manzano Jasmin, Hennes Andrew, Haynes Jacy, and Brower Christopher S. Bag6 prevents the aggregation of neurodegeneration-associated fragments of tdp43. Iscience, 25(5), 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Nativio Raffaella, Lan Yemin, Donahue Greg, Shcherbakova Oksana, Barnett Noah, Titus Katelyn R, Chandrashekar Harshini, Phillips-Cremins Jennifer E, Bonini Nancy M, and Berger Shelley L. The chromatin conformation landscape of alzheimer’s disease. bioRxiv, pages 2024–04, 2024. [Google Scholar]
- 41.Zhou Wenbin and Du Jin-Hong. Distance-preserving spatial representations in genomic data. arXiv preprint arXiv:2408.00911, 2024. [Google Scholar]
- 42.Du Jin-Hong, Chen Tianyu, Gao Ming, and Wang Jingshu. Joint trajectory inference for single-cell genomics using deep learning with a mixture prior. Proceedings of the National Academy of Sciences, 121(37):e2316256121, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lotfollahi Mohammad, Susmelj Anna Klimovskaia, Donno Carlo De, Hetzel Leon, Ji Yuge, Ibarra Ignacio L, Srivatsan Sanjay R, Naghipourfar Mohsen, Daza Riza M, Martin Beth, et al. Predicting cellular responses to complex perturbations in high-throughput screens. Molecular systems biology, 19(6):e11517, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Du Jin-Hong, Patil Pratik, Roeder Kathryn, and Kuchibhotla Arun Kumar. Extrapolated cross-validation for randomized ensembles. Journal of Computational and Graphical Statistics, pages 1–12, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Benjamini Yoav and Hochberg Yosef. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995. [Google Scholar]
- 46.Chen Yunshun, Chen Lizhong, Lun Aaron TL, Baldoni Pedro L, and Smyth Gordon K. edger 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv, pages 2024–01, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Luecken Malte D, Büttner Maren, Chaichoompu Kridsadakorn, Danese Anna, Inter-landi Marta, Müller Michaela F, Strobl Daniel C, Zappia Luke, Dugas Martin, Colomé-Tatché Maria, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Lin Yingxin, Ghazanfar Shila, Wang Kevin YX, Gagnon-Bartsch Johann A, Lo Kitty K, Su Xianbin, Han Ze-Guang, Ormerod John T, Speed Terence P, Yang Pengyi, et al. scmerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell rna-seq datasets. Proceedings of the National Academy of Sciences, 116(20):9775–9784, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kennedy Edward H, Kangovi Shreya, and Mitra Nandita. Estimating scaled treatment effects with multiple outcomes. Statistical methods in medical research, 28(4):1094–1104, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All datasets used in this paper are previously published and freely available, except the metadata for donors from the ROSMAP cohort. The Perturb-seq dataset is available through the Broad single cell portal as txt files. The gene expression count matrices of ROSMAP-AD datasets (35) can be obtained from supplementary website, which have been deidentified to protect confidentiality - the mapping to ROSMAP IDs and complete metadata can be found on Synapse as Seurat objects (rds files). The SEA-AD datasets of nuclei-by-gene matrices with counts and normalized expression values from the snRNA-seq assay (36) are available through the Open Data Registry in an AWS bucket (sea-ad-single-cell-profiling) as AnnData objects (h5ad files).





