Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Feb 6;25(2):bbae011. doi: 10.1093/bib/bbae011

scDecouple: decoupling cellular response from infected proportion bias in scCRISPR-seq

Qiuchen Meng 1, Lei Wei 2,, Kun Ma 3,4, Ming Shi 5, Xinyi Lin 6,7, Joshua W K Ho 8,9, Yinqing Li 10,11, Xuegong Zhang 12,13
PMCID: PMC10849189  PMID: 38324621

Abstract

Single-cell clustered regularly interspaced short palindromic repeats-sequencing (scCRISPR-seq) is an emerging high-throughput CRISPR screening technology where the true cellular response to perturbation is coupled with infected proportion bias of guide RNAs (gRNAs) across different cell clusters. The mixing of these effects introduces noise into scCRISPR-seq data analysis and thus obstacles to relevant studies. We developed scDecouple to decouple true cellular response of perturbation from the influence of infected proportion bias. scDecouple first models the distribution of gene expression profiles in perturbed cells and then iteratively finds the maximum likelihood of cell cluster proportions as well as the cellular response for each gRNA. We demonstrated its performance in a series of simulation experiments. By applying scDecouple to real scCRISPR-seq data, we found that scDecouple enhances the identification of biologically perturbation-related genes. scDecouple can benefit scCRISPR-seq data analysis, especially in the case of heterogeneous samples or complex gRNA libraries.

Keywords: single cell, pooled CRISPR-screening, perturbation effects, statistical model

INTRODUCTION

With the advancement in single-cell sequencing and CRISPR (clustered regularly interspaced short palindromic repeats) technologies, scCRISPR-seq [1] (single-cell CRISPR sequencing) has emerged as a novel high-throughput gene function profiling method. scCRISPR-seq first leverages CRISPR to perturb a set of genes and then assesses the resulting profiles of each perturbation by single-cell sequencing. There exist multiple scCRISPR-seq protocols, including Perturb-seq [2–4], CROP-seq [5], CRISP-seq [6], Mosaic-seq [7], Spear-ATAC [8] and CRISPR-sciATAC [9]. In a typical scCRISPR-seq protocol, a pool of guide RNAs (gRNAs) targeting different genes are usually packed into lentivirus and then introduced into cells. These gRNAs each introduce perturbation to a subgroup of cells. Then, single-cell sequencing [10–14] is used to capture one or more types of profiles for each cell, such as single-cell RNA sequencing (scRNA-seq) [10, 11], single-cell ATAC sequencing (scATAC-seq) [12, 13] and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) [14]. These scCRISPR-seq methods enable high-throughput perturbations as well as data-rich read-outs for each perturbation, providing informative data for gene regulation study [2, 15–17], disease target identification [4] and drug development [18].

Obtaining the true effect of each perturbation is a prerequisite for scCRISPR-seq analysis. A primary challenge lies in the inability to directly measure the original expression profile of perturbed cells. In typical experimental settings, a control group, comprising either unperturbed cells or those infected by non-targeting (NT) gRNAs, is introduced to approximate the profiles of cells before perturbations [2, 15, 19]. When regarding the control group as the original state of cells subjected to the perturbation (termed as the perturbation group), the data analysis may suffer from gRNA ‘infected proportion bias’. The infection proportion bias refers to the significant differences in the proportions of cell clusters between the perturbation group and the control group, which may be caused by the inconsistent infection efficiencies of different gRNAs, inherent differences in growth rates among different cell clusters or the sampling bias in sequencing. The infection proportion bias causes the average gene expression level of the control group different from the true original expression profile of the perturbation group. It then causes the traditional perturbation effect estimation by computing the fold change (FC) in the average expression level between the perturbation group and the control group cannot reflect the true cellular response. With the development of scCRISPR-seq technologies, the increasing heterogeneity of samples [4] and the expanding complexity of gRNA libraries [15] intensify the importance of decoupling the cellular response from infected proportion bias.

Here, we developed scDecouple as a solution to solve this problem by employing maximum likelihood estimation. scDecouple leverages a Gaussian mixture model to depict the distribution of gene expression profiles in the principal component (PC) space. Through the expectation–maximization (EM) algorithm, we iteratively approximate the genuine cluster proportion of infected cells along with their cellular responses. We evaluated the performance of scDecouple on both synthetic and real-world datasets. The generation of synthetic data considered various parameter settings, including the distance between clusters and the infection ratio. scDecouple consistently exhibited good performance across different parameter settings. Application to real-world scCRISPR-seq datasets revealed that scDecouple not only provided more precise cellular responses but also enhanced biologically relevant pathway identification and gene ranking. scDecouple facilitates a deeper comprehension of perturbation consequences and paves the way for advancing intricate scCRISPR-seq protocols.

METHODS

Mathematical description of infected proportion bias

We mathematically described the entire experimental process of scCRISPR-seq. Without loss of generality, we assumed that there are two clusters in the experimental cells before perturbation (Figure 1A). Their average expression (Inline graphic) is approximately equal to the average expression of cells in the control group (Inline graphic). Due to the infected proportion bias, the average expression of cells may differ from Inline graphic even taking no consideration about the expression perturbation caused by gRNAs. We named this average expression as Inline graphic (Figure 1B). Besides, we assumed that the perturbation brought by a gRNA introduces the same alteration to expression profiles of all cells in the perturbation group in the feature space, leading the average expression of the perturbation group alters from Inline graphic to the observed one Inline graphic (Figure 1C). The observed change between the control and perturbation group is Inline graphic, which contains the bias caused by infected proportion bias (Inline graphic) and the true cellular response to perturbation (Inline graphic).

Figure 1.

Figure 1

Mathematical description of average expression alteration of the perturbation group. Distributions and average expressions of cells are shown in the feature space. (A) Cells before perturbation. (B) Cells after infection without the consideration of perturbation effects. (C) Cells observed by high-throughput sequencing. Cells in different clusters are represented by different shapes.

General workflow of scDecouple

scDecouple comprises four distinct steps: data preprocessing, PC selection, decoupling and downstream analysis (Figure 2A).

Figure 2.

Figure 2

General workflow and probabilistic model of scDecouple. (A) Four main steps in scDecouple. (B) Plate notes of the GMM probabilistic model of the control and perturbation group. Smaller squares represent fixed parameters. Inline graphic means a vector of size K. The circles represent random variables, where filled-in means known values. The directed edges between variables indicate dependencies between the variables. The squiggly line with a crossbar indicates the value selected from upstream variables. The N or K in the corner of the plate indicates that the variables inside are repeated N or K times. Here, Inline graphic, Inline graphic.

During the data preprocessing phase, the gene expression of each cell is first normalized based on its total expression to reduce the technical variance caused by sequencing depths. The expression then undergoes a log transformation to reduce the effects of extremely high-expressed outliers. After, highly variable features (HVFs) of cells in the control group are selected, and the cell-HVF matrix is transformed into the PC space.

The second step is PC selection. The number of cells effectively impacted by a gRNA is limited in scCRISPR-seq, making it hard to estimate a mass of parameters. Thus, we designed a PC selection approach to reduce the number of parameters to be estimated. Our analytical derivations revealed that influence of infected proportion bias is most pronounced in PCs with both high variance and multimodality. scDecouple employs this finding as the selection criteria to select PCs bearing infected proportion bias for further analysis. Details of the step are shown in section PC selection.

After, the decoupling procedure is executed on the selected PCs to disentangle cellular response from infected proportion bias. We projected cells in the perturbation group into the PC space defined by HVFs in the control group. We then constructed two Gaussian mixture models (GMMs) for the control and perturbation groups, respectively. Both the models share the same number of cell clusters, but the parameters within these clusters are different. The EM algorithm is used to iteratively estimate the unknown parameters. With this approach, the genuine cluster proportion of infected cells along with their cellular responses are approximated. Details of the step are shown in sections Decoupling with GMMs and Inferring cellular response with the EM algorithm.

Finally, we integrated several common downstream analyses based on the decoupled results, including pathway enrichment and data visualization. Besides, the gene expression profile can be regenerated from the PC space, and a ranked list of perturbation-related genes is provided to help explore the cellular response of any gene to a perturbation.

PC selection

Despite the large number of sequenced cells, the number of cells effectively impacted by a gRNA is usually small in scCRISPR-seq due to high complexity of gRNA pools and limited gRNA efficiency. For example, the median of cells captured by a single gRNA is 121 in a K562 scCRISPR-seq experiment [15] and 83 in a Jurkat cell experiment [5]. Reducing the number of parameters to be estimated is crucial to amplify estimation accuracy. To address this issue, we classified PCs based on their distribution of data. We divided PCs into two types. The first type of PCs contains high-variance data and multimodal data distribution. Other PCs are classified as the second type, which contain low-variance data or unimodal data distribution. We proved in the following context that only the first category of PCs is susceptible to the bias of infected proportion when calculating perturbation effects. Therefore, the decoupling step can be performed only on the first category of PCs to improve the power of fitting. Below is the derivation of the formulas and the criteria for classifying the two types of PCs.

We first derived the relationship between the observed change, cellular response and infected proportion bias on each PC. We denoted the expression of cell Inline graphic in the control group on this PC as Inline graphic and the expression of cell Inline graphic in the perturbation group on this PC as Inline graphic. The observed change (OC) is the difference between two groups:

graphic file with name DmEquation1.gif (1)

where Inline graphic and Inline graphic represent the number of cells in the perturbation and control group, respectively. We assumed that cells in the control group have Inline graphic clusters, and the average expression profile on this PC and proportion of cluster Inline graphic is Inline graphic and Inline graphic, respectively. After perturbation, we assumed that the cluster number maintains as Inline graphic but the proportion of cluster Inline graphic in the perturbation group changes to Inline graphic. There are

graphic file with name DmEquation2.gif (2)

Besides, we assumed that all perturbed cells share a similar response Inline graphic. Thus,

graphic file with name DmEquation4.gif (3)

In this equation, the first term represents the cellular response, and second term represents the infected proportion bias.

The infected proportion bias approaches zero under two specific conditions. One condition is that the data on this PC follow a unimodal distribution (Inline graphic, and thus Inline graphic)Inline graphic Another condition is that the data variance explained by this PC is low, and thus

graphic file with name DmEquation5.gif (4)
graphic file with name DmEquation6.gif (5)

Therefore, in cases where PCs exhibit a unimodal distribution or low explained variance, the observed change is equal or approximately equal to real cellular response. In such scenarios, there is little space for improvement through decoupling. Decoupling is primarily required for PCs that exhibit both multimodality and high explained variance.

The explained variance of a PC is calculated during PC analysis (PCA). The multimodality score of a PC is measured by Hartigan’s dip test [20], which quantifies the maximum difference between the empirical distribution and the best-fitting unimodal distribution. PCs that are both multimodal and with high explained variance are selected for decoupling. For any remaining PC, OC calculated by Equation (1) is directly regarded as the cellular response.

Decoupling with GMMs

We used GMMs to describe the gene expression of cells in the control or perturbation group on all selected PCs. The density function of the control group is

graphic file with name DmEquation7.gif (6)

Here, Inline graphic and Inline graphicrepresent the expectation (mean) and precision (inverse of variance) of cell cluster Inline graphic in the control group on all selected PCs, respectively.

We applied the Bayesian information criterion (BIC) to determine the number of clusters Inline graphic. The BIC is calculated using the following formula:

graphic file with name DmEquation8.gif (7)

where Inline graphic is the likelihood of the model with Inline graphic clusters, Inline graphic is the number of parameters for the model with Inline graphic clusters and Inline graphic is the sample size.

The density function of the perturbation group is

graphic file with name DmEquation10.gif (8)

Here, Inline graphicrepresents the and precision of cell cluster Inline graphic in the perturbation group, and Inline graphic represents the cellular response on all selected PCs. Figure 2B shows the plate notes of the two GMMs for the control and perturbation group, separately.

Inferring cellular response with the EM algorithm

The likelihood function for all observed data is

graphic file with name DmEquation11.gif (9)

The first component represents the likelihood of the control group (Inline graphic), and the second represents the likelihood of the perturbation group (Inline graphic).

The EM algorithm is first employed to maximize Inline graphic and estimate parameters Inline graphic:

graphic file with name DmEquation12.gif (10)

where Inline graphic indicates the component of Inline graphic and each Inline graphic has Inline graphic possible outcomes. In other words,

graphic file with name DmEquation13.gif

The EM algorithm is then used to maximize Inline graphic:

graphic file with name DmEquation14.gif (11)

As Inline graphic has already been estimated in the previous step, the iterative EM algorithm is used to estimateInline graphic. We use Inline graphic to indicate the component of Inline graphic.

graphic file with name DmEquation15.gif

For the expectation step (E-step):

graphic file with name DmEquation16.gif (12)

For the maximum step (M-step):

graphic file with name DmEquation17.gif (13)

The E-step and M-step are iteratively applied till convergence. In certain cases, we can choose to fix Inline graphic as Inline graphic to make the shape of cell clusters in the perturbation group similar to those in the control group.

RESULTS

Performance of scDecouple on synthetic data

We conducted a series of simulation experiments to demonstrate the performance of scDecouple in decoupling cellular response from infected proportion bias. We first generated a synthetic dataset. We randomly generated 1000 NT cells with two-dimensional PCs as the control group. These cells followed a two-cluster GMM (Figure 3A), whose bimodality concentrates on PC 1. We used the same probability model to randomly generate the initial states of cells in the perturbation group and added perturbation effects on these cells with the assumption that all cells exhibit similar responses to the perturbation in the PC space. According to Equation (3), when the cluster number is 2, Inline graphic, and thus the infected proportion bias of any PC in the scenario of two clusters is

Figure 3.

Figure 3

Schematic diagram of synthetic data and performance of different methods on synthetic data. (A) The illustration of synthetic data. (B) Performance of different methods on synthetic data with different cluster distances Inline graphic. (C) Performance of different methods on synthetic data with different ratio changes Inline graphic. (D) The distribution of cells in the perturbation group with different covariance matrix angles. The dashed lines represent the direction of maximum variance in two clusters. (E) Performance of different methods on synthetic data with different cluster angles, gRNA infection efficiency and cluster distance. Each block represents the gRNA infection efficiency for cluster 1 from 0.4 to 0.1, while the ratio of cluster 2 was fixed at 0.3. The x-axis represents the distance between two clusters. Different columns represent different covariance matrix angles. (F) The distribution of cells in the perturbation group with multiple clusters. Different colors represent different clusters. (G) Performance of different methods on synthetic multi-cluster data with different gRNA infection efficiencies. The x-axis represents different experimental batches with varied gRNA efficiencies.

graphic file with name DmEquation18.gif (14)

The bias was linearly correlated with two components: the distance Inline graphic between two clusters and ratio change Inline graphic of cluster 1 between two groups:

graphic file with name DmEquation20.gif (15)

By varying the values of Inline graphic and Inline graphic, we systematically controlled the magnitude of infected proportion bias. For each parameter combination, we calculated OC as the results of naive analysis and applied scDecouple to estimate real cellular response. We compared the results of these methods with ground truth. We also evaluated the effectiveness of Hartigan’s dip test as the metric for assessing the inherent multimodality of the control group.

We first fixed the ratio change Inline graphic and varied the cluster distance Inline graphic (Figure 3B) and then fixed Inline graphic and varied Inline graphic (Figure 3C). We conducted 100 simulations for each parameter setting. When Inline graphic or Inline graphic was small, OC and the scDecouple results both exhibited low losses on PC 1 (Figure 3B and C). However, with the increase of data multimodality or ratio change, estimation by OC showed higher losses, whereas scDecouple continued to perform well. In the PC 2 dimension where the data distribution is unimodal, both methods showed similar results (Figure 3B and C). Besides, our results demonstrated that Hartigan’s dip test in scDecouple exhibited high sensitivity to changes in multimodality (Figure 3B and C).

The results revealed that when facing cellular heterogeneity and uneven gRNA sampling across different cell clusters, OC can hardly be regarded as cellular response as it is affected by infected proportion bias. In contrast, scDecouple decouples these intertwined factors and consistently produces low-error estimation outcomes, irrespective of the degree of infected proportion bias.

We conducted another two simulations to compare scDecouple with a conventional strategy. The procedures of conventional strategy consists of clustering and annotation in control and perturbation groups, respectively, and mapping clusters in two groups to calculate the perturbation effects. Here, we utilized GMM estimation to compute the cluster centers and annotated cell clusters based on their rankings of cluster centers on the first PC. Then, we calculated the average change in cluster centers before and after perturbation as perturbation effects.

The first experiment simulated a control group with two clusters with varied covariance matrix angles, cluster distances and gRNA infection efficiencies. The covariance matrix angle is the angle between the axis with the maximum variance of cluster 1 and cluster 2 (Figure 3D). A low covariance matrix angle will result in an ambiguous separation of two clusters. The perturbation effects were simulated to be identical for both clusters. For each parameter combination, we conducted 100 times of experiments by random sampling. We employed the conventional strategy and scDecouple to infer perturbation effects, respectively. As shown in Figure 3E, we found that scDecouple got lower errors than the conventional strategy. The performance of the conventional strategy is less stable, especially when distance between the two clusters decreases or the number of cells captured by gRNA diminishes. This situation becomes more severe as the cluster angle decreases to 0. In these scenarios, the conventional strategy may gain more errors when clustering the control and perturbation groups separately. On the contrary, scDecouple learns the cluster information from the control group, which can be used to assist the clustering of the perturbation group. This particularly matters for current scCRISPR-seq data, which usually show low gRNA efficiency and few cells in the perturbation group.

The second simulation experiment addressed a multi-cluster scenario. We randomly generated four clusters following GMMs on a two-dimensional PC space as the control group, each with distinct means and covariance matrices, as shown in Figure 3F. We randomly generated 10 gRNAs, each having a unique array of infection efficiency for different clusters. For each parameter, we conducted 50 experiments by randomly generating data. We compared the performance of scDecouple and the conventional strategy on these data (Figure 3G). The results showed that when experimental data involve multiple clusters, the use of a single marker in the conventional strategy cannot consistently capture the correct cluster correspondence between the control and perturbation groups. In contrast, scDecouple considers the relative distances between clusters during parameter estimation and thus ensures the correct correspondence between clusters.

From the simulation experiments, it is evident that scDecouple consistently shows more stable estimation results and lower errors compared with other strategies. Moreover, scDecouple does not rely on the definition of cluster markers or additional steps to solve for the correspondence between clusters before and after perturbation, which simplifies the process of perturbation effect estimation.

Performance of scDecouple on simulated biologically derived genome-wide data

We conducted simulations using real scCRISPR-seq data to further evaluate the performance of scDecouple. We collected a scCRISPR-seq dataset from a genome-wide Perturb-seq study [15] and selected the experimental data from two distinct cell lines, K562 and RPE1. We mixed these data to simulate two cell clusters in a single experiment. We found that the cell number infected by different gRNAs varied in both cell lines (Figure 4A). The correlation of the cell number infected by the same gRNA between different cell lines is notably low (Pearson’s correlation coefficient = 0.155, Figure 4B), suggesting significant infected proportion bias between cell clusters. The difference between the infection ratio of the RPE1 cluster and the K562 cluster was varied and differed from the control group (Figure 4C).

Figure 4.

Figure 4

Performance of different methods on simulated genome-scale scCRISPR-seq data. (A) The distribution of the cell number infected by different gRNAs in different cell lines. Solid lines show the smoothing spline fitting curve with a 0.5 smoothing parameter. (B) The cell number infected by the same gRNA between different cell lines. (C) The gRNA infection ratio difference between the RPE1 and K562 cluster in the perturbation group. The dashed line represents the infection ratio difference in the control group. (D) The explained variance and multimodality score of each PC. Dashed lines represent thresholds for identifying high variance or multimodality. (E) The distribution of cells in PC 1 and PC 2. (F) Cluster ratio changes and MAEs of cell response estimations on PC1. Each dot represents a gRNA. (G) Absolute error of cell response estimations in gene expression profiles. One sample represents a gene in the gene expression profile. The x-axis represents gRNA targets, sorted by cluster ratio changes. (H) Absolute errors of cell response estimations in gene expression profiles on the top 20 gRNAs.

We applied scDecouple on the simulated data. We found that PC 1 showed both the highest explained variance and the highest multimodality score among all PCs (Figure 4D and E). Thus, we performed decoupling on PC 1 with the cell cluster number Inline graphic = 2 and regarded OC as the cellular response for all other PCs. For the perturbation group, we selected 168 editing loci that were designed in both cell line groups and showed similar responses in both cell types (the difference of cellular response on PC 1 less than 5). We took the mean responses (log FC) of the two cell lines on PC 1 as true cellular response and used the mean absolute error (MAE) to assess the accuracy of cell response estimations. We found that scDecouple can better estimate true cellular response than OC, especially when cells infected by a gRNA displayed a substantial cluster ratio change (Figure 4F). We then randomly selected several gRNAs and generated the cellular response of these gRNAs from the PC space to gene expression. We found that responses calculated by scDecouple showed lower MAEs than OC, especially when ratio changes were large (Figure 4G). In general, the results showed that scDecouple revealed more accurate cellular responses and reduced the error introduced by infected proportion bias.

We then explored the impact of batch effects between different samples on scDecouple. We introduced different levels of batch effects by utilizing 10%, 40% and 70% dropout rates in the K562 data. Then, we evaluated the performance of scDecouple by calculating the absolute error of each experiment. We visualized the performance of scDecouple in top 20 gRNAs with the highest OC errors (Figure 4G), and the results demonstrated that scDecouple is not influenced by batch effects and consistently delivers robust estimation results (Figure 4H).

Applying scDecouple on real scCRISPR-seq data

We applied scDecouple to a real scCRISPR-seq dataset targeting T cell receptor (TCR)–related genes in human Jurkat cells [5], which have been reported to have heterogeneity [21, 22]. Each gene was targeted by three gRNAs. We selected 22 target genes each with over 60 infected cells. We considered the cells affected by gRNAs targeting the same gene as one perturbation group. We then got 22 perturbation groups along with NT gRNAs as the control group for downstream analysis. We performed data preprocessing on the data and then transformed the data into the PC space of the control group based on the top 700 highly variable genes. We identified three PCs with both high explained variances and high multimodality scores (Figure 5A). We performed decoupling on these three PCs with the cell cluster number Inline graphic = 2. We found that the estimated cell cluster proportion for each target varied (Figure 5B), and the inferred clusters showed distinct differences in the PC space (Figure 5C). We generated the cellular response of all targets from the PC space to gene expression obtained the average response of a gene by averaging the response of all gRNAs targeting this gene. We visualized the responses of TCR pathway signature genes defined in the original publication [5] (Figure 5D). The results showed that knockouts of some genes caused pathway activation, while others led to pathway inhibition.

Figure 5.

Figure 5

The performance of scDecouple on real scCRISPR-seq data. (A) The explained variance and multimodality score of each PC. Dashed lines represent thresholds for identifying high variance or multimodality. (B) The ratios of cluster 1 across all targets. (C) The distribution of cells in PC 1 and PC 17. The circles highlight the cells in cluster 1. (D) Cellular response estimations of all targets in the expression of all TCR pathway–related signature genes. (E) The enrichment scores of differential genes in the TCR signaling pathway for each target. (F) The median rank increase of gRNAs with similar enrichment scores using scDecouple versus OC. We used Fisher’s exact test to evaluate the overlap between differential genes and TCR pathway–related signature genes. (G) Average gene ranking of top 60 genes from the TCR pathway signature genes across all gRNAs. Each dot represents one signature gene. The black line represents Inline graphic. The color of each dot represents the change in ranking relative to all gRNAs. The transparency of each dot corresponds to the absolute value of the ranking change, where lower transparency indicates a smaller absolute change. (H) Ranking of CD69 in the absolute values of cellular responses calculated by OC or scDecouple. Each dot represents one target. The black line represents Inline graphic.

We then compared the cellular response estimated by OC and scDecouple. We identified genes with the top 1500 absolute values of cellular responses as differential genes for the OC and scDecouple results, respectively, and then calculated the enrichment score of the TCR signaling pathway for each target by Fisher’s exact test (Figure 5E). The results showed that scDecouple achieved similar or higher enrichment scores than OC. We quantified the comparison results of pathway enrichment between scDecouple and OC from the perspective of gene rankings. For four gRNAs with improved enrichment score versus OC (PTPN11, FOS, EGR2 and DOK2), we found that scDecouple identified more TCR pathway–related differential genes, leading to an average increase of 280 gene rankings. We also investigated the effect of scDecouple on TCR pathway–related genes for other gRNAs that achieved similar enrichment scores with OC. We compared the differences in cellular response rankings of differential genes between the two methods and focused on TCR-related genes with ranking differences greater than 100. As shown in Figure 5F, for a majority gRNAs, using scDecouple significantly improved the ranking of TCR-related genes, with an average increase of 125 in ranking across all gRNAs.

We further investigated whether scDecouple helps biological discovery. We selected the top 60 genes from the TCR pathway signature genes according to the original study [5] and calculated the average rank of these genes in the absolute values of cellular responses calculated by OC and scDecouple, respectively. We found that these genes showed higher rank in scDecouple results (Figure 5G). We specifically examined CD69, a widely recognized early activation marker of the TCR pathway [23, 24]. The results demonstrated that CD69 rankings were greatly improved for most targets (Figure 5H). The most significant improvement showed in the perturbation group targeting RELA, which was reported to be highly associated with CD69 [25–27]. All the results indicated that scDecouple can benefit the analysis of real scCRISPR-seq data with more precise gene ranking and more accurate gene regulation identification.

DISCUSSION

We initially formalized the process of cellular response estimation in scCRISPR-seq and derived mathematical equations to deduce estimation bias induced by infected proportion bias. To reduce the influence of infected proportion bias, we introduced scDecouple, a method aimed at unraveling observed changes in scCRISPR-seq data. Our approach focuses on decoupling cellular response from infected proportion bias through maximum likelihood estimation. We validated the efficacy of scDecouple through a series of simulations and its application to a real dataset. scDecouple yielded more enriched pathways and improved the ranking of perturbation-related genes. The good performance of scDecouple is mainly attributed to its estimation of the actual proportions of clusters for infected cells in perturbation groups.

With the development of technology, scCRISPR-seq with more complex gRNA libraries will be developed and applied to populations of more heterogeneous cells. As a result, reducing infected proportion bias will be more important and necessary in the analysis of scCRISPR-seq. Besides, double-strand breaking induced by CRISPR knockout may cause cell state arrest, which will introduce additional bias to cell cluster proportion especially in studies related to cell cycle, senescence or aging. scDecouple can help to discard these impacts and focus on the cellular responses we concern about. Also, unlike the simple approach of conducting separate differential expression analyses for different clusters, the uniqueness of scDecouple lies in its ability to preserve the shape and variance of individual clusters before and after perturbation. This is particularly valuable when dealing with data that exhibit multiple clusters and when there is no clear cluster structure in continuous data, as our method does not rely on a pre-defined set of clusters and analyzes the entire dataset to reach the maximum likelihood.

We have built an R package for scDecouple, offering a comprehensive suite of functionalities for streamlined scCRISPR-seq data analysis. The package comes equipped with an integrated one-click data preprocessing pipeline, infected proportion bias quantification, cellular response estimation and downstream analytical tools. Additionally, the library provides many visualization options, including PC plots, PC variance and multimodality visualizations, cellular response heatmaps and Gene Ontology enrichment plots. Moreover, the library supports both step-by-step execution and parameter customization to meet the diverse needs.

As the first method designed to handle infected proportion bias in scCRISPR-seq, scDecouple can be further optimized. Currently, scDecouple operates under three key assumptions: cells following GMM within the PC space, same cell clusters between NT and perturbation groups and similar cellular responses across cell clusters. The first assumption holds well in certain cell lines and tissues, but it might not remain in intricate systems or disease tissues, such as tumors. We can expand scDecouple by considering alternative data representation methods or statistical models to handle more complex tissues and systems. The second assumption might not hold true in extreme cases, such as when some rare cell types are completely absent from infection. In such a situation, the number of clusters in the control and perturbation group could differ. We plan to incorporate additional information to address this problem, such as including marker genes of cell clusters to link clusters in the perturbation group to those in the control group. The third assumption guarantees that scDecouple is suitable for cells with moderate heterogeneity, which is validated in this study by applying scDecouple on real scCRISPR-seq data. In the future, we intend to integrate more intricate models, such as deep learning, with scDecouple to handle more complex experimental data.

Key Points

  • We found that infected proportion bias distorts the genuine cellular response to perturbation in scCRISPRseq data.

  • We purposed scDecouple to decouple the cellular response from infected proportion bias by utilizing maximum likelihood estimation.

  • scDecouple improves estimation of perturbation effects in both simulation experiments and real scCRISPRseq data.

  • The scDecouple R package offers decoupling process and streamlines scCRISPR-seq data analysis.

Author Biographies

Qiuchen Meng is a PhD candidate at the Department of Automation, Tsinghua University. His research interests include single-cell bioinformatics and machine learning.

Lei Wei is a research assistant professor at BNRIST, Tsinghua University. His research interests include single-cell bioinformatics, synthetic biology and machine learning.

Kun Ma is an MPhil candidate at the School of Biomedical Sciences, The University of Hong Kong. Her research interests include spatial transcriptomic methods and single-cell transcriptomic analysis.

Ming Shi earned his PhD degree at the Department of Automation, Tsinghua University. His research interests include statistical learning and bioinformatics.

Xinyi Lin is a PhD student in the School of Biomedical Sciences, The University of Hong Kong. Her research interests include statistical modeling and single-cell multi-omics analysis.

Joshua Wing Kei Ho is an associate professor in the School of Biomedical Sciences, The University of Hong Kong. His research interests include single-cell data analysis and microbiome functional systems biology.

Yinqing Li is an associate professor at the School of Pharmaceutical Sciences, Tsinghua University. His research interests include machine learning, single-cell multi-omics analysis and cell editing.

Xuegong Zhang is the director of the Bioinformatics Division, BNRIST and a professor at the Department of Automation, Tsinghua University. His research interests include pattern recognition, machine learning and bioinformatics.

Contributor Information

Qiuchen Meng, MOE Key Lab of Bioinformatics & Bioinformatics Division BRNIST, Department of Automation, Tsinghua University, Beijing 100084, China.

Lei Wei, MOE Key Lab of Bioinformatics & Bioinformatics Division BRNIST, Department of Automation, Tsinghua University, Beijing 100084, China.

Kun Ma, School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Laboratory of Data Discovery for Health Limited (D24H), Hong Kong Science Park, Hong Kong SAR, China.

Ming Shi, MOE Key Lab of Bioinformatics & Bioinformatics Division BRNIST, Department of Automation, Tsinghua University, Beijing 100084, China.

Xinyi Lin, School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Laboratory of Data Discovery for Health Limited (D24H), Hong Kong Science Park, Hong Kong SAR, China.

Joshua W K Ho, School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Laboratory of Data Discovery for Health Limited (D24H), Hong Kong Science Park, Hong Kong SAR, China.

Yinqing Li, MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China; IDG-McGovern Institute for Brain Research, Center for Synthetic and Systems Biology, School of Pharmaceutical Sciences, Tsinghua University, Beijing 100084, China.

Xuegong Zhang, MOE Key Lab of Bioinformatics & Bioinformatics Division BRNIST, Department of Automation, Tsinghua University, Beijing 100084, China; Center for Synthetic and Systems Biology, School of Life Sciences and School of Medicine, Tsinghua University, Beijing 100084, China.

FUNDING

This work is supported in part by National Key R&D Program of China (2021YFF1200900, 2019YFA0904402, 2019YFA0906700), National Natural Science Foundation of China (62250005, 61721003, 32171448 and 62373210), Tsinghua University Initiative Scientific Research Program (2022Z11QYJ032, 2021Z11JCQ020), Beijing Natural Science Foundation (Z210010), Tsinghua University Spring Breeze Fund (2020Z99CFG006) and the AIR@InnoHK administered by Innovation and Technology Commission of Hong Kong.

DATA AVAILABILITY

We only use public datasets in this study. The K562 and RPE1 datasets employed for simulations were obtained through accession code GSE92872. The Jurkat cell dataset was downloaded from the website https://gwps.wi.mit.edu.

All codes of this study, including the R package of scDecouple, simulation experiments and real scCRISPR-seq data analysis, are available on GitHub via the following link: https://github.com/MengQiuchen/scDecouple.

References

  • 1. Bock C, Datlinger P, Chardon F, et al.  High-content CRISPR screening. Nat Rev Methods Primers  2022;2(1):8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Dixit A, Parnas O, Li B, et al.  Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell  2016;167(7):1853–1866.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Adamson B, Norman TM, Jost M, et al.  A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell  2016;167(7):1867–1882.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Jin X, Simmons SK, Guo A, et al.  In vivo perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes. Science  2020;370(6520):eaaz6063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Datlinger P, Rendeiro AF, Schmidl C, et al.  Pooled CRISPR screening with single-cell transcriptome readout. Nat Methods  2017;14(3):297–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Jaitin DA, Weiner A, Yofe I, et al.  Dissecting immune circuits by linking CRISPR-pooled screens with single-cell RNA-Seq. Cell  2016;167(7):1883–1896.e15. [DOI] [PubMed] [Google Scholar]
  • 7. Xie S, Duan J, Li B, et al.  Multiplexed engineering and analysis of combinatorial enhancer activity in single cells. Mol Cell  2017;66(2):285–299.e5. [DOI] [PubMed] [Google Scholar]
  • 8. Pierce SE, Granja JM, Greenleaf WJ. High-throughput single-cell chromatin accessibility CRISPR screens enable unbiased identification of regulatory networks in cancer. Nat Commun  2021;12(1):2969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Liscovitch-Brauer N, Montalbano A, Deng J, et al.  Profiling the genetic determinants of chromatin accessibility with scalable single-cell CRISPR screens. Nat Biotechnol  2021;39(10):1270–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Picelli S, Faridani OR, Björklund ÅK, et al.  Full-length RNA-seq from single cells using smart-seq2. Nat Protoc  2014;9(1):171–81. [DOI] [PubMed] [Google Scholar]
  • 11. Macosko EZ, Basu A, Satija R, et al.  Highly parallel genome-wide expression profiling of individual cells using Nanoliter droplets. Cell  2015;161(5):1202–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lareau CA, Duarte FM, Chew JG, et al.  Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat Biotechnol  2019;37(8):916–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Chen X, Miragaia RJ, Natarajan KN, Teichmann SA. A rapid and robust method for single cell chromatin accessibility profiling. Nat Commun  2018;9(1):5345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Stoeckius M, Hafemeister C, Stephenson W, et al.  Simultaneous epitope and transcriptome measurement in single cells. Nat Methods  2017;14(9):865–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Replogle JM, Saunders RA, Pogson AN, et al.  Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell  2022;185(14):2559–2575.e28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Duan B, Zhou C, Zhu C, et al.  Model-based understanding of single-cell CRISPR screening. Nat Commun  2019;10(1):2233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Pirkl M, Beerenwinkel N. Single cell network analysis with a mixture of nested effects models. Bioinformatics  2018;34(17):i964–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. McFarland JM, Paolella BR, Warren A, et al.  Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat Commun  2020;11(1):4296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Satpathy AT, Granja JM, Yost KE, et al.  Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol  2019;37(8):925–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Hartigan JA, Hartigan PM. The dip test of unimodality. Ann Statist  1985;13(1):70–84. [Google Scholar]
  • 21. Erdoğan İ, Coşacak M, Aldanmaz A, Akgül B. Deep sequencing reveals two Jurkat subpopulations with distinct miRNA profiles during camptothecin-induced apoptosis. Turk J Biol  2018;42(2):113–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Snow K, Judd W. Heterogeneity of a human T-lymphoblastoid cell line. Exp Cell Res  1987;171(2):389–403. [DOI] [PubMed] [Google Scholar]
  • 23. Cibrián D, Sánchez-Madrid F. CD69: from activation marker to metabolic gatekeeper. Eur J Immunol  2017;47(6):946–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Ziegler SF, Ramsdell F, Alderson MR. The activation antigen CD69. Stem Cells  1994;12(5):456–65. [DOI] [PubMed] [Google Scholar]
  • 25. Brownlie RJ, Zamoyska R. T cell receptor signalling networks: branched, diversified and bounded. Nat Rev Immunol  2013;13(4):257–69. [DOI] [PubMed] [Google Scholar]
  • 26. Mondor I, Schmitt-Verhulst AM, Guerder S. RelA regulates the survival of activated effector CD8 T cells. Cell Death Differ  2005;12(11):1398–406. [DOI] [PubMed] [Google Scholar]
  • 27. Saldanha-Araujo F, Haddad R, de  FariasKCRM, et al.  Mesenchymal stem cells promote the sustained expression of CD69 on activated T lymphocytes: roles of canonical and non-canonical NF-κB signalling. J Cell Mol Med  2012;16(6):1232–44. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

We only use public datasets in this study. The K562 and RPE1 datasets employed for simulations were obtained through accession code GSE92872. The Jurkat cell dataset was downloaded from the website https://gwps.wi.mit.edu.

All codes of this study, including the R package of scDecouple, simulation experiments and real scCRISPR-seq data analysis, are available on GitHub via the following link: https://github.com/MengQiuchen/scDecouple.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES