Skip to main content
Cell Genomics logoLink to Cell Genomics
. 2025 Feb 5;5(2):100766. doi: 10.1016/j.xgen.2025.100766

Characterization and bioinformatic filtering of ambient gRNAs in single-cell CRISPR screens using CLEANSER

Siyan Liu 1,6,7, Marisa C Hamilton 2,6,7, Thomas Cowart 4, Alejandro Barrera 4,6, Lexi R Bounds 3,6, Alexander C Nelson 3, Sophie F Dornbaum 2,6, Julia W Riley 3, Richard W Doty 4, Andrew S Allen 4,6,, Gregory E Crawford 1,2,5,6,∗∗, William H Majoros 1,4,6,∗∗∗, Charles A Gersbach 1,2,3,6,8,∗∗∗∗
PMCID: PMC11872138  PMID: 39914388

Summary

Single-cell RNA sequencing CRISPR (perturb-seq) screens enable high-throughput investigation of the genome, allowing for characterization of thousands of genomic perturbations on gene expression. Ambient gRNAs, which are contaminating gRNAs, are a major source of noise in perturb-seq experiments because they result in an excess of false-positive gRNA assignments. Here, we utilize CRISPR barnyard assays to characterize ambient gRNAs in perturb-seq screens. We use these datasets to develop CRISPR Library Evaluation and Ambient Noise Suppression for Enhanced single-cell RNA-seq (CLEANSER), a mixture model that filters ambient gRNAs. CLEANSER includes both gRNA and cell-specific normalization parameters, correcting for confounding technical factors that affect individual gRNAs and cells. The output of CLEANSER is the probability that a gRNA-cell assignment is in the native distribution over the ambient distribution. We find that ambient gRNA filtering methods impact differential gene expression analysis outcomes and that CLEANSER outperforms alternate approaches by increasing gRNA-cell assignment accuracy across multiple screen formats.

Keywords: single-cell RNA-seq, CRISPR screen, perturb-seq, ambient gRNA noise

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • CRISPR barnyard screens characterize the distribution of ambient gRNAs

  • CRISPR barnyard screens identify factors influencing the abundance of ambient gRNAs

  • CLEANSER is open-source software that assigns gRNAs to cells in perturb-seq screens

  • CLEANSER demonstrates increased accuracy in calling hits in published datasets


In this manuscript, Liu and Hamilton et al. experimentally characterize ambient gRNA library noise in perturb-seq screens; identify factors influencing ambient gRNA noise; develop CLEANSER, an open-source statistical model to discriminate noise from signal in perturb-seq screens; and provide evidence of additional screen hits from published datasets using CLEANSER.

Introduction

Single-cell RNA sequencing (scRNA-seq) CRISPR (e.g., perturb-seq)1 screens are powerful tools to conduct high-throughput functional genomic mapping.2 Perturb-seq screens have proved to be instrumental in efforts to understand both basic biology (i.e., gene function and cellular behavior) and human disease (i.e., cancer biology and complex genetic disorders).1,3,4,5,6 In droplet-based perturb-seq, cells are typically transduced with lentivirus encoding a gRNA library and then partitioned into distinct droplets so that RNA (e.g., mRNA and gRNA) transcripts from each cell can be tagged with a unique molecular identifier (UMI) and cell barcode (CB).7 At a low multiplicity of infection (MOI), most cells contain only one gRNA integration. Alternatively, in experiments where cells are transduced at a high MOI, there are multiple gRNA integrations per cell. This allows for fewer cells to be profiled to obtain the same coverage of each gRNA. In both methodologies, cells expressing a gRNA are compared to cells harboring alternate targeting gRNAs and/or negative control gRNAs to determine the impact of a perturbation on gene expression.8,9,10

Because gRNAs are typically expressed from RNA polymerase III (pol III) promoters and not poly-adenylated, they are not captured by single-cell RNA-seq methods using polyT priming. Therefore, in perturb-seq experiments, gRNAs are captured and identified through the use of CRISPR droplet sequencing (CROP-seq)11 or direct-capture lentiviral vectors.12 Following lentiviral integration into the target cell genome, the CROP-seq vector expresses both a pol III-transcribed gRNA and an RNA polymerase II (pol II)-transcribed poly-adenylated transcript containing the corresponding gRNA. The CROP-seq pol II mRNA is captured alongside all other poly-adenylated transcripts, allowing for the assignment of gRNAs to each cell. Alternatively, the direct-capture system expresses a modified gRNA harboring a capture sequence in the hairpin region that is targeted by sequence probes during RNA tagging. In both CROP-seq and direct-capture systems, two libraries are generated from each cell: (1) the gene expression library containing cellular mRNA, and (2) the CRISPR feature library containing sequences representing the gRNAs that arises from a series of PCR-based enrichment amplifications.

Previous studies have demonstrated the presence of ambient mRNAs represented by low UMI counts in gene expression libraries generated from scRNA-seq experiments that confound downstream analyses.13,14 These ambient mRNAs are attributed in part to cell lysis, exosome transfer between cells, PCR chimeras, and/or barcode swapping.13,15,16,17,18 Similarly, the presence of ambient noise in CRISPR gRNA libraries generated during perturb-seq screens is supported by the overabundance of low UMI transcripts observed in these libraries.9,10,19 These contaminating transcripts include both ambient mRNAs within CROP-seq gRNA libraries and ambient gRNAs within direct-capture gRNA libraries, referred to herein collectively as “ambient gRNAs.” The presence of these ambient gRNAs contributes to false-positive gRNA-cell assignments and a decrease in the sensitivity of downstream differential expression analyses. Single-cell experiments mixing human and mouse cells, known as “barnyard” experiments, have been used to investigate the abundance and sources of ambient mRNA contamination.13 However, ambient gRNAs in perturb-seq libraries have yet to be systematically characterized to understand the abundance and source of ambient gRNA contamination. As a result, the accurate filtering of ambient gRNA noise while retaining native (integrated and expressed) gRNA transcripts during bioinformatic assignment of gRNA to cells continues to be a statistical challenge in perturb-seq screens.

Design

Several filtering strategies have been used to remove ambient gRNA noise in perturb-seq libraries. The most commonly used approach applies a singular UMI threshold as a requisite cutoff for assignment to any cell.10,20 This cutoff is typically chosen by plotting an empirical cumulative distribution function (eCDF) of gRNA UMIs and approximating where the slope increases, known as the “elbow method.” However, this method is ad hoc and does not capture possible gRNA-specific or cell-specific biases. More recently, a number of mixture proportion methods have been developed, including the gRNA assignment modules in SCEPTRE, FBA, and Cellranger (10X Mixture Model).21,22,23 These methods improve upon the strict UMI cutoff by addressing gRNA- and/or cell-specific biases. However, to our knowledge, no gRNA assignment method uses mixtures that are fitted to experimental data where ambient gRNAs are known. In addition, the 10X Mixture Model fails to address cell-specific biases and has a restrictive license, making it unmodifiable to fit unique experimental considerations. Although all models can be applied to both CROP-seq and direct-capture datasets, it is unclear how the accuracy of each model varies for each capture system. Therefore, there is a need for an ambient gRNA filtering method that (1) considers gRNA- and cell-specific biases, (2) is trained on a dataset of ground-truth ambient gRNAs for both CROP-seq and direct-capture libraries, and (3) is open source to be further modified as new CRISPR methodologies are developed.

Here, we develop CRISPR Library Evaluation and Ambient Noise Suppression for Enhanced scRNA-seq (CLEANSER), a gRNA-cell assignment method that uses a mixture of two distinct distributions to model ambient and native gRNA presence in perturb-seq CRISPR libraries. We conducted a single-cell CRISPR (scCRISPR) barnyard experiment, in which human and mouse cells are transduced with distinct gRNA libraries and mixed to experimentally characterize ambient gRNA contamination in perturb-seq experiments. The components of CLEANSER are trained on CROP-seq and direct-capture scCRISPR barnyard datasets. CLEANSER considers gRNA- and cell-specific biases and generates a probability value that a gRNA is expressed natively in a cell or is likely ambient and therefore removed from analysis. We benchmark CLEANSER against current filtering methods on publicly available CROP-seq10 and direct-capture20,24 perturb-seq datasets. We quantify the presence of ambient gRNA noise in scCRISPR libraries and determine ideal approaches for increasing gRNA-cell assignment accuracy through a combination of experimental and computational approaches. We show CLEANSER is compatible with both CROP-seq and direct-capture experimental platforms and can improve the sensitivity of downstream differential gene expression analysis compared to a strict UMI cutoff and the 10X Mixture Model. CLEANSER is publicly available and packaged in a command-line interface (Github: https://github.com/Gersbachlab-Bioinformatics/CLEANSER).

Results

Discordance of gRNA-cell assignments using current ambient filtering methods

We first aimed to compare the gRNA-cell assignments produced by current ambient filtering methods. We applied the 5-UMI cutoff previously described in Gasperini et al. and the cutoffs generated by the 10X Mixture Model to a high-MOI CROP-seq dataset profiling K562 dCas9KRAB cells.10,20,25 The resulting MOIs ranged from 16.5 for the 5-UMI cutoff to 18.1 for the 10X Mixture Model (Figure 1A). Although 818,207 gRNA-cell assignments were concordantly assigned by both methods, we observed 93,292 additional assignments identified only by the 10X Mixture Model and two assignments uniquely detected by the 5-UMI method (Figure 1A). Overall, these data demonstrate a substantial discordance in the outputs generated by commonly used gRNA-cell assignment methods and underscore the current gap in our understanding of how to accurately filter ambient gRNAs.

Figure 1.

Figure 1

gRNA UMIs and gRNA-cell assignments are variable across scCRISPR screen workflows

(A) Left: histograms of gRNA UMIs in K562 dCas9KRAB cells using the CROP-seq system after applying no UMI threshold, a 5-UMI threshold, or a 10X Mixture Model threshold. Right: Venn diagram of gRNA-cell assignments produced by each gRNA assignment method. Dark red indicates gRNA-cell assignments identified by no filtering method; blue indicates gRNA-cell assignments identified by both filtering methods; light red indicates gRNA-cell assignments identified by only one filtering method.

(B) Schematic of 3′ perturb-seq experimental workflow. Histogram of library #1 gRNA UMIs in NIH3T3 dCas9KRAB cells using the (C) CROP-seq and (D) direct-capture system with no UMI threshold applied.

(E) Violin plot of gRNA UMIs across all cells for both the CROP-seq and direct-capture dataset profiling HEK293T dCas9KRAB and NIH3T3 dCas9KRAB cells. Wilcoxon rank-sum test.

(F) qPCR of the Pol II and Pol III transcript levels in HEK293T dCas9KRAB cells. n = 3. Two-tailed t test.

(G and H) Violin plot of library #1 gRNA UMIs grouped by mean UMI quantile for both the (G) CROP-seq and (H) direct-capture dataset profiling NIH3T3 dCas9KRAB cells.

(I and J) Wilcoxon rank-sum test. Left: histogram of library #1 gRNA UMIs in NIH3T3 dCas9KRAB cells using the (I) CROP-seq and (J) direct-capture system after applying a 5-UMI threshold or a 10X Mixture Model threshold. Right: Venn diagram of gRNA-cell assignments produced by each gRNA assignment method.

Dark red indicates gRNA-cell assignments identified by no filtering method; blue indicates gRNA-cell assignments identified by both filtering methods; light red indicates gRNA-cell assignments identified by only one filtering method. p values are represented by asterisks (∗p ≤ 0.05, ∗∗p ≤ 0.01, ∗∗∗p ≤ 0.001). See also Figure S1 and Tables S1 and S2.

UMI variability between CROP-seq and direct-capture CRISPR feature libraries

To better understand the variability of filtering methods across different perturb-seq platforms and cell types, we systematically characterized the distribution and abundance of gRNA UMIs across two widely used 3′ scRNA-seq gRNA capture methods, CROP-seq and direct-capture perturb-seq, in human HEK293T cells and mouse NIH3T3 cells that had both been engineered to express the dCas9KRAB transcriptional repressor. We generated two distinct non-targeting libraries of 100 gRNAs each (gRNA library #1 and gRNA library #2; Tables S1 and S2) that were both cloned into either a CROP-seq or direct-capture vector. For both CROP-seq and direct-capture barnyard experiments, gRNA library #1 was transduced into HEK293T dCas9KRAB cells, while gRNA library #2 was transduced into NIH3T3 dCas9KRAB cells (Figure 1B). Samples were processed in separate channels of a 10X Genomics Chromium X platform, resulting in 7,299–9,864 high-quality cells per replicate (Figure S1A). To initially evaluate the distribution of all captured gRNAs, we assigned gRNAs to cells based on the presence of a UMI count of ≥1 within each cell.

We observed that gRNA UMI counts generated from the CROP-seq system exhibited a ∼20-fold lower magnitude and smaller variance (Figures 1C and S1B) compared to those generated by the direct-capture system in both cell types (Figures 1D, 1E, and S1C). The differences in the number of detected UMIs per gRNA-cell pairing between the two systems may be due to variability in the expression, stability, and/or capture efficiencies of RNA pol II transcripts captured through the CROP-seq system and RNA pol III transcripts captured using the direct-capture perturb-seq system. This difference in poll II and III transcripts is further supported by perturb-seq screens utilizing poly-adenylated “guide barcode” (GBC) indexes where the majority of UMI counts ranged from ∼1 to 150 in a prior K562 CRISPR inhibition (CRISPRi) screen.2 To assess differences in the relative RNA levels of pol II and pol III transcripts in the CROP-seq vector, we transduced HEK293T dCas9KRAB cells with a single non-targeting gRNA (sgNT-73). This gRNA demonstrated a 20-fold increase in mean direct-capture UMIs per cell compared to CROP-seq in the HEK293T dCas9KRAB perturb-seq datasets, representing characteristics of a typical library #1 gRNA (Figure S1D). Using RT-qPCR, we show that the sgNT-73 pol III transcript was ∼10× more abundant than the pol II transcript (Figure 1F), which likely explains the differences in magnitude that are detected between CROP-seq and direct-capture vectors.

To examine the variability of gRNA UMIs detected in the CROP-seq and direct-capture perturb-seq datasets, we partitioned gRNAs into quantiles based on their mean observed UMIs across all cells within each dataset. For the CROP-seq dataset, the distribution of gRNA UMI counts was relatively consistent across gRNA quantiles in NIH3T3 dCas9KRAB (Figure 1G) and HEK293T dCas9KRAB cells (Figure S1E), with an increase of 1.2-fold across quantiles for both cell lines. Conversely, the direct-capture dataset distribution of gRNA UMI counts across quantiles differed by 5- to 13-fold in both NIH3T3 dCas9KRAB (Figure 1H) and HEK293T dCas9KRAB cells (Figure S1F). These biases may be due in part to variability in gRNA expression, stability, and/or capture efficiency of the pol III transcript. This indicates that the gRNA-to-gRNA biases are more pronounced for the pol III transcript and, therefore, are a more significant concern for direct-capture compared to CROP-seq libraries.

We evaluated the concordance of current methods of ambient gRNA removal across both CROP-seq and direct-capture systems. gRNA-cell assignments were compared after filtering out ambient gRNAs using (1) a 5-UMI cutoff determined by the elbow method (Figures S1G and S1H) or (2) the 10X Mixture Model.10,20,25 Most gRNA-cell pairs were concordantly assigned for both CROP-seq (42,242 and 46,771 gRNA-cell assignments; Figures 1I and S1I) and direct-capture (38,829 and 15,575 gRNA-cell assignments; Figures 1J and S1J). For CROP-seq, the 10X Mixture Model uniquely identified 271 and 9,346 additional gRNA-cell pairs, and the 5-UMI cutoff identified no additional gRNA-cell pairs (Figures 1I, S1I, and S1K), similar to our analysis of the published CROP-seq dataset10 (Figure 1A). Across the two direct-capture datasets, a maximum additional 628 and 2,370 gRNA-cell pairs were uniquely identified by either the 10X Mixture Model or the 5-UMI cutoff, respectively (Figures 1J, S1J, and S1L). This suggests the 5-UMI cutoff may overestimate the number of gRNA-cell assignments for direct-capture experiments, likely a result of the higher expression level, detection rate, and UMI count variance of the pol III transcript (Figures 1F and S1D). Increasing the stringency of the UMI cutoff decreased the number of unique assignments identified by the strict UMI cutoff method (Figures S1M and S1N). However, this results in a corresponding increase in the number of unique assignments identified by the 10X Mixture Model. The non-overlapping categorization of gRNA-cell assignments using different filtering methods is substantial and may have significant effects on downstream differential gene expression analysis of perturb-seq screens.

CRISPR barnyard assay for detection of ambient gRNAs

To more accurately characterize the abundance, distribution, and source of ambient gRNAs in CROP-seq and direct-capture perturb-seq screens, we conducted scRNA-seq CRISPR barnyard assays. In these experiments, we profiled human HEK293T dCas9KRAB cells transduced with gRNA library #1 (CROP-seq or direct-capture) that were mixed with mouse NIH3T3 dCas9KRAB cells transduced with gRNA library #2 (CROP-seq or direct-capture). This design provides ground-truth confidence in distinguishing transduced (native plus ambient) and non-transduced (ambient) gRNA distributions. The human-mouse cell mixtures were prepared by two approaches: (1) immediate mixing just prior to loading the 10X chip, referred to as “non-co-cultured,” and (2) pre-mixing followed by a 3-day co-culture, referred to as “72-h co-cultured” (Figure 2A). The number of high-quality individual cells ranged from 7,217 to 7,887 per experiment and the proportion of multiplet cells ranged from 0.9% to 12.7% (Figure S2A).

Figure 2.

Figure 2

A scCRISPR barnyard screen characterizes ambient gRNA noise

(A) Schematic of CRISPR barnyard perturb-seq experimental workflow.

(B) Violin plot of the fraction of non-transduced gRNA transcripts identified in each cell for CROP-seq and direct-capture libraries in either HEK293T dCas9KRAB or NIH3T3 dCas9KRAB cells.

(C and D) Wilcoxon rank-sum test. Histogram of library #1 and library #2 gRNA UMIs in the non-co-cultured barnyard dataset using the (C) CROP-seq and (D) direct-capture system.

(E and F) Scatterplots depicting the Pearson correlation between the number of normalized gRNA counts in the gDNA pool and the number of non-transduced gRNA UMIs and for (E) CROP-seq and (F) direct-capture CRISPR feature libraries for library #1 gRNAs in the non-co-cultured barnyard datasets.

(G and H) Scatterplots depicting the Pearson correlation between the number of transduced gRNA UMIs and the number of non-transduced gRNA UMIs and for (G) CROP-seq and (H) direct-capture CRISPR feature libraries for library #1 gRNAs in the non-co-cultured barnyard datasets.

(I) Representative plot for empty-droplet identification showing droplets ranked by their total gene expression UMI count for the non-co-cultured CROP-seq dataset. High-quality singlet cells are highlighted in light gray and empty droplets are highlighted in dark gray.

(J and K) Violin plots of library #1 and library #2 gRNA UMIs grouped by non-transduced, transduced, or empty-droplet gRNAs for (J) CROP-seq and (K) direct-capture non-co-cultured barnyard datasets.

(L) Schematic depicting the four major contributors to gRNA UMI counts in perturb-seq screens and the major source of ambient gRNA noise. Red arrows indicate conditions that result in a decrease in gRNA UMI counts and green arrows indicate conditions that result in an increase in gRNA UMI counts.

p values are represented by asterisks (∗p ≤ 0.05, ∗∗p ≤ 0.01, ∗∗∗p ≤ 0.001). See also Figure S2.

Characterizing ambient gRNAs in CRISPR feature barcode libraries

To determine the prevalence of ambient gRNA-cell assignments in perturb-seq screens, we calculated the total fraction of non-transduced/transduced gRNA library transcripts in each mouse and human cell. The fraction of non-transduced gRNA assignments in a cell ranged from 0% to 100% for the CROP-seq and direct-capture libraries, indicating considerable variation in the presence of ambient gRNA library noise across cells in perturb-seq screens (Figures 2B and S2B). The direct-capture dataset had a smaller median fraction of non-transduced gRNA library transcripts (0.0%–0.1%) relative to the CROP-seq dataset (1.5%–8.9%). This is likely due to increased library complexity because of increased expression of the pol III transcript (Figures 1F and S1D) and, therefore, decreased sequencing depth per recovered gRNA in the direct-capture gRNA libraries. Overall, we did not observe a significant increase in non-transduced transcript abundance in the 72-h co-cultured samples relative to the non-co-cultured samples, indicating most ambient contamination occurred during droplet generation and/or the single-cell library preparation, rather than exchange of gRNAs between cells during co-culture (Figure S2B). The mean UMI count of transduced gRNAs compared to non-transduced ambient gRNAs was 7-fold to 13-fold greater for CROP-seq (Figure 2C) and 50-fold to 266-fold greater for direct-capture (Figures 2D, S2C, and S2D). These data support the conclusions that ambient gRNA assignments are characterized by low UMI counts, ambient gRNAs originate from other cells, and the direct-capture platform produces a higher range of separation between native and ambient gRNAs compared to CROP-seq.

Given the observation of global gRNA-to-gRNA biases in our non-barnyard direct-capture datasets (Figures 1G, 1H, S1E, and S1F), we determined whether the transduced and non-transduced gRNA profiles observed in the barnyard datasets also exhibited gRNA-to-gRNA biases. The median UMI counts for transduced and non-transduced CROP-seq gRNA assignments were approximately consistent across gRNA quantiles (Figures S2E and S2F). However, the median UMI counts for transduced and non-transduced direct-capture gRNA assignments varied across quantiles, indicating substantial gRNA-to-gRNA bias (Figures S2G and S2H). This is consistent with our previous observation of gRNA-to-gRNA biases in the non-barnyard direct-capture perturb-seq datasets. We also observed gRNA UMI biases across cell types. In the direct-capture dataset, we observed larger UMI counts originating from NIH3T3 cells compared to HEK293T cells (Figure 2D). We observed the opposite in our CROP-seq dataset, where HEK293T cells received larger UMI counts relative to NIH3T3 cells (Figure 2C). This would indicate increased stability, expression, and/or recovery of pol III gRNAs in NIH3T3 cells and pol II gRNAs in HEK293T cells.

Previous studies have demonstrated a correlation in UMI abundance of native and ambient mRNA transcripts.13 Therefore, we determined whether gRNAs with a larger abundance in the plasmid pool and/or gRNAs with a larger number of genomic DNA (gDNA) integrations contribute more to the non-transduced ambient population. The relative abundance of each gRNA in each plasmid pool was correlated with the relative number of gDNA integrations for both HEK293T dCas9KRAB and NIH3T3 dCas9KRAB cells (Figures S2I and S2J). The relative number of gDNA integrations was strongly correlated with non-transduced UMI counts for the CROP-seq dataset (Figure 2E). However, we observed a discrepancy in direct-capture libraries, where gRNAs with larger total UMI counts had more non-transduced UMI counts compared to their corresponding gDNA counts (Figure 2F). We did not observe this trend for CROP-seq gRNAs (Figure 2E). This indicates the pol II transcript correlates well with vector DNA integration number, while the pol III transcript does not. This may reflect gRNA-to-gRNA biases in transcription efficiency and stability of the pol III transcript that are not reflected in the pol II transcript. This is consistent with our observation of larger gRNA-to-gRNA bias in the direct-capture libraries compared to CROP-seq libraries (Figures S2E–S2H). In addition, we found gRNAs with larger transduced UMI counts contribute more to the non-transduced distribution in CRISPR feature libraries. The number of transduced UMI counts at the gRNA level was highly correlated to non-transduced UMI counts for both CROP-seq (Figure 2G) and for direct-capture (Figure 2H). Together, these data support that gRNAs that are more highly represented in the original gRNA library plasmid pool are also more often integrated in transduced cells, which ultimately leads to higher ambient contamination of those gRNAs in other cells. However, gRNA biases have an additional influence on the number of ambient and transduced UMI counts in direct-capture libraries.

To further characterize the source of ambient gRNA contamination, we compared non-transduced gRNA profiles detected in cell-containing droplets to empty droplets. We distinguished empty droplets from cell-containing droplets based on the total number of UMI counts detected in the gene expression libraries (Figure 2I). We find a similar abundance of gRNA UMI counts in empty droplets compared to non-transduced gRNAs in cell-containing droplets, indicating that gRNAs present in the experimental buffer are the main contribution to ambient noise in CRISPR feature libraries (Figures 2J and 2K). We wash cells three times before loading onto the 10X chip (STAR Methods), which indicates that this degree of washing is not sufficient to remove ambient gRNA contamination. This is in line with previous findings for ambient mRNAs present in gene expression libraries.13

Next, we reasoned that the number of gRNAs with low UMI counts detected in a cell is a representative metric for cell-to-cell biases such as variation in capture efficiency, reaction efficiencies, and sequencing depth given that deeper sequencing of an individual cell will uncover rarer transcripts. We compared the sum of low-UMI-count gRNAs to the sum of non-transduced gRNA UMI counts detected in each cell and found a significant correlation for both the CROP-seq (Figure S2K) and direct-capture datasets (Figure S2L). This indicates that cell-to-cell biases in sequencing depth influence the number of non-transduced gRNA UMI counts detected in a cell.

Through an experimental approach, we have determined the factors that influence native and ambient gRNA UMI distributions (Figure 2L). These scCRISPR barnyard screens provide evidence for four major contributors to a gRNA’s native UMI abundance: (1) the gRNA capture system, (2) a gRNA’s abundance in the plasmid library and/or transduced pool of cells, (3) a gRNA’s bias in expression and/or capture efficiency, and (4) cell-to-cell biases in reaction efficiencies and sequencing depth. gRNAs that are expressed from a direct-capture vector, gRNAs that are more abundant in the transduced pool of cells, and gRNAs with higher capture efficiencies generate a larger number of native UMI counts. These native gRNA transcripts are subsequently released into the experimental buffer if cell lysis occurs during the experiment, transitioning into ambient gRNAs. Ambient gRNA UMI counts are highly correlated with native gRNA UMI counts as they originate from native gRNA transcripts.

Statistical analysis of gRNA assignments for perturb-seq screens using a mixture model

For a given gRNA, we observed two distributions in the CRISPR barnyard datasets: (1) a low UMI distribution of non-transduced (ambient) gRNA transcripts and (2) a UMI distribution of transduced (ambient + native) gRNAs (Figure 3A). The latter distribution is bimodal, as it is made up of both ambient and native gRNA UMIs, further supporting the use of a mixture model to remove ambient gRNA noise. This bimodal distribution is present throughout all of the datasets we analyzed (Figure 1A, 1C, 1D, S1B, S1C, 2C, 2D, S2C, and S2D). In accordance with our previous observations, we find significant differences in gRNA UMI count distributions between the CROP-seq and direct-capture perturb-seq datasets (Figure 3A), indicating that vector-specific mixture models are required to effectively bin ambient and native gRNA-cell assignments across distinct capture methods.

Figure 3.

Figure 3

CLEANSER accurately distinguishes ambient gRNA noise from signal

(A) Histograms of UMI counts for sgNT-73 (library #1 gRNA) across all cells in the non-co-cultured CROP-seq (top) and direct-capture (bottom) barnyard datasets.

(B) Scatterplots showing correlations between the mean and variance of ground-truth ambient gRNA UMIs in HEK293T dCas9KRAB cells for CROP-seq and direct-capture methods in the non-co-cultured and 72-h co-cultured barnyard datasets.

(C and D) Graphical model of (C) CROP-seq CLEANSER (csCLEANSER) and (D) Direct-capture CLEANSER (dcCLEANSER).

(E and F) Scatterplots depicting the relationship between the probability of assignment and UMI count size for each gRNA-cell pair in (E) csCLEANSER’s analysis of CROP-seq barnyard perturb-seq data and (F) dcCLEANSER’s analysis of direct-capture barnyard perturb-seq non-co-cultured data.

(G and H) Bar chart of the number of transduced and non-transduced assignments in (G) csCLEANSER’s analysis of non-co-cultured CROP-seq data and (H) dcCLEANSER’s analysis of direct-capture non-co-cultured barnyard data. CLEANSER’s assignments are compared to unfiltered data, a 5-UMI cutoff, and the 10X Mixture Model.

(I) Scatterplot depicting the correlation of the distance between noise and signal distributions and the sample variances of r.

(J) Boxplot of the sample variances of r with varying amounts of coverage per gRNA.

See also Figure S3.

We observed overdispersion of gRNA UMIs in both the CROP-seq and direct-capture CRISPR barnyard datasets. This is consistent with a negative binomial distribution, which successfully models Poisson-overdispersed datasets such as RNA-seq data. Therefore, we chose a negative binomial distribution to model the gRNA UMI count distributions for both ambient and native gRNAs in CLEANSER (Figure 3A).26 We isolated non-transduced gRNA transcripts in the CRISPR barnyard dataset to better understand the distribution of ambient UMI counts. While the non-transduced gRNA UMI counts for each gRNA in the direct-capture perturb-seq datasets showed a large variance compared to the mean, the non-transduced gRNA UMI counts for each gRNA in the CROP-seq datasets showed a similar mean and variance and less overdispersion compared to the direct-capture datasets (Figures 3B and S3A). Therefore, we chose to model the ambient gRNA distribution as a Poisson distribution (a negative binomial distribution with one parameter modeling both mean and variance) in CROP-seq experiments (Figure 3C) and as a negative binomial in direct-capture perturb-seq experiments (Figure 3D) to allow for different mean and variance parameters.

Separate priors were added to CROP-seq CLEANSER (csCLEANSER) and direct-capture CLEANSER (dcCLEANSER). In the csCLEANSER model, weakly informative priors allow a small parameter for the noise distribution (λ), and the mean of the signal negative binomial component (μ) is always larger than λ (Figure 3C). In the dcCLEANSER model, the dispersion parameter (ϕ) for the two negative binomial distributions allows for the larger variance observed in the barnyard perturb-seq experiments (Figure 3D). A normalized cell-specific parameter (L) allows for confounding technical factors that affect individual cells such as sequencing depth and batch effects. These cell-specific values contain information about the number of gRNA UMIs ≤2 detected in a cell and are used to normalize the mean values of the two distributions (Figures S2K–S2L).

We found that gRNA-cell pairs with high UMI counts generally have high assignment probabilities, indicating no obvious cases of false-negative assignments (Figures 3E and 3F). We applied three gRNA-cell pair assignment probability thresholds to the barnyard datasets: (1) a lenient cutoff, (2) a moderate cutoff (CLEANSER default), and (3) a strict cutoff (STAR Methods) and found that the percentage of non-transduced gRNA-cell assignments remained low at all applied thresholds (0%–2.12%) (Figures S3B and S3C). We compared the number of transduced and non-transduced gRNA-cell assignments after filtering with CLEANSER (default thresholds), a 5-UMI threshold, or the 10X Mixture Model. The percentage of discordant assignments across filtering methods ranged from 5.3% to 12.7% in the barnyard datasets, indicating substantial differences between assignment methods (Figure S3D). In both CROP-seq and direct-capture experiments, CLEANSER outperformed one or both assignment methods by assigning a smaller number of ambient gRNAs and not under-assigning native gRNAs (Figures 3G, 3H, S2, S3E, and S3F). Interestingly, all three filtering methods retained the largest number of non-transduced gRNA assignments in HEK293T dCas9KRAB cells in the direct-capture experiment, indicating variability in assignment method performance between CROP-seq and direct capture (Figures 3H and S3F).

Given the variability of assignment method performance across datasets, we employed CLEANSER to isolate variables influencing the accuracy of gRNA-cell assignment. Our analyses revealed a negative correlation between the sample variances of the ratio (r) between assignment to the ambient and native distributions and the distance between ambient and native distributions (Figure 3I). This suggests that enhanced model performance is associated with datasets exhibiting greater separation between ambient and native gRNA UMIs, a pattern more frequently observed in direct-capture datasets compared to CROP-seq datasets and in distinct cell types depending on the capture method. For instance, we observed increased model performance in NIH3T3 cells in the direct-capture dataset due to greater separation between the non-transduced and transduced gRNA populations (Figures 2D, S2D, 3H, and S3F). Furthermore, we noted a decrease in sample variances of r with increasing coverage of individual gRNAs (Figure 3J), highlighting the critical role of coverage in the accuracy of gRNA-cell assignment in perturb-seq screens.

Benchmarking ambient gRNA filtering tools

To benchmark CLEANSER against existing ambient gRNA removal methods, we analyzed publicly available CROP-seq10 and direct-capture20,24 datasets after filtering with (1) CLEANSER, (2) the strict UMI cutoffs previously applied to these datasets, or (3) the 10X Mixture Model (Figure 4A). To effectively assign gRNAs to cells using CLEANSER, we first determined an optimal filtering pipeline. In both K562 CROP-seq and direct-capture perturb-seq experiments, each sample was profiled across multiple lanes. We hypothesized that sequencing depth, cell number, amount of cell lysis, and reaction efficiencies could impact ambient gRNA presence across technical replicates. Therefore, we tested whether CLEANSER is influenced by lane-specific batch effects. We compared μ (mean of signal) and μn or λ (mean of noise) across technical replicates in each K562 screen and found that they were only weakly correlated by lane (Figures S2 and S4A–S4D). To reduce these lane-specific batch effects, we applied all filtering methods at a lane-by-lane level.

Figure 4.

Figure 4

Ambient gRNA filtering methods produce differential gRNA-cell assignments across benchmark perturb-seq datasets

(A) Schematic depicting the gRNA libraries, cell types, and gRNA capture systems used in the datasets downloaded for benchmarking analyses.

(B and C) Violin plots depicting the distribution of UMI counts for several (B) CROP-seq and (C) direct-capture datasets. Mean UMI count is italicized above each dataset.

(D) Percentage of discordant gRNA-cell assignments across the three filtering methods for all three benchmarking datasets.

(E–G) Histograms of gRNA UMIs in the (E) K562 CROP-seq, (F) CD8+CCR7+ T cell direct-capture, and (G) K562 direct-capture datasets after applying CLEANSER, a UMI cutoff, or the 10X Mixture Model.

(H–J) Density plots showing the fraction of gRNA-cell assignments assigned by one (red), two (purple), or all three (blue) filtering methods across all gRNAs screened in the (H) K562 CROP-seq, (I) CD8+CCR7+ T cell direct-capture, and (J) K562 direct-capture datasets.

See also Figure S4.

Another important factor when implementing CLEANSER is choosing an appropriate posterior cutoff. We first assessed UMI counts across all gRNA-cell pairs for each dataset relative to the barnyard datasets and other publicly available datasets employing the same capture methods (Figures 4B and 4C). We found similar mean UMI counts across CROP-seq datasets (Figure 4B). In contrast, the direct-capture datasets exhibited a wider range of variability in UMI counts. The T cell dataset displayed approximately 11-fold to 60-fold fewer mean UMI counts compared to the barnyard datasets and other direct-capture datasets (Figure 4C). We initially applied the default threshold parameters when filtering with cs/dcCLEANSER (Figures S4E–S4G). However, when using the default dcCLEANSER parameters with the T cell dataset, we noted that several gRNAs received no or very few assignments (Figure S4H). This is likely attributable to the low UMI counts in the T cell dataset relative to the barnyard training data. Consequently, we adjusted the probability of assignment threshold to a more lenient cutoff using an eCDF plot (Figure S4F). We selected a probability of assignment threshold positioned at the first inflection point of the plot to increase the number of gRNA assignments while preserving a unimodal distribution and minimizing the predicted false-positive rate (Figures S3C and 4F).

After establishing thresholds for ambient gRNA filtering, we compared gRNA-cell assignments produced from the three ambient gRNA filtering methods. As predicted, we found that gRNA-cell assignments differed across distinct filtering methods (Figures 4D–4G and S4I–S4K). For all datasets, we found that the 10X Mixture Model generated the largest number of gRNA-cell assignments (Figures S4I–S4K). Alternatively, the smallest number of gRNA-cell assignments were assigned by the strict UMI cutoff in the K562 CROP-seq and T cell direct-capture datasets (Figures S4I–S4J) and by CLEANSER in the K562 direct-capture dataset (Figure S4K). This is consistent with our prior findings that the 10X Mixture Model generates more gRNA-cell assignments in CROP-seq datasets than a strict UMI cutoff (Figures 1A, 1I, and S1I). However, this is in contrast with our prior observation of a larger number of gRNA-cell assignments using a strict UMI cutoff relative to the 10X Mixture Model in direct-capture datasets (Figure 1J and S1J), indicating variability in assignment outcomes using these filtering methods.

Upon further investigation at the gRNA level, we found that differences in gRNA-cell pairings in the CROP-seq dataset were uniformly spread across gRNAs (Figure 4H). In contrast, the direct-capture datasets had high variability in cell assignments across gRNAs, with some gRNAs having large differences in gRNA-cell assignments and some having minor differences (Figures 4I and 4J). This is likely due to the minimal gRNA-specific biases found in CROP-seq gRNA UMI counts, which results in a more consistent filtering across gRNAs after applying the three methods. In agreement with this observation, we found significant gRNA-to-gRNA bias in the number of UMI counts for direct-capture CRISPR feature libraries but minimal gRNA bias for CROP-seq libraries (Figures S4L–S4N).

Effect of ambient gRNA noise removal on differential gene expression

In order to better understand the effects of ambient gRNA filtering on differential gene expression analysis, we conducted differential expression testing on the CROP-seq10 and direct-capture20,24 datasets filtered by CLEANSER, a UMI cutoff, or the 10X Mixture Model. In the CRISPRi K562 CROP-seq and T cell datasets, we observed a similar number of significant positive control gRNA-gene pairs for data filtered by the three methods (Figures S5A and S5B; Tables S3 and S4). Likewise, we observed a strong correlation in the p values generated during differential expression testing (Figures S5C and S5D). This finding is consistent with our observation that the CROP-seq dataset was filtered similarly by the three methods (Figures 4E and 4H). However, this is inconsistent with the large differences we observed in gRNA-cell assignment for the T cell direct-capture dataset (Figures 4F and 4I). This may be due in part to the strong predicted effects for all targeting gRNAs in this experiment, since they were preselected for strong modulation of gene promoters, which may reduce the impact of gRNA misassignment. Nevertheless, we found that non-targeting gRNAs in the T cell dataset were often differentially assigned by the three filtering methods, with one non-targeting gRNA only receiving assignments when filtered with the 10X Mixture Model (Figure S5E). As a result, in comparison to CLEANSER, the number of non-targeting gRNA-gene pair (false-positive) hits was 1.4-fold higher for the 10X Mixture Model and 1.3-fold higher for the strict UMI cutoff (Figure S5F). These results support that the differential expression results produced by CLEANSER assignments for positive controls are comparable to those of alternate ambient gRNA filtering methods for certain datasets. In addition, filtering with CLEANSER reduced the total number of false-positive hits.

For the direct-capture K562 CRISPRi dataset, we compared the differential expression results produced by the three ambient filtering methods and found that CLEANSER detected more gRNA-gene hits relative to both the strict UMI cutoff and the 10X Mixture Model (Figures 5A and 5B; Table S5). Notably, CLEANSER hits encompassed more gRNA-gene pairs in both categories of promoter- and enhancer-targeting gRNAs, the majority of which upregulated their predicted gene targets, defined by the putative transcriptional start site (TSS) or previously identified enhancer-gene links in K562 cells20 (Figure 5C). In contrast to CLEANSER, the 10X Mixture Model identified the fewest significant gRNA-gene pairs (Figures 5A and 5B) with only eight of these hits belonging to a set of gRNA-gene links previously identified in a K562 CRISPRi screen,10 while CLEANSER yielded 11 of these gRNA-gene links (Figure 5D). When examining gRNAs upregulating their predicted gene targets, we observed larger changes in gene expression and smaller p values for CLEANSER relative to the UMI cutoff and the 10X Mixture Model (Figures 5E and S5G). We find similar results for the 32 gRNA hits identified concordantly through the three filtering methods (Figures 5F and S5H). The observed increase in differential expression testing sensitivity indicates an increase in gRNA-cell assignment accuracy after implementing CLEANSER as opposed to alternative methods.

Figure 5.

Figure 5

Ambient gRNA filtering methods impact differential gene expression analysis outcomes

(A) Venn diagram depicting the overlap of significant gRNA-gene links in the direct-capture perturb-seq K562 CRISPRa dataset produced by each ambient filtering method. Blue indicates gRNA-gene pairs identified by all filtering methods. Light red indicates gRNA-gene pairs identified by two filtering methods. Dark red indicates gRNA-gene pairs identified by only one filtering method.

(B) Volcano plot of gRNA-gene pairs identified after filtering the direct-capture perturb-seq K562 CRISPRa dataset with CLEANSER.

(C) Number of significant gRNA-gene links in the direct-capture perturb-seq K562 CRISPRa dataset produced by each ambient filtering method separated by TSS targeting or enhancer targeting.

(D) Number of significant K562 CRISPRa gRNA-gene pairs overlapping previously identified gRNA-gene pairs in a K562 CRISPRi screen10 for each ambient filtering method.

(E) Violin plot of log2(fold changes) for predicted gRNAs-gene pairs across three ambient gRNA filtering methods. Wilcoxon rank-sum test.

(F) Violin plot of log2(fold changes) across three ambient gRNA filtering methods for significant gRNAs-gene pairs identified by all three methods. Wilcoxon rank-sum test. p values are represented by asterisks (∗p ≤ 0.05, ∗∗p ≤ 0.01, ∗∗∗p ≤ 0.001).

(G) Density plots showing the UMI count of gRNA-cell assignments in unfiltered, strict UMI cutoff filtered, 10X Mixture Model filtered, or CLEANSER filtered K562 CRISPRa perturb-seq data for the six unique gRNA hits identified by CLEANSER.

(H) Violin plot of normalized gene expression for CLEANSER filtered cells assigned with a given gRNA and control cells for six unique gRNA hits identified by CLEANSER. p values are represented by asterisks (Benjamini-Hochberg [BH]-corrected empirical p value ≤0.1).

See also Figure S5 and Tables S4, S5, and S7.

The six hits that are unique to CLEANSER included four gRNAs that upregulated their predicted gene target and two gRNAs that upregulated an alternate gene (Figures 5G and 5H; Table S5). We noted that CLEANSER assignments for these six gRNAs contained a smaller proportion of low UMI counts and were more unimodal than gRNA-cell assignments generated by alternative filtering methods (Figure 5G). When individually validated by RT-qPCR, four of the six gRNA hits uniquely identified by CLEANSER significantly upregulated their corresponding gene targets (Figure S5I). In contrast, two of three gRNA-gene links uniquely identified by the strict UMI cutoff were not significant in the validation experiment, while one gene target was too lowly expressed to test by RT-qPCR (Figure S5I). The gRNA-gene link uniquely identified by the 10X Mixture Model was also significant by RT-qPCR; we found that this gRNA-gene link was not identified by CLEANSER due to the low number of cells identified with the targeting gRNA (two cells for CLEANSER vs. 71 cells for the 10X Mixture Model).

One of the unique CLEANSER hits that was validated individually included a previously linked gRNA-gene pair identified in a K562 CRISPRi perturb-seq screen, chr6.702_479 gRNA (an enhancer gRNA previously identified in Gasperini et al.) and RNF182.10 This previous study used a strict UMI cutoff to show that CRISPRi of this enhancer leads to downregulation of RNF182.10 However, only CLEANSER (not a strict UMI cutoff) detected CRISPR activation (CRISPRa) of this same enhancer results in a significant upregulation of RNF182 (Figure 5H). We examined the differentially assigned gRNA-cell pairs for the chr6.702_479 and found that the gRNA-cell pairs uniquely assigned by CLEANSER included fewer cells with relatively low expression of RNF182 than gRNA-cell pairs differentially assigned by the alternative methods (Figure S5J). This indicates that CLEANSER’s increased sensitivity may be due in part to the removal of false-positive gRNA-cell assignments. Other possibilities may be the inclusion of true-positive gRNA-cell assignments filtered by alternative methods or the removal of gRNA assignments to cells with low expression of the gRNA, in which the effects of the gRNA may be weaker. The six instances of gRNA-gene hits identified solely by CLEANSER and 32 additional gRNA-gene hits relative to the 10X Mixture Model demonstrate the importance of accurate gRNA assignment in perturb-seq screens to aid in the detection of subtle but biologically and/or therapeutically relevant changes in gene expression, such as identification of non-coding regulatory elements or off-target effects.

Discussion

In this study, we characterized ambient gRNA contamination and quantified its impact on downstream differential gene expression analyses in perturb-seq screens. Our side-by-side comparison of gRNA-cell assignments generated by widely used ambient filtering methods using both CROP-seq and direct-capture perturb-seq datasets underscores the discordance of current methods and the knowledge gap surrounding ambient gRNA filtering. Our findings shed light on the characteristics of ambient gRNAs and introduce a novel computational tool, CLEANSER, to efficiently target and remove ambient noise from CRISPR screening libraries.

Our findings from scCRISPR barnyard experiments highlight the differences in the distribution of gRNA UMI counts in CROP-seq and direct-capture perturb-seq datasets. We observed a larger number of gRNA UMI counts in direct-capture libraries compared to CROP-seq, likely linked to expression or stability differences in pol III vs. pol II transcripts. This underscores the need for vector-specific characterization of ambient gRNA noise. While we show differences in pol II CROP-seq versus pol III direct-capture transcripts, we note that CROP-seq also generates pol III gRNA transcripts that are not detected in this analysis. While pol III transcripts do not affect gRNA assignment in the current CROP-seq system, it is likely that these transcripts share similar characteristics with direct-capture pol III transcripts. Therefore, the potency of perturbation on gene expression would be similar with both approaches, even if there are stark differences in detection of the gRNA. Additionally, we characterize ambient gRNAs by their significantly lower mean UMI counts compared to native gRNAs in both detection methods. Nevertheless, these UMI counts vary across gRNAs and cells, emphasizing the need for ambient removal methods that account for both gRNA- and cell-specific biases (Figure 2L).

We observed that more abundant gRNAs in the CRISPR libraries were more likely to contribute to ambient contamination, in line with previous studies characterizing ambient mRNAs.14 Furthermore, our analyses revealed that an ambient gRNA’s profile from co-cultured, non-co-cultured cells, and empty droplets is similar, which supports that ambient gRNAs are consistently present in experimental solutions. Further investigation into more stringent washing or alternative strategies to experimentally reduce ambient gRNA noise are needed, and the methods we describe here provide a blueprint to evaluate those new experimental methods.

To address the limitations of current gRNA-cell assignment methods, we introduced the CLEANSER mixture model, which leverages the distinct bimodal distribution of ambient and native gRNA UMIs to differentiate signal from ambient noise. Our benchmarking results demonstrated that CROP-seq datasets more often display concordance across filtering methods. Additionally, even with high discordance of gRNA-cell assignments across filtering methods in direct-capture datasets, the differential expression analyses of gRNAs with very large effects are less likely to be impacted than gRNAs with small effects. Nonetheless, filtering a direct-capture dataset with CLEANSER produced a larger number of significant gRNA-gene pairs compared to a UMI cutoff or the 10X Mixture Model, with 32 additional gRNA-gene pairs identified by CLEANSER relative to the 10X Mixture Model. The gRNA-gene pairs uniquely discovered by CLEANSER exhibited more subtle regulatory relationships than pairs discovered by all methods (Figure 5B), which may be particularly critical for CRISPR screening studies designed to dissect genome-wide association study (GWAS) loci or other genetic determinants of common, complex disease. We also provided orthogonal evidence supporting these unique gRNA-gene pairs, including RT-qPCR (Figure S5I) and linkage by additional perturbation screens (Figure 5D), further indicating that these are true-positive interactions uniquely identified with CLEANSER. CLEANSER also generated larger changes in gene expression for positive control and predicted gRNA-gene pairings. This highlights the critical role of accurate filtering in downstream analysis and the impact of ambient gRNA removal methods on differential expression testing.

Our study provides a comprehensive analysis of ambient gRNA contamination in both CROP-seq and direct-capture scCRISPR screens, highlighting the need for effective ambient gRNA removal methods. The CLEANSER mixture model offers a publicly available tool for researchers to improve the accuracy of perturb-seq data analysis, enabling more reliable differential expression results. This tool can be modified by editing the statistical distributions for each variable, choosing different priors, or appending additional components as new perturb-seq platforms with distinct ambient gRNA characteristics are developed.

Limitations of the study

CLEANSER is trained on CRISPR barnyard data generated using human HEK293T and mouse NIH3T3 cells and therefore its performance may vary across additional cell types and experimental conditions. We tested both CROP-seq and direct-capture platforms, and additional training might be needed for other systems that are developed in the future. gRNAs were integrated and expressed through lentiviral transduction. Thus, alternative delivery methods may result in different UMI counts/distributions. In addition, gRNA-cell assignments can be impacted by coverage and strength of the signal component vs. the noise component. For instance, gRNAs with greater coverage and larger UMI counts are generally assigned with higher accuracy.

Additional limitations of our analysis are that gRNA-gene pairs identified downstream of CLEANSER can include false positives and/or false negatives and that, along with accurate gRNA-cell assignment, the design of differential expression analysis further influences the identification of gRNA-gene pairs. gRNA hits from perturb-seq screens should be validated individually by RT-qPCR.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Charles A. Gersbach (charles.gersbach@duke.edu).

Materials availability

Plasmids generated in this study have been deposited to Addgene.

Data and code availability

Acknowledgments

We thank the Duke sequencing core and Duke cell culture facility for excellent assistance. We also thank the teams at High-throughput Applied Research Data Analysis Cluster (HARDAC) and Duke Computing Cluster (DCC) for computing resources. Schematics were created with BioRender.com. The work is funded by National Institutes of Health grants RM1-HG011123 (C.A.G., G.E.C., W.H.M., and A.S.A.), MH125236 (C.A.G. and G.E.C.), HG012053 (C.A.G. and G.E.C.), R35-GM150404-01 (W.H.M.), U01-HG011967-03 (A.S.A. and W.H.M.), NSF EFMA-1830957 (C.A.G.), and Open Philanthropy (CAG). L.R.B. was supported by the NSF-GRFP (NSF-GRFP DGE - 2139754).

Author contributions

Conceptualization, methodology, writing – original draft, and writing – reviewing & editing, C.A.G., G.E.C., W.H.M., A.S.A., M.C.H., S.L., A.B., and L.R.B.; investigation and validation, C.A.G., G.E.C., M.C.H., L.R.B., S.F.D., J.W.R., and A.C.N.; software, formal analysis and visualization, C.A.G., G.E.C., W.H.M., A.S.A., M.C.H., S.L., T.C., A.B., A.C.N., and R.W.D.; funding acquisition and supervision, C.A.G., G.E.C., W.H.M., A.S.A., and L.R.B.

Declaration of interests

C.A.G. is an inventor on patents and patent applications related to genome engineering and CRISPR screens, and is a co-founder and advisor to Tune Therapeutics, an advisor to Sarepta Therapeutics, and a co-founder of Locus Biosciences.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Bacterial and virus strains

NEB Stable Competent E. coli (High Efficiency) New England Biolabs Cat#C3040H

Chemicals, peptides, and recombinant proteins

Puromycin Dihydrochloride Thermo Fisher Scientific Cat#A1113803
Blasticidin S HCl Thermo Fisher Scientific Cat#A1113903
NEBuilder HiFi DNA Assembly Master Mix New England Biolabs Cat#E2621L
DMEM, high glucose, pyruvate Gibco Cat#11995073
Fetal Bovine Serum Thermo Fisher Scientific Cat#10437028
Penicillin-Streptomycin (10,000 U/mL) Gibco Cat#15140122
Opti-MEM Reduced Serum Medium Gibco Cat#11058021
Trypsin-EDTA (0.25%), phenol red Gibco Cat#25200072
PureLink™ Genomic DNA Mini Kit Invitrogen Cat#K182002
DPBS, no calcium, no magnesium Gibco Cat#14190250
QIAquick PCR Purification Kit Qiagen Cat#28106
QIAquick Gel Extraction Kit Qiagen Qiagen Cat#28706
QIAprep Spin Miniprep Kit Qiagen Qiagen Cat#27106
SYBR Safe DNA Gel Stain Invitrogen Cat#S33102
Esp3I New England Biolabs Cat#R0734L
1 kKb Plus DNA Ladder Invitrogen Cat#10787026
UltraPure™ Agarose Invitrogen Cat#16500100
NEBNext Ultra II Q5 Master Mix New England Biolabs Cat#M0544L
Dimethyl sulfoxide Sigma-Aldrich Cat#D2650
Polybrene infection/Transfection Reagent Sigma-Aldrich Cat#TR-1003
RNEasy Plus mini kit Qiagen Cat#74134
Lenti-X Concentrator Takara Cat#631232
TaqMan Fast Advanced Master Mix for qPCR, no UNG Thermo Fisher Scientific Cat#A44360
TaqMan Gene Expression Assay (FAM) Assay ID: Hs00360405_g1 Gene Symbol: CUTA Thermo Fisher Scientific Cat#4448892
TaqMan Gene Expression Assay (FAM) Hs00185435_m1 Gene Symbol: HLA-DMA Thermo Fisher Scientific Cat#4453320
TaqMan Gene Expression Assay (FAM) Hs00171876_m1 Gene Symbol: DNMT3B Thermo Fisher Scientific Cat#4453320
TaqMan Gene Expression Assay (FAM) Hs02596927_g1 Gene Symbol: RPL7 Thermo Fisher Scientific Cat#4448892
TaqMan Gene Expression Assay (FAM) Hs00908900_m1 Gene Symbol: FOXP1 Thermo Fisher Scientific Cat#4448892
TaqMan Gene Expression Assay (FAM) Hs00162613_m1 Gene Symbol: TCF4 Thermo Fisher Scientific Cat#4453320
TaqMan Gene Expression Assay (FAM) Hs00381656_m1 Gene Symbol: EIF4E3 Thermo Fisher Scientific Cat#4448892
TaqMan Gene Expression Assay (FAM) Hs00153380_m1 Gene Symbol: CCND2 Thermo Fisher Scientific Cat#4453320
TaqMan Gene Expression Assay (FAM) Hs00381370_m1 Gene Symbol: RNF182 Thermo Fisher Scientific Cat#4448892
TaqMan Gene Expression Assay (FAM) Hs02576472_g1 Gene Symbol: EIF1AX Thermo Fisher Scientific Cat#4448892
TaqMan Gene Expression Assay, VIC primer-limited (VIC-MGB_PL) Hs99999905_m1 Gene Symbol: GAPDH Thermo Fisher Scientific Cat#4448485

Deposited data

Single Cell RNA-seq data GEO GEO: GSE272454 and GEO: GSE272457

Experimental models: Cell lines

NIH/3T3 ATCC Cat#CRL-1658
HEK293T/17 ATCC Cat#ACS-4500

Oligonucleotides

gRNA_60bp_fw TAACTTGAAAGT
ATTTCGATTTCTTGGCTTTATATA
TCTTGTGGAAAGGACGAAACACCG
This paper Method Details
gRNA_60bp_rv GTTGATAACGGAC
TAGCCTTATTTAAACTTGCTATGC
TGTTTCCAGCATAGCTCTTAAAC
This paper Method Details
U6_Bc_r1seq_halftail (24 distinct versions of this primer with staggered-length in-line barcodes denoted here as NNNNN)
5′ ACTCTTTCCCTACACGACGCTC
TTCCGATCT NNNNN GGAAAGGA
CGAAACACCG 3′
This paper Method Details
gRNAFE_r2seq_halftail
5′ GACTGGAGTTCAGACGTGTGCT
CTTCCGATCTGCCTTATTTAAAC
TTGCTATGCTGT 3′
This paper Method Details
r1seq_fulltail
5′ AATGATACGGCGACCACCGAG
ATCTACACTCTTTCCCTACACGA
CGCTCTTC 3′
This paper Method Details
r2seq_fulltail (up to 8 distinct indexed versions of this primer were used to maximize pooling)
5′ CAAGCAGAAGACGGCATACGA
GATNNNNNNNNGTGACTGGAGT
TCAGACGTGTGCT 3′
This paper Method Details
hU6_promoter_FW:CTTGTGGAA
AGGACGAAACACCG
This paper Method Details
gRNA_hairpin_RV: CGACTCGGT
GCCACTTTTTCAAG
This paper Method Details
sgNT-73_gRNA_FW:
CGGTGGAGTTTAAGAGCTATGCTG
This paper Method Details
See Table S6 for lentiviral titration qPCR primers This paper Method Details
See Table S7 for individual gRNA oligos This paper Method Details

Recombinant DNA

pCMV-VSV-G Addgene Addgene Cat#8454
pRSV-Rev Addgene Addgene Cat#12259
pMDLg/pRRE Addgene Addgene Cat#12259
pJR85 Addgene Addgene Cat#140095
hU6 modified direct capture perturb-seq vector This paper Addgene Cat#230934

Software and algorithms

Excel Microsoft https://www.microsoft.com/en-us/microsoft-365/excel
Rstudio Rstudio https://rstudio.com
Biorender Biorender https://biorender.com/
CLEANSER This paper GitHub: https://github.com/Gersbachlab-Bioinformatics/CLEANSER, Zenodo: https://doi.org/10.5281/zenodo.12744932
Cell Ranger Cell Ranger https://www.10xgenomics.com/support/software/cell-ranger/latest
Bowtie2 Bowtie2 https://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Experimental model and subject details

HEK293T/17 and NIH3T3cells were purchased from the Duke Cell Culture Facility (originally sourced from ATCC).

Method details

Cell lines and culture conditions

All cells were grown at 37°C. HEK293T/17 cells were cultured in DMEM +10% FBS and NIH3T3cells were cultured in DMEM +10% CBS.

gRNA library cloning

Non-targeting gRNA library #1 and #2 were designed with 100 non-overlapping, non-targeting gRNAs each. All oligonucleotide libraries (Tables S1, S2) were ordered in the following sequence format:

ATATATCTTGTGGAAAGGACGAAACACCG [20-bp protospacer] GTTTAAGAGCTATGCTGGAAACAGCATAG.

Libraries were amplified by PCR using Q5UltraII mastermix (NEB) using the following primers:

gRNA_60bp_fw TAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAACACCG

gRNA_60bp_rv GTTGATAACGGACTAGCCTTATTTAAACTTGCTATGCTGTTTCCAGCATAGCTCTTAAAC

gRNA libraries were cloned into either a CROP-seq or modified direct capture perturb-seq vector (Addgene plasmid #230934, derived from Addgene plasmid #140095 by replacing the mU6 promoter with an hU6 promoter and modifying a single base-pair in the gRNA hairpin) through BsmBI vector digest and NEBuilder HiFi DNA assembly, ensuring >100-fold representation of each gRNA.

qPCR gRNA library titration

HEK293T dCas9KRAB cells were seeded at a density of 5x104 cells/cm2 and NIH3T3 dCas9KRAB cells were seeded at a density of 1.25x104 cells/cm2 on a 24-well plate in one biological replicate per lentiviral transduction. The cells were transduced with varying volumes of lentivirus in the presence of 8 μg/mL polybrene. 10 days post transduction, cells were washed three times and gDNA from each sample was isolated using an Invitrogen PureLink Genomic DNA Mini Kit. MOI was determined using a qPCR titration approach described in Gordon et al., (2020), using the following primers (Table S6) and cycling conditions.

25 ng template DNA

1X OneTaq 2X Master Mix with Standard Buffer.

0.5 μM Fw primer.

0.5 μM Rv primer.

1X EvaGreen Dye

dH2O to total volume of 15 μL.

98°C | 98°C 54°C 68°C | 68°C | 4°C.

30s | 10s 30s 60s | 5min | forever.

35 cycles.

gRNA library transduction

HEK293T dCas9KRAB cells were seeded at a density of 5x104 cells/cm2 and NIH3T3 dCas9KRAB cells were seeded at a density of 1.25x104 cells/cm2 on 6-well plates in one biological replicate each. The cells were transduced with lentivirus using 8 μg/mL polybrene at a multiplicity of infection (MOI) of ∼10 as determined by titration. Two days post-transduction, cells were treated with either 500 (HEK293T dCas9KRAB + non-targeting library #1 cells) or 1000 (NIH3T3 dCas9KRAB + non-targeting library #2 cells) ng/mL puromycin or 20 (HEK293T dCas9KRAB cells + non-targeting library #1) or 80 (NIH3T3 dCas9KRAB cells + non-targeting library #2) μg/mL blasticidin and were selected for 10 days.

7 days post-transduction, cells were trypsinized and seeded on 6-well plates in three conditions.

  • 1) HEK293T dCas9KRAB + non-targeting library #1 cells at a density of 3.9 x 104 cells/cm2

  • 2) NIH3T3 dCas9KRAB + non-targeting library #2 cells at a density of 1.5 x 104 cells/cm2

  • 3) HEK293T dCas9KRAB + non-targeting library #1 cells at a density of 2.0 x 104 cells/cm2 and NIH3T3 dCas9KRAB + non-targeting library #2 cells at a density of 2.0 x 104 cells/cm2

CRISPR barnyard single-cell RNA-seq

10 days post transduction, cells were washed three times, trypsinized, and strained through a 40 μm cell strainer. The cells were diluted to 1K cells/μL and a fourth condition of HEK293T dCas9KRAB + non-targeting library #1 and NIH3T3 dCas9KRAB cells + non-targeting library #2 were mixed. Eight lanes were loaded for single-cell transcriptome profiling, with one lane per condition for each CROP-seq and modified direct capture perturb-seq vector. Approximately 10,000 cells were captured per lane of a 10x Chromium chip (Next GEM Chip G) using Chromium Next GEM Single Cell 3ʹ HT Reagent Kits v3.1 with Feature Barcoding technology for CRISPR Screening (10x Genomics, Inc, Document number CG000418, Rev D). CROP-seq protospacer sequences were amplified from barcoded cDNA as described previously.10

CRISPR barnyard single-cell RNA-seq library sequencing

Final libraries were pooled and sequenced on a NovaSeq S4 flow cell (R1:28 I1:10, I2:10, R2:90) aiming for ∼15,000 reads per cell for gene expression libraries and ∼5,000 reads per cell for gRNA libraries.

Transcriptome data processing and cell filtering for CRISPR barnyard screens

Each lane of cells was processed using cellranger (version 6.0.1) count using default parameters and mapping to the GRCh38-and-mm10-2020 reference genome from 10x Genomics. Using Seurat, cells with less than 15% mitochondrial reads, between 1500 and 6000 features, and between 3500 and 20000 UMIs were retained as high quality cells. Cells with >90% human transcripts were labeled as HEK293T dCas9KRAB cells and cells with >90% mouse transcripts were labeled as NIH3T3 dCas9KRAB cells. The resulting count matrices for gene expression and CRISPR feature libraries after this filtering was used for all downstream analyses.

Filtering ambient gRNAs in the CRISPR barnyard screen

A UMI threshold of ≥5 UMI and ≥0.5% of total gRNA UMIs in the cell was used for the 5 UMI cutoff. The lane-level Cellranger gRNA thresholds produced by Cellranger count were used as minimum UMI values to assign gRNAs to cells for the 10x Mixture Model method. A CLEANSER posterior probability cutoff of ≥0.8 and ≥0.5 was used as a threshold for CROP-seq and direct capture CRISPR libraries, respectively.

Genomic DNA isolation and NGS

Genomic DNA was isolated from cells using the Purelink Genomic DNA mini kit (Thermo Fisher), and up to 20 μg of genomic DNA per sample was used to amplify the U6-3′ to gRNA hairpin region. PCR2 was performed to add full-length Illumina sequencing adapters using internally ordered primers with equivalent sequences to NEBNext Index Primer Sets 1 and 2 (New England Biolabs). All PCRs were performed using Q5UltraII polymerase (NEB). Pooled samples were sequenced using MiSeq (Illumina), using 50-nt reads and collecting greater than 100 reads per gRNA in the library.

The library prep primers were as follows:

PCR1:

U6_BcA_r1seq_halftail.

5′ ACTCTTTCCCTACACGACGCTCTTCCGATCTACTAGGGAAAGGACGAAACACCG 3'

gRNAFE_r2seq_halftail.

5′ GACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCCTTATTTAAACTTGCTATGCTGT 3'

PCR2:

r1seq_fulltail.

5′ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC 3’

r2seq_fulltail (Two distinct indexed versions of this primer were used to allow for pooling)

5′ CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGTGCT 3'

Pol II and III transcript abundance RT-qPCR

An oligonucleotide including the nt-73 protospacer sequence was ordered in the following format: GGAAAGGACGAAACACCG CGTGCGACTCTTTCGGTGGA GTTTAAGAGCTATGCTGGAAAC. The nt-73 oligonucleotide was directly cloned into the CROP-seq backbone through NEBuilder HiFi DNA assembly. The resulting gRNA construct was packaged into lentivirus and transduced into HEK293T dCas9-KRAB cells seeded at a density of 2.86 x 104 cells/cm2 on a 12 well plate in three biological replicates in the presence of 8 μg/mL polybrene. The cells were selected with Blasticidin S (5 μg/mL) on days 2–7. Seven days post-transduction, RNA was harvested from the cells using Qiagen RNeasy Plus Mini kit (Qiagen, 74134) and DNase treated using RQ1 RNase free DNase (Promega, M6101). cDNA was generated using ProtoScript First Strand cDNA Synthesis Kit (NEB, E6300S) and the following RT primer.

gRNA_hairpin_RV: CGACTCGGTGCCACTTTTTCAAG

RT-qPCR was performed using SensiMix SYBR Master Mix (OriGene, QP100001) using the following primers and cycling conditions:

hU6_promoter_FW: CTTGTGGAAAGGACGAAACACCG

gRNA_hairpin_RV: CGACTCGGTGCCACTTTTTCAAG

sgNT-73_gRNA_FW: CGGTGGAGTTTAAGAGCTATGCTG

gRNA_hairpin_RV: CGACTCGGTGCCACTTTTTCAAG.

1 μL template cDNA.

1X SensiMix SYBR 2X Master Mix.

0.5 μM Fw primer.

0.5 μM Rv primer

dH2O to total volume of 15 μL.

95°C |95°C 60°C 72°C | 72°C | 4°C.

10 min |15 s 15 s 15 s | 5min | forever.

35 cycles.

The results are expressed as fold-increase in pol III gRNA expression normalized to pol II mRNA expression by the ΔΔCt method.

Statistical analysis: CLEANSER

We built a mixture model where the two components represent the ambient gRNA noise and the native gRNA signal. The native distribution is a negative binomial distribution while the ambient distribution is a Poisson distribution for csCLEANSER. csCLENASER can be formally specified via the likelihood below, where Xi is the gRNA count for cell i, r (prior: beta1,10 is the ratio between transduced assignment to the negative binomial distribution (NB) and the ambient assignment to the Poisson distribution (Poi). μm (prior: lognormal(log(100), log(100))) and ϕ (prior: beta1,10) denote the mean and dispersion parameters, respectively, of the transduced negative binomial, and λ (lognormal(log(1.1), log(1.1)) denotes the ambient Poisson parameter:

xirNB(μ,ϕ)+(1r)Poi(λ)

In the dcCLEANSER, the ambient distribution and native distributions are two separate negative binomials. dcCLENASER can be formally specified via the likelihood below, where μn (prior: lognormal(log(1.1), log(1.1)) and ϕn (prior: lognormal(log(1.1), log(1.1)) denote the mean and dispersion parameters, respectively, of the ambient negative binomial distribution:

xirNB(μ,ϕ)+(1r)rNB(μn,ϕn)

The probability of Gij = 1, when the gRNA-cell pair is a part of the native probability distribution, can be denoted using variables Xi,λ,μ, ɸ, r after G has been summed out.

P(Gi=1|Xi,λ,μ,ɸ,r)=P(r)P(Xi|Gi=1,μ,ɸ)P(r)P(Gi=0,λ)+P(r)P(Xi|Gi=1,μ,ɸ)
=rNB(Xi|μ,ɸ)(1r)Poi(λ)+rNB(μ,ɸ)

Due to the large number of 0 UMI counts in perturb-seq datasets, the likelihood is conditioned on the probability that the UMI count for a gRNA-cell pair is larger than 0. Adding this condition allowed CLEANSER to process perturb-seq datasets in a time efficient manner. Below is the condition added to the CROP-seq CLEANSER formula.

P(x|x>0)=(1r)Poi(x|λ)+rNB(x|μ,ϕ)1((1r)Poi(0|λ)+rNB(0|μ,ϕ))

A normalization component at a cell level (Li) is calculated by normalizing the sum of all gRNA UMI counts less than or equal to a threshold (default threshold of 2) for each cell over the average sum of all gRNA UMI counts lower than a threshold across all cells. That normalization factor is then used to calculate cell-specific distribution parameters.

CROP-seq CLEANSER:

λij=λjLi
μij=μjLi

Direct capture CLEANSER:

μijsignal=μjsignalLi
μijnoise=μjnoiseLi

The model is written in CmdStan27 and runs a predetermined number of steps of Markov Chain Monte Carlo sampling to estimate the posterior distribution of Gij. The model generates 300 samples per gRNA, as well as a posterior for each gRNA-cell pair. The posterior generated for each gRNA-cell pair will be the final model output of the probability of gRNA assignment to a cell.

CLEANSER can be accessed through github: https://github.com/Gersbachlab-Bioinformatics/CLEANSER(https://doi.org/10.5281/zenodo.12744932).

CLEANSER 10x Mixture Model
Ambient gRNA distribution Negative Binomial distribution Gaussian distribution
Native gRNA distribution Negative Binomial distribution Gaussian distribution
Cell-specific bias Normalization parameter N/A
Vector bias: Direct capture vs. CROP-seq dcCLEANSER and csCLEANSER N/A
Output Probability of assignment for cell-gRNA pair, QC metrics UMI threshold per gRNA
Input/Use Standalone pipeline. Gene expression and gRNA libraries can be processed and subset using preferred pipeline prior to gRNA assignment Used in conjunction with Cell Ranger. gRNA and gene expression libraries processed in Cell Ranger

K562 CRISPRi CROP-seq and CD8+CCR7+ T cell CRISPRi direct capture CLEANSER benchmarking analysis.

For this benchmark we reprocessed and analyzed the publicly available Gasperini et al. 2019 and McCutcheon et al. 2023 datasets. Transcriptomic and gRNA BAM files were downloaded from GEO (GSE120861, GSE218988) and converted back to FASTQ files with the `bamtofastq` program included in 10x Genomics Cell Ranger 7.1.0 (referred as cellranger from here on). Next, count output files for each pool were obtained with the cellranger `count` command, providing the reference list of gRNA information and protospacer sequences through the `--feature-ref` command. The outputs for each pool were merged with the cellranger `aggr` command, without normalizing the counts (`--normalize none` argument). A basic QC was applied to the resulting sparse matrix containing GEX and gRNA UMI counts. Cells with large numbers of mitochondrial gene UMI counts (≥20%), or a number of detected genes or total transcriptomic UMIs ≥2 median absolute deviation (MAD) were excluded from downstream analyses. To assign gRNAs to cells using a strict UMI cutoff, we used a UMI threshold of ≥5 UMI and ≥1% of total gRNA UMIs in the cell in Gasperini et al. 2019 and a UMI threshold of ≥4 UMI in McCutcheon et al. 2023. For the 10x Mixture Model, we used the UMI thresholds generated by cellranger count for each lane. For CLEANSER, a posterior probability cutoff of ≥0.8 was used as a threshold in Gasperini et al. 2019 and a posterior probability cutoff of ≥0.08 for McCutcheon et al. 2023. For each targeting gRNA, genes within 1 kb of the protospacer midpoint were tested for differential expression analysis, comparing the gene counts across cells with a given gRNA against cells with any other gRNA. A negative binomial generalized linear model was applied to these counts to detect significant gRNA-gene associations.

K562 CRISPRa direct capture perturb-seq CLEANSER benchmarking analysis

Cellranger count output files and differential expression testing pipelines were obtained at https://krishna.gs.washington.edu/content/members/CRISPRa_QTL_website/public/. Using Seurat, cells with greater than 10% mitochondrial reads and less than 4,000 UMIs were filtered out. To assign gRNAs to cells using a strict UMI cutoff, a global UMI filter of >5 gRNA UMIs/cell was used. For the 10x Mixture Model, we used the UMI thresholds generated by Cellranger count for each lane. For CLEANSER, a posterior probability cutoff of ≥0.5 was used as a threshold. Differential expression tests were run for each gRNA-gene pair using a modified version of the pipeline described in Chardon and McDiarmid et al. (2023).20 This version used all other cells without a gRNA targeting the same gene as control.

gRNA validation RT-qPCR

Twelve oligonucleotides were ordered in the following format: GGAAAGGACGAAACACCG [protospacer] GTTTAAGAGCTATGCTGGAAAC (Table S7). All oligonucleotides were directly cloned into a lentiviral hU6-gRNA BFP-P2A-PuroR backbone (Addgene plasmid #230935) through NEBuilder HiFi DNA assembly. The resulting gRNA constructs were packaged into lentivirus and transduced into K562 dCas92XVP64−BlastR cells seeded at a density of 6.5x104 cells/cm2 on a 24-well plate in four to eight biological replicates each in media supplemented with 10 μg/mL blasticidin and 8 μg/mL polybrene. The following day, the media was refreshed to contain 10 μg/mL blasticidin. Two days post-transduction, cells were treated with 10 μg/mL blasticidin and 0.5 μg/mL puromycin for eight days. Ten days post-transduction, RNA was harvested from the cells using Qiagen RNeasy Plus Mini kit and DNase treated using RQ1 RNase free DNase. cDNA was generated using ProtoScript First Strand cDNA Synthesis Kit.

qPCRs were conducted using the following TaqMan probes and cycling conditions:

TaqMan GEA (FAM) Assay ID: Hs00360405_g1 Gene Symbol: CUTA.

TaqMan GEA (FAM) Hs00185435_m1 Gene Symbol: HLA-DMA.

TaqMan GEA(FAM) Hs00171876_m1 Gene Symbol: DNMT3B.

TaqMan GEA (FAM) Hs02596927_g1 Gene Symbol: RPL7.

TaqMan GEA (FAM) Hs00908900_m1 Gene Symbol: FOXP1.

TaqMan GEA (FAM) Hs00162613_m1 Gene Symbol: TCF4.

TaqMan GEA (FAM) Hs00381656_m1 Gene Symbol: EIF4E3.

TaqMan GEA (FAM) Hs00381370_m1 Gene Symbol: RNF182.

TaqMan GEA (FAM) Hs02576472_g1 Gene Symbol: EIF1AX.

TaqMan GEA (VIC-MGB_PL) Hs99999905_m1 Gene Symbol: GAPDH.

1μL 20 X TaqMan GEA (20X FAM dye-labeled).

1 μL 20 X TaqMan GEA (20X VIC dye-labeled, primer-limited)

10μL TaqMan Gene Expression Master Mix (2X).

4 μL cDNA Template.

4 μL dH2O.

RT-qPCRs were run on the Applied Biosystems 7500 Fast Real-Time PCR System in 96-well plates according to the following settings:

50°C 95°C | 95°C 60°C |

2 min 10 min | 15 s 1 min |

45 cycles.

The results are expressed as fold-changes in gene expression after calculating the relative expression level compared to the housekeeping gene GAPDH using the ΔΔCt method.

Quantification and statistical analysis

Number of replicates can be found in the Figure legends or in the Methods Details. All measurements were taken from distinct samples. All figures show mean with standard error bars unless specified otherwise. For case-control comparisons, two-tailed t-tests and Mann-Whitney U-tests were performed to compare treatment and control groups as indicated in Figure legends. pP-values are represented by asterisks (∗p ≤ 0.05, ∗∗p ≤ 0.01, ∗∗∗p ≤ 0.001). Statistical analysis and visualization were carried out in R version 4.2.2.

Published: February 5, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2025.100766.

Contributor Information

Andrew S. Allen, Email: asallen@duke.edu.

Gregory E. Crawford, Email: greg.crawford@duke.edu.

William H. Majoros, Email: william.majoros@duke.edu.

Charles A. Gersbach, Email: charles.gersbach@duke.edu.

Supplemental information

Document S1. Figures S1–S5
mmc1.pdf (2.8MB, pdf)
Table S1. Non-targeting gRNA library #1 gRNA information, related to Figure 1
mmc2.xlsx (7.1KB, xlsx)
Table S2. Non-targeting gRNA library #2 gRNA information, related to Figure 1
mmc3.xlsx (7.1KB, xlsx)
Table S3. K562 CRISPRi CROP-seq benchmarking positive control gRNA-gene pairs, related to Figure 5
mmc4.xlsx (91KB, xlsx)
Table S4. CD8+CCR7+ T cell CRISPRi direct-capture perturb-seq benchmarking positive-control gRNA-gene pairs, related to Figure 5
mmc5.xlsx (7.1KB, xlsx)
Table S5. K562 CRISPRa direct-capture perturb-seq benchmarking significant gRNA-gene pairs, related to Figure 5
mmc6.xlsx (8.8KB, xlsx)
Table S6. qPCR primer sequences used for lentiviral transduction titration, related to STAR Methods
mmc7.xlsx (5KB, xlsx)
Table S7. gRNA protospacer sequences used in the K562 CRISPRa benchmarking RT-qPCR, related to Figure 5
mmc8.xlsx (5.4KB, xlsx)
Document S2. Article plus supplemental information
mmc9.pdf (10.3MB, pdf)

References

  • 1.Dixit A., Parnas O., Li B., Chen J., Fulco C.P., Jerby-Arnon L., Marjanovic N.D., Dionne D., Burks T., Raychowdhury R., et al. Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell. 2016;167:1853–1866. doi: 10.1016/j.cell.2016.11.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Adamson B., Norman T.M., Jost M., Cho M.Y., Nuñez J.K., Chen Y., Villalta J.E., Gilbert L.A., Horlbeck M.A., Hein M.Y., et al. A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response. Cell. 2016;167:1867–1882. doi: 10.1016/j.cell.2016.11.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Replogle J.M., Saunders R.A., Pogson A.N., Hussmann J.A., Lenail A., Guna A., Mascibroda L., Wagner E.J., Adelman K., Lithwick-Yanai G., et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell. 2022;185:2559–2575. doi: 10.1016/j.cell.2022.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jin X., Simmons S.K., Guo A., Shetty A.S., Ko M., Nguyen L., Jokhi V., Robinson E., Oyler P., Curry N., et al. In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes. Science. 2020;370 doi: 10.1126/science.aaz6063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yao D., Binan L., Bezney J., Simonton B., Freedman J., Frangieh C.J., Dey K., Geiger-Schuller K., Eraslan B., Gusev A., et al. Scalable genetic screening for regulatory circuits using compressed Perturb-seq. Nat. Biotechnol. 2024;42:1282–1295. doi: 10.1038/s41587-023-01964-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zhou P., Shi H., Huang H., Sun X., Yuan S., Chapman N.M., Connelly J.P., Lim S.A., Saravia J., Kc A., et al. Single-cell CRISPR screens in vivo map T cell fate regulomes in cancer. Nature. 2023;624:154–163. doi: 10.1038/s41586-023-06733-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Heumos L., Schaar A.C., Lance C., Litinetskaya A., Drost F., Zappia L., Lücken M.D., Strobl D.C., Henao J., Curion F., et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 2023;24:550–572. doi: 10.1038/s41576-023-00586-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Xie S., Duan J., Li B., Zhou P., Hon G.C. Multiplexed Engineering and Analysis of Combinatorial Enhancer Activity in Single Cells. Mol. Cell. 2017;66:285–299. doi: 10.1016/j.molcel.2017.03.007. [DOI] [PubMed] [Google Scholar]
  • 9.Xie S., Armendariz D., Zhou P., Duan J., Hon G.C. Global Analysis of Enhancer Targets Reveals Convergent Enhancer-Driven Regulatory Modules. Cell Rep. 2019;29:2570–2578. doi: 10.1016/j.celrep.2019.10.073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gasperini M., Hill A.J., McFaline-Figueroa J.L., Martin B., Kim S., Zhang M.D., Jackson D., Leith A., Schreiber J., Noble W.S., et al. A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens. Cell. 2019;176:377–390. doi: 10.1016/j.cell.2018.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Datlinger P., Rendeiro A.F., Schmidl C., Krausgruber T., Traxler P., Klughammer J., Schuster L.C., Kuchler A., Alpar D., Bock C. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods. 2017;14:297–301. doi: 10.1038/nmeth.4177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Replogle J.M., Norman T.M., Xu A., Hussmann J.A., Chen J., Cogan J.Z., Meer E.J., Terry J.M., Riordan D.P., Srinivas N., et al. Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat. Biotechnol. 2020;38:954–961. doi: 10.1038/s41587-020-0470-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Young M.D., Behjati S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience. 2020;9 doi: 10.1093/gigascience/giaa151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Yang S., Corbett S.E., Koga Y., Wang Z., Johnson W.E., Yajima M., Campbell J.D. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 2020;21:57. doi: 10.1186/s13059-020-1950-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tian L., Dong X., Freytag S., Lê Cao K.-A., Su S., JalalAbadi A., Amann-Zalcenstein D., Weber T.S., Seidi A., Jabbari J.S., et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods. 2019;16:479–487. doi: 10.1038/s41592-019-0425-8. [DOI] [PubMed] [Google Scholar]
  • 16.Costello M., Fleharty M., Abreu J., Farjoun Y., Ferriera S., Holmes L., Granger B., Green L., Howd T., Mason T., et al. Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms. BMC Genom. 2018;19:332. doi: 10.1186/s12864-018-4703-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.de Jong O.G., Murphy D.E., Mäger I., Willms E., Garcia-Guerra A., Gitz-Francois J.J., Lefferts J., Gupta D., Steenbeek S.C., van Rheenen J., et al. A CRISPR-Cas9-based reporter system for single-cell detection of extracellular vesicle-mediated functional transfer of RNA. Nat. Commun. 2020;11:1113. doi: 10.1038/s41467-020-14977-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Griffiths J.A., Richard A.C., Bach K., Lun A.T.L., Marioni J.C. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 2018;9:2667. doi: 10.1038/s41467-018-05083-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yang L., Zhu Y., Yu H., Cheng X., Chen S., Chu Y., Huang H., Zhang J., Li W. scMAGeCK links genotypes with multiple phenotypes in single-cell CRISPR screens. Genome Biol. 2020;21:19. doi: 10.1186/s13059-020-1928-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chardon F.M., McDiarmid T.A., Page N.F., Daza R.M., Martin B., Domcke S., Regalado S.G., Lalanne J.B., Calderon D., Li X., et al. Multiplex, single-cell CRISPRa screening for cell type specific regulatory elements. bioRxiv. 2024 doi: 10.1101/2023.03.28.534017. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Barry T., Wang X., Morris J.A., Roeder K., Katsevich E. SCEPTRE improves calibration and sensitivity in single-cell CRISPR screen analysis. Genome Biol. 2021;22:344. doi: 10.1186/s13059-021-02545-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Duan J., Hon G.C. FBA: feature barcoding analysis for single cell RNA-Seq. Bioinformatics. 2021;37:4266–4268. doi: 10.1093/bioinformatics/btab375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Xin H., Lian Q., Jiang Y., Luo J., Wang X., Erb C., Xu Z., Zhang X., Heidrich-O’Hare E., Yan Q., et al. GMM-Demux: sample demultiplexing, multiplet detection, experiment planning, and novel cell-type verification in single cell sequencing. Genome Biol. 2020;21:188. doi: 10.1186/s13059-020-02084-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.McCutcheon S.R., Swartz A.M., Brown M.C., Barrera A., McRoberts Amador C., Siklenka K., Humayun L., Ter Weele M.A., Isaacs J.M., Reddy T.E., et al. Transcriptional and epigenetic regulators of human CD8+ T cell function identified through orthogonal CRISPR screens. Nat. Genet. 2023;55:2211–2223. doi: 10.1038/s41588-023-01554-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zheng G.X.Y., Terry J.M., Belgrader P., Ryvkin P., Bent Z.W., Wilson R., Ziraldo S.B., Wheeler T.D., McDermott G.P., Zhu J., et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 2017;8 doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Hafemeister C., Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20:296. doi: 10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Stan Development Team (2012). Stan Modeling Language Users Guide and Reference Manual. Version 2.34. https://mc-stan.org/docs/2_34/reference-manual-2_34.pdf.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S5
mmc1.pdf (2.8MB, pdf)
Table S1. Non-targeting gRNA library #1 gRNA information, related to Figure 1
mmc2.xlsx (7.1KB, xlsx)
Table S2. Non-targeting gRNA library #2 gRNA information, related to Figure 1
mmc3.xlsx (7.1KB, xlsx)
Table S3. K562 CRISPRi CROP-seq benchmarking positive control gRNA-gene pairs, related to Figure 5
mmc4.xlsx (91KB, xlsx)
Table S4. CD8+CCR7+ T cell CRISPRi direct-capture perturb-seq benchmarking positive-control gRNA-gene pairs, related to Figure 5
mmc5.xlsx (7.1KB, xlsx)
Table S5. K562 CRISPRa direct-capture perturb-seq benchmarking significant gRNA-gene pairs, related to Figure 5
mmc6.xlsx (8.8KB, xlsx)
Table S6. qPCR primer sequences used for lentiviral transduction titration, related to STAR Methods
mmc7.xlsx (5KB, xlsx)
Table S7. gRNA protospacer sequences used in the K562 CRISPRa benchmarking RT-qPCR, related to Figure 5
mmc8.xlsx (5.4KB, xlsx)
Document S2. Article plus supplemental information
mmc9.pdf (10.3MB, pdf)

Data Availability Statement


Articles from Cell Genomics are provided here courtesy of Elsevier

RESOURCES