Abstract
Off-target effects are well established confounders of CRISPR negative selection screens that impair the identification of essential genomic loci. In particular, non-coding regulatory elements and repetitive regions are often difficult to target with specific gRNAs, effectively precluding the unbiased screening of a large portion of the genome. To address this, we developed CRISPR Specificity Correction (CSC), a computational method that corrects for the effect of off-targeting on gRNA depletion. We benchmark CSC with data from the Cancer Dependency Map and show that it significantly improves the overall sensitivity and specificity of viability screens while preserving known essentialities, particularly for genes targeted by highly promiscuous gRNAs. We believe this tool will further enable the functional annotation of the genome as it represents a robust alternative to the traditional filtering strategy of discarding unspecific guides from the analysis. CSC is an open-source software that can be seamlessly integrated into current CRISPR analysis pipelines.
Subject terms: Functional genomics, CRISPR-Cas systems
Off-target effects often confound the interpretation of CRISPR screens. Here, the authors introduce a computational method that corrects for off-targets in gRNA depletion assays, circumventing the need to completely discard unspecific guides.
Introduction
High-throughput loss-of-function screens can help catalog loci essential to cellular fitness1–4 and have been leveraged to systematically identify cancer vulnerabilities that can be exploited therapeutically1. The CRISPR-Cas9 genome editing system has become instrumental in these efforts, owing to the ease at which null alleles can be generated in a multiplex manner in both coding and non-coding regions.
Nevertheless, measurements of cellular fitness in CRISPR loss-of-function screens can be confounded by off-target cleavage because gRNAs that lead Cas9 to cleave multiple loci can trigger a DNA-damage response that includes cell cycle arrest5. The consequences of off-target cleavage on screen performance have been best characterized for gRNAs targeting amplified genomic regions6,7. However, they have also been documented for unspecific gRNAs within published genome-wide libraries both when they have perfect alignment or single mismatches to off-target sites8. To minimize the confounding effects of off-targets, unspecific gRNAs are typically identified and discarded during library design. In addition, guides suspected of off-target activity that were unintentionally included in the libraries are further removed from the analysis through the implementation of filters2,6. Both these filtering steps are reasonable strategies for screens to protein-coding genes, but they represent major obstacles to the implementation of fitness screens to genomic features that cannot be targeted by specific gRNAs9. These include a large fraction of non-coding regulatory elements9. Indeed, recent work reported that off-target activity caused the majority of fitness effects in CRISPR essentiality screens to CTCF binding sites9, indicating that gRNA specificity is a key confounder in this setting. However, removing unspecific gRNAs at the library design step would render a large fraction of those sites—as well as binding sites for numerous transcription factors—untargetable9. Thus, despite the ongoing efforts to comprehensively discover and annotate genomic features10,11, a large fraction of them cannot currently be screened for essentiality using conventional CRISPR strategies. This limitation significantly hinders our ability to gain insight into the functional roles of large segments of the genome. It also argues for the need for approaches that can be used as alternatives to filtering unspecific gRNAs, analogous to those used for gRNAs targeting amplified genomic regions12,13.
Here, we build upon previous work9 by developing a computational method that identifies and corrects for the confounding effect of gRNA off-targeting in high-throughput CRISPR fitness screens. We apply our CRISPR Specificity Correction (CSC) algorithm to genome-wide fitness screens performed by the Cancer Dependency Map initiative1,2,12, allowing us to use gold-standard sets of essential and non-essential genes to benchmark our approach14. We find that CSC significantly improves screen performance across all cellular lineages, beating gRNA filtering strategies in its ability to discriminate between known essential and non-essential genes. Correction of depletion data with CSC also captured previously missed gene dependencies, even for genes targeted by highly unspecific gRNAs. We believe this tool will further enable the comprehensive functional characterization of coding and non-coding elements in the genome by expanding the set of usable gRNAs in CRISPR libraries. We package CSC as an open-source Python software which we make freely available to the community.
Results and discussion
We set out to develop a computational strategy to correct for the effect of off-targeting on gRNA depletion without the need to filter out unspecific gRNAs. To validate our strategy, we focused our analysis on loss-of-function screens from the 19Q4 release of the Project Achilles Avana dataset from the Cancer Dependency Map initiative2,12, performed across 26 distinct cellular lineages (Fig. 1a). This dataset represents a useful proof-of-principal scenario as previous work described the presence of unspecific gRNAs in Avana8,15 and showed that these unspecific guides confound the analysis of essentiality and contribute to false-positives hits in negative selection screens8. As important, the Avana genome-wide library targets gold-standard sets of curated essential and non-essential genes14. These gene sets are commonly used to evaluate the performance of CRISPR tools because gRNAs that target them can be considered true-positives and true-negatives in the context of viability screens, since gRNAs that disrupt essential genes are expected to drop out from the population of infected cells over time, while the abundance of gRNAs that disrupt non-essential genes is expected to remain unchanged.
To systematically evaluate the effect of specificity on gRNA depletion, we enumerated all potential off-targets—between Hamming distances of 0 and 3—for each gRNA in the Avana library using GuideScan16. GuideScan is a retrieval-tree-based algorithm that outperforms Bowtie-based tools in the identification of off-target loci16, providing an exact and direct enumeration of the potential target space of a gRNA within a user-specified number of mismatches to the guide sequence (Supplementary Note 1 and accompanying Supplementary Table 1, Supplementary Data 1, Supplementary Data 2). Our catalog of potential off-target loci for Avana using this approach surpassed the original off-target estimates reported for this library15 as well as the estimates used by Project Achilles in the DepMap data processing pipeline2 (Supplementary Data 2, Fig. 1b–d, Supplementary Note 1). To summarize the specificity of each gRNA in this library, we further computed GuideScan’s specificity score. This score aggregates Cutting Frequency Determination values (or CFD, describing the likelihood of an off-target being cut by Cas9 based on the number, position, and identity of mismatches to a 20 nucleotide (nt)-long gRNA)15 for all potential target sites enumerated by Guidescan16, so that the most specific targeting gRNAs receive a score of 1 and the most unspecific a score of 0 (Fig. 1e, f). In agreement with previous studies8,9, gRNAs with low specificities were on average more depleted from the population during viability screens, often beyond the levels observed for gRNAs targeting known essential genes (Fig. 2a, left, Supplementary Fig. 1a–d). This observation held true even for guides that had a single perfect target site in the genome (H0 = 1) but increasing numbers of off-targets with mismatches (Fig. 2b, Supplementary Fig. 1c). Of note, when we looked at gRNAs targeting known non-essential genes14—whose representation in the library should remain unchanged over the course of the screen—we found that gRNAs with specificity scores below 0.16 were significantly depleted compared to highly specific guides (specificity score = 1; Kolmogorov–Smirnov test, adjusted for multiple testing). Fold-change distributions of gRNA with a specificity score equal or above 0.16, however, were indistinguishable from those of highly specific guides suggesting that above this threshold the effect of off-target cutting on the guide’s representation in the library is statistically minimal (Supplementary Fig. 1a, Supplemental Note 2).
To determine the extent to which off-target mediated gRNA depletion acted as a confounder in the Achilles dataset, we calculated Bayes Factors (BF) for each gene in individual screens17. In this context, BF are an assessment of gene essentiality, with positive values indicating a gene is essential and negative values indicating a gene is non-essential. Gene Set Enrichment Analysis (GSEA) showed that genes targeted by unspecific gRNAs were significantly enriched in high BF values, particularly as the number of unspecific gRNAs per gene increased or as the specificity of the gRNAs that target each gene decreased (see Supplementary Fig. 1e–g for an example cell line). This suggests that, in agreement with previous reports8,9, off-targeting may contribute to false-positive dependencies even when multiple independent gRNAs per gene are present in a library. Taken together, these data also validate the Avana dataset from Project Achilles as a suitable model to test our CRISPR Specificity Correction (CSC) algorithm for its ability to correct the confounding effect of off-targeting on gRNA depletion.
CSC takes as inputs the sequence and depletion values of all gRNAs in a screen (Fig. 2c). As a first step, it uses the sequence information to retrieve for each guide the number of potential target sites it has at zero (H0), one (H1), two (H2), or three (H3) Hamming distance to the gRNA sequence as well as their GuideScan specificity score. Although cleavage of sites with a Levenshtein distance of 1 to the gRNA—often referred to as ‘bulged’ sites—has also been reported18,19, these types of mismatches rarely preserve Cas9 activity15 and therefore were not considered. To retrieve off-target information for each guide in a library, CSC accesses hash tables whose keys contain every Cas9 gRNA that can target the human (hg38) or mouse (mm10) genomes. In the tables, each of these keys is mapped to the gRNA’s specificity metrics which have been previously computed using GuideScan16. CSC then uses the five specificity metrics as covariates in a model to assess the contribution of off-target parameters to gRNA depletion via a multivariate adaptive regression spline (EARTH, see Methods). Piecewise linear spline functions, have been previously used to model the effect of genome copy-number alterations on gRNA depletion as they can model nonlinearities and are well suited to deal with the saturating effect of high numbers of cut sites12. The EARTH model utilized by CSC has the advantage of automating both the partitioning of the data through its hinge functions as well as the variable selection step, thus providing a flexible approach for off-target correction. In addition, the backward pass improves generalizability of the model and minimizes overfitting of the training data by pruning the least effective terms and selecting the best performing model through generalized cross validation (see Methods for more details on model selection).
Specificity-corrected depletion values for each guide are outputted along with the enumeration of off-target sites and the gRNA’s GuideScan specificity score (Fig. 2c). Importantly, specificity metrics for all inputted gRNA sequences are provided even in the absence of accompanying depletion values. Thus, CSC not only allows users to correct off-target influence in their screening data but also provides direct access to GuideScan’s specificity metrics for any Cas9 gRNA that targets the human or mouse genome. This allows users to determine the potential target space of a gRNA or set of gRNAs even prior to a screen and may further aid CRISPR library design.
To validate CSC, we applied it to all screens from the DepMap 19Q4 Achilles dataset. As predicted, CSC removed the correlation between gRNA specificity and gRNA depletion (Fig. 2a, b). Additionally, when inferring gene essentiality in each cell line of the Achilles dataset, we found that correction of off-target mediated gRNA depletion by CSC significantly increased both the recall of constitutive essential genes at 5% False Discovery Rate (FDR) of constitutive non-essential genes (Fig. 3) as well as the Area Under the Curve (AUC) of precision-recall curves (Supplementary Fig. 2a, b). Both these metrics evaluate the success of predicting true gene essentialities. This trend remained true when looking at aggregate data for the entire Achilles data set (Fig. 3a), as well as at the level of individual lineages and cell lines (Fig. 3b, c, Supplementary Fig. 3). Of note, CSC markedly outperformed the filtering strategy implemented by Project Achilles to deal with unspecific gRNAs2 (Fig. 3, Supplementary Fig. 3b).
With the increased recall at 5% FDR, we observed a concomitant increase in the number of genes identified as essential. In total, 12,444 genes scored as a dependency in at least one screen when CSC was implemented, compared with 5831 and 6018 genes for data not corrected for off-targeting or when unspecific gRNAs were removed with the Achilles filter, respectively. To determine if these results constituted true gene essentialities, we first looked at the occurrence of known false positives. Because the number of false positives is influenced by the total number of genes identified as hits, we varied Bayes Factor thresholds for each screen to yield the same number of hits across each pipeline as previously described20. Then, for each screen we counted the number of gold-standard non-essential genes14 that had scored as hits. We found that CSC led to a significant reduction in the number false positives compared to uncorrected data, again outperforming the filtering strategy implemented by project Achilles (Fig. 4a). At this fixed number of positive hits, CSC also led to a significant increase in both precision and recall (Supplementary Fig. 2c). We then examined the expression levels of genes inferred as essential by each of the three analysis pipelines. We found that those genes identified as essential after off-target correction by CSC tended to be well expressed in the cell line in which they scored as hits (Fig. 4b, Supplementary Fig. 4a). By contrast, for each screen, genes scoring as essential in uncorrected data but not in data corrected with CSC tended to have significantly lower expression levels in the respective cell lines. In fact, a substantial subset of these genes was below bulk RNA-seq detection threshold, suggesting that they may represent false-positive hits. Similarly, genes identified as essential when data was corrected with the Achilles filter but not when data was corrected with CSC tended to be lowly expressed in the cell lines they scored as hits or below detection threshold, suggesting these may also represent false positives (Fig. 4b, Supplementary Fig. 4a).
Next, we looked for evidence of functional essentiality for genes identified as putative dependencies. Essential genes are under evolutionary pressure to maintain sequence integrity, and mutations that inactivate their functions are expected to be depleted from natural human populations. In contrast, non-essential genes whose disruption does not affect organism health or function, are expected to tolerate the accumulation of inactivating mutations. As such, we took advantage of the Genome Aggregation Database (gnomAD)21, which catalogs high-confidence predicted loss-of-function variants and uses these to classify human genes according to the mutational constraint they are under. Specifically, the LOEUF score places genes along a spectrum of tolerance to inactivating mutations, where genes that play essential cellular roles, and therefore are under high mutational constraint, receive low LOEUF scores, while genes whose disruption has no impact on cell viability or organismal health and are therefore under low mutational constraint in the human population receive high scores21. We confirmed that these scores can reflect gene essentiality by applying them to the curated sets of essential and non-essential genes14 (Fig. 4c, left). We then retrieved LOEUF scores for genes consistently identified as essential by one analysis pipeline (defined as genes that score as hits in more than 15 distinct cell lines) but not the other. We found that genes consistently identified as essential in data corrected using CSC but that did not score as essential in any screen when using uncorrected data or data corrected using the Achilles filter tended to have low LOEUF values. On average these scores were well below those attributed to constitutive non-essential genes (Fig. 4c). This suggests that genes exclusively identified as dependencies after correcting off-targeting with CSC are under mutational constraint in the human population and therefore presumably play essential roles in human cells. In contrast, genes that scored in more than 15 distinct screens only before data correction or only after removing promiscuous gRNAs through the Achilles filter—but that did not score as hits after data was corrected for off-targeting using CSC—tended to have higher LOEUF values, often above those of constitutive non-essentials. These results suggest that their inactivation may be well tolerated in humans. Taken together, these data are consistent with the notion that genes identified after computational correction by CSC reflect true essentialities, and that CSC implementation minimizes the occurrence of false-positive hits. It also suggests that CSC outperforms the current filtering approach implemented by Project Achilles to deal with unspecific gRNAs (Fig. 4a–c).
To test CSC’s performance in the context of highly unspecific libraries, we selected only genes that are targeted by at least one gRNA containing multiple perfect target sites in the human genome (H0 > 1) with the Avana library. The resulting subset of gRNAs approximates better the low specificity of libraries designed to target non-coding regulatory elements in the genome9 (Supplementary Fig. 4b, c), while still targeting true-positive and true-negative genes that can be used to benchmark CSC. The low specificity of gRNA libraries cutting within non-coding motifs stems from the fact that predicted cis-regulatory elements are typically small, with lengths several fold below those of average exons. As a result, the traditional approach used for screens of protein-coding genes, where known unspecific gRNAs are excluded at the step of library design, is not viable for a large fraction of regulatory sequences. Indeed, filtering out gRNAs suspected of off-target activity would render many of these regulatory sequences untargetable9 (Supplementary Fig. 4d).
We used this pool of unspecific Avana gRNAs to reanalyze all screens from the DepMap 19Q4 dataset and re-calculate precision and recall curves. As expected, in the context of this highly unspecific set of gRNAs, the simple filtering of known promiscuous guides was an ineffective strategy to correct off-targeting (Fig. 4d). By contrast, we found that correction of off-target mediated gRNA depletion by CSC substantially increased recall (5% FDR) across all lineages of the dataset (Fig. 4d). This suggests CSC can help retrieve known gene dependencies even from highly promiscuous gRNA pools.
To test the generalizability of CSC, we next applied it to the genome-wide screens released by Project Score1 (Fig. 5). This dataset was generated using the Sanger genome-wide library22, which was designed to contain gRNAs with 19 nt-long complementarity to the genome and therefore follows distinct off-target rules than those established for the more common 20-nt design. Like Avana, the Sanger library contains a non-negligible fraction of gRNAs with perfect (H0 > 1) or near perfect (H1 > 0) off-targets (Fig. 5a, Supplementary Data 3). In addition, as described for Project Achilles screens, gRNAs with increasing numbers of perfect target sites were increasingly depleted from essentiality screens released by Project Score (Fig. 5b, left). This was also true for gRNAs with only one perfect target site in the genome (H0 = 1) but increasing numbers of off-targets with single-mismatches (Fig. 5b, right). This suggests that even in the context of a 19-nt gRNA design, cleavage of mismatched off-targets can cause measurable toxicity to the cells. In contrast to the Achilles dataset, we found no correlation between GuideScan’s Specificity score and gRNA depletion in Project Score screens (Fig. 5c). This is expected as this score is computed using the CFD metric which was empirically developed using 20 nt-long gRNAs. Thus, the Project Score dataset provides a unique challenge for CSC and an opportunity to test its flexibility in correcting off-target mediated gRNA depletion in CRISPR viability screens based on distinct gRNA designs. Applying CSC to Project Score screens resulted in correction models that included both H0 (number of Hamming 0 targets) and H1 (number of Hamming 1 targets) as the main covariates, in stark contrast to models generated on Avana data where GuideScan’s specificity score was consistently the most important covariate (compare example metric output files for each dataset, provided as Supplementary Data 4 and Supplementary Data 5 for an Avana and Sanger screen, respectively). This reflects the weakness of GuideScan’s specificity score in summarizing off-target cleavage by 19-nt gRNAs (Fig. 5c) and highlights the ability of CSC to discriminate the most informative covariates for its correction models. As before, we evaluated the performance of CSC in the Project Score dataset by calculating the recall of known essential genes at 5% FDR (Fig. 5d). We also calculated the AUC for precision-recall curves before and after correction. We found that CSC led to a significant increase in both metrics (recall p-value <2.2e-16, AUC p-value = 5.58e-8; two-sided, Wilcoxon test). Together, these data demonstrate the generalizability of CSC and its ability to increase the sensitivity of essentiality screens performed with distinct gRNA library designs.
We note that for a small number of Avana and Sanger screens (38 and 26 screens, respectively) correction of off-targeting by CSC led to an apparent drop in recall at 5% FDR (Supplementary Fig. 3 and Fig. 5d). This effect does not appear to be caused by distortions introduced by CSC, as the mean distance between the fold-changes of known essential and non-essential genes is minimally impacted by the correction and is statistically identical between screens where CSC increases or decreases recall (Supplementary Fig. 2d; two-sided Kolmogorov–Smirnov test). In addition, while increased recall is accompanied by a significant increase in the AUC of the corresponding curve (Project Achilles p-value <2.2e-16, Project Score p-value = 9.404e-09; two-sided, Wilcoxon test), lower recall is not accompanied by statistically lower AUC values (Project Achilles p-value = 0.1159, Project Score p-value = 0.4186; two-sided, Wilcoxon test). This suggests that the drop in recall at 5% FDR we have documented in a minor fraction of screens represents only a local feature of the curve and not a general trend across all decision thresholds (See Supplementary Fig. 2b and Fig. 5d for representative curves). To further ensure the robustness of the corrections and provide users with full information of how they were generated, CSC outputs for each screen a file describing the model used along with its performance metrics. These include the RMSE of the model (i.e., the square root of the variance of the residuals), a measure of how accurately it predicts the impact of off-targets on gRNA depletion (see Supplementary Data 4 and Supplementary Data 5 for example files). Additionally, CSC allows users to define RMSE thresholds above which no corrections are performed and only off-target descriptions for each gRNA in the library are outputted.
Finally, BAGEL2 is a recently published hit-calling algorithm that includes some degree of multi-targeting correction20. BAGEL2 is efficient at gene essentiality calling but it does not segregate the multi-targeting correction and hit-calling functions preventing us from directly comparing it to CSC. However, BAGEL2 uses Bowtie to generate its off-target predictions using settings that can miss perfect and near perfect target sites (see Supplementary Note 1, Supplementary Data 1, and Methods). Additionally, it discards from its analysis gRNAs for which it identifies more than 10 perfect sites or more than 10 sites with a single mismatch, and therefore will have difficulty in handling libraries with particularly unspecific gRNAs, as those that may be used for the screening of non-coding regulatory elements. Additional advantages of CSC that may be particularly useful to users include the full description of the off-target information of every gRNA submitted as an input, as well as information of how off-target metrics were incorporated in the correction model, how well the model performed, and how the correction modified original depletion values. Finally, CSC provides users with means to correct for the confounding effect of off-targets as part of their standard hit-calling pipelines, therefore providing increased flexibility in its employment.
In summary, we present here a flexible computational correction that minimizes the confounding effect of unspecific gRNAs in CRISPR-Cas9 essentiality screens leading to improved sensitivity and reduced false-positive hits in genome-wide screens. We believe CSC will be a powerful aid to ongoing efforts to catalog genomic loci required for cellular fitness, particularly in the context of screens targeting highly repetitive genomic regions—such as non-coding regulatory elements—where the design of specific libraries and gRNA filtering approaches are not feasible9. To facilitate its incorporation into existing CRISPR analysis pipelines including those that correct for genomic amplifications (Supplementary Fig. 2e, see Methods), we make the software freely available as a Python package at https://bitbucket.org/arp2012/csc_public/src/master/.
Methods
Screening data
Raw read counts of CRISPR viability screens (Broad DepMap project 19Q4) performed with the Avana library were downloaded from the DepMap project data repository (https://figshare.com/articles/DepMap_19Q4_Public/11384241/2). Raw read counts of CRISPR essentiality screens (Sanger, release 1, 5th April 2019) performed with the Sanger library were downloaded from the Project Score page (https://score.depmap.sanger.ac.uk/downloads).
Guide RNA preprocessing
Guide RNAs with less than 30 reads in the initial plasmid counts were removed. Counts for all screen replicates and corresponding plasmid library were adjusted by median-ratio normalization to account for the effect of library sizes and read count distributions. Finally, for each screen, log2-fold changes for individual gRNAs were calculated between the initial plasmid library counts and the post-screen counts for each replicate experiment. The mean log2-fold changes between replicates was used as the final log2-fold changes value for each gRNA.
Off-target data
We downloaded the sequence and annotation data for the hg38 assembly of the human genome from the UCSC database23 and used it to construct a retrieval tree (trie) consisting of all possible 20mer Cas9 gRNA target sites in the human genome with the GuideScan software16. In contrast to the original trie16, this retrieval tree was constructed without inclusion of alternative chromosome data, so that they did not artificially inflate the enumeration of off-targets. To determine the mismatch neighborhood for each gRNA in the Avana library, we traversed their sequences through the trie to exhaustively determine all neighbors up to and including Hamming distances of 3. Specificity scores for each gRNA were computed using Hamming distance neighbors using our previously described strategy16, which incorporates Guidescan’s ability to faithfully enumerate all potential target sites up to a specified number of mismatches, with CFD score’s prediction of how likely each of those sites is to be cut15. All these metrics are used as covariates (x) in the CSC model (see below).
Gene set enrichment analysis
GSEA was performed using the FGSEA R package, Release 3.13. and available at: http://bioconductor.org/packages/release/bioc/html/fgsea.html.
Enrichment and p-values were calculated against 100,000 random gene sets.
Model comparison and selection
We tested various machine learning models including a linear model lasso with least angle regression, random forest regression, and multivariate adaptive regression splines (EARTH) by regressing the mean depletion values of each gRNA against its GuideScan’s specificity score and the number of mismatch neighbors at Hamming distances 0, 1, 2, and 3. We modeled the impact of specificity on gRNA depletion on all screens from the 19Q4 Project Achilles dataset. For each screen we used 90% of the data for training with 10% being held out for testing. The mean squared error of these models was significantly lower with the multivariate adaptive regression splines (Supplementary Fig. 5). Because EARTH performed best out of the ensemble of regressors tested and automatically feature engineers non-linearities and interaction terms using the input covariates it was selected as the base model for CSC.
Multivariate adaptive regression splines
We developed a model that assumes that the measured depletion value (D) of a gRNA (i) in any individual screen is the sum of gene-knockout effects (Gi) and off-target effects (Oi).
1 |
To estimate Oi, we use Multivariate Adaptive Regression Splines24 (EARTH) which can model non-linearities in the data as well as interactions between variables. The model takes the form of the following equation
2 |
where the estimated contribution of off-targeting to gRNA depletion () can be approximated by the weighted sum of J basis functions Bj derived from the model predictor variables (x). Cj are coefficients of expansion whose values are jointly adjusted to give the best fit to the data. The basis functions Bjcan take the form of (i) a constant 1, which represents the intercept of the model; (ii) a hinge function derived from a predictor variable, or (iii) a product of two or more hinge functions each derived from different predictors to capture their interaction.
Model training and pruning
The model starts with the intercept term (B0(x) = 1, with intercept at C0). It then iteratively adds new basis functions in the form of hinge functions or products of hinge functions. At each step, the new terms are selected and added into the model as to minimize the sum of squared error using ordinary least squares method. This forward pass proceeds until the residual error consistently falls below the stopping threshold (minimal change in mean squared error (MSE) with additional terms). To prevent over-fitting and improve generalization, the forward pass model undergoes a backwards pass, where model terms are removed in a stepwise manner with subsequent reassessment for increases in the sum of squared error obtained in this sub-model. Selection for the optimal sub-model is done using generalized cross-validation (GCV) which optimizes tradeoff between bias and variance. The model with the lowest GCV is selected as the optimal model. An example file detailing model metrics for an Avana and a Sanger screen are provided as Supplementary Data 4 and Supplementary Data 5.
In the development of this model training data consists of 90% of all input data; test data consists of the remaining 10%. Test error was assessed as root mean squared error between the predicted and actual values of test data.
CSC software and implementation
The CSC was packaged in Python (version 3.8.8) with Avana, Brunello, GeckoV1, GeckoV2, and Sanger libraries as package data. Pickle files for hg38 and mm10 genomes are also provided in a repository, to allow CSC to be implemented for any custom human or mouse libraries based on a 20-mer gRNA design. We also provide a Docker image. The software is also available via PyPi. All these files are freely available to download from our bitbucket repository (see Code Availability).
Alternative approaches for gRNA off-targeting correction
To benchmark CSC, we compared its performance against the current approach of filtering out gRNAs suspected of off-target activity as implemented by Project Achilles. Information about this filter (which can be downloaded from the DepMap data repository as “Achilles_dropped_guides.csv”) is provided here as part of Supplementary Data 2 (columns 8 and 9). The Achilles filter list was generated from runs of CERES, and includes guides flagged for potential off-target activity by CERES based on being the sole efficacious gRNA for a gene receiving a label of “guide_dropped_by_ceres”. In addition, this file enumerates the estimated number of perfect matches for each guide in the column “Achilles n_alignments”. These alignments are performed with Bowtie against the 20-nucleotide long sequence of each gRNA and subsequently filtered for the presence of PAM sequence motif in the form of NGG2. gRNAs that have no perfect alignment to hg38 or that are found to have more than one perfect target site through this method are dropped from the analysis and flagged as “not_aligned” or “in_dropped_guides”, respectively.
As discussed in Supplementary Note 1, and shown Supplementary Data 2, the filtering list that is generated through the method described above significantly underestimates the number of promiscuous gRNAs in the Avana library. In fact, estimation of perfect target sites by this approach only surpasses that of the GuideScan retrieval trie algorithm in 8 cases (Supplementary Table 2).
Manual curation of each of these shows that the additional sites identified by Bowtie but not GuideScan were not adjacent to PAM motifs, and therefore do not represent potential target sites for the guide RNA in question.
In our final analysis to test CSC’s performance in highly promiscuous gRNA pools (Fig. 4d), we compare it with a filter that accurately removes gRNAs with more than a single perfect target site, as identified by GuideScan. The identity of these guides, containing only 1 target site with Hamming distance of 0 (H0) compared to the gRNA, can be found in Supplementary Data 2. Each of these three approaches (CSC, Achilles filter, filter H0 = 1) was integrated within otherwise identical analysis pipelines and compared against a pipeline in which no off-target management was performed.
BAGEL2 is a recently published essentiality classifier that includes off-target correction as part of its algorithm20. We were unable to segregate the off-target correction component of this tool from its hit-calling component and therefore could not directly compare it to CSC. However, like the Achilles filter, the off-target enumerations generated and used by BAGEL2 for its corrections are done using Bowtie. As discussed in Supplementary Note 1, this can result in an underestimation of off-targets. We have attempted to compare the off-target description generated by BAGEL2 for the Avana library with that of CSC. However, the BAGEL2 publication does not provide a list of the off-target enumerations for this library used for the multi-targeting correction. Nevertheless, the ‘precalc_library_alignment_info.py’ script deposited on the BAGEL2 GitHub page (https://github.com/hart-lab/bagel) suggests that alignments are performed with Bowtie, using the following parameters:
As we show in Supplementary Data 1, these parameters can miss off-targets at edit distances of 0,1, and 2. In addition, these parameters are unable to retrieve off-targets at edit distances of 3, as N is considered a mismatch by Bowtie, and -v is an integer between 0 and 3. Therefore, these settings do not guarantee exhaustive off-target search within edit distances of 2 and cannot identify off-targets at edit distances of 3, which in our model can still contribute to improper gRNA depletion.
Evaluation of alignment methods for gRNA off-target search
To compare different alignment methods in their ability to identify gRNA off-targets, we took the sequence of the most unspecific gRNA in the Avana library (TGTAATCCCAGCACTTTGGG) and appended ‘NGG’ at its 3′ end. The resulting 23-nucleotide long sequence represents a potential perfect target sequence for this gRNA and was used as the query sequence for all alignment tools, using the UCSC hg38 genome assembly. These tools include BLAT25, Bowtie26, Bowtie227, STAR28, and BWA29. In Supplementary Data 1, we show the total number of alignments obtained with each of these tools and the parameters used in the corresponding run. To identify which of these alignments corresponded to possible target sites for this gRNA, we extracted the sequences corresponding to each alignment coordinate and filtered them based on the presence of an intact NGG PAM, to remove alignments in which mismatches disrupted this motif. We classified alignments with an intact PAM based on their edit distance to the gRNA sequence, using the NM:i:* tag and taking into consideration whether the alignment tool considered ambiguous characters mismatches.
Precision-recall analysis
To benchmark the performance of CSC, we generated precision-recall curves for all screens after being processed through each analysis pipeline, using the set of constitutive essential and non-essential genes defined in Hart et al.14, as references. Precision and Recall were calculated as:
3 |
4 |
Where: TP (True Positives) is the number of positive hits from the gold-standard essential gene set, FP (False Positives) is the number of positive hits from the gold-standard non-essential gene set, and FN (False Negatives) is the number of gold-standard non-essential genes that did not score as hit. Briefly, these gold-standard sets of genes were defined in ref. 14 by selecting genes that emerged as essential in shRNA screens in 72 different cell lines, and then filtering for genes that are constitutively and invariantly expressed across both the ENCODE and the Illumina BodyMap RNA-seq datasets. Reference non-essential genes were selected as those probed in the shRNA screens with no evidence for impact on cell growth and that in addition were not expressed (< 0.1 FPKM) in the majority of tissues and cell lines tested (15/16 BodyMap tissues; 16/17 ENCODE cell lines). Since they were first defined, these lists of ‘constitutive essential and non-essential genes’ have been further validated by multiple independent approaches.
Precision-Recall Curves were generated based on ordered BF values (see below). The BF value corresponding to 95% of precision (meaning the value for which at least 95% of genes are known essentials) was taken as the 5% False Discovery Rate threshold (FDR = 1 - precision). The percentage of reference essential genes identified as essential at that BF threshold was taken as the recall value at 5% FDR. In cases where multiple threshold values had a precision of 95%, that corresponding to the highest recall value was used. Area Under the Curve (AUC) was calculated using the ‘PRROC’ R Package (version 1.3.1)30.
Inference of gene essentiality
The Bagel software (version 0.91)17 was used to infer gene essentiality based on log2-fold changes of gRNAs for each gene. This software uses a supervised learning method which implements Bayesian statistics and outputs for each gene a Bayes Factor (BF) value based on the likelihood that the observed fold-changes of the gRNAs that target it were drawn from reference essential or non-essential distributions17. For each screen, essential genes at 5% FDR we identified by selecting those with BF values above the threshold identified in the precision-recall analysis.
Distortion metric
We quantified the distortion introduced by CSC for each screen by calculating the Mean and Median Distortion31.
Gene expression data
RNA-seq TPM gene expression data (log2-transformed using a pseudo-count of 1) for protein coding genes was downloaded from the DepMap project data repository (https://figshare.com/articles/DepMap_19Q4_Public/11384241/2).
Genome annotations
Binding site predictions for miRNAs expressed by the miR-17~92 cluster32 were retrieved from TargetScan33. Bed files for transcription factor motif archetypes overlapping consensus DNaseI footprints34 were downloaded from https://www.vierstra.org/resources/dgf.
Combined correction of off-targets and copy-number effects
To simultaneously correct for copy number and off-targets effects we ran CRISPRcleanR13 on median-ratio normalized reads using the pre-computed files provided for the Avana library. CRISPRcleanR corrected fold-changes were then used to correct for off-target effects using CSC.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
The authors thank all members of the Vidigal lab for critical comments on the manuscript. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov) and the UCSF Wynton cluster (https://wynton.ucsf.edu). This work was supported by the Intramural Research Program of the National Institutes of Health (NIH) and a FLEX grant from the Center for Cancer Research (JAV).
Author contributions
A.R.P. and J.A.V. conceived and designed the study. A.R.P. wrote and implemented all software. A.R.P., L.S. and J.A.V. processed, managed, and analyzed the data. R.K.P. assisted with computational analysis. J.A.V., A.R.P. and L.S. wrote and/or revised the manuscript with assistance from R.K.P. J.A.V. supervised the study.
Data availability
The DepMap data used in this study is available at the DepMap project data repository (https://figshare.com/articles/DepMap_19Q4_Public/11384241/2) for screens performed with the Avana library, and at the Project Score page (https://score.depmap.sanger.ac.uk/downloads) for screens performed with the Sanger library. The human genome sequence used to enumerate potential off-target sites was downloaded from the UCSC genome database (https://genome.ucsc.edu/). RNA-seq TPM gene expression data (log2-transformed using a pseudo-count of 1) for protein coding genes can be downloaded from the DepMap project data repository (https://figshare.com/articles/DepMap_19Q4_Public/11384241/2). Binding site predictions for miRNAs expressed by the miR-17~92 cluster can retrieved from TargetScan. Bed files for transcription factor motif archetypes overlapping consensus DNaseI footprints can downloaded from https://www.vierstra.org/resources/dgf.
Code availability
Scripts for off-target enumeration and CSC implementation are freely available at our Bitbucket repository (https://bitbucket.org/arp2012/csc_public/src/master/). A PyPi package of the software can be found at pip3 install crispr-CSC.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks Maximilian Haeussler, Francesco Iorio, and Max Salick for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
4/1/2022
A Correction to this paper has been published: 10.1038/s41467-022-29598-6
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-021-26722-w.
References
- 1.Behan FM, et al. Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature. 2019;568:511–516. doi: 10.1038/s41586-019-1103-9. [DOI] [PubMed] [Google Scholar]
- 2.J. M. Dempster, et al. Extracting biological insights from the project Achilles genome-scale CRISPR screens in cancer cell lines. Preprint at bioRxiv10.1101/720243 (2019).
- 3.Lenoir WF, Lim TL, Hart T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res. 2018;46:D776–D780. doi: 10.1093/nar/gkx993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wang T, et al. Identification and characterization of essential genes in the human genome. Science. 2015;350:1096–1101. doi: 10.1126/science.aac7041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Haapaniemi E, Botla S, Persson J, Schmierer B, Taipale J. CRISPR-Cas9 genome editing induces a p53-mediated DNA damage response. Nat. Med. 2018;24:927–930. doi: 10.1038/s41591-018-0049-z. [DOI] [PubMed] [Google Scholar]
- 6.Aguirre AJ, et al. Genomic copy number dictates a gene-independent cell response to CRISPR/Cas9 targeting. Cancer Discov. 2016;6:914–929. doi: 10.1158/2159-8290.CD-16-0154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Munoz DM, et al. CRISPR screens provide a comprehensive assessment of cancer vulnerabilities but generate false-positive hits for highly amplified genomic regions. Cancer Discov. 2016;6:900–913. doi: 10.1158/2159-8290.CD-16-0178. [DOI] [PubMed] [Google Scholar]
- 8.Fortin JP, et al. Multiple-gene targeting and mismatch tolerance can confound analysis of genome-wide pooled CRISPR screens. Genome Biol. 2019;20:21. doi: 10.1186/s13059-019-1621-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tycko J, et al. Mitigation of off-target toxicity in CRISPR-Cas9 screens for essential non-coding elements. Nat. Commun. 2019;10:4063. doi: 10.1038/s41467-019-11955-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Consortium EP, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Frankish A, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019;47:D766–D773. doi: 10.1093/nar/gky955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Meyers RM, et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 2017;49:1779–1784. doi: 10.1038/ng.3984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Iorio F, et al. Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. BMC Genomics. 2018;19:604. doi: 10.1186/s12864-018-4989-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hart T, Brown KR, Sircoulomb F, Rottapel R, Moffat J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 2014;10:733. doi: 10.15252/msb.20145216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Doench JG, et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 2016;34:184–191. doi: 10.1038/nbt.3437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Perez AR, et al. GuideScan software for improved single and paired CRISPR guide RNA design. Nat. Biotechnol. 2017;35:347–349. doi: 10.1038/nbt.3804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hart T, Moffat J. BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics. 2016;17:164. doi: 10.1186/s12859-016-1015-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lin Y, et al. CRISPR/Cas9 systems have off-target activity with insertions or deletions between target DNA and guide RNA sequences. Nucleic Acids Res. 2014;42:7473–7485. doi: 10.1093/nar/gku402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tsai SQ, et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 2015;33:187–197. doi: 10.1038/nbt.3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kim E, Hart T. Improved analysis of CRISPR fitness screens and reduced off-target effects with the BAGEL2 gene essentiality classifier. Genome Med. 2021;13:2. doi: 10.1186/s13073-020-00809-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tzelepis K, et al. A CRISPR dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukemia. Cell Rep. 2016;17:1193–1205. doi: 10.1016/j.celrep.2016.09.079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Haeussler M, et al. The UCSC genome browser database: 2019 update. Nucleic Acids Res. 2019;47:D853–D858. doi: 10.1093/nar/gky1095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Friedman JH. Multivariate adaptive regression splines. Ann. Stat. 1991;19:1–67. doi: 10.1177/096228029500400303. [DOI] [PubMed] [Google Scholar]
- 25.Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Grau J, Grosse I, Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 2015;31:2595–2597. doi: 10.1093/bioinformatics/btv153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Abraham I, Bartal Y, Neiman O. Advances in metric embedding theory. Adv. Math. 2011;228:101. doi: 10.1016/j.aim.2011.08.003. [DOI] [Google Scholar]
- 32.Han YC, et al. An allelic series of miR-17 approximately 92-mutant mice uncovers functional specialization and cooperation among members of a microRNA polycistron. Nat. Genet. 2015;47:766–775. doi: 10.1038/ng.3321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Agarwal, V., Bell, G. W., Nam, J. W. & Bartel, D. P. Predicting effective microRNA target sites in mammalian mRNAs. Elife4, e05005 (2015). [DOI] [PMC free article] [PubMed]
- 34.Vierstra J, et al. Global reference mapping of human transcription factor footprints. Nature. 2020;583:729–736. doi: 10.1038/s41586-020-2528-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The DepMap data used in this study is available at the DepMap project data repository (https://figshare.com/articles/DepMap_19Q4_Public/11384241/2) for screens performed with the Avana library, and at the Project Score page (https://score.depmap.sanger.ac.uk/downloads) for screens performed with the Sanger library. The human genome sequence used to enumerate potential off-target sites was downloaded from the UCSC genome database (https://genome.ucsc.edu/). RNA-seq TPM gene expression data (log2-transformed using a pseudo-count of 1) for protein coding genes can be downloaded from the DepMap project data repository (https://figshare.com/articles/DepMap_19Q4_Public/11384241/2). Binding site predictions for miRNAs expressed by the miR-17~92 cluster can retrieved from TargetScan. Bed files for transcription factor motif archetypes overlapping consensus DNaseI footprints can downloaded from https://www.vierstra.org/resources/dgf.
Scripts for off-target enumeration and CSC implementation are freely available at our Bitbucket repository (https://bitbucket.org/arp2012/csc_public/src/master/). A PyPi package of the software can be found at pip3 install crispr-CSC.