Abstract
Motivation
The CRISPR/Cas9 system is widely used for genome editing. The editing efficiency of CRISPR/Cas9 is mainly determined by the guide RNA (gRNA). Although many computational algorithms have been developed in recent years, it is still a challenge to select optimal bioinformatics tools for gRNA design in different experimental settings.
Results
We performed a comprehensive comparison analysis of 15 public algorithms for gRNA design, using 16 experimental gRNA datasets. Based on this analysis, we identified the top-performing algorithms, with which we further implemented various computational strategies to build ensemble models for performance improvement. Validation analysis indicates that the new ensemble model had improved performance over any individual algorithm alone at predicting gRNA efficacy under various experimental conditions.
Availability and implementation
The new sgRNA design tool is freely accessible as a web application via https://crisprdb.org. The source code and stand-alone version is available at Figshare (https://doi.org/10.6084/m9.figshare.21295863) and Github (https://github.com/wang-lab/CRISPRDB).
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
The CRISPR system has quickly become the first choice for performing site-specific genome editing. It has been used in a variety of organisms and cell types (Doudna and Charpentier, 2014). Among all the systems with various choices of Cas proteins, the CRISPR/Cas9 system is the most widely used because of its simplicity, high efficiency and low cost (Doudna and Charpentier, 2014). This system consists of two core components, the Cas9 protein and the guide RNA (gRNA). The Cas9/gRNA complex is guided by the gRNA to its target site by recognizing a 20-nucleotide target sequence proximal to an NGG protospacer adjacent motif (PAM). Then, Cas9 induces a site-specific double-strand break (DSB) at the target site. As for the editing performance of the CRISPR/Cas9 system, two major considerations are targeting specificity and cleavage efficiency. Off-target effects occur when the Cas9/gRNA complex binds and cleaves unintended genomic loci (Hsu et al., 2013). This is potentially a serious concern especially in clinical applications. Thus, many studies have been reported to address this issue using both experimental (Chen et al., 2017; Kleinstiver et al., 2016; Slaymaker et al., 2016) and computational (Chuai et al., 2018; Doench et al., 2016; Hsu et al., 2013) methods. Besides targeting specificity, there is also a high demand for high cleavage efficiency, as inefficient gene editing often obscures real functional changes in experiments.
To predict the cleavage efficiency of gRNAs, individual studies have utilized various experimental datasets to train prediction algorithms. Besides the differences in training data, individual algorithms employed different feature sets based on various sequence and structural attributes related to cleavage efficiency, such as nucleotide composition and gRNA secondary structure, as recently reviewed by Konstantakos et al. (2022). Moreover, a variety of traditional machine learning (Chari et al., 2017; Doench et al., 2016; Hiranniramol et al., 2020; Kaur et al., 2016; Kuan et al., 2017; Peng et al., 2018; Rahman and Rahman, 2017; Wilson et al., 2018; Xu et al., 2015) and deep learning (Chuai et al., 2018; Kim et al., 2019; Wang et al., 2019; Xiang et al., 2021) models have been employed by individual prediction algorithms. Among them, we recently reported an ensemble learning-based computational model, sgDesigner, which utilized the Stacking ensemble method to combine multiple machine learning models (Hiranniramol et al., 2020). As individual algorithms were developed using different computational strategies based on different training datasets and feature designs, their performance could vary significantly when applied in different experimental settings (Konstantakos et al., 2022). Thus, there is a strong demand in the CRISPR field for comprehensive guidance on selecting optimal tools for gRNA design. To this end, we performed a comprehensive comparison of existing gRNA design algorithms using a large collection of published datasets. The top-performing algorithms were then combined using the Stacking ensemble method to further enhance the power of the predictive model. Our validation analysis showed the new ensemble model has the best overall performance compared with any individual algorithm included in this study.
2 Materials and methods
2.1 Computational algorithms for predicting CRISPR/Cas9 efficacy
We assessed seventeen published scoring algorithms for gRNA cleavage efficiency: CRISPRon (Xiang et al., 2021), DeepSpCas9 (Kim et al., 2019), DeepHF (Wang et al., 2019), sgDesigner (Hiranniramol et al., 2020), uCRISPR (Zhang et al., 2019), TSAM (Peng et al., 2018), RuleSet2 (Doench et al., 2016), SSC (Xu et al., 2015), predictSGRNA (Kuan et al., 2017), E-crisp (Heigwer et al., 2014), TUSCAN (Wilson et al., 2018), ge-CRISPR (Kaur et al., 2016), DeepCRISPR (Chuai et al., 2018), sgRNAScorer.2.0 (Chari et al., 2017), CRISPRpred (Rahman and Rahman, 2017), CRISPRscan (Moreno-Mateos et al., 2015) and CRISPRater (Labuhn et al., 2018). The prediction results from most algorithms were generated using the respective stand-alone packages. Among them, CRISPRon, DeepSpCas9, sgDesigner, SSC, predictSGRNA and CRISPRpred were run with default settings; DeepHF was run with ‘wt_u6’ as the ‘enzyme’ option; uCRISPR scores were generated using the ‘on-target’ module; TSAM was run with additional ‘pHMM’ features as instructed by the authors; TUSCAN scores were computed using the regression version; DeepCRISPR predictions were generated using the sequence features only; sgRNA Scorer 2.0 model parameters were set to 20-bp as the seed length and ‘NGG’ as the PAM sequence. As for the remaining algorithms, the prediction results of RuleSet2, ge-CRISPR and E-crisp were obtained via the respective web servers; CRISPRscan was not included in our comparative analysis because the website does not support batch prediction for multiple gRNA sequences; CRISPRater was excluded from our analysis as we could not finish processing all queries due to usually long waiting time at the web server. More details about these scoring algorithms are summarized in Supplementary Table S1.
2.2 CRISPR gRNA datasets for evaluation of editing efficiency
Relevant experimental studies for gRNA potency analysis are summarized in Supplementary Table S2. We downloaded the Chari_293T dataset directly from the Supplementary Tables at the journal’s website (Chari et al., 2015). The Shalem dataset was originally generated by Shalem et al. (2014) and further processed by Doench et al. (2014) to calculate ‘Normalized sgRNA Activity’. The gene percent rank from the Doench2014_hs and Doench2014_mm datasets were used to assess gRNA efficiency. The Shalem, Doench2014_hs and Doench2014_mm datasets were downloaded from the Supplementary Tables of Doench et al. (2014). We downloaded the Doench2016 dataset from the Azimuth website (Doench et al., 2016). The XuHL60 and XuKBM7 datasets were originally generated by Wang et al. (2014) and further curated by Xu et al. (2015). We merged the gRNAs targeting both ‘ribosomal’ and ‘non-ribosomal’ genes and used the negative log2-fold change values with inverted sign to represent gRNA cleavage efficiency. As for the data from Hart et al. (2015), we included the representative Hct116-2 Lib1 dataset as curated by Haeussler et al. (2016). The CRISPRon_train, DeepSpCas9_train, DeepHF_train and sgDesigner_train datasets were respective training data of four individual scoring algorithms and downloaded from the journals’ websites (Hiranniramol et al., 2020; Kim et al., 2019; Wang et al., 2019; Xiang et al., 2021). The Cheruiyot dataset was obtained from the authors (Cheruiyot et al., 2021). The CHChen and Fiona_Breast datasets were downloaded from the journals’ websites (Behan et al., 2019; Chen et al., 2018). The Achilles dataset was originally generated in Project Achilles and hosted on the Cancer Dependency Map Portal (DepMap). It is publicly available on Figshare (Tsherniak et al., 2017). The Gagnon_2014, Varshney_2015 and Moreno-Mateos_2015 datasets from T7 expression systems were curated by Haeussler et al. (2016). For the datasets without originally provided efficiency labels, the gRNAs were first ranked by the corresponding experimental values. Then, we assigned ‘high’ and ‘low’ efficiency labels to the top and bottom 20% gRNAs, respectively. Details of all included validation datasets are described in Supplementary Table S2.
2.3 Assessment of model performance
In this study, we employed multiple computational methods to evaluate the performance of individual algorithms. Specifically, Spearman correlation analysis was performed to assess non-parametric correlation between the prediction scores and experimental values. Receiver operating characteristic (ROC) curve analysis was performed to assess the diagnostic ability of a prediction model based on both prediction sensitivity and specificity. Random forest (RF)-based feature analysis was performed with the Python scikit-learn package to determine the relative contribution of each algorithm in the final ensemble model. During the training of the RF model, a specified number (n = 5000 assigned to ‘n_estimators’) of independent decision trees were trained simultaneously. Every decision tree was trained with random collections of the features, and then the importance of a feature was collectively quantified.
3 Results
3.1 Evaluating the performance of public algorithms for prediction of CRISPR/Cas9 editing efficiency
We selected fifteen published algorithms that were developed for predicting gRNA efficiency (see Section 2, Supplementary Table S1). These algorithms employed various computational techniques as well as diverse training datasets to predict the efficiency of gRNAs. Specifically, CRISPRon, DeepSpCas9, DeepHF and DeepCRISPR applied deep learning neural networks to their prediction models; RuleSet2, predictSGRNA, SSC, E-crisp, TUSCAN, ge-CRISPR, sgRNA Scorer 2.0 and CRISPRpred employed single machine learning models, whereas sgDesigner and TSAM combined multiple machine learning models into assemble models; uCRISPR relied on an empirical scoring function based on experimental observations. As for the experimental strategies for generating model training data, CRISPRon, DeepSpCas9, DeepHF and sgDesigner relied on DNA sequencing of edited target plasmid libraries for direct quantitation of CRISPR/Cas9 efficiency; in contrast, the other algorithms relied on DNA sequencing data from genome-wide functional screens with CRISPR/Cas9 libraries. To comprehensively evaluate the performance of individual algorithms, we collected 16 independent datasets, encompassing more than 90 000 gRNAs, from various experimental sources. This collection of data included the training datasets used to build gRNA prediction algorithms (Chari et al., 2015; Doench et al., 2014; 2016; Hart et al., 2015; Hiranniramol et al., 2020; Kim et al., 2019; Wang et al., 2019; Xiang et al., 2021; Xu et al., 2015) as well as other datasets from high-throughput screening studies (Behan et al., 2019; Chen et al., 2018; Cheruiyot et al., 2021; Shalem et al., 2014; Tsherniak et al., 2017) (Supplementary Table S2) that are independent to all included algorithms.
We first conducted correlation analysis to evaluate the prediction performance of individual algorithms on the testing datasets. To this end, Spearman correlation analysis was performed to assess the predictive power of individual algorithms. Specifically, for each pair of the algorithm and testing dataset, we computed Spearman correlation coefficient between the prediction scores and the experimental values of gRNA activities (Fig. 1). With each testing dataset, individual algorithms were ranked based on their correlation coefficients. Then, the average rank from all datasets was calculated for each algorithm. Overfitting is a common concern in data modeling as the model may perform well on its training data, but not on independent testing data. Thus, in this ranking analysis, we only included the rank for an algorithm/dataset pair if the dataset had not been used to train the algorithm.
Fig. 1.

Performance comparison of 15 public algorithms with Spearman correlation analysis. The heatmap matrix table presents the Spearman correlation coefficients between algorithm prediction scores and corresponding experimental values in individual validation datasets. In this table, each row represents one prediction algorithm, and each column represents one testing dataset. The correlation between an algorithm and its own training dataset is marked in gray. The average rank for each algorithm was computed by averaging individual performance ranks with all independent datasets (i.e. algorithm self-training datasets, as marked in gray, were excluded from the analysis)
From the correlation matrix, most deep learning models showed better predictive power than conventional models, with the top-performing algorithms being CRISPRon, DeepSpCas9 and DeepHF. Besides the adoption of deep learning techniques, these models also took advantage of high-quality training datasets generated from large-scale plasmid libraries for direct quantification of gRNA efficiency. In our analysis, CRISPRon ranked first when applied to 11 out of 14 validation datasets (CRISPRon_train and DeepSpCas9_train were excluded from analysis as they were used to train CRISPRon). The average rank of CRISPRon was 1.36 when all validation results were combined. In the same way, the average ranks of DeepHF and DeepSpCas9 were computed to be 2.69 and 2.93, respectively.
Among other models, one promising computational strategy is to combine multiple single models into assemble models. The advantages of this assemble approach were exemplified by sgDesigner and TSAM, which outperformed other single-model algorithms for overall prediction performance. In combined analysis of all validation results, the average ranks of sgDesigner and TSAM were 3.8 and 4.54, respectively. As for other single-model algorithms, predictSGRNA and RuleSet2 had comparable overall performance, with the respective average ranks being 6.57 and 6.77. Both algorithms outperformed the remaining ones trained using single machine learning methods. Although most algorithms were trained using machine learning methods, one exception is uCRISPR, which was based on an empirical scoring scheme. uCRISPR had relatively robust performance, with the average rank being 5.8. By correlating prediction results to experimental testing data, we demonstrated that CRISPRon, DeepSpCas9, DeepHF, sgDesigner and TSAM were the top-performing algorithms according to average correlation rank.
We further performed receiver operating characteristic (ROC) curve analysis to evaluate the overall sensitivity and specificity of individual algorithms. Fifteen published datasets were directly used to construct ROC curves for algorithm evaluation, and the area-under-the-curve (AUC) values of these ROC curves were determined to quantify the performance of individual algorithms (Fig. 2). As similarly described in the correlation analyses, we also calculated the average AUC rank of each algorithm when applied to all testing datasets. For each algorithm, the self-training datasets were excluded from the performance ranking.
Fig. 2.

Performance comparison of 15 public algorithms with ROC analysis. The heat map matrix table presents the AUC values of the ROC curves generated by comparing the algorithm prediction scores with the corresponding efficiency labels (high versus low) from each validation dataset. In this table, each row represents one prediction algorithm, and each column represents one testing dataset. The AUC between an algorithm and its own training data is marked in gray. The average rank for each algorithm was computed by averaging individual performance ranks with all independent datasets
According to the AUC matrix, we identified the same top-performing deep learning algorithms as previously identified in the correlation analysis. Specifically, CRISPRon had the AUC values in the range of 0.792–0.963 for independent testing datasets, with the best average rank of 1.5 among all individual algorithms. Similarly, DeepSpCas9 and DeepHF were also top-performing algorithms, with the average ranks of 2.8 and 2.92, respectively. For sgDesigner and TSAM that adopted ensemble modeling strategies, they had the average ranks of 4.13 and 4.62, respectively. They performed worse than the top-performing deep learning algorithms but better than other single-model algorithms, as similarly described in previous correlation analysis. Among the remaining algorithms, we observed various algorithm rankings when different validation datasets were tested. For instance, ge-CRISPR performed worse than E-crisp on the XuHL60 dataset but better on the Chari_293T dataset. From the ROC analysis, it was clear that CRISPRon, DeepSpCas9, DeepHF, sgDesigner and TSAM consistently outperformed the other algorithms when generally applied to various testing datasets.
In summary, from both Spearman correlation analysis and ROC analysis, we demonstrated that CRISPRon, DeepSpCas9, DeepHF, sgDesigner and TSAM were consistently identified as the top-performing algorithms for predicting the efficacy of gRNAs.
3.2 Developing an ensemble model for improved prediction of CRISPR/Cas9 efficiency
Although the top-performing algorithms produced robust prediction results, there was significant room for improvement based on both ROC and correlation analyses. In our previous experience, assemble models (e.g. sgDesigner) that integrated multiple algorithms had superior performance when compared with any individual algorithm included in the assemble model. Based on that observation, we hypothesized that, by assembling top-performing algorithms, it would be possible to generate an integrated model with improved performance over any individual top-performing algorithm.
To achieve this goal, one computational strategy is the stacking ensemble method. The stacking ensemble models have two model layers. The first layer contains multiple individual models to be assembled, with each model generating prediction results independently. The second layer is a meta-model that collects the predictions from the first layer as input features to generate final integrated prediction results. The stacking method not only can be used to assemble various internal models but also can be adopted as a framework to integrate various published algorithms (Fig. 3).
Fig. 3.

The stacking ensemble modeling strategy. In the first layer, the gRNA sequences were input to individual algorithms to generate prediction scores. Then, these scores were merged as input features for second layer modeling. Specifically, a meta-model was used in the second layer to predict the final score for gRNA efficiency
3.2.1 Selection of training data to build ensemble models
For machine learning and deep learning models, high-quality training data are crucial for their robust performance. From the Spearman correlation analysis on public algorithms and datasets, we found the four algorithms with top performance implemented similar strategies to generate their training data (Hiranniramol et al., 2020; Kim et al., 2019; Wang et al., 2019; Xiang et al., 2021). Specifically, a pool of oligonucleotides was first designed and synthesized, which contain the gRNA sequences and corresponding target regions. A plasmid library containing these oligonucleotides was transduced into Cas9-expressing cells, which led to indel formation at the target region with frequencies dependent on the gRNA on-target activity. The final products were amplified and the indel frequency of every design was determined by deep sequencing. In another word, the efficacy of each gRNA was determined by observed rate of mutagenesis of the target site. In this way, we focused on CRISPR/Cas9 cleavage of double-stranded DNA that led to detectable gene mutations. The high quality of these datasets was clearly demonstrated by the high performance of the respective algorithms being trained, using either machine learning or deep learning methods. In addition, these experimental datasets also had relatively high Spearman correlation coefficients with the prediction results from other public algorithms. In our analysis, we combined the high-quality training data for CRISPRon, DeepSpCas9, DeepHF and sgDesigner (Supplementary Table S2) to train new ensemble models.
3.2.2 Ranking the relative contribution of individual algorithms in the ensemble model
One crucial step in developing ensemble models is to decide which individual algorithms should be included to further boost the performance of the final model. To this end, the relative contribution of individual algorithms was evaluated by feature importance analysis in random forest (RF) modeling. Specifically, RF models were trained with the prediction scores of individual algorithms on each non-self training data (CRISPRon_train, DeepSpCas9_train, DeepHF_train or sgDesigner_train). In another word, algorithm prediction scores on self-training datasets were already excluded from the analysis to reduce overfitting concerns. Prediction results from selected algorithms were treated as input features for the second-layer RF model (Fig. 3), and the Mean Decrease in Impurity (MDI) score computed by the model was used to assess the contribution of each algorithm. As we were interested in both classification and regression-based ensemble models, the RF classifier and regressor were separately applied for algorithm evaluation. The MDI score for each subset of the training data is presented in Supplementary Figures S1 and S2 for the regression and classification strategies, respectively. Then, combined MDI scores from all subsets of the training data were used to determine the overall feature ranking (Fig. 4a and b for the regression and classification models, respectively). From this RF analysis, the algorithms with top performance in Spearman correlation analysis (i.e. CRISPRon, DeepSpCas9, DeepHF, sgDesigner in Fig. 1) maintained their top ranks. In contrast, for other algorithms, we observed inconsistent ranking between RF analysis and Spearman correlation analysis, suggesting that correlative results may not fully reflect the independent contribution of each algorithm in the assemble model.
Fig. 4.

Contribution of individual algorithms and optimal order of the ensemble model. The feature importance score (MDI) of each algorithm was computed using either (a) random forest regressor or (b) random forest classifier, as presented in the boxplots. Further, the optimal order of the ensemble model was determined by stepwise addition of individual algorithms as ranked by either (c) the random forest regressor or (d) the random forest classifier. The optimal order of the ensemble model is indicated by the peak in each curve
3.2.3 Developing ensemble models by integrating top-ranking individual algorithms
We tested two common ensemble strategies that were based on either regression or classification models. For both strategies, we implemented different machine learning meta-models in the Stacking framework (Fig. 3). After assembling all the 15 algorithms into meta-models, we performed Spearman correlation analysis of these ensemble models with 11 independent validation datasets. Specifically, we computed the Spearman correlation coefficients to evaluate the prediction performance of each ensemble model. For regression-based ensemble modeling, we evaluated ridge regression, lasso regression, XGBoost regression and random forest regression models. Among them, ridge regression was ranked as the top ensemble model for 7 out of 11 validation datasets (Supplementary Fig. S3a). For classification-based analysis, we compared ensemble models based on logistic regression, XGBoost classification and random forest classification. Logistic regression outperformed other ensemble models for 9 out of 11 validation datasets (Supplementary Fig. S3b). Thus, we implemented ridge regression and logistic regression as the meta-models for further regression and classification modeling analysis, respectively.
Then, we tested various collections of individual algorithms as input to maximize the performance of the ensemble model. Specifically, we started the process by assembling the two best-performing algorithms based on relative feature importance (Fig. 4a and b). Then, the ensemble model was expanded by stepwise inclusion of the next best-performing algorithm. This process was repeated with one new algorithm added to the ensemble model at each step based on the algorithm ranking until all 15 individual algorithms were assembled. We further tested these models by performing Spearman correlation analysis using leave-one-dataset-out cross-validation with the four training datasets. Specifically, for each iteration in the cross-validation analysis, we trained the ensemble model using three training datasets and tested it on the remaining fourth dataset by Spearman correlation analysis. This process was repeated until all four training datasets had been used as testing data. Then, the performance of the ensemble model was evaluated by the average value of the Spearman correlation coefficients. The performance of the regression and classification models built on various number of individual algorithms is summarized in Figure 4c and d, respectively. Interestingly, performance of the regression model was peaked when the top five algorithms were assembled (Fig. 4c); similarly, maximal performance of the classification model was observed when the top six algorithms were assembled (Fig. 4d). However, after more algorithms were assembled into the models, the prediction performance was decreased as assessed by Spearman correlation coefficient. One possible explanation for this observation was that assembling the algorithms with relatively poor performance would worsen the final prediction results, reflecting the nature of the Stacking ensemble method. Thus, five and six top-ranking algorithms were included as input to build the final regression and classification ensemble models, respectively.
3.3 Validation of the ensemble models with independent datasets
Among the 16 collected gRNA datasets, our ensemble models were trained with CRISPRon_train, DeepSpCas9_train, DeepHF_train and sgDesigner_train. Thus, we used the other 12 datasets for independent model validation. The performance of the ensemble models on individual datasets was evaluated by both Spearman correlation analysis (Fig. 5a) and ROC analysis (Fig. 6a). Both regression and classification ensemble models generated robust prediction results, with the regression model outperforming the classification model in 10 out of 12 validation datasets by correlation analysis, and 9 out of 12 datasets by ROC analysis. Moreover, compared with CRISPRon (the top-performing individual algorithm), the ensemble regression model produced improved Spearman correlation results for ten out of twelve validation datasets and tied for the remaining two datasets. Similarly, the ensemble regression model had higher AUC-ROC values than CRISPRon for ten out of twelve validation datasets. To summarize the performance of the ensemble models, the average values of the Spearman correlation coefficients and the AUC-ROC for all included datasets were calculated. In this analysis, the training datasets of respective ensemble models were excluded to reduce overfitting bias. From Spearman correlation analysis, the ensemble_ridge and ensemble_logistic models had average correlation coefficients of 0.54 and 0.52, respectively, both of which were higher than the average correlation coefficients from all individual algorithms (Fig. 1). We also observed similar results from the ROC analysis, where the average AUC values of the ensemble models (0.89 and 0.88 for ensemble_ridge and ensemble_logistic, respectively) showed improved performance over all individual algorithms (Fig. 2).
Fig. 5.

Validation of the ensemble models by Spearman correlation analysis. (a) The Spearman correlation coefficients between the prediction scores and the experimental values were calculated using 12 independent validation datasets. (b) Performance comparison of the ensemble models with included public algorithms, ordered by the average rank of the correlation coefficients computed with individual validation datasets
Fig. 6.

Validation of the ensemble models by ROC analysis. (a) The AUC-ROC values were determined using 12 independent validation datasets. (b) Performance comparison of the ensemble models with included public algorithms, ordered by the average rank of the AUC-ROC values computed with individual validation datasets
Next, the rank distributions of the models among all validation datasets were determined to assess general model performance across experimental conditions. Specifically, for each model, we computed its performance rank for every validation dataset as determined either by correlation coefficient or AUC-ROC. To alleviate overfitting concerns, training datasets for respective models were not considered in this analysis. In this way, the rank distributions for each model were determined, and presented in Figure 5b for correlation rank and Figure 6b for AUC rank, respectively. From these results, the mean correlation ranks of the two ensemble models were 1.25 for ensemble_ridge and 2.75 for ensemble_logistic, both of which were higher than any individual algorithm (Fig. 5b). Similarly, the mean AUC ranks of the two ensemble models were the highest, with 1.42 for ensemble_ridge and 2.5 for ensemble_logistic, respectively (Fig. 6b). In summary, the two ensemble models outperformed all individual algorithms as evaluated by both correlation and AUC ranks.
Based on these validation analyses, we concluded that the two ensemble models had improved performance over all included individual algorithms, as shown by consistent results across various experimental datasets. In particular, the regression ensemble model, ensemble_ridge, demonstrated the best performance. To further demonstrate the superior performance of ensemble_ridge, we conducted overall Spearman correlation analysis (Supplementary Fig. S5), which correlated algorithm prediction scores to experimental data. In this analysis, we determined the overall Spearman correlation performance of ensemble_ridge, as well as other public algorithms on the combined dataset consisting of all twelve independent validation datasets (CRISPRon_train, DeepSpCas9_train, DeepHF_train and sgDesigner_train were excluded). Specifically, for each algorithm, the prediction scores of the twelve validation datasets were obtained. Then, only prediction results of independent validation datasets were combined for Spearman correlation analysis. Among all individual algorithms, CRISPRon had the best performance (r = 0.445), followed by DeepHF (r = 0.416), DeepSpCas9 (r = 0.399) and sgDesigner (r = 0.396). The ensemble_ridge model outperformed all individual algorithms in this correlative analysis (r = 0.471).
From the aforementioned comparative analyses, CRISPRon was the best single algorithm among the 15 public algorithms, while ensemble_ridge outperformed CRISPRon consistently across various validation datasets. To examine the cases where the prediction results of ensemble_ridge and CRISPRon disagreed, we conducted a case study using the Achilles dataset. To this end, we normalized the original efficacy and prediction scores from individual algorithms to percentage rank values. Then, the performance difference between ensemble_ridge and CRISPRon was assessed. To focus on the most discrepant cases, 50 gRNAs with the largest percentage rank differences between ensemble_ridge and CRISPRon were selected for further examination. As shown in Supplementary Table S4, individual algorithms including CRISPRon had variable performance on predicting the efficacy of these gRNAs. Specifically, CRISPRon, DeepSpCas9, DeepHF, sgDesigner and TSAM had the best performance on 1, 6, 20, 18 and 5 gRNAs, respectively. With the adoption of the ensemble strategy, the ensemble model balanced the performance of individual algorithms, leading to more stable and robust performance than any single algorithm in general. As a result, ensemble_ridge correlated better to experimental data than CRISPRon for 86% of the 50 selected gRNAs.
3.4 Re-training the ensemble model for the T7 promoter system
Previous studies reported that the prediction algorithms trained with U6 promoter-based datasets may not be ideal to predict the efficacy of gRNAs designed for the T7 promoter system. To evaluate the performance of the new ensemble model as well as five other top-performing algorithms on T7 promoter-based datasets, we collected three datasets from zebrafish screens (Gagnon_2014, Varshney_2015 and Moreno-Mateos_2015). Then, we conducted Spearman correlation analysis between the prediction scores from each algorithm and experimental gRNA efficiencies as presented in the T7 datasets. Comparative analysis indicates that all these algorithms (trained with the U6 datasets) performed relatively poorly on the T7 datasets (Supplementary Fig. S4a), with the Spearman correlation in the range of 0.074–0.321.
To address data discrepancy between the U6 and T7 systems, in the DeepHF and TSAM studies, the authors generated T7-focused algorithms (named as DeepHF_T7 and TSAM_T7, respectively) by training with the Moreno-Mateos_2015 dataset. Comparative analysis indicates these T7-tailored algorithms had improved performance over the corresponding U6-based algorithms (Supplementary Fig. S4b). Accordingly, we re-trained our ensemble_ridge model using the Moreno-Mateos_2015 dataset by assembling these two T7 algorithms (DeepHF_T7 and TSAM_T7). As shown in Supplementary Figure S4b, ensemble_ridge_T7 has the best performance on two independent validation datasets that were generated from the T7 expression system (Spearman correlation in the range of 0.385–0.513).
3.5 Online web server for gRNA design
In our study, with the final ensemble model ‘ensemble_ridge’, we performed genome-wide design of gRNAs that were predicted to have high efficacy to target human and mouse genes. Specifically, for each gene, we searched for potential target sites within the 5′-end exons (70% portion of the coding sequence). The focus on 5′-end exons was to maximize functional disruption of the gene resulting from frameshift mutations. Besides on-target activity, we also predicted the specificity of the gRNA using our previously published algorithm (Wong et al., 2015). Specifically, we performed gRNA seed search to identify potential off-target sites that had any 13-mer seed match to the gRNA sequence. In parallel, BLAST alignment was conducted to search for potential off-target sequences with at least 85% overall homology to the gRNAs. By implementing these specificity filters, we selected gRNAs with reduced off-target effects. Of note, previous studies indicate that limited off-target editing (i.e. involving <20 nucleotides) had little functional consequence when the target sites are within intergenic regions (Doench et al., 2014; Ho et al., 2015). Therefore, we focused our off-target analysis on all known exon regions, from both protein-coding genes and non-coding genes. An online database resource, CRISPRDB, which contains all pre-designed gRNAs, is freely accessible at https://crisprdb.org. CRISPRDB ranks all gRNAs based on predicted efficacy, and then the website presents up to twenty gRNAs with the highest scores. This online resource also presents a web server interface for custom design of gRNAs with user-provided sequences. Specifically, users can input target site sequences and use either the ‘ensemble_ridge’ or ‘ensemble_ridge_T7’ model to design gRNAs for the U6 or T7 expression system, respectively.
4 Discussion
Although the CRISPR/Cas9 system has been widely used, it is still a challenge to consistently design gRNAs with high cleavage efficiency. Previous studies described numerous computational algorithms on predicting the gRNA on-target activities based on various experimental datasets and modeling methods. For experimental datasets, many studies adopted biological enrichment schemes, in which gene editing events impact cell survival or other observable biological phenotypes. Although these strategies have been widely used, such indirect biological readouts could potentially introduce undesired noises in the training data for computational modeling, as equally efficient Cas9 cleavage sites may not result in equal phenotypic changes or survival pressure. Furthermore, gRNAs tested in experimental screens are typically designed for a subset of genes and tested in a single cell line. These restrictions could introduce biases specific to each experimental setting, such as those related to different levels of genomic accessibility, or different responses to DNA cleavage in a cell line or gene-specific manner. All these factors may potentially reduce model generalizability. As a result, these algorithms often had inconsistent performance when predicting gRNA efficiency under different experimental settings. That being said, we envision that common biological features learned from multiple experimental screening datasets would be valuable to further enhance the performance of the prediction model, as they are currently missing from the artificial plasmid expression systems. To this end, our ensemble model is a flexible framework that could easily incorporate new advancement in this field for performance improvement.
One major issue in comparing gRNA design algorithms is that researchers usually used different independent datasets and computational analysis strategies to validate their algorithms; this presents a major challenge to directly compare the performance of reported algorithms from different studies. To fill in this gap, we performed comprehensive comparison of 15 most widely used algorithms for gRNA design, using a variety of independent experimental datasets for performance validation. In this way, we ranked these algorithms based on their overall performance across many independent datasets. This provides a useful guidance to CRISPR researchers for selecting the most appropriate gRNA design tools for their studies. Moreover, our algorithm comparison analysis revealed interesting features that are characteristic of top-performing prediction models. For example, the same experimental strategy for training data generation was applied by all four top-performing algorithms. Although these four algorithms adopted various modeling strategies, they consistently outperformed other algorithms, which highlight the paramount importance of high-quality data for model training.
As described above, we identified important characteristics that are common among top performing models. Thus, it would be interesting to develop a new prediction algorithm that could ideally combine the advantages of the top-performing algorithms while avoiding their shortfalls. In computational modeling analysis, the stacking ensemble method is commonly used to linearly combine prediction results from multiple models to gain further performance improvement (Naimi and Balzer, 2018). In the present study, we extended this stacking strategy to assembling top-performing published algorithms. Through validation analysis, the final ensemble model clearly demonstrated performance gains over individual algorithms in the assembly. We hope this ensemble-learning method would further contribute to future improvement of gRNA design when more robust algorithms, based on high-quality training data and enhanced modeling strategies, become available.
Funding
This research was supported by grants [R35GM141535 and R01DE026471] from the National Institutes of Health.
Conflict of Interest: none declared.
Supplementary Material
Contributor Information
Yuhao Chen, Department of Pharmacology and Regenerative Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA; Department of Electrical and Systems Engineering, Washington University in St. Louis, St. Louis, MO 63112, USA.
Xiaowei Wang, Department of Pharmacology and Regenerative Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA; University of Illinois Cancer Center, Chicago, IL 60612, USA.
Data Availability
Our sgRNA design tool, CRISPRDB, is freely accessible as a web application via https://crisprdb.org. In addition, the source code and stand-alone version of CRISPRDB is distributed under GNU General Public License v3.0 and freely accessible at GitHub (https://github.com/wang-lab/CRISPRDB) and Figshare (https://doi.org/10.6084/m9.figshare.21295863).
References
- Behan F.M. et al. (2019) Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature, 568, 511–516. [DOI] [PubMed] [Google Scholar]
- Chari R. et al. (2015) Unraveling CRISPR-Cas9 genome engineering parameters via a library-on-library approach. Nat. Methods., 12, 823–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chari R. et al. (2017) sgRNA scorer 2.0: a species-independent model to predict CRISPR/Cas9 activity. ACS Synth. Biol., 6, 902–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen C.H. et al. (2018) Improved design and analysis of CRISPR knockout screens. Bioinformatics, 34, 4095–4101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J.S. et al. (2017) Enhanced proofreading governs CRISPR-Cas9 targeting accuracy. Nature, 550, 407–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheruiyot A. et al. (2021) Nonsense-mediated RNA decay is a unique vulnerability of cancer cells harboring SF3B1 or U2AF1 mutations. Cancer Res., 81, 4499–4513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chuai G. et al. (2018) DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol., 19, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doench J.G. et al. (2016) Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol., 34, 184–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doench J.G. et al. (2014) Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol., 32, 1262–1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doudna J.A., Charpentier E. (2014) Genome editing. The new frontier of genome engineering with CRISPR-Cas9. Science, 346, 1258096. [DOI] [PubMed] [Google Scholar]
- Haeussler M. et al. (2016) Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol., 17, 148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hart T. et al. (2015) High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell, 163, 1515–1526. [DOI] [PubMed] [Google Scholar]
- Heigwer F. et al. (2014) E-CRISP: fast CRISPR target site identification. Nat. Methods, 11, 122–123. [DOI] [PubMed] [Google Scholar]
- Hiranniramol K. et al. (2020) Generalizable sgRNA design for improved CRISPR/Cas9 editing efficiency. Bioinformatics, 36, 2684–2689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ho T.T. et al. (2015) Targeting non-coding RNAs with the CRISPR/Cas9 system in human cell lines. Nucleic Acids Res., 43, e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsu P.D. et al. (2013) DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol., 31, 827–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaur K. et al. (2016) ge-CRISPR – an integrated pipeline for the prediction and analysis of sgRNAs genome editing efficiency for CRISPR/Cas system. Sci. Rep., 6, 30870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H.K. et al. (2019) SpCas9 activity prediction by DeepSpCas9, a deep learning-based model with high generalization performance. Sci. Adv., 5, eaax9249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleinstiver B.P. et al. (2016) High-fidelity CRISPR-Cas9 nucleases with no detectable genome-wide off-target effects. Nature, 529, 490–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Konstantakos V. et al. (2022) CRISPR-Cas9 gRNA efficiency prediction: an overview of predictive tools and the role of deep learning. Nucleic Acids Res., 50, 3616–3637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuan P.F. et al. (2017) A systematic evaluation of nucleotide properties for CRISPR sgRNA design. BMC Bioinformatics, 18, 297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Labuhn M. et al. (2018) Refined sgRNA efficacy prediction improves large- and small-scale CRISPR-Cas9 applications. Nucleic Acids Res., 46, 1375–1385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moreno-Mateos M.A. et al. (2015) CRISPRscan: designing highly efficient sgRNAs for CRISPR-Cas9 targeting in vivo. Nat. Methods, 12, 982–988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naimi A.I., Balzer L.B. (2018) Stacked generalization: an introduction to super learning. Eur. J. Epidemiol., 33, 459–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng H. et al. (2018) CRISPR/Cas9 cleavage efficiency regression through boosting algorithms and Markov sequence profiling. Bioinformatics, 34, 3069–3077. [DOI] [PubMed] [Google Scholar]
- Rahman M.K., Rahman M.S. (2017) CRISPRpred: a flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems. PLoS One, 12, e0181943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shalem O. et al. (2014) Genome-scale CRISPR-Cas9 knockout screening in human cells. Science, 343, 84–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slaymaker I.M. et al. (2016) Rationally engineered Cas9 nucleases with improved specificity. Science, 351, 84–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsherniak A. et al. (2017) Defining a cancer dependency map. Cell, 170, 564–576.e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang D. et al. (2019) Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat. Commun., 10, 4284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T. et al. (2014) Genetic screens in human cells using the CRISPR-Cas9 system. Science, 343, 80–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson L.O.W. et al. (2018) High activity target-site identification using phenotypic independent CRISPR-Cas9 core functionality. CRISPR J., 1, 182–190. [DOI] [PubMed] [Google Scholar]
- Wong N. et al. (2015) WU-CRISPR: characteristics of functional guide RNAs for the CRISPR/Cas9 system. Genome Biol., 16, 218–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiang X. et al. (2021) Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning. Nat. Commun., 12, 3238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu H. et al. (2015) Sequence determinants of improved CRISPR sgRNA design. Genome Res., 25, 1147–1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang D. et al. (2019) Unified energetics analysis unravels SpCas9 cleavage activity for optimal gRNA design. Proc. Natl. Acad. Sci. USA, 116, 8693–8698. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Our sgRNA design tool, CRISPRDB, is freely accessible as a web application via https://crisprdb.org. In addition, the source code and stand-alone version of CRISPRDB is distributed under GNU General Public License v3.0 and freely accessible at GitHub (https://github.com/wang-lab/CRISPRDB) and Figshare (https://doi.org/10.6084/m9.figshare.21295863).
