Skip to main content
Journal of Molecular Cell Biology logoLink to Journal of Molecular Cell Biology
. 2020 Jan 3;12(11):909–911. doi: 10.1093/jmcb/mjz116

GNL-Scorer: a generalized model for predicting CRISPR on-target activity by machine learning and featurization

Jun Wang 1,2, Xi Xiang 3,4, Lars Bolund 3,4,5, Xiuqing Zhang 1,3, Lixin Cheng 2,, Yonglun Luo 3,4,5,
Editor: Luonan Chen
PMCID: PMC7883820  PMID: 31900489

CRISPR/Cas9 is an adaptive immunity system in bacteria and most archaea (Koonin and Makarova, 2009; Horvath and Barrangou, 2010). The CRISPR/Cas9 gene editing system is comprised of two key components, a small guide RNA (gRNA) and a Cas9 endonuclease (Deltcheva et al., 2011; Jinek et al., 2012). The gRNA is a chimeric RNA molecule of tracrRNA and crRNA (Ran et al., 2013) which guides the Cas9 protein to the target site in the genome. Therefore, selecting a target site with high on-target activity and low off-target effect is crucial for gene editing. Previously, we have discovered that the gene editing activity of CRISPR-Cas9 in mammalian cells was affected by several factors, such as the secondary structure and chromatin accessibility of the guide sequences (Jensen et al., 2017). Strikingly, previous results from us and other research groups consistently revealed that the CRISPR gRNA activities are highly variable. Thus, several in silico gRNA design web tools and algorithms have been developed to facilitate CRISPR design and applications.

Currently, machine learning-based tools are recommended for high-efficiency sgRNA design (Chuai et al., 2017). However, limitations exist when applying these tools for gRNA activity prediction on distinct species and data sets (Haeussler et al., 2016). Therefore, an algorithm with well cross-species generalization was imperative. Meanwhile, the species generalization of the gene editing on-target prediction needs to be further investigated. Here, we proposed a software named GNL Scorer (GNL and GNL-Human), which combines optimal data sets, models, and features, to address the cross-species problem. GNL Scorer contains four main steps as shown in Figure 1A.

Figure 1 .


Figure 1

Construction and evaluation of GNL-Scorer for predicting the CRISPR sgRNA cleavage efficiency. (A) GNL-Scorer contains four main steps: (i) 13 data sets were collected by literature mining and then they were normalized to the same distribution to eliminate the batch effects, resulting in 19797 sgRNAs among 10 data sets. HEL was used as independent test set. C. elegans and HL60 sets were removed because of the abnormal distribution. (ii) Eight candidate models were used to train each data set; the optimal model used in each data set was selected for downstream analysis. After testing the generalization of these models, the models trained by Hela and Hct116 were determined as the best generalization models. (iii) The features were ranked by minimal-redundancy-maximal-relevance and then selected by IFS to establish the top 200 important features for feature explanation. (iv) Comparison of GNL and GNL-Human with seven other algorithms. (B) Heatmap of SCCs among different models and data sets. Columns represent the models trained using 80% of each data set, and rows represent the test data sets using the left 20%. Darker color indicates higher SCC. (C) Boxplot of SCCs of these models trained by 5-fold cross-validation in each data set. (D) Comparison of model performance. X-axis indicates six independent data sets from five species. GNL and GNL-Human were developed using the best performing data sets (Hela or Hela + Hct116) and the optimal feature set; they were tested together with the other seven machine learning-based models based on the same independent validation sets. Y-axis shows the testing effect measured by SCC. (E) A comprehensive evaluation of the nine tools based on six validation data sets. The horizontal line represents the median SCC and the whiskers are the upper and lower quantile values.

Initially, we collected 10 data sets (Supplementary Text and Table S1); each of them was trained by eight different models to figure out the optimal model (Supplementary Figure S1). The applicability of data augmentation was tested at first. Specifically, we trained both the normalized and nonnormalized data sets of Doench (V1 + V2) and All In One (AIO) using the Gradient Boosted Regression Tree, a state-of-the-art model used in gRNA Designer (rule set II) based on the feature set I (Supplementary Text). When compared with the Doench (V1 + V2) data sets, the AIO data set cannot improve the Spearman correlation coefficient (SCC) by simply increasing data volume without considering batch effects (Supplementary Figure S2). The SCC for the normalized AIO data set was significantly improved (P-value < 0.045) when compared with the nonnormalized one, whereas there is no significant difference between the normalized and nonnormalized data sets of Doench (V1 + V2) (Supplementary Figure S2). However, even using the combined normalized AIO data set, the SCC is significantly lower (P-value = 0.044) than that of the Doench (V1 + V2) data sets. These results collectively reveal strong batch effects among the different data sets we collected, suggesting that algorithms developed with a specific training data set tend to have strong bias when applied to other data sets.

Then, the performance of each model was assessed across all the 10 data sets (Figure 1A), and the optimal one with the highest SCC was selected. The corresponding data sets used for training were defined as the most powerful data sets (Figure 1B). We found that Hela, Hct116, and Doench_V1 are the three data sets working best for all the gRNA activity prediction models (Figure 1C). The activities of the gRNAs in the Hela and Hct116 data sets were measured by a quantized sequencing detection method (Hart et al., 2015). Importantly, we found that the gRNA activity data sets generated by sequencing obtained the best generalization performance when compared with the counterparts, indicating that the data set detected by sequencing-based methods tend to have high generalization ability. Similar as the FC and RES data sets (Fusi et al., 2015), the Hela and Hct116 data sets have a ‘cleaner’ supervisory signal for machine learning. After that, we trained the Bayesian Ridge regression (BRR) model (MacKay, 1992; Tipping, 2001), which was demonstrated to perform best after model filtering step based on four sets of data: Hela, Hela+Doench_V1, Hela + Hct116, and Hela + Hct116 + Doench_V1 (Supplementary Figure S1). The results showed that the generalized prediction power of the combined Hct116 + Hela data set is comparable to the single Hela data set, and both of them are superior to the other data set combinations (Supplementary Figure S3).

After evaluating the importance of epigenetics for model performance, the feature set that works best was selected. To detect the optimal feature combination for gRNA activity prediction, a feature set without epigenetics was involved (feature set II, Supplementary Text). For comparation, we also generated a combined set of 2701 features, including epigenetic-related ones (feature set III, Supplementary Text). Two best performing data sets, the normalized Hela and the normalized Hela + Hct116, were used for training. The highest SCC (0.4913, Supplementary Figure S4A) was achieved when 1839 features were used after incremental feature selection (IFS, Supplementary Text) in feature set III of the Hct116 + Hela data set. For the prefeature set II (Supplementary Text), 1163 and 777 features were obtained as the best feature combinations by IFS with the highest SCC based on the combined Hela and Hct116 + Hela data sets (0.502 and 0.499), respectively (Supplementary Figure S4A). To avoid overfitting, we picked up a subset of the human Doench_V1 data (NB4, Supplementary Table S1) as test data to evaluate the step of feature selection, because the NB4 cell has epigenetic features. Also, we compared the IFS-trained models to the non-IFS-trained models using both feature sets II and III based on the NB4 data set. Our results show that the model trained by Hct116 + Hela performed the best in the NB4 test data without IFS using feature set II (SCC = 0.442, Supplementary Figure S4B), suggesting that the Hct116 + Hela model outperforms the Hela model in the human data set. The top 200 features in different categories are summarized in Supplementary Figure S4C.

Finally, we developed two generalization scores (GNL and GNL-Human) and evaluated their performance with other state-of-the-art models in the prediction of sgRNA cleavage efficiency by SCC. To develop an improved CRISPR gRNA activity prediction algorithm with decent generalization across cell types and species, GNL and GNL-Human were built using the BBR model based on the two best performing data sets (Hela and Hct116) and the feature set II (without IFS). Specifically, GNL was trained by the normalized Hela data set, and GNL-Human was trained by the Hela + Hct116 data set. Notably, both Hela and Hct116 data sets were generated by sequencing, indicating that data sets of sequencing outperform other sources in measuring sgRNA activity. For generalization evaluation, we compared GNL and GNL-Human with seven other counterparts, based on six test data sets from five species (Supplementary Text). We found that GNL and GNL-Human ranked on the top for a majority of the test data sets (5 of 6). GNL fits for almost all the other species, while GNL-Human tops the list for the human data set (HEL), revealing its practical potential in human cellular experiments (Figure 1D and E). In summary, we have developed two CRISPR gRNA activity prediction models with improved generalization and human gRNA activity prediction. The codes of GNL scores are available in the GitHub (https://github.com/TerminatorJ/GNL_Scorer).

Taken together, using the most representative data sets (sequencing-based), we developed improved CRISPR gRNA activity prediction algorithms (GNL and GNL-Human) based on BRR, which will facilitate the current in silico design of CRISPR gRNA and promote the appropriate application of the CRISPR gene editing technology.

[Supplementary material is available at Journal of Molecular Cell Biology online. This work was supported by the Lundbeck Foundation (R219-2016-1375 and R173-2014-1105), the Danish Research Council for Independent Research (DFF-1337-00128 and 9041-00317B), the Sapere Aude Young Research Talent Prize (DFF-1335-00763A), the Innovation Fund Denmark (BrainStem), and Aarhus University Strategic Grant (AU-iCRISPR). The project is also partially supported by the Sanming Project of Medicine in Shenzhen (SZSM201612074), BGI-Research, and Guangdong Provincial Key Laboratory of Genome Read and Write (2017B030301011).]

Supplementary Material

mjz116_Supplementary_Material

References

  1. Chuai, G.-H., Wang, Q.-L., and Liu, Q. (2017). In silico meets in vivo: towards computational CRISPR-based sgRNA design. Trends Biotechnol. 35, 12–21. [DOI] [PubMed] [Google Scholar]
  2. Deltcheva, E., Chylinski, K., Sharma, C.M.,et al. (2011). CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 471, 602–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Fusi, N., Smith, I., Doench, J., et al. (2015). In silico predictive modeling of CRISPR/Cas9 guide efficiency. BioRxiv, doi: 10.1101/021568. [DOI] [Google Scholar]
  4. Haeussler, M., Schonig, K., Eckert, H.,et al. (2016). Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol. 17, 148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Hart, T., Chandrashekhar, M., Aregger, M.,et al. (2015). High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell 163, 1515–1526. [DOI] [PubMed] [Google Scholar]
  6. Horvath, P., and Barrangou, R. (2010). CRISPR/Cas, the immune system of bacteria and archaea. Science 327, 167–170. [DOI] [PubMed] [Google Scholar]
  7. Jensen, K.T., Floe, L., Petersen, T.S.,et al. (2017). Chromatin accessibility and guide sequence secondary structure affect CRISPR-Cas9 gene editing efficiency. FEBS Lett. 591, 1892–1901. [DOI] [PubMed] [Google Scholar]
  8. Jinek, M., Chylinski, K., Fonfara, I.,et al. (2012). A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Koonin, E.V., and Makarova, K.S. (2009). CRISPR-Cas: an adaptive immunity system in prokaryotes. F1000 Bio. Rep. 59, 615–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. MacKay, D.J.C. (1992). Bayesian interpolation. Neural Comput. 4, 415–447. [Google Scholar]
  11. Ran, F.A., Hsu, P.D., Wright, J.,et al. (2013). Genome engineering using the CRISPR-Cas9 system. Nat. Protoc. 8, 2281–2308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Tipping, M.E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

mjz116_Supplementary_Material

Articles from Journal of Molecular Cell Biology are provided here courtesy of Oxford University Press

RESOURCES