TABLE 3.
Databases associated with genome editing research for the development of AI models.
Dataset name | Data description | Data link | Type of editing | Target | Machine learning model used |
---|---|---|---|---|---|
CHANGE-seq data Lazzarotto et al. (2020), (publicly available) | There are a total of 201,934 off-target sites scattered throughout the human genome | https://github.com/tsailabSJ/changeseq | CRISPR-Cas9 genome-editing | Off-target | GBT |
DeepHf data Wang et al. (2019), (publicly available) | For each nuclease, there are 50,000 gRNAs available, collectively targeting approximately 20,000 genes | http://www.deephf.com/ | CRISPR-Cas9 GED | On-target | RNN (Recurrent Neural Network), Bi-LSTM (bidirectional long short-term) |
Abadi et al. (2017), (publicly available) | This comprises 33 sets of sgRNAs, each associated with its specific targets | https://doi.org/10.1371/journal.pcbi.1005807.s014 | CRISPR-Cas9 GED | Off-targets | Random forest |
Genome CRISPR database Rauscher et al. (2017), (publicly available) | There is a total of 400,000 sgRNA sequences from the GenomeCRISPR project dataset. | http://genomecrispr.org/ | CRISPR-Cas9 GED | On-targets | CNN (named as DeepSgRNA) |
GUIDE-seq data Tsai et al. (2015), (publicly available) | Nucleases guided by RNA from two human cell lines, U2OS and HEK293, were examined at different sites | https://github.com/tsailabSJ/guideseq | CRISPR-Cas9 genome-editing | Off-targets | CRISTA (CRISPR Target Assessment using RF regression) |
Arbab et al. (2020), (publicly available) | Data from as many as 10,638 sgRNA-target pairs was randomly divided into partitions | https://www.google.com/url?q=https://ars.els-cdn.com/content/image/1-s2.0-S0092867420306322-mmc5.csv&sa=D&source=docs&ust=1698141109170457&usg=AOvVaw1eJY32CwBjGjzLD64EITS8 | BED | On-target | Be-Hive (autoregressive neural network) |
Kim et al. (2023), (publicly available) | Nine Cas9 variants | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA821929/ | BED | On and Off-target | SVM, L1-regularized LR, L2-regularized LR, AdaBoost, and Random Forest |
Pallaseni et al. (2022), (publicly available) | Used the dataset from (Arbab et al., 2020; Song et al., 2020) | https://www.ebi.ac.uk/ena/browser/home | BED | Off-target | GBT |
Li et al. (2022), Private | 1134 target sequences | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA885770/ | BED | On-target | XGBoost |
Kim et al. (2021), (publicly available) | There are 54,836 pairs consisting of pegRNAs and their corresponding target sequences | https://github.com/julianeweller/MinsePIE | PED | On and Off-target | DeepPE |