Skip to main content
. 2021 Sep 5;4(3):62. doi: 10.3390/mps4030062

Table 4.

List of ML prediction tools with the kinds of used strategies.

Tool Name Prediction Model Datasets Key-Points Ref
CADD Score of pathogenicity Rirst version: linear SVM Later versions: L2-regularized logistic regression Training datasets: Benign: evolutionarily neutral variants; Pathogenic: simulated de novo pathogenic variants Testing datasets: Benign: benign variants; Pathogenic: pathogenic ClinVar variants, somatic cancer mutation frequencies Effective tool for protein-coding impact prediction; may not be informative for poorly-conserved regions [74,75]
CryptSplice Impact of variants on existing splice sites, cryptic splice site prediction SVM with RBF kernel True and false splice sites from GenBank-derived datasets Identify creation of cryprtic acceptor/donor site; use of a quite obsolete database [78]
DARTS Prediction of alternative splicing using both cis sequence features and mRNA levels of trans RBPs DNN and Bayesian Hypothesis Testing RNA-seq data (*) Evaluation of RBP impact on splicing [82]
MMSplice Multiple predictions: exon skipping, competitive
interactions, changes in splicing effciency, and pathogenicity
Modular NN, linear and logistic regression Donor/acceptor modules: GENCODE v24 true (known sites) and false (random sequences) splice sites Exon/intron modules: MPRA data from [83] Easily clinically applicable training set; contains false positive/unverified sites [84]
MutPred Splice Impact of coding region substitutions on disruption of pre-mRNA splicing Linear SVM Positive: HGMD exonic disease-causing/disease-associated variants Negative: HGMD disease-causing missense, not reported to disrupt exon splicing, high frequency exonic SNPs (SNP- from 1000 Genomes Project [85] Suitable for use in an NGS high-throughput setting to identify and prioritize potentially splice-altering variants [86]
PEPSI Prediction of coding and noncoding variant impact on pre-mRNA splicing based on sequence conservation, RNA secondary structure, and regulatory sequence elements Random forest regression model Data obtained form the Vex-seq experiment (measurement of the ΔPSI of 2055 variants from the Exome Aggregation Consortium (ExAC; [Kircher et al., 2014]) v24 a selection of chromosomes as training set, the remaining ones as testing set (*) Indels and intronic variants included [80]
S-CAP Score of variant pathogenicity using compartmentalization of genomic regions DNN Pathogenic variants selected from HGMD and ClinVar; benign variants from gnomAD Evaluation of intronic pathogenic variants;
variants lying more than 50 bp into the intron are not covered by the model
[79]
SPANR Cassette exon skipping prediction NN modeled on Bayesian framework PSI values for all human exons across 16 tissues, based on the Illumina Human Body Map project (*) Web server easy to use, availability of a dataset of pre-computed scores for all eligible variants in the genome; evaluation of exon sequence only [87]
SpliceAI Prediction of variant impact on loss or gain of acceptor/donor sites 32-layer DNN Protein-coding transcripts from GENCODE v24 (a selection of chromosomes as training set, the remaining ones as testing set) (*) Very powerful tool able to use a “near-agnostic” approach [81]
SpliceFinder Classification of variants based on impact on donor site, acceptor site or non-splice-site CNN Sequences of donor, acceptor, and non-splice-site, randomly selected from human reference genome (90% for training, 10% for testing, and then 20% of the training data for validation) Non-canonical splice sites can also be predicted correctly; decreased number of false positives [88]
TraP Quantification of impact of variant on transcripts Random forest Benign: De novo mutations in healthy individuals Pathogenic: selected synonymous variants associated with rare disease (*) High performance in distinguishing pathogenic and benign variants, both intronic and synonymous; evaluation of potential impact of variants across multiple transcripts [76]

(*) data from NGS experiments. SVM: Support Vector Machine; RBF: Radial Basis Function, DNN: Deep Neural Network; NN: Neural Network; CNN: Convolutional Neural Network.