. 2021 Sep 5;4(3):62. doi: 10.3390/mps4030062

Table 4.

List of ML prediction tools with the kinds of used strategies.

Tool Name	Prediction	Model	Datasets	Key-Points	Ref
CADD	Score of pathogenicity	Rirst version: linear SVM Later versions: L2-regularized logistic regression	Training datasets: Benign: evolutionarily neutral variants; Pathogenic: simulated de novo pathogenic variants Testing datasets: Benign: benign variants; Pathogenic: pathogenic ClinVar variants, somatic cancer mutation frequencies	Effective tool for protein-coding impact prediction; may not be informative for poorly-conserved regions	[74,75]
CryptSplice	Impact of variants on existing splice sites, cryptic splice site prediction	SVM with RBF kernel	True and false splice sites from GenBank-derived datasets	Identify creation of cryprtic acceptor/donor site; use of a quite obsolete database	[78]
DARTS	Prediction of alternative splicing using both cis sequence features and mRNA levels of trans RBPs	DNN and Bayesian Hypothesis Testing	RNA-seq data (*)	Evaluation of RBP impact on splicing	[82]
MMSplice	Multiple predictions: exon skipping, competitive interactions, changes in splicing effciency, and pathogenicity	Modular NN, linear and logistic regression	Donor/acceptor modules: GENCODE v24 true (known sites) and false (random sequences) splice sites Exon/intron modules: MPRA data from [83]	Easily clinically applicable training set; contains false positive/unverified sites	[84]
MutPred Splice	Impact of coding region substitutions on disruption of pre-mRNA splicing	Linear SVM	Positive: HGMD exonic disease-causing/disease-associated variants Negative: HGMD disease-causing missense, not reported to disrupt exon splicing, high frequency exonic SNPs (SNP- from 1000 Genomes Project [85]	Suitable for use in an NGS high-throughput setting to identify and prioritize potentially splice-altering variants	[86]
PEPSI	Prediction of coding and noncoding variant impact on pre-mRNA splicing based on sequence conservation, RNA secondary structure, and regulatory sequence elements	Random forest regression model	Data obtained form the Vex-seq experiment (measurement of the ΔPSI of 2055 variants from the Exome Aggregation Consortium (ExAC; [Kircher et al., 2014]) v24 a selection of chromosomes as training set, the remaining ones as testing set (*)	Indels and intronic variants included	[80]
S-CAP	Score of variant pathogenicity using compartmentalization of genomic regions	DNN	Pathogenic variants selected from HGMD and ClinVar; benign variants from gnomAD	Evaluation of intronic pathogenic variants; variants lying more than 50 bp into the intron are not covered by the model	[79]
SPANR	Cassette exon skipping prediction	NN modeled on Bayesian framework	PSI values for all human exons across 16 tissues, based on the Illumina Human Body Map project (*)	Web server easy to use, availability of a dataset of pre-computed scores for all eligible variants in the genome; evaluation of exon sequence only	[87]
SpliceAI	Prediction of variant impact on loss or gain of acceptor/donor sites	32-layer DNN	Protein-coding transcripts from GENCODE v24 (a selection of chromosomes as training set, the remaining ones as testing set) (*)	Very powerful tool able to use a “near-agnostic” approach	[81]
SpliceFinder	Classification of variants based on impact on donor site, acceptor site or non-splice-site	CNN	Sequences of donor, acceptor, and non-splice-site, randomly selected from human reference genome (90% for training, 10% for testing, and then 20% of the training data for validation)	Non-canonical splice sites can also be predicted correctly; decreased number of false positives	[88]
TraP	Quantification of impact of variant on transcripts	Random forest	Benign: De novo mutations in healthy individuals Pathogenic: selected synonymous variants associated with rare disease (*)	High performance in distinguishing pathogenic and benign variants, both intronic and synonymous; evaluation of potential impact of variants across multiple transcripts	[76]

(*) data from NGS experiments. SVM: Support Vector Machine; RBF: Radial Basis Function, DNN: Deep Neural Network; NN: Neural Network; CNN: Convolutional Neural Network.