Table 4.
Tool Name | Prediction | Model | Datasets | Key-Points | Ref |
---|---|---|---|---|---|
CADD | Score of pathogenicity | Rirst version: linear SVM Later versions: L2-regularized logistic regression | Training datasets: Benign: evolutionarily neutral variants; Pathogenic: simulated de novo pathogenic variants Testing datasets: Benign: benign variants; Pathogenic: pathogenic ClinVar variants, somatic cancer mutation frequencies | Effective tool for protein-coding impact prediction; may not be informative for poorly-conserved regions | [74,75] |
CryptSplice | Impact of variants on existing splice sites, cryptic splice site prediction | SVM with RBF kernel | True and false splice sites from GenBank-derived datasets | Identify creation of cryprtic acceptor/donor site; use of a quite obsolete database | [78] |
DARTS | Prediction of alternative splicing using both cis sequence features and mRNA levels of trans RBPs | DNN and Bayesian Hypothesis Testing | RNA-seq data (*) | Evaluation of RBP impact on splicing | [82] |
MMSplice | Multiple predictions: exon skipping, competitive interactions, changes in splicing effciency, and pathogenicity |
Modular NN, linear and logistic regression | Donor/acceptor modules: GENCODE v24 true (known sites) and false (random sequences) splice sites Exon/intron modules: MPRA data from [83] | Easily clinically applicable training set; contains false positive/unverified sites | [84] |
MutPred Splice | Impact of coding region substitutions on disruption of pre-mRNA splicing | Linear SVM | Positive: HGMD exonic disease-causing/disease-associated variants Negative: HGMD disease-causing missense, not reported to disrupt exon splicing, high frequency exonic SNPs (SNP- from 1000 Genomes Project [85] | Suitable for use in an NGS high-throughput setting to identify and prioritize potentially splice-altering variants | [86] |
PEPSI | Prediction of coding and noncoding variant impact on pre-mRNA splicing based on sequence conservation, RNA secondary structure, and regulatory sequence elements | Random forest regression model | Data obtained form the Vex-seq experiment (measurement of the ΔPSI of 2055 variants from the Exome Aggregation Consortium (ExAC; [Kircher et al., 2014]) v24 a selection of chromosomes as training set, the remaining ones as testing set (*) | Indels and intronic variants included | [80] |
S-CAP | Score of variant pathogenicity using compartmentalization of genomic regions | DNN | Pathogenic variants selected from HGMD and ClinVar; benign variants from gnomAD | Evaluation of intronic pathogenic variants; variants lying more than 50 bp into the intron are not covered by the model |
[79] |
SPANR | Cassette exon skipping prediction | NN modeled on Bayesian framework | PSI values for all human exons across 16 tissues, based on the Illumina Human Body Map project (*) | Web server easy to use, availability of a dataset of pre-computed scores for all eligible variants in the genome; evaluation of exon sequence only | [87] |
SpliceAI | Prediction of variant impact on loss or gain of acceptor/donor sites | 32-layer DNN | Protein-coding transcripts from GENCODE v24 (a selection of chromosomes as training set, the remaining ones as testing set) (*) | Very powerful tool able to use a “near-agnostic” approach | [81] |
SpliceFinder | Classification of variants based on impact on donor site, acceptor site or non-splice-site | CNN | Sequences of donor, acceptor, and non-splice-site, randomly selected from human reference genome (90% for training, 10% for testing, and then 20% of the training data for validation) | Non-canonical splice sites can also be predicted correctly; decreased number of false positives | [88] |
TraP | Quantification of impact of variant on transcripts | Random forest | Benign: De novo mutations in healthy individuals Pathogenic: selected synonymous variants associated with rare disease (*) | High performance in distinguishing pathogenic and benign variants, both intronic and synonymous; evaluation of potential impact of variants across multiple transcripts | [76] |
(*) data from NGS experiments. SVM: Support Vector Machine; RBF: Radial Basis Function, DNN: Deep Neural Network; NN: Neural Network; CNN: Convolutional Neural Network.