Abstract
Alternative splicing can be disrupted by genetic variants that are related to diseases like cancers. Discovering the influence of genetic variations on the alternative splicing will improve the understanding of the pathogenesis of variants. Here, we developed a new approach, PredPSI-SVR to predict the impact of variants on exon skipping events by using the support vector regression. From the sequence of a particular exon and its flanking regions, 42 comprehensive features related to splicing events were extracted. By using a greedy feature selection algorithm, we found eight features contributing most to the prediction. The trained model achieved a Pearson correlation coefficient (PCC) of 0.570 in the 10-fold cross validation based on a training dataset set provided by the “vex-seq” challenge of the 5th Critical Assessment of Genome Interpretation (CAGI). In the blind test also held by the challenge, our prediction ranked the 2nd with a PCC of 0.566 that indicates the robustness of our method. A further test indicated that the PredPSI-SVR is helpful in prioritizing deleterious synonymous mutations.
The method is available on https://github.com/chenkenbio/PredPSI-SVR.
Keywords: alternative splicing, support vector machine, Vex-seq, splice site motif, synonymous mutation
Introduction
Human genes often express pre-mRNAs containing multiple introns and exons. Usually, introns will be spliced out with exons that are connected to form mature mRNAs. The alternative splicing of exons allows pre-mRNA to be spliced into diverse mature mRNAs (G.-S. Wang & Cooper, 2007). Thus, alternative splicing greatly contributes to the complexity of human genome, and allows to generate protein isoforms with different functions expressed from one gene(Baralle & Giudice, 2017). The changes of alternative splicing have been widely known to relate to human diseases, and even cancers (Climente-González, Porta-Pardo, Godzik, & Eyras, 2017). For example, variants occurring around splice sites can cause Birt-Hogg-Dubé syndrome, Cystic fibrosis, Duchenne muscular dystrophy and others (Anna & Monika, 2018; Furuya et al., 2018). Importantly, many synonymous mutations happening in exons that don’t change encoded proteins were found to influence gene functions (Goodman, Church, & Kosuri, 2013; Parmley, Chamary, & Hurst, 2006), or act as driver mutations in cancer due to their associations with splicing changes (Supek, Miñana, Valcárcel, Gabaldón, & Lehner, 2014).
Genetic variants that affect splicing events usually alter splicing signals in pre-mRNAs. The most fundamental splicing signals are located in 5’splice sites, 3’ splice sites, and branch point sequences(Will & Lührmann, 2011). Usually, 5’splice sites start with “GU” and 3’splice sites end with “AG”, marking the beginning and end of introns, respectively. On the other hand, branch point sequences locate near the upstream of 3’splice site in introns, which helps to form lariat-like intermediates for introns that are spliced out. In addition, splicing regulatory elements are also required to precisely identify splice sites existing in exons and introns, including exonic splicing enhancer and silencer (ESE/ESS), intronic splicing enhancer and silencer (ISE/ISS). These regulatory elements are short sequences in pre-mRNAs that can modulate alternative splicing by interacting with regulatory proteins (Z. Wang & Burge, 2008). Apart from splicing signals in pre-mRNA sequences, the secondary structure of pre-mRNAs can affect splicing as well (McManus & Graveley, 2011).
A common form of alternative splicing in mammals is exon skipping, where an exon will be spliced into mature mRNA or skipped entirely (Katz, Wang, Airoldi, & Burge, 2010). The skipping event of an exon is often measured by the percentage of the exon to be spliced in, namely PSI or Ψ, and the difference of Ψ (ΔΨ) can be used to quantify the change of exon splicing. In order to quantify the alternative splicing, Xiong et al employed a high-throughput sequencing technique to measure genome-wide exon splicing, from which they have designed a method SPANR to predict Ψ based on a deep Bayes network (Xiong et al., 2015). Though the method was able to obtain ΔΨ by predicting Ψ values individually for wild-type sequences and their genetic variants, the indirect way to predict ΔΨ is usually less accurate compared to methods specifically designed for the prediction. At the same time, Rosenberg et al. designed a new method HAL to predict ΔΨ by using hexamer motifs of splicing patterns trained from more than two million synthetic mini-genes (Rosenberg, Patwardhan, Shendure, & Seelig, 2015). However, the method can only make predictions for variants occurring in exons or splice donors (introns within 6bp from the 5’ splice sites), but not in other regions. Additionally, this method doesn’t consider other affecting factors.
Recently, Adamson et al employed a novel experimental technique, variant exon sequencing (vex-seq), to measure the impact of genomic variants on alternative splicing that are hard to be detected by traditional approaches using poly(A) + RNA-seq alone(Adamson, Zhan, & Graveley, 2018). Vex-seq adopts a barcoding approach and is able to detect variants in exons and flanking introns. This method was applied on 2059 variants, and has produced a precise dataset for ΔΨ caused by each variant. On the dataset, the method SPANR achieved poor correlations while the method HAL could not make predictions for mutations outside exons and donor regions. Thus, this Vex-seq dataset is valuable for developing an accurate model for predicting ΔΨ.
Here, we present a new method (namely PredPSI-SVR) that uses support vector regression for predicting ΔΨ caused by variants. This method was trained on selected features, including DNA sequence, DNA conservation score, splicing site, splicing regulatory elements and mRNA secondary structure. The 10-fold cross validation test indicated that the method outperformed SPANR. It was ranked the 2nd with a Pearson correlation coefficient of 0.566 on the blind prediction for vex-seq competition among the 5th Critical Assessment of Genome Interpretation (CAGI). Additional application indicated that PredPSI-SVR is helpful for prioritizing pathogenic synonymous mutations.
MATERIALS AND METHODS
Changes of alternative splicing (ΔΨ):
The expression level of an alternative exon can be quantified by the fraction of mRNA containing the exon, which is denoted as PSI (Ψ)
, where inclusion reads are counts of sequenced fragments aligned to the exon or its junctions with adjacent exons, and exclusion reads are the counts aligned to junctions supporting the exon’s exclusion. The inclusion of an exon in the alternative splicing may be affected by genetic variants, especially those occurring around the junction sites. In order to study the effects of variants on junction sites, the change of Ψ (ΔΨ) was commonly computed as the differences of Ψ between the wild type and their variants.
Vex-seq Dataset:
All data of variants and their causing ΔΨ was downloaded from the Critical Assessment of Genome Interpretation (CAGI) official website (URL: https://genomeinterpretation.org/content/vex-seq). The dataset was sequenced by a barcoding approach of variant exon sequencing (Vex-seq), and has been provided by the CAGI 5 organizer to assess methods for the prediction of genomic variants affecting exon splicing. The dataset consists of 957 variants distributed on the chromosomes 1 to 8 for model training, namely TR957, and 1098 variants on the chromosomes 9 to X for the test, namely TS1098. Each variant locates in either a central exon or in the flanking intronic region. The CAGI competition is a blind test, and the experimental results of ΔΨ in the test set were released after all predictions have been submitted by participants. Therefore, TS1098 is a strictly independent test set for our method.
Features
All variants in the Vex-seq dataset were annotated by ANNOVAR (K. Wang, Li, & Hakonarson, 2010) to determine their locations in exons. For each exon, genome sequence was fetched to cover the exon and its flanking regions of 300nt up- and down- streams, from which 42 features were extracted including six splice site motif features, eight splicing regulatory elements, two pre-mRNA secondary structures, Ψ, 17 CADD annotations, SPIDEXΔΨ, and seven features for variants location or codon (Detailed in Supp. Table S1). In short, the splice site motif features were calculated by MaxEntScan (Yeo & Burge, 2004), which was applied to scoring 5’ splicing site and 3’splicing site in the wild-type (WT) and mutant (MT) sequences, respectively. These scores were denoted as MES5WT, MES3WT, MES5MT,and MES3MT, respectively. The differences of MES between the mutant and the wild-type sequences were derived from 5’ and 3’ splicing sites and denoted as ΔMES5 and ΔMES3.
The splicing regulatory elements used in our models include exonic splicing enhancer SR-protein SF2/ASF from ESEfinder(Smith et al., 2006), exonic splicing silencer FAS-hex3 hexamer from FAS-ESS(Z. Wang et al., 2004), and putative exonic splicing enhancer and silencer pESE/pESS (Zhang, Kangsamaksin, Chao, Banerjee, & Chasin, 2005). These features were scored using scripts provided by SilVA program (Buske, Manickaraj, Mital, Ray, & Brudno, 2013). As SilVA was designed for only synonymous mutations, we slightly modified the scripts so that they can be applied to other SNVs or indels, in exons or introns.
Pre-mRNA secondary structure features include change of free energy(ΔΔG) calculated by UNAfold 3.8 (Markham & Zuker, 2008) and ensemble diversity change (ΔD) calculated by ViennaRNA 2.4.4 (Lorenz et al., 2011).
The Ψ values used in the CAGI challenge dataset were provided by the CAGI organizers. To be able to make predictions for datasets without available Ψ values, we calculated Ψ for other exons by using the program MISO 0.5.4 (Katz et al., 2010) over the Human BodyMap 2.0 project (NCBI GSE30611). As the Human BodyMap 2.0 project provides raw RNA-seq data across 16 human tissues, we have separately computed Ψ for each tissue sample to avoid bias to tissues. The alignment of paired-end reads to the reference genome (hg19) was performed by BWA-MEM 0.7.17 (Li & Durbin, 2009). Since the RNA-seq data has a low sequencing depth, MISO can’t obtain Ψ for most exons, and we calculated the average over available tissues for each exon. Finally, we obtained Ψ of 33922 exons, covering about 2.7% of all exons according to the GENCODE GRCh37 gene annotation file (Frankish et al., 2019).
CADD annotation features were extracted from the annotation for variants by CADD (Kircher et al., 2014), which contain DNA conservation scores, histone methylation levels and other sequence features. SPIDEXΔΨ is a pre-computed ΔΨ score provided by the SPANR (Xiong et al., 2015). The missing values from SPIDEX were simply filled with zero. We also defined another 7 features to describe variants’ locations and whether they will introduce stop codons, but they were not selected by the feature selection.
Support Vector Regression
We trained regression models by the support vector regression implemented in the LIBSVM 3.23 package (Chang & Lin, 2011). LIBSVM is a user-friendly SVM package designed for training SVM model as well as feature scaling, hyper-parameter tuning, et al. In this study, ε-SVR in the package and radial basis function (RBF) were selected as the kernel function for the SVM regression. This model has 2 hyper-parameters: the cost parameter C and the gamma (γ) of RBF kernel. To find the best value combination of these two hyper-parameters, we adopted a grid search strategy that tests on each combination of C ∈ {2−5,2−3,…,211} and γ ∈ {2−11,2−9,…,23}.
Feature Selection
A greedy feature selection algorithm was used as in the previous study (Huiying Zhao et al., 2013). In the selection process, we selected the first feature with the highest PCC and used it as the first optimal subset of features. Based on the optimal subset, we scanned all remaining features by adding them individually, and added to the optimal subset with the feature that can mostly improve the accuracy of predicting results. This continued until there is no more feature that can increase the performance. During this procedure, we employed a 10-fold cross-validation (CV) strategy to evaluate the performance of models, where all variants were randomly separated into 10 folds. Here, variants from the same gene were put into the same fold to avoid sharing gene information between the training and validation sets (Huiying Zhao et al., 2018). Every time, nine folds were employed for training, and the left fold was used for prediction. This process was repeated for 10 times, and all prediction results were collected to calculate the Pearson correlation coefficient (PCC) between predicted ΔΨ with experimental values.
Synonymous Mutation Datasets
The change on the alternative splicing was found to be one important factor for pathogenic synonymous mutations (Livingstone et al., 2017). In order to further test our model and evaluate the relation between changes on alternative splicing and diseases, we compiled a synonymous mutation dataset consisting of both pathogenic and normal mutations. The pathogenic synonymous mutations were downloaded from dbDSM (Wen, Xiao, & Xia, 2016), which is a database for deleterious synonymous mutations collected from public databases and literatures. We first removed duplicate and invalid records, and then converted the chromosome annotation from hg38 to hg19 assembly by using the CrossMap (Hao Zhao et al., 2014). The normal synonymous mutations were obtained from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) with an allele frequency ranging between 0.1 and 0.9. We further removed mutations that are more than 300bp away from the nearest splice site, leading to 890 pathogenic and 14030 normal synonymous mutations. By applying SPANR and MaxEntScan on these synonymous mutations, there were 133 pathogenic mutations and 3208 normal mutations without SPANR score and we removed these mutations from the dataset. Finally, 757 pathogenic mutations and 10822 normal mutations remained, namely SynonMut-complete.
Since our method PredPSI-SVR requires an input of Ψ, we mapped the synonymous mutations to exons. After excluding mutations in the exons having no experimental Ψ, we obtained a subset consisting of 87 pathogenic and 826 normal synonymous mutations, namely SynonMut-psi. This dataset is 8.7 and 13.1 times smaller than SynonMut-complete in the pathogenic and normal mutations, respectively.
RESULTS AND CONCLUSION
Feature analysis
We first computed the Pearson correlation coefficient (PCC) between individual features and PSI change (ΔΨ), As PCC ranges from −1 to 1 with a negative PCC value indicating negative correlation, features were sorted according to the absolute value of PCC values. Table 1 listed nine most important features with an absolute value of PCC greater than 0.1 in the training set TR957. MES scores are a group of most relevant features with 5 types of MES score features being in the list. ΔMES5 is the most correlated feature, and ΔMES3, MES3MT, MES5MT, and MES3WT ranked the 3th, 4th, 6th, and 8th respectively. MES score was designed to reflect the strength of splice site junction, where a lower MES score indicates that the exon is more likely to be interfered by splicing variants (Eng et al., 2004). The second most important feature is SPIDEXΔΨ, originally developed for predicting ΔΨ in the previous study (Xiong et al., 2015), which were obtained from the pre-scored database of the ANNOVAR package (K. Wang et al., 2010). The fifth important feature “dist-Splice” is the distance of the variant separated from the nearest splice site (5’ or 3’ site). The seventh feature verPhyloP is the phyloP conservation score for vertebrate animals. The last one is GC, which stands for percent GC in a window size of 75bp. GC and the verPhyloP just mentioned are extracted from annotations of CADD. Since pre-mRNA structure was reported to affect splicing (Lin, Taggart, & Fairbrother, 2016), we have evaluated both the free energy change and structural ensemble diversity change measured by both UNAFold 3.8 and ViennaRNA 2.4.4, but they show only weak correlations with ΔΨ, with the highest PCC of 0.032 by the free energy log change(ΔΔG)computed from the UNAfold 3.8.
Table 1.
Top features with the greatest absolute values of Pearson correlation coefficients (PCC) to the ΔΨ computed in the training set TR957. Features with absolute values greater than 0.1 were listed. Their PCCs in the test set TS1098 were listed in the last column.
| Rank | Feature | PCC (TR957) | PCC (TS1098) |
|---|---|---|---|
| 1 | ΔMES5 | 0.402 | 0.348 |
| 2 | SPIDEXΔΨ | 0.270 | 0.241 |
| 3 | ΔMES3 | 0.263 | 0.425 |
| 4 | MES3MT | 0.186 | 0.152 |
| 5 | dist-Splice | 0.176 | 0.167 |
| 6 | MES5MT | 0.167 | 0.084 |
| 7 | verPhyloP | −0.128 | −0.080 |
| 8 | MES3MT | 0.111 | −0.019 |
| 9 | GC | 0.108 | 0.030 |
Model training and Feature selection
We have employed a greedy feature selection algorithm to select effective features from all 42 features by using 10-fold cross validation over the training set. As shown in Figure 1, the PCCs by the 10-fold CV gradually increase with addition of features, and reach the highest value of 0.570 by eight features. A further addition of features decreased the performance. In the independent test set, the input of eight features consistently gave the highest PCC, though there is a slight drop in PCCs with 3–5 features. The most important feature is the ΔMES5 that individually shows the highest correlation (PCC = 0.402) with the ΔΨ. The other two scores (ΔMES3,SPIDEXΔΨ) gave strong correlation with ΔΨ individually, and the remained five features (exonic splicing enhancer feature SR protein loss [SR-], MES5WT, conservation score feature priPhyloP, wild-type Ψ, and minDistTSS [distance to closest transcript start]) were individually indicated weak correlations with ΔΨ. As shown in Table 2, the combination of these five features and SPIDEXΔΨ can increase the PCC of model predictions from 0.503 to 0.570. In the independent test, the PCC increases from 0.322 by combining 2 features from MaxEntScan to 0.516, and to 0.566 by (PredPSI-SVR) combining 8 features. At the same time, the removal of individual features consistently shows a decrease of PCC, with the largest drop from ΔMES5, and the smallest from the SPIDEXΔΨ. This is probably because the information of SPIDEXΔΨ has been partially covered by other features. Figure 2 shows a comparison between experimental ΔΨ and the predicted ΔΨ by PredPSI-SVR and SPIDEXΔΨ. Surprisingly, when we prepared the final server version, we found the removal of priPhyloP and minDistTSS obtained from the CADD leads to slight increase in PCCs of both the 10-fold CV and independent tests compared to the full model with 8 features: increase from 0.570 to 0.590 in the 10-fold CV, and from 0.566 to 0.577 in independent test. This indicates the limit of our current greedy feature selection algorithm. Therefore, our final server version (PredPSI-SVR) was trained by using six features over a combination of training and test sets from the CAGI.
Figure 1.

The growth of PCC as the number of features increases. The solid line shows the results of 10-fold cross validation on the training set, and the dashed line for the independent test set.
Table 2.
Performances of models by incremental addition of features, or by removing each feature from the final model tested on the training dataset (10-fold cross validation).
| Features added a | PCC | Feature excluded b | PCC |
|---|---|---|---|
| Final model | 0.570 | ||
| ΔMES5 | 0.414 | −ΔMES5 | 0.444 |
| +ΔMES3 | 0.503 | −ΔMES3 | 0.482 |
| +SR− | 0.518 | −SR− | 0.555 |
| +MES5WT | 0.524 | −MES5WT | 0.542 |
| +priPhyloP | 0.537 | −priPhyloP | 0.545 |
| +SPIDEXΔΨ | 0.548 | −SPIDEXΔΨ | 0.565 |
| +Ψ | 0.556 | −Ψ | 0.508 |
| +minDistTSS | 0.570 | −minDistTSS | 0.556 |
Performance by incremental addition of each feature
Performance by removing each feature from the final model
Figure 2.

Comparison of predicted ΔΨ by (A) PredPSI-SVR and (B) SPIDEXΔΨ (SPANR method) and experimental values on the independent test set TS1098.
The CAGI provides experimental Ψ for a small portion of exons, and MISO can’t compute Ψ for all exons. For a general use where exons don’t have Ψ values, we have built another model, PredPSI-SVR-noPSI without using the Ψ. The model achieved lower performance with PCCs of 0.525 and 0.479 on the 10-fold CV of the training set and the independent test set, respectively.
The prioritization of pathogenic synonymous mutations
The PredPSI-SVR model was further utilized to prioritize pathogenic synonymous mutations, and compared with SPANR and MaxEntScan. For PredPSI-SVR and SPANR, we directly used the absolute values of predicted ΔΨ. For MaxEntScan, we took the sum of the absolute values of ΔMES5 and ΔMES3 to obtain information for both 5’ and 3’ splicing sites. These scores were used to distinguish pathogenic mutations from normal ones. Mutations with a score above a threshold will be classified as pathogenic. As shown in Figure 3, we plotted the Receiver Operating characteristic Curves (ROC) by the PredPSI-SVR, SPANR, and MaxEntScan methods on the SynonMut-PSI dataset. PredPSI-SVR performs the best while SPANR performs the worst that is close to random on the dataset. As shown in Table 3, the area under ROC (AUC) indicates that PredPSI-SVR is significantly better than SPANR (P-value=0.036), and 6.6% higher than MaxEntScan. The PredPSI-SVR-noPSI without input of experimental Ψ has a great drop in the AUC (from 0.579 to 0.508) likely due to the small dataset. On the larger SynonMut-complete datasets, PredPSI-SVR-noPSI achieves an AUC of 0.575, which is significantly better than the SPANR and MaxEntScan with P-values of 0.004 and 0.049, respectively according to the statistical test (Hanley & McNeil, 1982). The Hanley & McNeil test is a statistical method for testing whether there is significant difference between two AUC values. These results also indicate that our predictions on changes of alternative splicing can help in prioritizing pathogenic synonymous mutations. At the region of low FPR (FPR<0.1), the curve of MaxEntScan is slightly above the one for PredPSI-SVR though MaxEntScan is an input feature for the PredPSI-SVR model. This is likely because our method was optimized for the overall performance that has brought down the results in this region. The problem may be overcome by using other machine learning algorithms like XGBoost (Chen & Guestrin, 2016) or a bigger training dataset. In addition, we divided the SynonMut-complete dataset into two portions: 555 mutations within the scanning scope of MaxEntScan and the remaining 11024 mutations. For the first portion, our model has essentially the same performance as MaxEntScan (Figure 3C), while for the remaining mutations without MaxEntScan scores, our model achieves an AUC of 0.536 that is significantly better than the AUC (=0.501) by SPANR with a P-value of 0.018 (Figure 3D). These suggests that our model can utilize additional features in addition to the MaxEntScan scores.
Figure 3.

ROC curves for PredPSI-SVR, PredPSI-SVR-noPSI, SPANR an MaxEntScan on (A) SynonMut-PSI dataset and (B) SynonMut-complete dataset. PredPSI-SVR doesn’t appear in the 2nd plot because the dataset consists of mutations on exons without experimental PSI values. ROC curves for different methods on mutations (C) with MaxEntScan scores or (D) without MaxEntScan scores were also shown.
Table 3.
The performance of methods to discriminate pathogenic from normal synonymous mutations.
| Dataset | Methods | AUC | P-value a |
|---|---|---|---|
| SynonMut-PSI | PredPSI-SVR | 0.579 | − |
| PredPSI-SVR-noPSI | 0.508 | −(0.064)b | |
| SPANR | 0.495 | 0.389 (0.036) | |
| MaxEntScan | 0.543 | 0.150 (0.220) | |
| SynonMut-Complete | PredPSI-SVR-noPSI | 0.575 | − |
| SPANR | 0.534 | 0.004 | |
| MaxEntScan | 0.549 | 0.049 |
The significance of difference between methods compared to PredPSI-SVR-noPSI or
PredPSI-SVR (values in the parenthesis) according to the statistical test (Hanley & McNeil, 1982).
DISCUSSION
In this study, we present a new method PredPSI-SVR to predict the change of exon splicing caused by genetic variants. PredPSI-SVR is a support vector regression model that integrates features of splice sites, splicing regulatory elements, DNA conservation score, SPIDEXΔΨ provided by SPANR, and Ψ of wild-type exons to predict ΔΨ. The method achieved PCCs of 0.570 and 0.566 for the 10-fold CV on the training dataset and strictly independent test set, respectively. This performance is significantly better than the performance (PCC=0.24) by SPANR’s SPIDEXΔΨ, and ranked the 2nd in the CAGI competition.
To build such a model, we extracted 42 features at first and analyzed their correlations with ΔΨ. We found that features on splicing sites computed by MaxEntScan have the highest correlations. The model trained by the ΔMES5 and ΔMES3 can achieve a PCC of about 0.51 on the test set, indicating the importance of variants around splice sites to affect the alternative splicing. By using greedy feature selection, the model built from eight selected features increased the PCCs from 0.503 to 0.570 on training set and from 0.516 to 0.566 on test set. Five among eight selected features individually shows weak correlations with ΔΨ (|PCC| < 0.1), indicating importance to extract comprehensive features.
Our method ranked the 2nd in the CAGI challenge, and it is of interest to compare with other methods. According to the descriptions of the prediction methods, available at https://genomeinterpretation.org/content/vex-seq, two groups (groups 1 and 2, which were ranked 3th and 4th, respectively) used similar features to our method (group 4). In contrast to our approach, the group 1 didn’t fit their model directly toward the experimental ΔΨ values. They trained a classification model to predict the sign of ΔΨ and then used the predicted scores to fit into the ΔΨ. The group 2 didn’t employ a cross validation to optimize the hyper-parameters for their random forest model, which might cause a lack of generalization to the test set. The group 3 (ranked 5th) didn’t provide implementation details. On the other side, the group 5 made the best predictions by using their developed MMSplice method (Cheng et al., 2019). In their method, six deep neural networks have been trained to extract features of splice donor, splice acceptor, 5’ exon, 3’ exon, 5’intron, and 3’ intron, which were later combined by a simple linear regression to predict ΔΨ. With the benefit of utilizing deep learning techniques, the method achieved a PCC of 0.675 for the test set. Therefore, the predictions might be further improved by coupling merits of different methods, e.g. using features mined from deep learning combined with our used knowledge-based information like conservation scores, and training a non-linear model with machine learning methods like SVM, as used in our method.
We also noticed that the removal of mutations with small Ψ changes lead to a better correlation of the predicted with experimental values. By removing ΔΨ with absolute values less than two times of the standard deviation, we observed improvement of correlations for all methods on the remained 53 mutations. For example, our PredPSI-SVR achieved an increase in PCC from 0.566 to 0.665 and the top method MMSplice increased from 0.675 to 0.782. This is likely because the mutations with small change of Ψ might be affected by many other factors with relatively weak impact, while current methods can only capture the dominant factors due to the limited data.
At present, PredPSI-SVR does not include features of branch point sequences and pre-mRNA secondary structure effectively due to the limit by relatively small numbers of samples in the dataset. Moreover, the small number of samples prevented us from using more powerful classification algorithms like deep learning. Another limitation of our method is its need for wild-type Ψ of exons. Without experimental Ψ, PCCs on the training set and test set dropped by about 0.1. Currently only RNA-seq data from the Human BodyMap Project 2.0 project was used, and many exons cannot be found from the MISO analysis due to sequencing depth. With advance in sequencing technology, more and more public databases are becoming available, which enables to capture more accurate Ψ for exons, and thus improves the performance. Moreover, the tissue dependence of Ψ reminds us to use tissue specific Ψ in PredPSI-SVR to better discover pathogenic variants in specific diseases.
The PredPSI-SVR method is available with a standalone version on https://github.com/chenkenbio/PredPSI-SVR. The program runs on Linux/Unix system with input of variants in the VCF format.
Supplementary Material
Acknowledgments
This project was supported in part by the National Natural Science Foundation of China (61772566, U1611261, and 81801132), the program for Guangdong Introducing Innovative and Entrepreneurial Teams (2016ZT06D211) and Guangdong Province Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation (2017B030314026). We would like to thank the Critical Assessment of Genome Interpretation (CAGI) group and data providers. The CAGI experiment coordination is supported by NIH U41 HG007346 and the CAGI conference by NIH R13 HG006650.
Contract Grant sponsors
National Natural Science Foundation of China (61772566, U1611261, and 81801132); Program for Guangdong Introducing Innovative and Entrepreneurial Teams (2016ZT06D211); Guangdong Province Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation (2017B030314026)
REFERENCES
- Adamson SI, Zhan L, & Graveley BR (2018). Vex-seq: high-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency. Genome Biology, 19(1), 71 10.1186/s13059-018-1437-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anna A, & Monika G (2018). Splicing mutations in human genetic disorders: examples, detection, and confirmation. Journal of Applied Genetics, 59(3), 253–268. 10.1007/s13353-018-0444-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baralle FE, & Giudice J (2017). Alternative splicing as a regulator of development and tissue identity. Nature Reviews Molecular Cell Biology, 18, 437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buske OJ, Manickaraj A, Mital S, Ray PN, & Brudno M (2013). Identification of deleterious synonymous variants in human genomes. Bioinformatics, 29(15), 1843–1850. 10.1093/bioinformatics/btt308 [DOI] [PubMed] [Google Scholar]
- Chang C-C, & Lin C-J (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1–27:27. [Google Scholar]
- Chen T, & Guestrin C (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. 10.1145/2939672.2939785 [DOI] [Google Scholar]
- Cheng J, Nguyen TYD, Cygan KJ, Çelik MH, Fairbrother WG, Avsec žiga, & Gagneur J (2019). MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology, 20(1), 48 10.1186/s13059-019-1653-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Climente-González H, Porta-Pardo E, Godzik A, & Eyras E (2017). The Functional Impact of Alternative Splicing in Cancer. Cell Reports, 20(9), 2215–2226. 10.1016/j.celrep.2017.08.012 [DOI] [PubMed] [Google Scholar]
- Eng L, Coutinho G, Nahas S, Yeo G, Tanouye R, Babaei M, … Gatti RA (2004). Nonclassical splicing mutations in the coding and noncoding regions of the ATM Gene: Maximum entropy estimates of splice junction strengths. Human Mutation, 23(1), 67–76. 10.1002/humu.10295 [DOI] [PubMed] [Google Scholar]
- Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, … Flicek P (2019). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research, 47(D1), D766–D773. 10.1093/nar/gky955 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Furuya M, Kobayashi H, Baba M, Ito T, Tanaka R, & Nakatani Y (2018). Splice-site mutation causing partial retention of intron in the FLCN gene in Birt-Hogg-Dubé syndrome: a case report. BMC Medical Genomics, 11(1), 42 10.1186/s12920-018-0359-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodman DB, Church GM, & Kosuri S (2013). Causes and Effects of N-Terminal Codon Bias in Bacterial Genes. Science, 342(6157), 475 10.1126/science.1241934 [DOI] [PubMed] [Google Scholar]
- Hanley JA, & McNeil BJ (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36. 10.1148/radiology.143.1.7063747 [DOI] [PubMed] [Google Scholar]
- Katz Y, Wang ET, Airoldi EM, & Burge CB (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Methods, 7(12), 1009–1015. 10.1038/nmeth.1528 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, & Shendure J (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics, 46(3), 310–315. 10.1038/ng.2892 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, & Durbin R (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin C-L, Taggart AJ, & Fairbrother WG (2016). RNA structure in splicing: An evolutionary perspective. RNA Biology, 13(9), 766–771. 10.1080/15476286.2016.1208893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Livingstone M, Folkman L, Yang Y, Zhang P, Mort M, Cooper DN, … Zhou Y (2017). Investigating DNA-, RNA-, and protein-based features as a means to discriminate pathogenic synonymous variants. Human Mutation, 38(10), 1336–1347. 10.1002/humu.23283 [DOI] [PubMed] [Google Scholar]
- Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, & Hofacker IL (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6, 26 10.1186/1748-7188-6-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Markham NR, & Zuker M (2008). UNAFold: software for nucleic acid folding and hybridization. Methods in Molecular Biology (Clifton, N.J.), 453, 3–31. 10.1007/978-1-60327-429-6_1 [DOI] [PubMed] [Google Scholar]
- McManus CJ, & Graveley BR (2011). RNA structure and the mechanisms of alternative splicing. Current Opinion in Genetics & Development, 21(4), 373–379. 10.1016/j.gde.2011.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parmley JL, Chamary JV, & Hurst LD (2006). Evidence for Purifying Selection Against Synonymous Mutations in Mammalian Exonic Splicing Enhancers. Molecular Biology and Evolution, 23(2), 301–309. 10.1093/molbev/msj035 [DOI] [PubMed] [Google Scholar]
- Rosenberg AB, Patwardhan RP, Shendure J, & Seelig G (2015). Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences. Cell, 163(3), 698–711. 10.1016/j.cell.2015.09.054 [DOI] [PubMed] [Google Scholar]
- Smith PJ, Zhang C, Wang J, Chew SL, Zhang MQ, & Krainer AR (2006). An increased specificity score matrix for the prediction of SF2/ASF-specific exonic splicing enhancers. Human Molecular Genetics, 15(16), 2490–2508. 10.1093/hmg/ddl171 [DOI] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang G-S, & Cooper TA (2007). Splicing in disease: disruption of the splicing code and the decoding machinery. Nature Reviews Genetics, 8(10), 749–761. 10.1038/nrg2164 [DOI] [PubMed] [Google Scholar]
- Wang K, Li M, & Hakonarson H (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), e164–e164. 10.1093/nar/gkq603 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Z, & Burge CB (2008). Splicing regulation: From a parts list of regulatory elements to an integrated splicing code. RNA, 14(5), 802–813. 10.1261/rna.876308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Z, Rolish ME, Yeo G, Tung V, Mawson M, & Burge CB (2004). Systematic Identification and Analysis of Exonic Splicing Silencers. Cell, 119(6), 831–845. 10.1016/j.cell.2004.11.010 [DOI] [PubMed] [Google Scholar]
- Wen P, Xiao P, & Xia J (2016). dbDSM: a manually curated database for deleterious synonymous mutations. Bioinformatics, 32(12), 1914–1916. 10.1093/bioinformatics/btw086 [DOI] [PubMed] [Google Scholar]
- Will CL, & Lührmann R (2011). Spliceosome Structure and Function. Cold Spring Harbor Perspectives in Biology, 3(7), a003707 10.1101/cshperspect.a003707 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RKC, … Frey BJ (2015). The human splicing code reveals new insights into the genetic determinants of disease. Science, 347(6218), 1254806–1254806. 10.1126/science.1254806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeo G, & Burge CB (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, 11(2–3), 377–394. 10.1089/1066527041410418 [DOI] [PubMed] [Google Scholar]
- Zhang XH-F, Kangsamaksin T, Chao MSP, Banerjee JK, & Chasin LA (2005). Exon Inclusion Is Dependent on Predictable Exonic Splicing Enhancers. Molecular and Cellular Biology, 25(16), 7323–7332. 10.1128/MCB.25.16.7323-7332.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Hao, Sun Z, Wang J, Huang H, Kocher J-P, & Wang L (2014). CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics, 30(7), 1006–1007. 10.1093/bioinformatics/btt730 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Huiying, Yang Y, Lin H, Zhang X, Mort M, Cooper DN, … Zhou Y (2013). DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels. Genome Biology, 14(3), R23 10.1186/gb-2013-14-3-r23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Huiying, Yang Y, Lu Y, Mort M, Cooper DN, Zuo Z, & Zhou Y (2018). Quantitative mapping of genetic similarity in human heritable diseases by shared mutations. Human Mutation, 39(2), 292–301. 10.1002/humu.23358 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
