Predicting the impact of single nucleotide variants on splicing via sequence-based deep neural networks and genomic features

Tatsuhiko Naito

doi:10.1002/humu.23794

. Author manuscript; available in PMC: 2020 Sep 1.

Published in final edited form as: Hum Mutat. 2019 Jun 23;40(9):1261–1269. doi: 10.1002/humu.23794

Predicting the impact of single nucleotide variants on splicing via sequence-based deep neural networks and genomic features

Tatsuhiko Naito ¹

PMCID: PMC7265986 NIHMSID: NIHMS1030139 PMID: 31090248

Abstract

Single nucleotide mutations in exonic regions can significantly affect gene function through a disruption of splicing, and various computational methods have been developed to predict the splicing-related effects of a single nucleotide mutation. We implemented a new method using ensemble learning that combines two types of predictive models: (a) base sequence-based deep neural networks (DNNs) and (b) machine learning models based on genomic attributes. This method was applied to the Massively Parallel Splicing Assay challenge of the Fifth Critical Assessment of Genome Interpretation, in which challenge participants predicted various experimentally-defined exonic splicing mutations, and achieved a promising result. We successfully revealed that combining different predictive models based upon the stacked generalization method led to significant improvement in prediction performance. In addition, whereas most of the genomic features adopted in constructing machine learning models were previously reported, feature values generated with DSSP, a DNN-based splice site prediction tool, were novel and helpful for the prediction. Learning the sequence patterns associated with normal splicing and the change in splicing site probabilities caused by a mutation was presumed to be helpful in predicting splicing disruption.

Keywords: CAGI, splicing, single nucleotide variant, ensemble learning, deep neural networks

1. INTRODUCTION

Large-scale sequencing has enabled comprehensive identification of genetic variants. One of the next challenges is to predict the effect of variants on protein function. Although predicting the effect of genetic variants on protein function is often challenging, the characterization of splicing mutations is more tractable. The Massively Parallel Splicing Assay (MaPSy) is an innovative system used to identify and evaluate the effect of exonic mutations on mRNA splicing (Soemedi et al., 2017). This assay screened single nucleotide mutations in exonic regions reported in the Human Gene Mutation Database (Stenson et al., 2003) via both in vivo and in vitro techniques, using fragments of either the mutant (MT) or wild-type (WT) sequences. For approximately 10% of the mutations, MaPSy confirmed splicing alterations both in vivo and in vitro, and such mutations are termed as exonic splicing mutations (ESMs).

When compared to laboratory experimentation, in silico prediction techniques present several advantages that minimize cost and time, and various computational methods have been applied to predict splicing-altering effects caused by single nucleotide mutation. For example, MaxEntScan (Jian et al., 2014) provides sequence-based prediction of splice sites, whereas ESEfinder (Cartegni et al., 2003) and Spliceman (Lim and Fairbrother, 2012) employed sequence-based identification of splicing regulatory elements. Furthermore, Human Splice Finder (Desmet et al., 2009) and MutPred Splice (Mort et al., 2014) combined a sequence-based approach and various genomic attributes to evaluate the effect of a particular mutation.

The combination of predictive models is known as ensemble learning in machine learning communities and is a powerful tool in bioinformatics. Ensemble learning encompasses the collection of individual base predictors to achieve highly accurate classification decisions by voting on the decisions of base predictors (Dietterich, 2000). These ensemble techniques have the advantage of addressing the issues associated with small sample size by averaging and incorporating multiple classification models to reduce the risk of overfitting the training data. Ideally, individual base predictors should be moderately diverse. Supervised machine learning methods can be categorized into two types: 1) the classical approach on which features are generated manually from input data, and 2) feature learning, which automatically extracts features directly from input data and is represented by deep neural networks (DNNs). As for genomic problems, convolutional layers can learn invisible patterns from base sequences (Naito, 2018), although this type of application of DNN lacks genome-specific attributes.

We hypothesized that ensemble learning combining a sequence-based DNN model and machine learning models based on genomic attributes could lead to the improvement of prediction accuracy. We participated in the MaPSy challenge of the Fifth Critical Assessment of Genome Interpretation (CAGI 5), in which the participants were provided with a dataset used in the MaPSy experiment and were required to predict the effects of exonic single nucleotide variants on splicing.

2. MATERIALS AND METHODS

2.1. MaPSy experiment and dataset

In the MaPSy experiment (Soemedi et al., 2017), the differences between WT and MT splicing of template DNA, were analyzed to identify the presence of allelic imbalances in splicing efficiency. These changes were interpreted as the allelic ratios (ARs): $\log_{2} \frac{d / c}{b / a},$ where a, b, c, and d were the read counts of WT input DNA, WT spliced RNA, MT input DNA, and MT spliced RNA. To assess statistical significance, a two-sided Fisher’s exact test was used. Mutations resulting in an AR change ≥1.5-fold with a P-value <0.05 were defined as ESMs.

The dataset included single nucleotide variants that were tested by MaPSy experiment. There were 4,964 variants, comprising 453 ESM variants, and 4,511 non-ESM variants. Data included WT and MT sequences, the position of exons, read counts of the input, and spliced fraction for WT and MT in vivo and in vitro, log₂ AR in vivo and in vitro, and ESM class (0 means non-ESM; 1 means ESM). The sequences were 200 base pairs in length, consisting of a 170-mer reporter sequence and a pair of forward and reverse 15-mer common primers, which were removed from the analysis as the primers were identical in every sample.

The challenge participants were required to predict ESM class and log₂ AR in vivo and in vitro. We generated models that predicted AR instead of log₂ AR, as training data in which log₂ AR was negative infinity can be used as AR after exponentiation to 0. Training data in which log₂ AR was either infinity or NaN were excluded.

2.2. General outline of our method

Our method combined two approaches of base predictors: (a) A DNN-based predictive model that automatically extracts the nucleotide sequence information and (b) machine-learning models that were trained with “handcrafted” features. For validation, 10% of the data was separated. To adjust the imbalance between ESM and non-ESM data, ESM data were randomly oversampled when generating the training set for the ESM classification.

2.3. Deep neural networks (DNN)

DNN are computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction (LeCun et al., 2015). These models have achieved record-breaking results in various fields, largely due to the recent revival of convolutional neural networks. Their performance are remarkable especially in image processing (Krizhevsky et al., 2012), and they have been applied to automated image recognition systems such as pathological images with a high level of accuracy (Naito et al., 2017). In addition, there are several successful examples of the application of DNN to establish a predictive system for determining the effect of genetic variants, and DNN was also employed in the CAGI challenges (Laksshman et al., 2017). One of the advantages of DNN is that it does not require feature engineering, as the models can learn the features directly from the data. We therefore took advantage of DNN to learn directly from the base sequence information.

The DNN architecture used in the challenge is shown in Figure 1. The input layer receives sequence data encoded as one-hot vector, where each base is converted to a five-element (“A”, “C”, “G”, “T”, and “mutated”) vector of which only element is 1 and the others are 0. The mutated base, which differs between WT and MT sequences, was converted to a vector of which only the last element is 1. Two convolutional layers follow the input layer, with max-pooling layers. The last two layers were fully connected followed by the binary output layer. In order to return the probability of ESM, softmax activation was added before the last output. Batch normalization (Ioffe and Szegedy, 2015) was added to the convolutional layers. Drop-out (Srivastava et al., 2014) was used on the convolutional and fully connected layers. The Adam optimizer was used for training with categorical cross-entropy and mean squared error for the ESM probability and the allele ratio, respectively. Hyper-parameters were determined based on the tree-structured Parzen estimator approach (TPE) (Bergstra et al., 2015), with candidate values: {2, 3} for the number of convolutional layers, {4, 8, 16, 32, 64} for the number of convolutional filters, {4, 8, 16} for convolutional kernel size, {2, 4} for pooling layer size, and {0.25, 0.5} for dropout ratio, {32, 64, 128, 256, 512} for fully connected layer size. The final parameters are decided: 2 convolutional layers with 64 filters and kernel size of 16 for the first and 8 filters and kernel size of 4 for the second, pooling size of 2 for both layers, dropout ratio of 0.5 for both layers, and fully connected layer size of 128. The architecture was implemented with Keras, a Python library (Chollet, 2015).

2.4. Feature engineering and application of machine learning models

2.4.1. Selected feature values

The majority of the features investigated in this study were chosen because of prior evidence highlighting their ability to predict the splicing-related outcomes of genetic variants (Sinha et al., 2008; Woolfe et al., 2010; Mort et al., 2014). A subset of these features was also employed in the original MaPSy paper (Soemedi et al., 2017) and included not only naive genomic attributes (e.g., exon length and GC rate) but also values calculated with additional tools that were previously established, such as MaxEntScan. Features were categorized into three types: (a) Either or both of exon/intron; (b) mutations; and (c) genes. All the features used here are shown in Table 1.

Table 1.

Summary of features investigated in this method.

Feature	Description	Type
3’ SS	3’ splice site score calculated with MaxEntScan.	Exon/ Intron
3’ SS diff	Difference of 3’ SS between WT and MT.	Mutation
5’ SS	5’ splice site score calculated with MaxEntScan.	Exon/ Intron
5’ SS diff	Difference of 5’ SS between WT and MT.	Mutation
Exon ESRseq	Hexamer splicing score of WT exon.	Exon
ESRseq diff	Hexamer splicing score between WT and MT.	Mutation
3’ intron ESRseq	Hexamer splicing score of 3’ intron.	Intron
5’ intron ESRseq	Hexamer splicing score of 5’ intron.	Intron
HI score	Haploinsufficiency score.	Gene
HI proba	Haploinsufficiency probability.	Gene
MFE	Maximum free energy computed using ViennaRNA packages.	Exon/ Intron
MFE diff	Difference of maximum free energies between WT and MT.	Mutation
Exon GC rate	GC rate of exon sequence.	Exon
Exon POS	Exon number divided by the total number of exons in the gene.	Exon
Exon num	Number of exons composing the gene.	Exon
Exon len	Length of exon.	Exon
MT dist	Distance from the mutation point to the nearer splicing site.	Mutation
3’ DSSP	3’ splice site probability of MT calculated using DSSP.	Exon/ Intron
3’ DSSP diff	Difference of 3’ splice site probability between WT and MT.	Mutation
5’ DSSP	5’ splice site probability of MT calculated using DSSP.	Exon/ Intron
5’ DSSP diff	Difference of 5’ splice site probability between WT and MT.	Mutation

Open in a new tab

Splice site scores at 3’ and 5’ sites were computed using MaxEntScan (Yeo and Burge, 2004). ESRseq was computed as hexamer splicing scores (Ke et al., 2011). The difference between the ESRseq scores derived from exonic/intronic regions of WT vs. MT was investigated. Haploinsufficiency (HI) scores and probabilities were obtained from a previous study that developed an HI prediction model using a large, deletion dataset (Huang et al., 2010). The maximum free energy (MFE) estimate was computed using the ViennaRNA package (Package et al., 2011) at default settings. Naive genomic attributes were obtained in reference to the human reference assembly (GRCh37/ hg19).

Splicing site probabilities calculated by DSSP (Naito, 2018) were used as new, additional feature values. DSSP is a novel DNN-based model that calculates splice site probability from a 140-length base sequence, in which the middle nucleotides represent the consensus sequence. DSSP can predict the probability of 3’ and 5’ splice sites, and therefore the probability of splice sites within the WT sequence, as well as the difference between WT and MT sequences, were calculated as feature values.

2.4.2. Machine learning models

We utilized several machine learning methods to generate predictive models from aforementioned feature values, including random forests (RF) (Leo Breiman, 2001) and eXtreme gradient boosting (XGBoost) Chen and Guestrin, 2016). We also used logistic regression for ESM class prediction and least-squares linear regression for AR prediction (both of them are abbreviated as LR, which denotes either linear or logistic regression, depending on the context). The models were implemented with scikit-learn, a machine learning library for Python (Pedregosa et al., 2011). Feature selection for each model was conducted using the backward elimination method.

In the RF model, each tree in the forest is constructed with a different bootstrap sample from the original dataset, and the remaining samples were used for validation. The results from all trees are then averaged to provide unbiased estimates of predicted values, error rates, and measures of variable importance. Default parameters of the scikit-learn package were used to build the random forest model, with the exception of the number of trees and maximum tree depth, which were chosen based on a random search using validation data.

XGBoost is an optimized, distributed, gradient tree-boosting library designed to be highly efficient and fast. Gradient tree boosting builds a predictive model through assembly of regression decision trees. During the process, each model is optimized to update the weights, which are derived from the weights optimized by the gradient descent to minimize the cost of the former model. The maximum tree depth, learning rate, number of trees, and gamma value were determined based on a random search using validation data, and default values of the package were used for other parameters.

2.5. Combination of models

We used stacked generalization to combine the aforementioned models. Stacked generalization is a scheme that uses a higher-level predictive model (level-1 generalizer) to combine lower-level base predictive models (level-0 generalizers) to achieve greater predictive accuracy (Wolpert, 1992). Using the standard formulation, (Ting and Witten, 1999), we performed stacked generalization with a level-1 linear regression and logistic regression for AR and ESM probabilities, respectively, that were trained on the probability outputs of level-0 aforementioned models.

The scheme of stacked generalization that we applied is shown in Figure 2. Training data (except validation data) were split into 100 subsets. Base predictors (i.e., Model 1~K) were trained with 99 subsets, and the probabilities predicted by the trained base predictors for the remained single subset (Prediction 1~K) were stacked. This process was performed until all the predicted probabilities were stacked. The probabilities were then used to train the level-1 generalizer.

2.6. Performance Evaluation

2.6.1. Performance

The separated 10% of the data were used for performance evaluation. In order to evaluate the performance of ESM class prediction models, areas under the curve (AUC) of the receiver operating characteristic (ROC) and precision-recall (PR) curves were calculated to evaluate performance. Spearman’s correlation coefficient (rho) and mean squared error (MSE) were calculated to evaluate the predicted values of in vivo and in vitro log₂ ARs. In addition, we compared the prediction performance of our method with that of SpliceAI, a novel tool which quantifies the effects of single nucleotide variants on splicing using DNN (Jaganathan et al., 2019). This tool calculates the delta score of a variant ranging from 0 to 1, which can be interpreted as the probability of a variant being splice-altering. Then the delta score of gain and loss of splicing in 3’ and 5’ splice sites were calculated, and we evaluated the maximum of them as the probability of a variant disrupting splicing.

2.6.2. Feature importance

The importance of features used in this investigation was evaluated to assess the potential of each feature to discriminate between ESM and non-ESM. The ranking of feature importance was generated from the trained RF model for ESM prediction. The importance of each feature was calculated based on the total reduction in node impurity of that feature averaged over all trees, which is referred to as the Gini importance (Breiman, 2001).

We presumed that some features (e.g., exon GC rate) could be accessible by DNN filters and others (e.g., exon number) were not; thus, we additionally analyzed which feature improved the performance when combined with DNN. We generated LR models for each feature, and combined each feature-derived LR model with DNN by stacked generalization, and compared the performance improvement among them.

3. RESULTS AND DISCUSSION

3.1. Performance

The ROC-AUC and PR-AUC for all of the model combinations are shown in Table 2. Figure 3 presents these curves, although only the ROC and PR curves of individual base predictors (DNN, RF, XGBoost, and LR) and the all-combined ensemble model (Ensemble) are shown for clarity. XGBoost performed best among individual base predictors. The all-combined ensemble model displayed higher performance in both ROC-AUC and PR-AUC than individual base predictors. In addition, DNN combined with any machine learning models achieved better performance for predicting ESM than individual base predictors. Among the ensemble models, the model composed of DNN, XGBoost, and LR performed the best. In addition, our ensemble models performed better than SpliceAI both in ROC-AUC and PR-AUC. These results suggested that the combination of DNN and machine learning models, two very different approaches, could provide improved prediction performance, as the gaps in analysis via one method could be compensated by the other method. For instance, when converting sequence data into one-hot vectors, a mutated nucleotide was equivalently converted to a vector of which only the last factor is 1. This process ignores information for original and mutated base types, but may be complemented by genomic attributes, which are indirectly included in the information of original and mutated base types.

Table 2.

ROC, AUC, and PR AUC for ESM class prediction.

Method	ROC AUC	PR AUC
DNN	0.776477	0.278518

RF	0.766881	0.230489
RF + DNN	0.823249	0.329222

XGBoost	0.800596	0.319162
XGBoost + DNN	0.829886	0.368649

LR	0.720325	0.248547
LR + DNN	0.811995	0.393629

RF + XGBoost	0.801751	0.327176
RF + XGBoost + DNN	0.831281	0.369969

RF + LR	0.764717	0.228589
RF + LR + DNN	0.824163	0.334075

XGBoost + LR	0.804684	0.327790
XGBoost + LR + DNN	0.837437	0.398162

RF + XGBoost + LR	0.798144	0.298897
RF + XGBoost + LR + DNN	0.831810	0.372592

SpliceAI	0.762938	0.286482

Open in a new tab

Figure 3. — Receiver operating characteristic (ROC) curves (A) and precision recall (PR) curves (B) for ESM class prediction. Areas under the curve (AUC) are shown in the parenthesis to the right of the labels.

The MSE and Spearman’s rho for predicting log₂ AR in vivo and in vitro are shown in Table 3. LR performed the best in MSEs and Spearman’s rho both in vivo and in vitro among the base predictors. With respect to in vivo, the all-combined ensemble model performed better than the base predictors. In contrast, none of the in vitro combinations of models improved when evaluated based on MSE instead of LR, which may be explained by the result that LR performed far better in vitro than other base predictors. DNN performed poorly in Spearman’s rho, both in vivo and in vitro. In addition, comparison of standardized partial regression coefficients of LR revealed that DNN had little effect on the prediction of log₂ AR, both in vivo and in vitro (Table 4). This suggested that the DNN architecture of the current study may not be suitable for these regression problems, and other activation functions from the last fully connected layer or other loss functions provided a more adequate fitting for the distribution of the AR value.

Table 3.

MSE for prediction of log₂ AR in vivo and in vitro

Method	in vivo		in vitro

	MSE	Spearman’s rho	MSE	Spearman’s rho
DNN	1.30548	0.05084	1.00574	0.09330

RF	1.30418	0.34763	0.95797	0.40759
RF + DNN	1.26058	0.31701	0.96434	0.41124

XGBoost	1.40212	0.35378	0.92402	0.50614
XGBoost + DNN	1.29030	0.08718	0.92394	0.50534

LR	1.22814	0.38239	0.76193	0.50903
LR + DNN	1.20719	0.37599	0.76509	0.50966

RF + XGBoost	1.29381	0.36163	0.91377	0.49302
RF + XGBoost + DNN	1.24926	0.33593	0.91547	0.49673

RF + LR	1.19564	0.38680	0.76650	0.50660
RF + LR + DNN	1.16961	0.38827	0.77075	0.50783

XGBoost + LR	1.22412	0.38288	0.81491	0.51463
XGBoost + LR + DNN	1.20390	0.37737	0.81552	0.51472

RF + XGBoost + LR	1.19216	0.38758	0.81764	0.51270
RF + XGBoost + LR + DNN	1.16580	0.39046	0.81891	0.51317

Open in a new tab

Table 4.

Standardized partial regression coefficient of LR as level-1 generalizer

Method	Standardized partial regression coefficient

	in vivo	in vitro
DNN	0.008616	−0.001116
RF	0.102111	0.006465
XGBoost	0.048226	0.056795
LR	0.018087	0.015809

Open in a new tab

3.2. Feature importance

Figure 4 depicts the feature ranking. Data showed that mutation type features contributed more to the prediction of ESM. Splice site strength and ESRseq exhibited high importance, and this tendency was consistent with previous investigations (Sinha et al., 2008; Woolfe et al., 2010; Mort et al., 2014). Since a previous study suggested that motifs of exonic splice enhancers were also enriched in introns with splice sites (Wu et al., 2005), we experimentally adopted ESRseq of 3’ and 5’ intron as feature variables; however, they were not as informative as the ESRseq of the exon.

Figure 5 shows the performance of each feature-derived LR model and ensemble model combining each feature and DNN. Higher performance of each feature-derived LR model resulted in corresponding improvement in the performance of DNN; however, these changes in performance were slight. As shown in Table 2, the performance of DNN improved significantly by combining this method with any machine learning method trained with features. Thus, while the effect of each feature was too slight to judge which feature contributed appreciably, their integration achieved significant improvement in overall performance.

3.3. Prediction via a DNN-based splice site prediction tool

It is worth noting that features generated from DSSP, a novel splice site prediction tool, contributed significantly to the prediction model in feature ranking. According to Figure 4, the DSSP features of 5’ splice sites were more important for prediction than those of 3’ splice sites. This observation can be attributed to the fact that DSSP has been shown to predict splice site probabilities more accurately in 5’ splice sites than in 3’ splice sites (Naito, 2018). As a post challenge analysis, we generated simple machine learning models (default parameters of the libraries), which were trained only with 3’ DSSP, 3’ DSSP diff, 5’ DSSP, and 5’ DSSP diff, and evaluated them with validation data. All of them, especially XgBoost, achieved performance comparable to the base predictors used in the challenge (Table 5).

Table 5.

ROC, AUC, and PR AUC of models trained with DSSP-derived features for ESM class prediction

Method	ROC AUC	PR AUC
RF	0.769334	0.304683
XGBoost	0.797182	0.305917
LR	0.740525	0.246298

Open in a new tab

The probability scores of splice sites and their differences between WT and MT sequences are presumed to be equivalent with the splice site strength and changes were caused by a mutation. Hence, their integration by machine learning leads to moderate performance in predicting a mutation that causes splicing disruption. This observation is consistent with the report that SpliceAI achieved prediction capability by evaluating the change in splicing site strengh caused by a mutation (Jaganathan et al., 2019).

3.4. Limitations

It is important to mention that there are several limitations of this study. With respect to ESM classification, we did not differentiate two directions of AR change (gain and loss) and it may have been worthwhile to generate a model that incorporates AR values into the ESM classification. In terms of the generation and validation of models, the hyper-parameters were determined by evaluating a single separated set, which is also used for validation; thus, the performance of our methods might be overestimated by the overfitting of hyper-parameters. Furthermore, single validation sets were investigated for performance evaluation. These limitations may be addressed by a more comprehensive method, such as nested cross validation (Stone, 1974). The visualization of filters of DNN would be helpful in delineating what DNN was extracted from sequence data and what caused the change in splicing. We tried several approaches for visualization including integrated gradient methods; however, no significant visualization could be achieved.

As for usability, the current methods were confined to single nucleotide variants, and did not distinguish different types of genetic variation (microdeletions or microinsertions). In addition, some genomic feature values were calculated with various tools and others were extracted from web programs; thus, we could not implement an integrated available tool that simply returns a prediction from an input. Instead, we uploaded the sequence-based DNN architecture used in ESM prediction and a simple source code to run it in Python at the GitHub repository (https://github.com/DSSP-github/CAGI_MaPSy_Prediction_TN).

4. CONCLUSION

We participated in the MaPSy challenge of CAGI 5 and applied a computational method to predict exonic single nucleotide mutations that impaired pre-mRNA splicing. By combining a sequence-based DNN model and machine learning models based on genomic attributes, we achieved significant improvement in prediction performance. In addition, feature values generated by DSSP, a DNN-based splice site prediction system, may be informative for predicting splicing mutations.

ACKNOWLEDGEMENTS

The author acknowledges the organizers of CAGI 5 and the data providers for the MaPSy challenge. The CAGI experiment coordination is supported by NIH U41 HG007346 and the CAGI conference by NIH R13 HG006650.

Funding: No funding is provided for our study. The CAGI experiment coordination is supported by NIH U41 HG007346 and the CAGI conference by NIH R13 HG006650.

REFERENCES

Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD. (2015). Hyperopt: a Python library for model selection and hyperparameter optimization. Comput Sci Discov 8:014008. [Google Scholar]
Breiman L (2001). Random forests. Mach Learn 45:5–32. [Google Scholar]
Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR. (2003). ESEfinder : a web resource to identify exonic splicing enhancers. Nucleic Acids Res 31:3568–3571. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen T, Guestrin C. (2016). XGBoost. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min - KDD ‘16 785–794. [Google Scholar]
Chollet F (2015). keras. https://github.com/keras-team/keras.
Desmet FO, Hamroun D, Lalande M, Collod-Bëroud G, Claustres M, Béroud C. (2009). Human Splicing Finder: An online bioinformatics tool to predict splicing signals. Nucleic Acids Res 37:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dietterich TG. (2000). Ensemble Methods in Machine Learning. Proc First Int Work Mult Classif Syst 1–15. [Google Scholar]
Huang N, Lee I, Marcotte EM, Hurles ME. 2010. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet 6:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ioffe S, Szegedy C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc ICML 448–456. [Google Scholar]
Jaganathan K, Panagiotopoulou SK, Mcrae JF, Batzoglou S, Sanders SJ, Farh KK, Jaganathan K, Panagiotopoulou SK, Mcrae JF, Darbandi SF. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell 176:535–548. [DOI] [PubMed] [Google Scholar]
Jian X, Boerwinkle E, Liu X. (2014). In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res 42:13534–13544. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ke S, Shang S, Kalachikov SM, Morozova I, Yu L, Russo JJ, Ju J, Chasin LA. (2011). Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res 21:1360–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krizhevsky A, Sutskever I, Hinton GE. 2012. ImageNet classification with deep convolutional neural networks. NIPS Proc 1097–1105. [Google Scholar]
Laksshman S, Bhat RR, Viswanath V, Li X. (2017). DeepBipolar: Identifying genomic mutations for bipolar disorder via deep learning. Hum Mutat 38:1217–1224. [DOI] [PMC free article] [PubMed] [Google Scholar]
LeCun Y, Bengio Y, Hinton G. (2015). Deep learning. Nature 521:436–444. [DOI] [PubMed] [Google Scholar]
Breiman Leo. (2001). Random Forests. Mach Learn 45:5–32. [Google Scholar]
Lim KH, Fairbrother WG. (2012). Spliceman — a computational web server that predicts sequence variations in pre-mRNA splicing. Bioinformatics 28:1031–1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mort M, Sterne-Weiler T, Li B, Ball EV, Cooper DN, Radivojac P, Sanford JR, Mooney SD. (2014). MutPred Splice: Machine learning-based prediction of exonic variants that disrupt splicing. Genome Biol 15:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Naito T (2018). Human Splice-Site Prediction with Deep Neural Networks. J Comput Biol 25:954–961. [DOI] [PubMed] [Google Scholar]
Naito T, Nagashima Y, Taira K, Uchio N, Tsuji S, Shimizu J. (2017). Identification and segmentation of myelinated nerve fibers in a cross-sectional optical microscopic image using a deep learning model. J Neurosci Methods 291:141–149. [DOI] [PubMed] [Google Scholar]
Package V, Lorenz R, Bernhart SH, Höner C, Tafer H, Flamm C. (2011). ViennaRNA Package 2.0. 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, et al. (2011). Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830. [Google Scholar]
Sinha R, Hiller M, Pudimat R, Gausmann U, Platzer M, Backofen R. (2008). Improved identification of conserved cassette exons using Bayesian networks. BMC Bioinformatics 9:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG. (2017). Pathogenic variants that alter protein code often disrupt splicing. Nat Genet 49:848–855. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NST, Abeysinghe S, Krawczak M, Cooper DN. (2003). Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21:577–581. [DOI] [PubMed] [Google Scholar]
Stone M (1974). Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B 36:111–147. [Google Scholar]
Ting KM, Witten IH. (1999). Issues in stacked generalization. J Artif Intell Res 10:271–289. [Google Scholar]
Wolpert DH. (1992). Stacked generalization. Neural Networks 5:241–259. [Google Scholar]
Woolfe A, Mullikin JC, Elnitski L. (2010). Genomic features defining exonic variants that modulate splicing. Genome Biol 11:R20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu Y, Zhang Y, Zhang J. (2005). Distribution of exonic splicing enhancer elements in human genes. Genomics 86:329–336. [DOI] [PubMed] [Google Scholar]
Yeo G, Burge CB. (2004). Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. J Comput Biol 11:377–394. [DOI] [PubMed] [Google Scholar]

[R1] Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD. (2015). Hyperopt: a Python library for model selection and hyperparameter optimization. Comput Sci Discov 8:014008. [Google Scholar]

[R2] Breiman L (2001). Random forests. Mach Learn 45:5–32. [Google Scholar]

[R3] Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR. (2003). ESEfinder : a web resource to identify exonic splicing enhancers. Nucleic Acids Res 31:3568–3571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen T, Guestrin C. (2016). XGBoost. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min - KDD ‘16 785–794. [Google Scholar]

[R5] Chollet F (2015). keras. https://github.com/keras-team/keras.

[R6] Desmet FO, Hamroun D, Lalande M, Collod-Bëroud G, Claustres M, Béroud C. (2009). Human Splicing Finder: An online bioinformatics tool to predict splicing signals. Nucleic Acids Res 37:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Dietterich TG. (2000). Ensemble Methods in Machine Learning. Proc First Int Work Mult Classif Syst 1–15. [Google Scholar]

[R8] Huang N, Lee I, Marcotte EM, Hurles ME. 2010. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet 6:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Ioffe S, Szegedy C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc ICML 448–456. [Google Scholar]

[R10] Jaganathan K, Panagiotopoulou SK, Mcrae JF, Batzoglou S, Sanders SJ, Farh KK, Jaganathan K, Panagiotopoulou SK, Mcrae JF, Darbandi SF. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell 176:535–548. [DOI] [PubMed] [Google Scholar]

[R11] Jian X, Boerwinkle E, Liu X. (2014). In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res 42:13534–13544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Ke S, Shang S, Kalachikov SM, Morozova I, Yu L, Russo JJ, Ju J, Chasin LA. (2011). Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res 21:1360–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Krizhevsky A, Sutskever I, Hinton GE. 2012. ImageNet classification with deep convolutional neural networks. NIPS Proc 1097–1105. [Google Scholar]

[R14] Laksshman S, Bhat RR, Viswanath V, Li X. (2017). DeepBipolar: Identifying genomic mutations for bipolar disorder via deep learning. Hum Mutat 38:1217–1224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] LeCun Y, Bengio Y, Hinton G. (2015). Deep learning. Nature 521:436–444. [DOI] [PubMed] [Google Scholar]

[R16] Breiman Leo. (2001). Random Forests. Mach Learn 45:5–32. [Google Scholar]

[R17] Lim KH, Fairbrother WG. (2012). Spliceman — a computational web server that predicts sequence variations in pre-mRNA splicing. Bioinformatics 28:1031–1032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Mort M, Sterne-Weiler T, Li B, Ball EV, Cooper DN, Radivojac P, Sanford JR, Mooney SD. (2014). MutPred Splice: Machine learning-based prediction of exonic variants that disrupt splicing. Genome Biol 15:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Naito T (2018). Human Splice-Site Prediction with Deep Neural Networks. J Comput Biol 25:954–961. [DOI] [PubMed] [Google Scholar]

[R20] Naito T, Nagashima Y, Taira K, Uchio N, Tsuji S, Shimizu J. (2017). Identification and segmentation of myelinated nerve fibers in a cross-sectional optical microscopic image using a deep learning model. J Neurosci Methods 291:141–149. [DOI] [PubMed] [Google Scholar]

[R21] Package V, Lorenz R, Bernhart SH, Höner C, Tafer H, Flamm C. (2011). ViennaRNA Package 2.0. 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, et al. (2011). Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830. [Google Scholar]

[R23] Sinha R, Hiller M, Pudimat R, Gausmann U, Platzer M, Backofen R. (2008). Improved identification of conserved cassette exons using Bayesian networks. BMC Bioinformatics 9:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG. (2017). Pathogenic variants that alter protein code often disrupt splicing. Nat Genet 49:848–855. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NST, Abeysinghe S, Krawczak M, Cooper DN. (2003). Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21:577–581. [DOI] [PubMed] [Google Scholar]

[R26] Stone M (1974). Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B 36:111–147. [Google Scholar]

[R27] Ting KM, Witten IH. (1999). Issues in stacked generalization. J Artif Intell Res 10:271–289. [Google Scholar]

[R28] Wolpert DH. (1992). Stacked generalization. Neural Networks 5:241–259. [Google Scholar]

[R29] Woolfe A, Mullikin JC, Elnitski L. (2010). Genomic features defining exonic variants that modulate splicing. Genome Biol 11:R20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Wu Y, Zhang Y, Zhang J. (2005). Distribution of exonic splicing enhancer elements in human genes. Genomics 86:329–336. [DOI] [PubMed] [Google Scholar]

[R31] Yeo G, Burge CB. (2004). Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. J Comput Biol 11:377–394. [DOI] [PubMed] [Google Scholar]

PERMALINK

Predicting the impact of single nucleotide variants on splicing via sequence-based deep neural networks and genomic features

Tatsuhiko Naito

Abstract

1. INTRODUCTION