Abstract
Alternative polyadenylation (APA) is a crucial step in post-transcriptional regulation. Previous bioinformatic studies have mainly focused on the recognition of polyadenylation sites (PASs) in a given genomic sequence, which is a binary classification problem. Recently, computational methods for predicting the usage level of alternative PASs in the same gene have been proposed. However, all of them cast the problem as a non-quantitative pairwise comparison task and do not take the competition among multiple PASs into account. To address this, here we propose a deep learning architecture, Deep Regulatory Code and Tools for Alternative Polyadenylation (DeeReCT-APA), to quantitatively predict the usage of all alternative PASs of a given gene. To accommodate different genes with potentially different numbers of PASs, DeeReCT-APA treats the problem as a regression task with a variable-length target. Based on a convolutional neural network-long short-term memory (CNN-LSTM) architecture, DeeReCT-APA extracts sequence features with CNN layers, uses bidirectional LSTM to explicitly model the interactions among competing PASs, and outputs percentage scores representing the usage levels of all PASs of a gene. In addition to the fact that only our method can quantitatively predict the usage of all the PASs within a gene, we show that our method consistently outperforms other existing methods on three different tasks for which they are trained: pairwise comparison task, highest usage prediction task, and ranking task. Finally, we demonstrate that our method can be used to predict the effect of genetic variations on APA patterns and sheds light on future mechanistic understanding in APA regulation. Our code and data are available at https://github.com/lzx325/DeeReCT-APA-repo.
Keywords: Polyadenylation, Gene regulation, Sequence analysis, Deep learning, Bioinformatics
Introduction
In eukaryotic cells, the termination of Pol II transcription involves 3ʹ-end cleavage followed by addition of a poly(A) tail, a process termed as “polyadenylation”. Often, one gene could have multiple polyadenylation sites (PASs). The so-called alternative polyadenylation (APA) could generate different transcript isoforms with different 3ʹ-UTRs and sometimes even different protein-coding sequences from the same gene locus. The diverse 3ʹ-UTRs generated by APA may contain different sets of cis-regulatory elements, thereby modulating the mRNA stability [1], [2], [3], translation [4], subcellular localization of mRNAs [5], [6], [7], or even the subcellular localization and function of the encoded proteins [8]. Importantly, it has been shown that dysregulation of APA could result in various human diseases [9], [10], [11], [12].
APA is regulated by the interaction between cis-elements located in the vicinity of PASs and the associated trans-factors [13]. The most well-known cis-element that defines a PAS is the hexamer AAUAAA and its variants located 15–30 nt upstream of the cleavage site, which is directly recognized by the cleavage and polyadenylation specificity factor (CPSF) components: CPSF30 and WDR33 [14]. Other auxiliary cis-elements located upstream or downstream of the cleavage site include upstream UGUA motifs bound by the cleavage factor Im (CFIm) and downstream U-rich or GU-rich elements targeted by the cleavage stimulation factor (CstF) [14]. The usage of individual PASs for a multi-PAS gene depends on how efficiently each alternative PAS is recognized by these 3′-end processing machineries, which is further regulated by additional RNA-binding proteins (RBPs) that could enhance or repress the usage of distinct PAS signals through binding in their proximity. In addition, the usage of alternative PASs is mutually exclusive. In particular, once an upstream PAS is utilized, all the downstream ones would have no chance to be used no matter how strong their PAS signals are. Therefore, proximal PASs, which are transcribed first, have positional advantage over the distal competing PASs [15]. Indeed, it has been observed that the terminal PASs more often contain the canonical AAUAAA hexamer which is considered to have higher affinity than the other variants, which possibly compensates for their positional disadvantage [16].
There has been a long-standing interest in predicting PASs based on genomic sequences using purely computational approaches. The so-called “PAS recognition problem” aims to discriminate between nucleotide sequences that contain a PAS and those do not. A variety of hand-crafted features have been proposed, and statistical learning algorithms, e.g., random forest (RF), support vector machines (SVM), and hidden Markov models (HMM), are then applied on these features to solve the binary classification problem [17], [18], [19]. Recently, researchers started investigating the “PAS quantification problem”, which aims to predict a score that represents the strength of a PAS [20], [21]. This is much more difficult than the recognition one.
Recent developments in deep learning have made great improvements on many tasks [22]. With remarkable success, it has also been applied to bioinformatics tasks such as protein–DNA binding [23], RNA splicing pattern prediction [24], enzyme function prediction [25], [26], Nanopore sequencing [27], [28], and promoter prediction [29]. Deep learning is favored due to its automatic feature extraction ability and good scalability with large amount of data. As for polyadenylation prediction, deep learning models have been applied on the PAS recognition problem, and they outperformed existing feature-based methods by a large margin [30]. Recently, deep learning models have also been applied on the PAS quantification problem, where Polyadenylation Code [20] was developed to predict the stronger one from a given pair of two competing PASs. Very recently, another model, DeepPASTA [21], has been proposed. DeepPASTA contains four different modules that deal with both the PAS recognition problem and PAS quantification problem. Similar as Polyadenylation Code, DeepPASTA also casts the PAS quantification problem into a pairwise comparison task.
In this study, we propose a novel deep learning method, Deep Regulatory Code and Tools for Alternative Polyadenylation (DeeReCT-APA), for the PAS quantification problem. DeeReCT-APA can quantitatively predict the usage of all the competing PASs from a same gene simultaneously, regardless of the number of PASs. The model is trained and evaluated based on the dataset from a previous study [31], which consists of a genome-wide PAS measurement of two different mouse strains [C57BL/6J (BL) and SPRET/EiJ (SP)] and their F1 hybrids. After training our model on the dataset, we comprehensively evaluate our model based on a number of criteria. We demonstrate the necessity of modeling the competition among multiple PASs simultaneously. Finally, we show that our model can predict the effect of genetic variations on APA patterns, visualize APA regulatory motifs, and potentially facilitate the mechanistic understanding of APA regulation.
Method
Description of DeeReCT-APA architecture
The DeeReCT-APA method is based on a deep learning architecture that contains a set of neural network models composed of base networks (Base-Net, one for each competing PAS) and upper-level interaction layers. Each base network takes a 455-nt genomic DNA sequence centered around one competing PAS cleavage site as input, and gives a vector which can be interpreted as the distilled features of that sequence as output. There are two types of base networks in our design, based on: 1) hand-engineered feature extractor and 2) convolutional neural networks (CNNs). The output of the lower-level base network is then passed to the upper-level interaction layers, which computationally model the process of choosing competing PASs. The interaction layers of DeeReCT-APA are based on long short-term memory networks (LSTMs) [32], which have achieved remarkable success in natural language processing and can naturally handle sentences with an arbitrary length, therefore suitable for handling any number of alternative PASs from a same gene locus. The interaction layers then output the percentage values of all the competing PASs of the gene. The architecture is illustrated in Figure 1. The design of each part of the network is further explained in the following subsections.
Figure 1.
Illustration of the DeeReCT-APA architecture
The DeeReCT-APA architecture uses BiLSTM as an interaction layer. PAS, polyadenylation site.
We use three different base network designs: deep neural network architectures based on a single 1D convolution layer (Single-Conv-Net), multiple 1D convolution layers (Multi-Conv-Net), and a handcrafted feature extractor with fully connected layers (Feature-Net). Single-Conv-Net and Multi-Conv-Net are two CNN structures for Base-Net. The Single-Conv-Net consists of only one layer of the 1D convolutional layer and takes directly the one-hot encoded sequences as input. The convolutional layer has a number of convolution filters which become automatically learned feature extractors after training. A rectified linear unit (ReLU) is used as the activation function. The max-pooling operation after that allows only values from highly activated neurons to pass to the upper fully connected layers. The three operations convolution, ReLU, and max-pooling form a convolution block. By contrast, the Multi-Conv-Net uses two convolution blocks before fully connected layers. The increased depth of the network makes it possible for the network to learn more complex representations. The structures of Single-Conv-Net and Multi-Conv-Net are shown in Figure 2A and B, respectively.
Figure 2.
Three designs of Base-Net
All three of the designs output a feature vector that represents distilled features of the input sequence. A. Single-Conv-Net uses a single convolution layer for feature extraction. B. Multi-Conv-Net uses multiple convolution layers for feature extraction. C. Feature-Net contains a handcrafted feature extractor before being processed by fully connected layers.
As a comparison, we also design a base network that works with hand-engineered features, named Feature-Net. The Feature-Net only consists of multiple fully connected layers and takes multiple types of features extracted from the sequences of interest as input. The features, as described in [20], include polyadenylation signals, auxiliary upstream elements, core upstream elements, core downstream elements, auxiliary downstream elements [33], and RBP motifs, as well as 1-mer, 2-mer, 3-mer, and 4-mer features (detailed in File S1 and Table S1). Each feature value corresponds to the occurrence of each motif. The extracted features are then z-score normalized. The architecture is illustrated in Figure 2C.
Design of the interaction layers
The utilization of alternative PASs is intrinsically competitive. On the one hand, as a multi-PAS gene is transcribed, any one of its PASs along the already transcribed region is possible to be used. However, if one of them has already been used, it will make other PASs impossible to be chosen. On the other hand, given that the same polyadenylation machinery is used by all the alternative PASs, such competition of resources also contributes to the competitiveness of this process. However, previous work in polyadenylation usage prediction did not take this important point into account [20], [21]. Both existing models, Polyadenylation Code and DeepPASTA [tissue-specific relatively dominant poly(A) site prediction model], can only take in two PASs at a time, ignoring the competition with others. Here, to overcome this limitation, we consider all the competing PASs at the same time, i.e., taking all the PASs in a gene simultaneously as input in our model and then jointly predicting the usage levels of all of them.
To fulfil this, we design the interaction layers above the base networks to model the interactions between different PASs. In neural networks, the most common way to model interactions among inputs is to introduce a recurrent neural network (RNN) layer, which can capture the interdependencies among inputs corresponding to each time step. We decide to choose the LSTM [32] as the foundation of interaction layers. LSTM is a type of RNN that has hidden memory cells which are able to remember a state for an arbitrary length of time steps, making it one of the most popular RNNs. To fit into the PAS usage level prediction task, each time step of LSTM corresponds to one PAS, at which the LSTM takes the extracted features of that PAS from the lower-level base network. As there is influence both from upstream PAS to downstream PAS and vice versa, we decide to use a bidirectional LSTM (BiLSTM), in which one LSTM’s time step goes from upstream PAS to downstream one and the other from downstream to upstream. The outputs of the two LSTMs at the same PAS are then concatenated and sent to the upper fully connected layer. The fully connected layer transforms the LSTM output to a scalar value representing the log-probability of that PAS to be used. After the log-probabilities of all competing PASs pass through a final SoftMax layer, they are transformed to properly normalized percentage scores, which sum up to one, representing their probability of being chosen. The detailed architecture is shown in Figure 1. We point out that, although DeepPASTA also contains a BiLSTM component, their BiLSTM layer is to process the sequence of one of the two competing PASs that are given as input. The time steps of the BiLSTM correspond to different positions in one particular sequence rather than to different PASs, and therefore the BiLSTM is not to model the interactions between different PASs, which is clearly different from the design in DeeReCT-APA.
As mentioned above, the aim of our model is to take all PASs of a gene as a whole and try to predict the usage level of each PAS as accurate as possible. Therefore, at one time, we must take all PASs in a gene as input. Considering that the number of PASs within a gene is not a constant, we design our model to take inputs of a variable length. Since most genes have a small number of PASs, we choose not to pad all the genes with dummy PASs to make them of the same length, otherwise it will be highly inefficient. Instead, we design the interaction layers in a way that it can take an arbitrary number of Base-Nets.
We further design two experiments for ablation study of DeeReCT-APA’s BiLSTM interaction layer. The first is to remove the BiLSTM layer and only keep the fully connected layer and the SoftMax layer. In this scenario, the network still considers all PASs of a gene simultaneously, but with a non-RNN interaction layer. The second is to remove the interaction layer altogether and use comparison-based training (like in Polyadenylation Code) to train a Base-Net. We show their performance separately in the “Overall performance” section.
A genome-wide PAS quantification dataset
A genome-wide PAS quantification dataset derived from the fibroblast cells of BL and SP mice, as well as their F1 hybrids, was obtained from a previous study [31]. In the F1 cells, the two alleles have the same trans environment, and the PAS usage difference between two alleles can only be due to the sequence variants between their genome sequences, making it a valuable system for APA cis-regulation study. Apart from APA, this kind of systems have also been used in the study of alternative splicing and translational regulation [34], [35].
The detailed description of the sequencing protocol and data analysis procedure can be found in [31]. As a brief summary, we used the fibroblast cell lines from BL, SP, and their F1 hybrids. The total RNA was extracted from the fibroblast cells of BL and SP, and then subjected to 3′-Region Extraction and Deep Sequencing (3′READS) [16] to build a good PAS reference of the two strains. The 3′-mRNA sequencing was then performed in all three cell lines to quantify those PASs in the reference. In the F1 hybrid cells, reads were assigned to BL and SP alleles according to their strain-specific SNPs. The PAS usage values were then computed by counting the sequencing reads assigned to each PAS. The sequence centering around each PAS cleavage site (448 nt in total) was extracted and then underwent feature extraction or one-hot encoding before training the model. The extracted features were then inputted to Feature-Net, while the one-hot encoded sequences were inputted to Single-Conv-Net and Multi-Conv-Net. As provided in [31], the raw sequencing data from which this dataset is derived are accessible at European Nucleotide Archive (ENA: PRJEB15336; http://www.ebi.ac.uk/ena).
Training and evaluation of the model
We train the DeeReCT-APA models based on the parental BL/SP PAS usage level dataset. For F1 hybrid data, however, we choose to start from the pre-trained parental model (we use either the BL parental model or the SP parental model, and the results are shown separately) and fine-tune the model on the F1 dataset. This is because, due to the read assignment problem, the usage of many PASs in F1 cannot be unambiguously characterized by 3′-mRNA sequencing [31]. As a result, the F1 dataset does not contain enough number of PASs to train our model from scratch. At the training stage, genes are randomly selected from the training set and the sequences of their PASs’ flanking regions are fed into the network. Each sequence of PASs in a gene passes through one Base-Net. The parameters of the Base-Net that are responsible for each PAS are all shared. The Base-Net then each outputs a vector representing distilled features for each PAS, which is then sent to the interaction layers. The interaction layers generate a percentage score of each PAS of this gene. Cross-entropy loss between the predicted usage and the actual usage is used as the training target. During back-propagation, the gradients are back-propagated through the passage originating from each PAS. As the model parameters are shared between base networks, the gradients are then summed up to update the model parameters. We use several techniques to reduce overfitting. 1) Weight decay is applied on weight parameters of CNN and all fully connected layers. 2) Dropout is applied on BiLSTM. 3) We stop training as soon as the mean absolute error of the predicted usage value does not improve on the validation set. 4) While fine-tuning the model on F1 dataset, we use a learning rate that is ∼ 100 times smaller than the one used when training from scratch.
The network is trained with the adaptive moment estimation (Adam) optimizer [36]. A detailed list of hyperparameters we used is specified in File S1 and Table S2. We construct the network using the PyTorch deep learning framework [37] and utilize one NVIDIA GeForce GTX 980 Ti as the GPU hardware platform.
To evaluate the performance of the model, we conduct a 5-fold cross validation at the gene level using all the genes in our dataset for each strain. That is, if a gene is selected as a training (testing) sample, all of its PASs are in the training (testing) set. At each time, four folds are used for training and the remaining one is used for testing. To make a fair comparison with Polyadenylation Code and DeepPASTA, we also train (fine-tune) the two models and optimize their model parameters on the parental and F1 datasets.
Performance measures
To comprehensively evaluate DeeReCT-APA and compare it against baseline and state-of-the-art methods, we use the following performance measures.
Mean absolute error
This metric is defined as the mean absolute error (MAE) of the usage prediction of each PAS, which is
| (1) |
where stands for the predicted usage, stands for the experimentally determined ground truth usage for PAS i, and M is the total number of PASs across all genes in the testing set. This is the most intuitive way of measuring the performance of DeeReCT-APA. However, it is not applicable to Polyadenylation Code [20] or DeepPASTA [21] as they do not have quantitative outputs that can be interpreted as the PAS usage values. For the same reason, it is not applicable to DeeReCT-APA either, when its interaction layers are removed and the comparison-based training is used (see the “Design of the interaction layers” section).
Comparison accuracy
We here define the Pairwise Comparison Task. We enumerate all the pairs of PASs in a given gene and keep those pairs with PAS usage level difference greater than 5%. We then ask the model to predict which PAS in the pair is of the higher usage level. The accuracy is defined as:
| (2) |
Note that the primary reason that we use this metric is to compare with Polyadenylation Code and DeepPASTA, as they are designed for predicting which one is stronger between the two competing PASs.
Highest usage prediction accuracy
We here define the Highest Usage Prediction Task. This task aims to test the model’s ability of predicting which PAS is of the highest usage level in a single gene. We select all the genes which have its highest PAS usage level greater than its second highest one by at least 15% in the testing set for evaluation. For DeeReCT-APA, the predicted usage in percentage is used for ranking the PASs. For Polyadenylation Code and DeepPASTA, as they do not provide a predicted value in percentage, the logit value before the SoftMax layer is used instead. The logit values, though not in the scale of real usage percentage values, can at least give a ranking of different PASs. The highest usage prediction accuracy is the percentage of genes whose highest-usage PASs are correctly predicted.
Averaged Spearman’s correlation
We here define the Ranking Task. We convert the predicted usage levels by each model into a ranking of PASs in that gene. We then compute the Spearman’s correlation between the predicted ranking and ground truth ranking. The correlation values for all genes are then averaged together to give an aggregated score. In other words,
| (3) |
where is the total number of genes; is the number of PASs in gene ; is the predicted rank of PAS in gene ; is the ground truth rank of PAS p in gene i; and and are averaged predicted and ground truth ranks in gene i, respectively.
Results
Overall performance
First, to compare the performance of different Base-Net designs, we evaluated DeeReCT-APA with different Base-Nets: Feature-Net, Single-Conv-Net, and Multi-Conv-Net. As shown in Table S3, both on the parental BL dataset and on the F1 dataset, DeeReCT-APA with Multi-Conv-Net performs the best, followed by that with Single-Conv-Net. This is expected, as deeper neural networks have higher representation learning capacity.
We then compared the performance of DeeReCT-APA with Multi-Conv-Net to Polyadenylation Code and DeepPASTA. As shown in Table 1, both on the parental BL dataset and on the F1 dataset, DeeReCT-APA with Multi-Conv-Net consistently performs the best across all four metrics. The standard deviation across 5-fold cross validation is higher in the F1 dataset than in the parental dataset, indicating the increased instability in F1 prediction, which is probably due to the limited amount of F1 data. As we have a rather small dataset, a very complex model like DeepPASTA is prone to overfitting, which is probably the reason why it performs the worst here. Indeed, for the smaller F1 dataset, DeepPASTA lags even more behind other methods. Similar results on the performance of the parental SP model and the F1 model (fine-tuned from the parental SP model) are shown in File S1 and Table S4. Unless otherwise stated, the F1 model used in the remaining part of this study is the one fine-tuned from the parental BL model, and uses the training set folds that do not include the gene or PAS to be tested.
Table 1.
Performance summary of three methods on the parental BL model and the F1 model
| Model |
Performance score |
|||
|---|---|---|---|---|
| MAE | Comparison accuracy | Highest usage prediction accuracy | Averaged Spearman’s correlation | |
| Performance on parental dataset | ||||
| DeeReCT-APA (Multi-Conv-Net) | 17.22% ± 0.3% | 77.64% ± 0.4% | 63.48% ± 0.9% | 0.5140 ± 0.021 |
| Polyadenylation Code | N/A | 75.88% ± 0.8% | 59.82% ± 1.5% | 0.4673 ± 0.022 |
| DeepPASTA | N/A | 74.08% ± 1.1% | 58.78% ± 1.4% | 0.4394 ± 0.017 |
| Performance on F1 dataset | ||||
| DeeReCT-APA (Multi-Conv-Net) | 17.80% ± 0.3% | 77.14% ± 1.2% | 64.52% ± 0.7% | 0.4567 ± 0.009 |
| Polyadenylation Code | N/A | 74.20% ± 0.1% | 59.04% ± 0.9% | 0.4224 ± 0.014 |
| DeepPASTA | N/A | 70.14% ± 1.5% | 53.82% ± 1.7% | 0.3693 ± 0.018 |
Note: The parental model is trained from scratch and the F1 model is fine-tuned from the parental BL model. The table shows the performances of three methods across four evaluation metrics. For the parental dataset, the values of MAE, comparison accuracy, and highest usage prediction accuracy for a random predictor are 43.12%, 50.00%, and 25.49%, respectively. For the F1 dataset, the values are 40.96%, 50.00%, and 28.56%, respectively. Data are shown as mean ± SD. The best performance is in bold. BL, mouse strain C57BL/6J; F1, the hybrids of mouse strains C57BL/6J and SPRET/EiJ; MAE, mean absolute error; N/A, not applicable; SD, standard deviation.
Next, we show that, in terms of comparison accuracy, the improvement made by DeeReCT-APA is statistically significant, even though the performance improvement is not numerically substantial. For this purpose, we repeated the experiment for five times, with each of them having the dataset randomly split in a different way, and reported the accuracy of DeeReCT-APA (Multi-Conv-Net), Polyadenylation Code, and DeepPASTA after 5-fold cross validation (File S1; Table S5). The performances of three tools were then compared using P values computed by t-test. As shown in Table S5, indeed the improvement of DeeReCT-APA over the other two methods is statistically significant.
To demonstrate that the results of our comparison are independent of the datasets, we trained and tested DeeReCT-APA also on another dataset used in [20]. Since it consists of polyadenylation quantification data from multiple human tissues, we reported the performance (comparison accuracy) of DeeReCT-APA for each tissue separately (File S1; Table S6). The performance metrics of Polyadenylation Code and DeepPASTA were adapted from [20], [21] accordingly. For six out of eight tissues, DeeReCT-APA achieves higher accuracy than the other two methods.
We finally show through ablation study that the usage of the BiLSTM interaction layer contributes to the performance of DeeReCT-APA. As shown in Table 2, we compared the performance of (1) DeeReCT-APA (Multi-Conv-Net; without interaction layers) to that of (2) DeeReCT-APA (Multi-Conv-Net; with interaction layers but without BiLSTM) and (3) DeeReCT-APA (Multi-Conv-Net; with both interaction layers and BiLSTM) (detailed architectures are shown in Figure S1). In terms of all four metrics, both the usage of interaction layers and BiLSTM improve the performance. Although not numerically substantial, the improvements are in general statistically significant after performing a similar experiment as we have done earlier (Table S7). The improvement of (2) over (1) (P = 2.5E–6 for parental and P = 1.1E–3 for F1) is more statistically significant than the improvement of (3) over (2) (P = 3.7E–3 for parental and P = 9.9E–2 for F1), indicating that the majority of the performance gain of DeeReCT-APA comes from using the interaction layers and the simultaneous consideration of all PASs. This concludes that DeeReCT-APA, with an RNN interaction layer that considers all PASs of a gene at the same time, can achieve better performance on the PAS quantification task.
Table 2.
Performance of DeeReCT-APA using different interaction layers
| Model |
Performance score |
|||
|---|---|---|---|---|
| MAE | Comparison accuracy | Highest usage prediction accuracy | Averaged Spearman’s correlation | |
| Performance on parental dataset | ||||
| DeeReCT-APA (Multi-Conv-Net; without interaction layers) |
N/A | 76.12% ± 0.5% | 60.02% ± 0.7% | 0.4988 ± 0.027 |
| DeeReCT-APA (Multi-Conv-Net; with interaction layers but without BiLSTM) |
17.54% ± 0.3% | 77.12% ±0.5% | 61.73% ± 0.6% | 0.5007 ± 0.034 |
| DeeReCT-APA (Multi-Conv-Net; with both interaction layers and BiLSTM) |
17.22% ± 0.3% | 77.64% ± 0.4% | 63.48% ± 0.9% | 0.5140 ± 0.021 |
| Performance on F1 dataset | ||||
| DeeReCT-APA (Multi-Conv-Net; without interaction layers) |
N/A | 76.28% ± 1.1% | 61.72% ± 0.8% | 0.4337 ± 0.019 |
| DeeReCT-APA (Multi-Conv-Net; with interaction layers but without BiLSTM) |
18.03% ± 0.2% | 76.77% ± 1.0% | 63.44% ± 0.3% | 0.4751 ± 0.011 |
| DeeReCT-APA (Multi-Conv-Net; with both interaction layers and BiLSTM) |
17.80% ± 0.4% | 77.14% ± 1.2% | 64.52% ± 0.7% | 0.4957 ± 0.009 |
Note: For DeeReCT-APA without interaction layers, the model is trained based on comparison and its output cannot be interpreted as a percentage score. Therefore, like for Polyadenylation Code and DeepPASTA earlier, we do not report its MAE value. For parental BL dataset, the values of MAE, comparison accuracy, and highest usage prediction accuracy for a random predictor are 43.12%, 50.00%, and 25.49%, respectively. For the F1 dataset (fine-tuned from the parental BL model), the values are 40.96%, 50.00%, and 28.56%, respectively. Data are shown as mean ± SD. The best performance is in bold.
Benefits of modeling all PASs jointly — one example
To illustrate DeeReCT-APA’s ability of modeling all PASs of a gene jointly, we use the gene Srr (Ensembl Gene ID: ENSMUSG00000001323) as an example, which contains four different PASs (PAS 1–4; Figure 3A). The ground truth usage levels and the usage levels predicted by DeeReCT-APA (Multi-Conv-Net) and Polyadenylation Code in the F1 hybrid cells for these four PASs are shown in Figure 3B–D. As before, the logit values before the SoftMax layer of Polyadenylation Code are used as surrogates for predicted usage values and therefore not in the range from 0 to 1. As shown in Figure 3B–D, the prediction of DeeReCT-APA (Multi-Conv-Net) is much more consistent with the ground truth than that of Polyadenylation Code, and the relative magnitude between the BL allele and SP allele for the prediction of DeeReCT-APA (Multi-Conv-Net) is correct for all four PASs. In comparison, Polyadenylation Code model predicted PAS 4 in the BL allele to be of slightly higher usage than the one in the SP allele, whereas both in the ground truth and the prediction made by DeeReCT-APA (Multi-Conv-Net), the reverse is true. We hypothesize in this case that the genetic variants between the BL allele and SP allele in the sequences flanking PAS 4 alone might make the BL allele a stronger PAS than the SP allele because Polyadenylation Code only considers which one between the two is stronger and predicts the strength of a PAS solely by its own sequence, without considering those of the others. However, when simultaneously considering genetic variations in PAS 1, PAS 2, and PAS 3, which probably have stronger effects, the usage of PAS 4 becomes lower in BL allele than in SP allele.
Figure 3.
Prediction of Srr
This shows one example of the benefit of modeling all PASs jointly. A. PASs of Srr.B. Ground truth usage. C. Usage prediction by DeeReCT-APA (Multi-Conv-Net). D. PAS signal intensity prediction by Polyadenylation Code. E. Usage prediction of “mixed allele” by DeeReCT-APA (Multi-Conv-Net). “mixed allele” indicates a hypothetical allele of Srr that has the BL sequence of PAS 1, PAS 2, and PAS 3 and the SP sequence of PAS 4. BL, mouse strain C57BL/6J (BL); SP, mouse strain SPRET/EiJ.
To test our hypothesis, we design an in silico experiment by constructing a hypothetical allele of Srr (hereafter referred to as “mixed allele”) that has the BL sequence containing PAS 1, PAS 2, and PAS 3 and the SP sequence containing PAS 4. We then used DeeReCT-APA (Multi-Conv-Net) to predict the usage level of each PAS in the “mixed allele”, where the usage differences between the BL allele and the “mixed allele” should then be purely due to the sequence variants in PAS 4, because the two alleles are exactly the same on the other PASs. As shown in Figure 3E, consistent with our hypothesis, the usage level of PAS 4 in the BL allele is indeed higher than that in the “mixed allele”. This example nicely demonstrates the benefit of jointly modeling all PASs in a gene simultaneously.
Allelic difference in PAS usage between BL and SP
One primary goal of developing DeeReCT-APA is to determine the effect of sequence variants on APA patterns. The F1 hybrid system we choose here is ideal to test how well such a goal is achieved, since in the F1 cells, the allelic difference in PAS usage can only be due to the sequence variants between their genome sequences.
Figure 4 shows two examples: Zfp709 (Ensembl Gene ID: ENSMUSG00000056019) and Lpar2 (Ensembl Gene ID: ENSMUSG00000031861). Previous analysis [31] demonstrated that in the distal PAS of Zfp709, a substitution (from A to T) in the SP allele relative to the BL allele disrupted the PAS signal (from ATTAAA to TTTAAA) (Figure 4A); in the distal PAS of Lpar2, a substitution (from A to G) in the SP allele relative to the BL allele disrupted another PAS signal (from AATAAA to AATAAG) (Figure 4B), causing both of them to be of lower usage in the SP allele than in the BL allele.
Figure 4.
Previous experimental findings and mutation maps of Zfp709 and Lpar2
Mutation maps are consistent with previous experimental findings on two genes, Zfp709 (A and C) and Lpar2 (B and D). Sequencing read coverage graphs of Zfp709 (A) and Lpar2 (B) (adapted from Figure 4H of [31]). The identified PASs are marked by red triangles on top of the sequencing read coverage (black coverage graph). The sequence variants of the PASs between BL and SP strains (shaded in pink) are shown on the top. Mutation maps of Zfp709 (C) and Lpar2 (D). The SP alleles of Zfp709 and Lpar2 can be viewed as undergoing a substitution relative to BL alleles. The four heatmap entries above each letter of the sequence show the relative change of usage level when the nucleotide at that position is substituted with the nucleotide of the corresponding row. Darker red indicates greater increase in usage and darker blue indicates more decrease in usage. The entries that correspond to the genetic variants between BL and SP in (A) and (B) are marked by red squares.
To check whether our model could be used to identify the effects of these variants, we plot a “mutation map” for the two genes. In brief, for each gene, given the sequence around the most distal PAS (suppose it is of length L), we generate a 3L “mutated sequence”. Each one of the 3L sequences has exactly one nucleotide mutated from the original sequence. These 3L sequences are then fed into the model along with other PAS sequences from that gene, and the model then predicts the usage for all sites and for each of the 3L sequences, separately. The predicted usage values of the original sequence are then subtracted from each of the 3L predictions and plotted in a heatmap, the “mutation map”.
As shown in Figure 4C and D, the heatmap entries that correspond to the sequence variants between BL and SP are consistent with the experimental findings from [31] (Figure 4A and B). In addition, the mutation maps can also show the predicted effects of sequence variants other than those between BL and SP, giving an overview of the effects from all potential mutations.
Obviously, the two examples described above involve sequence variants disrupting PAS signals, which makes the prediction relatively trivial. To check whether our model could be used for the variants with more subtle effect, we choose a third example, the gene Alg10b. Previous experiments [31] have shown that the usage of the most distal PAS in its BL allele is higher than that in its SP allele (Figure 5A). Using reporter assays (Figure 5B), it has been demonstrated that an insertion of UUUU in the SP allele relative to the BL allele contributes to this reduction in usage (Figure 5C). To check whether DeeReCT-APA could reveal such effects, we also construct the same four in silico sequences as in [31]: BL, SP, BL2SP, and SP2BL. Together with other PASs of Alg10b, the four sequences are feed to the DeeReCT-APA model, separately. As shown in Figure 5D, comparing BL with BL2SP and SP with SP2BL, our model is able to reveal the negative effect of poly(U) tract.
Figure 5.
Previous experimental findings and DeeReCT-APA’s prediction of Alg10b
In silico prediction for the Alg10b PAS reporter is consistent with previous experimental findings [31]. A. Sequencing read coverage graph and sequence variants of Alg10b. The red triangles mark the identified PASs. B. The structures of PAS reporter constructs of Alg10b. “BL” is the original BL version of the most distal PAS, “SP” is the original SP version, “BL2SP” is the BL sequence only inserted with TTTT at the corresponding location, and “SP2BL” is the SP sequence only deleted TTTT at the corresponding location. C. PAS reporter assay for the four reporters. D.In silico prediction of PAS reporter usage. Panels (A–C) are adapted from Figure 4H of [31].
To globally evaluate the performance of DeeReCT-APA on predicting the allelic difference in PAS usage, we compared the predicted allelic difference versus experimentally measured allelic difference in a genome-wide manner (Figure 6A). As a baseline control, we did the same for the prediction made by the Polyadenylation Code where logit values before SoftMax were again used as surrogates for the predicted allelic difference in PAS usage (Figure 6B). Here, the F1 model fine-tuned from the parental BL model was used. Similar results of the F1 model fine-tuned from the parental SP model are shown in File S1 and Figure S2. It is worth noting that this is a very challenging task because the training data do not well represent the complete landscape of genetic mutations. That is, the BL dataset only contains invariant sequences from different PASs, and the F1 dataset contains a limited number of genetic variants.
Figure 6.
Comparison of the allelic usage differences predicted by DeeReCT-APA and Polyadenylation Code
The F1 model fine-tuned from the parental BL model is used. A. Allelic usage difference predicted by DeeReCT-APA. The red line shows the perfect prediction. B. Allelic usage difference predicted by Polyadenylation Code. C. PCCs and their P values between two quantities at different minimum allelic usage differences. PCC, Pearson correlation coefficient.
We then computed the Pearson correlation coefficients (PCCs) between the experimentally measured allelic usage difference and the ones predicted by the two models. Clearly, DeeReCT-APA outperforms Polyadenylation Code. We further evaluated the PCCs using six subsets of the testing set, each filtering out PASs with allelic usage difference less than 0.1, 0.2, 0.3, 0.4, 0.6, and 0.8, respectively (Figure 6C). When the allelic usage difference is small, their relative magnitudes are more ambiguous, and the experimental measurements of their allelic usage difference (used here as ground truth) are less confident. Indeed, with the increasing allelic difference, the prediction accuracy increased for both DeeReCT-APA and Polyadenylation Code. Importantly, in all these groups, DeeReCT-APA shows consistently better performance.
Visualization of convolutional filters
To show the knowledge learned by the convolutional filters of DeeReCT-APA, we followed a similar procedure as in [36] to visualize the convolutional filters of the model. The aim of visualization is to reveal the important subsequences around PASs that activate a specific convolutional filter. In contrast to [38], in which the researchers only used sequences in the testing set for visualization, we used all sequences in the training and testing datasets of F1 for visualization due to the smaller size of our dataset. In visualization, neither the model parameters nor the hyperparameters were tuned on the testing set. Therefore, our usage of the testing set for visualization is legitimate. For all the learned filters in layer 1, we convolved them with all the sequences in the aforementioned dataset, and for each sequence, its subsequence (having the same size as the filters) with the highest activation on that filter was extracted and accumulated in a position frequency matrix (PFM). The PFM was then ready for visualization as the knowledge learned by that specific filter. For layer 2 convolutional filters, as we did not convolve them with raw sequences during training and testing, directly convolving them with the sequences in the dataset as we did for layer 1 would be undesirable. Instead, the layer 2 activations were calculated by a partial forward pass in the network, and the subsequences of the input sequences in the receptive field of the maximally activated neuron was extracted and accumulated in a PFM.
As shown in Figure 7A and B, DeeReCT-APA is able to identify the two strongest PAS hexamer, AUUAAA and AAUAAA [31]. In addition, one of the layer 2 convolutional filters is able to recognize the pattern of a mouse specific PAS hexamer UUUAAA [30] (Figure 7C). Furthermore, a Poly-U island motif previously reported in [38] could also be revealed by DeeReCT-APA (Figure 7D). A complete visualization of all the 40 filters in layer 1 and 40 filters in layer 2 is shown in Figures S3 and S4.
Figure 7.
Visualization of learned convolutional filters in DeeReCT-APA
Some visualization examples of the learned convolutional filters of DeeReCT-APA are shown. A. and B. The most common polyadenylation motifs AUUAAA and AAUAAA are learned in layer 1 by convolutional filters #2 and #37, respectively. C. Visualization of a layer 2 convolutional filter #38 showing a mouse specific polyadenylation motif UUUAAA. D. Visualization of a layer 2 convolutional filter #19 showing the poly-U islands on polyadenylation. Note that the position frequency matrices for layer 2 convolutional filter visualization are wider than those for the layer 1 convolutional filter visualization (12 nt), because the receptive field of neurons in a deeper layer is in general greater than their corresponding filter width.
Discussion and conclusion
In this study, we made the first attempt to simultaneously predict the usage of all competing PASs within a gene. Our method incorporates both sequence-specific information through automatic feature extraction by CNN and multiple PAS competition through interaction modeling by RNN. We trained and evaluated our model on the genome-wide PAS usage measurement obtained from 3′-mRNA sequencing of fibroblast cells from two mouse strains as well as their F1 hybrids. Our model, DeeReCT-APA, outperforms the state-of-the-art PAS quantification methods on the tasks that they are trained for, including pairwise comparison task, highest usage prediction task, and ranking task. In addition, we demonstrated that modeling all the PASs of a gene simultaneously captures the mechanistic competition among the PASs and reveals the genetic variants with regulatory effects on PAS usage.
A similar idea of using BiLSTM to model competitive biological processes was proposed recently in [39]. The researchers used BiLSTM to model the usage level of competitive alternative 5′/3′ splice sites. Given the similarity of modeling competing PASs and splice sites, it is therefore not surprising that DeeReCT-APA, which also incorporates BiLSTM to model the interactions among competing PASs, achieves the best performance on the PAS quantification task.
Although DeeReCT-APA provides the first-of-its-kind method to model all the PASs of a gene, it still has room for improvement. As shown in Figure 3B and C, the model has limited accuracy when the usage is very high or very low. In addition, for allelic comparison as shown in Figure 5, some PASs with high allelic usage difference are predicted to be of low difference (false negatives, along X axis) and vice versa (false positives, along Y axis). One of the main reasons for our model’s limitation, as well as for all the other PAS quantification methods, is that all the existing genome-wide PAS quantification datasets used as training data could only sample the limited number of naturally occurring sequence variants. Although in our study the two parental strains from which the F1 hybrid mice were derived are already the evolutionarily most distant ones among all the 17 mouse strains with complete genomic sequences, the number of genetic variants is still rather limited. Another limitation of our current model lies in the fact that it does not take all the factors with potential PAS regulatory effects into consideration. For example, transcription kinetics, i.e., the elongation rate of Pol II, which is not considered by the model in this study, can also affect APA choice [40]. Similarly, DeeReCT-APA does not take the distance between consecutive PASs into account, which, together with the transcription elongation rate, can also affect APA [41]. All of them are potential directions for further improvement.
Finally, recently, Zhang et al. [42] showed that effectively combining the power of deep learning and the information in RNA-seq data can significantly boost the performance for investigating the pattern of alternative splicing. Indeed, our preliminary results showed that also for the recognition of APA patterns, there are substantial cases in which deep learning cannot make accurate prediction but utilizing the pattern of RNA-seq coverage around the cleavage site could provide clear evidence, and vice versa. Future work integrating the strength of deep learning on genomic sequences and experimental RNA-seq data will for certain not only improve the model performance, but also shed more light on the APA regulatory mechanisms.
Code availability
Our implementation of DeeReCT-APA using the PyTorch [37] library is available at https://github.com/lzx325/DeeReCT-APA-repo.
Data availability
The genome-wide PAS quantification dataset of the parental and F1 mouse fibroblast cells is available in the subfolder ‘APA_ML’ at https://github.com/lzx325/DeeReCT-APA-repo.
CRediT author statement
Zhongxiao Li: Conceptualization, Methodology, Software, Writing - original draft, Visualization. Yisheng Li: Data curation, Writing - review & editing. Bin Zhang: Data curation, Writing - review & editing. Yu Li: Methodology, Investigation, Writing - review & editing. Yongkang Long: Writing - review & editing. Juexiao Zhou: Investigation, Writing - review & editing. Xudong Zou: Investigation, Writing - review & editing. Min Zhang: Investigation, Writing - review & editing. Yuhui Hu: Investigation, Writing - review & editing, Supervision, Funding acquisition. Wei Chen: Investigation, Writing - review & editing, Supervision, Project administration, Funding acquisition. Xin Gao: Investigation, Writing - review & editing, Supervision, Project administration, Funding acquisition. All authors have read and approved the final manuscript.
Competing interests
The authors have declared no competing interests.
Acknowledgments
This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) (Grant Nos. URF/1/4098-01-01, BAS/1/1624-01, FCC/1/1976-18-01, FCC/1/1976-23-01, FCC/1/1976-25-01, FCC/1/1976-26-01, and FCS/1/4102-02-01), the International Cooperation Research Grant from Science and Technology Innovation Commission of Shenzhen Municipal Government, China (Grant No. GJHZ20170310161947503 to YH), and the Shenzhen Science and Technology Program, China (Grant No. KQTD20180411143432337 to YH and WC).
Handled by Yi Xing
Footnotes
Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.05.004.
Contributor Information
Yuhui Hu, Email: huyh@sustech.edu.cn.
Wei Chen, Email: chenw@sustech.edu.cn.
Xin Gao, Email: xin.gao@kaust.edu.sa.
Supplementary material
The following are the Supplementary data to this article:
Supplementary materials for DeeReCT-APA
The structures of DeeReCT-APA models used in the ablation study A. The structure of DeeReCT-APA with interaction layers but without BiLSTM. B. The structure of DeeReCT-APA with interaction layers removed. Comparing A with Figure 1 in the main text, it has BiLSTM removed and only has the affine layer in the interaction layers. In B, the interaction layers are removed altogether and DeeReCT-APA resorted to comparison-based training (to predict which one of the two PAS is of higher usage). Note that an additional affine layer is added on top of the Base Networks to cast the output of the base network (which is a vector) into a scalar.
Comparison of the allelic usage difference prediction of DeeReCT-APA and Polyadenylation Code F1 model fine-tuned from SP parental model is used. A. B. The horizontal axis is the ground truth allelic usage value difference between two homologous PAS (which is the BL usage value minus the SP usage value). The vertical axis shows the predicted allelic usage value difference. The scatter plot of DeeReCT-APA is shown in Panel A and Polyadenylation Code is shown in Panel B. As DeeReCT-APA predicts the usage value in percentage, we draw a red line that shows the perfect prediction. C. Pearson correlations between two quantities at different minimum allelic usage difference are shown in the table below.
Visualization of convolutional filters in layer 1 of DeeReCT-APA There are 40 convolutional filters in layer 1 of DeeReCT-APA. The model is trained on parental BL dataset and fine-tuned on F1.
Visualization of convolutional filters in layer 2 of DeeReCT-APA There are 40 convolutional filters in layer 2 of DeeReCT-APA. The model is trained on parental BL dataset and fine-tuned on F1.
List of features used in Feature-Net and their corresponding dimensions
List of hyperparameters for the three DeeReCT-APA models
Performance summary for the BL parental model and the F1 model fine-tuned from the BL parental model
Performance summary for the SP parental model and the F1 model fine-tuned from the SP parental model
Replicated experiments of 5-fold cross validation on 5 random splits
Comparison accuracy on dataset from Leung et al. 2018 [20]
Replicated experiments of ablation study
References
- 1.Barreau C., Paillard L., Osborne H.B. AU-rich elements and associated factors: are there unifying principles? Nucleic Acids Res. 2005;33:7138–7150. doi: 10.1093/nar/gki1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Chen C.Y., Shyu A.B. AU-rich elements: characterization and importance in mRNA degradation. Trends Biochem Sci. 1995;20:465–470. doi: 10.1016/s0968-0004(00)89102-1. [DOI] [PubMed] [Google Scholar]
- 3.Jonas S., Izaurralde E. Towards a molecular understanding of microRNA-mediated gene silencing. Nat Rev Genet. 2015;16:421–433. doi: 10.1038/nrg3965. [DOI] [PubMed] [Google Scholar]
- 4.Lau A.G., Irier H.A., Gu J., Tian D., Ku L., Liu G., et al. Distinct 3ʹUTRs differentially regulate activity-dependent translation of brain-derived neurotrophic factor (BDNF) Proc Natl Acad Sci U S A. 2010;107:15945–15950. doi: 10.1073/pnas.1002929107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bertrand E., Chartrand P., Schaefer M., Shenoy S.M., Singer R.H., Long R.M. Localization of ASH1 mRNA particles in living yeast. Mol Cell. 1998;2:437–445. doi: 10.1016/s1097-2765(00)80143-4. [DOI] [PubMed] [Google Scholar]
- 6.Ephrussi A., Dickinson L.K., Lehmann R. Oskar organizes the germ plasm and directs localization of the posterior determinant nanos. Cell. 1991;66:37–50. doi: 10.1016/0092-8674(91)90137-n. [DOI] [PubMed] [Google Scholar]
- 7.Niedner A., Edelmann F.T., Niessing D. Of social molecules: the interactive assembly of ASH1 mRNA-transport complexes in yeast. RNA Biol. 2014;11:998–1009. doi: 10.4161/rna.29946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Berkovits B.D., Mayr C. Alternative 3ʹ UTRs act as scaffolds to regulate membrane protein localization. Nature. 2015;522:363–367. doi: 10.1038/nature14321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yasuda M., Shabbeer J., Osawa M., Desnick R.J. Fabry disease: novel alpha-galactosidase A 3ʹ-terminal mutations result in multiple transcripts due to aberrant 3ʹ-end formation. Am J Hum Genet. 2003;73:162–173. doi: 10.1086/376608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bennett C.L., Brunkow M.E., Ramsdell F., O'Briant K.C., Zhu Q., Fuleihan R.L., et al. A rare polyadenylation signal mutation of the FOXP3 gene (AAUAAA→AAUGAA) leads to the IPEX syndrome. Immunogenetics. 2001;53:435–439. doi: 10.1007/s002510100358. [DOI] [PubMed] [Google Scholar]
- 11.Higgs D.R., Goodbourn S.E., Lamb J., Clegg J.B., Weatherall D.J., Proudfoot N.J. Alpha-thalassaemia caused by a polyadenylation signal mutation. Nature. 1983;306:398–400. doi: 10.1038/306398a0. [DOI] [PubMed] [Google Scholar]
- 12.Orkin S.H., Cheng T.C., Antonarakis S.E., Kazazian H.H., Jr Thalassemia due to a mutation in the cleavage-polyadenylation signal of the human beta-globin gene. EMBO J. 1985;4:453–456. doi: 10.1002/j.1460-2075.1985.tb03650.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Elkon R., Ugalde A.P., Agami R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet. 2013;14:496–506. doi: 10.1038/nrg3482. [DOI] [PubMed] [Google Scholar]
- 14.Mandel C.R., Bai Y., Tong L. Protein factors in pre-mRNA 3ʹ-end processing. Cell Mol Life Sci. 2008;65:1099–1122. doi: 10.1007/s00018-007-7474-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shi Y. Alternative polyadenylation: new insights from global analyses. RNA. 2012;18:2105–2117. doi: 10.1261/rna.035899.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hoque M., Ji Z., Zheng D., Luo W., Li W., You B., et al. Analysis of alternative cleavage and polyadenylation by 3ʹ region extraction and deep sequencing. Nat Methods. 2013;10:133–139. doi: 10.1038/nmeth.2288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kalkatawi M., Rangkuti F., Schramm M., Jankovic B.R., Kamau A., Chowdhary R., et al. Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences. Bioinformatics. 2013;29:1484. doi: 10.1093/bioinformatics/btt161. [DOI] [PubMed] [Google Scholar]
- 18.Magana-Mora A., Kalkatawi M., Bajic V.B. Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA. BMC Genomics. 2017;18:620. doi: 10.1186/s12864-017-4033-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Xie B., Jankovic B.R., Bajic V.B., Song L., Gao X. Poly(A) motif prediction using spectral latent features from human DNA sequences. Bioinformatics. 2013;29:i316–i325. doi: 10.1093/bioinformatics/btt218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Leung M.K.K., Delong A., Frey B.J. Inference of the human polyadenylation code. Bioinformatics. 2018;34:2889–2898. doi: 10.1093/bioinformatics/bty211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Arefeen A., Xiao X., Jiang T. DeepPASTA: deep neural network based polyadenylation site analysis. Bioinformatics. 2019;35:4577–4585. doi: 10.1093/bioinformatics/btz283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 23.Alipanahi B., Delong A., Weirauch M.T., Frey B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33:831–838. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
- 24.Leung M.K., Xiong H.Y., Lee L.J., Frey B.J. Deep learning of the tissue-regulated splicing code. Bioinformatics. 2014;30:i121–i129. doi: 10.1093/bioinformatics/btu277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Li Y., Wang S., Umarov R., Xie B., Fan M., Li L., et al. DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34:760–769. doi: 10.1093/bioinformatics/btx680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zou Z., Tian S., Gao X., Li Y. mlDEEPre: multi-functional enzyme function prediction with hierarchical multi-label deep learning. Front Genet. 2019;9:714. doi: 10.3389/fgene.2018.00714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Han R., Li Y., Wang S., Gao X., Bi C., Li M. DeepSimulator: a deep simulator for Nanopore sequencing. Bioinformatics. 2018;34:2899–2908. doi: 10.1093/bioinformatics/bty223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wang S., Li Z., Yu Y., Gao X. WaveNano: a signal-level nanopore base-caller via simultaneous prediction of nucleotide labels and move labels through bi-directional WaveNets. Quant Biol. 2018;6:359–368. [Google Scholar]
- 29.Umarov R., Kuwahara H., Li Y., Gao X., Solovyev V. Promoter analysis and prediction in the human genome using sequence-based deep learning models. Bioinformatics. 2019;35:2730–2737. doi: 10.1093/bioinformatics/bty1068. [DOI] [PubMed] [Google Scholar]
- 30.Xia Z., Li Y., Zhang B., Li Z., Hu Y., Chen W., et al. DeeReCT-PolyA: a robust and generic deep learning method for PAS identification. Bioinformatics. 2019;35:2371–2379. doi: 10.1093/bioinformatics/bty991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Xiao M.S., Zhang B., Li Y.S., Gao Q., Sun W., Chen W. Global analysis of regulatory divergence in the evolution of mouse alternative polyadenylation. Mol Syst Biol. 2016;12:890. doi: 10.15252/msb.20167375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 33.Hu J., Lutz C.S., Wilusz J., Tian B. Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. RNA. 2005;11:1485–1493. doi: 10.1261/rna.2107305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Gao Q., Sun W., Ballegeer M., Libert C., Chen W. Predominant contribution of cis-regulatory divergence in the evolution of mouse alternative splicing. Mol Syst Biol. 2015;11:816. doi: 10.15252/msb.20145970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hou J., Wang X., McShane E., Zauber H., Sun W., Selbach M., et al. Extensive allele-specific translational regulation in hybrid mice. Mol Syst Biol. 2015;11:825. doi: 10.15252/msb.156240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv 2014;1412.6980.
- 37.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019:8024–8035. [Google Scholar]
- 38.Bogard N., Linder J., Rosenberg A.B., Seelig G. A deep neural network for predicting and engineering alternative polyadenylation. Cell. 2019;178:91–106.e23. doi: 10.1016/j.cell.2019.04.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zuberi K., Gandhi S., Bretschneider H., Frey B.J., Deshwar A.G. COSSMO: predicting competitive alternative splice site selection using deep learning. Bioinformatics. 2018;34:i429–i437. doi: 10.1093/bioinformatics/bty244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Pinto P.A.B., Henriques T., Freitas M.O., Martins T., Domingues R.G., Wyrzykowska P.S., et al. RNA polymerase II kinetics in polo polyadenylation signal selection. EMBO J. 2011;30:2431–2444. doi: 10.1038/emboj.2011.156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gruber A.J., Zavolan M. Alternative cleavage and polyadenylation in health and disease. Nat Rev Genet. 2019;20:599–614. doi: 10.1038/s41576-019-0145-z. [DOI] [PubMed] [Google Scholar]
- 42.Zhang Z., Pan Z., Ying Y., Xie Z., Adhikari S., Phillips J., et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat Methods. 2019;16:307–310. doi: 10.1038/s41592-019-0351-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary materials for DeeReCT-APA
The structures of DeeReCT-APA models used in the ablation study A. The structure of DeeReCT-APA with interaction layers but without BiLSTM. B. The structure of DeeReCT-APA with interaction layers removed. Comparing A with Figure 1 in the main text, it has BiLSTM removed and only has the affine layer in the interaction layers. In B, the interaction layers are removed altogether and DeeReCT-APA resorted to comparison-based training (to predict which one of the two PAS is of higher usage). Note that an additional affine layer is added on top of the Base Networks to cast the output of the base network (which is a vector) into a scalar.
Comparison of the allelic usage difference prediction of DeeReCT-APA and Polyadenylation Code F1 model fine-tuned from SP parental model is used. A. B. The horizontal axis is the ground truth allelic usage value difference between two homologous PAS (which is the BL usage value minus the SP usage value). The vertical axis shows the predicted allelic usage value difference. The scatter plot of DeeReCT-APA is shown in Panel A and Polyadenylation Code is shown in Panel B. As DeeReCT-APA predicts the usage value in percentage, we draw a red line that shows the perfect prediction. C. Pearson correlations between two quantities at different minimum allelic usage difference are shown in the table below.
Visualization of convolutional filters in layer 1 of DeeReCT-APA There are 40 convolutional filters in layer 1 of DeeReCT-APA. The model is trained on parental BL dataset and fine-tuned on F1.
Visualization of convolutional filters in layer 2 of DeeReCT-APA There are 40 convolutional filters in layer 2 of DeeReCT-APA. The model is trained on parental BL dataset and fine-tuned on F1.
List of features used in Feature-Net and their corresponding dimensions
List of hyperparameters for the three DeeReCT-APA models
Performance summary for the BL parental model and the F1 model fine-tuned from the BL parental model
Performance summary for the SP parental model and the F1 model fine-tuned from the SP parental model
Replicated experiments of 5-fold cross validation on 5 random splits
Comparison accuracy on dataset from Leung et al. 2018 [20]
Replicated experiments of ablation study
Data Availability Statement
The genome-wide PAS quantification dataset of the parental and F1 mouse fibroblast cells is available in the subfolder ‘APA_ML’ at https://github.com/lzx325/DeeReCT-APA-repo.







