Abstract
Dihydrouridine (D) is a modified pyrimidine nucleotide universally found in viral, prokaryotic, and eukaryotic species. It serves as a metabolic modulator for various pathological conditions, and its elevated levels in tumors are associated with a series of cancers. Precise identification of D sites on RNA is vital for understanding its biological function. A number of computational approaches have been developed for predicting D sites on tRNAs; however, none have considered mRNAs. We present here DPred, the first computational tool for predicting D on mRNAs in yeast from the primary RNA sequences. Built on a local self-attention layer and a convolutional neural network (CNN) layer, the proposed deep learning model outperformed classic machine learning approaches (random forest, support vector machines, etc.) and achieved reasonable accuracy and reliability with areas under the curve of 0.9166 and 0.9027 in jackknife cross-validation and on an independent testing dataset, respectively. Importantly, we showed that distinct sequence signatures are associated with the D sites on mRNAs and tRNAs, implying potentially different formation mechanisms and putative divergent functionality of this modification on the two types of RNA. DPred is available as a user-friendly Web server.
Keywords: MT: bioinformatics, dihydrouridine, CNN, local self-attention, RNA modification, sequence-derived features, epitranscriptomic mark
Graphical abstract

We present DPred, the first computational tool enabled by local self-attention and CNN for predicting dihydrouridine on mRNAs from the primary RNA sequences. We show that distinct sequence signatures are associated with the dihydrouridine sites on mRNAs and tRNAs, implying potentially different formation mechanisms on various types of RNA.
Introduction
Recent development of high-resolution transcriptome mapping has enabled the transcriptome-wide profiling of post-transcriptional RNA modification sites.1 They were reported on almost all kinds of RNA, including mRNA, rRNA, tRNA, and small nuclear RNA (snRNA).2,3 Over 170 post-transcriptional modifications have been identified and found to play important roles in various biological processes; e.g., fine-tuning RNA structures and functions, regulation of gene expression and protein synthesis, response to environmental exposures, cell differentiation, and mechanistic toxicology.4,5,6,7,8,9,10 Post-transcriptional RNA modification can also have implications in human health and medical science.11 To date, over 100 RNA modification enzyme mutations have been found to have an association with human diseases.12
Dihydrouridine (D) is a modified nucleotide universally found in viral, prokaryotic, and eukaryotic species. Recent studies showed that D is the second most prevalent modification in tRNA.13,14 Catalyzed by D synthases (Dus), D is formed by the hydrogenation of uridine (U) whose C5–C6 bond is reduced, and is found within the eponym D loop of tRNAs.15,16 With its unique structure, D can influence the RNA backbone and a series of RNA-involved processes. The modification can serve as a metabolic modulator for various pathological conditions, and its elevated levels in tumors are associated with a series of cancers in the lung,17 liver,18 kidney,19 prostate,20 and oral squamous cells.21 It is a non-trivial task to accurately identify D in RNA and understand its fundamental biological function in all species.
With the development of new high-resolution mapping methods, more and more D sites have been identified on tRNAs in multiple species, including Saccharomyces cerevisiae, Schizosaccharomyces pombe, Homo sapiens, Mus musculus, Escherichia coli, Bacillus subtilis, and Mycoplasma capricolum. Database Web servers such as MODOMICS1 and RMBase22 were developed and greatly facilitated the study of tRNA modifications, in which D sites are included. Furthermore, wet-lab experimental methods such as D sequencing (D-seq) and rhodamine sequencing (Rho-seq) established the presence of D on mRNAs in yeast and human, which further verified its physiological significance.14,23,24 Both methods take advantage of D-specific chemistry and next-generation sequencing to determine the location of D across the transcriptome, resulting in high-confidence single-nucleotide-resolution data.
Considering the high cost and long experiment time of wet-lab-based epitranscriptome profiling approaches, computational methods are often preferred as an alternative avenue.25 A number of computational methods have been developed for predicting epigenetic modifications of RNA.26,27,28,29,30,31,32,33,34,35,36,37 Among them, iRNAD is the first approach for D-site prediction from multiple species, which used a support vector machine to distinguish D and non-D sites.34 Later, iRNAD_XGBoost used XGBoost-selected multiple features to construct a model for D detection.33 However, to the best of our knowledge, all existing D-site prediction tools32,33,34,35 were trained on tRNAs, and it is not clear whether they can be applied to predict D sites on mRNAs. Although recent studies have unveiled the widely occurring nature and transcriptome-wide distribution of D (or the D epitranscriptome),14 there are still no prediction tools constructed for mRNA D sites using mRNA D datasets.
Here we present DPred, the first computational method for predicting D modification on mRNAs in yeast from the primary RNA sequences. In DPred, the sequence data were encoded by the combination of nucleotide chemical property and nucleotide density, which outperformed three other encoding schemes. By taking advantages of local self-attention mechanisms and convolutional neural networks (CNNs), DPred can effectively distinguish D-containing sites from the non-modified by exploiting informative features directly from primary sequences. It achieved reasonable accuracy and reliability with areas under the curve (AUCs) of 0.9166 and 0.9027 in jackknife cross-validation and on independent testing dataset, respectively. A simplified graphic framework of DPred is illustrated in Figure 1. Additionally, we showed that the sequence contexts of D sites on mRNAs and tRNAs are substantially different, and the D-prediction tools trained by mRNA and tRNA data should be clearly distinguished.
Figure 1.
Model architecture
The model architecture mainly consists of local self-attention and convolutional mechanisms.
To facilitate the access of our predictor, a user-friendly Web server has also been developed and made freely available at http://www.rnamd.org/dpred. It is expected that our predictor could be a useful tool for researchers interested in D modification.
Results
We developed the first mRNA D predictor enabled by deep neural networks. We established our benchmark dataset based on the high-confidence single-nucleotide-resolution data reported by the D-seq technique.23 Each unmodified site was randomly selected on the same transcript of each positive D site, making the positive-to-negative ratio 1:1. The dataset was randomly split into the training and testing part with a ratio of 8:2. Experiments were performed under jackknife cross-validation in the training phase.
Choosing a good encoding strategy is crucial for constructing a high-accurate predictor. Two nucleotide representation methods: one hot (OH), nucleotide chemical property (NCP), and two properties, namely nucleotide density (ND) and electron-ion interaction potential (EIIP), were employed in this study. Each nucleotide representation and each nucleotide property formed an encoding strategy. Thus, four encoding schemes, OH_ND, OH_EIIP, NCP_ND, and NCP_EIIP were compared in the following experiments and discussion. Details can be found in the “materials and methods” section.
The deep neural network model mainly consists of an additive local self-attention layer and a CNN layer (see Figure 1). The attention layer uses additive self-attention to calculate the alignment weights indicating how much attention it should give to each input state. The CNN layer exploits local sequence motifs for a given target through its local receptive fields. A max-pooling layer is followed to filter weak features and expand the receptive field. A dropout layer is implemented to avoid overfitting in model training. A fully connected layer eventually takes all the previous results and leads to a softmax function with a 0.5 threshold, which predicts whether the input sequence contains D sites or not. For more details, please see the “materials and methods” section.
Model performance evaluation and comparison with other methods
We evaluated proposed DPred and other competing classifiers on the benchmark dataset stated above with different encoding schemes. Results are summarized in Table 1. The competing methods include logistic regression (LR), random forest (RF), support vector machine (SVM), and XGBoost (XG). Model performance was evaluated by jackknife cross-validation and independent testing set. When using NCP-based encoding strategies, classifiers showed more reliable and robust results. The RF and XGBoost classifiers generally outperformed LR and SVM. In the case of OH-based encoding, the RF and XGBoost even achieved better accuracy than our DPred. In NCP_ND, DPred outperformed the competing classifiers and achieved the highest accuracy among all the tests. It obtained AUCs of 0.9166 and 0.9027 in jackknife cross-validation on the training dataset and on independent testing dataset, respectively. The receiver operating characteristic (ROC) curve is plotted in Figure S1. We also compared DPred with a series of deep learning neural networks. Results were summarized in Table S1. We found that our DPred, a combination of CNNs and local self-attention, outperformed other methods that only used either one of them individually, and it achieved the best performance in our study. Details of other methods can be viewed on GitHub.
Table 1.
Comparison of different methods and encoding methods (AUROC)
| Modes | Methods | LR | RF | SVM | XG | DPred |
|---|---|---|---|---|---|---|
| NCP_ND | cross-validation | 0.8686 | 0.8958 | 0.8449 | 0.8958 | 0.9166 |
| testing | 0.8242 | 0.8981 | 0.8056 | 0.8827 | 0.9027 | |
| NCP_EIIP | cross-validation | 0.8282 | 0.9275 | 0.8429 | 0.8750 | 0.9043 |
| testing | 0.8194 | 0.8734 | 0.8163 | 0.8688 | 0.8889 | |
| OH_ND | cross-validation | 0.8132 | 0.8645 | 0.7747 | 0.8560 | 0.8571 |
| testing | 0.7778 | 0.8518 | 0.7944 | 0.8256 | 0.8395 | |
| OH_EIIP | cross-validation | 0.8177 | 0.8428 | 0.7944 | 0.8583 | 0.8214 |
| testing | 0.8009 | 0.8611 | 0.7880 | 0.8441 | 0.8056 |
To give an intuitive assessment of the classification ability, we drew t-SNE (t-distributed Stochastic Neighbor Embedding) plots generated by different layers in DPred, which visualized the distance between samples after dimensionality reduction, as shown in Figure S2. The positive samples (red dots) are gradually clustered in the upper-left area through leaning, whereas the negative samples are concentrated in the lower-right area. As shown in the t-SNE plots, our proposed method is capable of extracting features that can efficiently discriminate D sites from non-D sites.
Since the available high-quality datasets are limited, this study has not involved multi-species prediction of mRNA D sites. To further demonstrate the generalization of proposed method, we collected available tRNA D sites reported by RMbase22 and MODOMICS1 in different species, including S. cerevisiae, H. sapiens, M. musculus, and E. coli. Results showed that the DPred framework achieved good performance in all datasets, which are summarized in Table S2.
The performance of our predictor was evaluated on the full transcript mode and mRNA mode. The full transcript mode kept both exonic and intronic regions of the positive and negative sites, while, in the mRNA mode, only the exons were kept. The sequences utilized in both modes were based on the same D and non-modified sites. We eventually found mRNA-based prediction outperformed full transcript-based prediction on both modes of 41 nt and 101 nt lengths. Results are summarized in Table S3.
Prediction tools of D sites trained by mRNA and tRNA data should be clearly distinguished
To the best of our knowledge, all existing proposed D-site prediction tools were trained on tRNAs, while none considered mRNAs. It is of interest to explore whether a tRNA-trained model can be applied to predict D sites on mRNAs, and also the reverse of the case. We therefore performed prediction for D over mRNA (tRNA) sequences using the DPred approach trained by tRNA (mRNA) sequences. The D-containing tRNA sequences in yeast were collected from the database RMBase 2.0.22 Interestingly, we found these two types of datasets basically cannot be used to predict each other (Table 2). This finding indicated that the D-prediction tools trained by mRNA and tRNA data should be clearly distinguished.
Table 2.
Performance evaluation results of D sites on tRNA (mRNA) predicted by mRNA (tRNA) data
| Training data | Testing data | Sn (%) | Sp (%) | ACC (%) | F1 scores | MCC | AUROC |
|---|---|---|---|---|---|---|---|
| tRNA | mRNA | 22.22 | 83.33 | 52.77 | 0.3200 | 0.0702 | 0.5617 |
| mRNA | tRNA | 15.38 | 84.62 | 51.92 | 0.2352 | 0.0494 | 0.5192 |
| tRNA | tRNA | 94.87 | 93.59 | 94.23 | 0.9427 | 0.8847 | 0.9675 |
| mRNA | mRNA | 94.44 | 88.88 | 91.66 | 0.9189 | 0.8062 | 0.9027 |
Analysis of nucleotide composition preference
Importantly, we revealed that the sequence contexts of D sites on mRNAs and tRNAs are substantially different, which explained why D-related mRNA and tRNA datasets cannot be used to train models that predict each other. We demonstrated the position-specific differences around D sites on mRNA, snRNA, and tRNA sequences respectively in Figure 2. On tRNA sequences, G bases are enriched around D sites, whereas, on mRNA sequences, this pattern is no longer the same, and A bases are found to be over-represented at positions 1 and 4. The plots were drawn by the tool ggseqlogo.38 Furthermore, we conducted motif analysis between positive and negative sequences on tRNAs and mRNAs respectively by the graphical tool two-sample-logo,39 and showed the differences in nucleotide distributions around D and non-D sites, suggesting it is feasible to design a computational model identifying the D only based on the primary sequences (Figure S3).
Figure 2.
Demonstration of nucleotide composition preference
(A–C) The sequence contexts around D modification sites on mRNA, snRNA, and tRNA sequences. (A), (B), and (C) were derived from D-seq data.23 (D) was based on RMBase data.22 D sites on mRNAs and tRNAs have distinct sequence signatures, which are quite informative. On the contrary, the sequence signatures of D sites on tRNA from D-seq technology and RMBase are very similar, suggesting technical bias should not be the major cause of the previous discrepancy.
Based on the datasets we collected, we also conducted mixed prediction about D sites between snRNA/tRNA/mRNA sequences and drew a heatmap with the accuracies shown in Figure S4. The RMBase tRNA-trained model achieved good performance on both RMBase and D-seq tRNA datasets but failed to predict D sites on mRNA and snRNA. On the contrary, mRNA- and snRNA-trained models made reliable predictions on their own datasets but could not classify tRNA D sites well. This is consistent with the motif analysis in Figure 2, where the nucleotide compositions around the D sites on mRNA and snRNA sequences are considerably different from that on tRNAs.
Web implementation
DPred takes FASTA sequences as input. Users can paste the sequences into the text box on the Web site or upload a FASTA file (Figure 3). The input sequence should be 41 nt long, and in the center should be a potential D site to be evaluated. The Web server will predict the D-site possibility using our well-trained DPred model. Results can be downloaded as an Excel table.
Figure 3.
Screenshot of the DPred Web server
Users can upload the query sequences. Results will then be presented after a while.
Discussion
In this study, we developed the first computational tool for prediction of D sites on mRNAs in yeast. We explored different ways to encode RNA so as to capture the informative features within each sequence. The model framework was mainly built on the additive local self-attention and CNN architecture. Accuracies of 0.9166 and 0.9027 were achieved in cross-validation and an independent testing dataset, showing reasonable accuracy and reliability of our model.
The universal D modification is first discovered within the eponym D loop of tRNAs. It results from a reduction of the uridine C5–C6 bond by Dus enzymes. Its unique structure leads to a series of special biochemical properties, including the potential destabilization of the RNA structure and molecular flexibility.14 Particularly, D affects the flexibility of tRNAs. It involves the formation of cloverleaf-related tRNA secondary structure and L-shaped tRNA tertiary structure by causing changes to the RNA backbone conformation.40,41
Recent studies show that the D distribution may spread to other types of RNA. Newly developed high-resolution mapping methods established the presence of D within mRNAs, which further verified its physiological significance.14,23,24 D-seq reported that certain (not all) mRNA D sites occur in regions where secondary structure potential is evolutionarily conserved, suggesting biological function possibly analogous to D sites on tRNAs.23 Another question is raised concerning how D on mRNA involves translation. An interesting finding is that a dihydrouridylated mRNA could be translationally repressed, where D may modulate the translational speed and ribosome stalling.42 A conjecture to explore here is that a D on mRNA enables co-translational folding by regulating the speed of translation.14 However, detailed biological functions and formation mechanisms of modified mRNAs still need to be investigated.
Interestingly, in our study, we found that the D sites on mRNAs and tRNAs have distinct sequence contexts. Although this may be due to technical bias embedded in data obtained from different resources, a more exciting hypothesis may be that the D sites on mRNAs and tRNAs may have different formation mechanisms and fulfill divergent biological functions, which remains to be tested in the future. The expression level of Dus enzymes and the D landscape will be the subject of further investigation of the roles dihydrouridylated mRNAs and tRNAs play in translation and other physiological activities.14
Since high-quality D epitranscriptome data are limited, multi-species prediction has not been conducted in this study. Other features, such as secondary structures, genome information, and RNA types, will need to be involved to further improve the robustness and generalization ability of the model in the future.
Materials and methods
Benchmark datasets
The benchmark dataset needs to include both positive and negative sequences. The sequences should contain a uridine in the middle that is a potential D modification to be evaluated; i.e., the sequence should be long, where is the number of nucleotides from center to both sides. The D-containing mRNA sequences (positive datasets) were obtained from the recently published single-nucleotide-resolution D-seq data.23 The D-containing tRNA sequences were collected from the database RMBase 2.0.22 Unmodified sequences (negative datasets) were randomly selected on the same transcripts of the positive D sites. There are a total of 956 RNA samples, of which 780 are tRNA sequences and 176 are mRNA sequences. Both have a 1:1 positive-to-negative ratio. We randomly split the dataset into training and testing datasets with a ratio of 8:2. Experiments were performed under jackknife cross-validation in the training phase. Each dataset contains an equal number of positive and negative sequences.
Furthermore, if a potential D site is near the 5′ or 3′ end, a head-to-end principle is utilized when extracting corresponding sequences.34 If the distance from a potential D site to the transcript head (end) is less than the window size , the missing part is filled by the transcript end (head) part nucleotides. This strategy ensures all sequences to be extracted have the same length and the D/uridine in the center.
Feature extraction strategy
In this study, four encoding schemes were employed, which were different combinations of nucleotide representations and properties. The nucleotide representation methods included OH and NCP. The nucleotide properties included ND and EIIP. Thus, we compared four encoding strategies, OH_ND, OH_EIIP, NCP_ND, and NCP_EIIP, to explore which combination of encoding strategies led to the best performance in the model training. Details of them are stated below.
OH is the most common feature representation method and maps each element into a vector. In this study, there are four types of nucleotide: A (adenine), C (cytosine), G (guanine), and U (uracil). Each nucleotide in sequences can be assigned to a vector of 4 ().
NCP is a nucleotide representation approach based on the chemical structures of RNA nucleotides. It was first proposed for site prediction in DNA sequences and now is also widely used in RNA sequences.43,44,45 Four different types of nucleotide are classified into three groups: the number of ring structure (A and G have two rings, while C and U have one ring), the chemical functionality (A and C have an amino group, G and U have a keto group), and the hydrogen bond strength (hydrogen bonding between C and G is strong, that between A and U is weak). The i-th nucleotide in a sequence is denoted by (), where is nucleotide type and reflect these three groups respectively:
| (Equation 1) |
| (Equation 2) |
| (Equation 3) |
ND indicates the distribution and cumulative frequency information of nucleotides at each position.46 The density of the i-th nucleotide is calculated as the number of nucleotides that have the same types as before the i + 1-th position, divided by i:
| (Equation 4) |
where
| (Equation 5) |
The EIIP value of the nucleotides was first proposed in 2006,47 indicating the electron-ion interaction potential of an mRNA nucleotide, and later was utilized by C prediction48 and other bioinformatics research areas.49,50 The EIIP values for different nucleotides are given in Table S4.
Each nucleotide in sequences was encoded into a discrete vector consisting of a nucleotide representation plus a nucleotide property. For example, in OH_EIIP, A, C, G, and U were encoded as vectors [1, 0, 0, 0, ], [0, 1, 0, 0, ], [0, 0, 1, 0, ], and [0, 0, 0, 1, ]. Each sequence was encoded as a × 5 matrix. In NCP_ND, A, C, G, and U were encoded as vectors [1, 1, 1, ], [0, 1, 0, ], [0, 0, 1, ], and [0, 0, 1, ]. Each sequence was encoded as a × 4 matrix. In the following discussion, we explore which combination of encoding strategies leads to the best performance in the model training.
Performance evaluation metrics
The model performance was evaluated by various statistical metrics, including sensitivity (Sn, also termed recall), specificity (Sp), accuracy (ACC), F1 scores, and Mathew’s correlation coefficient (MCC):
| (Equation 6) |
| (Equation 7) |
| (Equation 8) |
| (Equation 9) |
| (Equation 10) |
| (Equation 11) |
TP, TN, FP, and FN represent the number of true-positive, true-negative, false-positive, and false-negative, respectively. Re and Pre indicate recall and precision. The ROC curve and the area under the ROC (AUROC) were also included to quantitatively evaluate the proposed method.
Sequence length and feature extraction optimization
It is crucial to select a suitable input sequence length for the prediction model. If the length is too short, input sequences cannot provide enough information. If the sequence length is too long, redundant information could be introduced, increasing the difficulty for the model to extract crucial features. The encoding schemes can influence model performance as well and also need to be selected. Thus, we evaluated DPred under different sequence lengths and encoding schemes.
The range of sequence length to be evaluated is from 11 to 81 nt. Figure 4 shows the performance evaluation results based on different sequence sizes and four encoding schemes. We found the two NCP-based encoding methods generally outperformed the OH-based methods. In NCP_EIIP, the model achieved the best performance with an AUC of 0.8992 when the sequence size was 31 nt. In NCP_ND, an AUC of 0.8889 was obtained when the length was 31 nt, and a 0.9027 AUC was attained at 41 nt. We chose 41 nt as the sequence length and NCP_ND as the final encoding strategy for the predictor.
Figure 4.
Sequence length optimization by AUROC
Model performance was evaluated with different sequence lengths and encoding schemes by comparing AUROC values. 41nt sequence length and NCP_ND encoding scheme were chosen for the predictor.
The ROC curves for the independent testing dataset are plotted in Figure 5. Model performance was evaluated based on 41-nt sequence length for four encoding methods. NCP_ND generally yielded the best prediction among the four encoding schemes.
Figure 5.
ROC curves of DPred performance
The curves were drawn based on the model performance with 41nt sequence length and four encoding methods in the testing dataset.
Model architecture
The network framework is as follows: the first local attention51 layer uses additive attention and assigns attention scores to each state in the sequence; a convolutional layer exploits meaningful features in its local receptive fields by using weight-sharing strategy; a max-pooling layer removes unimportant features and enlarges the receptive field; all processed features are then flattened and fed into a dropout layer to avoid overfitting in model training and to increase the generalization of the network on unknown sequences. A fully connected layer eventually takes all the previous results and leads to a softmax function, which predicts whether the input sequence contains D sites or not with a 0.5 threshold. The rectified linear unit (ReLU) is used as an activation function throughout the framework except for the first local attention layer and the last layer, which utilize the softmax function.
L2 regularization was adopted in the dense layer. During the L2 regularization, the loss function of the neural network is extended by a regularization term, which is the L2 norm of the weight matrices; i.e., the sum over all squared weight values. The regularization term is weighted by a regularization rate being set to 0.01. Performing L2 regularization makes the weights of the neural network small (toward zero but not exactly zero). Smaller weights reduce the impact of the hidden neurons and lower the complexity of the model during training.
Attention-based methods have gained popularity in neural networks since the late 2010s. The attention structure highlights the importance of key points relative to their neighborhood regions and makes the model learn what to attend to from the given sequence.52 Self-attention was one of the attention mechanisms that attends to states within the sequence53; i.e., it only focuses on how one state attends to other states within a given sequence. Research has found that self-attention layers operate similarly to CNN layers54 but are more flexible with a learnable receptive field. This mechanism has been widely used in image analysis and NLP (Natural Language Processing) tasks, sometimes combing with or replacing CNNs, RNNs (Recurrent Neural Networks), and LSTMs (Long Short-Term Memory Networks).
Luong et al.51 first distinguished between global attention52 versus local attention. Global attention considers all the states, while the local attention only considers a subset of states when calculating attention weights. The latter combines the strength of both hard and soft attention and is differentiable, hence easier to implement and train.
The attention layer used in this model is the additive local self-attention. In this model, let and denote the i-th and the j-th element in the sequence respectively. It should first calculate the alignment (attention) weights , which is derived by aligning the state i with the state j:
| (Equation 12) |
| (Equation 13) |
| (Equation 14) |
The is alignment score function. Here, it used an additive structure, which is why we called this layer an additive local self-attention. Besides this additive structure, the alignment score function can also take dot product, scaled dot product, and cosine similarity operation. Furthermore, the local attention layer only focuses on a subset of context, which means, for each state i, we should not calculate the alignment of every state in the sequence with it, but only part of them with it. We used monotonic alignment here,51 just setting the aligned position to be i; i.e., the i-th state is aligned with a window of states around itself. For example, if the window size is 5, each state in the sequence is compared with two states in front of it, with two states behind it, and with itself, respectively. Next, the alignment weights were obtained by feeding alignment scores through a softmax function. The weight indicates how much attention it should give to each input state.
The states in the input layer were weighted to give a new vector that is used by the subsequent layer. The new vector is this attention-focused state representation , derived by the weighted average over the input sequence, which should have the same size with the input layer.
| (Equation 15) |
Selected hyper-parameters are presented in Table 3. The layers of the model and each output shape are summarized in Table S5. The loss function in this model computed a categorical cross entropy loss:
| (Equation 16) |
where is the true label of the s-th sample, is the prediction, and N is the number of values in the model output (2 in this study). The Adam optimizer,55 which combines the strength of momentum and adaptive learning rate method, was used for training the network with a small initial learning rate of 0.0001. The epoch was set to be 100.
Table 3.
Selected hyper-parameters for the DPred model
| Hyper-parameters | Selected values |
|---|---|
| Attention type | additive |
| Attention hidden size | 32 |
| Attention width | 2 |
| Number of filters | 100 |
| CNN stride size | 2 |
| Kernel size | 2 |
| Padding | same |
| Max-pooling stride-size | 1 |
| Max-pooling pool-size | 2 |
| Dropout ratio | 0.1 |
| Regularization rate | 0.01 |
| Dense size | 100 |
| Loss function | categorical_crossentropy |
| Optimizer | Adam |
| Learning rate | 0.0001 |
Acknowledgments
This project was supported by the National Science Foundation for Young Scientists of China (grant no. 62003273) and Natural Science Basic Research Program of Shaanxi (program no. 2020JQ-217).
Author contributions
J.M. initialized the project. Y.W. and J.M. conceived the ideas. Y.W. collected and processed all of the data. Y.W. designed and implemented the model. X.W. built the Web server. Y.W. performed the analysis and compared it with competing methods. Y.W. drafted the manuscript. All authors read, critically revised, and approved the final manuscript.
Declaration of interests
The authors declare no competing interests.
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.omtn.2023.01.014.
Supplemental information
Data and code availability
The DPred framework was implemented using Tensorflow 2.10.0, and the codes can be freely accessed at https://github.com/yue-wang-biomath/DPred. The corresponding Web server can be assessed from http://www.rnamd.org/dpred.
References
- 1.Boccaletto P., Stefaniak F., Ray A., Cappannini A., Mukherjee S., Purta E., Kurkowska M., Shirvanizadeh N., Destefanis E., Groza P., et al. MODOMICS: a database of RNA modification pathways. 2021 update. Nucleic Acids Res. 2022;50:D231–D235. doi: 10.1093/nar/gkab1083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Helm M., Alfonzo J.D. Posttranscriptional RNA modifications: playing metabolic games in a cell’s chemical legoland. Chem. Biol. 2014;21:174–185. doi: 10.1016/j.chembiol.2013.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ma J., Song B., Wei Z., Huang D., Zhang Y., Su J., de Magalhães J.P., Rigden D.J., Meng J., Chen K. m5C-Atlas: a comprehensive database for decoding and annotating the 5-methylcytosine (m5C) epitranscriptome. Nucleic Acids Res. 2022;50:D196–D203. doi: 10.1093/nar/gkab1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Batista P.J., Molinie B., Wang J., Qu K., Zhang J., Li L., Bouley D.M., Lujan E., Haddad B., Daneshvar K., et al. m6A RNA modification controls cell fate transition in mammalian embryonic stem cells. Cell Stem Cell. 2014;15:707–719. doi: 10.1016/j.stem.2014.09.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Delaunay S., Frye M. RNA modifications regulating cell fate in cancer. Nat. Cell Biol. 2019;21:552–559. doi: 10.1038/s41556-019-0319-0. [DOI] [PubMed] [Google Scholar]
- 6.Yang C. ToxPoint: dissecting functional RNA modifications in responses to environmental exposure—mechanistic toxicology research enters a new era. Toxicol. Sci. 2020;174:1–2. doi: 10.1093/toxsci/kfz252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Pendleton K.E., Chen B., Liu K., Hunter O.V., Xie Y., Tu B.P., Conrad N.K. The U6 snRNA m6A methyltransferase METTL16 regulates SAM synthetase intron retention. Cell. 2017;169:824–835.e14. doi: 10.1016/j.cell.2017.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Liu N., Dai Q., Zheng G., He C., Parisien M., Pan T. N6-methyladenosine-dependent RNA structural switches regulate RNA–protein interactions. Nature. 2015;518:560–564. doi: 10.1038/nature14234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Geula S., Moshitch-Moshkovitz S., Dominissini D., Mansour A.A., Kol N., Salmon-Divon M., Hershkovitz V., Peer E., Mor N., Manor Y.S., et al. Stem cells. m6A mRNA methylation facilitates resolution of naïve pluripotency toward differentiation. Science. 2015;347:1002–1006. doi: 10.1126/science.1261417. [DOI] [PubMed] [Google Scholar]
- 10.Song B., Huang D., Zhang Y., Wei Z., Su J., Pedro de Magalhães J., Rigden D.J., Meng J., Chen K. m6A-TSHub: unveiling the context-specific m6A methylation and m6A-affecting mutations in 23 human tissues. Dev. Reprod. Biol. 2022 doi: 10.1016/j.gpb.2022.09.001. [DOI] [PubMed] [Google Scholar]
- 11.Song B., Wang X., Liang Z., Ma J., Huang D., Wang Y., de Magalhães J.P., Rigden D.J., Meng J., Liu G., et al. RMDisease V2.0: an updated database of genetic variants that affect RNA modifications with disease and trait implication. Nucleic Acids Res. 2022:gkac750. doi: 10.1093/nar/gkac750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jonkhout N., Tran J., Smith M.A., Schonrock N., Mattick J.S., Novoa E.M. The RNA modification landscape in human disease. RNA. 2017;23:1754–1769. doi: 10.1261/rna.063503.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Machnicka M.A., Olchowik A., Grosjean H., Bujnicki J.M. Distribution and frequencies of post-transcriptional modifications in tRNAs. RNA Biol. 2014;11:1619–1629. doi: 10.4161/15476286.2014.992273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Finet O., Yague-Sanz C., Marchand F., Hermand D. The dihydrouridine landscape from tRNA to mRNA: a perspective on synthesis, structural impact and function. RNA Biol. 2022;19:735–750. doi: 10.1080/15476286.2022.2078094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Byrne R.T., Jenkins H.T., Peters D.T., Whelan F., Stowell J., Aziz N., Kasatsky P., Rodnina M.V., Koonin E.V., Konevega A.L., Antson A.A. Major reorientation of tRNA substrates defines specificity of dihydrouridine synthases. Proc. Natl. Acad. Sci. USA. 2015;112:6033–6037. doi: 10.1073/pnas.1500161112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Yu F., Tanaka Y., Yamashita K., Suzuki T., Nakamura A., Hirano N., Suzuki T., Yao M., Tanaka I. Molecular basis of dihydrouridine formation on tRNA. Proc. Natl. Acad. Sci. USA. 2011;108:19593–19598. doi: 10.1073/pnas.1112352108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kato T., Daigo Y., Hayama S., Ishikawa N., Yamabuki T., Ito T., Miyamoto M., Kondo S., Nakamura Y. A novel human tRNA-dihydrouridine synthase involved in pulmonary carcinogenesis. Cancer Res. 2005;65:5638–5646. doi: 10.1158/0008-5472.CAN-05-0600. [DOI] [PubMed] [Google Scholar]
- 18.Kuchino Y., Borek E. Tumour-specific phenylalanine tRNA contains two supernumerary methylated bases. Nature. 1978;271:126–129. doi: 10.1038/271126a0. [DOI] [PubMed] [Google Scholar]
- 19.The Cancer Genome Atlas Research Network Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499:43–49. doi: 10.1038/nature12222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Huang J., Mondul A.M., Weinstein S.J., Derkach A., Moore S.C., Sampson J.N., Albanes D. Prospective serum metabolomic profiling of lethal prostate cancer. Int. J. Cancer. 2019;145:3231–3243. doi: 10.1002/ijc.32218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sridharan G., Ramani P., Patankar S. Serum metabolomics in oral leukoplakia and oral squamous cell carcinoma. J. Cancer Res. Ther. 2017;13:556–561. doi: 10.4103/jcrt.JCRT_1233_16. [DOI] [PubMed] [Google Scholar]
- 22.Xuan J.-J., Sun W.-J., Lin P.-H., Zhou K.-R., Liu S., Zheng L.-L., Qu L.-H., Yang J.-H. RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2018;46:D327–D334. doi: 10.1093/nar/gkx934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Draycott A.S., Schaening-Burgos C., Rojas-Duran M.F., Wilson L., Schärfen L., Neugebauer K.M., Nachtergaele S., Gilbert W.V. Transcriptome-wide mapping reveals a diverse dihydrouridine landscape including mRNA. PLoS Biol. 2022;20:e3001622. doi: 10.1371/journal.pbio.3001622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Finet O., Yague-Sanz C., Hermand D. Epitranscriptomic mapping of RNA modifications at single-nucleotide resolution using rhodamine sequencing (Rho-seq) STAR Protoc. 2022;3:101369. doi: 10.1016/j.xpro.2022.101369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Tang Q., Nie F., Kang J., Chen W. mRNALocater: enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy. Mol. Ther. 2021;29:2617–2623. doi: 10.1016/j.ymthe.2021.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.He W., Jia C., Zou Q. 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics. 2019;35:593–601. doi: 10.1093/bioinformatics/bty668. [DOI] [PubMed] [Google Scholar]
- 27.Chen K., Wei Z., Zhang Q., Wu X., Rong R., Lu Z., Su J., de Magalhães J.P., Rigden D.J., Meng J. WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach. Nucleic Acids Res. 2019;47:e41. doi: 10.1093/nar/gkz074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Liu Q., Gregory R.I. RNAmod: an integrated system for the annotation of mRNA modifications. Nucleic Acids Res. 2019;47:W548–W555. doi: 10.1093/nar/gkz479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lv H., Zhang Z.-M., Li S.-H., Tan J.-X., Chen W., Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief. Bioinform. 2020;21:982–995. doi: 10.1093/bib/bbz048. [DOI] [PubMed] [Google Scholar]
- 30.Zhou Y., Zeng P., Li Y.-H., Zhang Z., Cui Q. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44:e91. doi: 10.1093/nar/gkw104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Liu K., Chen W. iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics. 2020;36:3336–3342. doi: 10.1093/bioinformatics/btaa155. [DOI] [PubMed] [Google Scholar]
- 32.Zhu H., Ao C.-Y., Ding Y.-J., Hao H.-X., Yu L. Identification of D Modification sites using a random forest model based on nucleotide chemical properties. Int. J. Mol. Sci. 2022;23:3044. doi: 10.3390/ijms23063044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Dou L., Zhou W., Zhang L., Xu L., Han K. Accurate identification of RNA D modification using multiple features. RNA Biol. 2021;18:2236–2246. doi: 10.1080/15476286.2021.1898160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Xu Z.-C., Feng P.-M., Yang H., Qiu W.-R., Chen W., Lin H. iRNAD: a computational tool for identifying D modification sites in RNA sequence. Bioinformatics. 2019;35:4922–4929. doi: 10.1093/bioinformatics/btz358. [DOI] [PubMed] [Google Scholar]
- 35.Feng P., Xu Z., Yang H., Lv H., Ding H., Liu L. Identification of D Modification sites by integrating heterogeneous features in Saccharomyces cerevisiae. Molecules. 2019;24:380. doi: 10.3390/molecules24030380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chen W., Tang H., Ye J., Lin H., Chou K.C. iRNA-PseU: identifying RNA pseudouridine sites. Mol. Ther. Nucleic Acids. 2016;5:e332. doi: 10.1038/mtna.2016.37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Song B., Chen K., Tang Y., Wei Z., Su J., de Magalhães J.P., Rigden D.J., Meng J. ConsRM: collection and large-scale prediction of the evolutionarily conserved RNA methylation sites, with implications for the functional epitranscriptome. Brief. Bioinform. 2021;22:bbab088. doi: 10.1093/bib/bbab088. [DOI] [PubMed] [Google Scholar]
- 38.Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics. 2017;33:3645–3647. doi: 10.1093/bioinformatics/btx469. [DOI] [PubMed] [Google Scholar]
- 39.Vacic V., Iakoucheva L.M., Radivojac P. Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22:1536–1537. doi: 10.1093/bioinformatics/btl151. [DOI] [PubMed] [Google Scholar]
- 40.Dyubankova N., Sochacka E., Kraszewska K., Nawrot B., Herdewijn P., Lescrinier E. Contribution of dihydrouridine in folding of the D-arm in tRNA. Org. Biomol. Chem. 2015;13:4960–4966. doi: 10.1039/c5ob00164a. [DOI] [PubMed] [Google Scholar]
- 41.Väre V.Y.P., Eruysal E.R., Narendran A., Sarachan K.L., Agris P.F. Chemical and conformational diversity of modified nucleosides affects tRNA structure and function. Biomolecules. 2017;7:E29. doi: 10.3390/biom7010029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Finet O., Yague-Sanz C., Krüger L.K., Tran P., Migeot V., Louski M., Nevers A., Rougemaille M., Sun J., Ernst F.G.M., et al. Transcription-wide mapping of dihydrouridine reveals that mRNA dihydrouridylation is required for meiotic chromosome segregation. Mol. Cell. 2022;82:404–419.e9. doi: 10.1016/j.molcel.2021.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yang H., Lv H., Ding H., Chen W., Lin H. iRNA-2OM: a sequence-based predictor for identifying 2′-O-methylation sites in Homo sapiens. J. Comput. Biol. 2018;25:1266–1277. doi: 10.1089/cmb.2018.0004. [DOI] [PubMed] [Google Scholar]
- 44.Chen W., Feng P., Tang H., Ding H., Lin H. RAMPred: identifying the N1-methyladenosine sites in eukaryotic transcriptomes. Sci. Rep. 2016;6:31080. doi: 10.1038/srep31080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chen W., Tang H., Lin H. MethyRNA: a web server for identification of N6-methyladenosine sites. J. Biomol. Struct. Dyn. 2017;35:683–687. doi: 10.1080/07391102.2016.1157761. [DOI] [PubMed] [Google Scholar]
- 46.Chen W., Yang H., Feng P., Ding H., Lin H. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics. 2017;33:3518–3523. doi: 10.1093/bioinformatics/btx479. [DOI] [PubMed] [Google Scholar]
- 47.Nair A.S., Sreenadhan S.P. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) Bioinformation. 2006;1:197–202. [PMC free article] [PubMed] [Google Scholar]
- 48.Dou L., Li X., Ding H., Xu L., Xiang H. Prediction of m5C modifications in RNA sequences by combining multiple sequence features. Mol. Ther. Nucleic Acids. 2020;21:332–342. doi: 10.1016/j.omtn.2020.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Han S., Liang Y., Ma Q., Xu Y., Zhang Y., Du W., Wang C., Li Y. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief. Bioinform. 2019;20:2009–2027. doi: 10.1093/bib/bby065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Bonidia R.P., Sampaio L.D.H., Domingues D.S., Paschoal A.R., Lopes F.M., de Carvalho A.C.P.L.F., Sanches D.S. Feature extraction approaches for biological sequences: a comparative study of mathematical features. Brief. Bioinform. 2021;22:bbab011. doi: 10.1093/bib/bbab011. [DOI] [PubMed] [Google Scholar]
- 51.Luong T., Pham H., Manning C.D. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. Effective approaches to attention-based neural machine translation. [Google Scholar]
- 52.Zheng G., Mukherjee S., Dong X.L., Li F. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. OpenTag: open attribute value extraction from product profiles. [Google Scholar]
- 53.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. Attention is all you need. [Google Scholar]
- 54.Cordonnier J.-B., Loukas A., Jaggi M. International Conference on Learning Representations. ICLR; 2019. On the relationship between self-attention and convolutional layers. [Google Scholar]
- 55.Kingma D., Ba J. International Conference on Learning Representations. ICLR; 2014. Adam: a method for stochastic optimization. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The DPred framework was implemented using Tensorflow 2.10.0, and the codes can be freely accessed at https://github.com/yue-wang-biomath/DPred. The corresponding Web server can be assessed from http://www.rnamd.org/dpred.





