Abstract
N6-methyladenosine (m6A) is the most abundant within eukaryotic messenger RNA modification, which plays an essential regulatory role in the control of cellular functions and gene expression. However, it remains an outstanding challenge to detect mRNA m6A transcriptome-wide at base resolution via experimental approaches, which are generally time-consuming and expensive. Developing computational methods is a good strategy for accurate in silico detection of m6A modification sites from the large amount of RNA sequence data. Unfortunately, the existing computational models are usually only for m6A site prediction in a single species, without considering the tissue level of species, while most of them are constructed based on low-confidence level data generated by an m6A antibody immunoprecipitation (IP)-based sequencing method, thereby restricting reliability and generalizability of proposed models. Here, we review recent advances in computational prediction of m6A sites and construct a new computational approach named im6APred using ensemble deep learning to accurately identify m6A sites based on high-confidence level data in multiple tissues of mammals. Our model im6APred builds upon a comprehensive evaluation of multiple classification methods, including four traditional classification algorithms and three deep learning methods and their ensembles. The optimal base–classifier combinations are then chosen by five-fold cross-validation test to achieve an effective stacked model. Our model im6APred can produce the area under the receiver operating characteristic curve (AUROC) in the range of 0.82–0.91 on independent tests, indicating that our model has the ability to learn general methylation rules on RNA bases and generalize to m6A transcriptome-wide identification. Moreover, AUROCs in the range of 0.77–0.96 were achieved using cross-species/tissues validation on the benchmark dataset, demonstrating differences in predictive performance at the tissue level and the need for constructing tissue-specific models for m6A site prediction.
Keywords: RNA modification, m6A site identification, ensemble deep learning
1. Introduction
Chemical modification is an efficient and important way to regulate the functions of biological macromolecules, including protein, DNA and RNA [1]. Besides modifications of protein and DNA, more than 150 different RNA post-transcriptional modifications have been characterized so far [2]. Among them, N6-methyladenosine (m6A), the methylation of adenosine at position 6, is the most abundant and prevalent on RNA molecules present in eukaryotes [3], regulated mainly by a variety of “writer”, “reader” and “eraser” proteins [4] (Figure 1). This modification has been proved to be implicated in a variety of cellular functions, such as the self-renewal and differentiation of stem cells [5], DNA-damage response [6], spermatogonia differentiation [7], cellular heat shock response [8], anti-tumor immunity [9], X-chromosome inactivation [10], long-term memory creation [11], circadian clock function [12] and tumorigenesis [13]. In addition, aberrant m6A methylation is also closely related to the disease progression, including acute myeloid leukemia, breast tumor, gastric cancer and glioblastoma. Therefore, identifying m6A modification sites accurately is crucial to understand and explore the regulatory mechanisms and functions of various RNAs.
The m6A modification is difficult to be discriminated by chemical reactions because its chemical properties are similar to adenosine. Thus, to accurately profile the transcriptome-wide distribution of m6A modification, a number of high-throughput m6A sequencing techniques have been developed, including MeRIP [14] or m6A-seq [15], PA-m6A-seq [16], miCLIP [17], UV-CLIP [18], m6A-REF-seq [19] and DART-seq [20]. Despite significant progress achieved in transcriptome-wide mapping of m6A at the nucleotide level, these experimental methods are expensive and time-consuming for genome-scale detection of m6A modification sites. Therefore, it is urgent to develop computational methods to identify m6A modification sites in RNA as an effective bioinformatics tool for scholars to study RNA m6A biology.
Recent advances in high-throughput m6A sequencing techniques and the accumulation of experimentally validated m6A modification data [21,22,23] have paved the way for training such in silico predictors. To date, numerous computational tools have been developed for in silico prediction of m6A modification sites from RNA sequences, including iRNA-Methyl [24], M6ATH [25], MethyRNA [26], iRNA-PseColl [27], DeepM6APred [28], iRNA(m6A)-PseDNC [29], RAM-ESVM [30], RAM-NPPS [31], RNAMethPre [32], iRNA-3typeA [33], M6APred-EL [34], pRNAm-PC [35], TargetM6A [36], AthMethPre [37], iMethyl-STTNC [38], SRAMP [39], HMpre [40], MASS [41], MultiRM [42], etc. A summary of the existing methods for m6A modification site prediction are shown in Table 1.
Table 1.
Species | Tool | Experimental Method | Single Nucleotide Resolution | Features a | Algorithm | Evaluation Strategy | Year | Webserver b |
---|---|---|---|---|---|---|---|---|
S. cerevisiae | iRNA-Methyl [24] | m6A-Seq | No | PseDNC | SVM | Jackknife | 2015 | http://lin-group.cn/server/iRNA-Methyl (accessed on 3 December 2022) |
pRNAm_PC [35] | m6A-Seq | No | PseDNC, AC, CC | SVM | Jackknife | 2015 | http://www.jci-bioinfo.cn/pRNAm-PC (accessed on 3 December 2022) | |
RNA-MethylPred [43] | m6A-Seq | No | DNC, KNN scores | SVM | Jackknife | 2016 | No | |
TargetM6A [36] | m6A-Seq | No | PSNP, PSDP, NC |
SVM | Jackknife, independent test | 2016 | http://202.119.84.36:3079/TargetM6A/ (accessed on 3 December 2022) | |
RAM-ESVM [30] | m6A-Seq | No | PseDNC | Ensemble SVM | 10-fold CV | 2017 | decommissioned | |
iRNA(m6A)-PseDNC [29] | m6A-Seq | No | PseDNC | SVM | 10-fold CV | 2018 | http://lin-group.cn/server/iRNA(m6A)-PseDNC.php (accessed on 3 December 2022) | |
M6APred-EL [34] | m6A-Seq | No | PS(k-mer)NP, RFHC-GACs, PCPs | Ensemble SVM | 10-fold CV | 2018 | decommissioned | |
DeepM6APred [28] | m6A-Seq | No | Deep features, NPPS | SVM | 10-fold CV | 2018 | decommissioned | |
iMethyl-STTNC [38] | m6A-Seq | No | PseDNC, PseTNC, STNC, STTNC | SVM | 10-fold CV | 2018 | No | |
H. sapiens | iRNA-PseColl [27] | m6A-Seq | No | CPD | SVM | Jackknife | 2017 | http://lin-group.cn/server/iRNA-PseColl.html (accessed on 3 December 2022) |
HMpre [40] | miCLIP | Yes | SLRF, FREI, SNP | XGBoost | Independent test | 2018 | No | |
MultiRM [42] | m6A-CLIP, miCLIP | Yes | One-hot | CNN + BiLSTM |
Independent test | 2021 | www.xjtlu.edu.cn/biologicalsciences/multirm (accessed on 3 December 2022) | |
DeepM6ASeq-EL [44] | m6A-CLIP, miCLIP | Yes | One-hot, CPD, Word2vec | Ensemble CNN + LSTM |
Independent test | 2021 | No | |
H. sapiens, M. musculus |
MethyRNA [26] | m6A-Seq, MeRIP-Seq | No | CPD | SVM | Jackknife | 2016 | http://lin-group.cn/server/MethyRNA (accessed on 3 December 2022) |
SRAMP [39] | miCLIP | Yes | One-hot, SPE, KNN scores, PSSP | RF | 5-fold CV, independent test | 2016 | http://www.cuilab.cn/sramp/ (accessed on 3 December 2022) | |
RNAMethPre [32] | miCLIP, m6A-CLIP | Yes | One-hot, NC, SLS | SVM | 5-fold CV, independent test | 2016 | decommissioned | |
iRNA-3typeA [33] | m6A-Seq, MeRIP-Seq | No | CPD | SVM | Jackknife | 2018 | http://lin-group.cn/server/iRNA-3typeA.php (accessed on 3 December 2022) | |
A. thaliana | M6ATH [25] | m6A-seq | No | CPD | SVM | Jackknife | 2016 | http://lin-group.cn/server/M6ATH (accessed on 3 December 2022) |
H. sapiens, Mouse, Zebrafish | DeepM6ASeq [45] | miCLIP | Yes | One-hot | CNN + BiLSTM |
independent test | 2018 | https://github.com/rreybeyb/DeepM6ASeq (accessed on 3 December 2022) |
S. cerevisiae, H. sapiens, A. thaliana | RAM-NPPS [31] | m6A-Seq, PA-m6A-seq | No | NPPS | SVM | 10-fold CV | 2017 | decommissioned |
H. sapiens, Mouse, Chimpanzee, Rhesus, Pig, Rat, Zebrafish | MASS [41] | m6A-Seq, MeRIP-Seq, m6A-CLIP, miCLIP | Bulking | One-hot, Phylogenetic tree | CNN + BiLSTM |
5-fold CV | 2021 | https://github.com/mlcb-thu/MASS (accessed on 3 December 2022) |
H. sapiens, M. musculus, Rat | iRNA-m6A [46] | m6A-REF-seq | Yes | PCPs, CPD, One-hot | SVM | 5-fold CV, independent test | 2020 | http://lin-group.cn/server/iRNA-m6A/ (accessed on 3 December 2022) |
im6A-TS-CNN [47] | m6A-REF-seq | Yes | One-hot | CNN | 5-fold CV, independent test | 2021 | No | |
TS-m6A-DL [48] | m6A-REF-seq | Yes | One-hot | CNN | 5-fold CV, independent test | 2021 | http://nsclbio.jbnu.ac.kr/tools/TS-m6A-DL/ (accessed on 3 December 2022) |
a PseDNC: pseudo dinucleotide composition; DNC: dinucleotide composition; AC: auto-covariance; CC: cross-covariance; KNN scores: K-nearest neighbor encoding; PSNP: position-specific nucleotide propensity; PSDP: position-specific dinucleotide propensity; NC: nucleotide composition; PS(k-mer)NP: position-specific k-mer nucleotide propensity; PCPs: physical-chemical properties; CPD: chemical property with density, RFHC-GAC: a method integrating by CPD, AC and CC; NPPS: nucleotide pair position specificity; PseTNC: pseudo-trinucleotide-composition; STNC: split-trinucleotide-composition; STTNC: split-tetranucleotide-composition; PSSP: predicted secondary structure pattern; spectrum encoding: nucleotide pair spectrum encoding; SLS: stability of the local structure; one-hot: binary encoding; SPE: spectrum encoding, SLRF: site location related features, FREI: features related to entropy information, SNP: single nucleotide polymorphism features. b decommissioned—the webserver/tool is no longer available; no—the publication has no webserver or tool.
Although promising achievements have been made in this field, as shown in Table 1, existing approaches are subject to the following limitations. Firstly, most of them were developed using traditional machine learning methods, such as support vector machine (SVM) [24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,43], random forest (RF) [39] and eXtreme gradient boosting (XGBoost) [40]. However, few computational tools [41,42,44,45] were developed based on deep learning, which has powerful capacity to extract meaningful feature representations from a variety of raw data. Secondly, most existing studies relied on a limited number of m6A modification data from a single species, such as S. cerevisiae, A. thaliana and H. sapiens. While few computational predictors were proposed for identifying m6A modification sites from multiple species, especially in different tissues. Thirdly, the benchmark data of most existing tools in the field consisted of m6A modification sites with a medium confidence level. These sites were derived from m6A-seq, which cannot identify precise m6A modification sites and can only provide a 100–200 nt mapping region with m6A. More high confidence m6A modification sites are required to be directly extracted from the experiments with the single-nucleotide level, making the prediction results more reliable.
Recently, Dao et al. [46] collected experimentally confirmed single-nucleotide resolution m6A sites in various tissues of humans, mice and rats to construct a high-quality dataset and then proposed an SVM-based predictor named iRNA-m6A for identifying m6A sites. Although the performance of iRNA-m6A is promising for m6A site prediction in various tissues of multiple species, the classification method that they adopted focused on traditional machine learning methods, restricting further improvements of performance. Subsequently, to further improve the performance for m6A site prediction at the tissue level, Liu et al. [47] used the one-hot encoding scheme and a convolutional neural network (CNN) containing a convolutional layer and a max-pooling layer to propose a new model called im6A-TS-CNN based on Dao’s benchmark datasets. Abbas et al. [48] proposed a CNN model TS-m6A-DL containing three convolutional and max-pooling blocks, where the corresponding flattened outputs of each block were concatenated and fed into the dense layer.
Ensemble learning and deep learning have achieved great achievements in bioinformatics because of their superior adaptability and flexibility. However, these two methods were mainly treated as two kinds of independent methodologies. Recently, their combination formed a new methodology, termed ensemble deep learning, for improving stability and generalization capability of the proposed model synergistically. Compared with one single deep learning model, ensemble deep learning has been confirmed to pose better generalizability, prompting a new wave of research and application in different bioinformatics fields, such as multi-omics [49,50,51], systems biology [52,53,54], structural bioinformatics [55,56], sequence analysis [57] and so on.
Motivated by these remarkable achievements of ensemble deep learning, we here propose an ensemble deep-learning-based method, named im6APred, to identify m6A modification sites from the primary RNA sequence in various tissues of multiple species, including (a) human brain, liver and kidney, (b) mouse brain, liver kidney, heart, and testis, and (c) rat brain, liver and kidney. The predictor im6APred mainly consists of two core components: (i) five base models that predict m6A modification sites; and (ii) a simple ensemble method for achieving more stable results. As diversity of an individual network is a critical factor of a good ensemble model, the base models were generated based on convolutional neural network (CNN) under different hyperparameter combinations for promoting diversity of the individual network. Then, an ensemble strategy that averages output from all the individual base models was adopted to combine these base models into a final ensemble model. Overall workflow of the proposed predictor im6APred is illustrated in Figure 2. In addition, we developed a user-friendly webserver, which is publicly accessible at http://47.94.248.117/im6APred (accessed on 3 December 2022) [58].
2. Results and Discussion
2.1. The Optimal Base-Classifier Combinations
An effective ensemble learning strategy can achieve the construction of a robust prediction model by integrating the information of various classifiers. In this study, we employed four commonly used machine learning algorithms and three deep learning methods as base classifiers, including support vector machine (SVM), k-nearest neighbors (KNN), random forest (RF), gradient boosting (GB), fully connected network (FCN), long short-term memory (LSTM) and convolutional neural network (CNN). To elucidate advantages of ensemble learning, we firstly evaluated predictive performance of each single classifier trained on specific tissue data of each species using five-fold validation. Their optimal parameters are determined by the grid search method during five-fold validation. This process can be implement using Grid Search CV in python, which tries all the exhaustive combinations of parameter values supplied by a user and selects the best from the parameter space. The optimal parameter combinations of these classifiers were deposited in Supplementary Note S1. Subsequently, we selected five different single classifiers among them according to the ACC metric as the base classifiers and then produced 21 base–classifier combinations. In addition, we also considered seven other base–classifier combinations generated by the same kind of single classifiers. In detail, in accordance with the ACC metric, we selected the five best performing models from the same kind of single classifiers with different hyper-parameters. Finally, a simple average method was employed as the ensemble strategy of these 35 base–classifier combinations to classify.
We comprehensively evaluated these 35 base–classifier combinations for human brain tissue and other various tissues of multiple species and listed the five-fold cross-validation results in Table 2 and Tables S1–S10, respectively. As can be seen, for a single classifier, CNN and LSTM models have better performance than other classifiers, and so do the ensemble models containing CNN and LSTM, indicating that CNN and LSTM could capture latent information of sequential features of RNA m6A modification by integrating possible local-range dependencies and long-range dependencies, respectively. Moreover, it is clear that the ensemble deep learning model generated by five CNNs with different hyper-parameters achieved the best performance in terms of almost all performance metrics, with the only exception of Sp. Therefore, we developed 11 m6A site predictors, collectively called im6APred, which integrate an ensemble of 5 CNN classifiers with the combined strategy of a simple average method for predicting m6A site from various tissues of multiple species.
Table 2.
Ensemble Framework | Base–Classifier Combination | ACC (%) | Sn (%) | Sp (%) | MCC | AUROC |
---|---|---|---|---|---|---|
Single classifier | (1) SVM | 70.79 | 74.25 | 67.34 | 0.42 | 0.7746 |
(2) KNN | 67.97 | 71.10 | 64.84 | 0.36 | 0.7526 | |
(3) RF | 70.80 | 76.37 | 65.23 | 0.42 | 0.7776 | |
(4) GB | 70.74 | 75.66 | 65.82 | 0.42 | 0.7782 | |
(5) CNN | 72.94 | 77.59 | 68.30 | 0.46 | 0.8139 | |
(6) FCN | 69.21 | 70.29 | 68.12 | 0.38 | 0.7479 | |
(7) LSTM | 72.67 | 78.05 | 67.30 | 0.46 | 0.8047 | |
Ensemble across multiple models | (1) + (2) + (3) + (4) + (5) | 71.64 | 75.94 | 67.34 | 0.43 | 0.8028 |
(1) + (2) + (3) + (4) + (6) | 70.92 | 75.70 | 66.15 | 0.42 | 0.7763 | |
(1) + (2) + (3) + (4) + (7) | 71.53 | 76.50 | 66.56 | 0.43 | 0.8071 | |
(1) + (2) + (3) + (5) + (6) | 71.35 | 76.92 | 65.78 | 0.43 | 0.7989 | |
(1) + (2) + (4) + (5) + (6) | 71.18 | 75.70 | 66.67 | 0.43 | 0.7986 | |
(1) + (3) + (4) + (5) + (6) | 71.90 | 76.66 | 67.14 | 0.44 | 0.8000 | |
(2) + (3) + (4) + (5) + (6) | 71.66 | 75.68 | 67.64 | 0.43 | 0.8029 | |
(1) + (2) + (3) + (6) + (7) | 71.27 | 76.35 | 66.19 | 0.43 | 0.8027 | |
(1) + (2) + (4) + (6) + (7) | 70.98 | 75.33 | 66.62 | 0.42 | 0.8009 | |
(1) + (3) + (4) + (6) + (7) | 71.66 | 77.37 | 65.95 | 0.44 | 0.8026 | |
(1) + (2) + (3) + (5) + (7) | 72.38 | 78.35 | 66.41 | 0.45 | 0.8172 | |
(1) + (2) + (4) + (5) + (7) | 72.42 | 77.42 | 67.43 | 0.45 | 0.8150 | |
(1) + (3) + (4) + (5) + (7) | 72.81 | 79.91 | 65.71 | 0.46 | 0.8165 | |
(2) + (3) + (4) + (5) + (7) | 72.55 | 77.29 | 67.82 | 0.45 | 0.8188 | |
(1) + (2) + (5) + (6) + (7) | 72.10 | 74.77 | 69.42 | 0.44 | 0.8137 | |
(2) + (3) + (4) + (6) + (7) | 71.57 | 77.98 | 65.17 | 0.44 | 0.8050 | |
(1) + (3) + (5) + (6) + (7) | 72.48 | 77.02 | 67.93 | 0.45 | 0.8129 | |
(1) + (4) + (5) + (6) + (7) | 72.30 | 76.96 | 67.64 | 0.45 | 0.8130 | |
(2) + (3) + (5) + (6) + (7) | 72.20 | 77.29 | 67.12 | 0.45 | 0.8170 | |
(2) + (4) + (5) + (6) + (7) | 72.40 | 77.09 | 67.71 | 0.45 | 0.8142 | |
(3) + (4) + (5) + (6) + (7) | 72.38 | 78.09 | 66.67 | 0.45 | 0.8135 | |
Ensemble with same kind of model | (1) + (1) + (1) + (1) + (1) | 70.42 | 73.88 | 66.97 | 0.41 | 0.7718 |
(2) + (2) + (2) + (2) + (2) | 68.58 | 71.62 | 65.54 | 0.37 | 0.7564 | |
(3) + (3) + (3) + (3) + (3) | 70.93 | 76.74 | 65.12 | 0.42 | 0.7793 | |
(4) + (4) + (4) + (4) + (4) | 70.81 | 75.74 | 65.88 | 0.42 | 0.7773 | |
(5) + (5) + (5) + (5) + (5) | 74.23 | 82.48 | 65.99 | 0.49 | 0.8241 | |
(6) + (6) + (6) + (6) + (6) | 69.71 | 69.45 | 69.97 | 0.39 | 0.7749 | |
(7) + (7) + (7) + (7) + (7) | 73.79 | 81.04 | 66.54 | 0.48 | 0.8200 |
2.2. Performance Evaluation of the Proposed Model at Tissue Level of Species
There are several published methods for predicting N6-methyladenosine sites in multiple tissues of mammals, such as iRNA-m6A, m6A-TS-CNN and TS-m6A-DL. To provide a fair comparison, we adopted the same assessment criteria to examine the existing methods. Five-fold cross-validation was used to evaluate performance of the proposed models, which were trained on training data from each tissue of the different species are shown in Table 1. Subsequently, the independent datasets of the different tissues from three species shown in Table 1 were used to evaluate the robustness and reliability of the corresponding model, respectively. The results from five-fold cross-validation and an independent test of our proposed models are all listed in Table 3. As shown in Figure 3, some slight differences between the AUROC values of five-fold cross-validation and the independent test indicate that the proposed method in the current study is robust.
Table 3.
Species | Tissues | Five-Fold Cross Validation | Independent Test | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
ACC (%) | Sn (%) | Sp (%) | MCC | AUROC | ACC (%) | Sn (%) | Sp (%) | MCC | AUROC | ||
Human | Brain | 74.23 | 82.48 | 65.99 | 0.49 | 0.8241 | 74.38 | 80.41 | 68.35 | 0.49 | 0.8290 |
Liver | 81.55 | 84.13 | 78.98 | 0.63 | 0.8915 | 81.21 | 82.61 | 79.80 | 0.62 | 0.8909 | |
Kidney | 80.67 | 84.89 | 76.45 | 0.62 | 0.8896 | 80.59 | 87.29 | 73.89 | 0.61 | 0.8892 | |
Mouse | Brain | 79.85 | 83.54 | 76.16 | 0.60 | 0.8847 | 79.50 | 83.18 | 75.83 | 0.59 | 0.8832 |
Liver | 73.53 | 84.01 | 63.05 | 0.48 | 0.8181 | 78.61 | 91.34 | 65.88 | 0.59 | 0.8848 | |
Kidney | 81.96 | 83.51 | 80.42 | 0.64 | 0.9008 | 81.78 | 81.96 | 81.60 | 0.64 | 0.9007 | |
Heart | 75.31 | 81.69 | 68.92 | 0.51 | 0.8350 | 75.91 | 82.55 | 69.27 | 0.52 | 0.8377 | |
Testis | 76.90 | 85.70 | 68.09 | 0.54 | 0.8522 | 77.61 | 84.49 | 70.74 | 0.56 | 0.8621 | |
Rat | Brain | 77.27 | 81.80 | 72.75 | 0.55 | 0.8580 | 77.20 | 87.32 | 67.08 | 0.56 | 0.8680 |
Liver | 81.81 | 82.24 | 81.38 | 0.64 | 0.8949 | 81.44 | 86.66 | 76.22 | 0.63 | 0.8991 | |
Kidney | 82.97 | 82.90 | 83.05 | 0.66 | 0.9061 | 83.92 | 84.91 | 82.93 | 0.68 | 0.9173 |
To highlight the generalization ability of our tissue-specific models for predicting m6A modification sites, the corresponding results from five-fold cross-validation and an independent test obtained by their rivals are listed in Tables S11 and S12, respectively. The results from the independent test show that our proposed model im6APred has overall higher performance than other existing models. As shown in Figure 4, the values of the most important metrics, MCC and AUROC, are increased to a certain degree. For example, compared with iRNA-m6A, our predictor for identifying m6A sites in mouse liver obtained the maximum AUROC growth of 0.12 and increased 0.09 compared to the other two predictors. It indicates that im6APred has good stability and high generalization ability for identifying the m6A modification sites for unknown RNA sequences.
To verify the superior generalization ability of our tissue-specific model in comparison with existing tools, we constructed three other datasets, named HEK293_data, HEK293T_data and HepG2_data, respectively, from the recent literature created by Song et al. [59]. Among them, HEK293_data and HEK293T_data originated from human kidney, while HepG2_data is from human liver. We eliminated overlapping sequences from the training data so that the constructed data had not been seen in the training data. The ratios of positive and negative samples were all set to 1:1. After removing sequence redundancy, the sizes of such datasets were as follows: 8440 for HEK293_data, 29740 for HEK293T_data, and 8574 for HepG2_data. These datasets can be downloaded from http://47.94.248.117/im6APred/download (accessed on 3 December 2022).
The results on the independent test are listed in Table 4, and the corresponding ROC curves are illustrated in Figure 5. It can be seen that m6A-TS-CNN achieved better generalization ability than the other two existing predictors on these three independent testing datasets, while im6APred showed the best predictive performance among all the tissue-specific models. More specifically, the ACC, MCC and AUROC for the im6APred model outperformed TS-m6A-DL on the independent test dataset HEK293_data by 3.91%, 7.58% and 2.36%, respectively. The performance of the im6APred model on HEK293_data was improved compared to m6A-TS-CNN by 2.04%, 3.98% and 1.96% in terms of ACC, MCC and AUROC, respectively. Moreover, those values of ACC, MCC and AUROC were increased compared to iRNA-m6A by 3.56%, 7.15% and 1.14%, respectively. In addition, the ACC, MCC and AUROC for the im6APred model outperformed m6A-TS-CNN on the independent test dataset HEK293T_data by 3.58%, 6.81% and 3.52%, respectively. The performance of the im6APred model on HepG2_data was improved compared to m6A-TS-CNN by 1.91%, 3.7% and 3.01% in terms of ACC, MCC and AUROC, respectively.
Table 4.
Independent Test Dataset | Tissues | Predictors | Evaluation Index | ||||
---|---|---|---|---|---|---|---|
ACC (%) | Sn (%) | Sp (%) | MCC | AUROC | |||
HEK293_data | Kidney | iRNA-m6A | 77.80 | 78.46 | 77.13 | 0.5560 | 0.8470 |
m6A-TS-CNN | 79.32 | 75.99 | 82.65 | 0.5877 | 0.8388 | ||
TS-m6A-DL | 77.45 | 72.55 | 82.34 | 0.5517 | 0.8348 | ||
im6APred | 81.36 | 82.93 | 79.78 | 0.6275 | 0.8584 | ||
HEK293T_data | Kidney | iRNA-m6A | 70.84 | 63.23 | 78.45 | 0.4217 | 0.7751 |
m6A-TS-CNN | 74.80 | 68.41 | 81.18 | 0.5001 | 0.7753 | ||
TS-m6A-DL | 72.66 | 62.62 | 82.70 | 0.4627 | 0.7528 | ||
im6APred | 78.38 | 76.03 | 80.73 | 0.5682 | 0.8105 | ||
HepG2_data | Liver | iRNA-m6A | 74.80 | 73.20 | 76.39 | 0.4962 | 0.8032 |
m6A-TS-CNN | 77.01 | 72.98 | 81.03 | 0.5419 | 0.7812 | ||
TS-m6A-DL | 76.21 | 75.95 | 76.48 | 0.5243 | 0.7822 | ||
im6APred | 78.92 | 76.97 | 80.87 | 0.5789 | 0.8113 |
2.3. Cross-Species/Tissues Validation
Since the benchmark datasets consist of various tissues of multiple species, it is necessary to further investigate the prediction capability of the proposed model on cross-species/cross-tissues data. Here, we implemented cross-species/cross-tissues validation experiments, in which the eleven tissue-specific models were trained on the training data of each tissue from three species, respectively. Subsequently, the remaining different tissue data were considered as independent test data to assess the performance of each tissue-specific model.
A heat map generated by the AUROCs is shown in Figure 6 to describe the prediction performance of tissue-specific models. Results in Figure 6 show that almost all cross-tissue prediction performance is accepted as all the AUROCs are higher than 0.8 in this heat map. Especially, the tissue-specific models trained on the datasets of human (liver and kidney), mouse (brain and kidney) and rat (brain, liver and kidney), respectively, have achieved superior results (AUROCs > 0.85), indicating that potential m6A sites in the sequences of different tissue can be accurately identified using these tissue-specific models. However, there is tissue specificity between human (brain) and mouse (liver, heart and testis), as confirmed by Dao et al. [46]. When testing the tissue-specific models trained on the human (brain) and mouse (liver), almost all the AUROC values produced were below 0.82, demonstrating that tissue-specific predictors are helpful for better detection of m6A sites from a specific tissue.
2.4. Weberver
To facilitate community-wide efforts in quick prediction of novel potential RNA m6A modification sites from RNA sequence data, we developed a user-friendly webserver of im6APred, which is publicly accessible at http://47.94.248.117/im6APred (accessed on 3 December 2022). The webserver im6APred can realize the prediction function for m6A modification sites at the species level or at the tissue level of species after specifying the tissue and species type in the drop-down menu simultaneously and inputting their sequences of interest or uploading an input sequence file in the FASTA format. Upon completion, the final prediction results will be displayed on the webpage or sent to the e-mail address users specified. A detailed user guide about how to use the webserver of im6APred can be found on the help page. In addition, all the data and the source code used in this study are provided on this webserver, which facilitates a further in-depth analysis of m6A modification.
3. Materials and Methods
3.1. Benchmark Dataset
Comprehensive interrogating of m6A at individual nucleotide resolution plays a crucial role in revealing the biological importance of this RNA modification. Most importantly, high-throughput m6A sequencing data with a single-nucleotide level is a pivotal resource as high-quality data for in silico identification of m6A modification sites accurately. However, most of the current computational methods can only predict m6A sites on a sample level. Recently, Zhang et al. [19] reported an antibody-independent enzymatic method termed m6A-REF-seq and identified a great number of m6A modification sites with the RNA ACA motif in different tissues of humans, mice and rats at a single-nucleotide level. These data can pave the way for the development of such computational predictors for predicting m6A sites in multiple tissues of mammals. According to the genome of humans (hg19), mice (mm10) and rats (rn6) downloaded from the University of California, Santa Cruz (UCSC), we employed a sliding window method with 41 nt length to construct the benchmark datasets based on such m6A site information of each species, including chromosome type, site position and strand. The sequence fragment is regarded as a positive sample if its center adenine in ACA motif is methylated. Otherwise, it is regarded as a negative sample if its center adenine in ACA motif is un-methylated. The final datasets were generated after removing the sequences with more than 80% sequence similarity by using CD-HIT software (Version 4.6.8, Burnham Institute for Medical Research, La Jolla, CA, USA) [60]. In this study, we first divided the benchmark dataset for specific tissue of each species into two equal parts randomly, one as a training set and the another as an independent testing set (Table 5), and then separately performed the analysis. In the process of model training, 5-fold cross-validation was implemented on the training set to optimize the parameters. Under the optimal parameter combination, all the training samples for specific tissue of each species were used to train the best model, respectively, and then the performance of the constructed model was evaluated using the corresponding independent testing set. The detailed procedure of benchmark data construction also can be seen in Dao et al. [46]. Additionally, the corresponding code for constructing the benchmark dataset and the exact data have been released on the Github https://github.com/pythonLzt/im6APred (accessed on 3 December 2022) to allow better reproducibility of the research.
Table 5.
Species | Tissues | Positive | Negative | ||
---|---|---|---|---|---|
Training | Testing | Training | Testing | ||
Human | Brain | 4605 | 4604 | 4605 | 4604 |
Liver | 2634 | 2634 | 2634 | 2634 | |
Kidney | 4574 | 4573 | 4574 | 4573 | |
Mouse | Brain | 8025 | 8025 | 8025 | 8025 |
Liver | 4133 | 4133 | 4133 | 4133 | |
Kidney | 3953 | 3952 | 3953 | 3952 | |
Heart | 2201 | 2200 | 2201 | 2200 | |
Testis | 4707 | 4706 | 4707 | 4706 | |
Rat | Brain | 2352 | 2351 | 2352 | 2351 |
Liver | 1762 | 1762 | 1762 | 1762 | |
Kidney | 3433 | 3432 | 3433 | 3432 |
3.2. Sample Formulation
Better feature representations can be conducive to distinguishing m6A sites from non-m6A sites more accurately, thereby improving the prediction performance of the proposed model. In the current study, we integrated two kinds of feature representations to represent the sample sequences.
3.2.1. Nucleotide Chemical Property
Nucleotide chemical property is an efficient and effective sequence encoding method, playing an important part in the prediction of other modification sites, such as 4 mC [61] and D sites [62]. The RNA sequences can be encoding by following the steps below.
Firstly, a three-dimensional vector (), in which its components represent the ring structure, the hydrogen bond and the chemical functionality of the four nucleotides, respectively, is used to encode a nucleotide in RNA sequence, as shown in Equation (1). Thus, A, C, G and U can be encoded by (1, 1, 1), (0, 0, 1), (1, 0, 0) and (0, 1, 0), respectively.
(1) |
where represents the -th nucleotide in the RNA sample sequence.
Next, the density of nucleotide is used to represent this nucleotide by calculating the accumulated frequency of nucleotides along the RNA sequence, as formulated by Equation (2).
(2) |
where , is the RNA sequence length.
The combination of nucleotide chemical property and nucleotide density can quantitatively represent the given RNA sample sequence to the most extent. Following this method, each RNA sample sequence with 41 nt length could be converted into a 4 × 41-dimensional vector.
3.2.2. Mono-Nucleotide Binary Encoding
Mono-nucleotide binary encoding is another simple and efficient method to encode sample sequences, widely used in representing nucleotide sequences and transforming them into numeric vectors [63,64,65,66]. Generally, A, C, G and U can be transferred to a 4-dimensional vector: (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1), respectively.
As these encoding methods only consider the nucleotides along either sense strand or non-sense strand, a lot of sequence order information could be lost. Here, to capture the long- and local-range sequence order information of RNA sequences as much as possible, these two encoding methods were employed to encode both sense and non-sense strands simultaneously. Thus, the given RNA sample sequence with 41 nt length in this study could be transferred into a 16 × 41-dimensional vector finally.
3.3. Classification Method
An ensemble across multiple models or within a single model with appropriate ensemble strategies could achieve complementary learning of the training data, thereby greatly improving model reliability, accuracy and efficiency. Moreover, convolutional neural network (CNN) is a multi-layer neural network framework, which has been widely applied in many bioinformatics fields and has made significant progress [67,68,69,70,71,72,73,74,75,76]. Thus, in the current study, we adopted a common ensemble method based on model perturbation, training five CNN base classifiers with different hyper-parameters and considering a simple average method as an ensemble strategy to classify (Figure 2).
Generally, a CNN is mainly composed of three parts: convolution layers, pooling layers and fully connected layers. A set of filters of the convolution layer slide over the output of the previous layer to extract high-level features and generate a multiple feature map. The pooling layer following each convolution layer is used for extracting dominant features from the feature map. Finally, the fully connected layer, which employs a dropout strategy for mitigating potential overfitting, maps the learned features to the label space for classifying and predicting. It is well-known that the filter size and the number of convolution kernels are main determinants of the performance of a CNN framework. Here, we denote CNN base classifiers with x convolution layer(s), y convolution kernel(s), filter size of z and drop out probability of w. In the training process, we select the best five models as base classifiers from different settings of parameters by five-fold cross-validation. In detail, we took x from {1, 2, 3, 4}, y from {4, 8, 16, 32, 64}, z from {2, 4} and w from {0.35, 0.4, 0.45, 0.5, 0.55}. For all components of the model, rectified linear-unit (ReLU) [77] was used as the activation for improving the computational efficiency and retaining the gradient. Moreover, the Adam optimizer was employed to minimize the loss function through updating network weights and learning rate based on calculated gradients [78].
3.4. Evaluation Metrics
Generally, for objectively evaluating the performance of proposed models, the popular metrics including sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthews correlation coefficient (MCC) [62,79,80,81,82,83,84,85] were used, as shown in Equation (3).
(3) |
where TP stands for the number of true m6A modification sites correctly predicted; TN, the number of true non-m6A modification sites correctly predicted; FP, the number of non-m6A modification sites incorrectly predicted as m6A modification sites; and FN, the number of m6A modification sites incorrectly predicted as non-m6A modification sites.
Furthermore, the area under the receiver operating characteristic curve or AUROC [78,86,87] is able to visually display the prediction performance of the proposed models. The value of AUROC ranges from 0 and 1. The larger the value is, the better the prediction performance is.
4. Conclusions
As an important RNA transcriptional modification, m6A is of vital importance in mRNA translation, mRNA stability, directional differentiation of hematopoietic stem cells and spermatogenesis. Therefore, the ability to accurately identify m6A modification sites in a genome at the single-nucleotide level would have profound effects on revealing its regulatory mechanism and assisting drug development. In the current study, we collected high confidence m6A modification sites identified by m6A-REF-seq from different tissues of three types of mammals, respectively, and constructed 11 tissue-specific models using ensemble deep learning. Our predictors could produce AUROC values in the range of 0.82–0.91 on the corresponding independent testing datasets. Compared with the existing predictors in this field, the AUROC values are improved to a certain degree, indicating that our models have the ability to learn general methylation rules on RNA bases and generalize to m6A transcriptome-wide identification. During cross-species/tissues validation, when testing the tissue-specific models trained on the human (brain) and mouse (liver), almost all the achieved AUROC values were less than 0.82, demonstrating that constructing tissue-specific predictors is helpful for better detection of m6A sites.
Despite the recent progress on m6A site identification in multiple tissues of mammals, there are several limitations outstanding: (1) These predictors for identifying m6A sites in multiple tissues of mammals were trained on such m6A modification data with a specific ACA motif, causing them difficulty in accurately identifying those m6A sites in another motif, such as canonical m6A motif DRACH (D = A, G or U; R = A or G; H = C or U). (2) Current computational methods for identifying m6A sites in multiple tissues of mammals are limited by algorithmic constraints because these methods were built upon classical machine learning or deep learning methods, such as SVM, CNN, or ensemble deep learning. It is necessary to adopt novel methods to improve the prediction performance of the model in the future. Thus, how to design a prediction method that can comprehensively and accurately predict m6A sites across various tissues of mammals is still a challenge.
To this end, we provide several insights for future directions of m6A site identification. Firstly, to identify potential m6A sites across various tissues of mammals efficiently, high-confidence m6A sites that have been experimentally annotated as methylated sites with the canonical m6A motif DRACH in tissues of mammals, and those sites that are clearly validated by experiment yet fail to be methylated, are the ideal data source for constructing accurate prediction models. Secondly, the use of graph neural network is also a promising direction in improving the performance of m6A modification site prediction. A given sequence can be transformed into a directed pattern graph, where each vertex could be represented by a k-mer of the given sequence. An edge connects two vertices if the joint subsequence of corresponding k-mers appear at least once in this sequence. Our expectation is that within the next few years, m6A modification sites will be able to be predicted with a high degree of accuracy, purely based on their sequence.
Acknowledgments
The authors are also grateful to Weizhong Lin for the extensive revision of written English. Moreover, the authors thank Guangfu Xue and Yideng Cai for the helpful discussions.
Supplementary Materials
The following are available online at https://www.mdpi.com/article/10.3390/ijms232415490/s1.
Author Contributions
Conceptualization, methodology, writing—original draft preparation, writing—review and editing, supervision, project administration, Z.X. and X.X.; software, visualization, Z.L. and L.L.; validation, Z.L.; formal analysis, investigation, W.Q.; data curation, L.L.; resources, funding acquisition, Z.X., X.X. and W.Q. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The training and independent testing used in this study can be downloaded from http://47.94.248.117/im6APred/download (accessed on 3 December 2022) or https://github.com/pythonLzt/im6APred (accessed on 3 December 2022).
Conflicts of Interest
The authors declare no conflict of interest.
Funding Statement
This research was funded by the National Nature Scientific Foundation of China, grant number 62062043, 32270789, 31860312, and 62162032.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Barbieri I., Kouzarides T. Role of RNA modifications in cancer. Nat. Rev. Cancer. 2020;20:303–322. doi: 10.1038/s41568-020-0253-2. [DOI] [PubMed] [Google Scholar]
- 2.Machnicka M.A., Milanowska K., Oglou O.O., Purta E., Kurkowska M., Olchowik A., Januszewski W., Kalinowski S., Dunin-Horkawicz S., Rother K.M., et al. MODOMICS: A database of RNA modification pathways-2013 update. Nucleic Acids Res. 2013;41:D262–D267. doi: 10.1093/nar/gks1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Meyer K.D., Jaffrey S.R. The dynamic epitranscriptome: N-6-methyladenosine and gene expression control. Nat. Rev. Mol. Cell Biol. 2014;15:313–326. doi: 10.1038/nrm3785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Frye M., Harada B.T., Behm M., He C. RNA modifications modulate gene expression during development. Science. 2018;361:1346–1349. doi: 10.1126/science.aau1646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Roundtree I.A., Evans M.E., Pan T., He C. Dynamic RNA Modifications in Gene Expression Regulation. Cell. 2017;169:1187–1200. doi: 10.1016/j.cell.2017.05.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Xiang Y., Laurent B., Hsu C.-H., Nachtergaele S., Lu Z., Sheng W., Xu C., Chen H., Ouyang J., Wang S., et al. RNA m(6)A methylation regulates the ultraviolet-induced DNA damage response. Nature. 2017;543:573–576. doi: 10.1038/nature21671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Xu K., Yang Y., Feng G.-H., Sun B.-F., Chen J.-Q., Li Y.-F., Chen Y.-S., Zhang X.-X., Wang C.-X., Jiang L.-Y., et al. Mettl3-mediated m(6)A regulates spermatogonial differentiation and meiosis initiation. Cell Res. 2017;27:1100–1114. doi: 10.1038/cr.2017.100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhou J., Wan J., Gao X., Zhang X., Jaffrey S.R., Qian S.-B. Dynamic m(6)A mRNA methylation directs translational control of heat shock response. Nature. 2015;526:591-U332. doi: 10.1038/nature15377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Han D., Liu J., Chen C., Dong L., Liu Y., Chang R., Huang X., Liu Y., Wang J., Dougherty U., et al. Anti-tumour immunity controlled through mRNA m(6)A methylation and YTHDF1 in dendritic cells. Nature. 2019;566:270–274. doi: 10.1038/s41586-019-0916-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Patil D.P., Chen C.-K., Pickering B.F., Chow A., Jackson C., Guttman M., Jaffrey S.R. m(6)A RNA methylation promotes XIST-mediated transcriptional repression. Nature. 2016;537:369–373. doi: 10.1038/nature19342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang Z., Wang M., Xie D., Huang Z., Zhang L., Yang Y., Ma D., Li W., Zhou Q., Yang Y.-G., et al. METTL3-mediated N-6-methyladenosine mRNA modification enhances long-term memory consolidation. Cell Res. 2018;28:1050–1061. doi: 10.1038/s41422-018-0092-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhong X., Yu J., Frazier K., Weng X., Li Y., Cham C.M., Dolan K., Zhu X., Hubert N., Tao Y., et al. Circadian Clock Regulation of Hepatic Lipid Metabolism by Modulation of m(6)A mRNA Methylation. Cell Rep. 2018;25:1816–1828. doi: 10.1016/j.celrep.2018.10.068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jaffrey S.R., Kharas M.G. Emerging links between m(6)A and misregulated mRNA methylation in cancer. Genome Med. 2017;9:2. doi: 10.1186/s13073-016-0395-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Meyer K.D., Saletore Y., Zumbo P., Elemento O., Mason C.E., Jaffrey S.R. Comprehensive Analysis of mRNA Methylation Reveals Enrichment in 3′ UTRs and near Stop Codons. Cell. 2012;149:1635–1646. doi: 10.1016/j.cell.2012.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dominissini D., Moshitch-Moshkovitz S., Schwartz S., Salmon-Divon M., Ungar L., Osenberg S., Cesarkas K., Jacob-Hirsch J., Amariglio N., Kupiec M., et al. Topology of the human and mouse m(6)A RNA methylomes revealed by m(6)A-seq. Nature. 2012;485:201-U84. doi: 10.1038/nature11112. [DOI] [PubMed] [Google Scholar]
- 16.Chen K., Lu Z.K., Wang X., Fu Y., Luo G.Z., Liu N., Han D., Dominissini D., Dai Q., Pan T., et al. High-Resolution N-6-Methyladenosine (m(6)A) Map Using Photo-Crosslinking-Assisted m(6)A Sequencing. Angew. Chem.-Int. Ed. 2015;54:1587–1590. doi: 10.1002/anie.201410647. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Linder B., Grozhik A., Olarerin-George A., Meydan C., Mason C., Jaffrey S. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat. Methods. 2015;12:767–772. doi: 10.1038/nmeth.3453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ule J., Jensen K., Ruggiu M., Mele A., Ule A., Darnell R. CLIP identifies Nova-regulated RNA networks in the brain. Science. 2003;302:1212–1215. doi: 10.1126/science.1090095. [DOI] [PubMed] [Google Scholar]
- 19.Zhang Z., Chen L.Q., Zhao Y.L., Yang C.G., Roundtree L., Zhang Z.J., Ren J., Xie W., He C., Luo G.Z. Single-base mapping of m(6)A by an antibody-independent method. Sci. Adv. 2019;5:eaax0250. doi: 10.1126/sciadv.aax0250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Meyer K.D. DART-seq: An antibody-free method for global m(6)A detection. Nat. Methods. 2019;16:1275–1280. doi: 10.1038/s41592-019-0570-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen K.Q., Song B.W., Tang Y.J., Wei Z., Xu Q.R., Su J.L., Magalhães J.P., Rigden D.J., Meng J. RMDisease: A database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis. Nucleic Acids Res. 2021;49:D1396–D1404. doi: 10.1093/nar/gkaa790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Luo X.T., Li H.Q., Liang J.Q., Zhao Q., Xie Y.B., Ren J., Zuo Z.X. RMVar: An updated database of functional variants involved in RNA modifications. Nucleic Acids Res. 2021;49:D1405–D1412. doi: 10.1093/nar/gkaa811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zheng Y.Y., Nie P., Peng D., He Z.H., Liu M.N., Xie Y.B., Miao Y.Y., Zuo Z.X., Ren J. m6AVar: A database of functional variants involved in m(6)A modification. Nucleic Acids Res. 2018;46:D139–D145. doi: 10.1093/nar/gkx895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen W., Feng P.M., Ding H., Lin H., Chou K.C. iRNA-Methyl: Identifying N-6-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem. 2015;490:26–33. doi: 10.1016/j.ab.2015.08.021. [DOI] [PubMed] [Google Scholar]
- 25.Chen W., Feng P.M., Ding H., Lin H. Identifying N (6)-methyladenosine sites in the Arabidopsis thaliana transcriptome. Mol. Genet. Genom. 2016;291:2225–2229. doi: 10.1007/s00438-016-1243-7. [DOI] [PubMed] [Google Scholar]
- 26.Chen W., Tang H., Lin H. MethyRNA: A web server for identification of N-6-methyladenosine sites. J. Biomol. Struct. Dyn. 2017;35:683–687. doi: 10.1080/07391102.2016.1157761. [DOI] [PubMed] [Google Scholar]
- 27.Feng P.M., Ding H., Yang H., Chen W., Lin H., Chou K.C. iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC. Mol. Ther.-Nucleic Acids. 2017;7:155–163. doi: 10.1016/j.omtn.2017.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wei L.Y., Su R., Wang B., Li X.T., Zou Q., Gao X. Integration of deep feature representations and handcrafted features to improve the prediction of N-6-methyladenosine sites. Neurocomputing. 2019;324:3–9. doi: 10.1016/j.neucom.2018.04.082. [DOI] [Google Scholar]
- 29.Chen W., Ding H., Zhou X., Lin H., Chou K.C. iRNA(m6A)-PseDNC: Identifying N-6-methyladenosine sites using pseudo dinucleotide composition. Anal. Biochem. 2018;561:59–65. doi: 10.1016/j.ab.2018.09.002. [DOI] [PubMed] [Google Scholar]
- 30.Chen W., Xing P., Zou Q. Detecting N-6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines. Sci. Rep. 2017;7:40242. doi: 10.1038/srep40242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Xing P.W., Su R., Guo F., Wei L.Y. Identifying N-6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine. Sci. Rep. 2017;7:46757. doi: 10.1038/srep46757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Xiang S.N., Liu K., Yan Z.M., Zhang Y.O., Sun Z.R. RNAMethPre: A Web Server for the Prediction and Query of mRNA m(6)A Sites. PLoS ONE. 2016;11:e0162707. doi: 10.1371/journal.pone.0162707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Chen W., Feng P.M., Yang H., Ding H., Lin H., Chou K.C. iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites. Mol. Ther.-Nucleic Acids. 2018;11:468–474. doi: 10.1016/j.omtn.2018.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wei L., Chen H., Su R. M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning. Mol. Ther.-Nucleic Acids. 2018;12:635–644. doi: 10.1016/j.omtn.2018.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Liu Z., Xiao X., Yu D.J., Jia J.H., Qiu W.R., Chou K.C. pRNAm-PC: Predicting N-6-methyladenosine sites in RNA sequences via physical-chemical properties. Anal. Biochem. 2016;497:60–67. doi: 10.1016/j.ab.2015.12.017. [DOI] [PubMed] [Google Scholar]
- 36.Li G.Q., Liu Z., Shen H.B., Yu D.J. TargetM6A: Identifying N-6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine. IEEE Trans. Nanobioscience. 2016;15:674–682. doi: 10.1109/TNB.2016.2599115. [DOI] [PubMed] [Google Scholar]
- 37.Xiang S.N., Yan Z.M., Liu K., Zhang Y., Sun Z.R. AthMethPre: A web server for the prediction and query of mRNA m(6)A sites in Arabidopsis thaliana. Mol. Biosyst. 2016;12:3333–3337. doi: 10.1039/C6MB00536E. [DOI] [PubMed] [Google Scholar]
- 38.Akbar S., Hayat M. iMethyl-STTNC: Identification of N-6-methyladenosine sites by extending the idea of SAAC into Chou’s PseAAC to formulate RNA sequences. J. Theor. Biol. 2018;455:205–211. doi: 10.1016/j.jtbi.2018.07.018. [DOI] [PubMed] [Google Scholar]
- 39.Zhou Y., Zeng P., Li Y.H., Zhang Z.D., Cui Q.H. SRAMP: Prediction of mammalian N-6-methyladenosine (m(6)A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44:e91. doi: 10.1093/nar/gkw104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zhao Z.X., Peng H., Lan C.W., Zheng Y., Fang L., Li J.Y. Imbalance learning for the prediction of N-6-Methylation sites in mRNAs. Bmc Genom. 2018;19:574. doi: 10.1186/s12864-018-4928-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Xiong Y.P., He X., Zhao D., Tian T.Z., Hong L.X., Jiang T., Zeng J.Y. Modeling multi-species RNA modification through multi-task curriculum learning. Nucleic Acids Res. 2021;49:3719–3734. doi: 10.1093/nar/gkab124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Song Z.T., Huang D.Y., Song B.W., Chen K.Q., Song Y.Y., Liu G., Su J.L., Magalhães J.P., Rigden D., Meng J. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat. Commun. 2021;12:4011. doi: 10.1038/s41467-021-24313-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Jia C.Z., Zhang J.J., Gu W.Z. RNA-MethylPred: A high-accuracy predictor to identify N6-methyladenosine in RNA. Anal. Biochem. 2016;510:72–75. doi: 10.1016/j.ab.2016.06.012. [DOI] [PubMed] [Google Scholar]
- 44.Chen J., Zou Q., Li J. DeepM6ASeq-EL: Prediction of human N6-methyladenosine (m(6)A) sites with LSTM and ensemble learning. Front. Comput. Sci. 2022;16:162302. doi: 10.1007/s11704-020-0180-0. [DOI] [Google Scholar]
- 45.Zhang Y., Hamada M. DeepM6ASeq: Prediction and characterization of m6A-containing sequences using deep learning. BMC Bioinform. 2018;19((Suppl. 19)):524. doi: 10.1186/s12859-018-2516-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Dao F.Y., Lv H., Yang Y.H., Zulfiqar H., Gao H., Lin H. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput. Struct. Biotechnol. J. 2020;18:1084–1091. doi: 10.1016/j.csbj.2020.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Liu K., Cao L., Du P.F., Chen W. im6A-TS-CNN: Identifying the N(6)-Methyladenine Site in Multiple Tissues by Using the Convolutional Neural Network. Mol. Nucleic Acids. 2020;21:1044–1049. doi: 10.1016/j.omtn.2020.07.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Abbas Z., Tayara H., Zou Q., Chong K.T. TS-m6A-DL: Tissue-specific identification of N6-methyladenosine sites using a universal deep learning model. Comput. Struct. Biotechnol. J. 2021;19:4619–4625. doi: 10.1016/j.csbj.2021.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Liang M.X., Li Z.Z., Chen T., Zeng J.Y. Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach. IEEE-Acm Trans. Comput. Biol. Bioinform. 2015;12:928–937. doi: 10.1109/TCBB.2014.2377729. [DOI] [PubMed] [Google Scholar]
- 50.Sharifi-Noghabi H., Zolotareva O., Collins C., Ester M. MOLI: Multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics. 2019;35:I501–I509. doi: 10.1093/bioinformatics/btz318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Poirion O., Jing Z., Chaudhary K., Huang S.J., Garmire L. DeepProg: An ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med. 2021;13:112. doi: 10.1186/s13073-021-00930-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Hu Y., Wang Z.Q., Hu H.L., Wan F.P., Chen L., Xiong Y.P., Wang X.X., Zhao D., Huang W.R., Zeng J.Y. ACME: Pan-specific peptide-MHC class I binding prediction through attention-based deep neural networks. Bioinformatics. 2019;35:4946–4954. doi: 10.1093/bioinformatics/btz427. [DOI] [PubMed] [Google Scholar]
- 53.Hu S.S., Zhang C.L., Chen P., Gu P.Y., Zhang J., Wang B. Predicting drug-target interactions from drug structure and protein sequence using novel convolutional neural networks. BMC Bioinform. 2019;20((Suppl. 25)):689. doi: 10.1186/s12859-019-3263-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Karimi M., Wu D., Wang Z.Y., Shen Y. DeepAffinity: Interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. 2019;35:3329–3338. doi: 10.1093/bioinformatics/btz111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Singh J., Hanson J., Paliwal K., Zhou Y.Q. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun. 2019;10:5407. doi: 10.1038/s41467-019-13395-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Zhang Y.Q., Qiao S.J., Ji S.J., Zhou J.L. ENSEMBLE-CNN: Predicting DNA Binding Sites in Protein Sequences by an Ensemble Deep Learning Method. Intell. Comput. Theor. Appl. Pt II. 2018;10955:301–306. [Google Scholar]
- 58.Predicting N6-Methyladenosine Sites in Multiple Tissues of Mammals through Ensemble Deep Learning. [(accessed on 3 December 2022)]. Available online: http://47.94.248.117/im6APred. [DOI] [PMC free article] [PubMed]
- 59.Song B.W., Chen K.Q., Tang Y.J., Wei Z., Su J.L., Magalhães J.P., Rigden D., Meng J. ConsRM: Collection and large-scale prediction of the evolutionarily conserved RNA methylation sites, with implications for the functional epitranscriptome. Brief. Bioinform. 2021;22:bbab088. doi: 10.1093/bib/bbab088. [DOI] [PubMed] [Google Scholar]
- 60.Li W., Godzik A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 61.Chen W., Yang H., Feng P.M., Ding H., Lin H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics. 2017;33:3518–3523. doi: 10.1093/bioinformatics/btx479. [DOI] [PubMed] [Google Scholar]
- 62.Xu Z.C., Feng P.M., Yang H., Qiu W.R., Chen W., Lin H. iRNAD: A computational tool for identifying D modification sites in RNA sequence. Bioinformatics. 2019;35:4922–4929. doi: 10.1093/bioinformatics/btz358. [DOI] [PubMed] [Google Scholar]
- 63.Yao Y., Zhang S., Liang Y. iORI-ENST: Identifying origin of replication sites based on elastic net and stacking learning. SAR QSAR Environ. Res. 2021;32:317–331. doi: 10.1080/1062936X.2021.1895884. [DOI] [PubMed] [Google Scholar]
- 64.Lv H., Dao F.Y., Zhang D., Guan Z.X., Yang H., Su W., Liu M.L., Ding H., Chen W., Lin H. iDNA-MS: An Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes. iScience. 2020;23:100991. doi: 10.1016/j.isci.2020.100991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Hasan M.M., Manavalan B., Shoombuatong W., Khatun M.S., Kurata H. i6mA-Fuse: Improved and robust prediction of DNA 6mA sites in the Rosaceae genome by fusing multiple feature representation. Plant Mol. Biol. 2020;103:225–234. doi: 10.1007/s11103-020-00988-y. [DOI] [PubMed] [Google Scholar]
- 66.Lv H., Dao F.Y., Guan Z.X., Zhang D., Tan J.X., Zhang Y., Chen W., Lin H. iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice. Front. Genet. 2019;10:793. doi: 10.3389/fgene.2019.00793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Wang L., You Z.H., Huang Y.A., Huang D.S., Chan K.C. An efficient approach based on multi-sources information to predict circRNA-disease associations using deep convolutional neural network. Bioinformatics. 2020;36:4038–4046. doi: 10.1093/bioinformatics/btz825. [DOI] [PubMed] [Google Scholar]
- 68.Xia C.-Q., Pan X., Shen H.-B. Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics. 2020;36:3018–3027. doi: 10.1093/bioinformatics/btaa110. [DOI] [PubMed] [Google Scholar]
- 69.Pan X., Shen H.-B. Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks. Bioinformatics. 2018;34:3427–3436. doi: 10.1093/bioinformatics/bty364. [DOI] [PubMed] [Google Scholar]
- 70.Cao Z., Zhang S. Simple tricks of convolutional neural network architectures improve DNA-protein binding prediction. Bioinformatics. 2019;35:1837–1843. doi: 10.1093/bioinformatics/bty893. [DOI] [PubMed] [Google Scholar]
- 71.Kulmanov M., Hoehndorf R. DeepGOPlus: Improved protein function prediction from sequence. Bioinformatics. 2020;36:422–429. doi: 10.1093/bioinformatics/btz595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Öztürk H., Özgür A., Ozkirimli E. DeepDTA: Deep drug-target binding affinity prediction. Bioinformatics. 2018;34:i821–i829. doi: 10.1093/bioinformatics/bty593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Tsubaki M., Tomii K., Sese J. Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics. 2019;35:309–318. doi: 10.1093/bioinformatics/bty535. [DOI] [PubMed] [Google Scholar]
- 74.Peng J.J., Hui W.W., Li Q.Q., Chen B.L., Hao J.Y., Jiang Q.H., Shang X.Q., Wei Z.Y. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35:4364–4371. doi: 10.1093/bioinformatics/btz254. [DOI] [PubMed] [Google Scholar]
- 75.Aoki G., Sakakibara Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics. 2018;34:i237–i244. doi: 10.1093/bioinformatics/bty228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Tahir M., Tayara H., Chong K.T. iPseU-CNN: Identifying RNA Pseudouridine Sites Using Convolutional Neural Networks. Mol. Nucleic Acids. 2019;16:463–470. doi: 10.1016/j.omtn.2019.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Eckle K., Schmidt-Hieber J. A comparison of deep networks with ReLU activation function and linear spline-type methods. Neural. Netw. 2019;110:232–242. doi: 10.1016/j.neunet.2018.11.005. [DOI] [PubMed] [Google Scholar]
- 78.Xu Z.C., Luo M., Lin W.Z., Xue G.F., Wang P.P., Jin X.Y., Xu C., Zhou W.Y., Cai Y.D., Yang W.Y., et al. DLpTCR: An ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Brief. Bioinform. 2021;22:bbab335. doi: 10.1093/bib/bbab335. [DOI] [PubMed] [Google Scholar]
- 79.Jiang Q., Wang G.H., Jin S.L., Li Y., Wang Y.D. Predicting human microRNA-disease associations based on support vector machine. Int. J. Data Min. Bioinform. 2013;8:282–293. doi: 10.1504/IJDMB.2013.056078. [DOI] [PubMed] [Google Scholar]
- 80.Zhang D., Xu Z.C., Su W., Yang Y.H., Lv H., Yang H., Lin H. iCarPS: A computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics. 2021;37:171–177. doi: 10.1093/bioinformatics/btaa702. [DOI] [PubMed] [Google Scholar]
- 81.Zhang L., Xiao X., Xu Z.-C. iPromoter-5mC: A Novel Fusion Decision Predictor for the Identification of 5-Methylcytosine Sites in Genome-Wide DNA Promoters. Front. Cell Dev. Biol. 2020;8:614. doi: 10.3389/fcell.2020.00614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Xiao X., Xu Z.C., Qiu W.R., Wang P., Ge H.T., Chou K.C. iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics. 2019;111:1785–1793. doi: 10.1016/j.ygeno.2018.12.001. [DOI] [PubMed] [Google Scholar]
- 83.Xu Z.C., Wang P., Qiu W.R., Xiao X. iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder. Sci. Rep. 2017;7:8222. doi: 10.1038/s41598-017-08523-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Xu Z.C., Qiu W.R., Xiao X. iRSpotH-TNCPseAAC: Identifying Recombination Spots in Human by Using Pseudo Trinucleotide Composition With an Ensemble of Support Vector Machine Classifiers. Lett. Org. Chem. 2017;14:703–713. doi: 10.2174/1570178614666170608125909. [DOI] [Google Scholar]
- 85.Xu Z.C., Jiang S.Y., Qiu W.R., Liu Y.C., Xiao X. iDHSs-PseTNC: Identifying DNase I Hypersensitive Sites with Pseuo Trinucleotide Component by Deep Sparse Auto-encoder. Lett. Org. Chem. 2017;14:655–664. doi: 10.2174/1570178614666170213102455. [DOI] [Google Scholar]
- 86.Huang Y., Zhou D.S., Wang Y.H., Zhang X.D., Su M., Wang C., Sun Z.Y., Jiang Q.H., Sun B.Q., Zhang Y. Prediction of transcription factors binding events based on epigenetic modifications in different human cells. Epigenomics. 2020;12:1443–1456. doi: 10.2217/epi-2019-0321. [DOI] [PubMed] [Google Scholar]
- 87.Su Z.D., Huang Y., Zhang Z.Y., Zhao Y.W., Wang D., Chen W., Chou K.C., Lin H. iLoc-lncRNA: Predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics. 2018;34:4196–4204. doi: 10.1093/bioinformatics/bty508. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The training and independent testing used in this study can be downloaded from http://47.94.248.117/im6APred/download (accessed on 3 December 2022) or https://github.com/pythonLzt/im6APred (accessed on 3 December 2022).