ABSTRACT
As one of the common post-transcriptional modifications in tRNAs, dihydrouridine (D) has prominent effects on regulating the flexibility of tRNA as well as cancerous diseases. Facing with the expensive and time-consuming sequencing techniques to detect D modification, precise computational tools can largely promote the progress of molecular mechanisms and medical developments. We proposed a novel predictor, called iRNAD_XGBoost, to identify potential D sites using multiple RNA sequence representations. In this method, by considering the imbalance problem using hybrid sampling method SMOTEEEN, the XGBoost-selected top 30 features are applied to construct model. The optimized model showed high Sn and Sp values of 97.13% and 97.38% over jackknife test, respectively. For the independent experiment, these two metrics separately achieved 91.67% and 94.74%. Compared with iRNAD method, this model illustrated high generalizability and consistent prediction efficiencies for positive and negative samples, which yielded satisfactory MCC scores of 0.94 and 0.86, respectively. It is inferred that the chemical property and nucleotide density features (CPND), electron-ion interaction pseudopotential (EIIP and PseEIIP) as well as dinucleotide composition (DNC) are crucial to the recognition of D modification. The proposed predictor is a promising tool to help experimental biologists investigate molecular functions.
KEYWORDS: Dihydrouridine, prediction, imbalanced Datasets, feature Selection, XGBoost
1. Introduction
Until now, more than 150 types of RNA post-transcriptional modifications have been reported in all kingdoms of life [1–9] that play significant roles in various biological processes, including gene expression, tRNA recognition, immune response, metabolic and stress responses, and so on. As a highly conserved tRNA modification, dihydrouridine (D, C9H14N2O6) has been found in eukaryotes, bacteria and some archaea [10–13]. As illustrated in Fig. 1, under the catalysis of dihydrouridine synthases (Dus), D modification is derived by adding two hydrogen atoms to the uridine base at positions 5 and 6, where the original carbon-carbon double bound (C5 = C6) is reduced (marked by green circle on the right panel). This chemical modification was mostly observed in the D-loop of tRNAs (the reason for its name) [13].
Figure 1.

Formation of dihydrouridine modification. Two hydrogen atoms are added to the endocyclic carbon-carbon double bond at positions 5 and 6 under the catalysis of dihydrouridine synthases (Dus)
Due to the non-aromatic ring structure of D site, its interactions with other nucleotide bases by stacking approach are largely reduced [14]. Therefore, non-planar D largely complicates the tRNA conformation [15]. Additionally, it was reported that the content of this site in psychrophilic organisms is significantly higher than that in mesophiles and thermophiles [16]. Recent researches also proved that the human Dus is involved in pulmonary carcinogenesis [17], and D site regulates the dsRNA-activated protein kinase in cells [18]. At present, D has been treated as a popular biomarker to develop effective treatments on cancerous organs [19]. Overall, D plays significant roles in biological functions and medical treatments.
Precise detection of D sites is the basis of in-depth exploration. Experimenters developed various high-throughput sequencing techniques to detect D sites [15,20], which have been collected in popular chemical modification databases, such as RMBase [21] and MODOMICS [22]. Although the biochemical approaches can provide reliable results, they usually take a huge amount of time and labour. Thus, building high-performance predictors becomes an urgent task to quickly find D sites contained in given RNA sequences.
As listed in Table 1, up to now, only two prediction tools have been published to recognize RNA D modification [23,24]. The first one was proposed by Feng et al., focusing on the bacteria S. cerevisiae with 68 positive and 68 negative samples [23]. In particular, three feature extraction methods were first investigated, including nucleotide physicochemical property (NPCP), pseudo dinucleotide composition (PseDNC) and secondary structure component (SSC). Then, three single-feature-based support vector machine (SVM) classifiers were incorporated to present an ensemble predictor using voting approach. Jackknife test gave the sensitivity (Sn) of 76.47% with specificity (Sp) of 89.71%. For the recent tool iRNAD introduced by Xu et al., a total of 550 sequences related to five species were considered, including H. sapiens, M. musculus, S. cerevisiae, E. coli and D. melanogaster [24]. By incorporating NPCP and nucleotide density features (merged as CPND), SVM-based predictor presented the Sn and Sp values of 86.43% and 98.66% for the training dataset, 86.11% and 96.05% for the testing dataset, respectively.
Table 1.
Summary of two published D predictors
| Tools | Species | Datasets | Features | Classifications | Metrics | Sn (%) | Sp (%) |
|---|---|---|---|---|---|---|---|
| Feng at al. | S. cerevisiae | Pos:68; Neg:68 | NPCP, PseDNC, SSC | SVM & Voting | Jackknife | 76.47 | 89.71 |
| iRNAD |
H. sapiens, M. musculus S. cerevisiae, E. coli, D. melanogaster |
Pos:176; Neg:374 | CPND | SVM | Jackknife | 86.43 | 98.66 |
| Independent | 86.11 | 96.05 |
It can be seen that only 136 samples (S. cerevisiae) were adopted in the first predictor [23]. Thus, it is necessary to expand the dataset to construct more statistically significant models. Regarding the second tool iRNA, although the total number of instances reached 550, the prediction powers of positive samples were averagely 11% lower than that of negatives. Although the imbalance ratio of true and false D samples was only 1:2, prediction bias towards negative samples in majority class is still noticeable. Therefore, it is very necessary to build more reliable models by considering the imbalance nature of datasets. Additionally, only CPND descriptor was applied to encode nucleotide bases, there are many other types of sequence encoding methods worth investigating. In this work, we developed an efficient machine-learning-based model, called iRNAD_XGBoost, using eXtreme Gradient Boosting (XGBoost) to diagnose real D modification. As displayed in Fig. 2, the RNA sequences were preliminarily encoded using three kinds of sequence encoding approaches, namely CPND, electron-ion interaction pseudopotential properties (EIIP and PseEIIP) and nucleotide composition Kmer. Then, the feature importances of involved vectors were analysed using XGBoost algorithm. With the application of incremental feature selection technique (IFS), different feature subsets were generated to proceed optimization under the consideration of various resampling and classification algorithms. Ultimately, the best-performance model was formed using the top 30 features, where the imbalanced training samples were handled by hybrid-sampling approach SMOTEENN.
Figure 2.

Pipeline for the identification of RNA D sites
2. Materials and methods
2.1. Benchmark datasets
In the current work, we directly used the benchmark datasets collected by Xu et al. [24]. Specifically, initial segments containing D sites were gathered from RMBase2.0 [21] (H. sapiens, M. musculus and S. cerevisiae) and MODOMICS (E. coli and D. melanogaster) [22]. In addition, tRNA sequences related to those five species were filtered from the genomic tRNA database GtRANdb2.0 [25]. After removing the redundant samples [26] with cut-off value of 90% using CD-HIT package [27], there were 500 instances left, including 176 positive and 374 negative samples. To objectively express the model generalizability, one-fifth of the original samples was separated as the testing dataset to perform independent test.
2.2. Feature extraction
2.2.1. CPND
CPND is a popular feature extraction method to encode RNA segments by incorporating the nucleotide chemical property and nucleotide density. It was mentioned above that CPND has been successfully applied in two predictors of D sites [23,24], as well as m4C [28], pseudouridine [29], m2G [30], and so on. In this method, four nucleotide bases (adenine: A; guanine: C; guanine: G; uracil: U) are classified into different categories according to their chemical properties. Given an RNA sequence , the arbitrary nucleotide at position i can be first represented by a 3D vector,
| (1) |
where , and indicate the three properties of ring structure, functional group and hydrogen bond, expressed as,
| (2) |
| (3) |
| (4) |
Specifically, considering the ring structure, the purines (A and G) with two rings are encoded as 1, while pyrimidines (C and U) with one ring as 0; As for the chemical function, amino group (A and C) is written as 1, whereas keto group (G and U) as 0; In terms of the strength of hydrogen bond, A and U forming weak bonding interaction (two hydrogen bonds) are represented as 1, while remaining C and G illustrating strong bonding interaction (three hydrogen bonds) as 0. Therefore, four nucleotides A, C, G and U can be separately encoded by (1, 1, 1), (0, 1, 0), (1, 0, 0) and (0, 0, 1).
Meanwhile, the accumulated density is calculated to reflect the position information, defined as
| (5) |
Combining the above two characteristics, every nucleotide can be expressed as,
| (6) |
Finally, -length RNA sequence can be simply encoded as a 4-D feature vector. The state-of-the-art bioinformatic package iLearn was used to extract the CPND features [31].
2.2.2. Kmer
As a common sequence descriptor, Kmer calculates the occurrence frequencies of k-neighbouring nucleic acids [32,33]. Three kinds of Kmer features with k = 1 ~ 3 were separately extracted in the current study, corresponding to nucleic acid composition (NAC), dinucleotide composition (DNC) and trinucleotide composition (TNC). Remarkably, this method will induce a -D feature vector.
2.2.3. EIIP and PseEIIP
In 2006 years, Nair et al. measured the EIIP values of four DNA nucleotides (i.e. EIIPA = 0.1260, EIIPC = 0.1340, EIIPG = 0.0806 and EIIPT = 0.1335) [34]. Then, two related approaches, called EIIP and PseEIIP, were developed to represent nucleic acid sequences, which have been widely adapted in multiple prediction subjects [35–37]. The EIIP value of T in DNA was directly used to replace U base in the process of RNA sequence representation.
In EIIP method, the RNA segment is simply transferred to a -D discrete vector using the relevant EIIP values. Meanwhile, PseEIIP feature can be formulated by the weighted average EIIP of trinucleotide,
| (7) |
where is the overall EIIP of the trinucleotide XYZ by summing EIIP values of three related bases X, Y and Z (i.e. ), and infers the occurrence probability. Finally, a 64-D PseEIIP numerical vector was formed, reflecting the information of 64 trinucleotides (AAA, AAC, …, UUU).
Feature Selection and Classification Algorithms
Feature selection is an imperative step to choose the most significant features to build simple and understandable model, which can be generally grouped into three categories, namely filter, wrapper and embedded methods [38]. This operation will not only improve the prediction quality but also save running time, even avoiding the dimension disaster with high-dimensional features. In this work, two-stage procedures were arranged to conduct feature selection by preliminary XGBoost-based feature ranking and application of IFS theory.
XGBoost is scalable tree boosting ensemble algorithm based on the gradient boosted decision trees (GBDT), which has been widely applied in many supervised learning topics with excellent performance [39]. In this method, sparsity-aware algorithm and weighted quantile sketch are combined to handle sparse data. This method trains various tress through residuals, and corresponding weighted results of involved models as final prediction results. Besides the penalty from regularizing the objective function, there are two techniques to prevent overfitting issue: shrinkage, introduced by Friedman [40], and feature subsampling retrieved from random forests to speed up learning. In addition, it can output the important features by the average feature gain across all splits. XGBoost works well in bioinformatics projects, such as RNA pseudouridine, protein glutarylation and glycation, binding sites [41–51]. More descriptions of XGBoost theory can be consulted in Ref [39].
In this research, XGBoost, implemented in python platform scikit-learn [52], was first adapted as filtering strategy to measure the contributions of each feature. Then, IFS was followed to generate different feature subsets using top n ranked features. Finally, by systematic comparisons of model generalization capacity and prediction gaps between positive and negative samples, the optimized XGBoost-based predictor can be obtained with best performance. At the same time, several common classifiers were investigated, including random forest (RF), SVM, logistic regression (LR) and Naive Bayes (NB). Besides, t-distributed stochastic neighbour embedding (t-SNE) algorithm [53] was adapted to display the location conditions of samples. It is a variation of stochastic neighbour embedding method in terms of cost function and ‘crowding problem’, which produces better visualization performance on creating a single map to reveal the associated structures with many different scales. At first, the joint probabilities are computed by similarities of different data. Then, Kullback-Leibler divergence between the joint probabilities of the high-dimensional data and low-dimensional embedding is optimized to obtain the minimum. Here, 2D feature space was produced to analyse the imbalanced datasets.
2.4. Resampling strategies
The data imbalance is a common challenge in classification subjects [54–71]. Traditional classifiers usually display noticeable bias on samples in the majority class. With the rapid growth of machine-learning industry and artificial intelligence (AI), researchers have proposed numerous solutions, which can be generally sorted into three groups, including data level, algorithm level and hybrid method (i.e. the combination of the first two techniques). The data level strategy is a direct way to balance the dataset by increasing/deleting the number of samples in minority (majority) class. It can be divided into three categories, namely the over-sampling, under-sampling and hybrid-sampling methods [72,73]. Hybrid-sampling strategy SMOTEENN was finally applied to balance the training samples [74], where powerful over-sampling approach Synthetic Minority Over-sampling Technique (SMOTE) [75] was used to synthesize new data in minority class followed by under-sampling method Edited Nearest Neighbours (ENN) [76] to eliminate the noisy data. This method can effectively overcome the overfitting issue and loss of key information to show the outperformed results.
2.5. Performance evaluation
In the present work, five-fold cross-validation (5-fold CV) and jackknife tests were used to evaluate the constructed model for training datasets [77], in which 5-fold CV decreases the optimized time and jackknife reflects a more objective evaluation. Generally, in the k-fold CV concept, all training samples are randomly divided into k subsets, in which k-1 subsets are applied to construct model and last one to test. k-fold CV is not finished until each subset is treated once as testing samples. Jackknife is a special case of k-fold CV experiment, where k is equal to the total number of training sequences considered. Additionally, independent test is also a very necessary step to measure model generalizability.
For the binary classification, the confusion-matrix-based metrics are usually applied to measure the predictor, including sensitivity or recall (Sn or Re), specificity (Sp), accuracy (Acc), Matthew’s correlation coefficient (MCC), precision (Pre), and F1-score (F1) [70,78–83], formulated as
| (8) |
| (9) |
| (10) |
| (11) |
| (12) |
| (13) |
Here, TP and TN are the true positive and negative samples, corresponding to the number of correctly identified sequences with and without D sites. Similarly, FP and FN mean the false positive and negative samples, respectively. Due to the independence of threshold, the receiver operating characteristic curve (ROC, FPR vs. TPR) is also applied to assess the model [84–92]. Due to the imbalanc nature of sample distribution, parameters Sn, Pre and F1 should be extremely optimized to focus on the predictive power of positive samples.
3. Results and discussion
3.1. Performance based on the single features
CPND-related features were applied in the reported two studies to describe RNA sequences (see Table 1) [23,24]. Thus, we first extracted this kind of features using iLearn package [31]. Besides, we considered EIIP and PseEIIP approaches to investigate the electron-ion interaction pseudopotentials. Meanwhile, basic nucleotide composition properties were included, that is NAC, DNC and TNC. Table 2 lists preliminary performance based on the different features using XGBoost classifier. For the 92-D CPND features, the prediction abilities of positive and negative samples achieved 80.71%, 96.64%, 83.33% and 94.74% over jackknife and independent test, respectively. These results are comparable to iRNAD’s reports. It can be found that EIIP yielded the better results than PseEIIP approach, where the prediction sensitivity values of positive (Jackknife: 86.43%; Independent: 96.17%) is very competitive compared with iRNAD’s performance (86.43%; 86.11%) [24]. At the same time, three nucleotide composition features were investigated, namely NAC, DNC and TNC, in which 64-D TNC features showed good results. Among those different RNA representations, EIIP demonstrated advantageous prediction results.
Table 2.
Preliminary results of different feature descriptors
| Features | Dimension | Jackknife |
Independent |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | ||
| CPND | 92 | 80.71 | 96.64 | 91.55 | 0.80 | 0.94 | 91.87 | 0.86 | 83.33 | 94.74 | 91.07 | 0.79 | 0.78 | 88.24 | 0.86 |
| EIIP | 23 | 86.43 | 96.64 | 93.38 | 0.85 | 0.95 | 92.37 | 0.89 | 91.67 | 94.74 | 93.75 | 0.86 | 0.83 | 89.19 | 0.90 |
| PseEIIP | 64 | 77.14 | 95.97 | 89.95 | 0.76 | 0.87 | 90.00 | 0.83 | 83.33 | 88.16 | 86.61 | 0.70 | 0.78 | 76.92 | 0.80 |
| NAC | 4 | 44.29 | 78.19 | 67.35 | 0.23 | 0.59 | 48.82 | 0.46 | 36.11 | 89.47 | 72.32 | 0.31 | 0.70 | 61.90 | 0.46 |
| DNC | 16 | 71.43 | 89.93 | 84.02 | 0.63 | 0.83 | 76.92 | 0.74 | 66.67 | 89.47 | 82.14 | 0.58 | 0.79 | 75.00 | 0.71 |
| TNC | 64 | 77.14 | 95.97 | 89.95 | 0.76 | 0.87 | 90.00 | 0.83 | 83.33 | 88.16 | 86.61 | 0.70 | 0.78 | 76.92 | 0.80 |
| ALL | 263 | 82.14 | 96.64 | 92.01 | 0.81 | 0.95 | 92.00 | 0.87 | 83.33 | 93.42 | 90.18 | 0.77 | 0.80 | 85.71 | 0.85 |
For the incorporated 263 features (labelled as ALL), it reflects 82.14%, 96.64%, 83.33% and 93.42% in terms of Sn and Sp for training and testing datasets, respectively. Although the combined features contained sequence information from multiple aspects, there was no noticeable improvement in identification efficiency. It may be caused by the existed redundant/noise features in directly merged combined feature set. Therefore, feature selection is necessary to analyse the features and remove interference data to improve model quality. More importantly, the constructed models were obviously biased on the negative samples (corresponding Sp is approximately 10% larger than Sn). Thus, we will develop high-accurate models from two aspects, including feature selection and imbalance strategies.
3.2. Model optimization and feature analysis
Based on the combined 263 attributes, feature selection was performed using XGBoost followed by IFS algorithm. Then, different feature subsets were formed and separately used to robustly optimize the model under the consideration of several resampling and classification theories. To save time, we chose fivefold CV experiment instead of jackknife to evaluate model. Ultimately, best model was obtained using the top 30 relevant features with hybrid-sampling technique SMOTEENN. This model illustrated the outperformed results (fivefold CV: Sn = 96.33%, Sp = 97.38%, Acc = 96.78%, MCC = 0.93; Independent test: Sn = 91.67%, Sp = 94.74%, Acc = 93.75%, MCC = 0.86). Comparing with the original model based on the overall 263 features, the prediction scores of the positive samples over independent test are largely increased from 83.33% to 94.74%, which infers the crucial contributions of selected 37 features to the distinction between the true and false D sites.
As mentioned above, in unbalance prediction task, conventional classifiers usually display poor identification capacities on positive sequences, which are exactly the focus of the biologist’ research. Based on the top 30 ranked features, results of six different resampling strategies are listed in Table 3. Particularly, over-sampling SMOTE gave good Sn and Sp of 91.95% and 93.96% for training samples, and 75.00% and 93.42% for testing samples. As an improvement of SMOTE, ADAYSN showed almost coincident results. For the under-sampling method, Tomek and ENN proposed better Sp scores of 83.33% and 100.0% over independent test. Meanwhile, two integrated sampling techniques SMOTEENN and SMOTETomek were also applied to reduce the impact of imbalance, where the former contributed to the best results (jackknife: Sn = 97.57%, Sp = 97.38; independent: Sn = 91.67%, Sp = 94.74%). Comparing with the presented results without consideration of the data imbalance in Table 2, the gaps in model performance of positive and negative samples have been greatly decreased. Particularly, training MCC and F1 scores are higher than 0.81 and 0.87, as well as testing scores higher than 0.71 and 0.79, respectively. In summary, the hybrid-sampling techniques indicate the better performance than purely over or under-sampling strategies.
Table 3.
Comparisons of different resampling techniques using the selected top 30 features
| Resampling | Jackknife |
Independent |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | |
| SMOTE | 91.95 | 93.96 | 92.95 | 0.86 | 0.96 | 93.84 | 0.93 | 75.00 | 93.42 | 87.50 | 0.71 | 0.87 | 84.37 | 0.79 |
| ADAYSN | 94.94 | 92.62 | 93.81 | 0.88 | 0.97 | 93.17 | 0.94 | 77.78 | 93.42 | 88.39 | 0.73 | 0.87 | 84.85 | 0.81 |
| Tomek | 85.00 | 94.62 | 91.41 | 0.81 | 0.96 | 88.81 | 0.87 | 83.33 | 94.74 | 91.07 | 0.79 | 0.87 | 88.24 | 0.86 |
| ENN | 95.00 | 94.80 | 94.89 | 0.90 | 0.98 | 93.66 | 0.94 | 100.00 | 80.26 | 86.61 | 0.75 | 0.87 | 70.59 | 0.83 |
| SMOTEENN | 97.54 | 97.38 | 97.47 | 0.95 | 0.99 | 97.94 | 0.98 | 91.67 | 94.74 | 93.75 | 0.86 | 0.87 | 89.19 | 0.90 |
| SMOTETomek | 94.39 | 94.04 | 94.21 | 0.88 | 0.98 | 94.06 | 0.94 | 77.78 | 93.42 | 88.39 | 0.73 | 0.87 | 84.85 | 0.81 |
To better understand the classification model, we analysed the nucleotide distributions in Fig. 3A. The symbols enriched and depleted in the positive set are separately located above (>0) and under horizontal axis (<0), which gave the differences of occurrence frequencies between the positive and negative sets. It can be observed that the nucleotide locations are largely different between true and false D sequences, as well as the upstream and downstream segments around uracil U. For positive instances, U base is obviously located at positions 4, 8, 16, G at 6, 11, 14, 18 as well as A at 10, 18 and 19. As for negative samples, nucleotide A prefers to distribute at position 4, C at 10, 11 and 14. Generally, the noticeable distribution preferences (20%~40% at positions 4, 6, 10, 11, 14, 16 and 18, labelled as blue dashed boxes) provided the basis for establishing the bioinformatics predictor.
Figure 3.

Feature analysis of RNA D modification. (A) Nucleotide location characteristics of positive and negative samples, where the blue dashed boxes indicate the positions with obvious preferences. (B) Feature importances of XGBoost-based top 30 features, where the insert illustration counts the number of selected features in each type
Feature selection is a useful operation to increase the model efficiency and readability. Based on the efficient 30 features, we conducted a systematic feature analysis to explain the optimized predictor. Fig. 3B displays the related feature importance and counted the number of those features belonging to each type in the insert part. As can be seen, the first ranked feature is EIIP14 with relative importance of 0.08, indicating EIIP value of nucleotide base at position 14. Combined with the base locations in Fig. 3A, it is found that G is enriched more than 42% in real D sequences, whereas C and U depleted 24% and 16% in false D samples. Next relevant feature is DNC13, giving the occurrence of 13th dinucleotide ‘UA’, which contributes to the importance of 0.063. The third one is PseEIIP12 with score of 0.048, which calculates the EIIP value of 12th trinucleotides ‘AGU’. Then, feature CPND15 displayed the significance of 0.04, which describes hydrogen bond of the nucleotide base at position 5. It can be observed from Fig. 3A that the nucleotides A is more likely located in positive segments with probabilities of 12% and U at negative samples with 14%, where these two bases correspond to weak hydrogen bond. Similarly, the fifth encoding feature is PseEIIP18, which infers the weighted EIIP of 18th trinucleotides ‘CAC’. For convenient application and comparisons, all feature-related data were provided in supplementary materials.
From the overall respective of Fig. 3B, the feature importances are rapidly decreased, where the highest score of 0.08 (1st feature EIIP14) is quickly reduced to 0.04 (4th feature CPND15), and finally to 0.0088 (30th feature CPND37). Among the selected 30 relevant features, there are separately 12, 6, 10 and 2 attributes belonging to four types of RNA representations. Noticeably, 12 CPND features accounted for 40% attributes of the total features to recognize D sites. It illustrated the greatest contributions of CPND extraction to build model, which is consistent with the previous conclusions [23,24]. Besides, EIIP, PseEIIP and DNC features also played crucial roles in D identification. In summary, although iRNAD_XGBoost only selected 11% of the overall features, the constructed model yielded satisfactory performance.
In addition, we recorded the results of different classifiers in Table 4 to compare, including RF, SVM, LR and NB. The best XGBoost results were obtained after hyperparameter optimization (i.e. the number of estimators was set to 146), where Pre and F1 reached 97.94%, 0.98 over jackknife, 89.19% and 0.90 over independent test. The RF model displayed the low Sn of 80.56% over independent test. Although SVM-based predictor showed the highest Sn values of training 99.59% and testing 100.000%, where Sp over independent test is only 75.00%. Similarly, LR showed lower testing Sp of 78.95. And the prediction power of positive samples (Sn) of RF and NB are still needed to be improved (80.56% and 75.0%). It can be concluded that XGBoost elucidates the best results for RNA D identification.
Table 4.
Comparisons of different classifiers using the top 30 features with SMOTEENN technique
| Classifiers | Jackknife |
Independent |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | |
| XGBoost | 97.54 | 97.38 | 97.47 | 0.95 | 0.99 | 97.94 | 0.98 | 91.67 | 94.74 | 93.75 | 0.86 | 0.87 | 89.19 | 0.90 |
| RF | 96.31 | 97.38 | 96.78 | 0.93 | 0.99 | 97.92 | 0.97 | 80.56 | 90.79 | 87.50 | 0.71 | 0.84 | 80.56 | 0.81 |
| SVM | 99.59 | 95.29 | 97.70 | 0.95 | 0.99 | 96.43 | 0.98 | 100.00 | 75.00 | 83.04 | 0.70 | 0.71 | 65.45 | 0.79 |
| LR | 96.72 | 94.76 | 95.86 | 0.92 | 0.98 | 95.93 | 0.96 | 94.44 | 78.95 | 83.93 | 0.69 | 0.83 | 68.00 | 0.79 |
| NB | 88.93 | 96.34 | 92.18 | 0.85 | 0.97 | 96.87 | 0.93 | 75.00 | 90.79 | 85.71 | 0.67 | 0.82 | 79.41 | 0.77 |
Hereafter, we visualized the distribution characteristics of imbalanced training samples. As illustrated in Fig. 4A, selected 30D features were effectively transferred into 2D vector using t-SNE algorithm. The positive instances (red dots) are basically located in the left and lower space, whereas negative (blue dots) in the right and upper area. Therefore, a clear boundary between the positive and negative samples largely reduced the difficulty of identification task. More intuitively, Fig. 4B plotted the violin graphs using the composed component1. The coloured boxes indicated the 25%~75% interquartile range (IQR), and vertical lines gave the 95% confidence interval (i.e. 1.5 IQR). Different distribution ranges of component1 of positive and negative samples inferred that those samples can be well separated using the selected features.
Comparisons with Other Tools
Figure 4.

Distributions of imbalanced training samples. (A) 2D representation based on the t-SNE method. (B) the violin plots of associated component1
There were two reported tools to discriminate RNA D and non-D sequences (see Table 1). Due to the limitation of available sequences in Feng et al.’s model [23], Table 5 only compared our model with iRNAD [24]. The first two rows represented the jackknife results using overall 550 samples marked with symbol ‘*’. As can be seen, Sn was improved from 92.05% to 99.03% with high Sp of 98.43%, corresponding MCC was also increased from 0.91 to 0.97 with AUC of 1.0. For the purpose of reflecting model generalizability, testing dataset was separated from original data to finish independent test. Although our prediction capacity on negative samples (Sp) is little lower (1.28%), the positive samples related Sn was improved from 86.43% to 97.13% over jackknife test, and from 86.11% to 91.67% over independent test. With slight reduction in identification efficiencies of negative samples, the evaluated results of positive samples have been largely improved. The present model prominently decreased the skewness of negative samples and exhibited high performance to identify D modification.
Table 5.
Comparisons of iRNAD and our present model to classify D sites
| Tools | Jackknife |
Independent |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 | |
| iRNAD* | 92.05 | 98.13 | 96.18 | 0.91 | 0.98 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| This work* | 99.03 | 98.43 | 98.76 | 0.97 | 1.00 | 0.99 | 0.99 | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| iRNAD | 86.43 | 98.66 | 94.75 | 0.88 | 0.98 | N/A | N/A | 86.11 | 96.05 | 92.86 | 0.83 | 0.98 | N/A | N/A |
| This work | 97.13 | 97.38 | 97.24 | 0.94 | 0.99 | 97.93 | 0.98 | 91.67 | 94.74 | 93.75 | 0.86 | 0.87 | 89.19 | 0.90 |
* Evaluated performance of prediction tools using collected 550 samples over jackknife test.
Meanwhile, ROC curves of training (purple line) and testing experiments (green line) were drawn in Fig. 5 to visually depict the prediction efficiency. Although corresponding AUCs reached 0.99 and 0.87, it cannot be ignored that there are 0.12 differences existing between the training and testing datasets. Thus, the recognition of generalizability still has room to improve in the future research. It was mentioned that there are 550 samples collected from five species in total, including H. sapiens (positive 29 + negative 68), M. musculus (13 + 48), S. cerevisiae (91 + 93), E. coli (34 + 127) and D. melanogaster (9 + 38). Because the number of samples of each species is relatively small, the species-specific models were not explored here.
Figure 5.

ROC curves of training and testing datasets, respectively
4. Conclusions
As a ubiquitous tRNA modification, dihydrouridine (D) is formed by hydrogenation of the endocyclic carbon-carbon double bond in uridine nucleic acid. Building accurate machine-learning-based predictors can rapidly accumulate the researches on biological regulation and cancerous tissues. In this work, we constructed a novel computational tool to accurately identify D sites using multiple features. With the feature selection technique XGBoost followed by IFS theory, top 30 important features were chosen from CPND, EIIP, PseEIIP and Kmer (k = 1 ~ 3) to build best-performance predictor. Here, the imbalance problem was addressed using the resampling method SMOTEENN. Compared with reported Sn and Sp values of 92.05% and 98.13% using overall samples in the iRNAD model, our model achieves 99.03% and 98.43% over jackknife test. Further training and testing experiments showed 10.7% and 5.6% improvements for Sn metric, associated MCC values are also separately increased 0.06 and 0.03, which indicates more consistent prediction capabilities between D and non-D sequences. The proposed model iRNAD_XGBoost is an efficient bioinformatical tool to identify RNA D modification and provides valuable suggestions on complex molecular experiments.
As we all know, high-quality datasets are a prerequisite for obtaining powerful and reliable bioinformatics tools. In this work, only a limited number of 550 samples were acquired to construct learner. More importantly, sequence similarity achieves 90%. Therefore, it is necessary to collect more experimentally proven D-containing RNA segments with stringent de-redundant process to construct more objective, even species-specific models. Another important challenge is the data imbalance issue, which always reduces the prediction results of positive instances in the minority class. Additionally, deep learning methods become a promising alternative to improve prediction performance [93–96]. In summary, researchers can make progress in three aspects above to develop more competitive models in the future.
Supplementary Material
Funding Statement
This work was supported by the Natural Science Foundation of China (No. 61902259, 62002242), the Natural Science Foundation of Guangdong Province (No. 2018A0303130084), the Scientific Research Foundation in Shenzhen (No. JCYJ20170818100431895, JCYJ20180306172207178) and Post-doctoral Foundation Project of Shenzhen Polytechnic (No. 6020330003K).
Disclosure statement
The authors declare no conflict of interest.
Author contributions
L. X. and K. H. designed this research. L. D. and W. Z. did the experiments and drafted the manuscript, L. Z. reviewed the manuscript. All authors approved the final manuscript.
Supplementary material
Supplemental data for this article can be accessed here.
References
- [1].Li S, Mason CE.. The pivotal regulatory landscape of RNA modifications, annual review of genomics and human genetics. Ann Rev Genom Hum Gen. 2014;15:127–150. [DOI] [PubMed] [Google Scholar]
- [2].Meyer KD, Jaffrey SR.. The dynamic epitranscriptome: N6-methyladenosine and gene expression control, nature reviews. Mol Cell Biol. 2014;15(5):313–326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Kirchner S, Ignatova Z. Emerging roles of tRNA in adaptive translation, signalling dynamics and disease. Nat Rev Genet. 2014;16(2):98–112. [DOI] [PubMed] [Google Scholar]
- [4].Sun WJ, Li JH, Liu S, et al. RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic Acids Res. 2016;44(D1):D259–D265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Roundtree IA, Evans ME, Pan T, et al. Dynamic RNA modifications in gene expression regulation. Cell. 2017;169(7):1187–1200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Boccaletto P, Machnicka MA, Purta E, et al. MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res. 2018;46(D1):D303–D307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Guohua H, Jincheng L. Feature extractions for computationally predicting protein post- translational modifications. Curr Bioinf. 2018;13(4):387–395. [Google Scholar]
- [8].Dao F-Y, Lv H, Yang Y-H, et al. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput Struct Biotechnol J. 2020;18:1084–1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Lv H, Zhang Z-M, Li S-H, et al. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2019;21(3):982–995. [DOI] [PubMed] [Google Scholar]
- [10].Madison JT, Holley RW. The presence of 5,6-dihydrouridylic acid in yeast “soluble” ribonucleic acid. Biochem Biophys Res Commun. 1965;18(2):153–157. [DOI] [PubMed] [Google Scholar]
- [11].Edmonds CG, Crain PF, Gupta R, et al. Posttranscriptional modification of tRNA in thermophilic archaea (Archaebacteria). J Bacteriol. 1991;173(10):3138–3148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Sprinzl M, Vassilenko KS. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 2005;33:D139–D140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Yu F, Tanaka Y, Yamashita K, et al. Molecular basis of dihydrouridine formation on tRNA, Proceedings of the National Academy of Sciences of the United States of America. PNAS. 2011;108:19593–19598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Jones CI, Spencer AC, Hsu JL, et al. A counterintuitive Mg2+-dependent and modification-assisted functional folding of mitochondrial tRNAs. J Mol Biol. 2006;362(4):771–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Dalluge JJ, Hashizume T, Sopchik AE, et al. Conformational flexibility in RNA: the role of dihydrouridine. Nucleic Acids Res. 1996;24(6):1073–1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Dalluge JJ, Hamamoto T, Horikoshi K, et al. Posttranscriptional modification of tRNA in psychrophilic bacteria. J Bacteriol. 1997;179(6):1918–1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Kato T, Daigo Y, Hayama S, et al. A novel human tRNA-dihydrouridine synthase involved in pulmonary carcinogenesis. Cancer Res. 2005;65(13):5638. [DOI] [PubMed] [Google Scholar]
- [18].Mittelstadt M, Frump A, Khuu T, et al. Interaction of human tRNA-dihydrouridine synthase-2 with interferon-induced protein kinase PKR. Nucleic Acids Res. 2007;36(3):998–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Kuchino Y, Borek E. Tumour-specific phenylalanine tRNA contains two supernumerary methylated bases. Nature. 1978;271(5641):126–129. [DOI] [PubMed] [Google Scholar]
- [20].Kellner S, Ochel A, Thüring K, et al. Absolute and relative quantification of RNA modifications via biosynthetic isotopomers. Nucleic Acids Res. 2014;42(18):e142–e142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Xuan -J-J, Sun W-J, Lin P-H, et al. RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2017;46(D1):D327–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Boccaletto P, Machnicka MA, Purta E, et al. MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res. 2017;46(D1):D303–D307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Feng P, Xu Z, Yang H, et al. Identification of D modification sites by integrating heterogeneous features in saccharomyces cerevisiae. Molecules. 2019;24(3):380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Xu Z-C, Feng P-M, Yang H, et al. iRNAD: a computational tool for identifying D modification sites in RNA sequence. Bioinformatics. 2019;35(23):4922–4929. [DOI] [PubMed] [Google Scholar]
- [25].Chan PP, Lowe TM. GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes. Nucleic Acids Res. 2016;44(D1):D184–D189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Zou Q, Lin G, Jiang X, et al. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform. 2018;21:1–10. [DOI] [PubMed] [Google Scholar]
- [27].Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Chen W, Yang H, Feng P, et al. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics. 2017;33(22):3518–3523. [DOI] [PubMed] [Google Scholar]
- [29].Chen W, Tang H, Ye J, et al. iRNA-PseU: identifying RNA pseudouridine sites. Mol Ther Nucleic Acids. 2016;5:e332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Chen W, Song X, Lv H, et al. iRNA-m2G: identifying N2-methylguanosine sites based on sequence-derived information. Mol Ther Nucleic Acids. 2019;18:253–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Chen Z, Zhao P, Li F, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2019;21(3):1047–1057. [DOI] [PubMed] [Google Scholar]
- [32].Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform. 2017;20(4):1280–1294. [DOI] [PubMed] [Google Scholar]
- [33].Wang J, Chen S, Dong L, et al. CHTKC: a robust and efficient k-mer counting algorithm based on a lock-free chaining hash table. Brief Bioinform. 2020. DOI: 10.1093/bib/bbaa063. [DOI] [PubMed] [Google Scholar]
- [34].Nair AS, Sreenadhan SP. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation. 2006;1(6):197–202. [PMC free article] [PubMed] [Google Scholar]
- [35].He W, Jia C, Duan Y, et al. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC Syst Biol. 2018;12(S4):44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Han S, Liang Y, Ma Q, et al. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief Bioinform 2018;20:2009-2027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].He WY, Jia CZ, Zou Q. 4mCPred: machine learning methods for DNA N-4-methylcytosine sites prediction. Bioinformatics. 2019;35:(4):593–601. [DOI] [PubMed] [Google Scholar]
- [38].Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. [DOI] [PubMed] [Google Scholar]
- [39].Chen T, Guestrin C. XGBoost: a Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA: Association for Computing Machinery, 2016, 785–794. [Google Scholar]
- [40].Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–1232. [Google Scholar]
- [41].Al-barakati HJ, Saigo H, Newman RH, et al. RF-GlutarySite: a random forest based predictor for glutarylation sites. Mol Omics. 2019;15(3):189–204. [DOI] [PubMed] [Google Scholar]
- [42].Liu K, Chen W, Lin H. XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics. 2020;295(1):13–21. [DOI] [PubMed] [Google Scholar]
- [43].Jia C, Bi Y, Chen J, et al. PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs. Bioinformatics. 2020;36(15):4276–4282. [DOI] [PubMed] [Google Scholar]
- [44].Yu J, Shi S, Zhang F, et al. PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics. 2018;35(16):2749–2756. [DOI] [PubMed] [Google Scholar]
- [45].Qu K, Zou QA. Review of DNA-binding proteins prediction methods. Curr Bioinf. 2018;13(4):14. [Google Scholar]
- [46].Wei L, Xing P, Shi G, et al. Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(4):1264–1273. [DOI] [PubMed] [Google Scholar]
- [47].Wei L, Zhou C, Chen H, et al. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–4016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Zhao X, Jiao Q, Li H, et al. ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles. Bmc Bioinformatics. 2020;21(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Zhou H, Chen C, Wang M, et al. Predicting golgi-resident protein types using conditional covariance minimization with xgboost based on multiple features fusion. Ieee Access. 2019;7:144154–144164. [Google Scholar]
- [50].Liu B, Luo Z, He J. sgRNA-PSM: predict sgRNAs on-target activity based on position-specific mismatch. Mol Ther Nucleic Acids. 2020;20:323–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Wang M, Cui X, Yu B, et al. SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput Appl. 2020;32(17):13843–13862. [Google Scholar]
- [52].Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
- [53].Ljp VDM, Hinton GE. Visualizing high-dimensional data using t-SNE. J Mach Learn Res. 2008;9:2579–2605. [Google Scholar]
- [54].Yu L, Yao S, Gao L, et al. Conserved disease modules extracted from multilayer heterogeneous disease and gene networks for understanding disease mechanisms and predicting disease treatments. Front Genet. 2019;9:745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Liu B, Zhu Y, Yan K. Fold-LTR-TCP: protein fold recognition based on triadic closure principle. Brief Bioinform. 2019;21(6):2185–2193. [DOI] [PubMed] [Google Scholar]
- [56].Li C-C LB. MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Brief Bioinform. 2019;21:2133–2141. [DOI] [PubMed] [Google Scholar]
- [57].Yu L, Zhao J, Gao L. Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome. Artif Intell Med. 2017;77:53–63. [DOI] [PubMed] [Google Scholar]
- [58].Zhang M, Xu Y, Li L, et al. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. Anal Biochem. 2018;550:41–48. [DOI] [PubMed] [Google Scholar]
- [59].Zeng X, Zhu S, Liu X, et al. deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics. 2019;35(24):5191–5198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Lin X, Quan Z, Wang Z-J, et al. A novel molecular representation with BiGRU neural networks for learning atom. Brief Bioinform. 2019;21(6):2099–2111. [DOI] [PubMed] [Google Scholar]
- [61].Liu X, Hong Z, Liu J, et al. Computational methods for identifying the critical nodes in biological networks. Brief Bioinform. 2019;21(2):486–497. [DOI] [PubMed] [Google Scholar]
- [62].Zeng X, Lin Y, He Y, et al. Deep collaborative filtering for prediction of disease genes. IEEE/ACM Trans Comput Biol Bioinform. 2020;17(5):1639–1647. [DOI] [PubMed] [Google Scholar]
- [63].Meng C, Wei L, Zou Q. SecProMTB: support vector machine-based classifier for secretory proteins using imbalanced data sets applied to mycobacterium tuberculosis. PROTEOMICS. 2019;19(17):1900007. [DOI] [PubMed] [Google Scholar]
- [64].Jin Q, Meng Z, Pham TD, et al. DUNet: a deformable network for retinal vessel segmentation. Knowledge-Based Syst. 2019;178:149–162. [Google Scholar]
- [65].Su R, Liu X, Wei L, et al. Deep-Resp-Forest: a deep forest model to predict anti-cancer drug response. Methods. 2019;166:91–102. [DOI] [PubMed] [Google Scholar]
- [66].Su R, Liu X, Xiao G, et al. Meta-GDBP: a high-level stacked regression model to improve anticancer drug response prediction. Brief Bioinform. 2019;21(3):996–1005. [DOI] [PubMed] [Google Scholar]
- [67].Wang Z, He W, Tang J, et al. Identification of highest-affinity binding sites of yeast transcription factor families. J Chem Inf Model. 2020;60(3):1876–1883. [DOI] [PubMed] [Google Scholar]
- [68].Wang H, Ding Y, Tang J, et al. Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing. 2020;383:257–269. [Google Scholar]
- [69].Ding Y, Tang J, Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing. 2019;325:211–224. [Google Scholar]
- [70].Liu B, Li K. iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features. Mol Ther Nucleic Acids. 2019;18:80–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019;1(20):e127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains, ACM Comput. Surv 2016;49(2):31. [Google Scholar]
- [73].Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: applications and solutions, ACM comput. Surv 2019;52(4):79. [Google Scholar]
- [74].Batista G, Bazzan A, Monard M-C. Balancing training data for automated annotation of keywords: a case study. 2003. [Google Scholar]
- [75].Batista G, Bazzan A, Monard M-C. Balancing training data forautomated annotation of keywords: a case study. In II Brazilian Workshop on Bioinformatics. Brazil: Macaé. 2003 [Google Scholar]
- [76].Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern. 1972;SMC-2(3):408–421. [Google Scholar]
- [77].Yang H, Yang W, Dao F-Y, et al. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief Bioinform. 2019;21(5):1568–1580. [DOI] [PubMed] [Google Scholar]
- [78].Wei L, Wan S, Guo J, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med. 2017;83:82–90. [DOI] [PubMed] [Google Scholar]
- [79].Wei L, Xing P, Zeng J, et al. Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif Intell Med. 2017;83:67–74. [DOI] [PubMed] [Google Scholar]
- [80].Li J, Pu Y, Tang J, et al. DeepAVP: a dual-channel deep neural network for identifying variable-length antiviral peptides. IEEE J Biomed Health Inform. 2020;24(10):3012–3019. [DOI] [PubMed] [Google Scholar]
- [81].Shen Y, Tang J, Guo F. Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC. J Theor Biol. 2019;462:230–239. [DOI] [PubMed] [Google Scholar]
- [82].Shen Y, Ding Y, Tang J, et al. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform. 2020;21(5):1628–1640. 1628-1640. [DOI] [PubMed] [Google Scholar]
- [83].Jiang Q, Wang G, Zhang T, et al. Predicting human microRNA-disease associations based on support vector machine. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Hongkong, China. 2010, p. 467–472. [DOI] [PubMed] [Google Scholar]
- [84].Davis J, Goadrich M. The relationship between precision-recall and ROC Curves. In: CML '06: Proceedings of the 23rd Iinternational Conference on Machine Learning, New York, United States. 2006, p. 223–240. [Google Scholar]
- [85].Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861–874. [Google Scholar]
- [86].Wang GH, Wang YD, Feng WX, et al. Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells. BMC Genomics. 2008;9:S22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [87].Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl. 2020;32(14):10303–10319. [Google Scholar]
- [88].Ding Y, Tang J, Guo F. Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J Biomed Health Inform. 2019;23(6):2619–2632. [DOI] [PubMed] [Google Scholar]
- [89].Ding Y, Tang J, Guo F. Identification of drug-target interactions via multiple information integration. Inf Sci. 2017;418-419:546–560. [418-419]. [Google Scholar]
- [90].Li Z, Tang J, Guo F. Learning from real imbalanced data of 14-3-3 proteins binding specificity. Neurocomputing. 2016;217:83–91. [Google Scholar]
- [91].Wang G, Luo X, Wang J, et al. MeDReaders: a database for transcription factors that bind to methylated DNA. Nucleic Acids Res. 2017;46(D1):D146–D151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [92].Zhao Y, Wang F, Juan L. MicroRNA Promoter identification in arabidopsis using multiple histone markers. Biomed Res Int. 2015;2005:861402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [93].Bizhi W, Hangxiao Z, Limei L, et al. A similarity searching system for biological phenotype images using deep convolutional encoder-decoder architecture. Curr Bioinf. 2019;14(7):628–639. [Google Scholar]
- [94].Lv Z, Ao C, Zou Q. Protein function prediction: from traditional classifier to deep learning. PROTEOMICS. 2019;19:1900119. [DOI] [PubMed] [Google Scholar]
- [95].Li P, Manman P, Bo L, et al. The advances and challenges of deep learning application in biological big data processing. Curr Bioinf. 2018;13(4):352–359. [Google Scholar]
- [96].Tang Y-J, Pang Y-H LB. IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence learning. Bioinformatics. 2020;36:(21):5177–5186. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
