XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites

Zeeshan Abbas; Mobeen ur Rehman; Hilal Tayara; Quan Zou; Kil To Chong

doi:10.1016/j.ymthe.2023.05.016

. 2023 Jun 3;31(8):2543–2551. doi: 10.1016/j.ymthe.2023.05.016

XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites

Zeeshan Abbas ^1,⁵, Mobeen ur Rehman ^1,⁵, Hilal Tayara ^2,^∗, Quan Zou ^3,^∗∗, Kil To Chong ^1,^4,^∗∗∗

PMCID: PMC10422016 PMID: 37271991

Abstract

5-methylcytosine (m5C) is indeed a critical post-transcriptional alteration that is widely present in various kinds of RNAs and is crucial to the fundamental biological processes. By correctly identifying the m5C-methylation sites on RNA, clinicians can more clearly comprehend the precise function of these m5C-sites in different biological processes. Due to their effectiveness and affordability, computational methods have received greater attention over the last few years for the identification of methylation sites in various species. To precisely identify RNA m5C locations in five different species including Homo sapiens, Arabidopsis thaliana, Mus musculus, Drosophila melanogaster, and Danio rerio, we proposed a more effective and accurate model named m5C-pred. To create m5C-pred, five distinct feature encoding techniques were combined to extract features from the RNA sequence, and then we used SHapley Additive exPlanations to choose the best features among them, followed by XGBoost as a classifier. We applied the novel optimization method called Optuna to quickly and efficiently determine the best hyperparameters. Finally, the proposed model was evaluated using independent test datasets, and we compared the results with the previous methods. Our approach, m5C- pred, is anticipated to be useful for accurately identifying m5C sites, outperforming the currently available state-of-the-art techniques.

Keywords: 5-methylcytosine (m5C), Machine Learning, SHapley Additive exPlanations, Optuna, Sequence Analysis, Bioinformatics, Computational Biology, Explainable Biology, Classification, Methylation

Graphical abstract

Chong and colleagues studied 5-methylcytosine (m5C), which is a crucial post-transcriptional modification that plays a vital role in various biological processes. Accurate identification of m5C methylation sites can provide insight into their functional significance. The proposed machine learning based model, m5C-pred, can accurately identify m5C sites in a range of species.

Introduction

Transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), messenger RNAs (mR-NAs), and non-coding RNAs have been revealed to include more than 170 different types of RNA chemical modifications (RCMs).¹^,²^,³^,⁴^,⁵ Methyltransferase, RNA binding protein, and demethylase are three regulating components that affect RCMs.¹^,⁶^,⁷ One of the most significant mRNA modifications is 5-methylcytosine (m5C), yet it can be difficult to precisely identify these sites. RNA methyltransferase is responsible for catalyzing the addition of a methyl group to the cytosine ring’s fifth position, forming m5C, as shown in Figure 1. Researchers can more clearly comprehend the precise function of 5-cytosine methylation in biological processes by correctly identifying the m5C sites. Numerous experimental techniques, including bisulfite sequencing,³^,⁸Aza-IP,⁹ m5C-RIP,¹⁰ RBS-seq,¹¹ mi-CLIP,¹² and Oxford Nanopore Technologies,¹³^,¹⁴ have been designed to predict m5C sites in RNA. These techniques have shown some success in locating m5C sites in various species. However, these techniques are overpriced and laborious and typically fall short of correctly identifying m5C sites due to the unstable nature of mRNA molecules.⁷^,⁹^,¹⁵^,¹⁶ Alternatively, computational may be able to offer a quicker and more affordable method for identifying m5C modification sites.¹⁷^,¹⁸^,¹⁹^,²⁰

Illustration of 5-methylcytosine (m5C) modification

Numerous computational techniques have been created until now to identify m5C locations in many species, including Homo sapiens (H. sapiens), Arabidopsis thaliana (A. thaliana), Mus musculus (M. musculus), and Saccharomyces cerevisiae (S. cerevisiae). Feng et al.²¹ developed the first computational model based on support vector machine (SVM) to locate m5C sites, utilizing pseudo k-tuple nucleotide composition (PseKNC) encoding, relying on the experimentally proven m5C sites in H. sapiens. Subsequently, Qiu et al.²² created the iRNAm5C-PseDNC predictor using random forest (RF) to identify m5C sites. After that, Zhang et al.²³ created another model, named m5CHPCR, by employing ensemble learning techniques. Using the same datasets as those of Feng et al.²¹ and Zhang et al.,²³ Sabooh et al.²⁴ created a new model using SVM once more by combining composite encoding features such as di-nucleotide composition, trinucleotide composition (TNC), and tetra-nucleotide composition. A new model named PEA-m5C was proposed by Song et al.²⁵ using RF, primarily concerned with predicting m5C sites in A. thaliana only. After prediction, they tested their model using independent test datasets. For the prediction of m5C locations in eight different cells of H. sapiens and M. musculus, Li et al.²⁶ compiled datasets from the Gene Expression Omnibus (GEO) database and created a web server using an RF classifier and named it RNAm5Cfinder. Lv et al.²⁷ proposed another model named iRNA-m5C using RF and multiple feature extraction techniques including mono-nucleotide binary encoding, natural vector, K-tuple nucleotide frequency component, and PseKNC for the prediction of m5C sites in H. sapiens, A. thaliana, M. musculus, and S. cerevisiae. Recently, Chai et al.²⁸ proposed a technique using stacked ensemble approach for m5C-site prediction in A. thaliana and M. musculus species, outperforming the previous techniques. They used five different machine learning classifiers including XGBoost, SVM, LightGBM, ExtraTree, and GBDT along with logistic regression as the meta-classifier. They utilized mlxtend-package for model stacking. Although m5C sites in the RNA sequences may be recognized well using the stated approaches, it is still feasible that the performance could be enhanced. With advances in different feature extraction and network optimization techniques, we can achieve better performances as we still have a lot of room for improvement available in these species. In this work, we proposed a new model, m5C-pred, to predict m5C sites in the RNA sequences of five distinct species: H. sapiens, A. thaliana, M. musculus, D. melanogaster, and D. rerio. As the foundation for the development of m5C-pred, five different feature extraction techniques were used, including composition of k-spaced nucleic acid pairs (CKSNAP), nucleotide chemical property (NCP), label encoding (LE), electron-ion interaction pseudopotentials of trinucleotide (PseEIIP), and enhanced nucleic acid composition (ENAC). After combining the features of the mentioned techniques, we used SHAP (SHapley Additive exPlanations) for selecting the best features. Our approach was developed using XGBoost classifier based on the chosen feature subset by SHAP. Finally, the effectiveness of our methodology was evaluated against the previous approaches. The outcomes demonstrated that our strategy provides significantly superior performance compared with these available methods.

Results

Predictive performance using different classifiers

We employed three different machine learning classifiers, including XGBoost, light gradient boosting machine (LGBM), and SVM, and based on the chosen encoding strategies and after picking the best features using SHAP, we fine-tuned the parameters using Optuna as stated above. Using the optimized parameters, we calculated the predictive performance of all three classifiers on H. sapiens, M. musculus, A. thaliana, D. melanogaster, and D. rerio using 10-fold cross-validation and on independent test datasets, respectively. Table S7 shows the comparison of the achieved results, which clearly illustrates that the XGBoost classifier is performing the best among the three.

Furthermore, to show the importance of feature selection and hyperparameter optimization, we compared the results using original data, hyperparameter optimized data without SHAP, and hyperparameter optimized data with SHAP on the best performing classifier, XGBoost, as presented in Figure 4. We can clearly see the performance improvement both in the case of 10-fold cross-validation and independent testing after doing the feature selection using SHAP and hyperparameter optimization using Optuna. Compared with the original data (without SHAP and hyperparameter optimization), the accuracies and Matthew’s correlation coefficients (MCC) of H. sapiens improved by 4.75% and 9.2%, respectively, on 10-fold cross-validation and 3.65% and 7.6%, respectively, on independent test data. Similarly, M. musculus improved by 2.56% and 5.13%, respectively, on 10-fold cross-validation and 2.35% and 4.8%, respectively, on independent test data, and A. thaliana improved by 3.35% and 6.94%, respectively, on 10-fold cross-validation and 4.05% and 8.1%, respectively, on independent test data. While the improvement for D. melanogaster, and D. rerio species is also noticeable, as mentioned in Table S7.

Predictive performance comparison for *H. sapiens*, *M. musculus*, and *A. thaliana* on 10-fold cross-validation and independent testing using XGBoost

Comparison with previous best models

We assessed and compared the achieved results with the previous state-of- the-art techniques including iRNA-m5C,²⁷ m5CPred-SVM,²⁹ m5Cpred- XS,³⁰ RNAm5Cfinder,²⁶ PEA-m5C,²⁵ and Staem5.²⁸ The results of iRNA-m5C for H. sapiens and M. musculus were calculated using the available webserver, and all the remaining results are directly quoted from the papers. Since Staem5 and PEA-m5C have not used the H. sapiens dataset, we have not included them for H. sapiens comparison. The M. musculus dataset was also not included by PEA-m5C, so we were unable to include its comparison. Further, in the case of A. thaliana species, RNAm5Cfinder was ignored for the same reason. The independent test data were kept blind during the training process; hence it has been utilized for the comparison of the models to make sure that the comparison remains unbiased.

Compared with the m5CPred-SVM²⁹ for H. sapiens, the accuracy and MCC have been improved from 77.5% to 55.1%–81.15% and 62.7%, respectively. Similarly, compared with the previous model Staem5,²⁸ the accuracy has been improved from 71.95% to 73.55% for M. musculus and 73.7% to 76.15% for A. thaliana, while the improvement for MCC is noted to be enhanced from 44.2% to 47.3% for M. musculus and from 47.4% to 52.3% for A. thaliana. Further, while comparing with m5Cpred-XS,³⁰ the accuracies of H. sapiens and M. musculus have been improved from 80.4% to 81.16% and 62.0% to 62.7%, while the MCCs improved from 72.3% to 73.55% and 46.0% to 47.3%, respectively. Whereas the performance on A. thaliana was slightly lower; i.e., the accuracy and MCC achieved are 76.15% and 52.3% compared with 77.2% and 54.5% being achieved by m5Cpred-XS. Figures 5, 6, and 7 show a comparison of the proposed model with all the previous best models for H. sapiens, M. musculus, and A. thaliana, in terms of accuracy (Acc), MCC, sensitivity (Sn), specificity (Sp), and area under the curve (AUC). Missing values in these graphs are either not calculated by previous tools or the values are too small to be shown in the figure. As D. melanogaster and D. rerio species datasets are newly collected in this study, in the literature, there is no computational tool available for the purpose of comparison. However, Table S7 shows the achieved results by the proposed model on these species. All findings reveal that the proposed method is more accurate in predicting m5C sites than previous state-of-the-art methods that are already in use.

Comparison of *H. sapiens* with existing state-of-the-art methods on independent test datasets

Comparison of *M. musculus* with existing state-of-the-art methods on independent test datasets

Comparison of *A. thaliana* with existing state-of-the-art methods on independent test datasets

Cross-species validation

Since we have datasets from various species for training, it is essential to show that the model trained on a specific species could predict the m5C locations in another species. Therefore, we tested all other species using the independent test datasets while training on a specific species, and we present the results (accuracies) using the heatmap diagram in Figure 8.

Cross-species accuracies shown in percentages

The model was trained on the specific dataset mentioned on the y axis and tested on the specific independent dataset mentioned on the x axis.

Here, we can see that while training on the M. musculus dataset, the testing using H. sapiens is good enough because mouse and human share maximum gene similarity, and the number of sequences in the M. musculus dataset is also much higher compared to the H. sapiens dataset. For the remaining, the model tested for the species itself is always able to achieve the highest accuracy. Such results illustrate that the model holds the ability to learn the insight genomic features that can be used to classify methylated and non-methylated sequences.

Discussion

In the proposed architecture, we tested nine different feature encoding techniques, including CKSNAP, NCP, Kmer (k = 2), electron-ion interaction pseudopotentials value (EIIP), LE, electron-ion interaction pseudopotentials of trinucleotide (PseEIIP), binary, TNC, and ENAC. After trying different combinations, we found that the selected feature set acquired from the integrated encoding schemes (including CKSNAP, NCP, LE, PseEIIP, and ENAC) performs the best. These five encoding schemes belong to three major groups, in which CKSNAP, NCP, and ENAC belong to nucleic acid composition, PseEIIP belongs to EIIP, and the last, LE, belongs to assigning the nucleotides with labels. The best features after the integration of these five encoding techniques were extracted using SHAP and three different classifiers were applied, including XGBoost, LGBM, and SVM, and their hyperparameters were tuned using Optuna.³¹ Figure 2 shows the complete architecture of the proposed framework.

Overview of m5C-pred, a prediction framework consisting of six major steps

(1) Dataset collection using GEO database. (2) Feature encoding using five different techniques. (3) Feature selection using SHAP. (4) Classification using XGBoost, using the Optuna-based optimized parameters. (5) Evaluation of the model. (6) Webserver creation.

Feature selection

We employed SHAP³² as a feature selection technique to eliminate unnecessary features brought on by high-dimensional input data features and to identify the ideal feature subset. SHAP introduced a metric to calculate feature importance, which assigns every feature a value based on how important it is and how it would affect the learning of the network. Assuming we have F number of total features, the Shapley value of a particular feature f can be calculated as

φ f (p) = \sum_{S \subseteq \frac{N}{f}} \frac{S |! (F - | S | - 1)!}{F!} (p (S \cup F) - p (S)),

(Equation 1)

where p is the prediction of proposed model, and S is the feature subset without the feature f.

This technique calculates the output prediction of the proposed model with and without that particular feature f, and afterward, it evaluates the variance among them.

i m p o r t a n c e o f f = p (w i t h f) - p (w i t h o u t f)

(Equation 2)

The total number of features concatenating CKSNAP, NCP, LE, PseEIIP, and ENAC becomes 472. Using SHAP, we calculated the importance of each feature and removed all the features having zero importance in the prediction, which gives a distinct amount of features for different species. The number of features used for training the model of H. sapiens, M. musculus, and A. thaliana were 256, 454, and 468, respectively. Reducing the number of features not only speed up the prediction process but also helped us to optimize the network and achieve higher performance. Consequently, the SHAP technique has the appealing ability to deliver predictions that are easy to understand. Figure 3 depicts the top 20 features based on the importance values calculated using SHAP for H. sapiens, M. musculus, and A. thaliana, respectively.

20 features with the highest importance value calculated by SHAP algorithm for *H. sapiens*, *M. musculus*, and *A. thaliana*

Feature 1 to feature 96 represents CKSNAP, feature 97 to feature 219 represents NCP, feature 220 to feature 260 represents LE, feature 261 to feature 324 represents PseEIIP, and feature 325 to feature 472 represents ENAC.

Hyperparameter optimization using Optuna

Hyperparameters are the factors that regulate how a machine or deep learning model learns, and their optimization is one of the most fundamental techniques for enhancing the performance of computational models. Traditional methods for hyperparameter optimization, such as grid search, have been extremely time-consuming and difficult to locate minima upon expanding the hyperparameter space and the volume of data. In order to quickly and effectively perform hyperparameter optimization, Akiba et al.³¹ developed an approach called Optuna. Optuna optimizes hyperparameters by combining sampling and pruning algorithms. Optuna enables the dynamic creation and manipulation of hyperparameter search spaces. The discontinuation of uninspiring trials while hyperparameter tuning is referred to as a pruning process. It checks the learning curves of each trial on a regular basis. It then finds the set of parameters that are unlikely to produce a satisfactory outcome and that should be avoided. In such a manner, the best set of parameters is identified to be used by the prediction model. After feature selection, we used this approach to further optimize the network. The parameters and their ranges of each classifier being used to optimize the network are shown in Table S1, while Tables S2–S6 show the best-selected parameters by Optuna for H. sapiens, M. musculus, A. thaliana, D. melanogaster, and D. rerio, respectively.

Materials and methods

Benchmark datasets

For developing and testing any machine or deep learning model, high-quality benchmark datasets are very crucial. In this work, high-quality m5C datasets of five different species including H. sapiens, M. musculus, and A. thaliana were compiled from the literature. The datasets of H. sapiens were constructed by Khoddami et al.¹¹ using the GEO database with the accession number GSE:93751, M. musculus by Yang et al.⁷ using GSE:93751, and A. thaliana by Lv et al.²⁷ using GSE:94065, respectively. Further, for D. melanogaster and D. rerio species, we collected data from m5C-Atlas.³³ Identical sequences having more than 70% similarity were eliminated from the datasets using the CD-HIT tool³⁴ to prevent bias between sequences.

Typically, the benchmark dataset is split into two sections, training data and independent test data. Table S9 summarizes the number of positive and negative samples existing in each training and testing dataset.

Encoding techniques

Feature encoding is incredibly important in the process of building any machine or deep learning model. Five different encoding schemes were selected in this study to make the data readable for the machine, as shown below.

Composition of k-spaced nucleic acid pairs

The CKSNAP feature encoding determines how frequently nucleic acid pairs are separated by any k nucleic acids, where k-value ranges from 0 to 5. For instance, if we take k = 0, we acquire 16 nucleic acid pairs with zero spacing (i.e., AA, AG, AC, CT, CC, CA, TG, AT, CG, GC, GG, GT, TA, TT, GA, TC). A feature vector can thus be expressed as follows:

{(\frac{N_{A A}}{N_{t o t a l}}, \frac{N_{A U}}{N_{t o t a l}}, \frac{N_{A C}}{N_{t o t a l}}, \dots \dots . \frac{N_{T T}}{N_{t o t a l}})}_{16}

(Equation 3)

The value of each descriptor indicates the makeup of the associated nucleic acid pair in the nucleotide sequence. Here, $N_{t o t a l}$ stands for the overall number of evenly spaced nucleic acid pairs in the provided sequence, and NAA stands for the frequency of the nucleic acid pair AA. For a sequence of length L, the values of $N_{t o t a l}$ are, L − 1, L − 2, L − 3, L − 4, L – 5, and L −6 for k = 0, 1, 2, 3, 4, and 5, respectively.

Nucleotide chemical property

Adenine (A), cytosine (C), guanine (G), and uracil (U) are the four types of nucleotides that make up the RNA. Every nucleotide has a unique chemical binding and a particular chemical structure. According to these chemical characteristics, the four main types of nucleotides can be divided into three categories, as indicated in Table S10.

According to their chemical characteristics, A can be expressed as (1, 1, 1), G as (1, 0, 0), C as (0, 1, 0), and U as (0, 0, 1).

Label encoding

LE is the process of transforming the nucleotides in numerical format to make them understandable for the network. We simply assigned 1, 2, 3, and 4 to A, U, C, and G respectively.

Electron-ion interaction pseudopotentials of trinucleotide

In EIIP the values of each nucleotide in RNA sequences are given as A: 0.1260, C: 0.1340, G: 0.0806, and U: 0.1335. Assuming EIIP_A, EIIP_C, EIIP_G, and EIIP_U represent EIIP values for A, C, G, and U, the PseEIIP value of trinucleotides can be calculated as

V = [{E I I P}_{A A A . f A A A}, {E I I P}_{A A G . f A A G}, \dots ., {E I I P}_{T T T . f T T T}]

(Equation 4)

Here, fxyz stands for the nth trinucleotide’s normalized frequency, and EIIPxyz = EIIPx + EIIPy + EIIPz, where x, y, and z belong to the RNA nucleotides [A, C, G, or U].

Enhanced nucleic acid composition

The nucleic acid composition (NAC) encoding computes the occurrence of each nucleotide in a nucleotide sequence using the equation:

f (t) = \frac{N (x)}{N}, x ϵ {A, C, G, o r U}

(Equation 5)

Here, N is the length of the sequence, and N(x) is the number of x nucleotides.

The ENAC uses a sequencing window with a specified length to determine the NAC, constantly sliding from 5′ to 3′ terminal. For a sequence of length L, with a window size W, the number of sliding windows will be L − W + 1, and the encoded vector dimension will be (L − W + 1) ∗ 4. To select viable size of window, we have computed results on different values of W. The achieved results are depicted in Table S8. The results show that W = 5 is a suitable value, as for different species, different values of W perform well, but W = 5 is constantly giving better results.

Performance evaluation metrics

Five metrics, including ACC, MCC, Sn, Sp, and AUC, are typically used for binary classification to assess how well the proposed model performs. These metrics can be mathematically defined as follows:

A c c = 1 - \frac{S_{-}^{+} + S_{+}^{-}}{S^{+} + S^{-}}

(Equation 6)

M C C = \frac{1 - (\frac{S_{-}^{+}}{S^{+}} + \frac{S_{+}^{-}}{S^{-}})}{\sqrt{(1 + \frac{S_{+}^{-} - S_{-}^{+}}{S^{+}}) (1 + \frac{{S_{-}^{+} - S}_{+}^{-}}{S^{-}})}}

(Equation 7)

S n = 1 - \frac{S_{-}^{+}}{S^{+}}

(Equation 8)

S p = 1 - \frac{S_{+}^{-}}{S^{-}}

(Equation 9)

Here, S⁺ and S⁻ show the total number of positive and negative sequences, while S⁺ shows the incorrectly predicted positive class, and S⁻ is the incorrectly predicted negative class.

Data availability

A friendly web server is made available for researchers at http://nsclbio.jbnu.ac.kr/tools/m5C-pred/, and the data used in this study along with the codes are made available through GitHub: https://github.com/Z-Abbas/m5C-pred.

Acknowledgments

This work was supported in part by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1A2C2005612).

Author contributions

Z.A.: conceptualization, methodology, software, writing – original draft, and writing – review and editing. M.U.R.: conceptualization, methodology, software, writing – original draft, and writing – review and editing. H.T.: conceptualization, validation, supervision, and writing – review and editing. Q.Z.: conceptualization, validation, supervision, and writing – review and editing. K.T.C.: conceptualization, validation, supervision, writing – review and editing, and funding acquisition.

Declaration of interests

The authors declare no competing interests.

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ymthe.2023.05.016.

Contributor Information

Hilal Tayara, Email: hilaltayara@jbnu.ac.kr.

Quan Zou, Email: zouquan@nclab.net.

Kil To Chong, Email: kitchong@jbnu.ac.kr.

Supplemental information

Document S1. Figures S1–S5 and Tables S1–S10

mmc1.pdf^{(1.3MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(4.4MB, pdf)}

References

1.Frye M., Harada B.T., Behm M., He C. RNA modifications modulate gene expression during development. Science. 2018;361:1346–1349. doi: 10.1126/science.aau1646. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Xuan J.-J., Sun W.-J., Lin P.-H., Zhou K.-R., Liu S., Zheng L.-L., Qu L.-H., Yang J.-H. RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2018;46:D327–D334. doi: 10.1093/nar/gkx934. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Squires J.E., Patel H.R., Nousch M., Sibbritt T., Humphreys D.T., Parker B.J., Suter C.M., Preiss T. Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res. 2012;40:5023–5033. doi: 10.1093/nar/gks144. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Boccaletto P., Machnicka M.A., Piątkowski P., Bagiński B., Baginski B., de Crécy-Lagard V., de Crécy-Lagard V., Ross R., Limbach P.A., Kotter A., et al. MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res. 2018;46:D303–D307. doi: 10.1093/nar/gkx1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dubin D.T., Taylor R.H. The methylation state of poly A-containing-messenger RNA from cultured hamster cells. Nucleic Acids Res. 1975;2:1653–1668. doi: 10.1093/nar/2.10.1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Zheng G., Dahl J.A., Niu Y., Fedorcsak P., Huang C.-M., Li C.J., Vågbø C.B., Shi Y., Wang W.-L., Song S.-H., et al. Alkbh5 is a mammalian RNA demethylase that impacts RNA metabolism and mouse fertility. Mol. Cell. 2013;49:18–29. doi: 10.1016/j.molcel.2012.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Yang X., Yang Y., Sun B.-F., Chen Y.-S., Xu J.-W., Lai W.-Y., Li A., Wang X., Bhattarai D.P., Xiao W., et al. 5-methylcytosine promotes mrna export — NSUN2 as the methyltransferase and ALYREF as an M5C reader. Cell Res. 2017;27:606–625. doi: 10.1038/cr.2017.55. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Schaefer M., Pollex T., Hanna K., Tuorto F., Meusburger M., Helm M., Lyko F. RNA methylation by dnmt2 protects transfer RNAS against stress-induced cleavage. Genes Dev. 2010;24:1590–1595. doi: 10.1101/gad.586710. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Khoddami V., Cairns B.R. Identification of direct targets and modified bases of RNA cytosine methyltransferases. Nat. Biotechnol. 2013;31:458–464. doi: 10.1038/nbt.2566. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Edelheit S., Schwartz S., Mumbach M.R., Wurtzel O., Sorek R. Transcriptome-wide mapping of 5-methylcytidine RNA modifications in bacteria, archaea, and yeast reveals M5C within archaeal mrnas. PLoS Genet. 2013;9:e1003602. doi: 10.1371/journal.pgen.1003602. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Khoddami V., Yerra A., Mosbruger T.L., Fleming A.M., Burrows C.J., Cairns B.R. Transcriptome-wide profiling of multiple RNA modifications simultaneously at single-base resolution. Proc. Natl. Acad. Sci. USA. 2019;116:6784–6789. doi: 10.1073/pnas.1817334116. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Hussain S., Sajini A.A., Blanco S., Dietmann S., Lombard P., Sugimoto Y., Paramor M., Gleeson J.G., Odom D.T., Ule J., Frye M. NSUN2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAS. Cell Rep. 2013;4:255–261. doi: 10.1016/j.celrep.2013.06.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zhang Y., Jiang J., Ma J., Wei Z., Wang Y., Song B., Meng J., Jia G., de Magalhães J.P., Rigden D.J., et al. DirectRMDB: a database of post-transcriptional RNA modifications unveiled from direct RNA sequencing technology. Nucleic Acids Res. 2023;51:D106–D116. doi: 10.1093/nar/gkac1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Jenjaroenpun P., Wongsurawat T., Wadley T.D., Wassenaar T.M., Liu J., Dai Q., Wanchai V., Akel N.S., Jamshidi-Parsian A., Franco A.T., et al. Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res. 2021;49:e7. doi: 10.1093/nar/gkaa620. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Song Z., Huang D., Song B., Chen K., Song Y., Liu G., Su J., Magalhães J.P.d., Rigden D.J., Meng J. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat. Commun. 2021;12:4011. doi: 10.1038/s41467-021-24313-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hussain S., Tuorto F., Menon S., Blanco S., Cox C., Flores J.V., Watt S., Kudo N.R., Lyko F., Frye M. The mouse cytosine-5 RNA methyltransferase NSUN2 is a component of the chromatoid body and required for testis differentiation. Mol. Cell Biol. 2013;33:1561–1570. doi: 10.1128/mcb.01523-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Furlan M., Delgado-Tejedor A., Mulroney L., Pelizzola M., Novoa E.M., Leonardi T. Computational methods for RNA modification detection from nanopore direct RNA sequencing data. RNA Biol. 2021;18:31–40. doi: 10.1080/15476286.2021.1978215. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Yan C., Zhang Z., Bao S., Hou P., Zhou M., Xu C., Sun J. Computational methods and applications for identifying disease-associated lncrnas as potential biomarkers and therapeutic targets. Mol. Ther. Nucleic Acids. 2020;21:156–171. doi: 10.1016/j.omtn.2020.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Shi J., Cui Q. Stam: an online tool for the discovery of Mirna-set level disease biomarkers. Mol. Ther. Nucleic Acids. 2020;21:670–675. doi: 10.1016/j.omtn.2020.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.He Z., Xu J., Shi H., Wu S. m5CRegpred: epitranscriptome target prediction of 5-methylcytosine (m5C) regulators based on sequencing features. Genes. 2022;13:677. doi: 10.3390/genes13040677. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Feng P., Ding H., Chen W., Lin H. Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. Mol. Biosyst. 2016;12:3307–3311. doi: 10.1039/c6mb00471g. [DOI] [PubMed] [Google Scholar]
22.Qiu W.-R., Jiang S.-Y., Xu Z.-C., Xiao X., Chou K.-C. IRNAM5C-psednc: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget. 2017;8:41178–41188. doi: 10.18632/oncotarget.17104. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Zhang M., Xu Y., Li L., Liu Z., Yang X., Yu D.-J. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. Anal. Biochem. 2018;550:41–48. doi: 10.1016/j.ab.2018.03.027. [DOI] [PubMed] [Google Scholar]
24.Sabooh M.F., Iqbal N., Khan M., Khan M., Maqbool H.F. Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's pseknc. J. Theor. Biol. 2018;452:1–9. doi: 10.1016/j.jtbi.2018.04.037. [DOI] [PubMed] [Google Scholar]
25.Song J., Zhai J., Bian E., Song Y., Yu J., Ma C. Transcriptome-wide annotation of M5C RNA modifications using machine learning. Front. Plant Sci. 2018;9:519. doi: 10.3389/fpls.2018.00519. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Li J., Huang Y., Yang X., Zhou Y., Zhou Y. RNAm5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5c) sites based on Random Forest. Sci. Rep. 2018;8:17299. doi: 10.1038/s41598-018-35502-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Lv H., Zhang Z.-M., Li S.-H., Tan J.-X., Chen W., Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief. Bioinform. 2020;21:982–995. doi: 10.1093/bib/bbz048. [DOI] [PubMed] [Google Scholar]
28.Chai D., Jia C., Zheng J., Zou Q., Li F. Staem5: a novel computational approach for accurate prediction of M5C site. Mol. Ther. Nucleic Acids. 2021;26:1027–1034. doi: 10.1016/j.omtn.2021.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Chen X., Xiong Y., Liu Y., Chen Y., Bi S., Zhu X. m5cpred-SVM: a novel method for predicting M5C sites of RNA. BMC Bioinformatics. 2020;21:489. doi: 10.1186/s12859-020-03828-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Liu Y., Shen Y., Wang H., Zhang Y., Zhu X. m5cpred-XS: A new method for predicting RNA M5C sites based on XGBoost and SHAP. Front. Genet. 2022;13:853258. doi: 10.3389/fgene.2022.853258. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. Optuna: a next- generation hyperparameter optimization framework. [DOI] [Google Scholar]
32.Lundberg S.M., Lee S.-I. Vol. 30. Curran Associates, Inc.; 2017. A unified approach to interpreting model pre- dictions; pp. 4765–4774. (Advances in Neural Information Processing Systems). [Google Scholar]
33.Ma J., Song B., Wei Z., Huang D., Zhang Y., Su J., de Magalhães J.P., Rigden D.J., Meng J., Chen K. m5C-atlas: a comprehensive database for decoding and annotating the 5-methylcytosine (m5c) epitranscriptome. Nucleic Acids Res. 2022;50:D196–D203. doi: 10.1093/nar/gkab1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Fu L., Niu B., Zhu Z., Wu S., Li W. Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S5 and Tables S1–S10

mmc1.pdf^{(1.3MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(4.4MB, pdf)}

Data Availability Statement

[bib1] 1.Frye M., Harada B.T., Behm M., He C. RNA modifications modulate gene expression during development. Science. 2018;361:1346–1349. doi: 10.1126/science.aau1646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Xuan J.-J., Sun W.-J., Lin P.-H., Zhou K.-R., Liu S., Zheng L.-L., Qu L.-H., Yang J.-H. RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2018;46:D327–D334. doi: 10.1093/nar/gkx934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Squires J.E., Patel H.R., Nousch M., Sibbritt T., Humphreys D.T., Parker B.J., Suter C.M., Preiss T. Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res. 2012;40:5023–5033. doi: 10.1093/nar/gks144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Boccaletto P., Machnicka M.A., Piątkowski P., Bagiński B., Baginski B., de Crécy-Lagard V., de Crécy-Lagard V., Ross R., Limbach P.A., Kotter A., et al. MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res. 2018;46:D303–D307. doi: 10.1093/nar/gkx1030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Dubin D.T., Taylor R.H. The methylation state of poly A-containing-messenger RNA from cultured hamster cells. Nucleic Acids Res. 1975;2:1653–1668. doi: 10.1093/nar/2.10.1653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Zheng G., Dahl J.A., Niu Y., Fedorcsak P., Huang C.-M., Li C.J., Vågbø C.B., Shi Y., Wang W.-L., Song S.-H., et al. Alkbh5 is a mammalian RNA demethylase that impacts RNA metabolism and mouse fertility. Mol. Cell. 2013;49:18–29. doi: 10.1016/j.molcel.2012.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Yang X., Yang Y., Sun B.-F., Chen Y.-S., Xu J.-W., Lai W.-Y., Li A., Wang X., Bhattarai D.P., Xiao W., et al. 5-methylcytosine promotes mrna export — NSUN2 as the methyltransferase and ALYREF as an M5C reader. Cell Res. 2017;27:606–625. doi: 10.1038/cr.2017.55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Schaefer M., Pollex T., Hanna K., Tuorto F., Meusburger M., Helm M., Lyko F. RNA methylation by dnmt2 protects transfer RNAS against stress-induced cleavage. Genes Dev. 2010;24:1590–1595. doi: 10.1101/gad.586710. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Khoddami V., Cairns B.R. Identification of direct targets and modified bases of RNA cytosine methyltransferases. Nat. Biotechnol. 2013;31:458–464. doi: 10.1038/nbt.2566. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Edelheit S., Schwartz S., Mumbach M.R., Wurtzel O., Sorek R. Transcriptome-wide mapping of 5-methylcytidine RNA modifications in bacteria, archaea, and yeast reveals M5C within archaeal mrnas. PLoS Genet. 2013;9:e1003602. doi: 10.1371/journal.pgen.1003602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Khoddami V., Yerra A., Mosbruger T.L., Fleming A.M., Burrows C.J., Cairns B.R. Transcriptome-wide profiling of multiple RNA modifications simultaneously at single-base resolution. Proc. Natl. Acad. Sci. USA. 2019;116:6784–6789. doi: 10.1073/pnas.1817334116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Hussain S., Sajini A.A., Blanco S., Dietmann S., Lombard P., Sugimoto Y., Paramor M., Gleeson J.G., Odom D.T., Ule J., Frye M. NSUN2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAS. Cell Rep. 2013;4:255–261. doi: 10.1016/j.celrep.2013.06.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Zhang Y., Jiang J., Ma J., Wei Z., Wang Y., Song B., Meng J., Jia G., de Magalhães J.P., Rigden D.J., et al. DirectRMDB: a database of post-transcriptional RNA modifications unveiled from direct RNA sequencing technology. Nucleic Acids Res. 2023;51:D106–D116. doi: 10.1093/nar/gkac1061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Jenjaroenpun P., Wongsurawat T., Wadley T.D., Wassenaar T.M., Liu J., Dai Q., Wanchai V., Akel N.S., Jamshidi-Parsian A., Franco A.T., et al. Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res. 2021;49:e7. doi: 10.1093/nar/gkaa620. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Song Z., Huang D., Song B., Chen K., Song Y., Liu G., Su J., Magalhães J.P.d., Rigden D.J., Meng J. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat. Commun. 2021;12:4011. doi: 10.1038/s41467-021-24313-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Hussain S., Tuorto F., Menon S., Blanco S., Cox C., Flores J.V., Watt S., Kudo N.R., Lyko F., Frye M. The mouse cytosine-5 RNA methyltransferase NSUN2 is a component of the chromatoid body and required for testis differentiation. Mol. Cell Biol. 2013;33:1561–1570. doi: 10.1128/mcb.01523-12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Furlan M., Delgado-Tejedor A., Mulroney L., Pelizzola M., Novoa E.M., Leonardi T. Computational methods for RNA modification detection from nanopore direct RNA sequencing data. RNA Biol. 2021;18:31–40. doi: 10.1080/15476286.2021.1978215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Yan C., Zhang Z., Bao S., Hou P., Zhou M., Xu C., Sun J. Computational methods and applications for identifying disease-associated lncrnas as potential biomarkers and therapeutic targets. Mol. Ther. Nucleic Acids. 2020;21:156–171. doi: 10.1016/j.omtn.2020.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Shi J., Cui Q. Stam: an online tool for the discovery of Mirna-set level disease biomarkers. Mol. Ther. Nucleic Acids. 2020;21:670–675. doi: 10.1016/j.omtn.2020.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.He Z., Xu J., Shi H., Wu S. m5CRegpred: epitranscriptome target prediction of 5-methylcytosine (m5C) regulators based on sequencing features. Genes. 2022;13:677. doi: 10.3390/genes13040677. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Feng P., Ding H., Chen W., Lin H. Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. Mol. Biosyst. 2016;12:3307–3311. doi: 10.1039/c6mb00471g. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Qiu W.-R., Jiang S.-Y., Xu Z.-C., Xiao X., Chou K.-C. IRNAM5C-psednc: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget. 2017;8:41178–41188. doi: 10.18632/oncotarget.17104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Zhang M., Xu Y., Li L., Liu Z., Yang X., Yu D.-J. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. Anal. Biochem. 2018;550:41–48. doi: 10.1016/j.ab.2018.03.027. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Sabooh M.F., Iqbal N., Khan M., Khan M., Maqbool H.F. Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's pseknc. J. Theor. Biol. 2018;452:1–9. doi: 10.1016/j.jtbi.2018.04.037. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Song J., Zhai J., Bian E., Song Y., Yu J., Ma C. Transcriptome-wide annotation of M5C RNA modifications using machine learning. Front. Plant Sci. 2018;9:519. doi: 10.3389/fpls.2018.00519. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Li J., Huang Y., Yang X., Zhou Y., Zhou Y. RNAm5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5c) sites based on Random Forest. Sci. Rep. 2018;8:17299. doi: 10.1038/s41598-018-35502-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Lv H., Zhang Z.-M., Li S.-H., Tan J.-X., Chen W., Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief. Bioinform. 2020;21:982–995. doi: 10.1093/bib/bbz048. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Chai D., Jia C., Zheng J., Zou Q., Li F. Staem5: a novel computational approach for accurate prediction of M5C site. Mol. Ther. Nucleic Acids. 2021;26:1027–1034. doi: 10.1016/j.omtn.2021.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Chen X., Xiong Y., Liu Y., Chen Y., Bi S., Zhu X. m5cpred-SVM: a novel method for predicting M5C sites of RNA. BMC Bioinformatics. 2020;21:489. doi: 10.1186/s12859-020-03828-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Liu Y., Shen Y., Wang H., Zhang Y., Zhu X. m5cpred-XS: A new method for predicting RNA M5C sites based on XGBoost and SHAP. Front. Genet. 2022;13:853258. doi: 10.3389/fgene.2022.853258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. Optuna: a next- generation hyperparameter optimization framework. [DOI] [Google Scholar]

[bib32] 32.Lundberg S.M., Lee S.-I. Vol. 30. Curran Associates, Inc.; 2017. A unified approach to interpreting model pre- dictions; pp. 4765–4774. (Advances in Neural Information Processing Systems). [Google Scholar]

[bib33] 33.Ma J., Song B., Wei Z., Huang D., Zhang Y., Su J., de Magalhães J.P., Rigden D.J., Meng J., Chen K. m5C-atlas: a comprehensive database for decoding and annotating the 5-methylcytosine (m5c) epitranscriptome. Nucleic Acids Res. 2022;50:D196–D203. doi: 10.1093/nar/gkab1075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Fu L., Niu B., Zhu Z., Wu S., Li W. Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites

Zeeshan Abbas

Mobeen ur Rehman

Hilal Tayara

Quan Zou

Kil To Chong

Abstract

Graphical abstract

Introduction

Figure 1.

Results

Predictive performance using different classifiers

Figure 4.

Comparison with previous best models

Figure 5.

Figure 6.

Figure 7.

Cross-species validation

Figure 8.

Discussion

Figure 2.

Feature selection

Figure 3.

Hyperparameter optimization using Optuna

Materials and methods

Benchmark datasets

Encoding techniques

Composition of k-spaced nucleic acid pairs

Nucleotide chemical property

Label encoding

Electron-ion interaction pseudopotentials of trinucleotide

Enhanced nucleic acid composition

Performance evaluation metrics

Data availability

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Contributor Information

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases