Skip to main content
Turkish Journal of Biology logoLink to Turkish Journal of Biology
. 2019 Aug 5;43(4):274–280. doi: 10.3906/biy-1904-59

MicroRNA prediction based on 3D graphical representation of RNA secondary structures

Müşerref Duygu SAÇAR DEMİRCİ 1
PMCID: PMC6713912  PMID: 31582883

Abstract

MicroRNAs (miRNAs) are posttranscriptional regulators of gene expression. While a miRNA can target hundreds of messenger RNA (mRNAs), an mRNA can be targeted by different miRNAs, not to mention that a single miRNA might have various binding sites in an mRNA sequence. Therefore, it is quite involved to investigate miRNAs experimentally. Thus, machine learning (ML) is frequently used to overcome such challenges. The key parts of a ML analysis largely depend on the quality of input data and the capacity of the features describing the data. Previously, more than 1000 features were suggested for miRNAs. Here, it is shown that using 36 features representing the RNA secondary structure and its dynamic 3D graphical representation provides up to 98% accuracy values. In this study, a new approach for ML-based miRNA prediction is proposed. Thousands of models are generated through classification of known human miRNAs and pseudohairpins with 3 classifiers: decision tree, naïve Bayes, and random forest. Although the method is based on human data, the best model was able to correctly assign 96% of nonhuman hairpins from MirGeneDB, suggesting that this approach might be useful for the analysis of miRNAs from other species.

Keywords: MicroRNA, RNA structure, machine learning, random forest, decision tree, naïve Bayes

1. Introduction

Ribonucleic acid (RNA) is a major player in many cellular processes and for some organisms it is the source of genetic information. Not only the sequences but also the structures of the RNA molecules have great importance. There are three main levels of RNA structure: primary (base sequence), secondary (based on base pairs, e.g., hairpins or the cloverleaf structure of transfer RNA (tRNA)), and tertiary (interactions between secondary structure elements) (Batey et al., 1999).The RNA secondary structure is formed by hydrogen bonds between base pairs A-U and G-C (G-U pairing is often observed) (Varani and McClain, 2000). However, these bases and pairings do not have the same strength. The four bases can be divided into several classes, such as based on the strength of the hydrogen bond (weak H-bonds (A, U) and strong H-bonds (G, C)), based on the amino group (A, C) and keto group (G, U), and according to chemical structures of purine (A, G) and pyrimidine (C, U). By using the properties of bases and pairing information, various methods aiming to measure RNA similarity have been proposed. Some of these approaches are based on graphical representation of RNA 2D structure, which might suffer from the loss of information (Zhang et al., 2016). On the other hand, methods developed for 3D graphical representation of RNA secondary structures use sequence, chemical, and structural information. The method developed by Zhang et al. (2016) for dynamic 3D graphical representation for RNA structure analysis seems to be performing better than other approaches.In recent years, the small, noncoding RNAs known as microRNAs (miRNAs) that regulate posttranscriptional gene expression have been studied extensively. There are various reasons for miRNAs’ popularity. For instance, a wide range of organisms produce miRNAs and there are some reports about their involvement in host-parasite interactions (Saçar Demirci et al., 2016; Acar et al., 2018). Moreover, many disease phenotypes are associated with miRNAs, and it is possible to use miRNAs as disease markers and new therapeutic agents (Avci and Baran, 2014; Tüfekci et al., 2014). However, considering the capacity of a eukaryotic genome to produce miRNA precursors, it is a difficult task to distinguish new miRNAs experimentally. As a result, designing and employing computational approaches for miRNA analysis have become essential subjects.In addition to the capacity of single-stranded RNAs forming secondary structures by self-folding, miRNAs have a characteristic hairpin structure so that they can be recognized and modified by miRNA biogenesis machinery elements (Kozlowski et al., 2008). Thus, miRNA prediction analyses usually require information from primary and secondary structures. Unfortunately, this hairpin structure is not a unique property of miRNAs (Roden et al., 2017). The majority of tools designed to determine if a given sequence is miRNA are based on the application of machine learning (ML) (Saçar Demirci et al., 2017). Although ML is quite powerful and advantageous for miRNA studies, there are some essential points to consider for an efficient analysis such as data quality, feature selection, and ML algorithm selection (Saçar Demirci and Allmer, 2017a). In this paper, for the first time, a ML framework for miRNA prediction based on the 3D representation of known miRNA precursors and pseudohairpins is proposed. The method is developed and tested based on human miRNA data, but it is possible to apply and/or extend it for other organisms as well.

2. Materials and methods

Identification of miRNA hairpins is usually achieved by using 2-class classification-based ML approaches. In order to create models and test the effect of these models, different datasets were obtained and various classification algorithms were used in a workflow system.

2.1. Data

Sequence datasets that were used in training and testing were as follows:- 1917 human precursors (miRBase Release 22) (Kozomara and Griffiths-Jones, 2014) - learning data- 587 human precursors (MirGeneDB 2.0) (Fromm et al., 2018) - learning data- 7701 nonhuman precursors (MirGeneDB 2.0) (Fromm et al., 2018) - testing data- 8492 pseudohairpins (Ng and Mishra, 2007) - learning data

2.2. 3D graphical representation of RNA secondary structures

Secondary structures of RNA sequences were obtained by using RNAfold (Hofacker, 2003) with default settings. The best structure for each sequence was selected based on minimum free energy values (Figure 1). According to the dot-bracket (nonbonding and bonding bases, respectively) representation of 2D structures, bases in the sequence were modified as uppercase and lowercase characters. These sequences were then used as input to the RnaFeatureGenerator software (Zhang et al., 2016) to produce 36D vectors characterizing RNA secondary structures (Figure 1).

Figure 1.

Figure 1

Representation of sequence and secondary structure of hsa-let-7a-1. mfe: Minimum free energy.

2.3. Data mining

The Konstanz Information Miner (KNIME) (Berthold et al., 2008) platform was used for the data mining analysis. Datasets containing 36D vectors were loaded and used for classification. For learning, 3 classifiers, random forest, decision tree, and naïve Bayes, were trained with human miRNAs from miRGeneDB as positive and pseudohairpins as negative examples (Figure 2). To avoid class imbalance, equally sized samples from both datasets were selected randomly, and 70% learning - 30% testing sets were applied with 1000-fold Monte Carlo cross-validation (Xu and Liang, 2001). The models from each classifier with the highest accuracy score were saved and used for further testing analysis. The same learning strategy was followed, where miRBase human precursor sequences were used as positive data. To test the performance of the model, nonhuman precursors from MirGeneDB were used. Datasets and the best models are available in the supplementary files (https://data.mendeley.com/datasets/dms72w9ckc/1).

Figure 2.

Figure 2

The basic workflow of the analysis. Hairpin sequences were folded into their secondary structures and based on the state of the bases (bonded or nonbonded) for each hairpin 3D features were calculated. Learning datasets were used for classification analysis with 1000-fold Monte Carlo cross-validation and the best models with the highest accuracy scores were applied to the test datasets for prediction.

3. Results

ML has been a popular choice for miRNA studies (Saçar Demirci et al., 2017). However, there are many factors affecting the performance of ML-based approaches (Saçar Demirci and Allmer, 2017a, 2017b). Here, not only a new workflow for miRNA precursor prediction is proposed, but also some crucial points for reliable results are investigated. For instance, the effect of positive dataset selection on the accuracy of learners is shown in Figures 3 and 4. The models trained with validated human miRNA sequences obtained from MiRGeneDB (Figure 3) have higher scores than models generated with miRBase (Figure 4). Graphs for other measures like precision, recall, specificity, sensitivity, and F-measure are provided in the supplementary files. Three classifiers were used simultaneously on the same datasets for learning. Among them, random forest showed better performance based on almost all of the measures (Figure 3 and 4 ; Supplementary Figures 1 and 2 ). Due to this clear difference, the random forest model with the highest accuracy score from Figure 3 was selected for further analysis. The next step was analyzing the capacity of the model on new datasets. According to results shown in Table 1and Supplementary Figure 3among 7701 miRNA precursors from 32 nonhuman species listed in MirGeneDB, around 4% were labeled as negative, meaning that even though the model was generated based on human data, it might also be applied to analyze miRNAs from other organisms. When the developed model was compared with some of the existing approaches using classification for miRNA prediction, as shown in Table 2results showed that the prediction accuracy of our method was greater than the Triplet-SVM, MiPred, MicroPred, and izMiR. The performance scores of sensitivity, specificity, and accuracy were taken from the articles of corresponding methods. Considering that all of these approaches were constructed by using differing parts such as types of classifiers, sampling methods, and datasets, comparison of their performances cannot be achieved based on their reported performance measurements. Nevertheless, the values are presented here to provide a general idea.

Figure 3.

Figure 3

Accuracies of classifiers when positive dataset is MirGeneDB human precursors. DT: Decision tree, NB: naïve Bayes, RF: random forest.

Figure 4.

Figure 4

Accuracies of classifiers when positive dataset is miRBase human precursors. DT: Decision tree, NB: naïve Bayes, RF: random forest.

Table 1.

Prediction results for other organisms’ miRNA hairpins. All organisms from MirGeneDB (except human) were included. MiRNA #shows total number of hairpins per species. Prediction column shows the number of miRNA and negative predictions, respectively. The table is sorted alphabetically for species.

Species Acronym MiRNA # Prediction
Anolis carolinensis Aca 261 244, 17
Alligator mississippiensis Ami 272 259, 13
Ascaris suum Asu 95 94, 1
Branchiostoma floridae Bfl 90 88, 2
Bos taurus Bta 433 418, 15
Caenorhabditis elegans Cel 139 135, 4
Canis familiaris Cfa 444 427, 17
Crassostrea gigas Cgi 150 145, 5
Columba livia Cli 246 237, 9
Chrysemys picta bellii Cpi 290 278, 12
Cavia porcellus Cpo 397 384, 13
Capitella teleta Cte 102 96, 6
Drosophila melanogaster Dme 152 133, 19
Dasypus novemcinctus Dno 373 362, 11
Daphnia pulex Dpu 79 67, 12
Danio rerio Dre 385 369, 16
Eisenia fetida Efe 192 177, 15
Echinops telfairi Ete 339 328, 11
Gallus gallus Gga 262 248, 14
Ixodes sp. Isc 56 52, 4
Lottia gigantea Lgi 80 79, 1
Macaca mulatta Mml 498 488, 10
Mus musculus Mmu 448 428, 20
Oryctolagus cuniculus Ocu 366 361, 5
Ptychodera flava Pfl 83 81, 2
Patiria miniata Pmi 58 54, 4
Rattus norvegicus Rno 413 394, 19
Sarcophilus harrissii Sha 417 409, 8
Saccoglossus kowalevskii Sko 83 80, 3
Strongylocentrotus purpuratus Spu 57 51, 6
Tribolium castaneum Tca 188 186, 2
Xenopus tropicalis Xtr 253 241, 12

Table 2.

Comparison of the model developed in this work with the existing classifiers. FN: Number of features used to build the classification model, ML: machine learning method, SE: sensitivity, SP: specificity, Acc: accuracy, SVM: support vector machine, NB: naïve Bayes, MLP: multilayered perceptron, RF: random forest, DT: decision tree.

Method FN ML SE SP Acc
Triplet-SVM (Xue et al., 2005) 32 SVM 93.30
MiPred (Jiang et al., 2007) 34 RF, SVM 98.21 95.09 96.68
MicroPred (Batuwita et al., 2009) 21 RF, SVM 90.02 97.28
izMiR (Saçar Demirci et al., 2017) ~900 SVM, NB, DT 91.98 91.98 91.25
3D model 36 RF, NB, DT 98.87 98.87 98.58

4. Discussion

The majority of tools developed for miRNA identification are based on ML; thus, they are affected by the challenges of ML. For instance, one of the most important criteria for a successful classification system is having high quality datasets (Saçar Demirci and Allmer, 2017b). For miRNA analysis, established miRNAs available in public databases, like miRBase and MirGeneDB, are used as positive data. Unfortunately, it is demanding to create a true negative dataset since it should have entries with similar characteristics to known miRNAs, but not too similar so that the algorithm can accurately discriminate between them. Hence, it is currently impossible to have a true validated negative dataset. The most popular negative dataset, known as pseudohairpins, was selected and used for this study. Not only negative datasets but also positive ones seem to need further improvement. Previously, it has been shown that some of the entries in miRBase are unlikely to be true miRNAs (Saçar et al., 2013). Moreover, the results presented here show that in terms of quality, human miRNAs listed in MirGeneDB are better than human miRNA entries in miRBase. Nevertheless, miRBase is the standard source providing miRNA sequence information from 286 organisms (Release 22). Various classification algorithms have been used for miRNA precursor predictions. In this work, three of those classifiers, random forest, decision tree, and naïve Bayes, were trained and tested with the same datasets. Models of the random forest classifier produced higher performance scores of accuracy (Figures 3 and 4), F-measure, recall, precision, sensitivity, and specificity (Supplementary Figures 1 and 2), consistent with our previous research (Saçar Demirci and Allmer, 2017a).For ML analyses, some parameters explaining the dataset are required. There are various features proposed and used for miRNAs and these features can be grouped into structural, sequence-based, probability-based, and thermodynamic parameters. In earlier studies, we implemented hundreds of such features but we found that about 50 features are usually adequate for building an effective ML model (Saçar and Allmer, 2013a, 2013b). However, calculating such features is computationally expensive, especially for a genome-wide miRNA search. Moreover, the selection of informative features is an important step that has a large impact on the overall model performance (Yousef et al., 2016). Thus, an alternative approach like using 3D graphical representation of RNA secondary structures as features describing miRNAs seems like a promising method. 2D and 3D representations of RNA sequences create a data matrix based on the structural information. Although such representations have been used for measuring RNA similarities and classifying viruses (Yao et al., 2005; Li et al., 2012), they are rarely applied for pre-miRNA analysis (Fu et al., 2018). The workflow developed in this study is the first example of application of 3D representations of RNAs for ML-based miRNA prediction. The results presented here imply that when these features are used on a high quality dataset, they are sufficient for building a successful model for miRNA analysis.

Supplementary Material

Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MiRBase human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MiRBase human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MiRBase human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Prediction performances on MirGeneDB data. Gray indicates miRNAs while red shows negatives. X-axis lists the acronyms of the organisms. Y-axis shows the number of precursors.

References

  1. Acar IE Saçar Demirci MD Groß U Allmer J The expressed MicroRNA-mRNA interactions of Toxoplasma gondii. Frontiers in Microbiology. 2018;8:2630–2630. doi: 10.3389/fmicb.2017.02630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Avci CB Baran Y Use of microRNAs in personalized medicine. Methods in Molecular Biology. 2014;1107:311. doi: 10.1007/978-1-62703-748-8_19. [DOI] [PubMed] [Google Scholar]
  3. Batey RT Rambo RP Doudna JA ( 1999;38:2326. doi: 10.1002/(sici)1521-3773(19990816)38:16<2326::aid-anie2326>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  4. Batuwita R Palade V microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics. 2009;25:989. doi: 10.1093/bioinformatics/btp107. [DOI] [PubMed] [Google Scholar]
  5. Berthold MR Cebron N Dill F Gabriel TR Kötter T KNIME: The Konstanz Information Miner. SIGKDD Explorations. 2008;11:319. [Google Scholar]
  6. Fromm B Domanska D Hackenberg M Mathelier A Hoye E 0: The Curated microRNA Gene Database. BioRxiv. 2018;258749 [Google Scholar]
  7. Fu X Liao B Zhu W Cai L New 3D graphical representation for RNA structure analysis and its application in the pre-miRNA identification of plants. RSC Advances. 2018;8:30833. doi: 10.1039/c8ra04138e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hofacker IL Vienna RNA secondary structure server. Nucleic Acids Research. 2003;31:3429. doi: 10.1093/nar/gkg599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jiang P Wu H Wang W Ma W Sun X MiPred: Classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Research. 2017;35:339. doi: 10.1093/nar/gkm368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kozlowski P Starega-Roslan J Legacz M Magnus M Krzyzosiak WJ Structures of microRNA precursors. In: Ying SY (editor) 2008;10:1. [Google Scholar]
  11. Kozomara A Griffiths-Jones S miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research. 2014;42:D68. doi: 10.1093/nar/gkt1181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Li Y Duan M Liang Y Multi-scale RNA comparison based on RNA triple vector curve representation. BMC Bioinformatics. 2012;13:280–280. doi: 10.1186/1471-2105-13-280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ng KLS Mishra SK De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007;23:1321. doi: 10.1093/bioinformatics/btm026. [DOI] [PubMed] [Google Scholar]
  14. Roden C Gaillard J Kanoria S Rennie W Barish S Novel determinants of mammalian primary microRNA processing revealed by systematic evaluation of hairpin-containing transcripts and human genetic variation. Genome Research. 2017;27:374. doi: 10.1101/gr.208900.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Saçar MD Allmer J Comparison of four Ab initio microRNA prediction tools. In: Bioinformatics. 2013a;10:190. [Google Scholar]
  16. Saçar MD Allmer J Data mining for microRNA gene prediction: on the impact of class imbalance and feature number for microRNA gene prediction. In: 2013 8th International Symposium on Health Informatics and Bioinformatics. 2013b;10 [Google Scholar]
  17. Saçar MD Hamzeiy H Allmer J Can MiRBase provide positive data for machine learning for the detection of MiRNA hairpins? Journal of Integrative Bioinformatics. 2013;10:215–215. doi: 10.2390/biecoll-jib-2013-215. [DOI] [PubMed] [Google Scholar]
  18. Saçar Demirci MD Allmer J Delineating the impact of machine learning elements in pre-microRNA detection. PeerJ. 2017a;5 doi: 10.7717/peerj.3131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Saçar Demirci MD Allmer J Improving the quality of positive datasets for the establishment of machine learning models for pre-microRNA detection. Journal of Integrative Bioinformatics. 2017b;14:0032–0032. doi: 10.1515/jib-2017-0032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Saçar Demirci MD Baumbach J Allmer J On the performance of pre-microRNA detection algorithms. Nature Communications. 2017;8:330–330. doi: 10.1038/s41467-017-00403-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Saçar Demirci MD Toprak M Allmer J A machine learning approach for MicroRNA precursor prediction in retro-transcribing virus genomes. Journal of Integrative Bioinformatics. 2016;13:303–303. doi: 10.2390/biecoll-jib-2016-303. [DOI] [PubMed] [Google Scholar]
  22. Tüfekci KU Oner MG Meuwissen RLJ Genç Ş The role of microRNAs in human diseases. Methods in Molecular Biology. 2014;1107:33. doi: 10.1007/978-1-62703-748-8_3. [DOI] [PubMed] [Google Scholar]
  23. Varani G McClain WH The G x U wobble base pair. A fundamental building block of RNA structure crucial to RNA function in diverse biological systems. EMBO Reports. 2000;1:18. doi: 10.1093/embo-reports/kvd001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Xu QS Liang YZ Monte Carlo cross validation. Chemometrics and Intelligent Laboratory Systems. 2001;56:1. [Google Scholar]
  25. Xue C Li F He T Liu GP Li Y Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005;6:1471–1471. doi: 10.1186/1471-2105-6-310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Yao YH Nan XY Wang TM A class of 2D graphical representations of RNA secondary structures and the analysis of similarity based on them. Journal of Computational Chemistry. 2005;26:1339. doi: 10.1002/jcc.20271. [DOI] [PubMed] [Google Scholar]
  27. Yousef M Saçar Demirci MD Khalifa W Allmer J Feature selection has a large impact on one-class classification accuracy for microRNAs in plants. Advances in Bioinformatics. 2016;10 doi: 10.1155/2016/5670851. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhang Y Huang H Dong X Fang Y Wang K A dynamic 3D graphical representation for RNA structure analysis and its application in non-coding RNA classification. PLoS One. 2016;11 doi: 10.1371/journal.pone.0152238. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MirGeneDB human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MiRBase human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MiRBase human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Boxplots of classification performance measures when positive dataset was selected from MiRBase human miRNA entries: F-measure, recall, precision, sensitivity, specificity (from top to bottom).
Prediction performances on MirGeneDB data. Gray indicates miRNAs while red shows negatives. X-axis lists the acronyms of the organisms. Y-axis shows the number of precursors.

Articles from Turkish Journal of Biology are provided here courtesy of The Scientific and Technological Research Council of Turkey

RESOURCES