Abstract
Increasing evidence has revealed that microRNAs (miRNAs) play important roles in the development and progression of human diseases. However, efforts made to uncover OMIM disease-miRNA associations are lacking and the majority of diseases in the OMIM database are not associated with any miRNA. Therefore, there is a strong incentive to develop computational methods to detect potential OMIM disease-miRNA associations. In this paper, random walk on OMIM disease similarity network is applied to predict potential OMIM disease-miRNA associations under the assumption that functionally related miRNAs are often associated with phenotypically similar diseases. Our method makes full use of global disease similarity values. We tested our method on 1226 known OMIM disease-miRNA associations in the framework of leave-one-out cross-validation and achieved an area under the ROC curve of 71.42%. Excellent performance enables us to predict a number of new potential OMIM disease-miRNA associations and the newly predicted associations are publicly released to facilitate future studies. Some predicted associations with high ranks were manually checked and were confirmed from the publicly available databases, which was a strong evidence for the practical relevance of our method.
1. Introduction
MicroRNAs (miRNAs) are a class of small noncoding RNAs typically about 22 nucleotides in length. They have been identified in eukaryotic organisms ranging from nematodes to humans [1–3]. Caenorhabditis elegans (C. elegans) lin-4 and let-7 are the first two discovered miRNAs [4, 5]. Over the past decade, thousands of miRNAs have been discovered. miRNAs normally function as negative regulators of gene expression [6–8]. Research also reports that miRNAs may act as positive regulators in some cases [9, 10].
Many investigators have reported that miRNAs are critical in tissue development [11], cell growth [12], cellular signalling [13], and so on. As such, the mutation of miRNAs, the dysfunction of miRNA biogenesis, and the dysregulation of miRNAs and their targets may result in various diseases, such as lung cancer [14], lymphoma [15], and breast cancer [16]. Therefore, research on the relationship between miRNAs and diseases has become an important biomedical goal.
The Online Mendelian Inheritance in Man database (OMIM, http://www.ncbi.nlm.nih.gov/omim/) is a comprehensive knowledgebase of human genetic disorders. It contains information about genes and genetic phenotypes. As of December 2012, OMIM comprised 5442 Mendelian diseases (which are prefixed using “#” (n = 3676) if the responsible gene is known and with “%” otherwise (n = 1766)). However, efforts made to reveal OMIM disease-miRNA associations are lacking and the majority of diseases in the OMIM database are not associated with any miRNA. To provide testable hypotheses to guide future experiments, it is of great importance to devise computational models to infer potential OMIM disease-miRNA associations.
Recently, some important conclusions and computational methods about the relationship between diseases and miRNAs have been presented. Lu et al. [17] performed a comprehensive analysis to the human miRNA-disease association data and disclosed that miRNAs tend to show similar or different dysfunctional evidences for the similar or different disease clusters, respectively. Jiang et al. [18] proposed a computational model based on the hypergeometric distribution to infer potential miRNA-OMIM disease associations by prioritizing the entire human microRNAome for diseases of interest. The notation that functionally related miRNAs tend to be associated with phenotypically similar diseases was reconfirmed in their manuscript. Although miRNAs functional network, disease similarity network, and known miRNA-disease associations were integrated in their work, only the neighbor information of each miRNA was used in their scoring system. Prediction accuracy would be increased by taking advantage of the global network similarity information. Another limitation is that in silico predicted associations were used as data sources in this method. It is known that these predicted associations used as data sources have some false-positive and false-negative results, thus influencing the final prediction accuracy. To test the hypothesis that some miRNAs could be potentially responsible for a number of “orphan” OMIM diseases, Rossi et al. [19] developed a novel approach OMiR to calculate the significance of the overlap between miRNA loci and OMIM disease loci. Results suggested that “orphan” genetic disease loci were proximal to miRNA loci more frequently than to loci for which the responsible protein coding gene is known.
In this paper, we propose a computational approach to infer potential human OMIM disease-miRNA associations by random walk to prioritize the candidate diseases for miRNAs of interest. We first constructed an OMIM disease similarity network and an OMIM disease-miRNA association network. We subsequently implement random walk on the OMIM disease similarity network to prioritize candidate diseases for an miRNA of interest. Cross-validation has illustrated the excellent performance of our method. The comprehensively predicted OMIM disease-miRNA associations also enable us to suggest many potential OMIM disease-miRNA associations, which can offer help in further experiments and hence increase research productivity. We further manually checked some strongly predicted associations and encouraging confirmation results were found from the publicly available databases.
2. Materials and Methods
2.1. Data Preparation
The benchmark dataset (see Supplementary material S1 available at http://dx.doi.org/10.1155/2013/204658) used in this manuscript is downloaded from [20, 21]. Here below we provide a brief description.
2.1.1. The OMIM Disease Similarity Data
We download the disease phenotype similarity scores from the MimMiner [21], developed by van Driel et al. who computed a phenotype similarity score for each phenotype pair by the text mining analysis of their phenotype descriptions in the Online Mendelian Inheritance in Man (OMIM) database [22]. The phenotypic similarity scores have been successfully used to predict or prioritize disease-related protein-coding genes [23, 24].
OMIM disease similarity matrix is defined as O, where the entity O(i, j) in row i column j is the similarity score between OMIM disease i and j. Based on the similarity matrix, OMIM disease similarity network (ODSN) is constructed, where vertex set D = {d 1, d 2,…, d n} denotes the set of n OMIM diseases. Vertices d i and d j are linked by an edge in the network if the similarity between diseases i and j is more than zero. The similarity score between diseases i and j is used as the weight of this edge.
2.1.2. The OMIM Disease-miRNA Association Data
Previous studies have produced a large number of miRNA-disease associations. Lu et al. [17] and Jiang et al. [20] manually retrieved the associations of miRNA and disease from literatures and constructed two curated databases, human miRNA-associated disease database (HMDD) and miR2Disease, respectively. They aim to offer comprehensive resources of experimentally confirmed miRNA-disease associations. Yang et al. [25] also created a publicly available database of Differentially Expressed miRNAs in human Cancers (dbDEMC) with the goal to provide potential cancer-related miRNAs by in silico computing.
The OMIM disease-miRNA association data used in our paper was downloaded from miR2Disease [20]. After mapping these downloaded diseases into OMIM disease IDs, we finally received 1226 OMIM disease-miRNA associations consisting of 61 OMIM diseases and 365 miRNAs. These associations were used for performance evaluation, and the latest versions of the HMDD [17] and dbDEMC [25] data were applied for prediction confirmation.
OMIM disease-miRNA association network (ODMAN) was constructed based on the 1226 verified associations, where vertex set D = {d 1, d 2,…, d n} denotes the set of n OMIM diseases and M = {m 1, m 2,…, m k} denotes the set of kmiRNAs. Vertices m i and d j are linked by an edge in the ODMAN if disease j is associated with miRNA i in our datasets. The weights of all edges are set to be 1.
2.2. Method Description
Random walk is a ranking algorithm. It simulates a random walker who starts on some given seed nodes and moves to their immediate neighbors randomly at each step. Finally, all the nodes in the network are ranked by the probability of the random walker reaching this node. Let p 0 be the initial probability vector and p s a vector in which the ith element holds the probability of finding the random walker at node i at steps. The probability vector at step s + 1 can be given by
(1) |
where r is the restart probability of random walk in every time step at source nodes and D norm is the normalized similarity network.
After some steps, the probability will reach a steady state. This is obtained by performing the iteration until the difference between p s and p s+1 (measured by the L1 norm) falls below 10−10. The steady-state probability p ∞ gives a measure of proximity to seed nodes. If p ∞(i) > p ∞(j), then node i is more proximate to seed nodes than node j.
In this paper, based on the observation that functionally related miRNAs are often associated with phenotypically similar diseases [17], random walk was proposed to uncover the potential associations between OMIM diseases and miRNAs. The source code in Matlab can be downloaded from Supplementary Material S2. As we want to predict potential OMIM diseases for a given miRNA m of interest, all the OMIM diseases which have already been confirmed to be associated with this miRNA will be considered as seed nodes. Other nonseed OMIM diseases will be considered as candidate diseases. The initial probability p 0 is formed such that equal probabilities are assigned to the seed nodes, with the sum equal to 1, while the initial probabilities of nonseed miRNAs are 0. Here we allow the restart of random walk in every time step at source nodes with probability r (0 < r < 1). After some iteration, the random walk is stable. The stable probability is defined as p ∞. Candidate OMIM diseases are ranked according to p ∞. The high-scored OMIM diseases can be expected to have a high probability to be associated with the given miRNA.
3. Results
3.1. OMIM Disease-miRNA Association Network (ODMAN) Analysis
In this study, we first focus on the verified OMIM disease-miRNA associations. The set of 1226 known OMIM disease-miRNA associations is regarded as the “gold standard” data and is used for evaluating the performance of our proposed method in the cross-validation experiments as well as training data in the comprehensive prediction. We constructed the OMIM disease-miRNA association network using a bipartite graph representation (see Figure 1) and analyzed some statistics for the OMIM disease-miRNA association network. In the bipartite graph, the heterogeneous nodes correspond to either miRNAs or diseases, and edges correspond to associations between them. An edge is placed between a miRNA node and a disease node if the disease is known to associate with the miRNA.
Figure 2 shows the degree distributions for miRNAs and diseases in the OMIM disease-miRNA association network. The degree of the miRNA (respective disease) node is the number of diseases that the miRNA has associations with (resp., the number of miRNAs targeting the disease).
Table 1 details some statistics for the OMIM disease-miRNA association network, such as the average degree of miRNAs and the average degree of diseases.
Table 1.
No. of OMIM diseases | No. of miRNAs | No. of OMIM disease-miRNA associations | Average degree of OMIM diseases | Average degree of miRNAs |
---|---|---|---|---|
61 | 365 | 1226 | 20.10 | 3.36 |
Inspection of the OMIM disease-miRNA association network shows that most edges in the network are connected and form a large connecting subnetwork.
3.2. Performance Evaluation of the Proposed Method
In order to assess the power of our method to predict OMIM disease-miRNA associations by prioritizing the entire candidate OMIM diseases, we performed a leave-one-out cross-validation on the 1226 known OMIM disease-miRNA associations. For a given miRNA m, each known related OMIM disease was left out in turn as test disease and other known OMIM diseases were taken as seed nodes. The candidate disease set consisted of all the OMIM diseases which have no evidence to show their association with miRNA m.
We calculated the sensitivity and specificity for each threshold. Sensitivity refers to the percentage of the associations whose ranking is higher than a given threshold, namely, the ratio of the successfully predicted experimentally verified OMIM disease-miRNA associations to the total experimentally verified OMIM disease-miRNA associations. Specificity refers to the percentage of associations that are below the threshold. The value of area under receiver-operating characteristics (ROC) curve (AUC) was calculated and an AUC value of 71.42% was achieved, suggesting that our method can recover the known experimentally verified OMIM disease-miRNA associations and therefore has the potential to infer new OMIM disease-miRNA associations.
3.3. Effects of Parameter in the Proposed Method
Restart probability r is one parameter in our method. To investigate the selection of the parameter for the performance of our method, we set various values for it and calculated the AUC values in the framework of leave-one-out cross-validation. Table 2 details the effect of the parameter on the cross-validation results in the benchmark dataset. After a comprehensive searching, the parameter (r = 0.8) which led to best AUC result is selected for further association prediction. It could also be observed that the predictive result is robust to the restart probability.
Table 2.
Restart probability | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
---|---|---|---|---|---|---|---|---|---|
AUC | 0.6703 | 0.6903 | 0.7011 | 0.7082 | 0.7126 | 0.7135 | 0.7138 | 0.7142 | 0.7138 |
3.4. Comparison with Other Methods
Until recently, efforts made to discover potential OMIM disease-miRNA associations are lacking. Meanwhile models have been constructed based on different data features, which makes direct performance comparison difficult. The most recent study related with our work is the computational model proposed by Jiang et al. [18], which was based on the hypergeometric distribution to infer potential miRNA-OMIM disease associations by prioritizing the entire human microRNAome for diseases of interest. Only the neighbor information of each miRNA was used in their scoring system. Another limitation is that in silico predicted associations were used as data sources in this method. It is known that these predicted associations have some false-positive and false-negative results, which may bring noises to the experiments. Our method is based on the experimentally verified OMIM disease-miRNA associations.
3.5. Comprehensive Prediction for Unknown OMIM Disease–miRNA Associations
After confirming the usefulness of our method, we conduct a comprehensive prediction of unknown associations between all possible OMIM diseases and miRNAs. In the inference process, we trained our method with all the known associations. We ranked the nonassociating pairs with respect to their probability scores and extracted the top 20 predicted associations for each of the 365 OMIM diseases. The full list of the prediction results can be obtained from Supplementary Material S3.
Furthermore, we manually checked some strongly predicted associations. Take the top 10 predicted associations of hsa-let-7g as an example. We confirmed that 6 associations (Table 3) are now annotated in at least one of the two latest online versions of HMDD [17] and dbDEMC [25] databases. We take these as a strong evidence to support the practical application of our method. Note that the predicted associations that are not reported yet may also exist in reality.
Table 3.
miRNA | OMIM ID | Rank | Source |
---|---|---|---|
hsa-let-7g | 114480 | 1 | HMDD |
hsa-let-7g | 155720 | 2 | HMDD |
hsa-let-7g | 180200 | 3 | HMDD |
hsa-let-7g | 256700 | 4 | |
hsa-let-7g | 188470 | 5 | |
hsa-let-7g | 109800 | 6 | |
hsa-let-7g | 133239 | 7 | dbDEMC |
hsa-let-7g | 603956 | 8 | HMDD |
hsa-let-7g | 155255 | 9 | dbDEMC |
hsa-let-7g | 607174 | 10 |
4. Discussion
We have applied random walk on OMIM disease similarity network to predict potential OMIM disease-miRNA associations. Differing from using local network similarity measures, like the method proposed by Jiang et al. [18], we adopted global network similarity measures. Excellent performance based on leave-one-out cross-validation suggested that our method has the potential to infer new OMIM disease-miRNA associations. The newly predicted associations are publicly released to facilitate future studies. Further confirmation of some strongly predicted associations in publicly accessible databases indicates the realistic application of our method.
Despite the encouraging results of our method, there are also limitations. The known experimentally verified OMIM disease-miRNA associations were rare. Therefore, integrating other bioinformatics sources, such as Gene Ontology, might improve model performance. Our method cannot be applied for miRNAs which do not have any known associated OMIM diseases. Thus miRNA similarity information should be taken into consideration. From a technical viewpoint, the performance of our method could be improved by using more accurate similarity information designed for OMIM diseases.
Supplementary Material
Acknowledgments
The authors are grateful to Dr. Yixiong Liang at the Central South University for useful discussions. They thank Dr. Qinghua Cui from Peking University Health Science Center and Professor Yadong Wang of Harbin Institute of Technology for their help. This research was supported by the National Natural Science Foundation of China (Grants 60970095, 61003124, and M1121008), Research Fund for the Doctoral Program of Higher Education of China (Grant no. 20120162110077) and the National High Technology Research and Development Program of China (863 Program, no. 2012AA011205).
References
- 1.Lagos-Quintana M, Rauhut R, Yalcin A, Meyer J, Lendeckel W, Tuschl T. Identification of tissue-specific microRNAs from mouse. Current Biology. 2002;12(9):735–739. doi: 10.1016/s0960-9822(02)00809-6. [DOI] [PubMed] [Google Scholar]
- 2.Lau NC, Lim LP, Weinstein EG, Bartel DP. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science. 2001;294(5543):858–862. doi: 10.1126/science.1065062. [DOI] [PubMed] [Google Scholar]
- 3.Lee RC, Ambros V. An extensive class of small RNAs in Caenorhabditis elegans. Science. 2001;294(5543):862–864. doi: 10.1126/science.1065329. [DOI] [PubMed] [Google Scholar]
- 4.Lee RC, Feinbaum RL, Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993;75(5):843–854. doi: 10.1016/0092-8674(93)90529-y. [DOI] [PubMed] [Google Scholar]
- 5.Reinhart BJ, Slack FJ, Basson M, et al. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature. 2000;403(6772):901–906. doi: 10.1038/35002607. [DOI] [PubMed] [Google Scholar]
- 6.Ambros V. The functions of animal microRNAs. Nature. 2004;431(7006):350–355. doi: 10.1038/nature02871. [DOI] [PubMed] [Google Scholar]
- 7.Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116(2):281–297. doi: 10.1016/s0092-8674(04)00045-5. [DOI] [PubMed] [Google Scholar]
- 8.Meister G, Tuschl T. Mechanisms of gene silencing by double-stranded RNA. Nature. 2004;431(7006):343–349. doi: 10.1038/nature02873. [DOI] [PubMed] [Google Scholar]
- 9.Jopling CL, Yi M, Lancaster AM, Lemon SM, Sarnow P. Modulation of hepatitis C virus RNA abundance by a liver-specific MicroRNA. Science. 2005;309(5740):1577–1581. doi: 10.1126/science.1113329. [DOI] [PubMed] [Google Scholar]
- 10.Vasudevan S, Tong Y, Steitz JA. Switching from repression to activation: microRNAs can up-regulate translation. Science. 2007;318(5858):1931–1934. doi: 10.1126/science.1149460. [DOI] [PubMed] [Google Scholar]
- 11.Krichevsky AM, King KS, Donahue CP, Khrapk K, Kosik K. A microRNA array reveals extensive regulation of microRNAs during brain development. RNA. 2003;9(10):1274–1281. doi: 10.1261/rna.5980303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Esquela-Kerscher A, Slack FJ. Oncomirs—microRNAs with a role in cancer. Nature Reviews Cancer. 2006;6(4):259–269. doi: 10.1038/nrc1840. [DOI] [PubMed] [Google Scholar]
- 13.Cui Q, Yu Z, Purisima EO, Wang E. Principles of microRNA regulation of a human cellular signaling network. Molecular Systems Biology. 2006;2, article 46 doi: 10.1038/msb4100089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Esquela-Kerscher A, Trang P, Wiggins JF, et al. The let-7 microRNA reduces tumor growth in mouse models of lung cancer. Cell Cycle. 2008;7(6):759–764. doi: 10.4161/cc.7.6.5834. [DOI] [PubMed] [Google Scholar]
- 15.Chen RW, Bemis LT, Amato CM, et al. Truncation in CCND1 mRNA alters miR-16-1 regulation in mantle cell lymphoma. Blood. 2008;112(3):822–829. doi: 10.1182/blood-2008-03-142182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Miller TE, Ghoshal K, Ramaswamy B, et al. MicroRNA-221/222 confers tamoxifen resistance in breast cancer by targeting p27Kip1. Journal of Biological Chemistry. 2008;283(44):29897–29903. doi: 10.1074/jbc.M804612200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lu M, Zhang Q, Deng M, et al. An analysis of human microRNA and disease associations. PLoS ONE. 2008;3(10) doi: 10.1371/journal.pone.0003420.e3420 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Jiang Q, Hao Y, Wang G, et al. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Systems Biology. 2010;4(1, article S2) doi: 10.1186/1752-0509-4-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Rossi S, Tsirigos A, Amoroso A, et al. OMiR: identification of associations between OMIM diseases and microRNAs. Genomics. 2011;97(2):71–76. doi: 10.1016/j.ygeno.2010.10.004. [DOI] [PubMed] [Google Scholar]
- 20.Jiang Q, Wang Y, Hao Y, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Research. 2009;37(1):D98–D104. doi: 10.1093/nar/gkn714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JAM. A text-mining analysis of the human phenome. European Journal of Human Genetics. 2006;14(5):535–542. doi: 10.1038/sj.ejhg.5201585. [DOI] [PubMed] [Google Scholar]
- 22.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research. 2005;33:D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ala U, Piro RM, Grassi E, et al. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Computational Biology. 2008;4(3) doi: 10.1371/journal.pcbi.1000043.e1000043 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Molecular Systems Biology. 2008;4(article 9) doi: 10.1038/msb.2008.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yang Z, Ren F, Liu C, et al. dbDEMC: a database of differentially expressed miRNAs in human cancers. BMC Genomics. 2010;11(4, article S5) doi: 10.1186/1471-2164-11-S4-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.