Prediction of disease-related miRNAs by voting with multiple classifiers

Changlong Gu; Xiaoying Li

doi:10.1186/s12859-023-05308-x

. 2023 Apr 30;24:177. doi: 10.1186/s12859-023-05308-x

Prediction of disease-related miRNAs by voting with multiple classifiers

Changlong Gu ^1,^✉, Xiaoying Li ^1,^✉

PMCID: PMC10150488 PMID: 37122001

Abstract

There is strong evidence to support that mutations and dysregulation of miRNAs are associated with a variety of diseases, including cancer. However, the experimental methods used to identify disease-related miRNAs are expensive and time-consuming. Effective computational approaches to identify disease-related miRNAs are in high demand and would aid in the detection of lncRNA biomarkers for disease diagnosis, treatment, and prevention. In this study, we develop an ensemble learning framework to reveal the potential associations between miRNAs and diseases (ELMDA). The ELMDA framework does not rely on the known associations when calculating miRNA and disease similarities and uses multi-classifiers voting to predict disease-related miRNAs. As a result, the average AUC of the ELMDA framework was 0.9229 for the HMDD v2.0 database in a fivefold cross-validation. All potential associations in the HMDD V2.0 database were predicted, and 90% of the top 50 results were verified with the updated HMDD V3.2 database. The ELMDA framework was implemented to investigate gastric neoplasms, prostate neoplasms and colon neoplasms, and 100%, 94%, and 90%, respectively, of the top 50 potential miRNAs were validated by the HMDD V3.2 database. Moreover, the ELMDA framework can predict isolated disease-related miRNAs. In conclusion, ELMDA appears to be a reliable method to uncover disease-associated miRNAs.

Keywords: miRNA similarity, Disease similarity, Multi-classifiers voting, Cross-validation, XGBoost classification, Random forest classification

Background

Identification of disease-related biomarkers and the interaction of biomolecules is an emerging and challenging task [1–3]. Many effective methods have been proposed by scholars in different fields [4–6], and the recognition of disease-related microRNAs (miRNAs) is one of the important branches. MiRNAs are small single-stranded non-coding RNA molecules (containing approximately 22 nucleotides) that can regulate gene expression at the posttranscriptional level [7]. MiRNAs play an important role in multiple biological processes, including cellular differentiation, proliferation, apoptosis and tissue development [8]. Substantial evidence indicates that miRNA dysregulation is related to a number of human diseases, such as cardiovascular disease, schizophrenia, and cancer [9]. Thus, the identification of disease-related miRNAs will be helpful in the diagnosis, treatment, and prevention of diseases.

Through biological experiments, such as Northern hybridization, microarray analysis, and real-time quantitative PCR, scientists have verified a large number of miRNA-disease associations [10]. By collecting and sorting miRNA-disease associations, Cui et al. constructed a comprehensive database, namely, the Human miRNA-associated Disease Database (HMDD) [11, 12]. The current version is HMDD V3.2; they manually collected 35,547 miRNA-disease association entries, which included 1206 miRNA genes and 893 diseases from 19,280 papers. In addition, in 2010, the team of Andrew E. produced the first release of dbDEMC, which represents a database for collecting differentially expressed miRNAs in human cancers obtained from microarray data [13]. Since then, they have maintained and updated the database, and the latest version is dbDEMC 3.0. This current version contains 3268 differentially expressed miRNAs from 40 cancer types, whereas for humans, a total of 2584 differentially expressed miRNAs were included. Focusing on different studies, there are many miRNA-related databases that provide a strong data source for miRNA research.

The identification of disease-related miRNAs by biological experimental methods has high costs and takes a long time, so effective calculation methods for predicting disease-related miRNAs have attracted extensive attention. In the past few years, significant progress has been made in the development of miRNA disease association prediction models. These models can be roughly divided into two categories: models based on score functions, models based on network algorithms or models based on machine learning.

Most methods that predict miRNA disease associations based on score functions are based on the assumption that functionally similar miRNAs tend to be associated with phenotypically similar diseases [14]. Xuan et al. [15] presented the HDMP prediction model based on the most highly weighted similar neighbors to predict potential miRNA-disease associations. The model combined the information content of disease terms and phenotype similarities among diseases to calculate miRNA functional similarities and used miRNA family information to further improve the prediction accuracy. However, this method will fail when miRNA has no known associated diseases. Chen et al. [16] developed a computational model named WBSMDA to predict disease-related miRNAs by integrating known miRNA-disease associations, miRNA functional similarities, disease semantic similarities and Gaussian interaction profile kernel similarities. This method obtains the final prediction scores by integrating Within-Scores and Between-Scores, which are used for miRNA disease association predictions, achieves a good prediction effect, and can be applied to diseases without any known related miRNAs and miRNAs without any known related diseases.

Some researchers predict disease-related miRNAs based on network algorithms, such as network embedding, network projection, matrix factorization, and random walk. These methods construct similarity networks of miRNAs and diseases from different perspectives and then implement network algorithms to predict the associations among miRNAs and diseases. For example, by integrating known human miRNA–disease associations, miRNA similarities and disease similarities, You et al. [17] proposed a path-based computational model for miRNA–disease association predictions. They constructed a heterogeneous graph with many paths and used the sum of all path scores to calculate the association probabilities of miRNA-disease pairs. Due to the sparsity of the known miRNA disease association matrix, this affects the performance of this model. Chen et al. [18] presented a prediction model of bipartite network projection for miRNA–disease association prediction (BNPMDA). Based on the known miRNA–disease association network, miRNA similarity network and disease similarity network, they constructed bias ratings for miRNAs and diseases and implemented a bipartite network recommendation algorithm to predict disease-related miRNAs. Recently, Chen et al. [19] developed a neoteric Bayesian model to predict potential miRNA-disease associations, named KBMFMDA, which combines kernel-based nonlinear dimensionality reduction, matrix factorization and binary classification. Based on random walk and binary regression, Niu et al. [20] presented a prediction model using RWBRMDA, which extracted the features of miRNAs by a random walk with restart, and applied binary logistic regression to calculate the probability scores of miRNA-disease pairs. The limitation of RWBRMDA is that it cannot predict new diseases that have no known related miRNAs. The analysis of biological molecular data related to diseases is highly complex, and examining the data from various perspectives can aid in comprehending the pathogenesis of diseases. Consequently, multi-network integrated learning models have emerged as a promising approach and have yielded favorable outcomes [21–24]. For example, Ma et al. [24] proposed a computational model, DeepMNE, which employs deep multi-network embedding to integrate multi-omics data and identify potential lncRNA-disease associations. Both cross-validation and case studies have demonstrated the excellent predictive performance of DeepMNE.

In recent years, the method of predicting disease-related miRNAs based on machine learning has appeared in a blowout. Machine-based learning methods predict disease-related miRNAs through a trained model. The training model needs the characteristics and labels of positive and negative samples. Therefore, the problems of feature selection and negative samples need to be solved. Chen et al. [25] developed a ranking-based k-nearest neighbor calculation method of RKNNMDA to predict disease-related miRNAs. By combining miRNA similarities, disease similarities, Gaussian kernel similarities and known miRNA disease associations, the K-nearest neighbors (KNN) algorithm is used to search the nearest K neighbors of miRNA and disease. Then, these k-nearest neighbors are reranked according to the SVM ranking model. Finally, the ranking results are weighted to obtain the final ranking of all potential miRNA-disease associations. The disadvantage of RKNNMDA is that it may be biased toward miRNAs with more known related diseases. Peng et al. [26] proposed a learning-based framework, MDA-CNN, for miRNA-disease association identification. The model extracts features by using an autoencoder based on three networks with an additional target gene layer, inputs the features into a CNN and identifies disease-connected miRNAs. Considering the difficulty in obtaining negative samples, Chen et al. [27] proposed a semi-supervised model to predict miRNA disease associations. This model applied the regularized least squares (RLS) method to construct two optimal classifiers based on miRNA functional similarities and disease semantic similarities and can be applied to new diseases that have no associated miRNAs. However, this method must manually adjust the parameters to balance the contributions of the two classifiers. Chen et al. [28] implemented ensemble learning models, named EDTMDA, to distinguish potential associations from unknown miRNA-disease associations. EDTMDA fuses multiple basic classifiers to infer novel miRNA-disease associations, which achieves good prediction accuracy. Due to the rapidity and effectiveness of unstructured data processing, deep learning methods are widely used in miRNA-disease association predictions. For example, CNNMDA [29] utilized dual convolutional neural networks (CNNs) to learn the original and global representations of miRNA–disease pairs. However, machine learning-based algorithms face difficulties in retrieving negative samples, which may decrease their prediction performance. Recently, hypergraph learning has been used to identify disease-related biomarkers [30–32]. Based on attention aware multi-view similarity networks and hypergraph learning, Ning et al. [32] developed a model called AMHMDA for identifying disease-related miRNAs. The experimental results have shown that AMHMDA has good performance, and the case study further confirms the predictive ability of the model.

In this work, we propose an ensemble learning framework for miRNA disease association prediction, named ELMDA. The ELMDA framework integrates miRNA and disease similarities along with known miRNA-disease associations to reveal potential miRNA-disease associations. The main contributions of the paper are summarized as follows.

The target data verified by experiments are used to construct similarity networks, which can avoid false-positives of target data.
Both disease and miRNA similarity network construction do not consider the known association data, and cross validation can avoid overestimating the prediction performance of the model.
The ELMDA framework extracts features from similarity data to reduce the data scale, adds structural feature data to obtain more complete data features, and selects appropriate negative samples through sample selection so that the model has good prediction performance.
The ELMDA framework uses multiple classifiers to vote for the final prediction, and the model has good generalization ability.
The ELMDA framework can be applied to predict isolated diseases (diseases without any known related miRNAs).

Results

Performance of ELMDA based on fivefold cross-validation

In this section, to validate the ability of ELMDA to predict potential miRNA-disease associations, we adopt fivefold cross-validation in our experiment. The training dataset is randomly and evenly divided into five subsets; then, one subset is used for testing, and the other four subsets are selected for training. This process is repeated until all subsets have been used as the test set. We assessed the performance of the methods using the following evaluation criteria: precision [Eq. (1)], recall [Eq. (2)] and F1-score [Eq. (3)]. The formulas are as follows:

p r e c i s i o n = \frac{TP}{T P + F P}

r e c a l l = \frac{TP}{T P + F N}

F 1 - s c o r e = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

where TP and TN represent the number of correctly identified positive and negative samples respectively, FP and FN represent the number of false positive and false negative samples. In addition, we draw receiver operating characteristic curve (ROC) and use the area under the curve (AUC) to evaluate these methods. The ROC curve plots true-positive rate (TPR) versus false-positive rate (FPR) at different thresholds. However, due to the small number of positive samples (experimentally verified miRNA-disease associations), using only the AUC to evaluate the performance was too arbitrary; thus, we also used the precision-recall (PR) curve and area under the PR curve (AUPR) to complement the performance evaluation. In general, if the ROC and PR curves show similar variations and the AUC and AUPR values are close to 1, the prediction performance is better.

The fivefold cross validation results of the ELMDA framework are shown in Table 1. The ELMDA framework clearly exhibits a commendable predictive performance with an average AUC value of 0.9229. The maximum AUC value is 0.9299, and the minimum AUC value is 0.9207. The ROC curve and PR curve and the local enlarged figure of the ELMDA framework are shown in Figs. 1 and 2, respectively. Based on these results, the ELMDA framework shows good prediction performance.

Table 1.

Fivefold cross validation results of the ELMDA framework

Fold	Precision	Recall	F1-score	AUC	AUPR
1	0.8709	0.8469	0.8587	0.9225	0.9200
2	0.8359	0.8412	0.8386	0.9216	0.9250
3	0.8544	0.8704	0.8623	0.9298	0.9261
4	0.8326	0.8510	0.8417	0.9207	0.9195
5	0.8463	0.8579	0.8521	0.9201	0.9184
Average	0.8480 ± 0.0138	0.8535 ± 0.0101	0.8507 ± 0.0093	0.9229 ± 0.0035	0.9218 ± 0.0031

Open in a new tab

Fig. 1 — ROC curve and the local enlarged figure of the ELMDA framework

Fig. 2 — PR curve and the local enlarged figure of the ELMDA framework

The ELMDA framework combines four classification algorithms: SVM, GBDT, RF and XGBoost. Next, we compare the performance of a single classifier and the ELMDA framework. The results are shown in Table 2. For all prediction performance evaluation indicators, including Precision, Recall, F1-score, AUC and AUPR, the ELMDA framework is superior to the prediction performance of a single classifier. Therefore, the construction of the ELMDA framework is reasonable and can improve the overall prediction ability.

Table 2.

Comparison of the prediction performance of the ELMDA framework and a single classifier

Fold	Precision	Recall	F1-score	AUC	AUPR
SVM	0.8369 ± 0.0085	0.8371 ± 0.0143	0.8370 ± 0.0075	0.9091 ± 0.0031	0.9057 ± 0.0036
GBDT	0.8369 ± 0.0107	0.8490 ± 0.0057	0.8429 ± 0.0054	0.9172 ± 0.0034	0.9138 ± 0.0039
RF	0.8424 ± 0.0108	0.8354 ± 0.0131	0.8388 ± 0.0091	0.9141 ± 0.0049	0.9123 ± 0.0047
XGboost	0.8471 ± 0.0090	0.8486 ± 0.0099	0.8478 ± 0.0076	0.9191 ± 0.0039	0.9165 ± 0.0045
ELMDA	0.8485 ± 0.0139	0.8536 ± 0.0101	0.8510 ± 0.0094	0.9229 ± 0.0035	0.9217 ± 0.0031

Open in a new tab

Comparison with other methods

We compared the performance of the ELMDA framework with other novel computational methods, including MDA-CF[30], TCRWMDA [31], WBSMDA [32], ABMDA [33] and ICFMDA[34]. Like ours, these methods are also developed based on HMDD V2.0, and are all based on five-fold cross validation and evaluated using AUC values. Each group is selected as the test set in turn, and the other 4 groups of data are used for training. The predicted scores of the test samples were obtained, and the scores of all miRNA-disease pairs were ranked. Then, we calculated TPRSs and FPRs at different thresholds and obtained AUCs. The whole procedure was repeated 20 times to obtain the average results. The results are shown in Fig. 3. The predicted AUCs of the six computational models were 92.13, 92.58, 92.09, 81.85, 90.45, and 90.23, respectively. The AUC of the ELMDA framework is slightly lower than that of MDA-CF and better than those of the other four methods. Without using known association data, the ELMDA framework achieves satisfactory performance, while other algorithms use known association data. The results further confirmed the efficiency of the ELMDA framework for miRNA-disease association predictions.

Fig. 3 — Comparison of the performance of the ELMDA framework with other new computing methods

Case study

To investigate the ability of the ELMDA framework to infer unknown miRNA-disease associations, we implemented case studies from two different perspectives. We first evaluated the overall model performance, trained the model with 5430 known associations present in the HMDD V2.0 dataset as positive samples and 5418 selected negative samples, and then predicted unknown associations (candidate associations) in the HMDD V2.0 dataset, sorted the prediction results, selected the top 50 candidate associations with the highest rankings, and verified them with HMDD V3.2 (the latest version). The predicted results are presented in Table 3. Among the top 50 predicted associations, only five have not been confirmed by the HMDD V3.2 database, and the top 10 were all confirmed. Among the prediction results not verified by HMDD V3.2, the 29th and 50th results found new evidence support through literature search. Based on genome analysis, Anna Barbato et al. [33] found that melanoma tissues with high hsa-mir-181a and hsa-mir-181b expression presented favorable outcomes in terms of progression free survival, suggesting that has-mir-181 is a clinically relevant candidate for biomarker-based therapy selection. Wu et al. [34] suggest that miR-93-5p modulates tumorigenesis and gemcitabine resistance in pancreatic cancer cells via targeting the PTEN/PI3K/Akt signaling pathway.

Table 3.

Predictions of all potential associations in the HMDD V2.0 database and 90% of the top 50 results are verified by the updated HMDD V3.2 database

rank	miRNA	Disease	Validated
1	hsa-mir-16	Lung neoplasms	True
2	hsa-mir-155	Glioblastoma	True
3	hsa-mir-21	Stomach neoplasms	True
4	hsa-mir-29a	Pancreatic neoplasms	True
5	hsa-mir-17	Carcinoma, renal cell	True
6	hsa-mir-155	Prostatic neoplasms	True
7	hsa-mir-9	Carcinoma, hepatocellular	True
8	hsa-mir-150	Breast neoplasms	True
9	hsa-mir-20a	Carcinoma, renal cell	True
10	hsa-mir-142	Breast neoplasms	True
11	hsa-mir-106b	Lung neoplasms	True
12	hsa-mir-155	Stomach neoplasms	True
13	hsa-mir-21	Retinoblastoma	True
14	hsa-mir-145	Stomach neoplasms	True
15	hsa-mir-34a	Heart failure	True
16	hsa-mir-98	Breast neoplasms	True
17	hsa-mir-155	Autistic disorder	False
18	hsa-mir-126	Stomach neoplasms	True
19	hsa-mir-155	Glioma	True
20	hsa-mir-125b	Heart failure	True
21	hsa-mir-21	Nasopharyngeal Neoplasms	True
22	hsa-mir-15b	Lung neoplasms	True
23	hsa-mir-17	Stomach neoplasms	True
24	hsa-mir-34b	Carcinoma, hepatocellular	True
25	hsa-mir-205	Colorectal neoplasms	True
26	hsa-mir-21	Hepatitis b	True
27	hsa-mir-15a	Lung neoplasms	True
28	hsa-mir-130a	Breast neoplasms	True
29	hsa-mir-181b	Melanoma	False
30	hsa-mir-29b	Colorectal neoplasms	True
31	hsa-mir-221	Heart failure	True
32	hsa-mir-195	Lung neoplasms	True
33	hsa-mir-20a	Stomach neoplasms	True
34	hsa-mir-98	Melanoma	True
35	hsa-mir-101	Melanoma	True
36	hsa-mir-9	Heart failure	False
37	hsa-mir-214	Colorectal neoplasms	True
38	hsa-mir-29a	Stomach neoplasms	True
39	hsa-mir-21	Schizophrenia	False
40	hsa-mir-21	Carcinoma, basal cell	True
41	hsa-mir-122	Lung neoplasms	True
42	hsa-mir-223	Melanoma	True
43	hsa-mir-17	Carcinoma	True
44	hsa-mir-378a	Breast neoplasms	True
45	hsa-mir-29c	Colorectal neoplasms	True
46	hsa-mir-143	Carcinoma, Hepatocellular	True
47	hsa-mir-222	Heart failure	True
48	hsa-mir-1	Neoplasms	True
49	hsa-mir-29a	Glioblastoma	True
50	hsa-mir-93	Pancreatic neoplasms	False

Open in a new tab

Furthermore, the same strategy was adopted for specific diseases, and three case studies were carried out on gastric neoplasms, prostate neoplasms and colon neoplasms. As shown in Table 4, 50, 47 and 45 of the top 50 miRNAs, respectively, predicted by the ELMDA framework were validated by HMDD V3.2.

Table 4.

The ELMDA framework was implemented to investigate gastric neoplasms, prostate neoplasms and colon neoplasms, and 100%, 94%, and 90%, respectively, of the top 50 potential miRNAs were validated by HMDD V3.2

Rank	Gastric neoplasms		PROSTATE neoplasms		COLON neoplasms
Rank	miRNA	Valided	miRNA	Valided	miRNA	Valided
1	hsa-mir-21	True	hsa-mir-21	True	hsa-mir-20a	True
2	hsa-mir-146a	True	hsa-mir-155	True	hsa-mir-93	True
3	hsa-mir-155	True	hsa-mir-34a	True	hsa-mir-21	True
4	hsa-mir-29a	True	hsa-mir-29a	True	hsa-mir-29a	True
5	hsa-mir-145	True	hsa-mir-222	True	hsa-mir-155	True
6	hsa-mir-17	True	hsa-mir-18a	True	hsa-mir-146a	True
7	hsa-mir-126	True	hsa-mir-146a	True	hsa-mir-122	True
8	hsa-mir-20a	True	hsa-mir-29b	True	hsa-mir-125b	True
9	hsa-mir-29b	True	hsa-mir-221	True	hsa-mir-221	True
10	hsa-mir-125b	True	hsa-mir-17	True	hsa-mir-106b	True
11	hsa-mir-200b	True	hsa-mir-122	True	hsa-mir-29b	True
12	hsa-mir-222	True	hsa-mir-20a	True	hsa-mir-182	True
13	hsa-mir-18a	True	hsa-mir-34c	True	hsa-mir-222	True
14	hsa-mir-221	True	hsa-mir-34b	True	hsa-mir-34a	True
15	hsa-mir-200c	True	hsa-mir-199a	True	hsa-mir-29c	True
16	hsa-mir-29c	True	hsa-mir-210	True	hsa-mir-20b	True
17	hsa-mir-1	True	hsa-mir-133b	True	hsa-mir-199a	True
18	hsa-mir-146b	True	hsa-mir-223	True	hsa-mir-205	True
19	hsa-mir-93	True	hsa-mir-93	True	hsa-mir-214	True
20	hsa-mir-107	True	hsa-mir-126	True	hsa-mir-34b	True
21	hsa-mir-9	True	hsa-mir-124	True	hsa-mir-34c	True
22	hsa-mir-34a	True	hsa-mir-184	False	hsa-mir-200b	True
23	hsa-mir-182	True	hsa-mir-214	True	hsa-mir-133a	True
24	hsa-mir-26a	True	hsa-mir-182	True	hsa-mir-200c	True
25	hsa-mir-23b	True	hsa-mir-27a	True	hsa-mir-18a	True
26	hsa-mir-34b	True	hsa-mir-29c	True	hsa-mir-31	True
27	hsa-let-7a	True	hsa-mir-342	True	hsa-mir-146b	False
28	hsa-mir-133b	True	hsa-mir-99b	True	hsa-mir-183	True
29	hsa-mir-27a	True	hsa-mir-23a	True	hsa-mir-486	True
30	hsa-mir-34c	True	hsa-mir-486	True	hsa-mir-107	True
31	hsa-let-7c	True	hsa-mir-133a	True	hsa-mir-133b	True
32	hsa-let-7b	True	hsa-mir-31	True	hsa-mir-23a	True
33	hsa-mir-106b	True	hsa-mir-200b	True	hsa-mir-223	True
34	hsa-mir-133a	True	hsa-mir-92a	True	hsa-mir-140	True
35	hsa-mir-183	True	hsa-mir-192	True	hsa-mir-143	True
36	hsa-mir-214	True	hsa-mir-16	True	hsa-mir-519d	False
37	hsa-mir-342	True	hsa-mir-224	True	hsa-let-7b	True
38	hsa-mir-200a	True	hsa-mir-7	True	hsa-mir-9	False
39	hsa-mir-196a	True	hsa-mir-146b	True	hsa-mir-1	True
40	hsa-mir-31	True	hsa-mir-200c	True	hsa-mir-124	False
41	hsa-mir-122	True	hsa-mir-151a	True	hsa-mir-125a	True
42	hsa-mir-181a	True	hsa-mir-137	False	hsa-mir-210	True
43	hsa-mir-16	True	hsa-mir-9	True	hsa-mir-429	True
44	hsa-let-7g	True	hsa-mir-429	True	hsa-mir-101	True
45	hsa-mir-142	True	hsa-mir-1	True	hsa-mir-96	True
46	hsa-mir-223	True	hsa-mir-205	True	hsa-mir-7	False
47	hsa-mir-206	True	hsa-mir-423	False	hsa-mir-10b	True
48	hsa-mir-141	True	hsa-mir-96	True	hsa-mir-130a	True
49	hsa-mir-101	True	hsa-mir-200a	True	hsa-mir-218	True
50	hsa-mir-486	True	hsa-mir-106b	True	hsa-let-7a	True

Open in a new tab

An isolated disease refers to a disease without any known associated miRNA. To further evaluate the predicted performance of the ELMDA framework for predicting isolated disease-related miRNAs, the predicted scores of potential associations were calculated by removing all known associations related to predicted diseases. Isolated disease-related miRNA predictions were implemented for gastric neoplasms, prostate neoplasms and colon neoplasms. As shown in Table 5, 48, 42 and 43, respectively, of the top 50 miRNAs predicted by the ELMDA framework were validated by HMDD V3.2.

Table 5.

Predictions of isolated disease-related miRNAs for gastric neoplasms, prostate neoplasms and colon neoplasms; 96%, 84%, and 86%, respectively, of the top 50 potential miRNAs were validated by HMDD V3.2

rank	Gastric neoplasms		Prostate neoplasms		Colon neoplasms
rank	miRNA	Valided	miRNA	Valided	miRNA	Valided
1	hsa-mir-21	True	hsa-mir-17	True	hsa-mir-21	True
2	hsa-mir-146a	True	hsa-mir-21	True	hsa-mir-29a	True
3	hsa-mir-155	True	hsa-mir-20a	True	hsa-mir-221	True
4	hsa-mir-17	True	hsa-mir-122	True	hsa-mir-155	True
5	hsa-mir-145	True	hsa-mir-29a	True	hsa-mir-122	True
6	hsa-mir-20a	True	hsa-mir-146a	True	hsa-mir-222	True
7	hsa-mir-125b	True	hsa-mir-93	True	hsa-mir-146a	True
8	hsa-mir-93	True	hsa-mir-133b	True	hsa-mir-34a	True
9	hsa-mir-29b	true	hsa-mir-34a	True	hsa-mir-29c	True
10	hsa-mir-222	True	hsa-mir-199a	True	hsa-mir-133b	True
11	hsa-mir-29a	True	hsa-mir-29c	True	hsa-mir-18a	True
12	hsa-mir-1	True	hsa-mir-155	True	hsa-mir-146b	False False
13	hsa-mir-221	True	hsa-mir-210	True	hsa-mir-29b	True
14	hsa-mir-133b	True	hsa-mir-200a	True	hsa-mir-223	True
15	hsa-mir-18a	True	hsa-mir-34c	True	hsa-mir-125b	True
16	hsa-mir-34a	True	hsa-mir-184	False	hsa-mir-486	True
17	hsa-mir-126	True	hsa-mir-29b	True	hsa-mir-34c	True
18	hsa-mir-27a	True	hsa-mir-126	True	hsa-mir-151a	False
19	hsa-mir-106b	True	hsa-mir-151a	True	hsa-mir-23a	True
20	hsa-let-7b	True	hsa-mir-18a	True	hsa-mir-34b	True
21	hsa-let-7a	True	hsa-mir-192	True	hsa-mir-133a	True
22	hsa-mir-29c	True	hsa-mir-222	True	hsa-mir-107	True
23	hsa-mir-16	True	hsa-mir-31	True	hsa-mir-210	True
24	hsa-mir-146b	True	hsa-mir-106b	True	hsa-mir-200c	True
25	hsa-mir-9	True	hsa-mir-200b	True	hsa-mir-214	True
26	hsa-mir-196a	True	hsa-mir-133a	True	hsa-mir-99a	True
27	hsa-mir-26a	True	hsa-mir-215	False	hsa-mir-200b	True
28	hsa-let-7c	True	hsa-mir-151b	True	hsa-mir-31	True
29	hsa-mir-206	True	hsa-mir-199b	False	hsa-mir-182	True
30	hsa-mir-124	True	hsa-mir-20b	True	hsa-mir-199a	True
31	hsa-mir-183	True	hsa-mir-429	True	hsa-mir-23b	True
32	hsa-mir-373	True	hsa-mir-146b	True	hsa-mir-9	False
33	hsa-mir-27b	True	hsa-mir-141	True	hsa-mir-183	True
34	hsa-mir-142	True	hsa-mir-34b	True	hsa-mir-96	True
35	hsa-mir-122	True	hsa-mir-1	True	hsa-mir-205	True
36	hsa-mir-98	False	hsa-mir-28	False	hsa-mir-137	True
37	hsa-mir-214	True	hsa-mir-371a	False	hsa-mir-342	False
38	hsa-mir-15b	True	hsa-mir-137	False	hsa-mir-429	True
39	hsa-mir-34b	True	hsa-mir-148a	True	hsa-mir-103a	True
40	hsa-mir-34c	True	hsa-mir-200c	True	hsa-mir-143	True
41	hsa-mir-107	True	hsa-mir-451a	False	hsa-mir-27a	True
42	hsa-mir-133a	True	hsa-mir-486	True	hsa-mir-150	True
43	hsa-let-7e	False	hsa-mir-449a	True	hsa-mir-708	False
44	hsa-mir-181a	True	hsa-mir-106a	True	hsa-mir-326	True
45	hsa-let-7g	True	hsa-mir-203	True	hsa-mir-124	False
46	hsa-mir-200b	True	hsa-mir-182	True	hsa-mir-10b	True
47	hsa-mir-20b	True	hsa-mir-185	True	hsa-mir-99b	True
48	hsa-mir-205	True	hsa-mir-224	True	hsa-mir-130a	True
49	hsa-mir-342	True	hsa-mir-99b	True	hsa-mir-138	True
50	hsa-mir-101	True	hsa-mir-326	False	hsa-mir-28	False

Open in a new tab

According to the above description, the ELMDA framework exhibits good performance for predicting potential miRNA-disease associations and isolated disease-related miRNAs.

Discussion

The accumulating evidence has indicated that miRNAs play important roles in disease development. The identification of disease-related miRNAs will be beneficial to gain a deeper understanding of disease mechanisms at the molecular level. As valuable complements to experimental studies, computational models used to identify associations between miRNAs and diseases are in high demand.

In this work, the miRNA-mRNA interactions verified by experiments are used to construct the miRNA similarity network, and the disease similarity network is constructed by using the similarities of disease functions and disease targets. The training dataset is reconstructed through feature extraction and sample selection, and the model is trained by multiclassifier voting. The model has shown good performance in both the fivefold cross validation and case studies and can predict isolated disease-related miRNAs.

Despite the favorable results obtained using the ELMDA framework, this study has some limitations. First, when we calculated the similarities among miRNAs and diseases, we used target data that were verified by experiments. However, target data that are verified by experiments are very sparse, resulting in no common target genes between many miRNAs and diseases, and the similarities of many miRNA pairs and disease pairs are 0. With the deepening of relevant research, considering that miRNA target genes and disease target genes are increasingly recognized, this problem will improve. Second, the ELMDA framework uses a form of multiclassifier voting to obtain the final prediction scores. We choose the top four classifiers with the highest scores to build the model through experimental methods. There is no theoretical basis for the selection of a single classifier. In the future, we will further study the selection method for classifiers and assign different weights to each classifier to improve the model.

Conclusion

We propose a model framework, named ELMDA, to predict the unknown miRNA-disease associations. Without considering the known association, the potential association can be predicted by multi-classifiers voting by integrating miRNA and disease similarity network. The performance of the model framework was evaluated through five-fold cross validation, and the predictive ability of the model was verified through case studies. The model can predict miRNAs related to isolated diseases. In conclusion, ELMDA appears to be a reliable method to uncover disease-associated miRNAs.

Materials and methods

Human miRNA–disease associations

The experimentally verified human miRNA–disease associations were downloaded from HMDD V2.0 database [11]. The database provides 5430 distinct high-quality experimentally verified miRNA–disease associations, which involve 495 miRNAs and 383 diseases. We use this dataset as the benchmark dataset and variables $m$ and $d$ to represent the number of miRNAs and diseases, respectively. The adjacency matrix of miRNA–disease associations is denoted by matrix $A$ , whereas the entity $A (i, j)$ in row $i$ and column $j$ is 1 if miRNA $i$ is associated with disease $j$ and 0 otherwise. Matrix $A$ is a very sparse matrix with a known association density of 0.00286. The research task in this work is to discover the potential miRNA-disease associations (0 in matrix $A$ ).

Disease similarity and miRNA similarity

Many miRNA disease association prediction models construct miRNA and disease similarity networks, which combine known associations to improve model performance. However, the direct use of these similarity data in the cross validation of the model will overestimate the model performance. If the known association data in the training set are removed and the similarity is recalculated during each cross validation, this will involve great time costs. Considering this factor, we constructed miRNA and disease similarity networks without using known association information.

The disease similarity network consists of two parts: semantic similarity and functional similarity. We use the method proposed by Wang et al. [35] to calculate the disease semantic similarity and use the matrix, SD1, to represent it. SD1(A,B) represents the semantic similarity between disease A and disease B.

Based on the assumption that diseases with similar functions tend to be associated with similar target genes (mRNAs), we measured the functional similarity of the two diseases by considering their related target genes. DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated with human diseases [36]. Disease-mRNA interactions were obtained from the latest version, DisGeNET V7.0; let $T_{d}^{A} = \{T_{d}^{A} (1), T_{d}^{A} (2), \dots, T_{d}^{A} (n a)\}$ and $T_{d}^{B} = \{T_{d}^{B} (1), T_{d}^{B} (2), \dots, T_{d}^{B} (n b)\}$ denote the target gene sets of diseases $A$ and $B$ , where variables $na$ and $nb$ are the number of target genes of diseases $A$ and $B$ , respectively. The information entropy of $T_{d}^{A}$ is defined in Eq. (4):

\{\begin{matrix} H (T_{d}^{A}) = - \sum_{i = 1}^{na} p (T_{d}^{A}, (i)) \log_{2} (p, (T_{d}^{A}, (i))) \\ p (T_{d}^{A}, (i)) = \frac{n (T_{d}^{A}, (i))}{N} \end{matrix})

where $N$ is the number of disease-mRNA interactions, $n (T_{d}^{A} (i))$ is the number of the $ith$ target gene of disease $A$ in the disease-mRNA set, $p (T_{d}^{A}, (i))$ is the frequency of the $ith$ target gene of disease $A$ in the disease-mRNA set, and $H (T_{d}^{A})$ is the information entropy of $T_{d}^{A}$ .

The normalized mutual information (NMI) of $T_{d}^{A}$ and $T_{d}^{B}$ is used to measure the functional similarity of diseases $A$ and $B$ , as shown in Eq. (5):

S D 2 (A, B) = \frac{2 * H (T_{d}^{A} \cap T_{d}^{B})}{H (T_{d}^{A}) + H (T_{d}^{B})}

where $H (T_{d}^{A})$ , $H (T_{d}^{B})$ and $H (T_{d}^{A} \cap T_{d}^{B})$ represent the information entropy of $T_{d}^{A}$ , $T_{d}^{B}$ and the intersection set of $T_{d}^{A}$ and $T_{d}^{B}$ , respectively. The functional similarity between two diseases is measured by Eq. (5) according to their common target genes and the information entropy of their respective target gene sets and is standardized based on NMI. Matrix $S D 2$ is the functional similarity matrix, and $S D 2 (i, j)$ in row $i$ and column $j$ represents the similarity between diseases $i$ and $j$ .

The disease similarity is obtained by integrating the semantic similarity and functional similarity in the Eq. (6):

S D (i, j) = α * S D 1 (i, j) + (1 - α) * S D 2 (i, j)

where $α$ and $(1 - α)$ are the contribution parameters of the semantic similarity and functional similarity, respectively. In our experiment, it is considered that the contributions are the same, and $α$ is taken as 0.5.

MiRNAs are important regulatory RNAs that mainly function in repressing gene expression at the posttranscriptional level by binding to the 3’-UTR of target mRNAs through base pairing [37]. Based on the assumption that miRNAs with similar functions tend to be associated with similar target genes, we downloaded miRNA‒mRNA interaction data from the miRTarBase database [38, 39], and let $T_{m}^{A} = \{T_{m}^{A} (1), T_{m}^{A} (2), \dots, T_{m}^{A} (m a)\}$ and $T_{m}^{B} = \{T_{m}^{B} (1), T_{m}^{B} (2), \dots, T_{m}^{B} (m b)\}$ denote the target gene sets of miRNAs, $A$ and $B$ , where variables $ma$ and $mb$ are the number of target genes of miRNA $A$ and $B$ , respectively. The MiRNA functional similarities were calculated using the same calculation method as for the disease functional similarities, and the miRNA similarities were represented by matrix $SM$ , where $S M (i, j)$ in row $i$ and column $j$ represents the similarity between miRNAs $i$ and $j$ .

ELMDA model

In this section, we will detail the ELMDA model construction process and show the overall process in Fig. 4.

Step 1: Structural feature extraction

To more comprehensively describe the sample characteristics and improve the model performance, we extracted the structural features of miRNA and disease and added them to the sample. We extracted 17 structural features for each miRNA disease pair, expressed as $f = {f (1), f (2), \dots, f (17}$ . $f (17)$ is the number of target genes associated with miRNA $i$ and disease $j$ , which are obtained from the miRTarBase and DisGeNET databases, respectively. The remaining 16 features are composed of two parts. The first 8 features are structural features related to miRNA $i$ , and the remaining 8 features are related to disease $j$ . The first two structural features of miRNA $i$ are the number and density of miRNA-associated genes. The third and fourth features are the average and variance of the miRNA similarity vector, $S M (i, :)$ , respectively. The remaining four features are the 200 miRNAs most similar to miRNA $i$ and the average similarity calculated for each group of 50 miRNAs. In the same way, the structural characteristics of disease $j$ also include the number and density of disease-associated genes, the mean and variance of disease similarity, and the mean of the most similar disease similarity.

Step 2: Coding the sample set

Each sample is formulated as $S (k) = {S M (i, :), S D (j, :)}$ , where $S M (i, :)$ is row $i$ of miRNA similarity matrix $SM$ and $S D (j, :)$ is row $j$ of disease similarity matrix $SD$ . The corresponding sample label, $L (k)$ , is 1 if miRNA $i$ is associated with disease $j$ ; otherwise, the label is 0. The dataset contains 495 miRNAs and 383 diseases, so we obtained a 189,585 × 878 initial sample set $S$ . There are a large number of sample features. To improve the efficiency of model training, we reduced the dimensions of the miRNA features and disease features. We used principal component analysis (PCA) dimensionality reduction technology to retain 80% of the data information and finally obtained 80 miRNA features and 119 disease features. Finally, 17 structural features of miRNA and disease were added to each sample (miRNA disease pair) to form the final sample set, which included 189,585 samples and 216 features.

Step 3: Model architecture

In this work, we used ensembles of machine learning algorithms, such as support vector machine (SVM), gradient boosting decision tree (GBDT), random forest (RF) and eXtreme gradient boosting (XGBoost) classifiers. These algorithms are combined with soft voting classifiers to improve the accuracy and are briefly discussed as follows.

SVM classification SVM is a supervised learning algorithm used for classification and regression [40]. An SVM aims to fit an optimal separating hyperplane (OSH) between classes by focusing on the training samples that lie at the edges of the class distributions, the support vectors. A hyperplane is defined as $ω ∙ x + b = 0$ , where $x$ is a point lying on the hyperplane, $ω$ is normal to the hyperplane, and $b$ is the bias. For the linearly separable case, a separating hyperplane can be defined for the two classes as $ω ∙ x_{i} + b \geq + 1$ (for $y_{i}$ = + 1) and $ω ∙ x_{i} + b \leq - 1$ (for $y_{i}$ = − 1), where $y_{i}$ is the sample category, + 1 is the positive sample and − 1 is the negative sample.

GBDT classification: The gradient advancing decision tree (GBDT) is a machine learning technique used for regression and classification tasks. GBDT provides a prediction model in the form of an integration of weak prediction models (usually decision trees) [41]. When the decision tree is a weak learner, the algorithm generated is called a gradient-boosted tree. In recent years, GBDT has achieved great success in the fields of machine learning and data mining. The gradient-boosted trees model is constructed in the same staged manner as other boosting methods, but it extends other methods by allowing optimization of any differentiable loss function.

RF classification Random forest (RF) [42] refers to the establishment of a forest by random sampling. Random refers to random sampling to establish a model; forest means that it consists of many independent decision trees. The basic principle of random forest is as follows: N training datasets are randomly sampled from the original data in the way of putting back; k features are randomly selected from each training dataset (k is less than the total number of features in the original data); M decision trees are established repeatedly according to these K features; each decision tree is applied to predict the results, and the results of all predictions are saved; the classification model is voted on, the number of votes for each prediction result is calculated, and the model with the highest number of votes is selected as the final decision. This method can reduce the risk of overfitting by averaging the decision trees.

XGBoost classification XGBoost is a supervised learning algorithm. It implements a process called boosting to generate an accelerated model, which was initially developed by Tianqi Chen [43] and has been further adopted by many developers. Supervised learning is often used to solve classification and regression problems. XGBoost was mainly designed for speed and performance using gradient-boosted decision trees. Boosting is an integrated learning technology that builds multiple models in sequence, and each new model attempts to correct the defects in the previous model. In tree promotion, each new model added to the integration is a decision tree. XGBoost can perform the three major gradient boosting techniques, namely, gradient boosting, regularized boosting, and stochastic boosting.

Training dataset

This dataset is very sparse, including 189,585 samples, of which only 5430 positive samples (known associations verified by experiments) are identified, and the proportion of positive samples is 2.86%. To better evaluate the model, we use the same method as in reference [44] to select negative samples, so we obtain a total of 5430 positive samples and 5418 negative samples, which form a relatively balanced dataset. By combining the feature extraction methods mentioned earlier and merging the structural features, the training dataset is finally formed, which contains 216 features of 10,849 samples.

Acknowledgements

Not applicable.

Author contributions

C.L.G. conceived the project, developed the prediction method, designed and implemented the experiments, analyzed the result, and wrote the paper. X.Y.L. implemented the experiments, analyzed the result, and wrote the paper. All authors reviewed the final manuscript.

Funding

This study is supported by National Nature Science Foundation of Hunan Province (Grant No. 2020JJ4209).

Availability of data and materials

The source code and datasets analysed during the current study are available at https://github.com/Changlong2020/ELMDA. All data used in the paper, including the data of miRNA-disease associations, miRNA-target and disease-target interactions, were obtained from current public databases and were cited in the text. The experimentally verified human miRNA–disease associations were downloaded from HMDD database (http://www.cuilab.cn/hmdd), where HMDD V2.0 was used for model training and HMDD V3.2 was used for case studies. The disease semantic similarity was downloaded form http://www.cuilab.cn/files/images/cuilab/misim.zip. The disease-mRNA interactions were obtained from DisGeNET V7.0 (https://www.disgenet.org/downloads) and used to calculate disease functional similarity. The miRNA‒mRNA interaction data downloaded from the miRTarBase database, Release 9.0 (https://mirtarbase.cuhk.edu.cn/~miRTarBase/miRTarBase_2022/php/download.php).

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Changlong Gu, Email: guchanglong@hnu.edu.cn.

Xiaoying Li, Email: lixy@hnu.edu.cn.

References

1.Dean KR, Hammamieh R, Mellon SH, Abu-Amara D, Flory JD, Guffanti G, Wang K, Daigle BJ, Jr, Gautam A, Lee I, et al. Multi-omic biomarker identification and validation for diagnosing warzone-related post-traumatic stress disorder. Mol Psychiatry. 2020;25(12):3337–3349. doi: 10.1038/s41380-019-0496-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Shen L, Liu F, Huang L, Liu G, Zhou L, Peng L. VDA-RWLRLS: an anti-SARS-CoV-2 drug prioritizing framework combining an unbalanced bi-random walk and Laplacian regularized least squares. Comput Biol Med. 2021;140:105119. doi: 10.1016/j.compbiomed.2021.105119. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Peng L, Wang F, Wang Z, Tan J, Huang L, Tian X, Liu G, Zhou L. Cell–cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies. Brief Bioinform. 2022;23(4). [DOI] [PubMed]
4.Lihong P, Wang C, Tian X, Zhou L, Li K. Finding lncRNA-protein interactions based on deep learning with dual-net neural architecture. IEEE/ACM Trans Comput Biol Bioinform 2021 [DOI] [PubMed]
5.Wang T, Shao W, Huang Z, Tang H, Zhang J, Ding Z, Huang K. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12(1):3445. doi: 10.1038/s41467-021-23774-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Zhou L, Wang Z, Tian X, Peng L. LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA-protein interaction identification. BMC Bioinformatics. 2021;22(1):479. doi: 10.1186/s12859-021-04399-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bartel DP. Metazoan MicroRNAs. Cell. 2018;173(1):20–51. doi: 10.1016/j.cell.2018.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136(2):215–233. doi: 10.1016/j.cell.2009.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Toden S, Zumwalt TJ, Goel A. Non-coding RNAs and potential therapeutic targeting in cancer. BBA Rev Cancer. 2021;1875(1). [DOI] [PMC free article] [PubMed]
10.Ye JW, Xu MC, Tian XK, Cai S, Zeng S. Research advances in the detection of miRNA. J Pharm Anal. 2019;9(4):217–226. doi: 10.1016/j.jpha.2019.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li Y, Qiu CX, Tu J, Geng B, Yang JC, Jiang TZ, Cui QH. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42(D1):D1070–D1074. doi: 10.1093/nar/gkt1023. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Huang Z, Shi JC, Gao YX, Cui CM, Zhang S, Li JW, Zhou Y, Cui QH. HMDD v3.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019;47(D1):D1013–D1017. doi: 10.1093/nar/gky1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Yang Z, Ren F, Liu CN, He SM, Sun G, Gao QA, Yao L, Zhang YD, Miao RY, Cao Y, et al. dbDEMC: a database of differentially expressed miRNAs in human cancers. Bmc Genom 2010,11. [DOI] [PMC free article] [PubMed]
14.Zhao Y, Wang CC, Chen X. Microbes and complex diseases: from experimental results to computational models. Brief Bioinform. 2021;22(3). [DOI] [PubMed]
15.Xuan P, Han K, Guo MZ, Guo YH, Li JB, Ding J, Liu Y, Dai QG, Li J, Teng ZX, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8(8). [DOI] [PMC free article] [PubMed]
16.Chen X, Yan CC, Zhang X, You ZH, Deng LX, Liu Y, Zhang YD, Dai QH. WBSMDA: within and between score for MiRNA-disease association prediction. Sci Rep. 2016;6:1–9. doi: 10.1038/srep21106. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.You ZH, Huang ZA, Zhu ZX, Yan GY, Li ZW, Wen ZK, Chen X. PBMDA: a novel and effective path-based computational model for miRNA-disease association prediction. Plos Comput Biol. 2017;13(3). [DOI] [PMC free article] [PubMed]
18.Chen X, Xie D, Wang L, Zhao Q, You ZH, Liu HS. BNPMDA: bipartite network projection for MiRNA-Disease association prediction. Bioinformatics. 2018;34(18):3178–3186. doi: 10.1093/bioinformatics/bty333. [DOI] [PubMed] [Google Scholar]
19.Chen X, Li SX, Yin J, Wang CC. Potential miRNA-disease association prediction based on kernelized Bayesian matrix factorization. Genomics. 2020;112(1):809–819. doi: 10.1016/j.ygeno.2019.05.021. [DOI] [PubMed] [Google Scholar]
20.Niu YW, Wang GH, Yan GY, Chen X. Integrating random walk and binary regression to identify novel miRNA-disease association. Bmc Bioinform. 2019;20. [DOI] [PMC free article] [PubMed]
21.Ma YJ, He TT, Jiang XP. Projection-based neighborhood non-negative matrix factorization for lncRNA-protein interaction prediction. Front Genet. 2019;10. [DOI] [PMC free article] [PubMed]
22.Zhang ZC, Zhang XF, Wu M, Le OY, Zhao XM, Li XL. A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks. Bioinformatics. 2020;36(11):3474–3481. doi: 10.1093/bioinformatics/btaa157. [DOI] [PubMed] [Google Scholar]
23.Li W, Wang SL, Xu J. An ensemble matrix completion model for predicting potential drugs against SARS-CoV-2. Front Microbiol. 2021;12. [DOI] [PMC free article] [PubMed]
24.Ma YJ. DeepMNE: deep multi-network embedding for lncRNA-disease association prediction. IEEE J Biomed Health. 2022;26(7):3539–3549. doi: 10.1109/JBHI.2022.3152619. [DOI] [PubMed] [Google Scholar]
25.Chen X, Wu QF, Yan GY. RKNNMDA: ranking-based KNN for MiRNA-disease association prediction. RNA Biol. 2017;14(7):952–962. doi: 10.1080/15476286.2017.1312226. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Peng JJ, Hui WW, Li QQ, Chen BL, Hao JY, Jiang QH, Shang XQ, Wei ZY. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35(21):4364–4371. doi: 10.1093/bioinformatics/btz254. [DOI] [PubMed] [Google Scholar]
27.Chen X, Yan GY. Semi-supervised learning for potential human microRNA-disease associations inference. Sci Rep. 2014;4. [DOI] [PMC free article] [PubMed]
28.Chen X, Zhu CC, Yin J. Ensemble of decision tree reveals potential miRNA-disease associations. Plos Comput Biol. 2019;15(7). [DOI] [PMC free article] [PubMed]
29.Xuan P, Sun H, Wang X, Zhang TG, Pan SX. Inferring the disease-associated miRNAs based on network representation learning and convolutional neural networks. Int J Mol Sci. 2019;20(15). [DOI] [PMC free article] [PubMed]
30.Ma YJ, Liu QQ. Generalized matrix factorization based on weighted hypergraph learning for microbe-drug association prediction. Comput Biol Med. 2022;145. [DOI] [PubMed]
31.Ma YJ, Ma YY. Hypergraph-based logistic matrix factorization for metabolite-disease interaction prediction. Bioinformatics. 2022;38(2):435–443. doi: 10.1093/bioinformatics/btab652. [DOI] [PubMed] [Google Scholar]
32.Ning Q, Zhao YM, Gao J, Chen C, Li X, Li TT, Yin MH. AMHMDA: attention aware multi-view similarity networks and hypergraph learning for miRNA-disease associations identification. Brief Bioinform. 2023;24(2). [DOI] [PubMed]
33.Barbato A, Iuliano A, Volpe M, D’Alterio R, Brillante S, Massa F, De Cegli R, Carrella S, Salati M, Russo A, et al. Integrated genomics identifies miR-181/TFAM pathway as a critical driver of drug resistance in melanoma. Int J Mol Sci. 2021;22(4). [DOI] [PMC free article] [PubMed]
34.Wu YF, Xu WX, Yang YM, Zhang ZW. miRNA-93-5p promotes gemcitabine resistance in pancreatic cancer cells by targeting the PTEN-mediated PI3K/Akt signaling pathway. Ann Clin Lab Sci. 2021;51(3):310–320. [PubMed] [Google Scholar]
35.Wang D, Wang JA, Lu M, Song F, Cui QH. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–1650. doi: 10.1093/bioinformatics/btq241. [DOI] [PubMed] [Google Scholar]
36.Pinero J, Sauch J, Sanz F, Furlong LI. The DisGeNET cytoscape app: exploring and visualizing disease genomics data. Comput Struct Biotec. 2021;19:2960–2967. doi: 10.1016/j.csbj.2021.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Meister G, Tuschl T. Mechanisms of gene silencing by double-stranded RNA. Nature. 2004;431(7006):343–349. doi: 10.1038/nature02873. [DOI] [PubMed] [Google Scholar]
38.Hsu SD, Lin FM, Wu WY, Liang C, Huang WC, Chan WL, Tsai WT, Chen GZ, Lee CJ, Chiu CM, et al. miRTarBase: a database curates experimentally validated microRNA-target interactions. Nucleic Acids Res. 2011;39:D163–D169. doi: 10.1093/nar/gkq1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, et al. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014;42(D1):D78–D85. doi: 10.1093/nar/gkt1266. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Mathur A, Foody GM. Multiclass and binary SVM classification: implications for training and classification users. IEEE Geosci Remote S. 2008;5(2):241–245. doi: 10.1109/LGRS.2008.915597. [DOI] [Google Scholar]
41.Piryonesi SM, El-Diraby TE. Data analytics in asset management: cost-effective prediction of the pavement condition index. J Infrastruct Syst. 2020;26(1).
42.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
43.Chen TQ, Guestrin C. XGBoost: a scalable tree boosting system. In: Kdd'16: Proceedings of the 22nd Acm Sigkdd international conference on knowledge discovery and data mining 2016;785–794.
44.Zhou S, Wang SL, Wu Q, Azim R, Li W. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85. [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Dean KR, Hammamieh R, Mellon SH, Abu-Amara D, Flory JD, Guffanti G, Wang K, Daigle BJ, Jr, Gautam A, Lee I, et al. Multi-omic biomarker identification and validation for diagnosing warzone-related post-traumatic stress disorder. Mol Psychiatry. 2020;25(12):3337–3349. doi: 10.1038/s41380-019-0496-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Shen L, Liu F, Huang L, Liu G, Zhou L, Peng L. VDA-RWLRLS: an anti-SARS-CoV-2 drug prioritizing framework combining an unbalanced bi-random walk and Laplacian regularized least squares. Comput Biol Med. 2021;140:105119. doi: 10.1016/j.compbiomed.2021.105119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Peng L, Wang F, Wang Z, Tan J, Huang L, Tian X, Liu G, Zhou L. Cell–cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies. Brief Bioinform. 2022;23(4). [DOI] [PubMed]

[CR4] 4.Lihong P, Wang C, Tian X, Zhou L, Li K. Finding lncRNA-protein interactions based on deep learning with dual-net neural architecture. IEEE/ACM Trans Comput Biol Bioinform 2021 [DOI] [PubMed]

[CR5] 5.Wang T, Shao W, Huang Z, Tang H, Zhang J, Ding Z, Huang K. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12(1):3445. doi: 10.1038/s41467-021-23774-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Zhou L, Wang Z, Tian X, Peng L. LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA-protein interaction identification. BMC Bioinformatics. 2021;22(1):479. doi: 10.1186/s12859-021-04399-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Bartel DP. Metazoan MicroRNAs. Cell. 2018;173(1):20–51. doi: 10.1016/j.cell.2018.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136(2):215–233. doi: 10.1016/j.cell.2009.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Toden S, Zumwalt TJ, Goel A. Non-coding RNAs and potential therapeutic targeting in cancer. BBA Rev Cancer. 2021;1875(1). [DOI] [PMC free article] [PubMed]

[CR10] 10.Ye JW, Xu MC, Tian XK, Cai S, Zeng S. Research advances in the detection of miRNA. J Pharm Anal. 2019;9(4):217–226. doi: 10.1016/j.jpha.2019.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Li Y, Qiu CX, Tu J, Geng B, Yang JC, Jiang TZ, Cui QH. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42(D1):D1070–D1074. doi: 10.1093/nar/gkt1023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Huang Z, Shi JC, Gao YX, Cui CM, Zhang S, Li JW, Zhou Y, Cui QH. HMDD v3.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019;47(D1):D1013–D1017. doi: 10.1093/nar/gky1010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Yang Z, Ren F, Liu CN, He SM, Sun G, Gao QA, Yao L, Zhang YD, Miao RY, Cao Y, et al. dbDEMC: a database of differentially expressed miRNAs in human cancers. Bmc Genom 2010,11. [DOI] [PMC free article] [PubMed]

[CR14] 14.Zhao Y, Wang CC, Chen X. Microbes and complex diseases: from experimental results to computational models. Brief Bioinform. 2021;22(3). [DOI] [PubMed]

[CR15] 15.Xuan P, Han K, Guo MZ, Guo YH, Li JB, Ding J, Liu Y, Dai QG, Li J, Teng ZX, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8(8). [DOI] [PMC free article] [PubMed]

[CR16] 16.Chen X, Yan CC, Zhang X, You ZH, Deng LX, Liu Y, Zhang YD, Dai QH. WBSMDA: within and between score for MiRNA-disease association prediction. Sci Rep. 2016;6:1–9. doi: 10.1038/srep21106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.You ZH, Huang ZA, Zhu ZX, Yan GY, Li ZW, Wen ZK, Chen X. PBMDA: a novel and effective path-based computational model for miRNA-disease association prediction. Plos Comput Biol. 2017;13(3). [DOI] [PMC free article] [PubMed]

[CR18] 18.Chen X, Xie D, Wang L, Zhao Q, You ZH, Liu HS. BNPMDA: bipartite network projection for MiRNA-Disease association prediction. Bioinformatics. 2018;34(18):3178–3186. doi: 10.1093/bioinformatics/bty333. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Chen X, Li SX, Yin J, Wang CC. Potential miRNA-disease association prediction based on kernelized Bayesian matrix factorization. Genomics. 2020;112(1):809–819. doi: 10.1016/j.ygeno.2019.05.021. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Niu YW, Wang GH, Yan GY, Chen X. Integrating random walk and binary regression to identify novel miRNA-disease association. Bmc Bioinform. 2019;20. [DOI] [PMC free article] [PubMed]

[CR21] 21.Ma YJ, He TT, Jiang XP. Projection-based neighborhood non-negative matrix factorization for lncRNA-protein interaction prediction. Front Genet. 2019;10. [DOI] [PMC free article] [PubMed]

[CR22] 22.Zhang ZC, Zhang XF, Wu M, Le OY, Zhao XM, Li XL. A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks. Bioinformatics. 2020;36(11):3474–3481. doi: 10.1093/bioinformatics/btaa157. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Li W, Wang SL, Xu J. An ensemble matrix completion model for predicting potential drugs against SARS-CoV-2. Front Microbiol. 2021;12. [DOI] [PMC free article] [PubMed]

[CR24] 24.Ma YJ. DeepMNE: deep multi-network embedding for lncRNA-disease association prediction. IEEE J Biomed Health. 2022;26(7):3539–3549. doi: 10.1109/JBHI.2022.3152619. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Chen X, Wu QF, Yan GY. RKNNMDA: ranking-based KNN for MiRNA-disease association prediction. RNA Biol. 2017;14(7):952–962. doi: 10.1080/15476286.2017.1312226. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Peng JJ, Hui WW, Li QQ, Chen BL, Hao JY, Jiang QH, Shang XQ, Wei ZY. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35(21):4364–4371. doi: 10.1093/bioinformatics/btz254. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Chen X, Yan GY. Semi-supervised learning for potential human microRNA-disease associations inference. Sci Rep. 2014;4. [DOI] [PMC free article] [PubMed]

[CR28] 28.Chen X, Zhu CC, Yin J. Ensemble of decision tree reveals potential miRNA-disease associations. Plos Comput Biol. 2019;15(7). [DOI] [PMC free article] [PubMed]

[CR29] 29.Xuan P, Sun H, Wang X, Zhang TG, Pan SX. Inferring the disease-associated miRNAs based on network representation learning and convolutional neural networks. Int J Mol Sci. 2019;20(15). [DOI] [PMC free article] [PubMed]

[CR30] 30.Ma YJ, Liu QQ. Generalized matrix factorization based on weighted hypergraph learning for microbe-drug association prediction. Comput Biol Med. 2022;145. [DOI] [PubMed]

[CR31] 31.Ma YJ, Ma YY. Hypergraph-based logistic matrix factorization for metabolite-disease interaction prediction. Bioinformatics. 2022;38(2):435–443. doi: 10.1093/bioinformatics/btab652. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Ning Q, Zhao YM, Gao J, Chen C, Li X, Li TT, Yin MH. AMHMDA: attention aware multi-view similarity networks and hypergraph learning for miRNA-disease associations identification. Brief Bioinform. 2023;24(2). [DOI] [PubMed]

[CR33] 33.Barbato A, Iuliano A, Volpe M, D’Alterio R, Brillante S, Massa F, De Cegli R, Carrella S, Salati M, Russo A, et al. Integrated genomics identifies miR-181/TFAM pathway as a critical driver of drug resistance in melanoma. Int J Mol Sci. 2021;22(4). [DOI] [PMC free article] [PubMed]

[CR34] 34.Wu YF, Xu WX, Yang YM, Zhang ZW. miRNA-93-5p promotes gemcitabine resistance in pancreatic cancer cells by targeting the PTEN-mediated PI3K/Akt signaling pathway. Ann Clin Lab Sci. 2021;51(3):310–320. [PubMed] [Google Scholar]

[CR35] 35.Wang D, Wang JA, Lu M, Song F, Cui QH. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–1650. doi: 10.1093/bioinformatics/btq241. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Pinero J, Sauch J, Sanz F, Furlong LI. The DisGeNET cytoscape app: exploring and visualizing disease genomics data. Comput Struct Biotec. 2021;19:2960–2967. doi: 10.1016/j.csbj.2021.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Meister G, Tuschl T. Mechanisms of gene silencing by double-stranded RNA. Nature. 2004;431(7006):343–349. doi: 10.1038/nature02873. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Hsu SD, Lin FM, Wu WY, Liang C, Huang WC, Chan WL, Tsai WT, Chen GZ, Lee CJ, Chiu CM, et al. miRTarBase: a database curates experimentally validated microRNA-target interactions. Nucleic Acids Res. 2011;39:D163–D169. doi: 10.1093/nar/gkq1107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, et al. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014;42(D1):D78–D85. doi: 10.1093/nar/gkt1266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Mathur A, Foody GM. Multiclass and binary SVM classification: implications for training and classification users. IEEE Geosci Remote S. 2008;5(2):241–245. doi: 10.1109/LGRS.2008.915597. [DOI] [Google Scholar]

[CR41] 41.Piryonesi SM, El-Diraby TE. Data analytics in asset management: cost-effective prediction of the pavement condition index. J Infrastruct Syst. 2020;26(1).

[CR42] 42.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]

[CR43] 43.Chen TQ, Guestrin C. XGBoost: a scalable tree boosting system. In: Kdd'16: Proceedings of the 22nd Acm Sigkdd international conference on knowledge discovery and data mining 2016;785–794.

[CR44] 44.Zhou S, Wang SL, Wu Q, Azim R, Li W. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85. [DOI] [PubMed]

PERMALINK

Prediction of disease-related miRNAs by voting with multiple classifiers

Changlong Gu

Xiaoying Li

Abstract

Background

Results

Performance of ELMDA based on fivefold cross-validation

Table 1.

Fig. 1.

Fig. 2.

Table 2.

Comparison with other methods

Fig. 3.

Case study

Table 3.

Table 4.

Table 5.

Discussion

Conclusion

Materials and methods

Human miRNA–disease associations

Disease similarity and miRNA similarity

ELMDA model

Fig. 4.

Step 1: Structural feature extraction

Step 2: Coding the sample set

Step 3: Model architecture

Training dataset

Acknowledgements

Author contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases