Skip to main content
Molecular Therapy. Nucleic Acids logoLink to Molecular Therapy. Nucleic Acids
. 2019 Dec 18;19:602–611. doi: 10.1016/j.omtn.2019.12.010

DBMDA: A Unified Embedding for Sequence-Based miRNA Similarity Measure with Applications to Predict and Validate miRNA-Disease Associations

Kai Zheng 1,4,∗∗∗, Zhu-Hong You 2,4,, Lei Wang 2,3,∗∗, Yong Zhou 1, Li-Ping Li 2, Zheng-Wei Li 1
PMCID: PMC6957846  PMID: 31931344

Abstract

MicroRNAs (miRNAs) play a critical role in human diseases. Determining the association between miRNAs and disease contributes to elucidating the pathogenesis of liver diseases and seeking the effective treatment method. Despite great recent advances in the field of the associations between miRNAs and diseases, implementing association verification and recognition efficiently at scale presents serious challenges to biological experimental approaches. Thus, computational methods for predicting miRNA-disease association have become a research hotspot. In this paper, we present a new computational method, named distance-based sequence similarity for miRNA-disease association prediction (DBMDA), that directly learns a mapping from miRNA sequence to a Euclidean space. The notable feature of our approach consists of inferring global similarity from region distances that can be figured by chaos game representation algorithm based on the miRNA sequences. In the 5-fold cross-validation experiment, the area under the curve (AUC) obtained by DBMDA in predicting potential miRNA-disease associations reached 0.9129. To assess the effectiveness of DBMDA more effectively, we compared it with different classifiers and former prediction models. Besides, we constructed two case studies for prostate neoplasms and colon neoplasms. Results show that 39 and 39 out of the top 40 predicted miRNAs were confirmed by other databases, respectively. BDMDA has made new attempts in sequence similarity and achieved excellent results, while at the same time providing a new perspective for predicting the relationship between diseases and miRNAs. The source code and datasets explored in this work are available online from the University of Chinese Academy of Sciences (http://220.171.34.3:81/).

Keywords: miRNAs, disease, chaos game representation, heterogenous information, rotation forest

Introduction

MicroRNA (miRNA) is a short group of noncoding RNA (ncRNA) constructed from about 22 nt that can combine designated messenger RNA by base pairing and control the translation and stability.1 Since the first miRNA was discovered by Victor Ambros in 1993, a large number of found miRNAs accumulated at a high level during the past 20 years from a far-ranging variety of species.2,3 The study found that miRNA plays an important influence on biological processes, such as cell development, proliferation, and apoptosis,4 and the regulation functions of miRNA are related to some particular gene expressions in the post-transcriptional stage.5 Based on the above findings, more and more miRNAs have been validated in connection with the development of complex diseases in humans.6 For instance, miR-137 controlled the mitotic progression of lung cancer cells by targeting Cdc42 and Cdk6.7 In von Brandenstein et al.’s8 study, miR-15a is a potential biological marker for differentiating benign and malignant renal tumors in biopsy and urine samples. The progression of head and neck carcinomas could also be boosted by miR-211 through combining transforming growth factor-β receptor 2 (TGF-βR2).9 However, the biological experimental conditions for verifying the association between miRNA and disease are harsh and have time-consuming and laborious disadvantages. Therefore, the computational algorithms for forecasting the potential miRNA-disease associations have become a hot topic, and more studies attach importance to it. Correspondingly, computational methods can more effectively assist biological experiments to validate disease-associated miRNAs by predicting results.10

Over the years, an increasing number of studies constructed computational models for predicting miRNA-disease association.11, 12, 13, 14, 15, 16, 17 There are two main types of computational models based on similarity and based on machine learning. To be specific, methods based on similarity figure the correlation intension through the miRNA and disease network. For example, Chen et al.18 proposed Random Walk with Restart for MiRNA-Disease Association prediction (RWRMDA) is a method for calculating global network similarity by combining matrices of miRNA functional similarity. Li et al.19 presented a computational method to predict potential associations by calculating functional consistency score (FCS) of target genes and disease-related genes. The main progress of heterogeneous graph inference for miRNA-disease association prediction (HGIMDA) was to calculate the optimal solution set through an iterative process, given by Chen et al.20 On the other hand, the methods based on machine learning predict the potential miRNA-disease association by using the known miRNA-disease association training model.21 For example, Xu et al.22 used a support vector machine (SVM) classifier to identify positive and negative associations in a miRNA-target-dysregulated network. Chen and Yan23 proposed a method for predicting new disease-related RNA without negative correlation named Regularized Least-squares for MiRNA-Disease Association according to semi-supervised learning. Restricted Boltzmann machine for multiple types of miRNA-disease association prediction (RBMMMDA) is a method developed by Chen et al.,24 whose main improvement is the acquisition of several types of new associations.

In this study, we build a distance-based sequence similarity for miRNA-disease association prediction (DBMDA) based on chaos game representation (CGR). DBMDA combines the information of miRNA sequence, miRNA function, confirmed association, and disease semantic. The motivation for this approach is to map miRNA sequences to Euclidean space, where the regional distance directly corresponds to a measure of miRNA sequence similarity. In detail, we first obtained miRNA and disease similarity matrices based on miRNA sequence information and disease semantic information. Second, the similarity matrices obtained in the previous step are combined with the Gaussian profile kernel similarity matrices of miRNA and disease to get the integrated similarity matrices. Third, each nucleotide directly learns the mapping from miRNA sequences to Euclidean space through CGR techniques. To be specific, the CGR plane is divided into 8 × 8 grids, and the average coordinates of each grid are calculated. Also, the regional distance between miRNAs is used to quantify the similarity of miRNA functions to construct a miRNA sequence similarity matrix and integrate the similar information obtained in the second step into a comprehensive feature. Finally, the integrated feature vector is placed in the rotation forest (RoF) classifier to predict the potential association. The following experiments have been designed to evaluate the reliability of the method. We use the 5-fold cross-validation to assess the performance of DBMDA in the Human microRNA Disease Database (HMDD) v.3.0 dataset. The AUC of 5-fold cross-validation was 0.9129 ± 0.0113 in result. Moreover, two case studies on prostate neoplasms and colon neoplasms have been applied. As a result, 39 (prostate tumors) and 39 (colon tumors) of the top 40 predicted miRNAs, respectively, were verified by other datasets. It shows that DBMDA is an efficient predicting potential miRNA-disease associations method.

Results

Performance Evaluation

Evaluation Criteria

We follow the widely used evaluation measure by means of classification accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and F1 score to assess the performance of DBMDA as defined, respectively, by:

Accu.=TP+TNTP+TN+FP+FN (1)
Sen.=TPTP+FN (2)
Prec.=TPTP+FP (3)
F1=Prec.×Sen.Prec.+Sen., (4)

where TP, FP, TN, and FN represent the true positive, false positive, true negative, and false negative, respectively. In addition, the receiver operating characteristic (ROC) curve and the area under the curve (AUC) can be used to show the performance of the model generally.

Prediction of miRNA-Disease Association

We have used the 5-fold cross-validation to assess the performance of DBMDA based on confirmed associations in HMDD v.3.0.25 Li et al.25 selected 17,412 papers and extracted 32,281 known miRNA-disease associations constructed by 1,102 miRNAs and 850 diseases. Because some information of miRNA cannot be judged by the public database miRBase, we have removed it. After screening, the associations confirmed by miRbase have been chosen as positive samples.26 Meanwhile, negative samples are constructed by possible miRNA-disease association pairs from all possible miRNA-disease pairs.

Figure 1 lists the performance of the 5-fold cross-validation obtained by DBMDA. We can see from the table that DBMDA has gained an average prediction AUC of 0.9129 ± 0.0113. The AUC of the five experiments is 0.8904 (fold 1), 0.9177 (fold 2), 0.9174 (fold 3), 0.9206 (fold 4), and 0.9188 (fold 5), respectively. The yielded averages of accuracy, sensitivity, precision, and F1-score come to be 85.36%, 85.74%, 85.09%, and 85.40% as in Table 1.

Figure 1.

Figure 1

The ROCs of DBMDA and AUCs Based on 5-Fold Cross-Validation

Table 1.

The Comparison Results of DBMDA Based on 5-Fold Cross-Validation

Testing Set Accuracy Sensitivity Precision F1-Score
1 83.14% 81.55% 84.23% 82.87%
2 86.21% 86.83% 85.77% 86.30%
3 85.57% 86.42% 84.99% 85.70%
4 86.22% 87.07% 85.63% 86.34%
5 85.66% 86.83% 84.85% 85.83%
Average 85.36% ± 1.27% 85.74% ± 2.35% 85.09% ± 0.62% 85.40% ± 1.44%

Comparison with Different Classifier Models

In the 5-fold cross-validation, our proposed method achieved good results in the HMDD v.3.0 dataset using the RoF classifier. The RoF as part of the proposed method was compared with SVM, random forest (RF), and decision tree (DT) in this experiment to illustrate why it was chosen. The accuracies of the four experiments are 85.00% (RoF), 83.73% (SVM), 82.06% (RF), and 80.33% (Decision Tree). Their AUCs are 91.15% (RoF), 89.01% (SVM), 90.77% (RF), and 80.29% (Decision Tree), which are shown in Figure 2. The accuracy, sensitivity, precision, and F1-score have been shown in Table 2. From the experimental results, the performance of the rotating forest classifier in terms of sensitivity is not the highest among the four classifiers However, the best results were obtained in other evaluation criteria, especially the AUC that represents the overall performance of the model. In general, rotating forests is the best classifier for the features we build.

Figure 2.

Figure 2

The ROCs of Four Different Classifiers, which Are RoF, SVM, Random Forest, and Decision Tree

Table 2.

Performance Comparison among Four Different Classifiers, which are Rotation Forest, SVM, Random Forest, and Decision Tree

Method Accuracy Sensitivity Precision F1-Score
SVM 83.73% 83.56% 83.33% 83.45%
RF 82.06% 76.49% 85.43% 80.72%
DT 80.33% 78.12% 81.10% 79.58%
RoF 85.00% 85.60% 84.11% 84.85%

Comparison with Related Methods

Many studies in the past have explored the field of the associations between miRNAs and diseases. To evaluate performance, we compared it with eight state-of-the-art methods. Because the database versions used are not the same, we compare only the AUC values reported in the article. Compared with the AUC of RLSMDA, PBSI, MBSI, NetCBI, MaxFlow, miRGOFS, HGIMDA, MDHGI, and LMTRDA, DBMDA performs better, as shown in Table 3.20,23,27, 28, 29, 30, 31 There are manifold reasons why DBMDA is more outstanding than traditional miRNA similarity. First, the sequence information of miRNAs contains attribute features and is an excellent source of knowledge reflecting essential information. Second, the miRNA similarity obtained based on limited knowledge resources may have errors caused by information loss. Third, our approach inferring global similarity from regional distances also helps improve performance.

Table 3.

The Comparison with Related Models

Methods AUC Scores
RLSMDAa 86.17%
PBSIb 54.02%
MBSIb 74.83%
NetCBIb 80.66%
MaxFlowc 86.93%
miRGOFSd 87.70%
HGIMDAe 87.81%
MDHGIf 87.94%
LMTRDAg 90.54%
DBMDA 91.29%
a

The results of the method are reported in Chen and Yan.23

b

The results of the method are reported in Chen and Zhang.27

c

The results of the method are reported in Yu et al.28

d

The results of the method are reported in Yang et al.30

e

The results of the method are reported in Chen et al.20

f

The results of the method are reported in Chen et al.31

g

The results of the method are reported in Wang et al.44

Case Studies

Here DBMDA will be applied to two kinds of human diseases, including prostate neoplasms and colon neoplasms. It further evaluates the effectiveness of DBMDA based on the associations identified in the HMDD v.3.0 database. The test samples are miRNA-disease associations consisting of two diseases and all possible miRNAs. We confirmed prediction results with top 40 ranks in dbDEMC v.2.032 and dbDEMC v.2.0.33

In the United States, prostate cancer has caused more than 20,000 deaths and has become one of the hidden dangers of men’s health today. Age is a major cause of prostate cancer, and older people may have a higher rate. However, an increasing number of younger men were diagnosed with prostate neoplasms. Prostate neoplasms may pass to other areas of the human body, such as surrounding tissue like regional lymph nodes. Therefore, we took prostate neoplasms as an example to evaluate the performance of DBMDA. The results are shown in Table 4. Thirty-nine of the top 40 predicted miRNAs were identified by the two datasets mentioned above.

Table 4.

Prediction of the Top 40 Predicted miRNAs Associated with Prostate Neoplasms Based on Known Associations in dbDEMC v.2.0 and miR2Database

miRNA dbDEMC miR2D miRNA dbDEMC miR2D
hsa-mir-192 confirmed unconfirmed hsa-mir-181a-2 confirmed unconfirmed
hsa-let-7i confirmed unconfirmed hsa-mir-196a confirmed unconfirmed
hsa-mir-140 confirmed unconfirmed hsa-mir-208a confirmed unconfirmed
hsa-mir-199b confirmed confirmed hsa-mir-337 confirmed unconfirmed
hsa-mir-144 confirmed unconfirmed hsa-mir-1246 confirmed unconfirmed
hsa-mir-372 confirmed unconfirmed hsa-mir-30 confirmed unconfirmed
hsa-let-7e confirmed confirmed hsa-mir-184 confirmed confirmed
hsa-let-7f confirmed confirmed hsa-mir-509 unconfirmed unconfirmed
hsa-mir-10b confirmed confirmed hsa-mir-9-3 confirmed unconfirmed
hsa-mir-129 confirmed unconfirmed hsa-let-7f-2 confirmed unconfirmed
hsa-mir-9-1 confirmed unconfirmed hsa-mir-202 confirmed confirmed
hsa-mir-206 confirmed unconfirmed hsa-mir-33a confirmed unconfirmed
hsa-mir-125a confirmed confirmed hsa-mir-451a confirmed unconfirmed
hsa-mir-30b confirmed confirmed hsa-let-7f-1 confirmed unconfirmed
hsa-mir-362 confirmed unconfirmed hsa-mir-186 confirmed unconfirmed
hsa-mir-133 confirmed unconfirmed hsa-mir-302b confirmed unconfirmed
hsa-mir-139 confirmed unconfirmed hsa-mir-328 confirmed unconfirmed
hsa-mir-137 confirmed unconfirmed hsa-mir-383 confirmed unconfirmed
hsa-mir-181b-2 confirmed unconfirmed hsa-mir-431 confirmed unconfirmed
hsa-mir-338 confirmed unconfirmed hsa-mir-103a-2 confirmed unconfirmed

In the United States, colon neoplasms have the third highest morbidity and third highest fatality rate, which is defined as a type of common malignant cancer. A study showed that more than 135,000 individuals would be diagnosed with colon neoplasms and rectum neoplasms. Therefore, we chose colon neoplasms as a case study to evaluate the performance of DBMDA. As a result, 39 of the top 40 potential miRNAs that associate with colon neoplasms were confirmed by experimental findings recorded in dbDEMC v.2.0 and miR2Disease as shown in Table 5.

Table 5.

Prediction of the Top 40 Predicted miRNAs Associated with Colon Neoplasms Based on Known Associations in dbDEMC v.2.0 and miR2Database

miRNA dbDEMC miR2D miRNA dbDEMC miR2D
hsa-mir-26a confirmed confirmed hsa-mir-497 confirmed confirmed
hsa-mir-182 confirmed confirmed hsa-mir-92a-2 confirmed unconfirmed
hsa-mir-342 confirmed confirmed hsa-mir-124 confirmed confirmed
hsa-mir-483 confirmed unconfirmed hsa-mir-129 confirmed confirmed
hsa-mir-139 confirmed unconfirmed hsa-mir-133a-1 confirmed confirmed
hsa-mir-372 confirmed unconfirmed hsa-mir-181b-1 confirmed confirmed
hsa-mir-181b-2 confirmed confirmed hsa-mir-26a-1 confirmed confirmed
hsa-mir-181a-2 confirmed confirmed hsa-mir-373 confirmed unconfirmed
hsa-mir-124-1 confirmed confirmed hsa-mir-423 confirmed unconfirmed
hsa-mir-193a confirmed unconfirmed hsa-mir-499 unconfirmed unconfirmed
hsa-mir-193b confirmed unconfirmed hsa-mir-128 confirmed confirmed
hsa-mir-26b confirmed unconfirmed hsa-mir-16 confirmed unconfirmed
hsa-mir-34b confirmed unconfirmed hsa-mir-212 confirmed unconfirmed
hsa-mir-1 confirmed confirmed hsa-mir-340 confirmed unconfirmed
hsa-mir-133a-2 confirmed confirmed hsa-mir-98 confirmed unconfirmed
hsa-mir-199b confirmed unconfirmed hsa-mir-100 confirmed unconfirmed
hsa-mir-27b confirmed confirmed hsa-mir-124-3 confirmed confirmed
hsa-mir-29c confirmed unconfirmed hsa-mir-133 confirmed confirmed
hsa-mir-451a confirmed unconfirmed hsa-mir-183 confirmed confirmed
hsa-mir-144 confirmed unconfirmed hsa-mir-370 confirmed unconfirmed

Discussion

Sequence-based miRNA similarity can aid in predicting miRNA-disease associations, extract biological property information, and enhance the analytical quality of high-throughput sequencing data. However, most existing methods do not involve sequence information, and according to current information sources (miRNA-disease association), the relationship between miRNAs is not directly reflected. Therefore, this paper proposed a predictive model for inferring miRNA similarity based on sequence information, called DBMDA. The improvement of the method was to directly learn the mapping from miRNA sequence to Euclidean space. In Euclidean space, the regional distance directly corresponds to the measure of miRNA sequence similarity. Excellent experimental results indicate that DBMDA had performed well in predicting disease-associated miRNAs with the support of new algorithms and sequence information. In addition, sequence information has sufficient coverage for human miRNAs, and DBMDA is universal in functional analysis.

Conclusions

Sequence-based miRNA similarity can aid in predicting miRNA-disease associations, extract biological property information, and enhance the analytical quality of high-throughput sequencing data. However, most existing methods do not involve sequence information, and according to current information sources (miRNA-disease association), the relationship between miRNAs is not directly reflected. Therefore, this paper proposed a predictive model for inferring miRNA similarity based on sequence information, called DBMDA. The improvement of the method was to directly learn the mapping from miRNA sequence to Euclidean space. In Euclidean space, the regional distance directly corresponds to the measure of miRNA sequence similarity. Excellent experimental results indicate that DBMDA had performed well in predicting disease-associated miRNAs with the support of new algorithms and sequence information. In addition, sequence information has sufficient coverage for human miRNAs, and DBMDA is universal in functional analysis.

Materials and Methods

Human miRNA-Disease Associations

We downloaded the confirmed associations data from the HMDD dataset in this experiment.25 The last update of HMDD v.3.0 was October 9, 2018, which includes 32,281 experimentally known associations about 850 diseases and 1,102 miRNAs from 17,412 papers. Based on it, an adjacency matrix XRnM×nD is built to reshape the associations, where nD and nM are the number of the diseases and miRNAs in HMDD v.3.0. Xij is equal to 1 if miRNA mi had been confirmed to associate with a disease dj, otherwise equal to 0.34

miRNA Functional Similarity

Wang et al.35 proposed a method for quantifying miRNA functional similarity between miRNAs based on the hypothesis that functionally similar miRNAs are more likely to affect the same disease and pathologically similar diseases are more likely to be affected by the same miRNA. The miRNA function information is uploaded to http://www.cuilab.cn/files/images/cuilab/misim.zip. A 495 rows × 495 columns matrix, MF(ma,mb), can be defined to represent the miRNA functional similarity, and the element is the similarity score between the miRNA ma and the miRNA mb.

Disease Semantic Similarity Model

We built a directed acyclic graph (DAG) to define the relationship among diseases based on the method proposed by Wang et al.,35 which is according to the Medical Subject Headings (MeSH) descriptors.36 The MeSH descriptors can be downloaded from the U.S. National Library of Medicine database (https://www.nlm.nih.gov/). The disease di can be defined as DAGdi=D,Ndi,Edi, where Ndi is a node set including the information of disease di and its ancestor diseases, and Edi is an edge set including the information of the corresponding edges. Based on the DAG, the contribution values of disease o in DAGdi to the semantic value of disease di was calculated as:

{Ddi(o)=1ifo=diDdi(o)=max{Ddi(o')|o'childrenofo}ifodi, (5)

where the semantic contribution decay factor is , which is set to 0.5 according to previous studies.29 Furthermore, if disease o is not disease di, it will decrease the contribution of disease o. If disease o is disease di, the contribution of disease di is defined as 1. Besides, we described the semantic value DV(d) as follows:

DV(di)=tNdiDdi(o). (6)

If disease di and dj have more shared segments of their DAGs, they will have a larger similarity score. The semantic similarity score could be defined as follows:

Sim(di,dj)=tNdiNdj(Ddi(o)+Ddj(o))DV(di)+DV(dj). (7)

The Sim is defined as the 850 rows and 850 columns semantic similarity matrix, and the element Sim(di,dj) is the semantic similarity of di and dj based on disease semantic similarity model 1.

According to the above formula, diseases in the same layer in DAGs will have the same contribution value. However, a higher value should be contributed by a definite disease that appears in fewer DAGs. Hence the contribution of disease o in DAG(d) to the semantic value of disease d is described based on the method built by Xuan et al.29 as follows:

Ddi'(o)=log(numberofDAGsincludingtnumberofdisease), (8)

where o is a disease of all the diseases in our method. Also, the semantic similarity between disease di and dj is described as sim', which is based on the shared ancestor nodes and all the ancestor nodes. To be specific, the disease semantic similarity can be computed as follows:

Sim'(di,dj)=oEdiEdj(Ddi'(o)+Ddj'(o))DV(di)+DV(dj), (9)

where DV(di) and DV(dj) are the semantic score of di and dj, and can be computed the same as for Equation 2.

GIP Similarity for Diseases and miRNA

The HMDD v.3.0 dataset provides plenty of correlation information.37 Based on the hypothesis that the pathologically similar disease may be affected by the same miRNA and vice versa, we calculate the disease and miRNA similarity by Gaussian interaction profile kernel (GIP) similarity. The binary vector V(di) is the i-th row vector of adjacency matrix X. The disease GIP similarity GD(di,dj) between di and dj was computed by:

GD(di,dj)=exp(γdV(di)V(dj)2), (10)

where adjustment coefficient γd was used to adjust the kernel bandwidth, which was computed via normalizing original parameter γd' as follows:

γd=1γd'(1ndi=1ndIP(di)2). (11)

Similarly, GIP similarity for miRNA GM(mi,mj) between miRNA mi and miRNA mj can be calculated as follows:

GM(mi,mj)=exp(γmV(mi)V(mj)2) (12)
γm=1γm'(1nmi=1nmIP(mi)2), (13)

where binary vector V(mi) [or V(mj))] is the interaction profile of miRNAmi (or mj) by observing whether mi (or mj) has association with each of the 850 diseases and is equivalent to thei-th (or j-th) column vector of adjacency matrix X.

Multi-source Feature Fusion

By combining the semantic similarity of the disease with the GIP similarity constructed above, a comprehensive similarity matrix incorporating heterogeneous information is computed.38 The element DS(di,dj) represented combined similarity between disease di and dj, and was described as follows:

DS(di,dj)={Sim(di,dj)+Sim'(di,dj)2ifdi,djinSim1andSim2GD(di,dj)others. (14)

The miRNA similarity matrix MS is constructed from miRNA functional similarity MF and miRNA GIP similarity GM. The miRNA similarity matrix [r(i), r(j)] formula for miRNA r(i) and miRNA r(j) is as follows:

MS(mi,mj)={MF(mi,mj)ifmi,mjinFSGM(mi,mj)others. (15)

CGR

In this study, based on the research of Jessime et al.39 that homologs can be effectively detected even if all positions of ncRNA are treated equally, we introduced CGR to map RNA sequences. In 1990, Jeffrey40 built a scale-independent representation for RNA sequences named CGR. CGR is an iterative mapping that can be traced back to chaos theory and is the basis of statistical mechanics. However, studies never fully explore identifying the resulting sequence scheme as representing the nucleotide sequence by the CGR format. RNA sequences can be mapped into the CGR space, which is planar. The four possible nucleotides confine the CGR space as vertices of a binary square (Figure 3).

nti=nti1+θ(nti1λi) (16)
λi={(0,0)ifnucleotide=A(0,1)ifnucleotide=C(1,1)ifnucleotide=G(1,0)ifnucleotide=U, (17)

where nti is the CGR positions, Nseq is the length of the sequence, λi is the nucleotide coefficient, parameter θ is the decay factor, and we define i=1Nseq and nt0=(0.5,0.5).

Figure 3.

Figure 3

CGR of the miRNA Named hsa-mir-135

Sequence Similarity for miRNAs

Information on miRNA sequences is mapped to Euclidean space, and its region distance is utilized to quantify the similarity of miRNA sequence. It will be easy to implement assignments such as miRNA sequence recognition, verification, and clustering using standard methods with DBMDA embeddings as feature vectors, if this space has been built. First, we downloaded 1,057 miRNA precursor sequences from the miRBase. Second, each nucleotide is mapped to a Euclidean space, and the CGR space is separated from the appropriately sized grid. After that, average coordinate in each quadrant is figured (Figure 4). Third, the regional distance between each miRNA and other miRNAs is calculated. The region distance as DRmimj(g(i)) was figured by:

DRmimj(g(i))={(xmixmj)2+(ymiymj)2ifmiandmibothhaveaveragecoordinateing(i)0ifmiandxmjbothdon'thaveaveragecoordinateing(i)αelse, (18)

where g(i) indicates the i-th grid, α represents the penalty parameter, and xmi,  ymi is the average coordinate of mi in g(i). Fourth, the calculation of the similarity between sequences at any scale was based on the region distance DRmimj(g(i)), defined as follows:

Simseq(mi,mj)=t=1nc2DRmimj(g(t)). (19)

Finally, we used a 2nc×2nc grid to get the distance-based similarity matrix of nucleotide length nc. (1057×1057). Therefore, each miRNA sequence could be described by a 1,057-dimensional vector:

Fseq=(f1,f2,f3,,f1056,f1057). (20)

Figure 4.

Figure 4

The Flowchart of Quantify the Sequence Similarity Utilizing its Regional Distance

(A) The CGRs of hsa-mir-27a are plotted with the average coordinates for each 8 × 8 quadrant represented. (B) The CGRs of hsa-mir-651 are plotted with the average coordinates for each 8 × 8 quadrant represented. (C) Figuring the region distances of hsa-mir-27a and hsa-mir-651.

Rotation Forest

Rotation Forest() independently trains decision trees using different extraction feature sets.41,42 Rodríguez et al.42 defined F=[f1,...,fn]T as n features (attributes), which is an N×n matrix that represents the training and T=[T1,...,Tr] as the ensemble of r classifiers. Each bootstrap sample is trained separately for the independent classifiers. The improvement of RoF is extracting a feature and rebuilding a complete training set for each decision tree in T. Specifically, the RoF randomly divides the training set into e subsets and runs principal-component analysis (PCA) separately. The data are mapped into the new feature space and use it to train classifier Ti. Different subsets will extract different features that improve the diversity through the bootstrap sampling.

Method Overview

A DBMDA was built. It assumes that functionally similar diseases have relation to similar miRNAs, which is also used to compute the association between target proteins and drug. DBMDA has four main processes: first, choosing positive examples and negative examples; second, gathering complex feature vectors by miRNA and disease similarity matrixes; and third, building an effective prediction model to figure potential miRNA-disease pairs. Specifically speaking, we will introduce each process in more detail.

First, we constructed the training examples. Specifically, we analyzed HMDD v.3.0 and selected the known miRNA-disease associations as positive samples. Then, we clustered all of the positive samples with negative samples to build a training set. There are three steps of selecting negative samples: (1) we selected a disease from all known diseases (850) randomly, (2) chose a miRNA in the same way, and (3) combined the miRNA and disease if miRNA and disease pair is not in positive samples as a negative sample.

Second, we built the feature set. In particular, we gathered three disease matrixes, which are a Gaussian profile kernel similarity matrix and two semantic similarity matrixes, into feature vectors as disease features. Feature vector of disease is described as follows:

DS(di)=(η1,η2,η3,η849,η850), (21)

where the i-th row vector of matrix DS is described as DS(di), and the combined similarity value between disease di and dj is defined as ηj. In the same way, we calculated each of 1,057 similarity values to construct a 1,057-dimensional feature vector by Gaussian interaction kernel profile similarity matrix as follows:

MS(ma)=(ϕ1,ϕ2,ϕ3,ϕ1056,ϕ1057), (22)

where the a-th column vector of matrix MS is described as MS(ma), and the gathered similarity value of miRNA ma and mb is described as ϕb. Each miRNA-disease sample can be described as 1,907-dimensional vector as follows:

Fsim=(DS(di),MS(ma)). (23)

Fsim=(η1,η2,η3,ϕ1906,ϕ1907), where(η1,η2,η3,η850) are the 850 gathered similarity values of the disease and (ϕ851,ϕ852,ϕ853,ϕ1907)stands for the 1,057 combined similarity values of the miRNAs. After that, we resized Fsim by autoencoder (AE) from 1,097 to 32, and the sequence feature matrixes Fseq is resized in same way from 1,057 to 32.43 We defined each miRNA-disease sample as a 64-dimensional vector as follows:

F=(Fsim',Fseq'). (24)

Finally, we used RoF to build the prediction model by training set. In particular, we got 64-dimensional vectors in steps 2 and 3 and used them as training set. Then, training samples were put into RoF, and a predicting potential miRNA-disease associations model was built. The workflow of the DBMDA model is shown in Figure 5.

Figure 5.

Figure 5

The Workflow of DBMDA Model to Predict Potential miRNA-Disease Associations

Availability of Data and Materials

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Author Contributions

K.Z. conceived the algorithm, analyzed it, conducted the experiment, and wrote the manuscript. K.Z. and L.W. prepared the dataset. L.-P.L., Z.-W.L., and Y.Z. analyzed the experiment. The final draft was read and approved by all authors.

Conflicts of Interest

The authors declare no competing interests.

Acknowledgments

Z.-H.Y. was supported by National Natural Science Foundation of China Grant 61572506, the Pioneer Hundred Talents Program of Chinese Academy of Sciences, and the CCF-Tencent Open Fund. L.W. was supported by National Natural Science Foundation of China Grant 61702444, Chinese Postdoctoral Science Foundation Grant 2019M653804, and West Light Foundation of the Chinese Academy of Sciences Grant 2018-XBQNXZ-B-008. Z.-W.L. was supported by National Natural Science Foundation of China Grant 61873270. The authors would like to thank all anonymous reviewers for their constructive advice.

Contributor Information

Kai Zheng, Email: zhengkai951211@gmail.com.

Zhu-Hong You, Email: zhuhongyou@ms.xjb.ac.cn.

Lei Wang, Email: leiwang@ms.xjb.ac.cn.

References

  • 1.Ambros V. The functions of animal microRNAs. Nature. 2004;431:350–355. doi: 10.1038/nature02871. [DOI] [PubMed] [Google Scholar]
  • 2.Bartel D.P. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116:281–297. doi: 10.1016/s0092-8674(04)00045-5. [DOI] [PubMed] [Google Scholar]
  • 3.Lee R.C., Feinbaum R.L., Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993;75:843–854. doi: 10.1016/0092-8674(93)90529-y. [DOI] [PubMed] [Google Scholar]
  • 4.Ambros V. microRNAs: tiny regulators with great potential. Cell. 2001;107:823–826. doi: 10.1016/s0092-8674(01)00616-x. [DOI] [PubMed] [Google Scholar]
  • 5.Wightman B., Ha I., Ruvkun G. Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell. 1993;75:855–862. doi: 10.1016/0092-8674(93)90530-4. [DOI] [PubMed] [Google Scholar]
  • 6.Meola N., Gennarino V.A., Banfi S. microRNAs and genetic diseases. PathoGenetics. 2009;2:7. doi: 10.1186/1755-8417-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zhu X., Li Y., Shen H., Li H., Long L., Hui L., Xu W. miR-137 inhibits the proliferation of lung cancer cells by targeting Cdc42 and Cdk6. FEBS Lett. 2013;587:73–81. doi: 10.1016/j.febslet.2012.11.004. [DOI] [PubMed] [Google Scholar]
  • 8.von Brandenstein M., Pandarakalam J.J., Kroon L., Loeser H., Herden J., Braun G., Wendland K., Dienes H.P., Engelmann U., Fries J.W. MicroRNA 15a, inversely correlated to PKCα, is a potential marker to differentiate between benign and malignant renal tumors in biopsy and urine samples. Am. J. Pathol. 2012;180:1787–1797. doi: 10.1016/j.ajpath.2012.01.014. [DOI] [PubMed] [Google Scholar]
  • 9.Chu T.-H., Yang C.C., Liu C.J., Lui M.T., Lin S.C., Chang K.W. miR-211 promotes the progression of head and neck carcinomas by targeting TGFβRII. Cancer Lett. 2013;337:115–124. doi: 10.1016/j.canlet.2013.05.032. [DOI] [PubMed] [Google Scholar]
  • 10.Bandyopadhyay S., Mitra R., Maulik U., Zhang M.Q. Development of the human cancer microRNA network. Silence. 2010;1:6. doi: 10.1186/1758-907X-1-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chen X., Yan C.C., Zhang X., You Z.-H. Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2016;18:558–576. doi: 10.1093/bib/bbw060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Chen X., You Z.H., Yan G.Y., Gong D.W. IRWRLDA: improved random walk with restart for lncRNA-disease association prediction. Oncotarget. 2016;7:57919–57931. doi: 10.18632/oncotarget.11141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chen X., Huang Y.A., Wang X.S., You Z.H., Chan K.C. FMLNCSIM: fuzzy measure-based lncRNA functional similarity calculation model. Oncotarget. 2016;7:45948–45958. doi: 10.18632/oncotarget.10008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.You Z.H., Huang Z.A., Zhu Z., Yan G.Y., Li Z.W., Wen Z., Chen X. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput. Biol. 2017;13:e1005455. doi: 10.1371/journal.pcbi.1005455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zheng K., You Z.H., Wang L., Li Y.R., Wang Y.B., Jiang H.J. MISSIM: improved miRNA-disease association prediction model based on chaos game representation and broad learning system. In: Huang D.S., Huang Z.K., Hussain A., editors. Intelligent Computing Methodologies: 15th International Conference, ICIC 2019. Springer; 2019. pp. 392–398. [Google Scholar]
  • 16.Zheng K., You Z.H., Wang L., Zhou Y., Li L.P., Li Z.W. MLMDA: a machine learning approach to predict and validate MicroRNA-disease associations by integrating of heterogenous information sources. J. Transl. Med. 2019;17:260. doi: 10.1186/s12967-019-2009-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zheng K., Wang L., You Z.-H. CGMDA: An Approach to Predict and Validate MicroRNA-Disease Associations by Utilizing Chaos Game Representation and LightGBM. IEEE Access. 2019;7:133314–133323. [Google Scholar]
  • 18.Chen X., Liu M.-X., Yan G.-Y. RWRMDA: predicting novel human microRNA-disease associations. Mol. Biosyst. 2012;8:2792–2798. doi: 10.1039/c2mb25180a. [DOI] [PubMed] [Google Scholar]
  • 19.Li X., Wang Q., Zheng Y., Lv S., Ning S., Sun J., Huang T., Zheng Q., Ren H., Xu J. Prioritizing human cancer microRNAs based on genes’ functional consistency between microRNA and cancer. Nucleic Acids Res. 2011;39:e153. doi: 10.1093/nar/gkr770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen X., Yan C.C., Zhang X., You Z.H., Huang Y.A., Yan G.Y. HGIMDA: Heterogeneous graph inference for miRNA-disease association prediction. Oncotarget. 2016;7:65257–65269. doi: 10.18632/oncotarget.11251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Xuan P., Han K., Guo Y., Li J., Li X., Zhong Y., Zhang Z., Ding J. Prediction of potential disease-associated microRNAs based on random walk. Bioinformatics. 2015;31:1805–1815. doi: 10.1093/bioinformatics/btv039. [DOI] [PubMed] [Google Scholar]
  • 22.Xu J., Li C.X., Lv J.Y., Li Y.S., Xiao Y., Shao T.T., Huo X., Li X., Zou Y., Han Q.L. Prioritizing candidate disease miRNAs by topological features in the miRNA target-dysregulated network: case study of prostate cancer. Mol. Cancer Ther. 2011;10:1857–1866. doi: 10.1158/1535-7163.MCT-11-0055. [DOI] [PubMed] [Google Scholar]
  • 23.Chen X., Yan G.-Y. Semi-supervised learning for potential human microRNA-disease associations inference. Sci. Rep. 2014;4:5501. doi: 10.1038/srep05501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chen X., Yan C.C., Zhang X., Li Z., Deng L., Zhang Y., Dai Q. RBMMMDA: predicting multiple types of disease-microRNA associations. Sci. Rep. 2015;5:13877. doi: 10.1038/srep13877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li Y., Qiu C., Tu J., Geng B., Yang J., Jiang T., Cui Q. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42:D1070–D1074. doi: 10.1093/nar/gkt1023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Griffiths-Jones S., Saini H.K., van Dongen S., Enright A.J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008;36:D154–D158. doi: 10.1093/nar/gkm952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Chen H., Zhang Z. Similarity-based methods for potential human microRNA-disease association prediction. BMC Med. Genomics. 2013;6:12. doi: 10.1186/1755-8794-6-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yu H., Chen X., Lu L. Large-scale prediction of microRNA-disease associations by combinatorial prioritization algorithm. Sci. Rep. 2017;7:43792. doi: 10.1038/srep43792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Xuan P., Han K., Guo M., Guo Y., Li J., Ding J., Liu Y., Dai Q., Li J., Teng Z., Huang Y. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8:e70204. doi: 10.1371/journal.pone.0070204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Yang Y., Fu X., Qu W., Xiao Y., Shen H.-B. MiRGOFS: a GO-based functional similarity measurement for miRNAs, with applications to the prediction of miRNA subcellular localization and miRNA-disease association. Bioinformatics. 2018;34:3547–3556. doi: 10.1093/bioinformatics/bty343. [DOI] [PubMed] [Google Scholar]
  • 31.Chen X., Yin J., Qu J., Huang L. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction. PLoS Comput. Biol. 2018;14:e1006418. doi: 10.1371/journal.pcbi.1006418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yang Z, Ren F, Liu C, He S, Sun G, Gao Q, Yao L, Zhang Y, Miao R, Cao Y, et al., dbDEMC: a database of differentially expressed miRNAs in human cancers, BMC Genomics, 11, 10.1186/1471-2164-11-S4-S5. [DOI] [PMC free article] [PubMed]
  • 33.Jiang Q., Wang Y., Hao Y., Juan L., Teng M., Zhang X., Li M., Wang G., Liu Y. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37:D98–D104. doi: 10.1093/nar/gkn714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Chen L., Liu B., Yan C. DPFMDA: Distributed and privatized framework for miRNA-Disease association prediction. Pattern Recognit. Lett. 2018;109:4–11. [Google Scholar]
  • 35.Wang D., Wang J., Lu M., Song F., Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26:1644–1650. doi: 10.1093/bioinformatics/btq241. [DOI] [PubMed] [Google Scholar]
  • 36.Lord P.W., Stevens R.D., Brass A., Goble C.A. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003;19:1275–1283. doi: 10.1093/bioinformatics/btg153. [DOI] [PubMed] [Google Scholar]
  • 37.van Laarhoven T., Nabuurs S.B., Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27:3036–3043. doi: 10.1093/bioinformatics/btr500. [DOI] [PubMed] [Google Scholar]
  • 38.Chen X., Yan C.C., Zhang X., You Z.H., Deng L., Liu Y., Zhang Y., Dai Q. WBSMDA: within and between score for MiRNA-disease association prediction. Sci. Rep. 2016;6:21106. doi: 10.1038/srep21106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kirk J.M., Kim S.O., Inoue K., Smola M.J., Lee D.M., Schertzer M.D., Wooten J.S., Baker A.R., Sprague D., Collins D.W. Functional classification of long non-coding RNAs by k-mer content. Nat. Genet. 2018;50:1474–1482. doi: 10.1038/s41588-018-0207-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Jeffrey H.J. Chaos game representation of gene structure. Nucleic Acids Res. 1990;18:2163–2170. doi: 10.1093/nar/18.8.2163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kuncheva L.I., Rodríguez J.J. An experimental study on Rotation Forest ensembles. In: Haindl M., Kittler J., Roli F., editors. Multiple Classifier Systems: MCS 2007. Lecture Notes in Computer Science. Volume 4472. Springer; 2007. pp. 459–468. [Google Scholar]
  • 42.Rodríguez J.J., Kuncheva L.I., Alonso C.J. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 2006;28:1619–1630. doi: 10.1109/TPAMI.2006.211. [DOI] [PubMed] [Google Scholar]
  • 43.Deng, L., et al. Eleventh Annual Conference of the International Speech Communication Association.
  • 44.Wang L., You Z.H., Chen X., Li Y.M., Dong Y.N., Li L.P., Zheng K. LMTRDA: Using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities. PLoS Comput. Biol. 2019;15:e1006865. doi: 10.1371/journal.pcbi.1006865. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Molecular Therapy. Nucleic Acids are provided here courtesy of The American Society of Gene & Cell Therapy

RESOURCES