Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Jul 12;17(7):e1009165. doi: 10.1371/journal.pcbi.1009165

SCMFMDA: Predicting microRNA-disease associations based on similarity constrained matrix factorization

Lei Li 1, Zhen Gao 1, Yu-Tian Wang 1, Ming-Wen Zhang 1, Jian-Cheng Ni 1,*, Chun-Hou Zheng 1,2,*, Yansen Su 2,*
Editor: Quan Zou3
PMCID: PMC8345837  PMID: 34252084

Abstract

miRNAs belong to small non-coding RNAs that are related to a number of complicated biological processes. Considerable studies have suggested that miRNAs are closely associated with many human diseases. In this study, we proposed a computational model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to effectively combine different disease and miRNA similarity data, we applied similarity network fusion algorithm to obtain integrated disease similarity (composed of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity) and integrated miRNA similarity (composed of miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity). In addition, the L2 regularization terms and similarity constraint terms were added to traditional Nonnegative Matrix Factorization algorithm to predict disease-related miRNAs. SCMFMDA achieved AUCs of 0.9675 and 0.9447 based on global Leave-one-out cross validation and five-fold cross validation, respectively. Furthermore, the case studies on two common human diseases were also implemented to demonstrate the prediction accuracy of SCMFMDA. The out of top 50 predicted miRNAs confirmed by experimental reports that indicated SCMFMDA was effective for prediction of relationship between miRNAs and diseases.

Author summary

Considerable studies have suggested that miRNAs are closely associated with many human diseases, so predicting potential associations between miRNAs and diseases can contribute to the diagnose and treatment of diseases. Several models of discovering unknown miRNA-diseases associations make the prediction more productive and effective. We proposed SCMFMDA to obtain more accuracy prediction result by applying similarity network fusion to fuse multi-source disease and miRNA information and utilizing similarity constrained matrix factorization to make prediction based on biological information. The global Leave-one-out cross validation and five-fold cross validation were applied to evaluate our model. Consequently, SCMFMDA could achieve AUCs of 0.9675 and 0.9447 that were obviously higher than previous computational models. Furthermore, we implemented case studies on significant human diseases including colon neoplasms and lung neoplasms, 47 and 46 of top-50 were confirmed by experimental reports. All results proved that SCMFMDA could be regard as an effective way to discover unverified connections of miRNA-disease.

Introduction

MicroRNAs (miRNAs) are a number of 17-24nt non-coding RNAs, which act a pivotal part in controlling the expression of gene through RNA cleavage or translation repression [13]. Lin-4 was the first miRNA inspected in experiment by Lee et al. [4] in 1993. Since that time, a large amount of miRNAs was discovered by researchers in experiments [4,5]. Researchers have sought out generous miRNAs from various of species that included viruses, animals and plants [6]. Because miRNAs regulated the expression of a great quantity of target genes, the total miRNA pathway played a key role in gene expression control [79]. miRNAs are bound up with several crucial biological processes, such as cell development, cell differentiation, cell proliferation and so on [10]. Developmental defects can be the result of the dysregulation of miRNAs that also associate with progression of diseases [11]. In the meantime, considerable studies have indicated that miRNAs are connected with a serious of human neoplasms, which include lung neoplasms [12], prostate neoplasms [13] and so on. Hence distinguishing miRNAs associated with diseases can deepen understanding of the genetic causes of complex diseases. Massive connections between miRNAs and diseases have been found by a variety of traditional experiments in the past few years [14,15]. Traditional manual models can infer the connections between miRNA and disease, but which are time-consuming, laborious and high failure rate. Therefore, showing the potential relationship between miRNAs and diseases in need of computational methods with effectiveness and stability, as they can obtain increasing reliable miRNA-disease connections [16].

In the past period of time, a great deal of computation-based algorithms and methods have been applied to predict potential relationship of miRNA-disease [17,18]. For example, Jiang et al. [19] proposed a model that applied the human phenome-microRNAome network to predict potential interactions between miRNAs with similar function and diseases with similar phenotypic. However, the predictive performance of the model was not as decent as expected due to be affected by high false positive and false negative rates existing in the associations between miRNAs and targets. Later, the model WBSMDA [20] introduced the Gaussian interaction profile similarity to enrich similarity information of miRNA and disease. The WBSMDA could also predict potential relationship between new miRNAs and new diseases without any verified correlative information. The collaborative matrix factorization method was applied to predict the relationship of miRNA-diseases in CMFMDA [21], which also could utilize plentiful biological information observe unknown interactions. The model EGBMMDA [22] began to take advantage of decision tree learning to discover novel miRNA-disease interaction by integrating verified miRNA-disease connections, miRNA functional similarity and disease semantic similarity. The informative feature vector was constructed by multi-measures to train the regression tree under the gradient boosting framework. Zhao et al. [23] applied adaptive boosting to observe unverified miRNA-disease association in ABMDA model. And they utilized k-means clustering on negative samples to perform random sampling, which could control the balance between positive samples and negative samples. The BHCMDA [24] model utilized biased heat conduction (BHC) algorithm to predict unknown connections between miRNAs and diseases though combining miRNA similarity matrix, disease similarity matrix and miRNA-disease association matrix. The probabilistic matrix factorization (PMF) algorithm was used in IMIPMF [25] model to infer potential miRNA-disease interactions. The PMF was widely used in recommender systems, so it could effectively make use of all information to recommend miRNAs which are strongly associated with the disease.

Recently, the methods based on random walk were gradually proposed and more accuracy prediction results were obtained. Chen et al. [26] utilized the random walk with restart algorithm to construct RWRMDA model. Because the prediction performance calculated by global network similarity was better than local network [27,28], RWRMDA employed global network similarity to gain the feasible interactions between miRNAs and diseases. Unfortunately, RWRMDA was inappropriate to the diseases without known associated miRNAs. Shi et al. [29] utilized the function links between human disease genes and miRNA targets to devise a novel model. Random walk algorithm and global network distance measurement were applied to search feasible relationship between miRNAs and diseases. Liu et al. [30] also implemented random walk with restart algorithm in the model to make prediction results to a higher degree. They employed random walk with restart algorithm on a heterogeneous graph established by utilizing disease similarity and miRNA similarity. Luo et al. [31] employed imbalanced bi-random walk method on a heterogeneous network with information of miRNAs and diseases to identify feasible interactions of miRNA-disease. Niu et al. [32] applied random walk with restart algorithm to extract miRNA features from integrated miRNA similarity network in RWBRMDA model. Then these miRNA features were utilized by binary logistic regression algorithm to predict potential miRNA-disease associations.

For the sake of obtaining reliable and accurate predictive performance, machine learning-based methods gradually were utilized to predict unknown miRNA-disease associations. For instance, the model RBMMMDA [33] utilized restricted Boltzmann machine to predict miRNA-disease multi-type associations. The RBMMMDA could gain not only novel associations between miRNAs and diseases, but also corresponding association types. The model PBMDA [34] constructed a heterogeneous graph including different interlinked sub-graphs and further adopted depth-first search algorithm to seek potential miRNA-disease associations. PBMDA could function as a useful calculation tool to accelerate the prediction of miRNA-disease interactions. The model DNRLMF-MDA [35] integrated dynamic neighborhood regularized and logistic matrix factorization to predict potential relationship of miRNA-disease. DNRLMF-MDA applied logistic matrix factorization algorithm to association probability between miRNAs and diseases. Then implementing dynamic neighborhood regularized algorithm to improve predictive performance. Peng et al. [36] proposed the model MDA-CNN for miRNA-disease connection identification. The miRNA-disease interaction features were firstly captured by a three-layer network. Then an auto-encoder was employed to identify obvious miRNA-disease feature combinations. After these feature representations were reduced, the convolutional neural network utilized them to predict the final results. The significant machine learning-based model MLMDA [37] was proposed by Zheng et al. to predict unknown relationship of miRNA-disease. The k-mer sparse matrix was used to extract miRNA sequence information. Then integrating miRNA sequence information, miRNA and disease similarity information to construct feature vectors. The deep auto-encoder neural network (AE) and random forest classifier made full use of feature vectors to calculate the prediction probability. The NCMCMDA [38] model integrated neighborhood constraint with matrix completion algorithm to change the recovery task into an optimization problem. This model applied the fast iterative shrinkage-thresholding algorithm to recover missing interactions between miRNAs and diseases. Zhang et al. [39] proposed the computational model MSFSP to achieve a more accuracy predictive performance of miRNA-disease interactions. The MSFSP firstly integrated various similarity information of miRNA and disease to construct the similarity of miRNA and disease. Then miRNA and disease similarity matrices and verified miRNA-disease association matrix were utilized to constitute the weighted network of miRNA-disease connections. The final prediction labels were calculated by weighting miRNA and disease space projection scores. Ji et al. [40] proposed SVAEMDA model to infer more disease-related miRNAs, which used miRNA similarity and disease similarity to obtain the representations of miRNA and disease. In addition, the variational autoencoder based predictor was trained to predict unknown interactions of miRNA-disease, which combined verified miRNA-disease interactions with the representations of miRNA and disease to generate the feature vectors of miRNA and disease.

Because there were several limitations in previous models, we presented a novel model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to obtain plentiful disease similarity data, we applied similarity network fusion algorithm to integrate various disease similarities, which consisted of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity. Similarly, miRNA similarity data was obtained by applying similarity network fusion to integrate miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity. In addition, we added L2 regularization terms and similarity constraint terms to standard Nonnegative Matrix Factorization (NMF) method to predict more unknown miRNA-disease associations. To evaluate the effectiveness of SCMFMDA, global Leave-one-out cross validation and five-fold cross validation were carried out on the verified miRNA-disease association data downloaded from HMDD v2.0 [41]. As a result, SCMFMDA achieved AUC values of 0.9675 and 0.9447, respectively. Furthermore, we performed case studies on colon neoplasms and lung neoplasms. Consequently, the miR2Disease [42] and dbDEMC v2.0 [43] databases were utilized to validate results of case studies, which achieved high confirmation ratios. Experimental results showed that SCMFMDA was effective for inferring possible relationship between miRNAs and diseases.

Materials

Human miRNA-disease associations

In this study, we downloaded verified human miRNA-disease association information from HMDD v2.0 database, which included 5430 known associations between 383 diseases and 495 miRNAs. For the sake of making calculation convenient, we made an adjacency matrix ARnd×nm to indicate the verified miRNA-disease associations. The nd and nm mean the number of diseases and miRNAs, respectively. We used aij to represent the (i,j)th element of matrix A. Specifically, The element aij is set to 1 if disease di is related to miRNA mj; and otherwise, it is set to 0.

Disease functional similarity

The phenotypically similar diseases tend to associate with similar genes. Therefore, we could calculate disease functional similarity based on the functional information of gene. The log-likelihood score (LLS) represents the probability of a functional linkage between different genes, which can be downloaded from the HumanNet database [44] and be normalized as follows:

LLSn(ga,gb)=LLS(ga,gb)LLSminLLSmaxLLSmin (1)

where LLS(ga, gb) denotes the LLS between gene ga and gene gb, LLSmax and LLSmin are the maximum LLS and minimum LLS in HumanNet database; LLSn(ga, gb) represents the normalized LLS.

Then, the gene functional similarity score can be calculated by the below equation:

FS(ga,gb)={1ifa=bLLSn(ga,gb)ifabe(a,b)SHumanNet0ifabe(a,b)SHumanNet (2)

where SHumanNet represents the link set that contains whole links between genes in HumanNet database; e(a,b) indicates the link between gene ga and gene gb.

Furthermore, the functional similarity score between gene g and gene set G is defined as follows:

SG(g)=maxgbGFS(g,gb) (3)

The SIDD [45] can be utilized to obtain disease-gene association data, which are involved in calculating disease functional similarity SD1 by the following equation:

SD1(di,dj)=gaGiSGi(ga)+gbGjSGj(gb)|Gi||Gj| (4)

Disease semantic similarity

On the basis of previous study [46], the medical subject headings (Mesh) descriptors could be implemented to calculate disease semantic similarity. Here, the Directed Acyclic Graph (DAG) could be adopted to indicate the specific relationship of different diseases. Concretely, the DAG(D) = (D,T(D),E(D)) represents the DAG of disease D, in which T(D) denotes the node set containing D itself and its ancestor nodes, E(D) denotes the relevant edge set including edges from parent nodes to their child nodes directly. Then the semantic value of disease D can be calculated as below:

DV1(D)=dT(D)DD1(d) (5)

where the semantic contribution of disease d to D can be calculated as follows:

DD1(d)={1ifd=Dmax{Δ*DD1(d)|dϵchildrenofd}ifdD (6)

here, Δ is the semantic contribution factor that is set to 0.5 based on previous literature [47].

On the basis of assumption that various diseases tend to be regarded as similar diseases if the large parts of their DAGs are same. Therefore, the semantic similarity DS1(di, dj) between disease di and disease dj can be defined as follows:

DS1(di,dj)=tT(di)T(dj)(Ddi1(t)+Ddj1(t))DV1(di)+DV1(dj) (7)

Based on the previous study [48], diseases appear in less DAGs may be more specific, these diseases ought to gain a higher semantic contribution in DAGs. Therefore, different diseases located in the same layer of one DAG, which may obtain the different contribution value. Specifically, the semantic contribution of disease d to D can be calculated in different way as below:

DD2(d)=log(thenumberofDAGsincludingdthenumberofdiseases) (8)

Correspondingly, the semantic score of disease D and semantic similarity DS2(di, dj) between disease di and disease dj can be calculated as follows:

DV2(D)=dT(D)DD2(d) (9)
DS2(di,dj)=tT(di)T(dj)(Ddi2(t)+Ddj2(t))DV2(di)+DV2(dj) (10)

Finally, we integrated DS1 and DS2 to calculate final disease semantic similarity SD2(di, dj) between disease di and disease dj in following equation:

SD2(di,dj)=DS1(di,dj)+DS2(di,dj)2 (11)

miRNA functional similarity

Based on the calculation method of miRNA functional similarity [49,50], assuming that functionally similar miRNAs tend to be linked with phenotypically similar diseases and vice versa. We downloaded miRNA functional similarity data from http://www.cuilab.cn/files/images/cuilab/misim.zip. Here, we constructed the matrix SM1 with nm rows and nm columns for storing the corresponding information. The element SM1(mi, mj) represents the relevant functional similarity score between miRNA mi and miRNA mj.

miRNA sequence similarity

We utilized the Needleman-Wunsch Algorithm to calculate miRNA sequence similarity, and corresponding miRNA sequence information can be obtained from miRBase database [51]. Be similar to miRNA functional similarity, we also constructed a matrix SM2Rnm×nm to store sequence similarity information, where SM2(mi, mj) was the relevant sequence similarity score between miRNA mi and miRNA mj.

Gaussian interaction profile kernel similarity for diseases and miRNAs

On the basis of previous study [49,50], because miRNAs with similar function are likely to be linked with diseases with similar phenotypes, the Gaussian interaction profile (GIP) kernel similarity can be calculated and applied to stand for the miRNA similarity and disease similarity. Concretely, the binary vector K(di) is constructed to indicate the interaction profile of disease di in accordance with whether di possesses known association with each miRNA or not. Here, the GIP kernel similarity SD3(di, dj) between disease di and disease dj can be calculated as below equations:

SD3(di,dj)=exp(ρd||K(di)K(dj)||2) (12)
ρd=ρd/(1ndi=1nd||K(di)||2) (13)

In the same light, the GIP kernel similarity SM3(mi, mj) between miRNA mi and miRNA mj can be calculated by the following formulas:

SM3(mi,mj)=exp(ρm||K(mi)K(mj)||2) (14)
ρm=ρm/(1nmi=1nm||K(mi)||2) (15)

where the binary vector K(mi) indicates the interaction profile of miRNA mi in accordance with whether mi has known association with each disease or not, the parameter ρm is utilized to control kernel bandwidth.

Methods

Overview

The SCMFMDA includes two major parts: similarity network fusion is applied to obtain integrated disease similarity and integrated miRNA similarity; known miRNA-diseases associations and integrated similarities are adopted in similarity constrained matrix factorization to infer unknown associations of miRNA-disease. The specific flow chart of SCMFMDA is shown in Fig 1.

Fig 1. Flow chart of SCMFMDA.

Fig 1

Integrating similarity for diseases and miRNAs

The similarity between two diseases can use disease functional similarity, disease semantic similarity and disease GIP kernel similarity to represent. Similarly, miRNA functional similarity, miRNA sequence similarity and miRNA GIP kernel similarity can be utilized to indicate similarity between different miRNAs. Here, the similarity network fusion (SNF) [52] method is applied to integrate various similarities for disease and miRNA. According to previous study, the process of SNF can be expressed as iterative update of similarity matrices. The main steps of utilizing SNF to integrate different disease similarities SDn, n = 1,2,3 are introduced as follows.

In the first step, we calculated normalized weight matrix Pn of each similarity network as follows:

Pn(di,dj)={SDn(di,dj)2kiSDn(di,dk)ji12j=i (16)

In the second step, we utilized k nearest neighbor (KNN) algorithm to measure the local relationship of each similarity network. The specific process to obtain corresponding matrix Kn is displayed as follows:

Kn(di,dj)={SDn(di,dj)kNiSDn(di,dk)jNi0otherwise (17)

where the Ni indicates the number of neighbors in the disease.

In the third step, we applied SNF to integrate normalized weight matrix Pn and local relationship matrix Kn as follows:

Pn=Kn(tnPtm1)(Kn)Tn=1,2,,m (18)

Because we had three different disease similarity networks (disease functional similarity, disease semantic similarity and disease GIP kernel similarity), the m was equal to 3. After iterative update, the ultimate disease similarity matrix Sd could be obtained as follows:

Sd=13n=13Pn (19)

Similarly, we could apply SNF algorithm to obtain final miRNA similarity matrix Sm.

Similarity constrained matrix factorization

After obtaining processed disease similarity and miRNA similarity, similarity constrained matrix factorization method is adopted to observe more unknown interactions of miRNA-disease, and Fig 2 shows concrete details of it. The SCMFMDA factorized the matrix ARnd×nm into URnd×γ and VRnm×γ, where γ denoted the dimension of disease feature and miRNA feature in the low-rank spaces. To be specific, the association of miRNA-disease roughly equal to the inner product between the disease feature vector and the miRNA feature vector: aijuivjT, where ui and vj represent the ith row of U and the jth row of V, respectively. The corresponding objective function is shown as follows:

min12ij(aijuivjT)2 (20)

Fig 2. The details of similarity constrained matrix factorization.

Fig 2

Then, the L2 regularization terms of ui and vj are added to the Eq (20) for solving overfitting problem.

min12ij(aijuivjT)2+σ2i||ui||2+σ2j||vj||2 (21)

where σ is the regularization parameter for ui and vj.

On the basis of previous study [53,54], the geometric properties of data points may be kept when they are mapped from high-rank space into low-rank space. Disease similarity Sd and miRNA similarity Sm can indicate geometric structure of data points, so we present similarity constraint terms SU and SV as follows:

SU=12ij||uiuj||2Sijd (22)
Sv=12ij||vivj||2Sijm (23)

where Sijd represents the similarity between disease di and disease dj, Sijm denotes the similarity between miRNA mi and miRNA mj, respectively. Considering the similarity degree between two data points is up to the distance of them, so SU will incur a heavy penalty if the distance of di and dj are close in disease feature space. Therefore, we could keep the geometric structure of disease data points by minimizing SU, which would cause that disease di and disease dj were mapped closely in low dimensional space. For miRNA, it is the same situation. Hence, the objective function of SCMFMDA are proposed by adding SU and SV to Eq (21) as follows:

minU,VL=12ij(aijuivjT)2+σ2i||ui||2+σ2j||vj||2+ε2ij||uiuj||2Sijd+ε2ij||vivj||2Sijm (24)

where ε is regarded as hyper parameter which can availably control the smoothness of similarity consistency.

Optimization algorithm

In this section, we proposed an efficacious optimization algorithm to calculate the objective function of SCMFMDA. First, the partial derivatives of L in regard to ui and vj are calculated as follows:

uiL=j(uivjTaij)vj+σui+ε(j(uiuj)Sijdj(ujui)Sjid)=ui(VTV+σI+ε(jSijd+jSjid)I)A(i,:)Vεj(Sijd+Sjid)uj (25)

where A(i,:) denotes the ith row of matrix A.

vjL=i(vjuiTaij)ui+σvj+ε(i(vjvi)Sjimi(vivj)Sijm)=vj(UTU+σI+ε(iSijm+iSjim)I)A(:,j)TUεi(Sijm+Sjim)vi (26)

where A(:,j) denotes the jth column of matrix A.

Then, the second derivatives of L in regard to ui and vj are calculated by the below equations:

ui2L=VTV+σI+ε(jSijd+jSjid)I (27)
vj2L=UTU+σI+ε(iSijm+iSjim)I (28)

According to Newton’s method, ui and vj can be executed iterative update as follows:

uiuiuiL(ui2L)1 (29)
vjvjvjL(vj2L)1 (30)

Hence, ui and vj can be updated by the following formulas:

ui=(A(i,:)V+εj(Sijd+Sjid)uj)(VTV+σI+ε(jSijd+jSjid+)I)1 (31)
vj=(A(:,j)TU+εi(Sijm+Sjim)vi)(UTU+σI+ε(iSijm+iSjim+)I)1 (32)

When the convergence condition is met, the update of ui and vj will stop. The prediction matrix can be obtained by updated ui and vj.

AP=UVT (33)

The value of AijP denotes the association probability between disease di and miRNA mj. The more likely the association is, if the score is higher.

Results

Parameters optimization

In this section, parameters γ, σ and ε are quantitatively analyzed to research their effect on the prediction performance. γ represents the dimension of diseases and miRNAs in low-rank spaces, and γ<min (nd, nm) that can be considered as the percentage of min (nd, nm). Parameters σ and ε denote the regularization parameters. The AUC value of 5-CV is applied to evaluate influence of the choice of parameters on the performance of model. And after generous test experiments were conducted, we could get the conclusion that the value of γ would affect the experiment individually. For this reason, we fixed σ and ε in a suitable combination to test the most suitable value of γ∈{0,10%,…,1} in SCMFMDA. In order to ensure the correctness of the test, σ and ε are fixed in different combination. From Fig 3A, we could see that SCMFMDA obtained the best performance when γ = 50%. In addition, the γ = 50% is fixed so that the effect of regularization parameters σ and ε can be clearly evaluated. We utilized all combinations of σ∈{2−3,2−2,…,23} and ε∈{2−3,2−2,…,23} to construct SCMFMDA. From Fig 3B, we could discover that SCMFMDA acquired best AUC value of 0.9447 when σ = 22 and ε = 20. In summary, γ, σ and ε are set to 50%, 22 and 20 in our model, respectively.

Fig 3.

Fig 3

The influence of parameters on SCMFMDA: (A) the influence of γ; (B) the influence of σ and ε.

Model comparison

In order to evaluate the prediction ability of SCMFMDA, we compared several previous computational methods that were proposed to predict unknown miRNA-disease associations. We applied same dataset (HMDD v2.0 database) to train these methods so that comparison results could be considered as fairness. The specific information of these methods are shown as follows.

  • MSCHLMDA [55] is a multi-similarity based combinative hypergraph learning model (published in 2020).

  • ICFMDA [56] is an improved collaborative filtering-based computational model (published in 2018).

  • SACMDA [57] is short acyclic connections-based computational model (published in 2018).

  • GRNMF [58] is a graph regularized non-negative matrix factorization-based model (published in 2018).

  • GRL2,1NMF [59] is a graph Laplacian regularized L2,1-nonnegative matrix factorization-based computational model (published in 2020).

  • NPCMF [60] is a nearest profile-based collaborative matrix factorization model (published in 2019).

  • KBMFMDA [61] is a kernelized Bayesian matrix factorization-based computational model (published in 2020).

Based on the HMDD v2.0 database that included 5430 verified associations and 184155 unverified associations between 383 diseases and 495 miRNAs, global Leave-one-out cross validation (global LOOCV) and five-fold cross validation (5-CV) were implemented to evaluate the prediction performance of these methods. In the framework of global LOOCV, the test set was held by each verified association of miRNA-disease in turn, the training set was composed of other verified associations. The whole unknown miRNA-disease associations were considered as candidate samples. Similarly, in the framework of 5-CV, the whole verified miRNA-disease associations were divided into five parts in a random way, where test set was held by one part in turn, training set consisted of other four parts in turn. The whole unknown miRNA-disease associations were considered as candidate samples. In addition, by either the global LOOCV or the 5-CV, we applied SCMFMDA to obtain all predicted association scores so that the ranking of test set relative to candidate samples could be calculated. When the ranking of all test sample were higher than the certain threshold, SCMFMDA was regarded as a valid model. Then we could utilize the Receiver operating characteristics (ROC) curve that was obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) to effectively evaluate the performance of SCMFMDA. We could calculate the area under the ROC curve (AUC) of SCMFMDA whose value was between 0 and 1. Similarly, we could obtain AUCs of other computational methods by utilizing the information of HMDD v2.0 database.

In this work, when global LOOCV method was conducted, SCMFMDA, MSCHLMDA, ICFMDA and SACMDA acquired average AUC values of 0.9675, 0.9287, 0.9072 and 0.8777, respectively (Fig 4). For the purpose of reducing potential deviations resulted in random sample segmentations, we applied 100 times repeated segmentations to verified associations of miRNA-disease in 5-CV method, and the average AUC values of SCMFMDA, MSCHLMDA, ICFMDA and SACMDA reached 0.9447, 0.9263, 0.9046, and 0.8773, respectively (Fig 5). Obviously, the prediction performance of SCMFMDA was better than other methods.

Fig 4. AUC of global LOOCV compared with those of MSCHLMDA, ICFMDA and SACMDA.

Fig 4

Fig 5. AUC of 5-CV compared with those of MSCHLMDA, ICFMDA and SACMDA.

Fig 5

In order to further reflect the performance of the SCMFMDA, it is also compared with other state-of-the-art matrix factorization-based methods that include GRNMF, GRL2,1NMF, NPCMF, KBMFMDA. The 5-CV results of all model are demonstrated in Table 1, clearly SCMFMDA possesses the best AUC. The advantages of SCMFMDA than other matrix factorization-based models are as follows: first, the biological similarity data that are utilized in SCMFMDA obviously more than other models; second, SCMFMDA utilizes SNF instead of traditional linear combination method to integrate various similarity data, which greatly guarantee the completeness and effectiveness of experiment data; third, the L2 regularization and similarity constraint terms are added to the NMF objective function, which benefit to correctly discover more unknown miRNA-disease connections.

Table 1. Comparisons between SCMFMDA and other MF-based models.

Computational models AUC of 5-CV
GRNMF 0.869
GRL2,1NMF 0.9276
NPCMF 0.9429
KBMFMDA 0.9008
SCMFMDA 0.9447

Case studies

For the purpose of demonstrating the effectiveness and accuracy of SCMFMDA, we applied an evaluation experiment in this section. We implemented two types of human diseases, i.e., colon neoplasms and lung neoplasms to validate the expression of our method. There is no doubt that these diseases do great harm to human health. Colon neoplasms belongs to malignancy in the field of Medicine, which has been confirmed to associate with several miRNAs [62,63]. Lung neoplasms is one of the most dangerous malignancies with the fastest increase in morbidity and mortality [12]. A growing number of evidence indicates that lung neoplasms and a few of miRNAs have close relationship. For a specific disease, verified associations of whole diseases in HMDD v2.0 database are considered as training samples, unverified associations with the specific disease in HMDD v2.0 database are treated a candidate samples. By training this model, we could rank predicted association score of the candidate samples and then the top 50 candidate associations with the specific disease are selected. In addition, we utilized two types of databases that were miR2disease and dbDEMC v2.0 to check out miRNAs that have been ranked. Moreover, Tables 2 and 3 indicated prediction results obtained via SCMFMDA, respectively. The 94% and 92% of top 50 miRNAs that inferred by our model, which were individually confirmed to associate with colon neoplasms and lung neoplasms according to the miR2Disease and dbDEMC v2.0 databases. Only 3 and 4 of top 50 predicted miRNAs that are related colon neoplasm and lung neoplasms could not find clues in the databases.

Table 2. The top 50 potential miRNAs associated with colon neoplasms.

miRNA Evidence miRNA Evidence
hsa-mir-21 a; b hsa-mir-30a a; b
hsa-mir-20a a; b hsa-mir-10b a; b
hsa-mir-143 a; b hsa-mir-181b a; b
hsa-mir-155 a; b hsa-mir-106b a
hsa-mir-18a b hsa-mir-203 a; b
hsa-mir-92a b hsa-mir-9 a; b
hsa-mir-34a a; b hsa-mir-34c a
hsa-mir-19b a; b hsa-mir-196a a; b
hsa-mir-19a a; b hsa-mir-183 a; b
hsa-mir-125b a; b hsa-mir-142 unconfirmed
hsa-mir-146a b hsa-mir-24 a; b
hsa-mir-16 unconfirmed hsa-mir-222 b
hsa-mir-200c a hsa-mir-133b a; b
hsa-mir-223 a; b hsa-mir-34b a; b
hsa-mir-200b b hsa-mir-224 a; b
hsa-mir-221 a; b hsa-mir-93 a; b
hsa-mir-182 a; b hsa-mir-29a a; b
hsa-mir-31 a; b hsa-mir-29b a; b
hsa-mir-200a b hsa-mir-146b b
hsa-let-7a a; b hsa-mir-27a a; b
hsa-mir-205 b hsa-mir-210 b
hsa-mir-101 b hsa-mir-141 a; b
hsa-mir-218 b hsa-mir-148a b
hsa-mir-15a b hsa-mir-486 b
hsa-mir-181a a; b hsa-mir-199a unconfirmed

a: miR2Disease database; b: dbDEMC v2.0 database

Table 3. The top 50 potential miRNAs associated with lung neoplasms.

miRNA Evidence miRNA Evidence
hsa-mir-16 a; b hsa-mir-378a unconfirmed
hsa-mir-195 a; b hsa-mir-92b b
hsa-mir-141 a; b hsa-mir-342 b
hsa-mir-106b B hsa-mir-367 b
hsa-mir-15a B hsa-mir-23b b
hsa-mir-429 a; b hsa-mir-139 a; b
hsa-mir-296 unconfirmed hsa-mir-373 b
hsa-mir-99a a; b hsa-mir-452 b
hsa-mir-122 B hsa-mir-148b b
hsa-mir-130a a; b hsa-mir-339 a; b
hsa-mir-151a unconfirmed hsa-mir-302c b
hsa-mir-625 B hsa-mir-302d b
hsa-mir-193b B hsa-mir-423 a
hsa-mir-152 B hsa-mir-208a unconfirmed
hsa-mir-20b B hsa-mir-328 b
hsa-mir-15b B hsa-mir-708 b
hsa-mir-451a B hsa-mir-211 b
hsa-mir-194 B hsa-mir-181d b
hsa-mir-196b B hsa-mir-215 b
hsa-mir-302b B hsa-mir-302a b
hsa-mir-129 B hsa-mir-28 b
hsa-mir-204 a; b hsa-mir-153 b
hsa-mir-149 B hsa-mir-130b b
hsa-mir-10a B hsa-mir-345 a; b
hsa-mir-320a B hsa-mir-144 b

a: miR2Disease database; b: dbDEMC v2.0 database

Discussion and conclusion

In this paper, we introduced a new model named SCMFMDA that used similarity constrained matrix factorization algorithm to predict possible associations of miRNA-disease. In order to obtain plenty of disease similarity data and miRNA similarity data, similarity network fusion algorithm is used to integrate various disease and miRNA biological information, respectively. In addition, L2 regularization terms and similarity constraint terms are added to the standard NMF for predicting more unobserved miRNA-disease associations. In the frameworks of global LOOCV and 5-CV, the AUCs of SCMFMDA severally achieved 0.9675 and 0.9447 that indicated the performance of our model had a significant improvement relative to previous models. Furthermore, the predicted miRNAs that related to colon neoplasms and lung neoplasms were confirmed by the experiment literatures, so the prediction results of our model were proved to be reliable.

What should be denoted is that the following factors may contribute to the reliable performance of SCMFMDA. First, similarity network fusion algorithm was applied to integrate different disease and miRNA similarities, which can ensure the richness of biological data in the experiment. Then, the function of L2 regularization terms is avoiding overfitting problem. Moreover, the similarity constraint terms consist of disease feature-based similarity and miRNA feature-based similarity, which can generate robustness to the data richness.

However, several limitations may influence the performance of SCMFMDA. First, the model is applicable to the diseases and miRNAs must appear in the selected dataset, but can’t make predictions for other diseases and miRNAs. In addition, for some important parameters in SCMFMDA, we hadn’t appropriate way to select the most suitable parameters expect carrying out all combinations. Therefore, we should continuously optimize our model to improve its performance in later days.

Supporting information

S1 Table. Known human miRNA-disease associations obtained from HMDD v2.0 database.

(XLSX)

S2 Table. Names of 383 diseases involved in known human miRNA-disease associations obtained from HMDD v2.0 database.

(XLSX)

S3 Table. Names of 495 miRNAs involved in known human miRNA-disease associations obtained from HMDD v2.0 database.

(XLSX)

S4 Table. The constructed disease functional similarity score matrix.

(XLSX)

S5 Table. The constructed disease semantic similarity score matrix.

(XLSX)

S6 Table. The constructed miRNA functional similarity score matrix.

(XLSX)

S7 Table. The constructed miRNA sequence similarity score matrix.

(XLSX)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This work was supported by the National Natural Science Foundation of China through grants 61873001 (C.Z., Y.W.), U19A2064 (C.Z.), 61872220 (C.Z.) and 11701318 (Y.W.), the Natural Science Foundation of Shandong Province grant ZR2020KC022 (J.N., Y.W., Z.G., L.L) and the Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, No. MMC202006 (Y.W., L.L). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Bartel DP. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell. 2004;116(2):281–297. doi: 10.1016/s0092-8674(04)00045-5 [DOI] [PubMed] [Google Scholar]
  • 2.Chatterjee S, Grosshans H. Active turnover modulates mature microRNA activity in Caenorhabditis elegans. Nature. 2009;461(7263):546–549. doi: 10.1038/nature08349 [DOI] [PubMed] [Google Scholar]
  • 3.He L, Hannon GJ. MicroRNAs: small RNAs with a big role in gene regulation. Nat Rev Genet. 2004;5(7):522–531. doi: 10.1038/nrg1379 [DOI] [PubMed] [Google Scholar]
  • 4.Lee RC, Feinbaum RL, Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993;75(5):843–854. doi: 10.1016/0092-8674(93)90529-y [DOI] [PubMed] [Google Scholar]
  • 5.Wightman B, Ha I, Ruvkun G. Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell. 1993;75(5):855–862. doi: 10.1016/0092-8674(93)90530-4 [DOI] [PubMed] [Google Scholar]
  • 6.Jopling CL, Yi M, Lancaster AM, Lemon SM, Sarnow P. Modulation of Hepatitis C Virus RNA Abundance by a Liver-Specific MicroRNA. Science. 2005;309(5740):1577–1581. doi: 10.1126/science.1113329 [DOI] [PubMed] [Google Scholar]
  • 7.Xu P, Guo M, Hay BA. MicroRNAs and the regulation of cell death. Trends Genet. 2004;20(12):617–624. doi: 10.1016/j.tig.2004.09.010 [DOI] [PubMed] [Google Scholar]
  • 8.Bartel DP. MicroRNAs: Target Recognition and Regulatory Functions. Cell. 2009;136(2):215–233. doi: 10.1016/j.cell.2009.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Miska EA. How microRNAs control cell division, differentiation and death. Curr Opin Genet Dev. 2005;15(5):563–568. doi: 10.1016/j.gde.2005.08.005 [DOI] [PubMed] [Google Scholar]
  • 10.Harfe BD. MicroRNAs in vertebrate development. Curr Opin Genet Dev. 2005;15(4):410–415. doi: 10.1016/j.gde.2005.06.012 [DOI] [PubMed] [Google Scholar]
  • 11.Meola N, Gennarino V, Banfi S. microRNAs and genetic diseases. Pathogenetics. 2009;2(1):7. doi: 10.1186/1755-8417-2-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Yanaihara N, Caplen N, Bowman E, Seike M, Kumamoto K. Unique microRNA molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell. 2006;9(3):189–198. doi: 10.1016/j.ccr.2006.01.025 [DOI] [PubMed] [Google Scholar]
  • 13.Sita-Lumsden A, Dart DA, Waxman J, Bevan CL. Circulating microRNAs as potential new biomarkers for prostate cancer. Br J Cancer. 2013;108(10):1925–1930. doi: 10.1038/bjc.2013.192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mohammadi-Yeganeh S, Paryan M, Samiee SM, Soleimani M, Arefian E, Azadmanesh K, et al. Development of a robust, low cost stem-loop real-time quantification PCR technique for miRNA expression analysis. Mol Biol Rep. 2013;40(5):3665–3674. doi: 10.1007/s11033-012-2442-x [DOI] [PubMed] [Google Scholar]
  • 15.Thomson JM, Parker JS, Hammond SM. Microarray Analysis of miRNA Gene Expression. Methods Enzymol. 2007;427:107–122. doi: 10.1016/S0076-6879(07)27006-5 [DOI] [PubMed] [Google Scholar]
  • 16.Han K, Xuan P, Ding J, Zhao ZJ, Hui L, Zhong YL. Prediction of disease-related microRNAs by incorporating functional similarity and common association information. Genet Mol Res. 2014;13(1):2009–2019. doi: 10.4238/2014.March.24.5 [DOI] [PubMed] [Google Scholar]
  • 17.Yu S, Liang C, Xiao Q, Li G, Ding P, Luo J. MCLPMDA: A novel method for miRNA-disease association prediction based on matrix completion and label propagation. J Cell Mol Med. 2019;23(2):1427–1438. doi: 10.1111/jcmm.14048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen X, Gong Y, Zhang D, You Z, Li Z. DRMDA: deep representations–based miRNA–disease association prediction. J Cell Mol Med. 2018;22(1):472–485. doi: 10.1111/jcmm.13336 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Jiang Q, Hao Y, Wang G, Juan L, Wang Y. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol. 2010;4(SUPPL. 1):S2. doi: 10.1186/1752-0509-4-S1-S2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen X, Yan C, Zhang X, You Z, Deng L, Liu Y, et al. WBSMDA: Within and Between Score for MiRNA-Disease Association prediction. Sci Rep. 2016;6:21106. doi: 10.1038/srep21106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Shen Z, Zhang YH, Han K, Nandi AK, Honig B, Huang DS. miRNA-Disease Association Prediction with Collaborative Matrix Factorization. Complexity. 2017;2017:1–9. [Google Scholar]
  • 22.Chen X, Huang L, Xie D, Zhao Q. EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction. Cell Death Dis. 2018;9(1):3. doi: 10.1038/s41419-017-0003-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhao Y, Chen X, Yin J. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics. 2019;35(22):4730–4738. doi: 10.1093/bioinformatics/btz297 [DOI] [PubMed] [Google Scholar]
  • 24.Zhu XY, Wang XZ, Zhao HC, Pei TR, Kuang LN, Wang L. BHCMDA: A New Biased Conduction Based Method for Potential MiRNA-Disease Association Prediction. Front Genet. 2020;11(1):384. doi: 10.3389/fgene.2020.00384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ha J, Park C, Park C, Park S. IMIPMF: Inferring miRNA-disease interactions using probabilistic matrix factorization. J Biomed Inform. 2020;102:103358. doi: 10.1016/j.jbi.2019.103358 [DOI] [PubMed] [Google Scholar]
  • 26.Chen X, Liu M, Yan G. RWRMDA: predicting novel human microRNA-disease associations. Mol Biosyst. 2012;8(10):2792–2798. doi: 10.1039/c2mb25180a [DOI] [PubMed] [Google Scholar]
  • 27.Köhler S, Bauer S, Horn D, Robinson PN. Walking the Interactome for Prioritization of Candidate Disease Genes. The Am J Hum Genet. 2008;82(4):949–958. doi: 10.1016/j.ajhg.2008.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang H, Cao L, Gao S. A locality correlation preserving support vector machine. Pattern Recognition. 2014;47(9):3168–3178. [Google Scholar]
  • 29.Shi H, Xu J, Zhang G, Xu L, Xia L. Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7:101. doi: 10.1186/1752-0509-7-101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Liu Y, Zeng X, He Z, Zou Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(4):905–915. doi: 10.1109/TCBB.2016.2550432 [DOI] [PubMed] [Google Scholar]
  • 31.Luo J, Xiao Q. A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J Biomed Inform. 2017;66:194–203. doi: 10.1016/j.jbi.2017.01.008 [DOI] [PubMed] [Google Scholar]
  • 32.Niu Y, Wang G, Yan G, Chen X. Integrating random walk and binary regression to identify novel miRNA-disease association. BMC Bioinformatics. 2019;20:59. doi: 10.1186/s12859-019-2640-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chen X, Yan CC, Zhang X, Li Z, Deng L, Zhang Y, et al. RBMMMDA: predicting multiple types of disease-microRNA associations. Sci Rep. 2015;5(1):13877. doi: 10.1038/srep13877 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.You Z, Huang Z, Zhu Z, Yan G, Chen X. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput Biol. 2017;13(1):e1005455. doi: 10.1371/journal.pcbi.1005455 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Yan C, Wang JX, Ni P, Lan W, Wu FX, Pan Y. DNRLMF-MDA: Predicting microRNA-Disease Associations Based on Similarities of microRNAs and Diseases. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(1):233–243. doi: 10.1109/TCBB.2017.2776101 [DOI] [PubMed] [Google Scholar]
  • 36.Peng J, Hui W, Li Q, Chen B, Hao J, Jiang Q, et al. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35(21):4364–4371. doi: 10.1093/bioinformatics/btz254 [DOI] [PubMed] [Google Scholar]
  • 37.Zheng K, You ZH, Wang L, Zhou Y, Li LP, Li ZW. MLMDA: a machine learning approach to predict and validate MicroRNA-disease associations by integrating of heterogeneous information source. J Transl Med. 2019;17(1):260. doi: 10.1186/s12967-019-2009-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chen X, Sun LG, Zhao Y. NCMCMDA: miRNA-disease association prediction through neighborhood constraint matrix completion. Brief Bioinform. 2021;22(1):485–496. doi: 10.1093/bib/bbz159 [DOI] [PubMed] [Google Scholar]
  • 39.Zhang Y, Chen M, Cheng X, Wei H. MSFSP: A Novel miRNA-Disease Association Prediction Model by Federating Multiple-Similarities Fusion and Space Projection. Front Genet. 2020;11:389. doi: 10.3389/fgene.2020.00389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ji C, Wang Y, Gao Z, Li L, Zheng C. A Semi-Supervised Learning Method for MiRNA-Disease Association Prediction Based on Variational Autoencoder. IEEE/ACM Trans Comput Biol Bioinform. 2021;1(1):99. doi: 10.1109/TCBB.2021.3067338 [DOI] [PubMed] [Google Scholar]
  • 41.Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42(Database issue): D1070–D1074. doi: 10.1093/nar/gkt1023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Jiang Q, Wang Y, Hao Y, Liran J, Teng M, Zhang X, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37(Database issue): D98–D104. doi: 10.1093/nar/gkn714 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Yang Z, Wu L, Wang A, Tang W, Zhao Y, Zhao H, et al. dbDEMC 2.0: Updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 2016;45(D1):D812–D818. doi: 10.1093/nar/gkw1079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 2011;21(7):1109–1121. doi: 10.1101/gr.118992.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cheng L, Wang G, Li J, Zhang T, Xu P, Wang Y, et al. SIDD: A Semantically Integrated Database towards a Global View of Human Disease. PLoS One. 2013;8(10):e75504. doi: 10.1371/journal.pone.0075504 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–266. [PMC free article] [PubMed] [Google Scholar]
  • 47.Wang D, Wang JY, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–1650. doi: 10.1093/bioinformatics/btq241 [DOI] [PubMed] [Google Scholar]
  • 48.Xuan P, Han K, Guo MZ, Guo YH, Li JB. Ding J, et al. Correction: Prediction of microRNAs Associated with Human Diseases Based on Weighted k Most Similar Neighbors. PLoS One. 2013;8(9):e70204. doi: 10.1371/annotation/a076115e-dd8c-4da7-989d-c1174a8cd31e [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Goh KI, Cusick ME, Valle D, Childs B, Barabási AL. The human disease network. Proc Natl Acad Sci U S A. 2007;104(27):8685–8690. doi: 10.1073/pnas.0701361104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Lu M, Zhang Q, Min D, Jing M, Guo Y, Guo W, et al. An Analysis of Human MicroRNA and Disease Associations. PLoS One. 2008;3(10):e3420. doi: 10.1371/journal.pone.0003420 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kozomara A, Griffiths-jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2013;42(D1):D68–D73. doi: 10.1093/nar/gkt1181 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Wang B, Mezlini AM, Demir F, Fiume M, Tu ZW, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–337. doi: 10.1038/nmeth.2810 [DOI] [PubMed] [Google Scholar]
  • 53.Zhang W, Liu XR, Chen YL, Wu WJ, Wang W, Li XH. Feature-derived graph regularized matrix factorization for predicting drug side effects-Science Direct. Neurocomputing. 2018;287:154–162. [Google Scholar]
  • 54.Rana B, Juneja A, Saxena M, Gudwani S, Kumaran SS, Behari M, et al. Graph Theory based Spectral Feature Selection for Computer Aided Diagnosis of Parkinson’s Disease Using T1-weighted MRI. International Journal of Imaging Systems and Technology. 2015;25(3):245–255. [Google Scholar]
  • 55.Wu Q, Wang Y, Gao Z, Ni J, Zheng C. MSCHLMDA: Multi-Similarity Based Combinative Hypergraph Learning for Predicting MiRNA-Disease Association. Front Genet. 2020;11:354. doi: 10.3389/fgene.2020.00354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Jiang Y, Liu B, Yu L, Yan C, Bian H. Predict MiRNA-Disease Association with Collaborative Filtering. Neuroinformatics. 2018;16:363–372. doi: 10.1007/s12021-018-9386-9 [DOI] [PubMed] [Google Scholar]
  • 57.Shao B, Liu B, Yan C. SACMDA: MiRNA-Disease Association Prediction with Short Acyclic Connections in Heterogeneous Graph. Neuroinformatics. 2018;16:373–382. doi: 10.1007/s12021-018-9373-1 [DOI] [PubMed] [Google Scholar]
  • 58.Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–248. doi: 10.1093/bioinformatics/btx545 [DOI] [PubMed] [Google Scholar]
  • 59.Gao Z, Wang Y, Wu Q, Ni J, Zheng C. Graph regularized L2,1-nonnegative matrix factorization for miRNA-disease association prediction. BMC Bioinformatics. 2020;21:61. doi: 10.1186/s12859-020-3409-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Gao Y, Cui Z, Liu J, Wang J, Zheng C. NPCMF: Nearest Profile-based Collaborative Matrix Factorization method for predicting miRNA-disease associations. BMC Bioinformatics. 2019;20(1):353. doi: 10.1186/s12859-019-2956-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Chen X, Li S, Yin J, Wang C. Potential miRNA-disease association prediction based on kernelized Bayesian matrix factorization. Genomics. 2020;112(1):809–819. doi: 10.1016/j.ygeno.2019.05.021 [DOI] [PubMed] [Google Scholar]
  • 62.Torre LA, Bray F, Siegel RL, Tieulent JL, Jemal A. Global cancer statistics, 2012. CA Cancer J Clin. 2015;65(2):87–108. doi: 10.3322/caac.21262 [DOI] [PubMed] [Google Scholar]
  • 63.Hiroko OK, Masashi I, Daisuke K, Yoshitaka H, Yasuhide Y, Koh F, et al. Circulating Exosomal microRNAs as Biomarkers of Colon Cancer. PLoS One. 2014;9(4):e92921. doi: 10.1371/journal.pone.0092921 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009165.r001

Decision Letter 0

Ilya Ioshikhes, Quan Zou

29 Apr 2021

Dear Mr Li,

Thank you very much for submitting your manuscript "SCMFMDA: Predicting microRNA-disease Associations based on Similarity Constrained Matrix Factorization" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Please carefully revise your paper according to the reviews.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Quan Zou

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

Please carefully revise your paper according to the reviews.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this study, Li et al. proposed a computational model to predict miRNA-disease associations based on similarity constrained matrix factorization. Their method, SCMFMDA, achieved better performance than existing ones and accurately predict relations between miRNAs and diseases in the following case studies. The model seems valid and the results they show are promising. However, following concerns should be addressed before considering for publication.

- There are three major issues for model validation: First, the authors should provide more details about the dataset they used for validation. They only mentioned: “Based on the verified association of miRNA-disease in HMDD V2.0 database”, which is not enough for the readers to fully understand the dataset. For examples, how many verified associations included in the dataset? How many unknown cases in the dataset? Do they perform any filtering to the dataset? Second, the authors should summarize the existing methods they compete and show the information including the publication time, the mathematic model they use, the dataset they used for training etc. in a table. The authors should also provide the details about how they compare these methods. To my understanding, it’s better to compare different methods in an independent testing set instead of on their own training set. Different methods can be trained on different datasets, it’s unclear how the authors compare these methods using cross validation. Third, the authors should provide the dataset they use as well as the source codes to execute their and others’ models, so that others can reproduce their results to confirm the correctness.

- As for parameter optimization, I’m not sure why they first fix two to optimize the third and then fix the third to optimize other two. And they don’t mention the exact values of two parameters they use to optimize the third. Just randomly pick some values? A more straightforward way is to test all the 490 possible combinations of three parameters and select the combination with best performance. Is this strategy very time-consuming? Figure 5a is not very professional, the dots representing the exact values should be plotted on the figure with a more smoothed curve to show the trend. The first two values on the figure is weird, why there is no value for 0 but two values for 10%? A color bar should be added for the heatmap in figure 5b. They grey line surround the heatmap can be removed.

- For case studies, a more detailed statistics of the performance can be shown in addition to three tables. For examples, for each disease, how many percentages of miRNAs in the top 50 can find evidences in HMDD v3.2 or dbDEMC v2.0 or both? How many cannot find clues in the database. Also, what’ s the differences between HMDD v3.2 and v2.0, which is used for model training? Are there any overlaps? if so, the authors should exclude those involved in the model training.

- The organization of the “Results” part is not quite reasonable. Logically, the parameter optimization should be in front of model comparison. Model comparison can be separated to non-ML methods and ML methods, but I would suggest putting them together. The last part can be the case studies to further confirm the prediction value of the method. In this way, the reader can better follow the logic of the results.

Reviewer #2: This paper proposed a new approach for miRNA-disease associations prediction. Experimental results indicated that this approach can effectively and efficiently predict miRNA-disease associations. This manuscript is well-written, which makes it easy to understand, although there are some minor issues to be addressed. I have the following specific comments.

1. There are some typos and grammatical errors throughout the paper, the authors should recheck the whole paper carefully to revise these problems.

2. There is an error in Equation (26) and Equation (32) in section of METHODS (D. Optimization algorithm). The A(:,j) should be changed to 〖A(:,j)〗^T. The authors should revise this error..

3. The authors ranked the miRNAs associated with diseases in section of RESULTS (C. Case studies). In my opinion, the authors should specifically introduce how to get the ranking of these miRNAs.

4. Authors compared this model with other MF-based models in section of RESULTS (D. Comparison with MF-based Models). Many matrix factorization-based models have developed for miRNA-disease prediction in a recent time. I think you should compare your model with them.

5. In your paper, the “Gaussian interaction profile kernel similarity” are referred to as “GIP kernel similarity”, but you used “GIP similarity” to referred in a few places in your paper. The authors should pay attention and make changes.

6. Many important computational models for miRNA-disease association prediction published in the top journals such as Bioinformatics and IEEE/ACM Transactions on Computational Biology and Bioinformatics should be discussed and cited. This research field has made much progress in recent several years. Author should mention more recent computational studies for example:

Y. Zhao, X. Chen, and J. Yin, “Adaptive boosting-based computational model for predicting potential miRNA-disease associations,” Bioinformatics, vol. 35, no. 22, pp. 4730-4738, Apr. 2019.

C. Ji, Y. Wang, Z. Gao and L. Li, “A Semi-Supervised Learning Method for MiRNA-Disease Association Prediction Based on Variational Autoencoder,” vol.1, no. 1, p.99, Mar. 2021.

Reviewer #3: In this paper, the authors use similarity network fusion and similarity constrained matrix factorization for miRNA-disease association prediction (SCMFMDA), which completes the missing associated scores between miRNAs and diseases. According to the relevant results, it demonstrates that SCMFMDA is effective for prediction of possible associations of miRNA-disease. However, there are existing some detailed problems.

1. Authors should revise your English writing carefully and eliminate small errors in the paper to make the paper easier to understand.

2. Authors should explain in details the function and impact of parameter ε in Equation (24).

3. The AUC values of global LOOCV of SCMFMDA, MSCHLMDA, GRL-NMF, IMCMDA, ICFMDA and SACMDA were introduced in the section of RESULTS (A. Model validation), but the corresponding values in Fig. 3 look different.

4. There is a reference error in Table III. Authors use ‘a’ and ‘b’ to denote ‘HMDD v3.2 database’ and ‘dbDEMC v2.0 database’ in Table III, respectively, but label ‘H’ and ‘d’ to denote ‘HMDD v3.2 database’ and ‘dbDEMC v2.0 database’ under the Table III.

5. Authors should pay attention to formatting errors which should be further adjusted. For example, there are some fonts are so small in Fig. 2.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: I don't see the link to the source code and dataset they used to train and compare their model with others'

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009165.r003

Decision Letter 1

Ilya Ioshikhes, Quan Zou

1 Jun 2021

Dear Mr Li,

Thank you very much for submitting your manuscript "SCMFMDA: Predicting microRNA-disease Associations based on Similarity Constrained Matrix Factorization" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please revise quickly

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Quan Zou

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Please revise quickly

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have made substantial efforts to address the reviewers' concerns and the revised manuscript improved from many aspects. I have one more question, in the data availability section, the authors claimed "All relevant data are within the manuscript and its Supporting Information files" , yet I haven't found the code and datasets they used in their supporting information files or any link to public website for data sharing in their manuscript. Do I miss it or they don't provide it?

Reviewer #2: All my concerns have been solved.

Reviewer #3: The question of the manuscript has been answered correctly. The paper is acceptable.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: I haven't found the code and datasets they used in their supporting information files or any link to public website for data sharing in their manuscript.

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009165.r005

Decision Letter 2

Ilya Ioshikhes, Quan Zou

8 Jun 2021

Dear Mr Li,

We are pleased to inform you that your manuscript 'SCMFMDA: Predicting microRNA-disease Associations based on Similarity Constrained Matrix Factorization' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Quan Zou

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009165.r006

Acceptance letter

Ilya Ioshikhes, Quan Zou

5 Jul 2021

PCOMPBIOL-D-21-00629R2

SCMFMDA: Predicting microRNA-Disease Associations based on Similarity Constrained Matrix Factorization

Dear Dr Li,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Kata Acsay

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Known human miRNA-disease associations obtained from HMDD v2.0 database.

    (XLSX)

    S2 Table. Names of 383 diseases involved in known human miRNA-disease associations obtained from HMDD v2.0 database.

    (XLSX)

    S3 Table. Names of 495 miRNAs involved in known human miRNA-disease associations obtained from HMDD v2.0 database.

    (XLSX)

    S4 Table. The constructed disease functional similarity score matrix.

    (XLSX)

    S5 Table. The constructed disease semantic similarity score matrix.

    (XLSX)

    S6 Table. The constructed miRNA functional similarity score matrix.

    (XLSX)

    S7 Table. The constructed miRNA sequence similarity score matrix.

    (XLSX)

    Attachment

    Submitted filename: Summary of changes .docx

    Attachment

    Submitted filename: Summary of changes.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES