Abstract
Identifying new disease indications for existing drugs can help facilitate drug development and reduce development cost. The previous drug–disease association prediction methods focused on data about drugs and diseases from multiple sources. However, they did not deeply integrate the neighbor topological information of drug and disease nodes from various meta-path perspectives. We propose a prediction method called NAPred to encode and integrate meta-path-level neighbor topologies, multiple kinds of drug attributes, and drug-related and disease-related similarities and associations. The multiple kinds of similarities between drugs reflect the degrees of similarity between two drugs from different perspectives. Therefore, we constructed three drug–disease heterogeneous networks according to these drug similarities, respectively. A learning framework based on fully connected neural networks and a convolutional neural network with an attention mechanism is proposed to learn information of the neighbor nodes of a pair of drug and disease nodes. The multiple neighbor sets composed of different kinds of nodes were formed respectively based on meta-paths with different semantics and different scales. We established the attention mechanisms at the neighbor-scale level and at the neighbor topology level to learn enhanced neighbor feature representations and enhanced neighbor topological representations. A convolutional-autoencoder-based module is proposed to encode the attributes of the drug–disease pair in three heterogeneous networks. Extensive experimental results indicated that NAPred outperformed several state-of-the-art methods for drug–disease association prediction, and the improved recall rates demonstrated that NAPred was able to retrieve more actual drug–disease associations from the top-ranked candidates. Case studies on five drugs further demonstrated the ability of NAPred to identify potential drug-related disease candidates.
Keywords: drug–disease association prediction, neighbor topology learning based on meta-paths, pairwise node attribute encoding, multiple drug–disease heterogeneous networks, fully connected neural networks and autoencoder based on CNN
1. Introduction
The process of producing a new medicine is typically lengthy, expensive, and fraught with failure; it may require more than 10 y and cost between USD 0.8 billion and USD 1.5 billion on average [1,2,3,4,5]. Therefore, a method to reduce the time and funding costs for the development of new medicines must be identified. That approved drugs are subject to clinical trials endows them with a favorable safety profile. In contrast to developing a medicine from scratch, using indications for current drugs (drug repositioning) [6] can effectively reduce research and development costs and accelerate drug development [7,8,9].
Drug candidates can be further screened for wet laboratory validation using computational predictions of the relationship between licensed drugs and diseases [10,11]. Several approaches for predicting drug-related diseases that have been reported can be classified into two categories. The first category of methods predicts the disease indications for drugs based on the integration of multiple kinds of information about the drugs and diseases. A couple of methods integrate the known drug–disease associations, the drug similarities, and the disease similarities [12,13]. They estimate the association possibilities between drugs and diseases by utilizing a logistic regression classifier and matrix decomposition with a similarity constraint. Wang et al. employed kernel functions to incorporate drug and disease similarities and applied the support vector machine approach to forecast drug–disease correlations [14]. Liang et al. applied sparse subspace learning and graph Laplacian regularization to combine multiple types of drug characteristics to predict drug indications [15]. To infer drug–disease associations, relevant data from drugs and diseases are utilized or combined in these strategies. However, the above-mentioned approaches cannot consider topological information in a network to demonstrate the potential use of a specific drug.
The second method primarily considers prediction based on the topology of the network. For example, heterogeneous network models based on diseases, drugs, and targets are used to infer drug candidates using iterative algorithms [16]. In several methods, random walk algorithms are employed to predict possible drug–disease associations; in fact, they have been employed in networks such as drug similarity, disease similarity, and integrated drug–disease heterogeneity networks [17,18,19,20,21]. However, because these methods do not consider the attribute information of drug and disease network nodes, they cannot learn the deep feature representation of nodes. Furthermore, these shallow-model-based approaches cannot extract potentially complicated relationships between drug and disease nodes.
Deep learning technologies have been widely utilized for the prediction of miRNA–disease associations [22] and disease-related lncRNAs [23,24]. Owing to the development of deep learning, the indications of drug candidates are identified more accurately in recent approaches by integrating multiple sources of drug- and disease-relevant information. For the prediction of drug-related diseases, models employing graph convolutional and fully connected autoencoders with attention mechanisms are used [25]. Xuan et al. [26] proposed a prediction model comprising a convolutional neural network (CNN) and a bi-directional long short-term memory (BiLSTM) network. Jiang et al. devised a module for forecasting drug–disease correlations by employing Gaussian interaction profile kernels and autoencoders [27]. Deep relationships between drugs and diseases can be extracted more easily using deep learning models. At the node pair level, however, the present deep learning approaches cannot combine and incorporate the drug–disease neighbor topology and attribute information. In addition, when capturing the neighbor topology information in three heterogeneous networks, the multi-scale meta-paths to obtain the set of neighbor nodes is important auxiliary information.
Herein, we propose and develop NAPred, a predictive model for capturing, encoding, and learning the neighbor topology and attribute representation of node pairs from diverse heterogeneous networks. The primary contributions of our proposed model are as follows:
Three drug–disease heterogeneous networks were constructed, each with different aspects of drug similarities, to facilitate the acquisition of topological information regarding drug and disease nodes from different perspectives. To construct sets of different types of neighbors of the nodes, multi-scale meta-path sets of drug or disease nodes were established;
We present an approach based on fully connected and convolutional neural networks with attention mechanisms for learning topological information regarding the same type of neighbors for drug and disease nodes. Multiple-neighbor feature representations extracted from drug and disease nodes were adaptively combined via a neighbor-scale-level attention mechanism;
We developed a neighbor-topology-level attention mechanism to distinguish the contributions and then obtain the neighbor topological representations of the nodes; this is because different types of neighbor topological features contribute differently to drug–disease association prediction;
The attribute information of the node pairs was extracted from the three heterogeneous networks using the proposed embedding mechanism and encoded using a convolutional autoencoder (CAE). The premise of this embedding mechanism is that drug–disease pairs are more likely to be associated with each other if they exhibit similarities or associations with more typical drugs or diseases.
2. Experimental Results and Discussion
2.1. Evaluation Metrics
The performances of all prediction models were analyzed and compared using five-fold cross-validation. Positive and negative samples were those with known and unknown drug–disease associations, respectively. We used 4/5 of the positive samples, as well as 4/5 of the random negative samples formed in the training set in each fold of the cross-validation. The remaining 1/5 positive samples, as well as all negative samples were tested. The prediction correlation scores of the test samples were generated and ranked; the higher the rank of the positive sample use cases, the better was their prediction performance.
Several evaluation metrics were used in this study, i.e., the true positive rate (TPR), false positive rate (FPR), receiver operating characteristic (ROC) curve, area under the ROC curve (AUC) [28], precision–recall (PR) curve, area under the PR (AUPR) curve [29], and recall at various top-k. The performances of all models in the cross-validation were compared based on the average AUC and AUPR.
The AUC is an accepted appraisal metric for comparing algorithms and probabilistic estimates [30]. The TPR and FPR at various thresholds yield the ROC curve. The sample was regarded as positive if the predicted association score of a drug–disease pair exceeds a threshold ; otherwise, it was considered negative. The fraction of correctly (incorrectly) detected positive (negative) samples among all the positive (negative) samples is denoted as the ().
| (1) |
where () represents the number of positive samples correctly (incorrectly) classified as positive (negative) and () indicates the number of negative samples correctly (incorrectly) categorized as negative (positive) [31,32].
This was due to the uneven distribution of drug–disease candidates. The AUPR curve provides more information regarding the AUC for assessing the predictive performance [29]. and were determined as follows:
| (2) |
where indicates the rate of TP samples among those anticipated to be positive and expresses the rate of positive samples accurately recognized among the total positive samples. The AUC and AUPR curve were calculated using the mean cross-validation [33]. Each fold’s mean AUC and AUPR curve must be calculated, and the final score is the average of the five results.
Considering that biologists typically choose the top-ranked candidates and confirm their predictions based on wet laboratory trials, determining the actual drug–disease connections is critical. Therefore, for the projected outcomes, the recall rates of the top-k candidate drug–disease pairs were evaluated. The more trustworthy the prediction performance, the higher is the recall of the top-k.
2.2. Comparison with Other Methods
NAPred is more effective compared with six cutting-edge drug–disease association forecasting models: GFPred [25], CBPred [26], SCMFDD [13], LRSSL [15], MBiRW [18], and HGBI [16]. In the cross-validation, the other six methods were trained or tested using the same or similar datasets as the NAPred model. The best performance was achieved by each method when the optimal parameters were used. In particular, for GFPred; and for CBPred; , , and for LRSSL; , , , and for MBiRW; , , and for SCMFDD; for HGBI.
For each of the 763 drugs, we calculated the AUC and AUPR curve at each fold before calculating their five-fold mean. The final results were averaged across all AUCs (or AUPR curves) for the 763 drugs. As shown in Figure 1A, in the comparison of the 763 drugs, NAPred achieved the best mean AUC value among all the methods investigated (AUC = 0.978), outperforming GFPred by 3.3%, CBPred by 5.2%, SCMFDD by 25.5%, LRSSL by 14.7%, MBiRW by 15%, and HGBI by 27.6%. The second-best model GFPred successfully learned multiple attribute representations of nodes and fully extracted topological information from multiple heterogeneous networks. This suggests that constructing heterogeneous networks on the basis of multiple drug similarities and capturing topological information improved the prediction accuracy. CBPred, LRSSL, and MBiRW extract topology information from heterogeneous networks for drug repositioning, where CBPred considers the path information between pairs of diseases, whereas MBiRW disregards the properties of the nodes. Hence, CBPred performed better, whereas MBiRW performed worse than LRSSL. SCMFDD is a matrix-decomposition-based model. The dimensionality reduction process may cause the lossof low-frequency valid information. Therefore, SCMFDD performed worse, but better than HGBI; additionally, it did not exploit the multiple similarities of the drugs. In conclusion, our NAPred achieved the best results owing to the comprehensive learning of the neighborhood topology, as well as the property information of the drug–disease pairs.
Figure 1.
ROC and PR curves of all the methods of drug–disease association.
As shown in Figure 1B, our method NAPred performed better than GFPred, CBPred, LRSSL, MBiRW, SCMFDD, and HGBI by 14.8%, 22.8%, 28.4%, 34.6%, 37.8%, and 37.9%, respectively, based on the AUPR curves of 763 drugs.
In addition, to validate the robustness of our model under multiple datasets, we used the CC dataset [34] to replace drug-related data and implement another instance of our method, . We utilized the A (chemistry) data, B (targets) data, and C (networks) data of CC dataset to replace the original chemical substructure, protein structural domain, and gene ontology data of the drugs. In Figure 1, the AUC and AUPR of are still higher than those of the compared methods. The experimental results demonstrated the good robustness of our model.
To evaluate the impact of cross-validation folds on NAPred performance, we also performed an additional ten-fold cross-validation. The number of training samples in the ten-fold cross-validation was larger than that in the five-fold cross-validation. As shown in Supplementary Table S1, the AUC and AUPR for the ten-fold cross validation were 0.8% and 1.3% higher than the five-fold cross validation. NAPred achieved better performance when the training data were increased.
The Wilcoxon test was used to evaluate the ability of the 763 drugs to predict the outcomes. NAPred performed much better than the other approaches in terms of the AUCs and AUPR curves when a 0.05 p-value threshold was used (Table 1).
Table 1.
The statistical results of the paired Wilcoxon test on the AUCs over all the 763 drugs by comparing NAPred and all other five methods.
| GFPred | CBPred | SCMFDD | LRSSL | MBiRW | HGBI | |
|---|---|---|---|---|---|---|
| p-value of AUC | 5.27051 × 10 | 1.83480 × 10 | 5.49787 × 10 | 5.31080 × 10 | 2.89205 × 10 | 1.74747 × 10 |
| p-value of AUCPR | 3.42304 × 10 | 4.72506 × 10 | 1.81013 × 10 | 8.63715 × 10 | 4.68094 × 10 | 4.85712 × 10 |
Figure 2 shows the recall rates of drug candidates for various top-k values. More real drug–disease associations can be successfully identified using a higher recall rate. The average recall rate for the 763 drugs was 86.14%, 89.19%, 93.24%, 95.54%, and 97.33% for the top-30, -60, -90, -120, and -150, respectively. Among the top-30, -90, -150, and -210, GFPred indicated the second-highest recall rate, with 81.03%, 90.20%, 94.64%, and 97.12%, respectively. CBPred obtained recall rates of 68.63%, 82.41%, 90.69%, and 94.17% in the top-30, -90, -150, and -210, respectively, with a slightly lower performance than GFPred. LRSSL demonstrated a higher recall than MBiRW for the top-30, -60, and -90. The former model achieved 66.12%, 70.73%, and 74.90% recall rates, whereas the latter obtained recall rates of 57.65%, 65.30%, and 73.71%, respectively. The recall of SCMFDD was 32.97%, 51.18%, 59.75%, and 66.13% when k was 30, 90, 150, and 210, respectively. HGBI had a slightly lower recall rate than SCMFDD, i.e., 30.62%, 46.10%, 56.34%, and 63.98% for the top-30, -90, -150, and -210, respectively.
Figure 2.
The average recalls of all the drugs under different top-k.
2.3. Case Studies of Five Drugs
Case studies of ampicillin, ceftriaxone, doxorubicin, erythromycin, and itraconazole were conducted to further illustrate the efficacy of NAPred in drug–disease association prediction. The association prediction scores for each drug candidate in the descending order, as well as the top-ten candidates for each of the five drugs are listed in Table 2.
Table 2.
The top-10 candidate diseases of 5 drugs.
| Drug Name | Rank | Disease Name | Description | Rank | Disease Name | Description |
|---|---|---|---|---|---|---|
| 1 | Staphylococcal Infections | CTD, PubChem | 6 | Staphylococcal Skin | PubChem | |
| Infections | ||||||
| 2 | Pneumonia, Bacterial | ClinicalTrials | 7 | Streptococcal Infections | CTD, ClinicalTrials | |
| Ampicillin | 3 | Urinary Tract Infections | CTD, DrugBank, | 8 | Osteomyelitis | PubChem, |
| PubChem | ClinicalTrials | |||||
| 4 | Wound Infection | PubChem, ClinicalTrials | 9 | Postoperative Complications | PubChem | |
| 5 | Proteus Infections | Inferred Candidate | 10 | Bacterial Infections | CTD, DrugBank, | |
| by 2 Literature Works | ClinicalTrials | |||||
| 1 | Escherichia coli Infections | CTD, PubChem, ClinicalTrials | 6 | Salmonella Infections | DrugBank, PubChem, ClinicalTrials | |
| 2 | Urinary Tract Infections | DrugBank, PubChem, | 7 | Enterobacteriaceae Infections | PubChem, ClinicalTrials | |
| ClinicalTrials | ||||||
| Ceftriaxone | 3 | Haemophilus Infections | PubChem | 8 | Septicemia | DrugBank, PubChem, |
| ClinicalTrials | ||||||
| 4 | Gonorrhea | DrugBank, PubChem, | 9 | Endocarditis, Bacterial | DrugBank, ClinicalTrials | |
| ClinicalTrials | ||||||
| 5 | Gram-Negative Bacterial | Inferred Candidate | 10 | Pseudomonas Infections | PubChem | |
| Infections | by 1 Literature Work | |||||
| 1 | Urinary Tract Infections | CTD, PubChem | 6 | Leukemia, Lymphoid | CTD, DrugBank, | |
| ClinicalTrials | ||||||
| 2 | Leukemia, Myeloid, | CTD, DrugBank, | 7 | Bronchitis | CTD | |
| Acute | ClinicalTrials | |||||
| Doxorubicin | 3 | Escherichia coli Infections | CTD | 8 | Sarcoma | CTD, DrugBank, |
| ClinicalTrials | ||||||
| 4 | Neoplasms | ClinicalTrials, PubChem | 9 | Gonorrhea | Unconfirmed | |
| 5 | Staphylococcal Infections | CTD, PubChem | 10 | Precursor Cell Lymphoblastic | CTD | |
| Leukemia-Lymphoma | ||||||
| 1 | Gonorrhea | DrugBank, PubChem | 6 | Gram-Positive Bacterial Infections | PubChem | |
| 2 | Gram-Negative Bacterial | PubChem | 7 | Staphylococcal Infections | CTD, DrugBank, | |
| Erythromycin | Infections | PubChem | ||||
| 3 | Chancroid | DrugBank, PubChem | 8 | Pneumonia, Mycoplasma | Unconfirmed | |
| 4 | Bacterial Infections | DrugBank, PubChem | 9 | Neurosyphilis | PubChem | |
| 5 | Neisseriaceae Infections | DrugBank | 10 | Chlamydiaceae Infections | DrugBank, ClinicalTrials | |
| 1 | Candidiasis, Cutaneous | DrugBank, PubChem, | 6 | Tinea Capitis | DrugBank, PubChem | |
| ClinicalTrials | ||||||
| 2 | Tinea Versicolor | DrugBank, PubChem, | 7 | Fungemia | DrugBank, PubChem, | |
| ClinicalTrials | ClinicalTrials | |||||
| Itraconazole | 3 | Tinea Pedis | DrugBank, PubChem | 8 | Skin Diseases, Infectious | PubChem, ClinicalTrials |
| 4 | Leishmaniasis | CTD, PubChem, | 9 | AIDS-Related Opportunistic | ClinicalTrials | |
| ClinicalTrials | Infections | |||||
| 5 | Chromoblastomycosis | DrugBank, PubChem | 10 | Candidiasis | CTD, DrugBank, PubChem |
The Comparative Toxicogenomics Database (CTD), which was painstakingly acquired and validated based on the literature, contains information regarding drugs and their effects on human health [35]. DrugBank is a database containing drug-related targets, mechanisms of action, interactions, and integrated molecular information [36]. A total of 16 candidate diseases are covered by CTD, and 23 candidates are recorded in DrugBank. This indicates that the disease candidate was receiving effective treatment.
ClinicalTrials.gov, which is the world’s largest searchable clinical trial database, contains data pertaining to clinical studies conducted worldwide; the National Library of Medicine in the United States contributes to its resources. As Supporting Material, we only used experimental records with a “completed” status. PubChem is a public database sponsored by the National Institutes of Health that includes information regarding chemicals and their biological activity, safety, and toxicity [37]. There were 23 candidate diseases supported by ClinicalTrials.gov, whereas PubChem approved 33 of the candidates. These records indicate that clinical trials established an association between the candidate disease and the relevant drug.
Besides manually validated drug–disease correlations, CTD additionally includes those derived from the literature with temporarily unverified associations. The inferred section of the CTD contains two candidates, which suggests a more plausible correlation between the diseases and their corresponding drugs. Among all 50 drug candidates, two candidates were labeled as “unconfirmed”.
In addition, we conducted case studies on an additional five drugs (betamethasone, acetaminophen, etoposide, flurbiprofen, and verapamil) and list their top-ten candidate diseases in Supplementary Table S2. There were 42 candidate diseases recorded by CTD. There were 29 and 42 candidates covered by DrugBank and PubChem. ClinicalTrials contained 20 candidate diseases. This indicates that these candidates are more likely to be associated with the corresponding drugs. Only one candidate was labeled as “unconfirmed”. All the above analysis indicated that NAPred had the ability to discover potential candidate drug–disease associations.
2.4. Prediction of Novel Drug-Related Diseases
Finally, we applied the trained NAPred to 763 drugs to predict candidate diseases. The top-30 drug-related candidate diseases selected by our model are listed in Supplementary Table S3. They can be used by biologists to facilitate further wet experiments for validation.
3. Materials and Methods
Figure 3 shows our proposed predictive model for drug-related disease candidates; the model comprises two branches. Three drug–disease heterogeneity networks were first established to correlate the similarities between drugs and diseases from different perspectives. For the first branch, we obtained the sets of neighbor nodes for drugs and diseases based on meta-paths of different scales. Neighbor-scale-level and neighbor-topology-level attention mechanisms are proposed for capturing drug and disease neighbor information, followed by encoding pairwise neighbor topology representations using convolutional neural networks. In the second branch, CAE was utilized to learn a pair of drug–disease attribute representations from the three drug–disease heterogeneous networks. The scores predicted from the two branches were weighted and summed to obtain the scores for the corresponding associations. A higher score signifies the higher possibility of an association.
Figure 3.
Framework of the proposed NAPred model. (a) Construct multi-scale meta-path sets and the sets composed of the same-type neighbor nodes. (b) Encode the attribute vectors of neighbor nodes of a drug. (c) Encode the attribute vectors of neighbor nodes of a disease. (d) Learn the neighbor topology of a drug–disease node pair. (e) Learn the attributes of the node pair. (f) Integrate multiple representations.
3.1. Dataset
Based on previous studies, we obtained drug–disease association data [15], chemical substructure data of drugs, protein structural domain data of target proteins, and gene ontology information of target proteins. Initially, data pertaining to drug–disease associations were obtained in the UMLS [38], which contains information regarding 763 drugs, 681 diseases, and 3051 known drug–disease associations. We extracted drug chemical substructure data from the PubChem database [39] and drug target protein structural domain data from the InterPro database [40]. The UniProt database was used to obtain gene ontology information regarding the target protein of the drug [41]. The numbers of drug chemical substructures, drug target protein structural domains, and drug target protein gene ontologies in our dataset were 623, 1426, and 4447, respectively.
3.2. Establishing Drug–Disease Heterogeneous Networks
3.2.1. Matrix of Drug Properties
Let the matrix denote the case in which each drug contains a chemical substructure, and . and indicate the number of drugs and all relevant chemical substructures, respectively. A value of 1 implies that drug contains the chemical substructure , whereas a value of 0 implies otherwise. The vector of the chemical substructure attributes of , which is obtained from the i-th row vector of , is represented as .
Let the matrix denote the cases of protein structural domains discovered in the respective associated target proteins of drugs; subsequently, is the number of protein structural domains of all drug target proteins. is 1 for the target protein related to drug containing the j-th protein structural, and 0 otherwise. The protein structural domain attribute vector of is obtained from the i-th row of data in .
The matrix is used to indicate whether gene ontology information is included in drugs and their associated target proteins. A value of 1 implies that the target protein associated with drug contains gene ontology , whereas a value of 0 implies otherwise. The target protein gene ontology property vector of is represented by the i-th row vector .
3.2.2. Establishment of the Drug Network
For two drugs and , a higher number of identical chemical substructures between them signifies a higher level of similarity between them. The cosine similarity of their chemical substructures can be calculated using the strategy previously described by Liang et al. [15]; in fact, we used it as the first cosine similarity between and .
Similarly, based on the protein domains or protein-associated gene ontologies in the two drug-related target proteins, cosine similarity calculations can be applied to determine the second and third similarities of a drug.
We treated two drug nodes as having connected edges when the calculated drug similarity exceeded 0. The weights on the edges are expressed as the similarity between the two drugs (Figure 4). We used the matrices , , and to denote the drug networks obtained based on the similarity of the three drugs. For instance, based on the chemical substructure, represents the similarity between and .
Figure 4.
Construction of three heterogeneous networks based on multiple kinds of drug similarities, drug–disease associations, and disease similarities.
3.2.3. Establishment of the Disease Network
The similarity of diseases was calculated to establish disease networks. Wang et al. [42] computed the similarity between diseases using their directed acyclic graph (DAG). A DAG that includes all semantic terms associated with a disease can be used to illustrate the disease. A higher number of disease terms in the DAGs of two diseases implies a higher semantic similarity between them. The corresponding edges between any two diseases can be added if their similarity exceeds 0. The weights on these edges reflect the similarity between the two diseases. The matrix represents the disease network, with denoting the semantic similarity of diseases and . The attribute vector of is denoted as .
3.2.4. Drug–Disease Heterogeneous Network
Connecting edges were added to link the nodes among the three drug networks and a disease network using existing drug–disease association data (Figure 4). Let the association matrix denote the association between drugs and diseases, and let if edges connected between and exist and if no connection exists.
The matrix , which is derived from the first drug similarity, drug–disease association, and disease semantic similarity, represents the first drug–disease heterogeneous network.
Similarly, regarding the second and third drug similarities, the second and third drug–disease heterogeneous network can be generated. These two heterogeneous networks can be represented by and .
We denote these three drug–disease heterogeneous networks by , where .
3.3. Neighborhood Topology Encoding
3.3.1. Multi-Scale Meta-Path Sets
The meta-path [43] can be expressed as a path shaped as (abbreviated as ). The complex relationship of node types and is described by . Two nodes can be connected to each other via different meta-paths in a heterogeneous drug–disease network. Figure 1 shows the manner by which drugs and can be connected by meta-paths and , with different meta-paths showing different semantics. For example, in (rrr), drugs and may be similar if both have functions similar to . In (rdr), an association is indicated between both drugs and , suggesting that may be similar to .
Based on the structural information from , we can obtain the first-order meta-paths of drug nodes with and to form the set of the first-order meta-paths of the drug nodes. Similarly, the second-order meta-paths of the drug nodes include , , , and , which form set of the second-order meta-paths of the drug node. Finally, we obtain set (), of the multi-scale meta-paths of the drug (disease) nodes.
3.3.2. Neighbor Sets Based on Meta-Paths at Different Scales
For node () and the set of meta-paths (), we can capture the drug nodes or disease nodes connected to () based on meta-paths of different scales. This results in a set of drug neighbor nodes and the disease neighbor node set at different scales of (), where the first-order neighbors of the node include itself.
For the drug (disease)-type neighbors of (), we calculated the top- neighbors that were the most similar to () based on their similarity to all other drugs (diseases). For the disease (drug)-type neighbors of (), the disease (drug) nodes associated with () were ranked based on their occurrence frequency, and the top- nodes of the ranking were retained as neighbors of ().
As shown in Figure 3, for and the set of meta-paths and , assuming = 3, we can obtain the first-order drug neighbor nodes of based on via meta-paths , retain the three top-ranked neighbors of , and obtain the set . Similarly, captures and retains the top- disease neighbors via meta-paths and in , thereby forming its second-order disease neighbor set .
3.3.3. Aggregation of Multi-Scale Neighbor Features
We propose a fully connected neural network with mean aggregation [44] to effectively combine the network topology in with the characteristics of same-type nodes to learn the low-dimensional features of same-type neighbors at different scales. Because the learning frameworks of both drug and disease nodes are similar, we describe and its drug (disease)-type neighbors as an example.
For the kth-order drug neighbor set of , the attribute vector of its neighbor node can be obtained from the drug attribute matrix (, , ) corresponding to . Because is high-dimensional and sparse, we first performed the mean aggregation of the attribute vectors of the kth-order drug neighbors of , and the aggregated vector is expressed as:
| (3) |
Subsequently, we project into the low-dimensional feature space through a fully connected network and obtain the low-dimensional kth-order drug neighbor feature vector as follows:
| (4) |
where denotes the activation function [45], the weight matrix when the neighbor type is a drug, and the bias vector. K denotes the total number of orders, and in our model.
3.3.4. Same-Type Neighbor Topology Encoding Based on Neighbor-Scale-Level Attention
Because the drug (disease)-type neighbor node information at different scales of contributes differently to the learning of the drug (disease) neighbor topological representation of , we established a neighbor-scale-level attention to learn the attention weights of order 1-k neighbor feature vectors of the same type. For the kth-order drug neighbor feature of , with attention score ,
| (5) |
where is the weight vector at the neighbor scale level; and are the weight matrix and bias vector, respectively. The normalized attention coefficient is , which can be obtained using the function, as follows:
| (6) |
The drug neighbor topology representation of obtained using the attention mechanism is:
| (7) |
3.3.5. Neighbor Topology Encoding Based on Attention Enhancement at the Neighbor Topology Level
contains two types of neighbor nodes, drug and disease, whose neighbor topologies are represented as and , respectively. However, the importance of different types of neighbor nodes for association prediction varies, and neighbor-topology-level attention is proposed to enhance the neighbor topology representation of . The attention score for the same-type neighbor topology representation of is:
| (8) |
where , and are the neighbor-topology-level weight matrix and weight vector, respectively, and is a bias vector. The normalized attention weights are expressed as follows:
| (9) |
Finally, the augmented representation of the neighbor topology obtained using the attention mechanism is , expressed as follows:
| (10) |
Here, denotes the neighboring topological representation obtained by in , where .
Similarly, the neighbor topology representation of in can be obtained. These neighboring topological representations are used to form the feature matrices S of – node pairs, as follows:
| (11) |
where denotes the dimension number of the neighbor topology representation.
3.3.6. CNN-Based Pairwise Neighbor Topology Encoding
The feature matrix of the first branch S is passed into the CNN, which learns the – neighbor topology representations. We filled the periphery of S with zeros to learn the edge features of S and then obtained the new matrix .
We established a CNN module using convolutional and pooling layers. The filter length and breadth relative to the convolution layer are denoted by and , respectively; a total of filters were used. After applying the convolution filter to , a feature map was generated. represents the sliding of the k-th filter to position of , and it is defined as:
| (12) |
where , , and . The element value of the filter sliding on to is:
| (13) |
where is the function and b the bias vector. The position in the feature map is represented by .
The more significant features of were extracted using the max-pooling layer. The filter length of the max-pooling layer is , and the width is . The k-th feature map of all feature maps P output by the pooling layer is , and can be calculated as:
| (14) |
where , , and .
In the CNN module, we set the number of filters in the convolutional layer to 16, the kernel size to 2 × 2, and the stride size to 1. In the pooling layer, the kernel size was set to 2 × 2, and the step size and zero-padding were set to 1 and 0, respectively. After performing processing in the convolution and max-pooling layers, the output vector was obtained. Subsequently, was input to the fully connected and layer [46], which yielded the association probability distributed for the first branch, as follows:
| (15) |
where is the first branch of the fully connected layer’s weight matrix and is the corresponding bias vector. indicates the association probability distribution for the classification, including the likelihood of a drug and disease being associated and otherwise.
3.4. Encoding Pairwise Node Attributes
3.4.1. Attribute Embedding Matrix for Drug–Disease Pairs
We introduced an embedding strategy to extract the nodal attributes of drug–disease pairs (Figure 5). If () is similar (related) to a more typical drug or related (similar) to a disease, then – is likely to be related. Therefore, information regarding the properties of drugs and diseases must be learned from the pairwise node level.
Figure 5.
Illustration of constructing an attribute embedding matrix for a pair of drug and disease nodes.
For a heterogeneous drug–disease network , contains the m-th similarity of with all drugs and the association with all diseases, and contains the association of with all drugs and the similarity with all diseases. Therefore, we used the attribute vectors and to perform splicing such that the attribute embedding matrix P of and can be obtained. P is expressed as follows:
| (16) |
where P has a dimension of .
3.4.2. CAE-Based Pairwise Node Attribute Encoding
Because the node attribute matrix P obtained from the three heterogeneous networks is high-dimensional and sparse, meaningless and non-representative information may be present. Therefore, we performed encoding and decoding based on a CAE to comprehensively learn the attribute information of drug–disease pairs in the original data distribution, as shown in Figure 3.
Encoder: Two hidden layers, each comprising a convolutional layer and a max-pooling layer, constitute the encoder. The edge features of P should be preserved and learned via zero-padding. The first hidden layer uses the zero-padded P as input and yields the feature map encoded as:
| (17) |
Subsequently, the feature map of the t-th layer is generated as follows:
| (18) |
where is the function. denotes the encoder’s t-th hidden layer’s weight matrix, and is the corresponding bias vector. . indicates the encoder’s total number of layers, and the convolution computation is indicated by “*”; denotes the max-pooling processing for capturing the most critical features within every feature map by downsampling the potential representations acquired from the convolution layer.
Decoder: Using the decoder, we projected the code such that it returns to its initial space and reassembled it to obtain the decoding matrix. The variance between the decoding matrix and the initial matrix P was evaluated, and an optimal coded feature map was obtained. Three hidden layers, each with a transposed convolutional layer, constitute the decoder. For as the input of the first hidden layer of the decoder, the feature map is obtained as follows:
| (19) |
| (20) |
where is the weight matrix of the decoder and is the decoder’s bias vector. . A total of decoder layers are involved. The operator “⋆” indicates the transposed convolution computation. The reconstructed matrix is the output of the last layer of the decoder.
Optimization: Our optimization objective was to render as consistent as possible with the input P. The loss function is expressed as:
| (21) |
where P is the input of the encoder, the output at the decoder, the number of training samples, and the embedding matrix of the nth drug–disease pair in the corresponding training sample. Adam’s algorithm [47] was used to optimize . The back propagation [48] approach was used to train the CAE and update . Using the iterative algorithm, the pairwise property encoding was regarded as the output of the last encoder layer, denoted by .
To acquire the association probability of the second branch of node pair –, was processed in the fully connected and layer. is expressed as:
| (22) |
where and are the weight matrix and bias vector of the fully connected second branch, respectively. is the association probability distribution for the classification.
3.5. Final Integration and Optimization
The loss function in the first branch can be expressed as the cross-entropy between the true label and the drug–disease association prediction result , as follows:
| (23) |
where is the set of training samples and represents the probability of a drug–disease association. If an – pair has an association, then is 1; otherwise, it is 0. In the second branch, the cross-entropy loss function is defined as:
| (24) |
We trained the loss functions and separately until their minimum values were attained. The final correlation prediction score is calculated as follows:
| (25) |
where denotes a hyperparameter that ranges from 0 to 1 and was used to measure the contribution of neighboring topologies and pairwise node attributes to the association prediction score.
4. Conclusions
We proposed the NAPred method to determine the association between drug candidates and diseases. The three proposed heterogeneous networks facilitated neighbor topology extraction and pairwise node attribute embedding using multiscale meta-paths. A framework comprising a convolutional neural network with attention mechanisms and CAE was constructed to encode and integrate neighbor topological representations and pairwise attribute representations. Two attention mechanisms were proposed to assign greater weights to multi-scale features and topologies. NAPred’s ability to discover potentially relevant diseases for drugs was validated through case studies and a cross-validation of five drugs. Numerous experimental results showed that NAPred’s predictions outperformed existing methods. Our predictive model serves as a tool for screening to recognize potential drug–disease associations, thereby allowing biologists to conduct wet laboratory research for determining real drug–disease associations.
Supplementary Materials
The following are available at: https://www.mdpi.com/article/10.3390/ijms23073870/s1.
Author Contributions
P.X. designed the method and participated in manuscript writing; Z.L. developed the experiments and participated in manuscript writing; T.Z. participated in the method’s design; Y.L. participated in the experimental design; T.N. participated in the experimental design. All authors have read and agreed to the published version of the manuscript.
Funding
The work was supported by the Natural Science Foundation of China (62172143, 61972135); Heilongjiang Provincial Natural Science Foundation of China (LH2019F049, LH2019A029, LH2020F043); China Postdoctoral Science Foundation (2020M670939, 2019M650069); Heilongjiang Postdoctoral Scientific Research Staring Foundation (BHLQ18104).
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Chen H., Cheng F., Li J. iDrug: Integration of drug repositioning and drug-target prediction via cross-network embedding. PLoS Comput. Biol. 2020;16:e1008040. doi: 10.1371/journal.pcbi.1008040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ceddia G., Pinoli P., Ceri S., Masseroli M. Matrix Factorization-based Technique for Drug Repurposing Predictions. IEEE J. Biomed. Health Inform. 2020;24:3162–3172. doi: 10.1109/JBHI.2020.2991763. [DOI] [PubMed] [Google Scholar]
- 3.Luo H., Li M., Yang M., Wu F.X., Li Y., Wang J. Biomedical data and computational models for drug repositioning: A comprehensive review. Briefings Bioinform. 2021;22:1604–1619. doi: 10.1093/bib/bbz176. [DOI] [PubMed] [Google Scholar]
- 4.Pushpakom S., Iorio F., Eyers P.A., Escott K.J., Hopper S., Wells A., Andrew D., Tim G., Joanna L., Christine M., et al. Drug repurposing: Progress, challenges and recommendations. Nat. Rev. Drug Discov. 2019;18:41–58. doi: 10.1038/nrd.2018.168. [DOI] [PubMed] [Google Scholar]
- 5.Turanli B., Altay O., Borén J., Hasan T., Jens N., Mathias U., Yalcin A.K., Adil M. Systems biology based drug repositioning for development of cancer therapy. Semin. Cancer Biol. 2021;68:47–58. doi: 10.1016/j.semcancer.2019.09.020. [DOI] [PubMed] [Google Scholar]
- 6.Padhy B., Gupta Y. Drug repositioning: Re-investigating existing drugs for new therapeutic indications. J. Postgrad. Med. 2011;57:153–160. doi: 10.4103/0022-3859.81870. [DOI] [PubMed] [Google Scholar]
- 7.Pritchard J.-L.E., O’Mara T.A., Glubb D.M. Enhancing the Promise of Drug Repositioning through Genetics. Front. Pharmacol. 2017;8:896. doi: 10.3389/fphar.2017.00896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Novac N. Challenges and opportunities of drug repositioning. Trends Pharmacol. Sci. 2013;34:267–272. doi: 10.1016/j.tips.2013.03.004. [DOI] [PubMed] [Google Scholar]
- 9.Alfedi G., Luffarelli R., Condò I., Pedini G., Mannucci L., Massaro D.S. Drug repositioning screening identifies etravirine as a potential therapeutic for friedreich’s ataxia. Mov. Disord. 2019;34:323–334. doi: 10.1002/mds.27604. [DOI] [PubMed] [Google Scholar]
- 10.Karaman B., Sippl W. Computational Drug Repurposing: Current Trends. Curr. Med. Chem. 2019;26:5389–5409. doi: 10.2174/0929867325666180530100332. [DOI] [PubMed] [Google Scholar]
- 11.Shameer K., Readhead B., Dudley J.T. Computational and experimental advances in drug repositioning for accelerated therapeutic stratification. Curr. Top. Med. Chem. 2015;15:5–20. doi: 10.2174/1568026615666150112103510. [DOI] [PubMed] [Google Scholar]
- 12.Gottlieb A., Stein G.Y., Ruppin E., Sharan R. PREDICT: A method for inferring novel drug indications with application to personalized medicine. Mol. Syst. Biol. 2011;7:496. doi: 10.1038/msb.2011.26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang W., Yue X., Lin W., Wu W., Liu R., Huang F., Liu F. Predicting drug-disease associations by using similarity constrained matrix factorization. BMC Bioinform. 2018;19:233. doi: 10.1186/s12859-018-2220-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang Y., Chen S., Deng N., Wang Y. Drug Repositioning by Kernel-Based Integration of Molecular Structure, Molecular Activity, and Phenotype Data. PLoS ONE. 2013;8:e78518. doi: 10.1371/journal.pone.0078518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liang X., Zhang P., Yan L., Fu Y., Peng F., Qu L. LRSSL: Predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics. 2017;33:1187–1196. doi: 10.1093/bioinformatics/btw770. [DOI] [PubMed] [Google Scholar]
- 16.WWang W., Yang S., Zhang X., Li J. Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics. 2014;30:2923–2930. doi: 10.1093/bioinformatics/btu403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu H., Song Y., Guan J., Luo L., Zhuang Z. Inferring new indications for approved drugs via random walk on drug-disease heterogenous networks. BMC Bioinform. 2016;17:539. doi: 10.1186/s12859-016-1336-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Luo H., Wang J., Li M., Luo J., Peng X., Wu F.X., Pan Y. Drug repositioning based on comprehensive similarity measures and Bi-Random walk algorithm. Bioinformatics. 2016;32:2664–2671. doi: 10.1093/bioinformatics/btw228. [DOI] [PubMed] [Google Scholar]
- 19.Yu L., Su R., Wang B., Zhang L., Zou Y., Zhang J., Gao L. Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017;14:966–977. doi: 10.1109/TCBB.2016.2550453. [DOI] [PubMed] [Google Scholar]
- 20.Huang Y.-F., Yeh H.-Y., Soo V.-W. Inferring drug-disease associations from integration of chemical, genomic and phenotype data using network propagation. BMC Med. Genom. 2013;6:S4. doi: 10.1186/1755-8794-6-S3-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen H., Zhang Z., Peng W. miRDDCR: A miRNA-based method to comprehensively infer drug-disease causal relationships. Sci. Rep. 2017;7:15921. doi: 10.1038/s41598-017-15716-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Xuan P., Zhang Y., Zhang T., Li L., Zhao L. Predicting MiRNA-Disease Associations by Incorporating Projections in Low-Dimensional Space and Local Topological Information. Genes. 2019;10:685. doi: 10.3390/genes10090685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xuan P., Pan S., Zhang T., Liu Y., Sun H. Graph Convolutional Network and Convolutional Neural Network Based Method for Predicting LncRNA-Disease Associations. Cells. 2019;8:1012. doi: 10.3390/cells8091012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Xuan P., Sheng N., Zhang T., Liu Y., Guo Y. CNNDLP: A Method Based on Convolutional Autoencoder and Convolutional Neural Network with Adjacent Edge Attention for Predicting LncRNA–Disease Associations. Int. J. Mol. Sci. 2019;20:4260. doi: 10.3390/ijms20174260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Xuan P., Gao L., Sheng N., Zhang T., Nakaguchi T. Graph Convolutional Autoencoder and Fully-Connected Autoencoder with Attention Mechanism Based Method for Predicting Drug-Disease Associations. IEEE J. Biomed. Health Inform. 2021;25:1793–1804. doi: 10.1109/JBHI.2020.3039502. [DOI] [PubMed] [Google Scholar]
- 26.Xuan P., Ye Y., Zhang T., Zhao L., Sun C. Convolutional Neural Network and Bidirectional Long Short-Term Memory-Based Method for Predicting Drug–Disease Associations. Cells. 2019;8:705. doi: 10.3390/cells8070705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jiang H.-J., Huang Y.-A., You Z.-H. Predicting Drug-Disease Associations via Using Gaussian Interaction Profile and Kernel-Based Autoencoder. Biomed Res. Int. 2019;2019:2426958. doi: 10.1155/2019/2426958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Casp. J. Intern. Med. 2013;4:627–635. [PMC free article] [PubMed] [Google Scholar]
- 29.Saito T., Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE. 2015;10:e0118432. doi: 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ling C.X., Huang J., Zhang H. Conference of the Canadian Society for Computational Studies of Intelligence. Springer; Berlin/Heidelberg, Germany: 2003. AUC: A Better Measure than Accuracy in Comparing Learning Algorithms; pp. 329–341. [Google Scholar]
- 31.Bolboacă S.D., Jäntschi L. Predictivity Approach for Quantitative Structure-Property Models. Application for Blood-Brain Barrier Permeation of Diverse Drug-Like Compounds. Int. J. Mol. Sci. 2011;12:4348. doi: 10.3390/ijms12074348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bolboacă S.D., Jäntschi L. Sensitivity, Specificity, and Accuracy of Predictive Models on Phenols Toxicity. J. Comput. Sci. 2014;5:345–350. doi: 10.1016/j.jocs.2013.10.003. [DOI] [Google Scholar]
- 33.Pahikkala T., Airola A., Pietilä S., Shakyawar S., Szwajda A., Tang J., Aittokallio T. Toward more realistic drug–target interaction predictions. Briefings Bioinform. 2015;16:325–337. doi: 10.1093/bib/bbu010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Duran-Frigola M., Fernández-Torras A., Bertoni M., Aloy P. Formatting biological big data for modern machine learning in drug discovery. WIREs Comput. Mol. Sci. 2019;9:e1408. doi: 10.1002/wcms.1408. [DOI] [Google Scholar]
- 35.Davis A.P., Grondin C.J., Johnson R.J., Sciaky D., McMorran R., Wiegers J. The Comparative Toxicogenomics Database: Update 2019. Nucleic Acids Res. 2019;47:D948–D954. doi: 10.1093/nar/gky868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wishart D.S., Feunang Y.D., Guo A.C., Lo E.J., Marcu A., Grant J.R. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kim S., Thiessen P.A., Bolton E.E., Chen J., Fu G., Gindulyte A. PubChem Substance and Compound databases. Nucleic Acids Res. 2016;44:D1202–D1213. doi: 10.1093/nar/gkv951. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang F., Zhang P., Cao N., Hu J., Sorrentino R. Exploring the associations between drug side-effects and therapeutic indications. J. Biomed. Inform. 2014;51:15–23. doi: 10.1016/j.jbi.2014.03.014. [DOI] [PubMed] [Google Scholar]
- 39.Wang Y., Xiao J., Suzek T.O., Zhang J., Wang J., Bryant S.H. PubChem: A public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37:W623–W633. doi: 10.1093/nar/gkp456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Mitchell A., Chang H.-Y., Daugherty L., Fraser M., Hunter S., Lopez R., Craig M., Conor M., Gift N., Sebastien P., et al. The InterPro protein families database: The classification resource after 15 years. Nucleic Acids Res. 2015;43:D213–D221. doi: 10.1093/nar/gku1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.The UniProt Consortium The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–D148. doi: 10.1093/nar/gkp846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wang D., Wang J., Lu M., Song F., Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26:1644–1650. doi: 10.1093/bioinformatics/btq241. [DOI] [PubMed] [Google Scholar]
- 43.Wang X., Ji H., Shi C., Wang B., Ye Y., Cui P., Yu P.S. Heterogeneous Graph Attention Network. arXiv. 20191903.07293 [Google Scholar]
- 44.Cen Y., Zou X., Zhang J., Yang H., Zhou J., Tang J. Representation Learning for Attributed Multiplex Heterogeneous Network. arXiv. 20191905.01669 [Google Scholar]
- 45.Nair V., Hinton G.E. Rectified Linear Units Improve Restricted Boltzmann Machines; Proceedings of the 27th International Conference on International Conference on Machine Learning; Haifa, Israel. 21–24 June 2010; pp. 807–814. [Google Scholar]
- 46.Bahdanau D., Cho K., Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv. 20141409.0473 [Google Scholar]
- 47.Kingma D.P., Ba J. Adam: A Method for Stochastic Optimization. arXiv. 20141412.6980 [Google Scholar]
- 48.Petrini M. Improvements to the Backpropagation Algorithm. Ann. Univ. Petrosani Econ. 2012;12:185–192. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





