Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Jan 15.
Published in final edited form as: IEEE Trans Comput Biol Bioinform. 2025 Nov-Dec;22(6):3055–3064. doi: 10.1109/TCBBIO.2025.3609315

Link Prediction in Multipartite Graphs with Application to Drug Repositioning Studies*

Cheng Chen 1, Stephen K Grady 1, Levente Dojcsak 1, Sally R Ellingson 2, Michael A Langston 1,
PMCID: PMC12800381  NIHMSID: NIHMS2129193  PMID: 40938726

Abstract

Developing new ethical drugs is exceedingly expensive in terms of both time and resources. A single drug can take up to a decade to bring to market, with costs soaring to over a billion dollars. Drug repositioning has thus become an attractive alternative to the development of new compounds, with growing interest in the use of in silico repositioning predictions. Bipartite graphs and efficient biclique enumeration algorithms can be used to study drug-protein or other pairwise crucial interactions. Extensions of this approach to datasets with three or more divergent data types have been hobbled, however, by a lack of effective analytics. To address this shortcoming, a highly innovative and efficient graph theoretical technique is introduced to impute potential edges (links) in an arbitrary multipartite graph. The utility of this method is demonstrated on five tripartite graphs, each comprised of three partite sets, one each for diseases, drugs, and gene products of interest, and with interpartite edges denoting known interactions or associations. Evidence for the reliability of imputed edges is also reported.

Index Terms—: Computational drug development and repositioning, multipartite graph theoretical algorithms, network-based link prediction

I. INTRODUCTION

Traditional drug discovery and development entail substantial costs, routinely requiring funding of over a billion USD and development times of over a decade [1, 2]. Moreover, the proportion of compounds that make it to market is low, further diminishing return on investments. In an effort to mitigate these costs, computer-aided drug design [4] has become an attractive alternative. A particularly promising framework for this is drug repositioning, which encompasses methods for determining new medical indications for existing FDA approved pharmaceuticals, thereby circumventing the time and finances needed to develop novel compounds. Compared to the billions typically required to develop a new drug, bringing a repurposed drug to market is estimated to cost an average of $300 million [5]. One of the oldest and best-known examples of drug repositioning is aspirin. It was first named by the German pharmaceutical giant Bayer near the end of the 19th century and prescribed to treat fever and inflammation [6]. It was repositioned in the 1980s for antiplatelet aggregation [7], and suggested only recently as a colorectal cancer preventative [8]. Unfortunately, for afflictions such as Alzheimer’s disease, entire decades may pass with no new drug candidate approvals [7]. This makes repositioning an especially appealing strategy for timely and efficacious treatment of disease.

A common approach to drug repositioning is based on leveraging publicly available databases containing known chemical structures to predict possible drug-target interactions. Computer-assisted protein-ligand binding analysis is widely included in both disease-protein relationship studies and drug-protein studies [2]. This technique was used, for instance, to analyze possible binding sites between 200 antiviral phytocompounds and the SARS-CoV-2 main protease to determine potential inhibitors [9].

Graph theory too has found a natural role in drug repositioning. In this setting, biological entities are represented by vertices, and their relationships are denoted by edges (aka links). Thus, the process of determining these relationships is often framed as a link prediction task. Examples of this approach include the use of neighborhood similarity [10, 11], network diffusion [12], and matrix projection to generate drug-target interaction predictions [13]. Melding highly heterogenous data can sometimes result in more accurate predictions, although this poses its own set of challenges [14] and prompts the use of multipartite graphs. Specifically, a k-partite graph is one whose vertices can be partitioned into k nonempty partite sets so that every edge has its endpoints in different sets [15]. Bipartite graphs, those for which k=2, have previously been used in drug repositioning studies [11, 12, 16, 17], but little consideration has thus far been seen for scaling to larger values of k.

In this paper, we address this issue with the presentation of a scalable graph theoretical algorithm that can be used to predict edges between an arbitrary number of partite sets. We test the utility of this method on a tripartite graph, whose partite sets respectively denote diseases, drugs, and gene products, and whose interpartite edges represent known interactions and associations. Although the primary focus of this paper is on emergent combinatorial tools for link prediction, we also apply well established bioinformatics methodology to estimate the potential validity of our results.

The remainder of this paper is organized as follows. In the next section, we explain how to construct tripartite graphs from heterogenous data, work through a handful of well-chosen examples, and describe the algorithm we use for link prediction. In Sections 3, 4, and 5, respectively, we provide supplemental evidence to support the pertinence of the disease-drug, disease-protein, and drug-protein links that our method predicts. In the final section, we summarize our work, review its limitations, and suggest possible avenues for future study.

II. METHODOLOGY

A. Tripartite Graph Construction

All graphs used in this paper are finite, simple, unweighted, and undirected. A k-partite clique in a k-partite graph is a set of vertices that induces a complete k-partite subgraph.

We began with three bipartite networks (undirected graphs) maintained by the Stanford Biomedical Network Dataset Collection (BioSNAP) [18]. These were ChG-Miner (drugs and gene products), DCh-Miner (drugs and diseases), and DG-Miner (diseases and gene products). When there is no confusion, we will merely refer to gene products from these sources as proteins. Data in ChG-Miner represent known interactions between proteins and drugs/nutraceuticals on the US market. Nutraceuticals are distinct from ethical, also known as prescription, drugs in that they are usually taken to supplement health rather than to treat disease. BioSNAP combines the two, however, into a single category that we will simply term drugs. Associations in this dataset were derived from both DrugBank [19] and the Gigascale multimodal biological network (MINER) [20]. DCh-Miner contains known associations between diseases and drugs. Prescriptions were extracted from the Comparative Toxicogenomics Database (CTD) [21] and MINER. The CTD contains both therapeutic disease-drug links and marker/mechanism disease-drug links. A marker/mechanism link is used to denote that a drug has been correlated with a disease in the literature, even though it has not necessarily been investigated for its therapeutic effects. Finally, DG-Miner contains known associations between diseases and proteins ascertained from CTD, MINER, and the Online Mendelian Inheritance in Man (OMIM) [22] databases. As one might expect, graph degree distributions were heavily skewed toward the most highly studied diseases, drugs, and proteins. Curated links were derived from the published literature and provide no consistent sort of scoring metric or confidence interval, thereby eliminating thresholding [23, 24] as a preprocessing option.

We constructed a tripartite graph by first taking the union [25] of the three aforenamed bipartite graphs. We then removed any vertex that was not adjacent to at least one vertex from each of the two partite sets in which it did not reside. From this, we extracted five tripartite subgraphs, each anchored by a family of disease vertices categorized by their Medical Subject Headings (MeSH) codes [26] along with their drug and protein neighbors. The MeSH categories we chose were eye diseases (C11), heart diseases (C14.280), hemic and lymphatic diseases (C15), infections (C01), and wound injuries (C26). Although no vertex chosen for its disease classification was contained in two or more of these five subgraphs, their neighbors sometimes were, which was both expected and acceptable because this simple form of graph decomposition does not require disjointness. This list of disease categories is of course arbitrary. It was chosen merely as a proof of concept and to represent a highly diverse set of tissues and afflictions. Vertex and edge counts for these five subgraphs are listed in Table 1. Densities ranged between two and seven percent.

TABLE 1.

Vertex and edge counts for the five tripartite subgraphs under study.

Graph Metrics Graph MeSH Categories
Eye Disease Heart Disease Hemic & Lymphatic Infections Wounds & Injuries
Disease Vertices 88 40 79 150 71
Drug Vertices 844 963 925 814 885
Protein Vertices 1987 1992 1991 1977 1986
Disease-Drug Edges 5539 6630 9272 23068 6090
Disease-Protein Edges 59821 33524 74530 115283 31225
Drug-Protein Edges 5402 5826 5673 6232 5656

B. Link Prediction

We generated all maximal tricliques using the k-partite clique enumeration algorithm described in [15]. To ensure that disease, drug, and protein data were all at least marginally represented, we retained for analysis only those tricliques with at least three vertices in each partite set. Otherwise, we note as examples that only one or two diseases restricts the opportunity to study pathological similarity, only one or two drugs limits the potential for pharmaceutical retargeting, and only one or two proteins hampers our ability to pinpoint molecular pathways. Due to imbalance, no maximum triclique withstood this test.

Let C denote a triclique satisfying this criterion. We next identified any non-triclique vertex, v, that was adjacent to all but a handful of C’s vertices not in v’s partite set. Note that v must be nonadjacent to at least one such vertex, since otherwise v would be an element of C. On the other hand, too many nonadjacencies weaken v’s potential utility in the context of link prediction. For this study, we allowed no more than two nonadjacencies based on extensive bootstrapping analysis, which we will detail in the following subsection. Edges (aka links) were then imputed between v and these formerly nonadjacent vertices of C. We shall henceforth term this procedure “triclique augmentation” and hasten to note that it does not consider non-clique adjacencies, nor is it designed to preserve density. Thus, triclique augmentation should not be confused with the superficially similar paraclique algorithm [27], which is intended mainly for clustering applications in which there is a need to ameliorate the effects of noise. Link prediction counts are listed in Table 2.

TABLE 2.

Link predictions (edge imputations) by link type and disease classification.

Graph Computations Graph MeSH Categories
Eye Disease Heart Disease Hemic & Lymphatic Infections Wounds & Injuries
Tricliques Meeting Inclusion Criterion 16409 18071 108249 20750 10239
Disease-Drug Link Predictions 5958 3722 9218 4998 3404
Disease-Protein Link Predictions 418 242 812 566 540
Drug-Protein Link Predictions 11207 31886 24457 11195 26372

C. Nonadjacency Bound

We use the term “nonadjacency bound” to denote an upper limit on the number of triclique nonadjacencies permitted by the link prediction algorithm just described. To determine a best limit over these data, we first created a suite of test graphs by pseudorandomly deleting 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90% of the edges from each of the five tripartite disease-specific graphs we had previously created. We then defined the “recovery rate” as the number of recovered edges divided by the number of removed edges. While this ensures a direct measure of link prediction’s success, it can only increase with the nonadjacency bound. Therefore, to accompany the recovery rate, we defined the “recovery percentage” as the number of recovered edges divided by the number of predicted edges. The recovery percentage was employed merely to provide an upper bound on the range of nonadjacency bounds under consideration. This ratio is of course a more indirect and uncertain measure of success, because little is likely to be known about a predicted edge unless it had also been removed. We next applied our link prediction algorithm to each of these five tripartite graphs, varying the nonadjacency bound to examine its effect. Observed recovery rates were quite uniform: the rate fared better when a smaller proportion of edges was removed for each subgraph and for each nonadjacency bound under study. Consider, as a representative example, the graph focused on eye diseases with 70% of its edges deleted. In Figure 1, we show how the nonadjacency bound affects both the recovery rate and the recovery percentage for this graph. As another example, we scrutinized the five graphs with 1% of their edges removed. For these, we found that a nonadjacency bound of one averaged a recovery rate of barely over 35%, while a bound of two yielded an average rate just shy of 96%. This of course left little room for improvement, and larger bounds generated rates less than 99%. Results such as these suggest that a nonadjacency bound of two is superior to a bound of one, and furthermore that when reaching for bounds of three or more we experience greatly diminished returns and are thus unlikely to produce markedly better link predictions.

Fig 1. Typical edge recovery experience.

Fig 1.

We consider a tripartite graph whose partite sets represent diseases, drugs, and proteins. It was anchored by eye diseases, and 70% of its edges were pseudorandomly removed. (a) The recovery rate experienced a huge gain as the nonadjacency bound was raised from one to two, but then it developed a much more gradual trend as the bound was further increased. (b) Roughly in concordance, the recovery percentage reached its maximum value at a nonadjacency bound of two, and then it tapered off at larger nonadjacency values.

Drug repositioning is the main application of this effort, and so drug-protein link predictions are of course our foremost concern. Nevertheless, we shall now consider all three types of edge imputations.

III. EVALUATING DISEASE-DRUG LINK PREDICTIONS

Determining the quality of inferred relationships between diseases and drugs presents a special challenge because there is no underlying mechanistic foundation for such a relationship other than archival patterns of prescription. We therefore began by using the Tanimoto fingerprint similarity metric [28] to estimate the validity of disease-drug predictions. Our goal was to determine, given a disease-drug imputed edge, whether the drug was on average more structurally similar to known drugs prescribed for the same disease than it was to other drugs. We computed this metric using Rdkit [29] to obtain a drug’s molecular access system (MACCS) fingerprint [30], which is a 166-bit key drug structure descriptor widely used in molecular comparisons [31]. We also used ccbmlib [32] to estimate p-values for the Tanimoto similarities. We did not conduct any sort of similarity comparison between diseases because each tripartite subgraph was constructed and analyzed separately based on diseases within the same major category. With this approach, we found support for the study of imputed edges in the form of modest but significant mean differential drug similarity.

For an imputed edge e between disease x and drug y, we created a set D1(e) of Tanimoto similarity scores between the fingerprint of y and the fingerprints of all drugs found in tricliques that produced this same edge during triclique augmentation. We did this to ensure that y and each of these drugs were adjacent to at least one common protein and one common disease. Next, we created a set D2(e) of Tanimoto similarity scores between y and drugs not adjacent to x via known or imputed edges. Our reasoning for these computations was that higher D1 scores relative to D2 scores may have provided evidence that this imputed edge was valid. From there, we constructed U1 as the union of all D1(e) sets, and U2 as the union of all D2(e) sets, where e ranged over all imputed disease-drug edges. Finally, we compared the distributions of U1 and U2 using the non-parametric one-sided Mann-Whitney U test, after Shapiro-Wilk tests confirmed that U1 and U2 did not follow normal distributions. The Shapiro-Wilk tests yielded p-values of 2.24e−46 and 6.10e−73 for U1 and U2, respectively.

We used the procedure just described to collect 32,310 available Tanimoto drug similarity scores in U1 and 434,302 scores in U2, with means of 0.332 and 0.303, respectively. Under the hypothesis that the scores of U1 would be higher than U2, we obtained a p-value of 5.00e−280 from the one-sided Mann-Whitney U test, much less than our confidence cutoff of 0.05. The Cohen’s d effect size for this test was 0.219, which is typically considered a small effect [33]. This result supported our expectation that U1 would have higher similarity scores on average than U2, although the difference was modest. We also used ccbmlib [32] to estimate the p-values of the Tanimoto similarity scores themselves. We filtered both U1 and U2 to remove Tanimoto similarity scores with p-values not less than 0.05. After filtering, U1 retained 2,768 scores (8.57%) and U2 retained 23,369 scores (5.38%). We applied the one-sided Mann-Whitney U test to these two subsets, producing a p-value of 4.50e−9, and a Cohen’s d effect size of 0.152. These differences were again modest but statistically significant.

We next looked at all edges that were missing from our original data in order to gauge how they compared to those we imputed. Let disease x and drug y denote a pair of non-neighbor vertices in one of our original five subgraphs. We identified all drugs adjacent to x in the relevant tripartite subgraph and formed the set S1(x,y), comprising Tanimoto similarity scores between these drugs and y. We also identified all drugs (excluding y) in the relevant tripartite subgraph that were not adjacent to x and built the set S2(x,y), using Tanimoto similarity scores between these drugs and y. We then ranked the missing disease-drug edges in each tripartite subgraph based on the difference between the means of these two sets of similarity scores. After this, we chose a cutoff value, n, selected the n highest ranked edges from each tripartite subgraph, and formed their respective unions S1 and S2. As before, we excluded edges containing diseases adjacent to fewer than three drugs since so little data was available for them and to ensure an equitable comparison with triclique augmentation. We began with n set to 24, which is the smallest number of missing edges in any of the five tripartite subgraphs, and then varied n from 100 to 1000 in increments of 100. Across all these cutoff values, the proportion of highest ranked edges imputed by triclique augmentation consistently ranged from 3.4% to 5.0%. Thus, the overlap was low, providing additional support for the potential value added by the highly focused triclique approach over mere similarity ranking.

While these analyses are positive, our main objective remains to exploit the wealth of data provided by all three partite sets. And thus we ask: how do proteins figure into disease-drug imputations? Therefore, in a third study, we counted the number of proteins adjacent to both vertices in the highest ranked disease-drug edges, comparing edges also imputed by triclique augmentation to all others. As shown in Figure 2, vertices incident to imputed edges shared more than eight common adjacent proteins, while vertices incident to non-imputed edges had just slightly over two. These results further strengthen the case that triclique augmentation effectively imputes disease-drug relationships for further study, and that by leveraging protein data it provides a possible mechanistic explanation for this stance.

Fig 2. Mean number of shared adjacent proteins for diseases and drugs in highest ranked edges.

Fig 2.

The x-axis denotes the cutoff value, n, which limits the number of missing edges considered. The y-axis indicates the mean number of adjacent proteins shared between a disease and a drug in such an edge. The green line displays this number for the highest ranked missing edges imputed by triclique augmentation, while the orange line reflects the number for the highest ranked missing edges not imputed.

We also assessed the plausibility of our imputed disease-drug associations using non-structural measures. To accomplish this, we considered the Katz Index and Personalized PageRank (PPR), a pair of classical metrics commonly used in drug repositioning studies [34]. Results here too were positive. Details of the analyses are rather technical, however, and are thus provided in Appendix A.

IV. EVALUATING DISEASE-PROTEIN LINK PREDICTIONS

Estimating the potential significance of imputed associations between diseases and proteins was rather direct, largely because the interpretation of biological mechanisms underpinning human disease has long been a highly active area of research. To take advantage of prior work, we therefore employed two somewhat orthogonal methods, namely, Gene Ontology (GO) and Basic Local Alignment Search Tool (BLAST).

For GO, we concentrated on the molecular function enrichment of proteins that shared imputed edges with a common disease and conducted a literature review to assess the biological relevance of the most highly enriched GO terms. Our triclique augmentation methodology imputed 2,578 disease-protein edges in the five graphs under study. These edges collectively covered 161 diseases and 199 proteins. From each tripartite subgraph, we chose the disease contained as an endpoint within the largest number of imputed disease-protein edges. The results of this selection were these: aortic valve insufficiency (61 edges), heat stroke (73 edges), macular degeneration (37 edges), mycoplasma infections (55 edges), and sarcoidosis (70 edges). For each of these five diseases, we next used ClueGO [35] to perform molecular function enrichment on the five respective sets of proteins adjacent to these diseases via imputed edges. It is important to note that ClueGO reports GO categories in groups, each group denoted by its most significant term. GO categories with Bonferroni-corrected p-values less than 0.01 were considered significantly enriched and are listed in Table 3.

TABLE 3.

Molecular function enrichment by GO category and disease classification.

Enriched GO Category Adjusted P-Value Gene Coverage
Aortic Valve Insufficiency
neurotransmitter receptor activity 1.81E-28 39.47%
organic acid binding 5.85E-15 25.00%
steroid hydroxylase activity 4.98E-09 15.79%
voltage-gated cation channel activity 2.30E-08 13.16%
G protein-coupled amine receptor activity 4.66E-05 6.58%
Heat Stroke
neurotransmitter receptor activity 1.01E-30 27.88%
postsynaptic neurotransmitter receptor activity 3.44E-21 25.96%
steroid hydroxylase activity 4.33E-16 18.27%
G protein-coupled serotonin receptor activity 5.93E-11 11.54%
retinoic acid binding 8.20E-06 8.65%
oxidoreductase activity, acting on paired donors 3.45E-05 2.88%
nuclear receptor activity 1.53E-04 4.81%
Macular Degeneration
neurotransmitter receptor activity 4.17E-36 28.40%
G protein-coupled amine receptor activity 1.03E-28 18.52%
postsynaptic neurotransmitter receptor activity 3.27E-17 19.75%
transmitter-gated ion channel activity 3.06E-15 16.05%
adrenergic receptor activity 2.17E-11 6.17%
Gq/11-coupled serotonin receptor activity 5.37E-08 6.17%
oxidoreductase activity, acting on paired donors 1.13E-05 4.94%
Mycoplasma Infections
neurotransmitter receptor activity 1.89E-26 33.87%
postsynaptic neurotransmitter receptor activity 1.49E-17 30.65%
glucuronosyltransferase activity 6.91E-10 14.52%
regulation of lipase activity 2.89E-06 9.68%
monooxygenase activity 3.97E-06 11.29%
Sarcoidosis
neurotransmitter receptor activity 2.70E-34 26.72%
G protein-coupled serotonin receptor activity 2.31E-28 27.59%
gated channel activity 7.29E-16 18.97%
postsynaptic neurotransmitter receptor activity 5.54E-14 10.34%
retinoic acid binding 2.83E-08 9.48%
monooxygenase activity 1.59E-06 6.90%

We next conducted a literature review for some of the most significantly enriched GO categories to determine whether the corresponding disease associations were novel or supported by previous research. We focused our efforts primarily on GO categories that were enriched for two or more diseases under review. The breadth of MESH codes, the presence of both curated and marker/mechanism links, and the ubiquity of some proteins suggest that enrichments such as neurotransmitters and G-protein coupled receptors should not be unexpected. Let us discuss a few of these findings.

  • G protein-coupled amine receptors were enriched in the analyses of aortic valve insufficiency and macular degeneration, while G protein-coupled serotonin receptors were enriched in analyses of heat stroke and sarcoidosis. Both receptor types belong to the G Protein-Coupled Receptor (GPCR) family, which forms the largest receptor family in the human genome, with endogenous ligands including odors, hormones, neurotransmitters, and chemokines [36]. Moreover, GPCRs, including serotonin receptors, are implicated in a wide range of diseases due to their key role in cellular signaling. Disruptions can contribute to the development or progression of various maladies, with notable examples such as heart disease and cancer. GPCRs are also important in therapeutic applications because they are well-studied and serve as cell surface targets for various drugs [37]. With this in mind, it is not surprising that they came up in our study. Moreover, newly predicted GPCR links such as these may provide novel insights into the treatment of disease.

  • Monooxygenase activity was enriched in the analyses of mycoplasma infections and sarcoidosis. Both diseases are associated with oxidative stress, with the former inducing it in host cells [38] and the latter characterized by increased levels [39]. Certain monooxygenases, such as cytochrome P450 enzymes, also contribute to reactive oxygen species generation and modulate inflammatory responses. Biological processes of this nature may help explain why these proteins and diseases were linked by imputed edges.

  • Neurotransmitter receptor activity was enriched in the analyses of all five diseases. These receptors are the brain’s chemical messengers. They are broadly related to disease because they play a crucial role in regulating function and behavior. Neurotransmitter disruptions can lead to a wide spectrum of physical and mental health issues. These also play an important role in pain signal transmission and other signal transduction functions [40]. Notably, this category provided the largest gene coverage in four of the analyses, except for sarcoidosis where it ranked second. Postsynaptic neurotransmitter receptor activity in particular was enriched for four of the diseases we considered, except for aortic valve insufficiency. Neurotransmitter receptors have previously been found to play a role in several relevant processes. Serotonin receptors, for example, are involved in heart valve tissue fibrosis [41], while ionotropic receptor expression may be enhanced by acute heat stress in aged rats [42]. It is also known that glutamate, an excitatory neurotransmitter, serves as the primary driver of light responses in the retina [43], possibly linking it to macular degeneration. Additionally, some cases of Mycoplasma pneumonia infection have been linked to recurrent anti-NMDA receptor encephalitis [44].

  • Steroid hydroxylase activity was enriched in the analyses of aortic valve insufficiency and heat stroke. These enzymes play critical roles in the biosynthesis and metabolism of steroid hormones [45], such as testosterone, which can influence the heart’s structure and function [46]. Previous studies in cows have demonstrated that heat stress too can affect the production and utilization of these hormones [47].

For BLAST, we used an NCBI tool [48] to assess protein similarity. We constructed sets S1(x,y) and S2(x,y) following the steps outlined in Section 3, using BLAST bit scores of protein sequences in the place of Tanimoto similarity scores of drug fingerprints. The BLAST bit score is a normalized measure of sequence similarity that represents the overall quality of an alignment between two sequences [49]. Additionally, for each BLAST alignment, we recorded the corresponding expected value (E-value). The E-value is a statistical measure that reflects the likelihood of observing the alignment by chance, with lower values indicating greater significance [50]. We followed the analysis steps outlined in Section 3 to study the number of shared adjacent drugs between diseases and proteins in imputed edges. As illustrated in Figure 3, vertices incident to imputed edges shared, on average, more than fourteen common adjacent drugs, whereas non-imputed edges shared, on average, fewer than one. These findings further support the effectiveness of triclique augmentation in utilizing information from all three partite sets for link prediction.

Fig 3. Mean number of shared adjacent drugs for diseases and proteins in highest ranked edges.

Fig 3.

The x-axis denotes the cutoff value, n, which limits the number of missing edges considered. The y-axis indicates the mean number of adjacent drugs shared between a disease and a protein in such an edge. The green line displays this number for the highest ranked missing edges imputed by triclique augmentation, while the orange line reflects the number for the highest ranked missing edges not imputed by triclique augmentation.

We also found support for the study of imputed edges in the form of significant mean differential protein similarity. We constructed new sets U1 and U2 following the steps outlined in Section 3, using BLAST bit scores of protein sequences in the place of Tanimoto similarity scores of drug fingerprints. U1 comprised 2,444 BLAST bit scores, while U2 comprised 72,919 BLAST bit scores, with mean values of 59.06 and 14.54, respectively. Figure 4(a) shows the distributions of BLAST bit scores in U1 and U2. We conducted a one-sided Mann-Whitney U test to evaluate the hypothesis that U1 had higher scores than U2. The resulting p-value of 1.63e−88 was significantly lower than our cutoff of 0.05, supporting our hypothesis. Next, we filtered these sets based on a specified E-value and bit score threshold, respectively. While there is no standard recognized cut-off for determining BLAST alignment significance, some consider E-values less than 0.001 [51, 52] or bit scores exceeding 50 [53] to indicate statistically meaningful similarity. In U1, 438 (17.9%) BLAST results exhibited an E-value less than 0.001, while 7906 (10.8%) results in U2 had E-values less than 0.001 (Figure 4(b)). We applied the one-sided Mann-Whitney U test exclusively to these subsets of U1 and U2, yielding a p-value of approximately 0. In U1, 342 (14.0%) values had bit scores greater than 50, while only 143 (0.2%) scores in U2 exceeded this threshold (Figure 4(c)). We applied the one-sided Mann-Whitney U test exclusively to these subsets of U1 and U2, yielding a p-value of 1.40e−5. Notably, all BLAST alignments in our analysis with bit scores greater than 50 had E-values below 0.001. Specifically, among the BLAST alignments in U1 with E-values less than 0.001, 78.1% also had bit scores greater than 50, in stark contrast to only 1.8% in U2.

Fig 4: Histograms of the BLAST bit score distributions.

Fig 4:

The yellow histogram represents the BLAST bit scores from U1 and is measured using the left y-axis. The blue histogram shows the BLAST bit scores from U2 and is measured using the right y-axis. (a) Histogram of all BLAST bit scores. (b) Histogram of BLAST bit scores with an E-value below 0.001. (c) Histogram of BLAST bit scores greater than 50.

At this point it is probably relevant to note that running BLAST on large numbers of proteins can require considerable computational time and resources. This further underscores the relative merit of triclique augmentation over merely ranking missing disease-protein edges based on vertex similarity.

V. EVALUATING DRUG-PROTEIN LINK PREDICTIONS

We identified a total of 42,323 unique imputed drug-protein edges across the five tripartite subgraphs under study. Establishing the feasibility of these imputed drug-protein associations by way of physical interactions was straightforward due to advancements in the molecular docking simulation software that we used to determine binding energies between drugs and proteins. Specifically, we calculated these energies using Autodock Vina [54], Molecular Operating Environment (MOE) [55], and VinaMPI [56]. We downloaded drug compounds from DrugBank in concatenated 3D-SDF format, processed and converted SDF files to Mol2 files using Open Babel [57], and transformed these into PDBQT (Protein Data Bank, Partial Charge and Atom Type) files using AutoDockTools (ADT) scripts [58]. We pulled protein structures used in binding energy calculations from the RCSB Protein Data Bank (PDB) [59]. Note that a single protein can exhibit multiple crystal structures in the PDB. Thus, to ensure optimal use of these structures, we gave preference to those with a higher number of amino acids and a higher resolution. We conducted a global search docking simulation using the entire protein without specifying a binding site. To accomplish this, we used VMD [60] to obtain the coordinates of a docking box that encapsulates the protein. Next, we converted the cleaned proteins into PDBQT files and used high-throughput virtual docking [56] to calculate the binding energies between drug ligands and protein receptors, retaining the binding energy of the best model for further analysis.

Finally, we validated our docking protocol by comparing multiple docking results against known crystal structures. To evaluate the global search docking protocol, we compared multiple docking results with crystal structures from the RCSB Protein Data Bank (PDB) for a protein complexed with one of the known binding drugs in our dataset. We selected multiple docking results from our preliminary study, in which we applied triclique augmentation to a small set of tricliques with the lowest possible nonadjacency bound value of one. We carried out an initial investigation prior to performing comprehensive studies such as nonadjacency bound selection, by applying triclique augmentation to 40 tricliques, all with a nonadjacency bound of one. We selected these tricliques by sorting all tricliques into four sets: three sets based on the size of each partite set and one set based on triclique size. We then selected the ten largest tricliques from each set.

We identified 45 proteins in the imputed drug–protein edges in the preliminary study and used them to search for crystal structures of complexes with one of the known binding drugs in our dataset. Among these, we identified eight proteins with the required crystal structures available in the PDB. For five of the eight proteins (3TBG, 6KUW, 7C61, 7EJ8, and 7JVR), we successfully docked the corresponding drugs into the correct binding site, with the best case of 1.38 Å root mean squared deviation (RMSD) to the crystal structure (Figure 5). Each drug-protein docking result provided nine predicted binding conformations, which we then ordered by binding energy. For the five drug-protein pairs, we found the best RMSD occurred in either the first or second conformation. Table 4 contains the best RMSD between the docked model and the co-crystallized compound for these five drug-protein pairs and indicate in which conformation the best RMSD occurred. For the largest RMSD between the docked and co-crystalized ligand in 7JVR, six of the nine models were in the binding site of the crystal structure, but the molecule was buried deeper in the crystal structure than what docking predicted.

Fig 5: Comparison of docked (blue) and co-crystalized (magenta) DB00679 from PDB ID 3TBG.

Fig 5:

The protein backbone is gray and the protein surface within 3.5 Å of the ligand is displayed with lines. A 1.38 Å RMSD between the docked ligand and co-crystalized ligand shows that a global search docking protocol (docking to the entire protein) was able to identify the binding site in Human cytochrome P450 2D6 for Thioridazine correctly. Figure generated in MOE [3].

TABLE 4.

Best model RMSD between the docked and co-crystal ligands.

PDB
3TBG 6KUW 7C61 7EJ8 7JVR
Drugbank DB00679 DB04540 DB00696 DB00484 DB01200
RMSD 1.3787 1.5111 3.9803 2.3087 9.6884
Conformation Number 1 2 1 1 2

As case studies, we selected the five proteins (3TBG, 6KUW, 7C61, 7EJ8, and 7JVR) that we used in docking protocol validation and that satisfied our docking criteria to evaluate the predicted drug-protein interactions. For each of these proteins, we divided the drugs from the five tripartite subgraphs into three distinct sets: (1) drugs adjacent to the protein via imputed edges, (2) drugs adjacent to the protein via known edges, and (3) drugs not adjacent to the protein via either imputed or known edges. For protein p, we conducted molecular docking between p and drugs from these three sets and recorded their corresponding binding energies as E1(p), E2(p), and E3(p), respectively. Subsequently, we conducted Mann-Whitney U tests between E1(p) and E2(p), as well as between E1(p) and E3(p). For all five proteins, we observed that E1(p) exhibited lower binding energy values when compared to E3(p), with p-values less than 0.05 (Table 5). Additionally, for all five proteins, we found that the binding energy values in E1(p) showed no statistically significant difference when compared to E2(p). These results, alongside previous findings that lower binding energy values indicate a higher probability of ligand-receptor interaction and complex formation [61], suggest that our algorithm did indeed select plausible drug-protein pairs.

TABLE 5.

Binding energy metrics and Mann-Whitney U test results.

PDB
3TBG 6KUW 7C61 7EJ8 7JVR
E1(p) Mean −7.59 −7.19 −6.76 −6.62 −6.74
E2(p) Mean −7.64 −7.17 −7.14 −6.64 −7.05
E3(p) Mean −6.99 −6.72 −6.31 −6.21 −6.33
E1(p)-E2(p) P-Value 0.39 0.68 0.56 0.78 0.09
E1(p)-E2(p) Effect Size 0.04 −0.02 0.31 0.02 0.25
E1(p)-E3(p) P-Value 9.00E-06 6.00E-06 6.37E-07 9.00E-06 7.30E-05
E1(p)-E3(p) Effect Size −0.41 −0.33 −0.34 −0.32 −0.28

VI. DISCUSSION

With this paper we presented a novel, scalable, link prediction strategy for k-partite graphs, explicated it for k=3 using a disease-drug-protein testbed we created from BioSNAP, and evaluated its potential utility at predicting all three types of links. Among its key analytic features are mathematical abstraction, which is provided through an application of graph theoretical algorithms, and interpretability, which is often missing with machine learning and other black-box methods. Notably, the procedures we crafted need no training data or negative sample. Moreover, they can be used when handling heterogeneous data, thus making them well-suited for studies that include data from multiple experiments.

We hasten to mention a few methodological limitations. At most a modest set of proteins was scrutinized for validation. And only in silico models were employed, since wet-lab technologies fall well outside the scope of this effort. Furthermore, the approach we devised quite naturally relies on data quality and thus may be of limited utility in the face of cold start scenarios (problems in which a newly designed drug is introduced with no previously known associations). We also note that docking may not always correlate particularly well with experimental binding energy. Nevertheless, true binding compounds generally exhibit better energies than non-binding ones.

In closing, we point out that this work may open a few new research directions. For example, investigators might want to anchor augmentation on a poorly understood disease or other targeted vertex of special interest. They could also try to incorporate this methodology with other drug repositioning techniques, and in fact such an integrative approach seems to have become increasingly popular [6265]. Finally, researchers might wish to add highly relevant partite sets denoting categories such as side effects, or they may seek to test the effectiveness of a k-clique augmentation strategy in the context of closely related problem domains such as studies of drug-drug and protein-protein interaction networks.

Supplementary Material

supp1-3609315

ACKNOWLEDGEMENTS

This research was supported in part by the National Institutes of Health under grant R01HD092653, by the Environmental Protection Agency under grant G17D112354237, and by the Cancer Research Informatics Shared Resource Facility of the University of Kentucky Markey Cancer Center (P30CA177558). It also employed resources of the National Energy Research Scientific Computing Center, a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory and operated under Contract No. DE-AC02–05CH11231.

Biographies

Cheng Chen is currently completing the Ph.D. in Computer Science at the University of Tennessee. Her doctoral work has focused on developing novel graph theoretical methods for life science applications. Her research interests include algorithmic analysis, artificial intelligence, computational biology, data science, and machine learning.

Stephen K. Grady received the Ph.D. in Genome Science and Technology at the University of Tennessee in 2022. He is currently employed by Cobalt42, where he serves as a bioinformatician. His research interests include analytics, bioinformatics, combinatorial optimization, graph theoretical algorithms, high performance computing, and machine learning for life science applications.

Levente Dojcsak is currently pursuing the Ph.D. in Computer Science at the University of Tennessee. His doctoral studies are aimed at integrating classic graph-theoretical methods with complementary tools such as Bayesian analysis and machine learning. His research interests include artificial intelligence, combinatorial optimization, big data analytics, data science, mathematical programming, modern statistical methods, Ramsey theory, and timely applications in public health.

Sally R. Ellingson received the Ph.D. in Genome Science and Technology at the University of Tennessee in 2014. She now serves as an Assistant Professor in the Division of Biomedical Informatics at the University of Kentucky and the Cancer Research Informatics Shared Resource at the Markey Cancer Center. Her current research interests include computational drug discovery, data science, health informatics, and modeling and simulations for biomedical research.

Michael A. Langston received the Ph.D. in Computer Science from Texas A&M University in 1981. He now serves as a Professor of Electrical Engineering and Computer Science at the University of Tennessee. His current research interests include big data analytics, combinatorial optimization, computer and data science, graph theoretical algorithms, high performance implementations, life science applications, machine learning, and statistical software.

Footnotes

*

A preliminary version of a portion of this paper was presented at the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, which was held online in August 2021.

REFERENCES

  • [1].Ou-Yang SS, Lu JY, Kong XQ, Liang ZJ, Luo C, and Jiang H, “Computational Drug Discovery,” Acta Pharmacologica Sinica, vol. 33, no. 9, pp. 1131–40, Sep, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Jingtian Z, Yang C, and Le Z, “Exploring the computational methods for protein-ligand binding site prediction,” Comput Struct Biotechnol J, vol. 18, pp. 417–426, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Vilar S, Cozza G, and Moro S, “Medicinal chemistry and the molecular operating environment (moe): Application of qsar and molecular docking to drug discovery,” Curr Top Med Chem, vol. 8, no. 18, pp. 1555–72, 2008. [DOI] [PubMed] [Google Scholar]
  • [4].Song CM, Lim SJ, and Tong JC, “Recent advances in computer-aided drug design,” Brief Bioinform, vol. 10, no. 5, pp. 579–91, Sep, 2009. [DOI] [PubMed] [Google Scholar]
  • [5].Pushpakom S, Iorio F, Eyers PA, Escott KJ, Hopper S, Wells A, Doig A, Guilliams T, Latimer J, McNamee C, Norris A, Sanseau P, Cavalla D, and Pirmohamed M, “Drug repurposing: Progress, challenges and recommendations,” Nat Rev Drug Discov, vol. 18, no. 1, pp. 41–58, Jan, 2019. [DOI] [PubMed] [Google Scholar]
  • [6].Rinsema TJ, “One hundred years of aspirin,” Med Hist, vol. 43, no. 4, pp. 502–7, Oct, 1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Jourdan J-P, Bureau R, Rochais C, and Dallemagne P, “Drug repositioning: A brief overview,” J Pharm Pharmacol, vol. 72, no. 9, pp. 1145–1151, Sep, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Rothwell PMP, Fowkes FGRP, Belch JFFP, Ogawa HMD, Warlow CPP, and Meade TWP, “Effect of daily aspirin on long-term risk of death due to cancer: analysis of individual patient data from randomised trials,” The Lancet (British edition), vol. 377, no. 9759, pp. 31–41, 2011. [DOI] [PubMed] [Google Scholar]
  • [9].Shivanika C, Kumar D, Ragunathan V, Tiwari P, Sumitha A, and B. D. P, “Molecular docking, validation, dynamics simulations, and pharmacokinetic prediction of natural compounds against the sars-cov-2 main-protease,” J Biomol Struct Dyn, vol. 40, no. 2, pp. 585–611, Feb, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Liu J, Zuo Z, and Wu G, “Link prediction only with interaction data and its application on drug repositioning,” IEEE Trans Nanobioscience, vol. 19, no. 3, pp. 547–555, Jul, 2020. [DOI] [PubMed] [Google Scholar]
  • [11].Gündoğan E, and Kaya B, “A recommendation method based on link prediction in drug-disease bipartite network.” pp. 125–128. [Google Scholar]
  • [12].Wang W, Lv H, Zhao Y, Liu D, Wang Y, and Zhang Y, “Dls: A link prediction method based on network local structure for predicting drug-protein interactions,” Front Bioeng Biotechnol, vol. 8, pp. 330, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, Peng J, Chen L, and Zeng J, “A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information,” Nat Commun, vol. 8, no. 1, pp. 573, Sep 18, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].MacLean F, “Knowledge graphs and their applications in drug discovery,” Expert Opinion on Drug Discovery, vol. 16, no. 9, pp. 1057–1069, 2021/09/02, 2021. [DOI] [PubMed] [Google Scholar]
  • [15].C. A. Phillips, K. Wang, E. J. Baker, J. A. Bubier, E. J. Chesler, and M. A. Langston, “On finding and enumerating maximal and maximum k-partite cliques in k-partite graphs,” Algorithms, vol. 12, no. 1, Jan, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Eslami Manoochehri H, and Nourani M, “Drug-target interaction prediction using semi-bipartite graph model and deep learning,” BMC Bioinformatics, vol. 21, no. 4, pp. 248, 2020/07/06, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Zhou W, Huang J, and Tang Y, “Prediction of drug-target interactions and drug repositioning via network-based inference,” PLoS Comput Biol, vol. 8, no. 5, pp. e1002503, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Zitnik M, Sosi R, Maheshwari S, and Leskovec J, “Biosnap datasets: Stanford biomedical network dataset collection,” Aug, 2018. [Google Scholar]
  • [19].Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, and others, “Drugbank 5.0: A major update to the drugbank database for 2018,” Nucleic Acids Res, vol. 46, no. D1, pp. D1074–D1082, Jan 4, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].S. S. Group, “MINER: Gigascale multimodal biological network,” 2017. [Google Scholar]
  • [21].Davis AP, Grondin CJ, Johnson RJ, Sciaky D, King BL, McMorran R, Wiegers J, Wiegers TC, and Mattingly CJ, “The comparative toxicogenomics database: Update 2017,” Nucleic Acids Res, vol. 45, no. D1, pp. D972–D978, Jan 4, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Amberger JS, Bocchini CA, Scott AF, and Hamosh A, “Omim.Org: Leveraging knowledge across phenotype-gene relationships,” Nucleic Acids Res, vol. 47, no. D1, pp. D1038–D1043, Jan 8, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Jafari M, Mirzaie M, Bao J, Barneh F, Zheng S, Eriksson J, Heckman CA, and Tang J, “Bipartite Network Models to Design Combination Therapies in Acute Myeloid Leukaemia,” Nature Communications, vol. 13, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Bleker C, Grady SK, and Langstob MA, “A Comparative Study of Gene Co-Expression Thresholding Algorithms,” Journal of Computational Biology, vol. 31, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Bondy JA, and Murty USR, Graph theory with applications: Macmillan London, 1976. [Google Scholar]
  • [26].Nelson SJ, Schopen M, Savage AG, Schulman J-L, and Arluk N, “The mesh translation maintenance system: Structure, interface design, and implementation,” Stud Health Technol Inform, vol. 107, no. Pt 1, pp. 67–9, 2004. [PubMed] [Google Scholar]
  • [27].Hagan RD, Langston MA, and Wang K, “Lower bounds on paraclique density,” Discrete Applied Mathematics, vol. 204, pp. 208–212, 2016/05/11/, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Bajusz D, Rácz A, and Héberger K, “Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?,” Journal of Cheminformatics, vol. 7, no. 1, pp. 20, 2015/05/20, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Landrum G. “Rdkit: Open-Source Cheminformatics Software,” http://www.rdkit.org/. [Google Scholar]
  • [30].Durant JL, Leland BA, Henry DR, and Nourse JG, “Reoptimization of MDL Keys for Use in Drug Discovery,” Journal of Chemical Information and Computer Sciences, vol. 42, no. 6, pp. 1273–1280, 2002/11/01, 2002. [DOI] [PubMed] [Google Scholar]
  • [31].Kuwahara H, and Gao X, “Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach,” Journal of Cheminformatics, vol. 13, pp. 1–12, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Vogt M, and Bajorath J, “ccbmlib–a Python package for modeling Tanimoto similarity value distributions,” F1000Research, vol. 9, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Cohen J, Statistical power analysis for the behavioral sciences: routledge, 2013. [Google Scholar]
  • [34].Park JH, and Cho YR, “Computational Drug Repositioning with Attention Walking,” Scientific Reports, vol. 14, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, Fridman W-H, Pag F{\è}s, Z. Trajanoski, and J. Galon, “Cluego: A cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks,” Bioinformatics, vol. 25, no. 8, pp. 1091–3, Apr 15, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Yang D, Zhou Q, Labroska V, Qin S, Darbalaei S, Wu Y, Yuliantie E, Xie L, Tao H, Cheng J, Qing L, Suwen Z, Wenqing S, Yi J, and Ming-Wei W, “G protein-coupled receptors: Structure- and function-based drug discovery,” Signal Transduct Target Ther, vol. 6, no. 1, pp. 7, Jan 8, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Sriram K, and Insel PA, “G protein-coupled receptors as targets for approved drugs: how many targets and how many drugs?,” Molecular pharmacology, vol. 93, no. 4, pp. 251–258, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Ji Y, Karbaschi M, and Cooke MS, “Mycoplasma infection of cultured cells induces oxidative stress and attenuates cellular base excision repair activity,” Mutation Research/Genetic Toxicology and Environmental Mutagenesis, vol. 845, pp. 403054, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Boots AW, Drent M, Swennen EL, Moonen HJ, Bast A, and Haenen GR, “Antioxidant status associated with inflammation in sarcoidosis: a potential role for antioxidants,” Respiratory medicine, vol. 103, no. 3, pp. 364–372, 2009. [DOI] [PubMed] [Google Scholar]
  • [40].Di Maio G, Villano I, Ilardi CR, Messina A, Monda V, Iodice AC, Porro C, Panaro MA, Chieffi S, Messina G, Monda M, and La Marra M, “Mechanisms of transmission and processing of pain: A narrative review,” Int J Environ Res Public Health, vol. 20, no. 4, Feb 9, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Hutcheson JD, Setola V, Roth BL, and Merryman WD, “Serotonin receptors and heart valve disease—it was meant 2B,” Pharmacology & therapeutics, vol. 132, no. 2, pp. 146–157, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Pawar HN, Balivada S, and Kenney MJ, “Does acute heat stress differentially-modulate expression of ionotropic neurotransmitter receptors in the RVLM of young and aged F344 rats?,” Neuroscience letters, vol. 687, pp. 223–233, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Majumdar S, “Role of glutamate in the development of visual pathways,” Frontiers in Ophthalmology, vol. 3, pp. 1147769, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Dickson KS, Rosenstein DL, and Sowa NA, “Recurrent Anti-NMDA Receptor Encephalitis After Mycoplasma Pneumonia Infection,” American Journal of Psychiatry, vol. 180, no. 12, pp. 880–883, 2023. [DOI] [PubMed] [Google Scholar]
  • [45].Gilep AA, Sushko TA, and Usanov SA, “At the crossroads of steroid hormone biosynthesis: the role, substrate specificity and evolutionary development of CYP17,” Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics, vol. 1814, no. 1, pp. 200–209, 2011. [DOI] [PubMed] [Google Scholar]
  • [46].Ayaz O, and Howlett SE, “Testosterone modulates cardiac contraction and calcium homeostasis: cellular and molecular mechanisms,” Biology of sex differences, vol. 6, pp. 1–15, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Li L, Wu J, Luo M, Sun Y, and Wang G, “The effect of heat stress on gene expression, synthesis of steroids, and apoptosis in bovine granulosa cells,” Cell Stress and Chaperones, vol. 21, no. 3, pp. 467–475, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, and Madden TL, “Ncbi blast: A better web interface,” Nucleic Acids Res, vol. 36, no. Web Server issue, pp. W5–9, Jul 1, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ, “Basic local alignment search tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990. [DOI] [PubMed] [Google Scholar]
  • [50].Kerfeld CA, and Scott KM, “Using BLAST to teach “E-value-tionary” concepts,” PLoS biology, vol. 9, no. 2, pp. e1001014, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Weisman CM, Murray AW, and Eddy SR, “Many, but not all, lineage-specific genes can be explained by homology detection failure,” PLoS biology, vol. 18, no. 11, pp. e3000862, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Friedrich A, Ripp R, Garnier N, Bettler E, Deléage G, Poch O, and Moulinier L, “Blast sampling for structural and functional analyses,” BMC bioinformatics, vol. 8, pp. 1–17, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].Buchfink B, Xie C, and Huson DH, “Fast and sensitive protein alignment using diamond,” Nat Methods, vol. 12, no. 1, pp. 59–60, Jan, 2015. [DOI] [PubMed] [Google Scholar]
  • [54].Trott O, and Olson AJ, “Autodock vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading,” J Comput Chem, vol. 31, no. 2, pp. 455–61, Jan 30, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [55].ULC CCG, “Molecular Operating Environment (MOE), 2019.01,” Chemical Computing Group ULC 1010 Sherbooke St. West, Suite# 910, Montreal …, 2020. [Google Scholar]
  • [56].Ellingson SR, Smith JC, and Baudry J, “Vinampi: Facilitating multiple receptor high-throughput virtual docking on high-performance computers,” J Comput Chem, vol. 34, no. 25, pp. 2212–21, Sep 30, 2013. [DOI] [PubMed] [Google Scholar]
  • [57].O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, and Hutchison GR, “Open babel: An open chemical toolbox,” J Cheminform, vol. 3, pp. 33, Oct 7, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [58].Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, Goodsell DS, and Olson AJ, “Autodock4 and autodocktools4: Automated docking with selective receptor flexibility,” J Comput Chem, vol. 30, no. 16, pp. 2785–91, Dec, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [59].Burley SK, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow GV, Christie CH, Dalenberg K, Di Costanzo L, Duarte JM, Dutta S, Feng Z, Ganesan S, Goodsell DS, Ghosh S, Green RK, Guranovic V, Guzenko D, Hudson BP, Lawson CL, Liang Y, Lowe R, Namkoong H, Peisach E, Persikova I, Randle C, Rose A, Rose Y, Sali A, Segura J, Sekharan M, Shao C, Tao YP, Voigt M, Westbrook JD, Young JY, Zardecki C, and Zhuravleva M, “Rcsb protein data bank: Powerful new tools for exploring 3d structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences,” Nucleic Acids Res, vol. 49, no. D1, pp. D437–D451, Jan 8, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [60].Humphrey W, Dalke A, and Schulten K, “Vmd: Visual molecular dynamics,” J Mol Graph, vol. 14, no. 1, pp. 33–8, 27–8, Feb, 1996. [DOI] [PubMed] [Google Scholar]
  • [61].Zothantluanga JH, and Chetia D, “A beginner’s guide to molecular docking,” Sciences of Phytochemistry, vol. 1, no. 2, pp. 37–40, 2022. [Google Scholar]
  • [62].Abbas K, Abbasi A, Dong S, Niu L, Yu L, Chen B, Cai S-M, and Hasan Q, “Application of network link prediction in drug discovery,” BMC bioinformatics, vol. 22, pp. 1–21, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [63].Gaudelet T, Day B, Jamasb AR, Soman J, Regep C, Liu G, Hayter JB, Vickers R, Roberts C, and Tang J, “Utilizing graph machine learning within drug discovery and development,” Briefings in bioinformatics, vol. 22, no. 6, pp. bbab159, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [64].Zhang C, Zang T, and Zhao T, “KGE-UNIT: toward the unification of molecular interactions prediction based on knowledge graph and multi-task learning on drug discovery,” Briefings in Bioinformatics, vol. 25, no. 2, pp. bbae043, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [65].Mukta FT, Rana MM, Meyer A, Ellingson S, and Nguyen DD, “The algebraic extended atom-type graph-based model for precise ligand–receptor binding affinity prediction,” Journal of Cheminformatics, vol. 17, no. 1, pp. 10, 2025. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp1-3609315

RESOURCES