Empowering Graph Neural Network-Based Computational Drug Repositioning with Large Language Model-Inferred Knowledge Representation

Yaowen Gu; Zidu Xu; Carl Yang

doi:10.1007/s12539-024-00654-7

. Author manuscript; available in PMC: 2026 Mar 1.

Published in final edited form as: Interdiscip Sci. 2024 Sep 26;17(3):698–715. doi: 10.1007/s12539-024-00654-7

Empowering Graph Neural Network-Based Computational Drug Repositioning with Large Language Model-Inferred Knowledge Representation

Yaowen Gu ¹, Zidu Xu ², Carl Yang ³

PMCID: PMC12949635 NIHMSID: NIHMS2058368 PMID: 39325266

Abstract

Computational drug repositioning, through predicting drug-disease associations (DDA), offers significant potential for discovering new drug indications. Current methods incorporate graph neural networks (GNN) on drug-disease heterogeneous networks to predict DDAs, achieving notable performances compared to traditional machine learning and matrix factorization approaches. However, these methods depend heavily on network topology, hampered by incomplete and noisy network data, and overlook the wealth of biomedical knowledge available. Correspondingly, large language models (LLMs) excel in graph search and relational reasoning, which can possibly enhance the integration of comprehensive biomedical knowledge into drug and disease profiles. In this study, we first investigate the contribution of LLM-inferred knowledge representation in drug repositioning and DDA prediction. A zero-shot prompting template was designed for LLM to extract high-quality knowledge descriptions for drug and disease entities, followed by embedding generation from language models to transform the discrete text to continual numerical representation. Then, we proposed LLM-DDA with three different model architectures (LLM-DDA_{Node Feat}, LLM-DDA_{Dual GNN}, LLM-DDA_GNN-AE) to investigate the best fusion mode for LLM-based embeddings. Extensive experiments on four DDA benchmarks show that, LLM-DDA_GNN-AE achieved the optimal performance compared to 11 baselines with the overall relative improvement in AUPR of 23.22%, F1-Score of 17.20%, and precision of 25.35%. Meanwhile, selected case studies of involving Prednisone and Allergic Rhinitis highlighted the model’s capability to identify reliable DDAs and knowledge descriptions, supported by existing literature. This study showcases the utility of LLMs in drug repositioning with its generality and applicability in other biomedical relation prediction tasks.

Keywords: Drug repositioning, Drug-disease association prediction, Heterogeneous graph neural network, Large Language Model

Graphical Abstract

graphic file with name nihms-2058368-f0001.jpg

1. Introduction

Drug development has always been costly and time consuming, averagely taking 3 billion dollars over a 13-year cycle usually end with low chance of success [1–5]. Computational drug repositioning has been recognized as a promising alternative to overcome this substantial challenge [6, 7], which reduces the drug safety examination cost and shortening the period of drug approval and launch [8–11]. There have been quite a few successful examples demonstrating the effectiveness of computational drug repositioning in accelerating the drug discovery process [9–13], such as repurposing Metformin for various neoplasm [14], and Thalidomide for Erythema Nodosum Leprosum and Multiple Myeloma [15].

The core idea of computational drug repositioning is to identify the new associations between known drugs and diseases. Currently, the computational drug repositioning methods include three main types [1, 16]: machine learning methods, matrix factorization/completion methods, and deep learning methods. Conventional machine learning methods predict drug-disease associations (DDAs) using drug and disease information as features, using classical classification models such as Support Vector Machines, Regularized Least Squares, and Random Forests. For instance, Gao et al. developed an approach called DDA-SKF, which enhances prediction by combining a Laplacian regularized least squares algorithm with a similarity kernel fusion method [17]. The main issue with these methods is their low performance, which is largely attributable to their reliance on high-quality features that depend heavily on specific domain knowledge and experience. Matrix factorization/completion methods reconstruct a DDA matrix into lower-dimensional matrices to uncover latent factors [18]. As an example, SCPMF identifies new drug-virus interactions by projecting a heterogeneous drug-virus interaction network into latent feature matrices for drugs and COVID-19 viruses and incorporated weighted similarity constraints [19]. Despite the competitive performances, it suffers from limited effectiveness in representing drugs and diseases, especially in sparse association networks.

Deep learning methods use neural networks to construct end-to-end frameworks for the representation learning of drugs and diseases in an integrated manner, enabling accurate predictions for DDAs. Such framework allows accurate predictions for query DDAs at the same time, without the need for extensive manual feature engineering. Intuitively, graph neural network (GNN) [20–22], is readily embedded in end-to-end architectures to perform specific tasks with graph data inputs, captures structural information of graphs via message passing between the nodes of graphs. Contributing to its applicability for graph and network data, GNN architecture has been widely used in drug discovery-related tasks, such as property prediction [23–25] and virtual screening [26–28]. For drug repositioning, most of deep learning-based DDA prediction methods were designed based on GNN architectures [29–33]. For instance, PSGCN was proposed to transform DDA prediction into a graph classification problem by converting DDAs into partner-specific graphs using the SortPool strategy to handle variable-sized graph data effectively [30]. REDDA integrated GNN, graph attention, and layer attention mechanisms to learn drug and disease representations, and was trained on a multifaceted network to enhance drug-protein-gene-pathway-disease relationships through sequential learning blocks [33]. Although GNNs have significantly advanced DDA predictions with well-structured models and input data, their performance is often limited by the richness of input features, which typically rely on drug and disease pairwise similarities. These similarities heavily rely on topology, neglecting abundant related biomedical knowledge, which is widely stored in multiple data source but hard to collect exhaustively.

One promising solution to this challenge could be the large language models (LLMs) like BERT (Bidirectional Encoder Representations from Transformers) [34] and GPT (Generative Pre-trained Transformer) [35]. They are well-known for learning billions of parameters through the large-scale multi-source data training process. The advanced neural network architecture leveraging self-attention also allows them to excel in deep contextual understanding and text generation [36, 37]. LLMs have shown great promise in biomedical data synthesis, knowledge retrieval, and reasoning, with applications in tasks like biomedical knowledge graph entity and relation extraction [38] and clinical trial matching [39]. The accessibility of these models through user-friendly interfaces like ChatGPT has made them even more popular, potentially revolutionizing biomedical knowledge enrichment and representation augmentation by providing contextually rich descriptions. A recent study finetuned BERT on biomedical literature data for similarity calculation in drug-disease heterogeneous network and DDA prediction [40]. However, it neglects the use of LLM for drug and disease representation augmentation, which is more important and valuable than network similarity calculation.

To address the above challenge, and to investigate the effectiveness of LLM in improving the DDA predictions, we proposed a comprehensive framework for LLM-based DDA prediction: a zero-shot prompting template using GPT-4 is designed to generate precise descriptions of drugs and diseases. These descriptions were then converted into entity description embeddings, utilizing both GPT-4 and BioBERT. Subsequently, we integrated these embeddings into a GNN-based DDA prediction model, termed LLM-DDA, exploring the optimal mode for such integration. Specifically, three model architectures were developed: LLM-DDA_{Node Feat}, LLM-DDA_{Dual GNN}, LLM-DDA_GNN-AE. Comprehensive experiments conducted on four benchmark datasets demonstrated the superiority of LLM-DDA for DDA prediction compared to 11 competitive baseline methods. Meanwhile, the best integration mode (LLM-DDA_GNN-AE) and LLM embedding generator of LLM-based embedding and GNN-based model were determined by architecture analysis and performance comparison. Furthermore, case studies also emphasized the applicability of LLM-DDA in practical drug repositioning that discover novel DDAs. Our investigational study provides computational evidence for the potential of LLM-inferred knowledge representation for computational drug repositioning and more general biomedical network association prediction tasks.

2. Materials and Methods

The workflow for this study is illustrated in Fig. 1 and encompasses several key phases: DDA benchmark collection, drug-disease heterogeneous network construction, prompt engineering, LLM-based embedding generation, and LLM-DDA model construction. Detailed descriptions of each phase are provided in the subsequent sections.

2.1. Problem Formulation

The DDA prediction problem is formulated as a link prediction task within a heterogeneous network $G = \{V, E\}$ , where $V = \{V_{r}, V_{d}\}$ is the node set comprising $N$ drugs $(V_{r})$ and $M$ diseases $V_{d}$ , and $E = \{E_{r - r}, E_{r - d}, E_{d - d}\}$ is the edge set, including edges denoting drug-drug $E_{r - r}$ , drug-disease $E_{r - d}$ , and disease-disease $E_{d - d}$ associations. The goal is to model a function $f_{D D A} (H_{i}, H_{j}, E ⊅ v_{i} - v_{j})$ that estimates the association probability $p$ for a given drug-disease pair $" v_{i} - v_{j} "$ while $v_{i} \in V_{r}$ , $v_{j} \in V_{d}$ , and with the known edge set without the query drug-disease pair linked, also with $H_{i}$ and $H_{j}$ as their respective feature embeddings for prediction.

2.2. Benchmark Dataset Preparation

Four drug-disease association benchmarks were adopted in our study for model performance comparisons, which include: B-dataset [18], C-dataset [41], F-dataset [42], and R-dataset [33]. These datasets have been extensively used in previous drug repositioning studies. Basic statistical descriptions of these datasets are provided in Table 1. The datasets exhibit variations in label imbalance and data sparsity, enabling a comprehensive performance evaluation across both general and data-scarce scenarios.

Table 1.

Summary of four benchmark datasets

Dataset	Drugs	Diseases	Drug-disease associations	Density	Pos-Neg Ratio
B-dataset	269	598	18,416	0.114	11.45%
C-dataset	663	409	2,532	0.009	1.57%
F-dataset	593	313	1,933	0.010	1.05%
R-dataset	894	454	2,704	0.007	0.67%

Open in a new tab

B-dataset: Comprises 269 drugs, 598 diseases, and 18,416 DDAs, sourced from the Comparative Toxicogenomics Database (CTD)) [43]. Drug-drug and disease-disease similarities were assessed through multi-source interactions (such as substructures and target enzymes) and MeSH (Medical Subject Headings) semantic similarities.
C-dataset: Contains 663 drugs, 409 diseases, and 2532 DDAs, generated by integrating the Dndataset [44] with F-dataset [42], as described by Luo et al. [41]. Similarities between drugs were calculated using Anatomical Therapeutic Chemical (ATC) codes, and disease similarities were derived from Disease Ontology (DO) terms.
F-dataset: Includes 593 drugs, 313 diseases, and 1933 DDAs, originating from OMIM and processed using the MetaMap tool. This dataset’s similarities were calculated based on comprehensive similarity measurements [42].
R-dataset: Features 894 drugs, 454 diseases, and 2704 DDAs, reorganized as a combination of C-dataset, F-dataset, and additional data from the KEGG database. Similarities were measured using molecular fingerprint similarities and MeSH semantic similarities.

Details on the overlap of common drugs, diseases, and DDAs between these datasets are provided in the Supplementary Materials.

2.3. Drug-Disease Heterogeneous Network Construction

For GNN-based model, a drug-disease association network is essential for effective DDA prediction. To construct this network, we first derived drug-drug and disease-disease associations from the pre-calculated similarity matrices $S_{r - r}$ and $S_{d - d}$ , respectively. We applied a Top15 filtering method to select the most significant associations, ensuring the relevance and strength of the interactions within our model. Then, taking drug and disease entities as nodes, drug-drug interactions, disease-disease interactions, and drug-disease associations $A_{r - d}$ were adopted as edges to construct drug-disease heterogeneous networks. By representing the network as a node feature matrix $H_{Sim}^{(0)} \in ℝ^{(N + M) \times (N + M)}$ and an adjacency matrix $A_{Sim} \in ℝ^{(N + M) \times (N + M)}$ , the drug-disease network can be formulated as

H_{Sim}^{(0)} = [\begin{matrix} S_{r - r} & 0 \\ 0 & S_{d - d} \end{matrix}]

(1)

A_{Sim} = [\begin{matrix} Top15 (S_{r - r}) & A_{r - d} \\ {(A_{r - d})}^{T} & Top15 (S_{d - d}) \end{matrix}]

(2)

2.4. Prompt Design for Description Generation

We utilized the principle of GPT-4’s zero-shot prompting, which lies in its ability to understand and generate appropriate responses to tasks without needing explicit prior examples or fine-tuning for those specific tasks. This technique harnesses the core capabilities of LLMs—comprehension, reasoning, and explanation—thereby ensuring efficient and effective description generation. As illustrated in Tables 2 and 3, GPT-4 (version: gpt-4-0125-preview) was configured to mimic the expertise typical of scientists in relevant fields, optimizing it for the generation of chemo-biomedical descriptions crucial for the DDA prediction task. By calling upon specific databases (i.e., DrugBank, OMIM, and SMILES) to provide domain-specific knowledge, the zero-shot prompts guide the model to generate coherent responses that encompass key information beneficial to subsequent link prediction tasks.

Table 2.

Description generation prompt

Prompt task	Disease description generation	Drug description generation
Prompt beginning	“Generate a single, cohesive, narrative paragraph for the disease ‘{disease_name}’ associated with OMIM ID ‘{omim_id}’.” The response should include 9 key information as follows	“Generate a single, comprehensive paragraph for the drug ‘{drug_name}’ associated with its DrugBank ID ‘{drug_id}’, and its SMILES (Simplified Molecular Input Line Entry System) notation ‘{SMILES_note}’.” The response should include 10 key information as follows
Prompt key information	1) Associated genes, proteins, or mutations (3 examples) 2) Associated signal pathway (key molecular/cellular components) 3) Associated drugs for treatment (3 examples with mechanisms of action) 4) Linked comorbidities and complications 5) Nature of the disease 6) Typical clinical symptoms and signs 7) Types of the disease 8) Inheritance patterns and genetic components (examples) 9) Diagnostic criteria and testing methods	1) Detailed description of its chemical structure 2) Chemical category 3) Chemical scaffold 4) Known similar drugs (examples) 5) Pharmacokinetics (absorption, distribution, metabolism, excretion) 6) Toxicity details (examples) 7) List of target proteins 8) Indications (diseases/symptoms examples) 9) Side effects (examples) 10) Clinical usage (examples)
Prompt end	“If no specific answer, just return not available. The information does not need to be current or from a live database. Ensure the final summary is precise, evidence-based, suitable for a professional medical audience, and condenses all the points above into a coherent narrative.”	“If no specific answer, just return not available. The information does not need to be current or from a live database. Ensure the final summary is precise, evidence-based, suitable for a professional medical audience, and condenses all the points above into a coherent narrative.”

Open in a new tab

Table 3.

Drug disease association prediction prompt

Prompt component

Content

Introduction

The Online Mendelian Inheritance in Man (OMIM) database serves as a comprehensive and authoritative repository of human genes and genetic phenotypes. Simultaneously, the DrugBank database merges detailed drug information with extensive drug target data. Our research focuses on identifying associations between drugs and diseases. In this network model, both diseases and drugs (including certain chemicals not traditionally used as human drugs) are represented as nodes, with edges depicting the relationships between them. This includes associations like the link between arsenic and diseases such as prostatic neoplasms and myocardial ischemia

Query

Considering the information provided, does the drug identified by the name “{drug_name}” and DrugBank ID “{drug_db_id}” have any known associations with the disease listed as “{disease_name}” with OMIM ID “{omim_id}”?

Open in a new tab

The text generation prompts focused on:

Domain-specific knowledge enrichment: Prompts were crafted to guide the LLM to produce responses enriched with domain-specific knowledge, including information about genes, signaling pathways, related diseases, and drugs. This was facilitated by a detailed template specifying the essential elements to be included in the responses.
Minimization of hallucinated information [45, 46]: To reduce the generation of inaccurate or fabricated information, specific constraints were incorporated into the prompts. Terms such as “precise” “with examples” and “evidence-based” were used to direct the model’s responses. Additionally, the model was programmed to respond with “not available” when encountering queries beyond its scope of knowledge. This approach was intended to enhance the confidence and relevance of the LLM’s outputs.

For the DDA prediction using GPT-4, our study adopted a few-shot prompting technique with GPT-4 as outlined in Table 2, which includes: (1). Introduction Section: Provides GPT-4 with contextual background through examples of known drug-disease associations. (2). Answer Query: Elicits direct predictions for specific drug-disease pairs. This approach primes GPT-4 with relevant examples before querying, enhancing the accuracy and relevance of its predictions. These predictions serve as the “DirectPred” baseline in our comparative analysis.

2.5. Entity Description-Based Embedding Generation

We transformed drug and disease-related descriptions into LLM-based embeddings to facilitate the mapping from discrete semantic spaces to continuous hidden vector spaces. This process enables the incorporation of high-order semantic information into deep neural network architectures for drug-disease association (DDA) prediction. Specifically, we adopted two LLMs as the embedding generator: GPT-4 [47] (version: text-embedding-ada-002) is a general-purpose LLM which achieved the most competitive performance among all LLMs; BioBERT [48] is a BERT-based LLM tailored for biomedical applications with smaller parameter size [48]. For embedding generation, each description, denoted as D, is processed into an embedding vector with dimensionE. The resulting embeddings $H_{LLM}^{(0)} \in ℝ^{(N + M) \times E}$ can be obtained by

H_{LLM}^{(0)} = LLM (D), LLM \in \{GPT4,BioBERT\}

(3)

2.6. LLM-DDA Model Architectures

To explore the best method for integrating LLM-based embeddings into current GNN-based framework for computational drug repositioning, we designed three distinct model architectures, each differing in how LLM embeddings are incorporated: (1). LLM-DDA_{Node Feat}: This architecture incorporates LLM embeddings directly as node features within the graph. These enriched node features are designed to enhance the node’s representational learning directly through the GNN’s processing layers; (2). LLM-DDA_{Dual GNN}: LLM-based embeddings serve as inputs to a dual-channel GNN. This model leverages a novel drug-disease heterogeneous graph recalculated based on LLM embeddings, effectively creating a more informed network topology for the GNN to process; (3). LLM-DDA_GNN-AE: LLM-based embeddings are fed into an Autoencoder (AE) within a dual GNN-AE channel. The AE’s role is to refine and reconstruct the LLM embeddings, aiming to capture and utilize complex patterns more effectively. Each model variant is represented in Fig. 1, and detailed descriptions of their methodologies are provided in subsequent sections.

2.6.1. LLM-DDA_{Node Feat}

LLM-based embeddings, generated from descriptions of each drug and disease entity, are utilized as node features within a GNN-based framework for DDA prediction. We incorporate these embeddings into the existing node feature matrix to enhance the representational capacity of the nodes. As for GNN-based model design, we employed a heterogeneous graph convolutional network (HeteroGCN) equipped with a layer attention module. This configuration helps mitigate the “over-smoothing” issue often encountered in multilayer GNNs, a challenge noted in several prior studies [29, 32, 33]. The integration process involves merging the similarity-based node feature matrix $H_{Sim}^{(0)}$ with theLLM-based node embedding matrix $H^{(0)} \in ℝ^{(N + M) \times E}$ as the input node features:

H^{(0)} = Concat (H_{Sim}^{(0)}, H_{LLM}^{(0)})

(4)

Then, $L$ -layered HeteroGCNs were constructed to calculate updated node features by firstly considering homogeneous neighbor node sets in a general Graph Convolutional Network (GCN) manner, and then employ a summation process for heterogeneous aggregation. The updating process at each $l -th$ HeteroGCN layer is be formulated as

H^{(l)} = HeteroGCN (H^{(l - 1)}, A_{Sim})

(5)

H^{(l)} = {\hat{A}}_{Sim} [\begin{matrix} {\tilde{H}}_{r}^{(l)} \\ {\tilde{H}}_{d}^{(l)} \end{matrix}] = {\hat{A}}_{Sim} [\begin{matrix} σ ({\hat{D}}_{r}^{- \frac{1}{2}} {\hat{A}}_{Sim (r - r)} {\hat{D}}_{r}^{- \frac{1}{2}} H_{r}^{(l - 1)} W_{r}^{(l)}) \\ σ ({\hat{D}}_{d}^{- \frac{1}{2}} {\hat{A}}_{Sim (d - d)} {\hat{D}}_{d}^{- \frac{1}{2}} H_{d}^{(l - 1)} W_{d}^{(l)}) \end{matrix}]

(6)

where ${\hat{A}}_{Sim} \in ℝ^{(N + M) \times (N + M)}$ represents the adjacency matrix augmented with the identity matrix, which can be decomposed into drug-drug homogeneous adjacency matrix ${\hat{A}}_{Sim (r - r)} \in ℝ^{N \times N}$ and disease-diesease homogeneous adjacency matrix ${\hat{A}}_{Sim (d - d)} \in ℝ^{M \times M}$ . This setup allows for the application of graph convolutional operations separately on drug and disease entities thus capturing intra-type interactions. Given the dimension for the hidden vector as $K$ , the intermediate node features $[\begin{matrix} {\tilde{H}}_{r}^{(l)} \\ {\tilde{H}}_{d}^{(l)} \end{matrix}] \in ℝ^{(N + M) \times K}$ represents the aggregated node features from homogeneous graphs. For homogeneous GCN, $σ$ is the ReLU activation function, ${\hat{D}}_{r} \in ℝ^{N \times N}$ and ${\hat{D}}_{d} \in ℝ^{M \times M}$ are the degree matrix for drug and disease homogeneous graphs. $W_{r}^{(l)}$ and $W_{d}^{(l)}$ are the trainable parameter matrix for $l -th$ HeteroGCN layer.

Subsequently, a layer attention mechanism, proposed by Yu et al. [29], was introduced to dynamically aggregate output node embeddings from different layers of the HeteroGCN, which helps alleviate the issue of over-smoothing observed in deep GNNs. For each $l -th$ HeteroGCN layer, the node embedding $H^{(l)}$ undergo a process where normalized attention coefficients are calculated to determine the significance of each layer’s output for both drug and disease nodes, formulated as

\begin{matrix} α_{r}^{(l)} = \frac{\exp (H_{r}^{(l)} W_{r} q_{r}^{T})}{\sum_{l \in L} \exp (H_{r}^{(l)} W_{r} q_{r}^{T})}, & α_{d}^{(l)} = \frac{\exp (H_{d}^{(l)} W_{d} q_{d}^{T})}{\sum_{l \in L} \exp (H_{d}^{(l)} W_{d} q_{d}^{T})} \end{matrix}

(7)

where $q_{r}, q_{d} \in ℝ^{1 \times K}$ and $W_{r}, W_{d} \in ℝ^{K \times K}$ are trainable parameter matrixes. The output node embedding $H \in ℝ^{(N + M) \times K}$ can be represented as

H = [\begin{matrix} H_{r} \\ H_{d} \end{matrix}] = [\begin{matrix} \sum_{l \in L} α_{r}^{(l)} H_{r}^{(l)} \\ \sum_{l \in L} α_{d}^{(l)} H_{d}^{(l)} \end{matrix}]

(8)

Finally, a bilinear inner product decoder was introduced to reconstruct and predict the drug-disease association matrix $\hat{A}$ based on node embeddings:

\hat{A} = σ (H_{r} W H_{d}^{T})

(9)

where σ is the Sigmoid activation function and $W$ is a trainable parameter matrix.

2.6.2. LLM-DDA_{Dual GNN}

LLM-DDA_{Dual GNN} leverages LLM-based embeddings to encode biomedical knowledge into a high-order vector space for drugs and diseases. We developed a drug-disease heterogeneous graph that is knowledge-intensive and can enhance the GNN-based DDA prediction method by integrating network topology with embedded biomedical knowledge. LLM-DDA_{Dual GNN} is designed as a dual HeteroGCN model. Specifically, the first channel of LLM-DDA_{Dual GNN} utilizes initial similarity-based embeddings as inputs to generate topology-based representations $H_{Sim}^{(l)}$ based on $L$ -layered HeteroGCNs, which can be represented as Eqs. (4)–(6). Concurrently, the second channel utilizes LLM-based dense embeddings to compute drug-drug and disease-disease similarities by cosine similarity. Top15 filtering is then applied to refine these similarities, selecting the most significant associations to construct the adjacency matrix $A_{LLM}$ for the LLM-based drug-disease heterogeneous graph as

A_{LLM} = [\begin{matrix} Top15 ({\tilde{H}}_{LLM (r)}^{(0)} {({\tilde{H}}_{LLM (r)}^{(0)})}^{T}) & A_{r - d} \\ {(A_{r - d})}^{T} & Top15 ({\tilde{H}}_{LLM (d)}^{(0)} {({\tilde{H}}_{LLM (d)}^{(0)})}^{T}) \end{matrix}]

(10)

where ${\tilde{H}}_{LLM (r)}^{(0)} \in ℝ^{N \times E}$ and ${\tilde{H}}_{LLM (d)}^{(0)} \in ℝ^{M \times E}$ represent the normalized LLM-based embedding matrixes for drug and disease, respectively. Subsequently, another $L$ -layered HeteroGCNs are used to generate updated LLM-based embedding $H_{LLM}^{(l)}$ based on $H_{LLM}^{(0)}$ and $A_{LLM}$ , which was as represented in Eqs. (4), (5), (6). Then, the layer attention block combined these embedding into a final integrated one $H$ for further DDA prediction:

H = LayerAttn (H_{Sim}^{(0)}, .., H_{Sim}^{(L)}, H_{LLM}^{(0)}, \dots H_{LLM}^{(L)})

(11)

Finally, the predicted drug-disease association matrix was reconstructed by a bilinear inner product decoder (Eq. (9)).

2.6.3. LLM-DDA_GNN-AE

Inspired by previous DDA prediction studies utilizing the Auto-Encoder for feature deduction and proved its contribution to for multi-source drug and disease representation learning [49, 50], LLM-DDA_GNN-AE was proposed as a dual channel with an AE used for generating deduced LLM-based embeddings and a GNN-based channel used for generating network topology-based embeddings. Specifically, similar to LLM-DDA_{Node Feat} and LLM-DDA_{Dual GNN}, $L$ -layered HeteroGCNs were constructed in the GNN-based channel to obtain $H_{Sim}$ based on Eqs. (4), (5), (6). Then, a two-layered dense neural network was employed as the AE that takes the initial LLM-based embeddings as the input and produces higher-order embeddings $H_{LLM}^{(1)} \in ℝ^{(N + M) \times E}$ and $H_{LLM}^{(2)} \in ℝ^{(N + M) \times E}$ :

H_{LLM}^{(1)} = [\begin{matrix} H_{LLM (r)}^{(0)} W_{r}^{(1)} + b_{r}^{(1)} \\ H_{LLM (d)}^{(0)} W_{d}^{(1)} + b_{d}^{(1)} \end{matrix}] = [\begin{matrix} (H_{LLM (r)}^{(0)} W_{r}^{(1)} + b_{r}^{(1)}) W_{r}^{(2)} + b_{r}^{(2)} \\ (H_{LLM (d)}^{(0)} W_{d}^{(1)} + b_{d}^{(1)}) W_{d}^{(2)} + b_{d}^{(2)} \end{matrix}]

(12)

H_{LLM}^{(2)} = [\begin{matrix} H_{LLM (r)}^{(2)} W_{r}^{(2)} + b_{r}^{(2)} \\ H_{LLM (d)}^{(2)} W_{d}^{(2)} + b_{d}^{(2)} \end{matrix}] = [\begin{matrix} (H_{LLM (r)}^{(0)} W_{r}^{(1)} + b_{r}^{(1)}) W_{r}^{(2)} + b_{r}^{(2)} \\ (H_{LLM (d)}^{(0)} W_{d}^{(1)} + b_{d}^{(1)}) W_{d}^{(2)} + b_{d}^{(2)} \end{matrix}]

(13)

where $W$ and $b$ are trainable parameter matrixes in each layer. Similarly, a layer attention was adopted to aggregate output embeddings from each HeteroGCN and AE layers based on Eqs. (11). Finally, a bilinear inner product decoder predicted the final drug-disease association probability matrix based on Eqs. (8) and (9).

2.7. Optimization

The above three variants of LLM-DDA were optimized by a weighted cross-entropy loss function to balance different categories and focused on known drug-disease associations. The loss function is formulated as

L = - \frac{1}{N} (γ \sum_{(i, j) \in S^{+}} \log {\hat{A}}_{i j} + \sum_{(i, j) \in S^{-}} (1 - \log {\hat{A}}_{i j}))

(14)

where $γ = \frac{|S^{-}|}{|S^{+}|}$ is the balance weight, $|S^{+}|$ and $|S^{-}|$ are the number of known/unknown drug-disease associations in the training set, and ${\hat{A}}_{i j}$ is the predicted probability of drug $i$ and disease $j$ .

The Adam optimizer is for model optimization and the trainable parameters in each layer are initialized by Xavier [51]. Moreover, the dropout layer and batch normalization layer are also adopted to inhibit overfitting.

2.8. Experimental Settings

We employed fivefold cross-validations to evaluate the performance of the LLM-DDA and to facilitate comparison with baseline methods. In this setup, known drug-disease associations (DDAs) were treated as positive samples, while all unknown DDAs were considered negative. Each validation fold was composed of 20% positive and 20% negative samples, with four folds used for training and one reserved for validation. This strategy ensured comprehensive validation of all samples within the datasets. To prevent data leakage, any DDAs present in the test set were excluded from the training graphs. Given the inherent label imbalance in DDA prediction benchmarks, we utilized several metrics for performance assessment: area under the receiver operating curve (AUC), area under Precision-Recall curve (AUPR), F1-score, and Precision.

For the LLM-DDA model, we configured the number of HeteroGCN layers to two, set the hidden vector dimensions at 128, applied a dropout rate of 0.4, and ran the models for 5000 epochs with a learning rate of 0.01. The hyperparameter settings for baseline methods were adopted directly from their respective original literature to ensure fairness in comparisons.

3. Results

This study conducted computational experiments to evaluate several aspects of integrating LLM embeddings into GNN-based models for drug-disease association (DDA) prediction. We specifically focused on and addressed the following questions: (Q1). Model Architecture: Which model architecture is most effective when incorporating LLM embeddings, such as those generated by GPT-4 or BioBERT, into traditional GNN-based models? (Q2). Performance and Stability: How do the performances and stabilities of the enhanced models (LLM-DDA) compare across four different datasets? (Q3). Comparative Analysis: Is the LLM-DDA approach competitive with existing DDA prediction baselines? (Q4). Embedding Impact: To what extent do LLM-based embeddings contribute to the accuracy of DDA predictions within the LLM-DDA framework? (Q5). Application Potential: Can LLM embeddings be effectively applied to the discovery of new indications and the extraction of knowledge for query drugs and diseases?

3.1. For Q1: Model Performance Comparison in LLM-DDA

Our fivefold cross-validation experiments evaluated the efficacy of three distinct model architectures integrated with two types of LLM-based embeddings for drug-disease association (DDA) prediction.

Architecture comparison: The model performances of LLM-DDA_{Node Feat}, and LLM-DDA_{Dual GNN}, LLM-DDA_GNN-AE on four benchmark datasets are presented in Fig. 2. The results showed that LLM-DDA_GNN-AE achieved the best performance compared to LLM-DDA_{Node Feat} and LLM-DDA_{Dual GNN} on four datasets. Among them, LLM-DDA_{Node Feat} performed worst, which indicates simply combining LLM-based embeddings with network similarity features could not increase the model capacity. Therefore, it requires elaborate architecture design for the integration of LLM-based embeddings to the general GNN-based DDA prediction methods; LLM-DDA_{Dual GNN} performed moderately with mild performance gaps compared to LLM-DDA_GNN-AE. One possible explanation is, LLM-based embeddings already imply high-order association information for drugs and diseases based on the knowledge description overlaps between similar drugs/diseases. Therefore, when updating such high-order features based on a lower-order topology network, the representation for drugs and disease could degrade. This could cause model failing to fully utilize the complex relationships and patterns inherent in LLM features. Regarding the best-performed LLM-DDA_GNN-AE, the results indicate the Autoencoder is more effective for LLM-based embedding updating and integrating, which can maintain the high-order association information within these high-order embeddings.

We assessed LLM-DDA_{Node Feat}, LLM-DDA_{Dual GNN}, and LLM-DDA_GNN-AE across four benchmark datasets, as shown in Fig. 2. The LLM-DDA_GNN-AE model demonstrated superior performance over the other architectures, effectively leveraging the high-order association information provided by LLM-based embeddings. In contrast, LLM-DDA_{Node Feat} exhibited the weakest performance, indicating that simply merging LLM embeddings with network similarity features does not sufficiently enhance model capacity. This underscores the need for more sophisticated integration techniques in GNN-based DDA prediction methods. LLM-DDA_{Dual GNN} displayed moderate performance, suggesting that the full potential of LLM embeddings might not be realized when updated through simpler network topologies. This degradation in feature representation could hinder the model’s ability to capitalize on the complex relationships encoded in the LLM features. Our findings highlight that an autoencoder framework, such as that used in LLM-DDA_GNN-AE, is more adept at maintaining and updating high-order embeddings effectively.

Embedding Generator Comparison: Using LLM-DDA_GNN-AE as the reference model, we compared the performance impacts of different LLM embedding generators—GPT-4 and BioBERT (Fig. 3). The performance differences between models using GPT-4 and BioBERT were minimal, indicating that both a general large-scale LLM like GPT-4 and a domain-specific LLM like BioBERT are effective at encoding biomedical knowledge into usable vectors for DDA prediction. This suggests that the choice between these embedding generators can be based on other factors such as computational resources or specific model requirements, rather than efficacy alone.

Fig. 3 — Performance comparison for two LLM embedding generators on four datasets in fivefold cross-validation

3.2. For Q2: Cross-Validation on Four Datasets

We utilized fivefold cross-validation results to evaluate the predictive performance and stability of LLM-DDA_GNN-AE across four benchmark datasets. Following the experimental settings outlined earlier, each dataset was divided into training and validation sets in an 8:2 ratio for each fold, ensuring no overlap among validation sets. We plotted the AUC and AUPR curves for each fold and for the aggregate of all validation sets (Overall), as shown in Fig. 4. The results demonstrate that LLM-DDA_GNN-AE consistently delivered strong AUC performance across all folds, with minimal deviation, indicating high stability of the model. In terms of AUPR, we observed larger fluctuations across different folds, which is attributable to the impact of dataset splitting on performance metrics in imbalanced datasets. Notably, the AUPR was more consistent across folds of the better-balanced B-dataset, showing smaller performance variations compared to other, more imbalanced datasets.

Fig. 4 — AUC and AUPR curves of LLM-DDA_GNN-AE on four datasets in fivefold cross-validation

These findings highlight not only the robustness and effectiveness of the LLM-DDA_GNN-AE model but also underscore its consistent performance across various dataset conditions, reinforcing the reliability of the LLM-DDA approach in handling diverse and imbalanced data.

3.3. For Q3: Model Performance Comparison Against Baseline Methods

To benchmark the LLM-DDA_GNN-AE model, we compared its performance with eleven baseline methods derived from previous studies, including two baselines developed by omitting the LLM-based embeddings from LLM-DDA: the reproduced LAGCN model (layer attention graph convolutional network) [29] and DirectPred from GPT-4 turbo. Our selection of baseline methods spanned different computational drug repositioning categories: machine learning-based (DDA-SKF [17]), matrix completion/factorization-based (SCPMF [19]), and deep learning-based (NIMCGCN [52], HAN [53], MHGNN [54], DRWBNCF [31], REDDA [33], PSGCN [30], LAGCN [29], HDGAT [55], and DirectPred). All the hyperparameter settings for the baselines were collected from their original studies or attached codebases. The brief introductions of these methods were in Supplementary Materials.

Utilizing fivefold cross-validation, we assessed various performance metrics (AUC, AUPR, F1-Score, and Precision) across four datasets, presented in Tables 4, 5, 6, 7, 8. LLM-DDA_GNN-AE consistently demonstrated superior performance on most metrics. On the B-dataset, it was second only to SCPMF, which marginally outperformed in all evaluated metrics. Across other datasets, LLM-DDA_GNN-AE excelled in AUPR, F1-Score, and Precision. When results were averaged (Table 8), LLM-DDA_GNN-AE exhibited significant improvements: 23.22% in AUPR, 17.20% in F1-Score, and 25.35% in Precision over the suboptimal model, while maintaining comparable AUC.

Table 4.

The AUC, AUPR, F1-Score, and Precision results of LLM-DDA and baseline methods on B-dataset in fivefold cross validation

Model	AUC	AUPR	F1-Score	Precision
DDA-SKF	0.701	0.252	0.328	0.259
SCPMF	0.859	0.511	0.509	0.468
NIMCGCN	0.667	0.233	0.290	0.223
HAN	0.695	0.256	0.323	0.258
MHGNN	0.574	0.160	0.222	0.136
DRWBNCF	0.838	0.455	0.474	0.428
REDDA	0.847	0.490	0.494	0.444
PSGCN	0.814	0.392	0.432	0.365
LAGCN	0.811	0.493	0.438	0.370
HDGAT	0.828	0.461	0.461	0.415
DirectPred	0.510	0.171	0.205	0.114
LLM-DDA_GNN-AE	0.847	0.499	0.497	0.462

Open in a new tab