Summary
Recent advancements in large language models (LLMs) have demonstrated their potential in scientific reasoning, but their ability to open-ended hypotheses in data-scarce domains remains underexplored. Here, we introduce Combinatorial Alzheimer’s Disease Therapeutic Efficacy Decision (Coated-LLM), an AI-driven framework that is inspired by scientific collaboration to predict efficacious combinatorial therapy when data-driven prediction is infeasible. Coated-LLM employs multiple specialized LLM agents—Researcher, Reviewers, and Moderator—to systematically generate and evaluate hypotheses through several in-context learning techniques. Using Alzheimer’s disease (AD) as a test case, Coated-LLM outperformed traditional knowledge-based methods (accuracy: 0.74 vs. 0.52), with external validation achieving an accuracy of 0.82. In addition, a drug combination predicted from Coated-LLM was experimentally validated to significantly reduce amyloid aggregation in vitro. These findings highlight the potential of our framework to augment human reasoning in complex scientific reasoning tasks, offering a scalable approach for hypothesis generation in biomedical research.
Subject areas: Health sciences, Medicine, Drugs, Artificial intelligence
Graphical abstract

Highlights
-
•
Multi-agent LLM enhances human hypothesis generation in open scientific domains
-
•
Few-shot LLM framework offers a viable alternative when less data is available
-
•
M266 + Gypenoside XVII validated to significantly reduce amyloid aggregation in vitro
Health sciences; Medicine; Drugs; Artificial intelligence
Introduction
Recent advancements in large language models (LLMs) have demonstrated their disruptive potential in scientific discovery. These models have efficiently tackled combinatorial optimization problems, often surpassing traditional heuristics.1,2 LLMs have also shown success in various chemistry and materials science tasks, such as predicting molecular properties and chemical reaction yields,3 and autonomously searching for chemicals.4,5 Our research aligns with efforts to develop “autonomous scientists” inspired by scientific collaboration.
In scientific investigation, given a question, human researchers apply deductive and inductive reasoning to drive predictions and draw conclusions.6 However, such traditional scientific reasoning is often limited by human bias and cognitive capacity.6 Researchers may exhibit confirmation bias, favoring data that supports their preconceptions, struggle to process the vast amount of existing literature, leading to incomplete reviews and overlooked insights, and have difficulty in managing multiple factors and identifying subtle patterns.
LLMs can mitigate these limitations by assisting human scientific reasoning, as evidenced in several prior studies.7,8,9 Recent reasoning-oriented LLMs, such as OpenAI o1 and DeepSeek V3,10 have achieved human-level performance on closed-form questions where the answer is already known, but remain underexplored for open-ended biomedical hypothesis generation due to the unclear reasoning and answers. While data-driven machine learning models work well when abundant data exists, our focus is on data-scarce research areas, which are more common in real world scientific discovery. Recent work by Qi et al. has demonstrated that LLMs can generate novel and validated biomedical hypotheses.11,12 However, their approach requires background knowledge extracted from existing literature as input to generate hypotheses, which may not be available in data-scarce domains. Building on this motivation, we chose a specific biomedical question that requires deep domain knowledge and critical reasoning without definite answers to test LLM’s capacity beyond memorization: identifying effective drug combinations for in vivo experiments in complex systemic diseases. We chose to focus on Alzheimer’s disease (AD), a complex neurodegenerative condition where multiple disease etiologies entangle together; thus, a comprehensive consideration of multiple underlying mechanisms is critical when developing therapeutics.
Drug combination therapy is to use of two or more therapeutic agents to treat a single disease, to achieve a more effective treatment outcome than what could be achieved with a single drug. This approach is particularly prevalent in the treatment of complex diseases such as diabetes and metabolic syndrome, cardiovascular disease, cancer, and others,13,14,15,16,17,18,19,20 but no successes in AD due to the multifactorial nature of the disease. Until now, only Donepezil + Memantine has been FDA-approved for AD. Developing effective combinatorial therapies faces significant challenges, particularly in selecting the right therapeutic agents (drugs) and relevant in vivo models. Researchers make specific predictions about the potential efficacy of various combinations from general principles and known mechanisms of action (deductive reasoning). The complexity arises from the factorial growth in the number of possible combinations (therapeutic agent 1 ∗ therapeutic agent 2 ∗ in vivo model), making manual evaluation impractical.
Only a few studies have explored the use of LLMs for such combinatorial search problems. A recent study9 introduced CancerGPT, an LLM-based model designed for predicting drug pair synergy in rare tissues with limited data. It demonstrates the capability of LLMs to handle complex biological inference tasks. However, CancerGPT primarily focuses on fine-tuning LLMs using high-throughput in vitro experimental data, which is not available for most complex diseases. In contrast, we propose a versatile and generalizable LLM-based framework, named Combinatorial Alzheimer’s disease Therapeutic Efficacy Decision (Coated-LLM), designed specifically to generate biomedical hypotheses even in data-scarce settings. Our innovative approach, inspired by collective human scientific reasoning, employs multiple specialized LLM agents (Researcher, Reviewers, Moderator) to generate biomedical hypotheses. Critically, our framework identified a novel combinational therapy, m266 antibody with Gypenoside XVII, that experimentally demonstrated superior inhibition of amyloid beta aggregation compared to individual treatments alone, highlighting a previously unexplored synergistic combinatorial therapy in AD. Comprehensive computational evaluation alongside these in vitro validations demonstrates that Coated-LLM effectively identifies potent therapeutic combinations, significantly augmenting human capabilities for scientific discovery in complex diseases.
Results
Model summary
We create an AI model f that automates scientific reasoning to generate hypotheses on efficacious combinatorial therapy for in vivo experiments (Algorithm 1; Figure 1). Inspired by human scientific collaboration, our framework consists of multiple LLM agents playing different roles: Researcher, Reviewers, and Moderator. The Researcher generates a series of reasoning steps to propose a prediction on the efficacy of combinatorial therapeutic agents. Multiple Reviewers review and criticize the quality of the prediction generated by the Researcher and offer feedback. Finally, the Moderator integrates Researcher’s proposed prediction and Reviewers’ feedback to suggest a more valid prediction.
Algorithm 1. Framework for Drug Combination Efficacy in Alzheimer's Disease.
Input: Dataset D = {(t1, t2, m)}, external knowledge B_q
Output: Predictions A_Q for test set
Split D into training set D_train and test set D_test
# Phase I - Warm Up:
learning_examples=[]
learning_question_embeddings = []
for each (t1, t2, m) in D_train
q = transform_into_question(t1, t2, m)
B_q = search_external_knowledge(t1, t2)
E_q = generate_learning_question_embedding(q)
(C_q, A_q) = LLM_Researcher(instruction, q, B_q)
if correct(A_q) then
learning_examples.append((q, C_q, A_q))
learning_question_embeddings.append(E_q)
# Phase II - Inference:
for each (t1, t2, m) in D_test
Q = transform_into_question(t1, t2, m)
B_Q = search_external_knowledge(t1, t2)
E_Q = generate_testing_question_embedding(Q)
top_k_examples = select_top_k_similar_question(E_Q, learning_question_embeddings, learning_examples, k=5)
input = concatenate(top_k_examples, Q, B_Q)
hypotheses = []
Repeat 5 times do
(C_Q, A_Q) = LLM_Researcher(instruction, input)
hypotheses.append((C_Q, A_Q))
A_Q*= majority_vote([A_Q for _, A_Q in hypotheses])
C_Q*=select_longest_CoT([C_Q for C_Q, A_Q in hypotheses if A_Q==A_Q*])
# Phase III - Revision
for each (t1, t2, m) in D_test
Q = transform into_question(t1, t2, m)
F_Q = LLM_Reviewers(instruction, C_Q*, A_Q*)
hypotheses = []
Repeat 5 times
(C_Q, A_Q) = LLM_Moderator(instruction, Q, F_Q, C_Q*, A_Q*)
hypotheses.append((C_Q, A_Q))
A_Q* = majority_vote([A_Q for _, A_Q in hypotheses]))
Return A_Q*
Figure 1.
Study overview
(A) Traditional approach: Given the vast number of possible drug combinations, human experts rely on general principles and known mechanisms of action to manually select potential candidates. The top-score combinations are then subjected to in vitro experiments to evaluate their efficacy.
(B) Coated-LLM workflow: Coated-LLM is a structured framework that was inspired by collective scientific discovery processes to generate hypotheses on efficacious combinatorial therapy. For a target drug combination, Researcher learns from its top five relevant questions from learning examples, generates a series of reasoning steps to propose a prediction on the efficacy, and gets a consistent prediction. Multiple Reviewers then provide feedback, and the Moderator integrates consistency prediction from the Researcher and feedback from the Reviewers to generate the final consensus prediction.
(C) In vitro Experiments: Coated-LLM selects the most promising drug combinations from false-positive augmented data for in vitro efficacy testing.
Throughout all the communication among LLM agents, we prompt the LLM agents to utilize various in-context learning techniques such as integrating external biomedical knowledge as retrieval augmented generation (RAG),21 few-shot learning,22 self-generated chain-of-thoughts (CoT),23 and/or tree of thoughts (ToT),24 and self-consistency.25
Data collection and augmentation
Following literature mining, we identified 242 articles reporting 250 drug combinations with positive efficacy and 30 with negative efficacy (Figure 2A). In comparison, even the rare tissue subsets in the CancerGPT study (e.g., 352 samples for soft tissue, 1,190 for stomach)9 are significantly larger. This highlights the relative scarcity of AD-related combination data. In addition, our literature collection showed severe positive bias. To address it, we performed data augmentation. As a result, we have a total of 530 combinations (250 combinations with positive efficacy; 30 combinations with negative efficacy, augmented by 250 combinations with noisy, non-positive efficacy). Figure 2B summarizes the top five most frequently mentioned terms across therapeutic agents, animal models, and pathways.
Figure 2.
Distribution of drug combinations and efficacy in the literature
(A) Data collection from literature. The process began with an initial pool of articles from the AlzPED, followed by additional searches conducted in PubMed. Articles were screened and excluded based on predefined criteria. The final selected literature included articles that reported drug combinations with positive or negative efficacy.
(B) Top 5 frequent terms in therapeutic agents, animal models, and pathways.
Model development
In the warmup phase, the Researcher correctly predicted 231 training combinations (134 combinations with positive efficacy; 97 combinations with non-positive efficacy), which became learning examples for the inference phase. Following,1 we used k = 5 examples for the few-shot learning, achieving a high average similarity score between test questions and selected five examples (0.919 ± 0.017), while also maintaining contextual diversity among selected examples. Further, to demonstrate that the selected questions were more relevant to the test questions than other questions from the learning examples in the inference phase, we calculated the mean cosine distance between the test question embeddings and the learning examples' embeddings. The mean top five average cosine distance was 0.08 (variance: 0.0002), while the mean overall average cosine distance was 0.13 (variance: 0.0003). Figure S2 provided a visual representation of similarities between the target combination and learning examples. In the revision phase, among 156 testing combinations, 129 demonstrated more than 80% consistency across 5 rounds, with 83 of them achieving 100% consistency.
Coated-LLM achieved significant accuracy in predicting drug combination efficacy
The Coated-LLM framework significantly surpassed the traditional network-based approach (no data-driven machine learning models available) in predicting the efficacy of drug combinations. Specifically, on a test set with 156 drug combinations, Coated-LLM achieved an accuracy of 0.74, precision of 0.71, recall of 0.80, and an F1-score of 0.75, with an average confidence of 0.87 and ECE of 0.17. Table 1 presents the contingency table for Coated-LLM’s predictions and examples of misclassifications. In comparison, the traditional network-based model yielded substantially lower performance metrics, with an accuracy of 0.52, precision of 0.46, recall of 0.16, and an F1 score of 0.24 (Table S1). Our model demonstrates superior flexibility by effectively accommodating therapeutic agents beyond conventional pharmaceuticals, such as membrane-free stem cell extracts, for which traditional gene-target data are often unavailable. This significant accuracy, even without data-driven model training, underscores the ability of Coated-LLM to identify effective combinatorial therapies in AD in a scalable way.
Table 1.
Contingency table of prediction outcomes with examples using Coated-LLM
| Therapeutic agent 1 | Therapeutic agent 2 | Animal model | Predicted efficacy | Actual efficacy (reference) |
|---|---|---|---|---|
| True positive (n = 61) | ||||
| Lycopene | Vitamin E | Tau P301L | Positive | Positive26 |
| Donepezil | Fluoroethylnormemantine | Swiss OF-1 mice | Positive | Positive27 |
| False positive (n = 25) | ||||
| Cholesterol | Homocysteine | Sprague-Dawley rats | Positive | Non-positive28 |
| m266 | Gypenoside XVII | 3xTg-AD transgenic mice | Positive | Non-positive (Augmented data) |
| False negative (n = 15) | ||||
| Gallic Acid | Sodium Arsenite | Male rats | Non-positive | Positive29 |
| AMD3100 | L-Lactate | 3xTg | Non-positive | Positive30 |
| True negative (n = 55) | ||||
| Atorvastatin | Farnesol | C57BL/6 | Non-positive | Non-positive31 |
| Galantamine | Mecamylamine | ICR | Non-positive | Non-positive32 |
Retrospective in vitro validation
To evaluate the generalizability of Coated-LLM and address data leakage concerns, we conducted in vitro validation with an independent private dataset. This dataset comprises eleven drug combinations and in vitro efficacy in cell lines, and includes nine non-positive and two positive efficacies, skewed toward negative efficacy. Although our model is developed for predicting in vivo efficacy, we assume that the model for in vivo efficacy can also capture in vitro efficacy. Despite the increased difficulty due to realistic distribution skew, Coated-LLM achieved an accuracy of 0.82 (Table 2), whereas the baseline network-based model achieved 0.27 (Table S2). Table 2 presents the contingency table for Coated-LLM’s predictions and examples of misclassifications in the external validation. The precision, recall, and F1-score for our model were each 0.50, reflecting the challenging nature of the task but still outperforming the baseline.
Table 2.
Contingency table of prediction outcomes for the external data with examples
| Therapeutic agent 1 | Therapeutic agent 2 | Model | Predicted efficacy | Actual efficacy |
|---|---|---|---|---|
| True positive (n = 1) | ||||
| Galantamine | Caffeine | HT22 Mouse Hippocampal Neuronal Cell Line | Positive | Positive |
| False positive (n = 1) | ||||
| Donepezil | Salicylic | HT22 Mouse Hippocampal Neuronal Cell Line | Positive | Non-positive |
| False negative (n = 1) | ||||
| Galantamine | Mifepristone | HT22 Mouse Hippocampal Neuronal Cell Line | Non-positive | Positive |
| True negative (n = 8) | ||||
| Galantamine | Diclofenac | HT22 Mouse Hippocampal Neuronal Cell Line | Non-positive | Non-positive |
| Rivastigmine | Lithium | HT22 Mouse Hippocampal Neuronal Cell Line | Non-positive | Non-positive |
Experimentally validated drug combinations demonstrate our model’s practical utility
To demonstrate practical utility, we prospectively tested the top-ranked drug combinations predicted by Coated-LLM through in vitro amyloid beta aggregation experiments. We selected the top three most promising drug combinations from false positives for experimental validation (Table 3). Across a total of 24 experimental conditions, we assessed individual drugs, their combinations, and controls using two independent assays. Among these, Gypenoside alone resulted in approximately 50% inhibition of Aβ42 aggregation (Figure 3). The m266 alone, an anti-amyloid beta monoclonal antibody, did not show significant inhibition. Remarkably, Gypenoside combined with m266 exhibited even greater inhibition of Aβ42 aggregation. The synergy between Gypenoside and m266 not only validates the predictive capabilities of Coated-LLM but also highlights its potential to uncover novel and effective combinatorial therapies that could otherwise remain unexplored.
Table 3.
Selected drug combination candidates
| Therapeutic agent 1 | Therapeutic agent 2 | Researcher’s reasoning |
|---|---|---|
| M266, an anti-amyloid beta (Aβ) monoclonal antibody | Gypenoside XVII, a bioactive compound derived from Gynostemma pentaphyllum | m266 directly targets and clears β-amyloid plaques. Gypenoside XVII activates ERβ, which can reduce plaque production, enhance plaque clearance, and provide neuroprotection. Gypenoside XVII could potentiate m266’s plaque-clearing effect by enhancing clearance through ERβ activation. |
| Acamprosate, an NMDA receptor modulator | Melatonin, a biogenic amine | Acamprosate reduces glutamate-induced neurotoxicity. Melatonin has neuroprotective properties, including anti-amyloid, antioxidant, and anti-inflammatory effects. Melatonin increases GABA receptor expression, which could potentially enhance the GABAergic effects of Acamprosate. Both drugs have anti-inflammatory properties, which could provide an additive neuroprotective effect. |
| Memantine, an NMDA receptor antagonist | Atorvastatin, an HMG-CoA reductase inhibitor | Memantine blocks NMDA receptors, reducing excitotoxicity and slowing disease progression. Atorvastatin provides neuroprotection, reduces beta-amyloid peptide production, enhances cerebral blood flow, and modulates the immune response. Memantine’s reduction of excitotoxicity could be enhanced by Atorvastatin’s neuroprotective effects. Atorvastatin’s ability to reduce beta-amyloid peptide production could further slow disease progression alongside Memantine’s effects. |
Figure 3.
Inhibitory effects of therapeutic agents on amyloid beta aggregation
(A) The aggregation profiles of amyloid beta, both in the absence and presence of various combinations of compounds, are depicted. Error bars represent the standard error of the mean (SEM).
(B) The percentage of aggregation is presented to better illustrate the effect of the different compounds.
Ablation study highlights key contributing factors
Aiming to understand the contributions of each component within our model, we conducted an ablation study (Figure 4). The ablation study begins with a zero-shot GPT-4 model, which serves as the baseline. In this setting, GPT-4 leverages its pre-learned knowledge to make predictions. Introducing dynamic few-shot1 results in a slight performance decrease, likely due to probable mislabeled augmented combinations used as negative few-shot examples. Despite this performance decrease, this augmentation strategy still remains necessary to mitigate reporting bias. Augmented data helps balance the number of examples with positive and negative results in our learning examples, preventing the dynamic few-shot learning examples from being overoptimistic due to the reporting bias (always being positive efficacy), which would otherwise lead to biased predictions, skewing predictions toward more frequently observed positive combinations. For comparison, we conducted the same ablation study under a non-augmentation setting (no negative augmented data), in which the data contains only 250 positive and 30 negative combinations (Figure S1) to assess the impact of data availability and class balance on model performance. With this unbalanced data, we observed a clear accuracy improvement thanks to the dynamic few-shot strategy. However, the models were overly optimistic, and the predicted results were biased toward positive efficacy.
Figure 4.
Visual illustration of Coated-LLM components and additive contributions to the performance
Coated-LLM combines kNN-based five-shot dynamic learning example selection, external pathway knowledge, self-consistency (n = 5), Reviewers, and Moderator.
Subsequently, applying a RAG, we integrated external biomedical knowledge on pathways to address knowledge gaps in GPT-4’s pre-trained model and led to significant improvements in accuracy (+17%), precision (+13%), recall (+40%), and F1-score (+23%). By implementing self-consistency via an ensemble strategy, we increased the accuracy rate by 6%, precision by 4%, recall by 3%, and F1-score by 4%.
Finally, incorporating feedback from the Reviewers and Moderator further improved the model’s predictions, correcting potential errors to achieve the highest accuracy (+5%) and precision (+9%). However, we observed a decrease in recall due to Reviewers and Moderator, suggesting that two LLM agents in the revision phase favor reducing false positives over false negatives, analogous to the fact that (human) reviewers are more skeptical than (human) researchers. Despite the decrease in the recall, we kept the Reviewers and Moderator because, in real in vivo experiments, our model is used to retrieve a top ranked list of probable positive combinations, thus high precision is more important than high recall.
Discussion
In this study, we introduced Coated-LLM, an innovative AI-driven framework that leverages multiple specialized LLM agents to systematically generate, evaluate, and revise biomedical hypotheses for identifying efficacious combinatorial therapies for AD. Our framework demonstrated robust predictive accuracy through both internal cross-validation and a retrospective in vitro dataset, underscoring its potential for real-world application. In addition, experimental validation further reinforced the practical utility of our framework. Computational analysis showed that for a single testing drug combination, Coated-LLM required approximately 4 minutes and $0.95 in total. Specifically, the fully equipped Researcher (incorporating dynamic few-shots, external knowledge integration, and self-consistency) required approximately 15,000 tokens, 90 seconds of processing time, and $0.53 in computational costs. The Reviewers consumed approximately 1,600 tokens, 27 seconds, and $0.07, and Moderator, executing five self-consistent runs, utilized approximately 11,000 tokens, 100 seconds, and $0.35. This level of resource usage suggests that Coated-LLM is a viable solution for scalable application in academic and industrial environments, thereby enhancing the traditional drug discovery process by uncovering therapeutically valuable combinations that might remain unexplored due to the complexity, cost, and limitations of conventional experimental approaches.
In addition, Coated-LLM provides a generalizable insight for developing an LLM framework. The following lessons were derived from our study.
-
•
Interaction of Multi-Agent LLM: In our multi-agent LLM framework, Reviewers and Moderator operate in a manner similar to human peer reviewers and journal editors, prioritizing rigor and skepticism over inclusivity, avoiding false positives over false negatives. Given that our model is designed to discover the most probable positive drug combinations for in vivo experiments, where false positives carry high experimental costs, favoring precision over recall is a necessary trade-off.
-
•
Implications of Dynamic Few-Shot Learning: Dynamic few-shot learning plays a crucial role in enhancing the predictive capabilities of LLMs by leveraging examples that are most similar to the target data. Our findings reveal that the inclusion of high-quality real learning examples significantly enhances the accuracy of predictions compared to the zero-shot strategy. In contrast, the use of augmented learning examples does not yield a similar increase in accuracy, showing the importance of high-quality examples.
-
•
Importance of a balanced set of learning examples: Given that the initial dataset from literature mining contains 250 positive combinations (89.28%), the predominance of the positive class in the dynamic few-shot learning examples can introduce bias, potentially leading to skewed predictions toward more frequently observed positive combinations. Our data augmentation strategy balanced the distribution of positive and non-positive learning examples. Although this approach resulted in a slight performance decrease, it remains essential for reducing bias and preventing overly optimistic predictions.
In addition, we analyzed the failure modes of Coated-LLM’s reasoning when evaluated against experimental ground-truth. In the case of estradiol and continuous progesterone combination, our model incorrectly assumed that continuous progesterone would reduce amyloid pathology and improve cognition. However, experimental evidence demonstrated that while continuous progesterone reduced tau hyperphosphorylation, it failed to lower amyloid levels and in fact antagonized the beneficial cognitive effects of estradiol, which was distinct from the effects observed with cyclic progesterone.33 Similarly, for the combination of cholesterol and homocysteine, our model predicted therapeutic benefit by jointly targeting two pathological mechanisms. In contrast, empirical findings showed only the partial reduction of inflammation without recovery of cognitive function.28 In the case of BACE1 haploinsufficiency and neprilysin overexpression combination, our model emphasized dual targeting of amyloid production and clearance. However, experimental results revealed no additional benefit beyond neprilysin overexpression alone, which was sufficient to abolish amyloid deposition and rescue memory, thus leaving no room for further improvement.34 These failure cases showed common reasoning errors of Coated-LLM, such as the overgeneralization of prior knowledge and a lack of sensitivity to ceiling effects.
In all, Coated-LLM presents a transformative opportunity for drug discovery across AD and other multifactorial conditions by reducing reliance on extensive experimental screening. Its ability to effectively predict drug combination efficacy from limited or scarce data holds promise for accelerating therapeutic innovation, optimizing resource allocation, and substantially decreasing costs associated with drug development.
Limitations of the study
While we have obtained promising results, our study has several limitations. One of the primary limitations is the underrepresentation of negative combinations within the dataset obtained through literature mining, introducing unavoidable bias toward positive outcomes during the model’s inference phase. While the data augmentation approach helped balance this bias, it is important to acknowledge that these augmented combinations could be false negatives. Furthermore, determining the true efficacy or non-efficacy of such combinations ultimately requires experimental validation, and expert review alone may not fully resolve such uncertainty. As a result, a certain degree of mislabeling may exist and could potentially lead to incorrect predictions. However, given that truly efficacious drug combinations are relatively rare in the real world, the likelihood of falsely labeling an efficacious pair as negative is relatively low. In addition, our experts manually reviewed a representative subset of augmented combinations and concluded that the theoretical conclusion of non-positive remains justifiable without the empirical testing of these combinations. Based on these considerations, we regarded the trade-off acceptable in order to improve the class balance of the learning examples.
The use of LLMs in biomedical hypothesis generation may raise ethical concerns, particularly regarding potential hallucinated reasoning or misleading predictions that may influence human decisions.35,36 However, recent studies have shown that multi-agent collaborative frameworks can effectively reduce hallucinations in LLM outputs,37,38,39 which aligns with the design of our Coated-LLM that incorporates Reviewers and Moderator agents. While our framework is designed to generate hypotheses and predictions, we acknowledge that the outputs require rigorous reasoning checks and experimental validation before clinical consideration. In addition, false positive predictions could lead to costly experimental investigations. Our multi-agent system with Reviewers and Moderator specifically addresses this concern by prioritizing precision over recall to minimize false positives.
It is important to acknowledge that the retrospective in vitro validation is limited by the small, private dataset, whereas our model was trained on in vivo experimental outcomes. Despite this difference in modality, the in vitro dataset remained within the same disease context (Alzheimer’s Disease), allowing a within-domain but cross-modality assessment, since both in vitro and in vivo assays ultimately aim to evaluate whether drug combinations can slow or treat Alzheimer’s Disease. In addition, the inclusion of the cross-validation test set provides a complementary evaluation, supporting the generalizability of our findings. Given that the initial learning examples contain only 97 non-positive efficacy combinations out of 231 (41.9%), such bias presents a considerable challenge for our model in accurately predicting non-positive outcomes. Despite these challenges, our model achieved a significant accuracy rate, demonstrating its effectiveness in generating hypotheses for synergistic drug combinations with minimal historical data. While only three top-ranked combinations were selected for experimental validation, this selection reflects our prioritization of disease-modifying mechanisms over symptomatic targets or herbal compounds. We acknowledge that expanding validation to a broader range of predicted candidates would strengthen the robustness and generalizability of our findings. Nevertheless, we interpret these results as preliminary, and further validation on larger in vivo datasets, additional experimental assays, and applications to other disease domains is needed to substantiate broader claims of generalizability.
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Yejin Kim (yejin.kim@uth.tmc.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
The data generated from literature mining and drug hit AD genes for this study can be accessed via the following link: https://github.com/QidiXu96/Coated-LLM.40
-
•
The code for Researcher, Reviewers, and Moderator can be found at https://github.com/QidiXu96/Coated-LLM.40
Acknowledgments
YK is supported in part by National Institutes of Health under award number R01AG082721, R01AG066749, and R01AG084637.
Author contributions
Concept and design: QX and YK.; data access and analysis: Y.K. model development: Q.X. and Y.K. interpretation: C.S., M.S., Q.X., and Y.K.; draft article: C.S., M.S., Q.X., Y.K., and X.L. in vitro experiments design: C.S., M.S., and X.J. All authors contributed to editing the article, approved the final article, and accepted the responsibility to submit it for publication.
Declaration of interests
No competing interest to declare.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Data | ||
| Literature mining and data augmentation | This paper | https://github.com/QidiXu96/Coated-LLM40 |
| Software and algorithms | ||
| OpenAI API | Python package | https://openai.com/api/ |
| Claude API | Python package | https://www.anthropic.com/api |
Method details
Problem formulation
Our objective is to predict whether a combination of therapeutic agents t1and t2 have a positive efficacy y when tested in an in vivo model m. That is, we aim to develop a model f such that y=f(x), where x is a triplet of (t1,t2,m). Here, t1,t2, and m are not only drawn from a finite set, but can be a new or investigational therapeutic agent (e.g. “Membrane-free stem cell extract”) or in vivo model (e.g. “Rats induced with AD using aluminum chloride”), which are not registered with a formal identifier, thus best described as a natural text. We convert the structured input (t1,t2,m) into a natural text question Q following.9 For example, we convert the combination (“Galantamine”, “Nicotine”, “ICR mice”) into ‘Decide if the combination of Galantamine and Nicotine is effective or not to treat ICR mice model in theory.’
In some previous studies,9 the effectiveness of combinations of therapeutic agents was measured as synergy. However, the synergy quantification requires dose-dependent inhibition profiling, which is not available in an in vivo model. As most in vivo experiments only report efficacy (without formal calculation of synergy, toxicity, or dose–response relationships), our focus is also on efficacy (rather than synergy). Rather than a specific efficacy measurement, we focused on a broad sentiment as an efficacy measurement (positive or not). While this binary labeling simplifies the underlying pharmacological nuances, it makes more transferable to different studies.
Data collection
We collected scientific articles that report the efficacy of therapeutic agent combinations on AD in vivo models. We first utilized the Alzheimer’s Disease Preclinical Efficacy Database (AlzPED),41 a data resource dedicated to the preclinical efficacy studies of candidate therapeutics for Alzheimer's Disease. Among the 1,463 articles in AlzPED, we manually reviewed and selected 39 articles that experimented with multiple therapeutic agents.
We further searched for more related articles based on the 39 selected articles. Specific search queries are available in Data S1. We extracted 376 additional articles meeting the query from PubMed. 10 out of 39 (25.64%) AlzPED articles were searchable from the query. We then extracted therapeutic agents, in vivo models, and their efficacy from the abstract of the selected articles. We excluded articles in which drugs were used to induce AD or suppress mechanisms for mechanistic study. Among the 376 articles, 199 articles reported positive efficacy, and 3 reported mixed or partial effects, 16 reported negative efficacy. All others are not relevant.
Data augmentation
Our initial dataset showed severe imbalance toward positive efficacy, as researchers tend to publish positive results more than negative ones. In addition to combinations reporting non-positive efficacy in the literature, we created plausible samples with unknown efficacy (unlabeled data), and used them as non-positive samples with noise in both the warm-up phase and inference phase. The non-positive samples were created by randomly replacing either one of the drugs or an in vivo model from the positive efficacy combinations, with the replacement therapeutic agents or models selected from those commonly used in AD research from AlzPED dataset. This constraint ensures that the augmented combinations remain within the AD therapeutic domain, preventing obvious non-positive cases that would result from replacing AD drugs with treatments for unrelated conditions like obesity. For example, given an efficacious combination (Acamprosate, Baclofen, mThy1-hAPP751 (TASD41)), we created a non-positive combination by replacing Baclofen. As a result, we have (Acamprosate, Melatonin, mThy1-hAPP751 (TASD41)). This random replacement strategy generated combinations that were statistically unlikely to be efficacious, as the probability of two randomly selected therapeutic agents showing efficacy is extremely low by nature, thereby not only balance the dataset, but also creating more truly novel cases that couldn’t have been memorized during LLMs’ pre-training. To further address potential data leakage concerns, we had an independent private dataset (see retrospective in vitro validation) that was never published, ensuring our model’s predictions were not simply regurgitations of previously seen information.
Coated-LLM
Warm-up phase
Overview
In the warm-up phase, Researcher generates answers to training questions and compares them with ground truth answers. The correctly generated answers are used as learning examples in the next inference phase. For this purpose, we split the data into 70% training and 30% testing sets, which are used to derive learning examples in the warm-up phase and actual inference in the next phase, respectively. This training set is not for actual training nor fine-tuning LLMs but for learning examples. We set aside a higher proportion for training to ensure Researcher can be exposed to diverse learning examples.
Chain-of-thoughts (CoT)
To improve the reasoning ability of Researcher, we applied a chain-of-thought (CoT)23 prompting strategy by incorporating the instruction: “Take a deep breath and work on this problem step-by-step.”.42 This approach encourages Researcher to decompose complex drug combination efficacious task into a series of intermediate steps, such as identifying drug targets and mechanisms of action, analyzing biological pathways, evaluating multi-pathway targeting, before reaching a final conclusion. (Prompt at Data S2 List S1.1).
Retrieval augmented generation (RAG)
To make Researcher answers a question q in the training set more intelligently, we allowed it to use external biomedical knowledge. Since LLMs generate responses based on patterns learned during training, which are inherently limited by the data they were pre-trained on. However, information on therapeutic agents is vast and continuously expanding. To bridge this gap, we provided external biomedical knowledge to enhance Researcher’s responses through retrieval-augmented generation (RAG),21 which complements static LLM parameters with up-to-date and dynamic information.21 We retrieved and provided specific external information Bq on therapeutic agents t1,t2. We used the Comparative Toxicogenomics Database (CTDbase),43 a knowledge database encompassing 88,144,004 relationships in chemicals, genes, pathways, and diseases. We particularly focused on the pathway information that the therapeutic agents t1,t2 targets. Only pathways with a corrected p-value below 0.01 are incorporated as external knowledge. Of the total combinations, 129 had pathway information for both drugs available from CTDbase, and 235 had information for only one of the two drugs. For example, for the therapeutic agent Galantamine, we provided molecular pathway information such as “Galantamine has several pathway information, such as cholinergic synapse, transmission across chemical synapses, highly calcium permeable postsynaptic nicotinic acetylcholine receptors, …, and peptide hormone metabolism.”. Note that we also tried to incorporate a list of targeting genes as external knowledge, and it did not provide high-quality answers due to its high sparsity. Also note that the in vivo model information, such as one available in AlzForum,44 marginally increased the generation quality while consuming many tokens.
Based on the targeting pathway information Bq, we prompt Researcher to generate a hypothesis. This hypothesis consists of a series of CoT reasoning (Cq) and a final binary answer Aq (Output example at Data S2 List S1.2). We only focused on the correct answers and its corresponding reasoning as learning examples and filtered out Cq if the answer Aq is different from the ground truth efficacy label y. This simple filtering has greatly decreased the low-quality chain-of-throughs examples.1 We used GPT-4 for Researcher. To encourage Researcher to be skeptical, we added the statement ‘It is rare for combinations of two drugs to be efficacious and synergistic in real world’ into the prompt (Prompt at Data S2 List S1.1).
Inference phase
Overview
Using the learning examples from the warm-up phase, Researcher generates hypotheses to the questions in the testing set. In the inference phase, Researcher leverages the learning examples (dynamic few-shot learning) and external biomedical knowledge (RAG), following the same methodology as in the warm-up phase.
Dynamic Few-shot
When asked scientific questions, human researchers look for similar previously that were answered previously and performed inductive reasoning. So does Researcher by leveraging dynamic few-shot learning.1 Few-shot learning22 is one of the most effective in-context learning methods to guide LLMs to learn the patterns from a few demonstration examples and to generate similar outcomes like the examples. Here, it is critical to provide examples that are relevant to the question of interest.1 However, in our application on AD combinatorial therapy discovery, the therapeutic and their associated biological mechanisms are very diverse, making randomly selected examples insufficient for pattern learning. For example, the question Q from (‘Galantamine’, ‘Nicotine’, ‘ICR mice’) is more similar to one from (‘Galantamine’, Memantine’, ‘ICR mice’) than one from (‘Scyllo-inositol’, ‘neotrofin’, ‘TgCRND8’). Thus, we selected the most similar question q in the learning examples and its associated reasoning Cq for inductive reasoning in the inference phase. We derived textual embedding EQ of the question Q of interest and Eq of the question q in the learning examples using OpenAI’s text-embedding-ada-002.1 We then calculated cosine similarity <Eq,EQ>/(‖Eq‖ · ‖EQ‖)to identify the top five similar questions q with the highest similarity. So, the prompt consisted of similar learning examples (Cq, Aq), interest question Q, and external biomedical knowledge BQ. Note that we guided LLMs to have a series of CoT reasoning not only by simply encouraging LLM to “think step by step,” but by providing the exact CoT demonstration Cq in the learning example from dynamic few-shot learning. See the full prompt and the output examples at Data S2 List S2.1 & S2.2.
Self-consistency via ensemble
To increase the reliability of LLM’s prediction, we generated the response (CQ,AQ) multiple times. We aggregated them by obtaining consensus prediction via majority vote and selecting the most detailed (thus longest) chain of thought if its paired answer Aq is the same as the majority. This ensemble technique can minimize the risk of incorrect prediction by cross-verifying multiple outputs.25
Revision phase
Evaluate
The theoretical inductive reasoning inevitably carries uncertainty. It is critical to independently evaluate the validity of hypotheses and revise accordingly. After we obtain the hypothesis on efficacy and reasoning in the inference phase, Reviewers need to critically evaluate whether the hypothesis is logical and reasonable. This review process should be independent, thus we used another LLM with comparable performance to Researcher (GPT-4), Claude-3-opus, to enhance the independence of the reviewing process.
Tree-of-thoughts (ToT)
Reviewers should have diverse perspectives than Researcher to critically evaluate Researcher’s hypothesis and identify potential pitfalls that Researcher could not spot. Thus we encouraged Reviewers to have multiple perspectives and discuss different branches of thoughts via tree-of-thoughts (ToT) reasoning.24 This approach explores different possibilities and then converges on the most optimal solution. We prompted Reviewers by instructing, “Imagine three different experts who are in therapy development for Alzheimer’s disease, are tasked with critically reviewing the reasoning…” (Prompt and output example at Data S2 List S3.1 & S3.2).
Revise
Once Reviewers finish the discussion and provide feedback FQ, Moderator aggregated the Reviewers’ feedback and Researcher’s hypothesis to obtain the final decision. Moderator took input of Q, , FQ and deduced the final revised reasoning and answer . See the full prompt and output example at Data S2 List S4.1 & S4.2.
Evaluation
We evaluate whether the prediction of Coated-LLM is accurate by comparing the binary prediction (i.e., positive vs. non-positive) with the ground-truth label. We reported accuracy, precision, recall, and F1. We also quantified our model confidence based on self-consistency, defined as the proportion of repeated Moderator runs that agreed with the majority prediction. To assess how well this confidence aligns with actual prediction correctness, we calculated the Expected Calibration Error (ECE), which is defined as the average absolute difference between the model's confidence and accuracy across binned confidence intervals. We first evaluated the accuracy via cross-validation using the test set and via retrospective in vitro validation using in-house private data. This in vitro dataset has 11 drug combinations, all of which were entirely unseen in our main dataset, including literature mining and data augmentation. Of these, 9 combinations are labeled as non-positive efficacy. Furthermore, during the evaluation of the in vitro data, we augmented the initial learning examples from the warm-up phase by incorporating combinations that were correctly predicted during the revision phase. After augmenting the learning examples, we had 347 combinations (195 combinations with positive efficacy; 152 showing non-positive efficacy) serving as learning examples for predicting efficacy on our private data.
We conducted an ablation study to understand the relative contributions of each component in our model. We iteratively introduced each component and measured the performance differences. Since these components are not statistically independent,1,2 we should consider the performance differences as the components' relative contributions.
Baseline
Due to the lack of sufficient data, data-driven machine learning models are not appropriate or available. Instead, we developed a rule-based baseline model to predict the efficacy of drug combinations. We utilize complementary exposure patterns,45 stating that drug combination is therapeutically effective if the targets of the therapeutic agents hit the disease module without overlap (Methods S1). The target genes of the therapeutic agents were collated from multiple sources, including Drug Target Commons, PubChem, and CTDbase,43,46,47 whereas AD-related genes were derived from Agora’s nominated gene list.48 Within the test set, target gene information was unobtainable for 76 therapeutic agent (34.3%). Additionally, 76 therapeutic agent combinations (48.72%), can not be evaluated using the baseline model due to the absence of necessary target gene information.
Selecting drug combination for in vitro experiments
We had created augmented non-positive samples. These augmented data that were predicted to be positive (i.e., false positive) may suggest intriguing hypotheses. Although these combinations were unknown thus marked as non-positive, we hypothesize that they may, in fact, be efficacious combinations that have not yet been empirically tested. Therefore, we selected the top most promising drug combinations out of them and tested for its in vitro efficacy. Human experts carefully examined the 25 false positives. We selected drug combinations if the drug combination co-inhibit functionally complementary pathways according to Researcher’s reasoning. Recent studies in network pharmacology indicate that synergy occurs when drug targets are functionally complementary but do not share direct interactions, reducing the risk of redundant inhibition, such as the combination of Lenalidomide and BET inhibitors (e.g., I-BET-762) in bortezomib-resistant mantle cell lymphoma,49 Geldanamycin + Tofacitinib in myeloproliferative neoplasm cells,49 co-inhibition of NOX4-derived ROS and NOS-derived NO in ischemic stroke.50 In addition, we excluded combinations if they only target symptom relief (e.g., targeting neurotransmitters) as we are more interested in disease-modifying therapies. We excluded drug combinations if they contain natural products (e.g., Herbal medicine like Huperzine A).
In-vitro amyloid beta 1–42 aggregation assay
Misfolding, aggregation, and the progressive accumulation of amyloid beta (Aβ) and Tau proteins in the brain are hallmark events in AD. To evaluate the efficacy of selected drug combinations, we performed an Aβ42 aggregation assay.
Aβ peptide (Aβ42) was synthesized at the W. M. Keck Facility at Yale University using solid-phase N-tert-butyloxycarbonyl chemistry and purified via reverse-phase high-performance liquid chromatography (HPLC). The purified Aβ42 was lyophilized and stored at -80°C until use. To ensure seed-free preparations, the lyophilized Aβ42 powder was dissolved in a high pH solution (10 mM NaOH) and filtered through a 30-kDa cutoff filter to eliminate residual aggregates.
For the aggregation assay, 200 μl of seed-free Aβ42 at a concentration of 2 μM was prepared in aggregation buffer (0.1 M Tris-Cl pH 7.4, 500 mM NaCl, and 5 μM Thioflavin T). The assay was conducted either with Aβ42 alone as a control or in the presence of individual therapeutic agents (Gypenoside, Acamprosate, Melatonin, Memantine, Atorvastatin), drug combinations (each at a final concentration of 100 μM), or an amyloid beta-specific antibody (m266, 5 μg/ml). The mixtures were placed in a 96-well plate and incubated at 25 °C for 150 hours, with intermittent shaking at 500 rpm.
Aggregation was monitored periodically by measuring Thioflavin T (ThT) fluorescence intensity using a Gemini-XS microplate spectrofluorometer (Molecular Devices, Sunnyvale, CA), with excitation and emission wavelengths set at 435 nm and 485 nm, respectively. Differences in aggregation were quantified by calculating percent aggregation, using the maximum fluorescence (Fmax) of the Aβ42-alone control as a reference.
Quantification and statistical analysis
Statistical analysis in our study was primarily descriptive and focused on evaluating prediction performance of large language models. All analyses were performed using Python (v3.11.1). LLM-based outputs were generated using OpenAI GPT-4 and Claude-3-Opus-20240229 APIs. Embeddings were produced using the OpenAI text-embedding-ada-002 API.
Evaluation metrics included accuracy, precision, recall, and F1 score for binary classification of drug combination efficacy. Cosine similarity was used for k-nearest neighbor (where k = 5) retrieval of learning examples, which were used in dynamic few-shot prompting. Performance metrics reflected model behavior across test instances, not biological replicates.
Published: November 10, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.isci.2025.113984.
Supplemental information
References
- 1.Nori H., Lee Y.T., Zhang S., Carignan D., Edgar R., Fusi N., King N., Larson J., Li Y., Liu W., et al. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. arXiv. 2023 doi: 10.48550/arXiv.2311.16452. Preprint at. [DOI] [Google Scholar]
- 2.Romera-Paredes B., Barekatain M., Novikov A., Balog M., Kumar M.P., Dupont E., Ruiz F.J.R., Ellenberg J.S., Wang P., Fawzi O., et al. Mathematical discoveries from program search with large language models. Nature. 2024;625:468–475. doi: 10.1038/s41586-023-06924-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jablonka K.M., Schwaller P., Ortega-Guerrero A., Smit B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 2024;6:161–169. [Google Scholar]
- 4.Boiko D.A., MacKnight R., Kline B., Gomes G. Autonomous chemical research with large language models. Nature. 2023;624:570–578. doi: 10.1038/s41586-023-06792-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.M Bran A., Cox S., Schilter O., Baldassari C., White A.D., Schwaller P.A. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 2024;6:525–535. doi: 10.1038/s42256-024-00832-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Holyoak K.J., Morrison R.G. Cambridge University Press; 2005. The Cambridge Handbook of Thinking and Reasoning. [Google Scholar]
- 7.Kalyanpur A., Saravanakumar K.K., Barres V., Chu-Carroll J., Melville D., Ferrucci D. LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic. arXiv. 2024 doi: 10.48550/arXiv.2406.17663. Preprint at. [DOI] [Google Scholar]
- 8.Haji F., Bethany M., Tabar M., Chiang J., Rios A., Najafirad P. Improving LLM reasoning with multi-agent Tree-of-Thought Validator agent. arXiv. 2024 doi: 10.48550/arXiv.2409.11527. Preprint at. [DOI] [Google Scholar]
- 9.Li T., Shetty S., Kamath A., Jaiswal A., Jiang X., Ding Y., Kim Y. CancerGPT for few shot drug pair synergy prediction using large pretrained language models. npj Digit. Med. 2024;7:40. doi: 10.1038/s41746-024-01024-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Liu A., Feng B., Xue B., Wang B., Wu B., Lu C., Zhao C., Deng C., Zhang C., Ruan C., et al. DeepSeek-V3 Technical Report. arXiv. 2024 doi: 10.48550/arXiv.2412.19437. Preprint at. [DOI] [Google Scholar]
- 11.Qi B., Zhang K., Li H., Tian K., Zeng S., Chen Z., Zhou B. Large language models as biomedical hypothesis generators: A comprehensive evaluation. arXiv. 2024 doi: 10.48550/arXiv.2311.05965. Preprint at. [DOI] [Google Scholar]
- 12.Qi B., Zhang K., Li H., Tian K., Zeng S., Chen Z., Zhou B. Large Language Models are zero shot hypothesis proposers. arXiv. 2023 doi: 10.48550/arXiv.2311.05965. Preprint at. [DOI] [Google Scholar]
- 13.Sun X., Vilar S., Tatonetti N.P. High-throughput methods for combinatorial drug discovery. Sci. Transl. Med. 2013;5 doi: 10.1126/scitranslmed.3006667. [DOI] [PubMed] [Google Scholar]
- 14.Celebi R., Bear Don’t Walk O., 4th, Movva R., Alpsoy S., Dumontier M. In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data. Sci. Rep. 2019;9:8949. doi: 10.1038/s41598-019-45236-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Preuer K., Lewis R.P.I., Hochreiter S., Bender A., Bulusu K.C., Klambauer G. DeepSynergy: predicting anti-cancer drug synergy with Deep Learning. Bioinformatics. 2018;34:1538–1546. doi: 10.1093/bioinformatics/btx806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Huang L., Li F., Sheng J., Xia X., Ma J., Zhan M., Wong S.T.C. DrugComboRanker: drug combination discovery based on target network analysis. Bioinformatics. 2014;30:i228–i236. doi: 10.1093/bioinformatics/btu278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bansal M., Yang J., Karan C., Menden M.P., Costello J.C., Tang H., Xiao G., Li Y., Allen J., Zhong R., et al. A community computational challenge to predict the activity of pairs of compounds. Nat. Biotechnol. 2014;32:1213–1222. doi: 10.1038/nbt.3052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhao X.-M., Iskar M., Zeller G., Kuhn M., van Noort V., Bork P. Prediction of drug combinations by integrating molecular and pharmacological data. PLoS Comput. Biol. 2011;7 doi: 10.1371/journal.pcbi.1002323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chen G., Tsoi A., Xu H., Zheng W.J. Predict effective drug combination by deep belief network and ontology fingerprints. J. Biomed. Inform. 2018;85:149–154. doi: 10.1016/j.jbi.2018.07.024. Preprint at. [DOI] [PubMed] [Google Scholar]
- 20.Tang J., Gautam P., Gupta A., He L., Timonen S., Akimov Y., Wang W., Szwajda A., Jaiswal A., Turei D., et al. Network pharmacology modeling identifies synergistic Aurora B and ZAK interaction in triple-negative breast cancer. NPJ Syst. Biol. Appl. 2019;5:20. doi: 10.1038/s41540-019-0098-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W., Rocktäschel T., et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. 2020 doi: 10.48550/arXiv.2005.11401. Preprint at. [DOI] [Google Scholar]
- 22.Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al. Language Models are Few-Shot Learners. arXiv. 2020 doi: 10.48550/ARXIV.2005.14165. Preprint at. [DOI] [Google Scholar]
- 23.Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., Chi E., Le Q., Zhou D. Chain-of-thought prompting elicits reasoning in large language models. arXiv. 2022 doi: 10.48550/ARXIV.2201.11903. Preprint at. [DOI] [Google Scholar]
- 24.Yao S., Yu D., Zhao J. Tree of thoughts: Deliberate problem solving with large language models. arXiv. 2023 doi: 10.48550/arXiv.2305.10601. Preprint at. [DOI] [Google Scholar]
- 25.Wang X. Self-consistency improves chain of thought reasoning in language models. arXiv. 2022 doi: 10.48550/arXiv.2203.11171. Preprint at. [DOI] [Google Scholar]
- 26.Yu L., Wang W., Pang W., Xiao Z., Jiang Y., Hong Y. Dietary lycopene supplementation improves cognitive performances in tau transgenic mice expressing P301L mutation via inhibiting oxidative stress and tau hyperphosphorylation. J. Alzheimers Dis. 2017;57:475–482. doi: 10.3233/JAD-161216. [DOI] [PubMed] [Google Scholar]
- 27.Freyssin A., Carles A., Guehairia S., Rubinstenn G., Maurice T. Fluoroethylnormemantine (FENM) shows synergistic protection in combination with a sigma-1 receptor agonist in a mouse model of Alzheimer’s disease. Neuropharmacology. 2024;242 doi: 10.1016/j.neuropharm.2023.109733. [DOI] [PubMed] [Google Scholar]
- 28.Pirchl M., Ullrich C., Sperner-Unterweger B., Humpel C. Homocysteine has anti-inflammatory properties in a hypercholesterolemic rat model in vivo. Mol. Cell. Neurosci. 2012;49:456–463. doi: 10.1016/j.mcn.2012.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Samad N., Jabeen S., Imran I., Zulfiqar I., Bilal K. Protective effect of gallic acid against arsenic-induced anxiety-/depression- like behaviors and memory impairment in male rats. Metab. Brain Dis. 2019;34:1091–1102. doi: 10.1007/s11011-019-00432-1. [DOI] [PubMed] [Google Scholar]
- 30.Gavriel Y., Rabinovich-Nikitin I., Ezra A., Barbiro B., Solomon B. Subcutaneous administration of AMD3100 into mice models of Alzheimer’s disease ameliorated cognitive impairment, reduced neuroinflammation, and improved pathophysiological markers. J. Alzheimers. Dis. 2020;78:653–671. doi: 10.3233/JAD-200506. [DOI] [PubMed] [Google Scholar]
- 31.Zhao L., Chen T., Wang C., Li G., Zhi W., Yin J., Wan Q., Chen L. Atorvastatin in improvement of cognitive impairments caused by amyloid β in mice: involvement of inflammatory reaction. BMC Neurol. 2016;16:18. doi: 10.1186/s12883-016-0533-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wang D., Noda Y., Zhou Y., Mouri A., Mizoguchi H., Nitta A., Chen W., Nabeshima T. The allosteric potentiation of nicotinic acetylcholine receptors by galantamine ameliorates the cognitive dysfunction in beta amyloid25-35 i.c.v.-injected mice: involvement of dopaminergic systems. Neuropsychopharmacology. 2007;32:1261–1271. doi: 10.1038/sj.npp.1301256. [DOI] [PubMed] [Google Scholar]
- 33.Carroll J.C., Rosario E.R., Villamagna A., Pike C.J. Continuous and cyclic progesterone differentially interact with estradiol in the regulation of Alzheimer-like pathology in female 3xTransgenic-Alzheimer’s disease mice. Endocrinology. 2010;151:2713–2722. doi: 10.1210/en.2009-1487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Devi L., Ohno M. A combination Alzheimer’s therapy targeting BACE1 and neprilysin in 5XFAD transgenic mice. Mol. Brain. 2015;8:19. doi: 10.1186/s13041-015-0110-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Deng Z., Ma W., Han Q.L., Zhou W., Zhu X., Wen S., Xiang Y. Exploring DeepSeek: A survey on advances, applications, challenges and future directions. IEEE/CAA J. Autom. Sin. 2025;12:872–893. [Google Scholar]
- 36.Zhou W. The security of using large language models: A survey with emphasis on ChatGPT. IEEE/CAA J. Autom. Sin. 2025;12:1–26. [Google Scholar]
- 37.Deng Z. AI agents under threat: A survey of key security challenges and future pathways. arXiv. 2024 doi: 10.48550/arXiv.2406.02630. Preprint at. [DOI] [Google Scholar]
- 38.Chen D., Wang H., Huo Y., Li Y., Zhang H. GameGPT: Multi-agent collaborative framework for game development. arXiv. 2023 doi: 10.48550/arXiv.2310.08067. Preprint at. [DOI] [Google Scholar]
- 39.Du Y., Li S., Torralba A., Tenenbaum J.B., Mordatch I. Improving factuality and reasoning in language models through multiagent debate. arXiv. 2023 doi: 10.48550/arXiv.2305.14325. Preprint at. [DOI] [Google Scholar]
- 40.Xu Q. Coated-LLM. Github. https://github.com/QidiXu96/Coated-LLM
- 41.AlzPED. https://alzped.nia.nih.gov/.
- 42.Yang C. Large Language Models as Optimizers. arXiv. 2023 doi: 10.48550/ARXIV.2309.03409. Preprint at. [DOI] [Google Scholar]
- 43.Davis A.P., Wiegers T.C., Johnson R.J., Sciaky D., Wiegers J., Mattingly C.J. Comparative Toxicogenomics Database (CTD): update 2023. Nucleic Acids Res. 2023;51:D1257–D1262. doi: 10.1093/nar/gkac833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Research Models. https://www.alzforum.org/research-models.
- 45.Cheng F., Kovács I.A., Barabási A.-L. Network-based prediction of drug combinations. Nat. Commun. 2019;10:1197. doi: 10.1038/s41467-019-09186-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Tang J., Tanoli Z.U.R., Ravikumar B., Alam Z., Rebane A., Vähä-Koskela M., Peddinti G., van Adrichem A.J., Wakkinen J., Jaiswal A., et al. Drug Target Commons: A community effort to build a consensus knowledge base for drug-target interactions. Cell Chem. Biol. 2018;25:224–229.e2. doi: 10.1016/j.chembiol.2017.11.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.PubMed. https://pubmed.ncbi.nlm.nih.gov/.
- 48.Agora. https://agora.adknowledgeportal.org/.
- 49.Unsal-Beyge S., Tuncbag N. Functional stratification of cancer drugs through integrated network similarity. NPJ Syst. Biol. Appl. 2022;8:11. doi: 10.1038/s41540-022-00219-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Casas A.I., Hassan A.A., Larsen S.J., Gomez-Rangel V., Elbatreek M., Kleikers P.W.M., Guney E., Egea J., López M.G., Baumbach J., Schmidt H.H.H.W. From single drug targets to synergistic network pharmacology in ischemic stroke. Proc. Natl. Acad. Sci. USA. 2019;116:7129–7136. doi: 10.1073/pnas.1820799116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
The data generated from literature mining and drug hit AD genes for this study can be accessed via the following link: https://github.com/QidiXu96/Coated-LLM.40
-
•
The code for Researcher, Reviewers, and Moderator can be found at https://github.com/QidiXu96/Coated-LLM.40




