Abstract
We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, literature-grounded question answering via retrieval-augmented generation, molecular generation, multiproperty prediction, property-aware molecular refinement, and 3D protein–ligand structure generation. The agent autonomously retrieves relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answers mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generates chemically diverse seed molecules and predicted 75 properties, including ADMET-related and general physicochemical descriptors, which guids iterative molecular refinement. Across two refinement rounds, the number of molecules with QED >0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein–ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
Introduction
The discovery of drug-like molecules is fundamental to therapeutic innovation and the advancement of public health. However, identifying viable candidates is a highly complex and resource-intensive process. It requires navigating an enormous chemical space, integrating heterogeneous biological and chemical data sources, and conducting iterative experimental validation. Empirically, drug development spans 10–15 years and can cost $1.5–2.5 billion per approved therapy. Much of this burden arises from attrition in Phase II and III clinical trials, where insufficient efficacy or safety issues often lead to late-stage failures. − A key strategy to reduce this attrition is to identify high-quality candidates as early as possible, thereby improving the likelihood of clinical success. At the same time, the chemical search space is astronomically large, with estimates of more than 1060 drug-like molecules, making it impossible to explore effectively with traditional trial-and-error approaches. ,
To mitigate these challenges, computational approaches have become essential in early stage drug development. Methods such as molecular docking, − quantitative structure–activity relationship (QSAR) modeling, − and molecular dynamics simulations − have significantly improved the speed and accuracy of compound screening and optimization. More recently, machine learning (ML) techniques have expanded this toolkit by enabling predictive modeling of pharmacokinetic and pharmacodynamic properties, de novo molecular generation, and multitask learning across diverse biomedical tasks. − AlphaFold has achieved high-accuracy protein structure prediction. − In parallel, generative models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models (e.g., JT-VAE, MolGAN, MolDiff) enable de novo or target-specific molecular design. − Also, Graph Neural Networks (GNNs) and transformer-based models improve property prediction. − Collectively, these methods increase the scalability and automation potential of modern drug discovery workflows.
Despite this progress, most computational methods remain task-specific and require manual orchestration by domain experts. Drug discovery is inherently a multistep, interdependent process that requires seamless integration across diverse tasks. − This fragmented implementation limits the scalability and efficiency of current pipelines. As the demand for faster and more cost-effective drug screening grows, there is an urgent need for unified, intelligent platforms capable of autonomously coordinating these tasks while supporting expert decision-making.
Large Language Models (LLMs) offer a compelling solution to this need. Trained on massive corpora of natural language data, LLMs exhibit strong reasoning capabilities and domain-agnostic knowledge. When augmented with external tools, such as domain-specific plugins, APIs, and software libraries, LLM-based agents can overcome the limitations of general-purpose language models and act as interpretable, flexible controllers of scientific workflows. ,
LLM agents have recently been applied to automate diverse aspects of scientific discovery, including experimental design, material synthesis planning, and data analysis. − For instance, the dZiner framework leverages LLM agents for molecular design through iterative reasoning and structure-based optimization. CACTUS demonstrated the use of an LLM to wrap cheminformatics utilities for evaluating properties such as Mwt, LogP, TPSA, and drug-likeness filters, providing a practical but narrowly scoped toolset. In contrast, SciToolAgent introduced a knowledge-graph–driven orchestration layer that allows an LLM to select from hundreds of scientific tools across domains ranging from biological to materials, illustrating scalability but lacking a dedicated workflow for therapeutic design. Building on these directions, DrugPilot focuses on orchestrating multistage discovery workflows using parametrized reasoning, primarily demonstrated on established model-zoo tasks and benchmark data sets rather than end-to-end molecular design. A more detailed overview of the comparative landscape of agentic frameworks is provided in Supporting Information Section 9.
We introduce AgentD, an LLM-powered agent framework designed to support and streamline the drug discovery pipeline. The agent performs a wide range of important tasks, including biomedical data retrieval from structured databases and unstructured web sources, answering domain-specific scientific queries, generating seed molecule libraries via SMILES-based generative models, predicting a broad spectrum of drug-relevant properties, refining molecular representations to improve drug-likeness, and generating 3D molecular structures for downstream analysis. Our results demonstrate that this agent-driven framework streamlines early phase drug discovery and serves as a flexible foundation for scalable, AI-assisted therapeutic development. Its modular architecture allows for continual improvement as more advanced tools and models become available.
Agentic Workflow
Task Modules
AgentD performs six essential tasks across the drug discovery pipeline, as illustrated in Figure a. From a code-design perspective, the agent integrates LLMs sourced from OpenAI, Anthropic, and DeepSeek. These models function both as the core reasoning module and as the natural-language interaction layer. In the present implementation, AgentD is primarily developed and executed using OpenAI’s GPT-4o as the main language model. By integrating domain-specific tools and databases, AgentD coordinates a wide range of activities - from data retrieval and molecular generation to property evaluation and structure prediction. The primary tool components supporting each task module are summarized in Table .
1.

AgentD overview. (a) Task modules supported by AgentD. In question-answering and molecular-refinement tasks, outputs are generated directly by the language model, while other tasks rely on external computational tools for final results. (b) Workflow illustrating how each task module contributes to the overall drug discovery pipeline.
1. Key External Tools Integrated into Each Task Module within the AgentD Drug Discovery Pipeline.
| Task Module | Primary Tool Components |
|---|---|
| Data Extraction | UniProt, ChEMBL, Serper, Semantic Scholar APIs |
| Question Answering | RAG, LLM internal knowledge and reasoning |
| SMILES Generation | REINVENT, Mol2Mol |
| Property Prediction | DeepPK API, BAPULM |
| Molecule Refinement | LLM internal knowledge and reasoning |
| Structure Generation | Boltz-2 |
Adapted from the dZiner framework.
Data Extraction
AgentD is capable of retrieving biomedical information from both structured and unstructured sources. Given a query, such as identifying known drugs associated with a specific protein–disease context, the agent retrieves the protein’s FASTA sequence from UniProt, searches for relevant drug names on the web, and extracts corresponding SMILES representations from ChEMBL. Additionally, AgentD autonomously constructs keyword-based queries to download relevant open-access scientific literature from Semantic Scholar, providing a broader context for downstream reasoning and decision-making.
Question Answering
In therapeutic applications, high accuracy and mechanistic specificity are critical. Generic responses that only appear plausible are inadequate for addressing domain-specific questions relevant to drug discovery and biomedical research. To handle domain-specific scientific queries, AgentD employs a retrieval-augmented generation (RAG) strategy. By grounding its responses in literature obtained during the data extraction phase, the agent provides context-aware and evidence-based answers.
SMILES Generation
Constructing a diverse and chemically relevant seed molecule pool is essential for effective early stage virtual screening. AgentD generates seed molecules in SMILES format using external generative models. In this study, we incorporate two such models: REINVENT, , which supports de novo molecule generation without requiring input SMILES, and Mol2Mol, which performs conditional generation to produce molecules structurally similar to a given input SMILES. This dual capability enables both exploration and exploitation in the molecular search space.
Property Prediction
For each candidate molecule, AgentD predicts key pharmacologically relevant properties, including ADMET (absorption, distribution, metabolism, excretion, and toxicity) profiles and binding affinity (e.g., pK d ). ADMET prediction is performed using the Deep-PK API, which accepts SMILES strings as input. For binding affinity estimation, the BAPULM model is used, which operates on both the SMILES and the protein’s FASTA sequence. These predictions help prioritize compounds based on both efficacy and safety.
Molecule Refinement
Based on the predicted properties, AgentD can identify molecular shortcomings such as toxicity or poor permeability, and propose targeted structural modifications to improve attributes like solubility and metabolic stability. This SMILES refinement is carried out solely through the LLM’s internal reasoning and built-in chemical knowledge, without additional model-based tools. This task is carried out on top of the property prediction stage.
Structure Generation
AgentD can generate 3D structures of protein–ligand complexes using Boltz-2 as an external tool. , This process produces candidate complex structures along with associated binding metrics such as IC50 values and inhibitor probability. Since these structures and affinity outputs are model-derived, downstream physical validation (e.g., MD-based simulations and free-energy calculations) is required before drawing the final conclusions about binding stability or potency.
Workflow
All six task modules in AgentD are designed to support key components of the drug discovery pipeline, as illustrated in Figure b. The process begins with a user-provided query, such as identifying drug molecules that target a specific disease-related protein. Through the data extraction module, AgentD retrieves the protein’s FASTA sequence from UniProt and identifies known drugs using sources such as ChEMBL and Google search APIs. In this study, we demonstrate the workflow using three representative therapeutic targets: BCL-2 for lymphocytic leukemia, JAK-2 for myelofibrosis, and thrombin for thrombosis. The BCL-2 case is discussed in detail in the main manuscript, while results for the other two targets are provided in Supporting Information Sections 5 and 6.
The AgentD framework checks whether known drug molecules exist for a given target or disease. If available, these molecules serve as starting points for subsequent molecular exploration. Because AgentD includes a de novo generation module, the workflow can still proceed even when no such molecules are found. The SMILES generation task builds a chemically diverse library of candidate molecules using generative models. These molecules are then evaluated using property prediction tools to estimate key pharmacological characteristics such as ADMET profiles, general physicochemical properties, and binding affinity. Based on these predicted properties, AgentD proposes SMILES-level modifications to improve attributes like solubility, toxicity, and metabolic stability. These refinements rely on the language model’s internal reasoning rather than external optimization tools and help enrich the candidate pool.
Following refinement, domain-specific criteria can be applied by the user to select promising compounds. For these, the structure generation task creates 3D protein–ligand complexes using the ligand SMILES and target protein sequence. This enables more detailed downstream analyses. The structure generation step also outputs auxiliary metrics like predicted IC50 and inhibitor probability as proxies for binding strength. Although these estimates from Boltz-2 provide useful initial guidance, downstream physical validation via MD simulations is required to assess binding stability and interaction fidelity. Additional uncertainty information, including Boltz-2–derived affinity estimates, is provided in Supporting Information Section 10.
Throughout the workflow, users may request clarification or validation of scientific concepts. The question answering module, implemented using RAG, supports this by leveraging literature collected during the data extraction phase. This functionality serves both to enhance the pipeline and to address user-specified scientific queries on demand.
Results
Data Extraction
AgentD integrates web search capabilities and database API access as tool modules for extracting protein and molecule data. When provided with a user query specifying a target protein and associated disease, the agent is instructed to (i) retrieve the protein’s FASTA sequence, (ii) identify existing drugs relevant to the specified target–disease context, and (iii) download related open-access scientific literature. The agent was tested on example queries including BCL-2 for lymphocytic leukemia, JAK2 for myelofibrosis, and thrombin for thrombosis. The results, summarized in Figure , show that the agent successfully retrieved key molecular data, such as identifying venetoclax as a BCL-2 inhibitor.
2.

Examples of data extraction. Given a user query specifying a target protein and associated disease, AgentD retrieves the protein’s FASTA sequence and identifies known drug molecules relevant to the query. (a) BCL-2 for lymphocytic leukemia, (b) JAK-2 for myelofibrosis, and (c) thrombin for thrombosis.
The extracted information is used in several downstream modules. The retrieved SMILES is used to initialize molecule generation models for building a seed library. Both the SMILES and protein sequence serve as inputs to the property prediction and 3D structure generation components. Additionally, the downloaded documents are embedded into a vector database, enabling context-aware retrieval during the agent’s question answering task.
Domain-Specific Question Answering
To address the limitations of generic, unsupported answers in biomedical contexts, AgentD employs RAG to ground its responses in domain-specific literature. During the data extraction phase, the agent formulates keyword-based queries from the target protein and disease, retrieves open-access papers via Semantic Scholar, and stores them in a vector database for later retrieval during question answering.
We evaluated this capability using a representative study by Weller et al., which describes how venetoclax activates the integrated stress response (ISR), leading to NOXA upregulation and MCL-1 inhibition. Based on this paper, we formulated domain-specific questions and compared the responses generated by AgentD with RAG to those from the vanilla GPT-4o model (Figure ).
3.
Comparison of question-answering performance between AgentD with RAG and the Vanilla GPT-4o model. Green highlights indicate grounded, context-relevant information supported by the reference paper, while gray highlights denote generic content not directly related to the question. (a) Mechanism of ABT-199/venetoclax synergy with proteasome inhibitors. (b) Effect of ABT-199–induced integrated stress response on NOXA expression.
AgentD consistently generates more detailed and mechanistically accurate responses. For example, when asked about the synergy mechanism between venetoclax and proteasome inhibitors, the RAG-augmented answer includes key elements such as ISR activation, ATF3/ATF4-mediated transcription of NOXA, and downstream effects involving BIM, BAX/BAK activation, and mitochondrial membrane permeabilization,closely aligning with the mechanistic explanation presented in the source paper (Figure a). In contrast, vanilla GPT-4o provides a more generic explanation, omitting critical components such as the ISR pathway and transcriptional regulators.
A similar pattern is observed when inquiring about how ABT-199 influences NOXA expression through ISR. AgentD correctly references eIF2α phosphorylation, TP53 independence, and the ATF3/ATF4 regulatory cascade. Although the GPT-4o response is fluent, it lacks these essential mechanistic details. In Figure , AgentD’s context-aware answers are highlighted in green, whereas GPT-4o’s generic responses appear in gray.
To quantitatively assess these observed differences, we measured the semantic similarity between responses generated with and without RAG and human-annotated reference answers derived from the literature using BERTScore. The reference answers were curated by the authors through careful reading of the relevant source papers. , BERTScore quantifies contextual similarity by comparing the embeddings of model-generated and reference texts. We employed SciBERT as the embedding encoder because it is pretrained on a large corpus of scientific publications from Semantic Scholar, capturing richer representations of biomedical and technical terminology than general-domain encoders such as RoBERTa.
As shown in Table , AgentD consistently outperforms the corresponding vanilla language models across both GPT-4o and DeepSeek-Chat backends. For GPT-4o, the use of RAG increases the mean F1-score from 0.63 ± 0.01 to 0.70 ± 0.05, primarily driven by a substantial gain in Precision (0.72 vs 0.61). This indicates that AgentD more effectively suppresses unsupported or generic statements during question answering. A similar trend is observed with DeepSeek-Chat: AgentD improves the F1-score from 0.67 ± 0.01 in the vanilla model to 0.70 ± 0.05, again with a noticeable increase in Precision (0.69 vs 0.65). These results show that integrating RAG and structured reasoning enables AgentD to produce more faithful and literature-grounded responses.
2. Comparison of BERTScores between AgentD (with RAG) and Vanilla Language Models (without RAG) .
| Method | Precision | Recall | F1-score |
|---|---|---|---|
| AgentD (GPT-4o, w/RAG) | 0.719 ± 0.048 | 0.686 ± 0.064 | 0.701 ± 0.046 |
| Vanilla GPT-4o (w/o RAG) | 0.609 ± 0.018 | 0.661 ± 0.039 | 0.633 ± 0.024 |
| AgentD (DeepSeek-Chat, w/RAG) | 0.685 ± 0.064 | 0.717 ± 0.042 | 0.700 ± 0.051 |
| Vanilla DeepSeek-Chat (w/o RAG) | 0.645 ± 0.021 | 0.698 ± 0.020 | 0.670 ± 0.012 |
Values are averaged across six question–answer pairs. Results for two representative pairs are shown in Figure , and additional examples are provided in Supplementary Information Section 1.
Seed Molecule Generation
The initial library of molecules serves as a critical foundation for exploring chemical space and identifying candidates for further optimization. To construct this seed molecule pool, AgentD leverages two complementary generation strategies. REINVENT enables de novo molecule generation, allowing for broad and unbiased exploration of chemical space without the need for an input structure. Mol2Mol performs conditional generation, producing analogs that are structurally similar to a specified input molecule. In our workflow, the existing drug identified during the data extraction phase, such as venetoclax, is used as input for Mol2Mol, enabling the agent to focus on chemically relevant and biologically meaningful regions of the search space.
After retrieving the SMILES of the known drug, AgentD automatically integrates it into a configuration file and executes both REINVENT and Mol2Mol to generate the initial seed molecules. As shown in Figure , REINVENT-generated molecules are widely distributed across the chemical space, reflecting its exploratory capabilities, whereas Mol2Mol-generated molecules are more tightly clustered around the input molecule, enabling targeted exploration near known active scaffolds. This seed library serves as the starting point for downstream tasks such as property prediction, refinement, and structure generation.
4.
t-SNE visualization of chemical space, based on Mordred-derived molecular descriptors, including seed molecules generated by AgentD (Mol2Mol, REINVENT), refined molecules after property-aware optimization, 5,000 randomly sampled compounds from ChEMBL, and the reference drug venetoclax.
Property-Aware Molecular Refinement
To enhance the quality of the seed molecule pool, AgentD performs property-driven refinement of SMILES structures by identifying and addressing limitations that compromise their drug-likeness. Using the Deep-PK model API, the agent predicts 67 ADMET properties, as well as 7 general physicochemical properties. A complete list of predicted ADMET properties is provided in Supplementary Table S2. In parallel, AgentD uses the BAPULM model to estimate binding affinity (pK d).
Based on these predictions, AgentD identifies unfavorable molecular properties relevant to drug development, such as low permeability or high toxicity, and proposes targeted structural edits to improve them. A comprehensive list of weakness-associated properties is provided in Supplementary Table S3. We demonstrate this refinement process over two iterations, beginning with an initial set of 100 molecules. These iterations produced 99 and 95 additional valid SMILES, respectively, which were added to the candidate pool. For instance, a molecule is initially flagged for poor permeability due to a low predicted log P app value (Figure a). In response, AgentD suggested replacing a hydroxyl group with a methyl group in the first round, followed by substituting a sulfonamide with an amine in the second - both changes contributed to improved predicted permeability. Figure d illustrates another case of successful refinement, where the targeted weak properties, such as toxicity and permeability, were improved at each step. If resources permit, further iterations can be used to improve weak properties and strengthen the molecule’s potential as a drug candidate.
5.
Property-aware molecular refinements by AgentD. (a) Iterative changes improve permeability. (b) An unintended modification leads to improved permeability. (c) An intended modification fails to improve permeability. (d) Iterative changes improve both toxicity and permeability.
Refinements are not always successful or implemented as intended. Two common failure modes are observed: (i) unintended SMILES modifications that diverge from the agent’s stated intent, and (ii) correctly executed modifications that fail to improve the target property. For case (i), in Figure b, the agent intended a nitro-to-hydrogen substitution and successfully applied it, but also unintentionally removed a carbonyl group; nonetheless, the predicted permeability improved. Additionally, in Figure c, an intended methoxy-to-ethoxy substitution was improperly implemented, resulting in the loss of an oxygen atom and decreased permeability. For case (ii), even when the modification is applied correctly, the desired outcome may not be achieved, as seen in the first-round refinement of Figure c, where the permeability worsened despite the intended structural change.
In the first refinement round, 57 out of the initial 100 molecules were identified as having low permeability. Among these, approximately 44% showed improved log P app values after refinement, 26% declined, and 30% remained unchanged (Figure a). In the second round, 52% improved, 17% declined, and 31% were unchanged. For the molecules that showed target property improvement, the median log P app increased from −5.33 to −5.03 (Figure b). Although the magnitude of change is modest, the upward trend is clearly visible within just two iterations. Toxicity improvements were relatively more limited: 24% and 20% of high-toxicity molecules showed improvement in the first and second rounds, respectively, with most remaining toxic despite modification (Figure c).
6.
Property-aware molecular refinement results. (a) Proportion of molecules with improved, declined, or unchanged permeability (logPapp). (b) Boxplot of logPapp for entries showing improvement. (c) Molecular refinement patterns for molecules with improved permeability. (d) QED distribution shifts across iterations (vanilla GPT-4o indicates refinement solely through direct queries to GPT-4o). (e) Counts of molecules satisfying empirical drug-likeness rules (definitions in Methods). (f) Fraction of molecules modified for permeability; “Analysis failed” indicates unresolved substructure differences by the substructure algorithm. (g) Functional-group frequencies in input vs refined molecules across iterations (input = input for refinement; refined = postrefinement outputs).
Despite occasional local failures, as illustrated in Figure b and c, the overall impact of refinement is positive across the molecule pool. This is supported by increased QED (Quantitative Estimate of Drug-likeness) scores, which reflect overall drug-likeness based on empirical physicochemical properties. As shown in Figure d, the distribution of high-QED molecules expanded with each refinement iteration: the number of molecules with QED > 0.6 increased from 34 in the original set to 49 after the first update, and to 55 after the second.
As shown in Figure e, LLM-based refinement increased the proportion of molecules meeting empirical drug-likeness and lead-likeness criteria, including Lipinski’s Rule of Five, Veber’s rule, Ghose’s filter, and Oprea’s lead-likeness filter. − These rules assess different aspects of molecular suitability: Lipinski’s and Veber’s rules emphasize oral bioavailability and permeability, Ghose’s filter captures physicochemical ranges observed in marketed drugs, and Oprea’s filter identifies compounds suitable as lead-like starting points for optimization. After refinement, more molecules satisfied Veber’s rule (78 to 86) and Oprea’s filter (60 to 67). The most notable improvement was observed for Ghose’s filter, with the number of passing molecules increasing from 32 to 55 (∼72% increase). These results indicate that refinement helped expand the chemical space toward drug-like and lead-like regions, particularly in terms of Ghose’s statistical descriptors. Compliance with Lipinski’s rule remained largely unchanged; this outcome is further discussed in the Discussion section.
AgentD tends to prioritize edits to polar or electron-withdrawing groups, such as hydroxyls, sulfonamides, and nitro groups, when attempting to improve permeability-related properties. Figure f shows the distribution of SMILES modifications for entries with permeability as the target weakness property. The most frequent edits involved hydroxyl groups (30 cases, 15.5%), followed by sulfonamide substitutions (29 cases, 15.0%) and nitro group changes (21 cases, 10.8%). Other commonly modified motifs included alkyl groups (8.3%) and halogens (6.2%), and together these five categories account for 55.7% of all refinements. Also, AgentD’s refinement behavior does not merely reflect functional-group prevalence. For instance, despite nitro groups being less common than alkyl, sulfonamide, or hydroxyl groups, nitro edits still represent a substantial portion of modifications (10.8%; Figure g). This targeted preference for polar and electron-withdrawing group edits contrasts sharply with the behavior of a vanilla GPT-4o model, which predominantly applies halogen and alkyl substitutions and does not exhibit polar functional-group enrichment in the absence of property-aware guidance (see Supplementary Figure S9).
3D Structure Generation
After refining the seed molecule pool through property-aware modifications, we applied empirical filtering criteria to select candidates for 3D structural evaluation. The purpose of this filtering scheme is not to identify a final drug candidate, but to illustrate the agent’s capability to perform key tasks in the drug discovery pipeline. In our workflow, molecules are shortlisted if they (i) satisfy Oprea’s lead-likeness filter, (ii) pass at least two of the three drug-likeness rules (Lipinski’s Rule of Five, Veber’s rule, and Ghose’s filter), (iii) achieve a QED score above 0.55, and (iv) have a predicted pK d value greater than 6.0.
The molecule shown in Figure satisfies the lead-likeness requirement, passes all three drug-likeness rules, has a QED score of 0.68, and achieves a predicted pK d of 6.18. This is simply a representative candidate for further analysis. It is important to note that these empirical rules may not always align, and molecules rarely satisfy all of them simultaneously. In fact, only one molecule in our set passed all three rules, which we selected as an illustrative example for 3D visualization. Additional examples are provided in Supplementary Figure S7.
7.
Protein–ligand structure generation and evaluation. (a) A representative 3D complex predicted by Boltz-2, with estimated IC50 and inhibitor probability. (b) Radar plot showing empirical rule-based evaluation. Green dotted lines indicate the thresholds for each rule, while the purple solid line represents the properties of the selected molecule. See Methods and Table S4 for criteria.
AgentD uses Boltz-2 as a tool to generate 3D protein–ligand complex structures using the ligand’s SMILES and the target protein’s FASTA sequence. These structures provide a foundation for more rigorous downstream evaluations, such as molecular docking, MD simulations, and free energy perturbation analyses, to assess binding stability, conformational flexibility, and interaction specificity under biologically relevant conditions. In addition to 3D structure, Boltz-2 returns estimated binding metrics, including logIC50 values and inhibitor probabilities. In Figure , the predicted IC50 (0.54 μM) and inhibitor probability indicate good submicromolar binding affinity, making the molecule a suitable candidate for further validation using molecular dynamics (MD) simulations.
Comparison Across Language Models
AgentD can operate with a variety of LLMs, including those from OpenAI, Anthropic, and DeepSeek. In practice, the framework is optimized for GPT-4o. We tested AgentD using GPT-4o, Claude-3.7-Sonnet, and DeepSeek-Chat. − DeepSeek-Chat, an open-source model, successfully executed all six tasks and demonstrated behavior broadly comparable to GPT-4o. In contrast, Claude-3.7-Sonnet exhibited failure modes in two tasks: domain-specific question answering and molecular refinement.
In the question-answering task, Claude-3.7-Sonnet encountered syntax errors when parsing PDF files. AgentD, powered by GPT-4o, achieved higher BERTScores than AgentD with DeepSeek-Chat (Table ). Among the vanilla language models, DeepSeek-Chat outperformed GPT-4o, highlighting that the performance gain provided by AgentD is more pronounced when using GPT-4o.
In the molecular refinement task, AgentD with Claude-3.7-Sonnet occasionally became trapped in repeated attempts to generate valid updated SMILES, despite explicit instructions to avoid infinite retries (see Supplementary Figure S4). For the quantitative comparison of property-aware molecular refinement shown in Figure , we report results from a verified successful run of Claude-3.7-Sonnet to ensure a fair comparison.
8.
Comparison of AgentD performance across different language models. (a) Distribution of QED values before and after two iterations of property-aware molecular refinement for AgentD powered by GPT-4o, Claude-3.7-Sonnet, and DeepSeek-Chat. (b) Histogram showing the number of molecules passing empirical lead- and drug-like rules. (c–e) Tanimoto similarity between the original and updated SMILES for the top five most frequently targeted weakness properties for AgentD powered by GPT-4o (c), Claude-3.7-Sonnet (d), and DeepSeek-Chat (e).
All three model cases improved the QED of seed molecules after two iterations of property-aware refinement (Figure a). Starting from an average QED of 0.47, AgentD, powered by GPT-4o, increased the mean value to 0.57, Claude-3.7-Sonnet increased it to 0.59, and DeepSeek-Chat increased it to 0.55. While all models demonstrated meaningful improvements, DeepSeek-Chat achieved a slightly smaller gain relative to the others. The seed set contained 23.0% of molecules with QED > 0.6, which increased to 45.7% with GPT-4o, 44.9% with Claude-3.7-Sonnet, and 40.0% with DeepSeek-Chat after the second refinement step.
The rule-based evaluation shows a similar trend across the three LLMs, with the most evident increase observed in the number of molecules passing the Ghose filter (Figure b). Claude-3.7-Sonnet and DeepSeek-Chat do not exhibit the slight deterioration in Lipinski’s Rule of Five observed with GPT-4o, although both models yield slightly fewer molecules passing Veber’s rule.
Each language model exhibits some variability in the selection of “weakness” properties during molecular refinement. Figures c–e report the Tanimoto similarity , between the original SMILES and the updated SMILES for the top five most frequently targeted weakness properties. For AgentD powered by GPT-4o, the top five properties were logPapp, AMES mutagenicity, liver injury I toxicity, logP, and hERG blocker toxicity, with median similarities of 0.71, 0.71, 0.67, 0.81, and 0.79, respectively. Despite model-specific differences, the selected properties largely arise from a consistent subset of the 75 property end points, indicating broadly aligned refinement priorities. AgentD with Claude-3.7-Sonnet prioritized liver injury I toxicity, hERG blocker toxicity, AMES mutagenicity, liver injury II toxicity, and logP, with median similarities of 0.76, 0.79, 0.73, 0.79, and 0.83. For AgentD with DeepSeek-Chat, the top five properties were liver injury II toxicity, hERG blocker toxicity, liver injury I toxicity, AMES mutagenicity, and logPapp, with median similarities of 0.80, 0.80, 0.77, 0.70, and 0.77. Although all three models generally target similar toxicity- and permeability-related weaknesses, the specific ranking of properties and the degree of structural change.
Discussion
Property Awareness Improves Drug-Likeness
Weakness-property awareness substantially improves drug-likeness, even though drug-likeness is not explicitly used as an optimization target. Refinement via a vanilla LLM, without property-aware guidance, yields little measurable improvement (Figure d–e). In contrast, incorporating weakness-property information leads to pronounced gains, indicating that providing explicit context about what is deficient in a molecule enables more targeted and pharmacologically meaningful edits. This functions similarly to chain-of-thought reasoning: giving the LLM stepwise situational cues (e.g., “this is the weakness, so modify accordingly”) helps steer its structural decisions in a more directed and interpretable manner.
AgentD’s refinement pattern also does not simply mirror the prevalence of functional groups in the data set (Figure f–g). Instead, the model displays distinct preferences in the types of chemical transformations it applies, suggesting that the observed patterns arise from its learned reasoning rather than from data set-driven frequencies.
The benefit of property-aware refinement is consistent across all LLMs used as the agent’s core engine (Figure a–b). Whether using GPT-4o, Claude-3.7-Sonnet, or DeepSeek-Chat, the refinement strategy reliably increases drug-likeness. This indicates that the performance gains stem from the refinement paradigm itself rather than from model-specific idiosyncrasies.
Challenges in Molecular Refinement
The property-aware molecular refinement module also shows limitations. Molecular optimization remains intrinsically complex due to interdependencies among properties. For example, in Figure a and b, predicted apparent permeability (logPapp) is influenced by lipophilicity (logP), consistent with Overton’s rule, whereas aqueous solubility (logS) does not follow a consistent trend and may either increase or decrease. These observations highlight the complex, nonlinear interplay among lipophilicity, solubility, and permeability. Similarly, the first-round refinement in Figure d mitigates predicted liver toxicity but worsens permeability. These trade-offs highlight the fundamental challenge of multiproperty optimization in drug design. These findings highlight the importance of incorporating multiobjective optimization in future iterations to better balance competing pharmacological goals. Despite these complexities, AgentD achieves a net improvement in overall drug-likeness across the molecule pool, as shown in Figure c and d.
Molecular optimization is also target-dependent, as multiple interacting physicochemical properties influence outcomes. Across the three therapeutic targets, BCL-2, JAK-2, and thrombin, drug-likeness improves overall, as reflected by QED distributions and rule-based filters, though the improvement patterns differ. For BCL-2 and thrombin, the largest gains occur under the Ghose filter, while Lipinski’s Rule of Five shows only minor improvement, likely because oral bioavailability was rarely selected as a weakness (Supplementary Table S3). By contrast, JAK-2 shows stronger gains under Lipinski’s criteria but minimal change under the Ghose filter. These differences highlight the intrinsic complexity of molecular refinement, where multiple interdependent properties are convoluted.
Effective Retrieval and Generation
The agent reliably retrieves protein and ligand data from both structured databases and web sources. While our current implementation focuses on targeted extraction, future extensions could incorporate large-scale literature mining and relevance scoring to support broader knowledge synthesis.
For domain-specific question answering, AgentD’s RAG-augmented responses consistently outperform standard LLM outputs in both contextual accuracy and mechanistic completeness (Table ). Because RAG performance depends strongly on the quality of retrieved literature, expanding retrieval beyond open-access sources may further enhance results. Our current implementation uses a deliberately minimal retrieval pipelinewithout filters on publication date, venue quality, or citation metricsyet still provides clear gains in factual grounding and mechanistic fidelity. These findings demonstrate that even a raw, loosely constrained RAG setup can substantially improve scientific question answering. We further evaluated alternative retrieval strategies, including citation-ranked and publication-year-filtered approaches (Supplementary Table S2), which also outperform the vanilla LLM baseline without RAG.
For molecule generation, AgentD integrates external models such as REINVENT and Mol2Mol using natural language-driven configuration updates, enabling seamless integration of new tools with minimal customization. By leveraging a language model to interpret and modify configuration files based on high-level instructions, the system can flexibly switch between different generative models or sampling strategies without manual code changes. This approach removes the need for model-specific hardcoded logic and allows for rapid adaptation to evolving architectures and file structures. The modular design of AgentD further ensures its long-term adaptability as generative modeling techniques continue to advance.
The structure generation task complements the pooling and refinement process by enabling 3D protein–ligand complex generation, which provides critical input for structure-based studies. Boltz-2 captures meaningful trends in ligand binding behavior, but its IC50 and inhibitor probability predictions should be interpreted as rough approximations that are useful for initial prioritization rather than substitutes for MD-based physical validation.
Limitations and Future Directions
Despite its modular design and encouraging end-to-end performance, AgentD has several limitations. The workflow depends on external predictive tools such as Deep-PK, BAPULM, and Boltz-2, each with its own error profile and inherent uncertainties. Errors in these modules can propagate downstream, affecting both refinement decisions and compound prioritization. For instance, an inaccurate ADMET property prediction can lead to incorrect weakness determination and suboptimal molecular refinements. To mitigate this, we analyze outcomes at the group level rather than on a per-run or per-molecule basis. Despite these uncertainties, the overall drug-likeness, as measured by QED and empirical rules, shows clear improvement after iterative refinement. AgentD’s current goal is to generate a promising compound pool for further evaluation, making this approach suitable for its purpose. To enhance individual success rates, future iterations of AgentD should integrate more rigorous predictive tools as they become available. This rationale also underpins the framework’s modular design, allowing updated tools to be incorporated easily. In particular, predicted protein–ligand complex structures and affinity should be interpreted as screening-level signals and require downstream MD/free-energy validation prior to mechanistic or potency claims.
Conclusion
In this study, we introduced AgentD, a modular, LLM-powered agent framework for automating and streamlining key tasks of the drug discovery pipeline. By integrating language model reasoning with domain-specific tools and databases, AgentD can: (i) retrieve relevant protein and compound data from structured databases and web sources; (ii) answer domain-specific scientific questions grounded in literature; (iii) generate diverse, context-aware molecules using both de novo and conditional models; (iv) predict pharmacologically relevant properties; (v) refine molecular representations through iterative, property-aware optimization; and (vi) construct protein–ligand complex structures for downstream simulations.
AgentD marks a step toward general-purpose, AI-driven scientific agents for therapeutic discovery. Its modular architecture supports seamless integration of new models and tools, ensuring continued adaptability as technologies evolve. As a necessary downstream validation step, MD simulations should be applied to physically validate AgentD-generated 3D complexes, ideally complemented by free-energy calculations when feasible. Future work will focus on automating this validation within the agent loop, along with multiobjective molecular optimization and conditional generation toward desired pharmacological profiles.
Methods
Large Language Model Agent
Our framework primarily employs OpenAI’s GPT-4o model as the main language model for all AgentD modules. GPT-4o is an optimized version of GPT-4, part of the Generative Pretrained Transformer family. In addition, we integrated Anthropic’s Claude-3–7-sonnet-20250219 and DeepSeek-Chat models. All models are accessed via their respective APIs and loaded dynamically using the LangChain library, which simplifies prompt management, response processing, and integration with external tools. All models are used with default settings provided by the API. For all API calls, no local GPU resources are required; however, for tool-specific tasks such as binding affinity prediction, SMILES generation, and protein–ligand structure generation, computations are performed on a single NVIDIA RTX 2080 Ti GPU.
Database
We utilize two established bioinformatics resources, UniProt and ChEMBL, to access protein sequence data and small molecule representations, respectively. UniProt provides high-quality annotated protein sequences between species, enabling access to functionally characterized targets of therapeutic relevance. For each therapeutically relevant protein target (e.g., EGFR or TP53), the agent queries UniProt’s REST API, restricted to the Homo sapiens taxonomy (organism ID: 9606) to retrieve the corresponding amino acid sequence in FASTA format. When existing drugs are available for the target, we subsequently query ChEMBL to obtain SMILES representations of these known compounds. This ChEMBL query step is only executed if existing drugs for the target are identified through prior web searches; otherwise, this step is skipped.
Retrieval-Augmented Generation
AgentD incorporates a lightweight Retrieval-Augmented Generation (RAG) workflow to provide literature-grounded responses during domain-specific question answering. In its current implementation, literature retrieval is intentionally kept minimalistic: the agent generates search keywords, and papers are retrieved solely through the Serper API using these keywords, without applying explicit filters on publication year, venue quality, or citation count. As a result, the retrieved documents may vary in quality and relevance.
Once retrieved, PDFs are processed with a CharacterTextSplitter (chunk size 1000 characters, 50-character overlap) to maintain contextual continuity across segments. Each chunk is embedded using EMBEDDING_MODEL and stored in a FAISS vector index to allow uniform and efficient retrieval. During the question-answering phase, the LLM performs similarity search over the FAISS index to identify the most relevant text segments, which are then supplied as grounding context for response generation.
ADMET Properties
ADMET properties encompass Absorption, Distribution, Metabolism, Excretion, and Toxicity characteristics that are critical for drug development. , These pharmacologically relevant descriptors determine how a compound is absorbed into the bloodstream, distributed across tissues, metabolized by enzymatic systems, eliminated from the body, and whether it poses potential toxic effects. Early prediction of ADMET properties is essential for prioritizing compounds with favorable biopharmaceutical and safety profiles, thereby reducing downstream attrition during drug development. ,
As part of this multistage assessment, we integrated Deep-PK, an ADMET prediction framework that operates on SMILES input and internally employs a Message Passing Neural Network (MPNN) to capture the atomic and topological features of each molecule. Candidate ligands, whether recovered from ChEMBL or generated, are submitted to the Deep-PK REST API via their SMILES strings, and the resulting ADMET profiles are parsed to prioritize compounds with favorable pharmacokinetic and safety attributes. The complete set of predicted properties is summarized in Supplementary Table S2.
A tolerance mechanism was implemented for Deep-PK API requests. After submitting SMILES, the system polls for job completion at fixed intervals (30 s) up to a maximum wait time (500 s). API errors or missing job IDs are logged and returned without interrupting the pipeline, and running jobs are retried until completion or timeout.
Binding Affinity
Binding affinity to the biological target is a fundamental determinant of therapeutic potential, serving as a key predictor of drug potency and selectivity. We employ two complementary strategies for affinity prediction. In the property prediction task, the sequence-based BAPULM model employs a dual encoder architecture to estimate the dissociation constant (K d) from protein amino acid sequences and ligand SMILES representations. It integrates two domain-specific pretrained language models: ProtT5-XL-U50 for proteins and MolFormer for small molecules. , Each encoder generates latent embeddings tailored to its respective input, which are then projected into a shared latent space using learnable feedforward projection heads. Subsequently, a predictive head processes these joint representations to estimate the binding affinity, reported as pK d = – log10(K d). Given its computational efficiency and reliance solely on sequence-level inputs, it is well-suited for early stage screening of large ligand libraries based on predicted binding affinity.
In the structure generation task, the structure-based model Boltz-2 begins with the same sequence and SMILES input but internally generates 3D protein–ligand complex structures. These conformations are then used to predict the half-maximum inhibitory concentration (IC50), reflecting the inhibitory efficacy of a compound in biochemical assays. In addition to regression-based affinity values, Boltz also outputs inhibitor probability scores, indicating the likelihood that a given ligand acts as an active binder. Although K d and IC50 originate from different experimental setups, they are related through the Cheng–Prusoff equation (eq ).
| 1 |
where K i approximates K d under certain biochemical conditions.
SMILES Generation
We utilize two molecular generators: REINVENT for de novo molecular design and Mol2Mol for molecular optimization. − Both generators utilize recurrent neural networks and transformer architectures and are embedded within machine learning optimization algorithms, including reinforcement learning and transfer learning. , Both models were implemented using the REINVENT4 package and configured to generate one molecule per input SMILES (‘num_smiles = 157’), while retaining only unique molecules (‘unique_molecules = true’) with canonicalized SMILES to eliminate duplicates.
REINVENT performs de novo molecular generation using sequence-based models that capture SMILES token probabilities in an autoregressive manner. The models are trained via teacher-forcing on large SMILES data sets to learn chemical syntax and generate valid molecules.
Mol2Mol performs conditional generation, accepting input SMILES strings and generating structurally similar molecules within user-defined similarity constraints. The transformer-based model was trained on over 200 billion molecular pairs from PubChem with Tanimoto similarity ≥ 0.50. Training employed ranking loss to directly link negative log-likelihood to molecular similarity. The model supports multinomial sampling with temperature control (set to 1.0 in this study).
Protein–Ligand Complex Structure
Boltz-2 is a structure-based deep learning model which can jointly predict 3D protein–ligand complex structures by integrating protein folding and ligand binding into a unified framework. The model takes as input a protein FASTA sequence and a ligand SMILES string, and simultaneously infers the full atomic conformation of the protein as well as the bound pose of the ligand within the predicted binding pocket. Unlike traditional docking pipelines that require experimentally resolved protein structures, Boltz-2 performs ab initio structure prediction, enabling end-to-end modeling from sequence alone.
Its architecture builds upon the Evoformer stack and SE(3)-equivariant transformer modules, incorporating interleaved attention mechanisms to capture long-range dependencies both within the protein sequence and between the protein and ligand. ,, This allows Boltz-2 to reason over flexible ligand conformations and protein-side chain rearrangements in a physics-aware manner. By directly predicting all-atom 3D coordinates, the model enables rapid generation of realistic protein–ligand complexes suitable for downstream scoring.
Evaluation Metrics
To evaluate the drug-likeness, lead-likeness, and pharmacokinetic relevance of the generated molecules, we applied four widely used rule-based filters: Lipinski’s Rule of Five, Veber’s Rule, the Ghose filter, and Oprea’s Lead-like Rule. The first three rules assess general drug-likeness, while Oprea’s rule specifically addresses lead-likeness as an early stage starting point for optimization. The detailed numerical thresholds for each rule are summarized in Table S4 of the Supporting Information. All descriptors were computed using RDKit.
Lipinski’s Rule of Five includes five criteria: molecular weight (MW) ≤ 500 Da, LogP ≤ 5, hydrogen bond donors (HBD) ≤ 5, hydrogen bond acceptors (HBA) ≤ 10, and rotatable bonds ≤ 10. A molecule was considered compliant if it satisfied at least four of the five conditions. Although the original formulation included only the first four parameters, the rotatable bond threshold has since been adopted to better capture oral absorption potential.
Veber’s rule assesses polarity and flexibility using two thresholds: topological polar surface area (TPSA) ≤ 140 Å2 and rotatable bonds ≤ 10. Both conditions must be met for compliance.
The Ghose filter identifies drug-like chemical space based on statistical distributions observed in marketed drugs. Its thresholds include MW between 160–480 Da, LogP between −0.4 and 5.6, molar refractivity (MR) between 40–130, and heavy atom count between 20–70. Molecules meeting these ranges are considered to occupy favorable physicochemical space for drug development.
Oprea’s Lead-like Rule defines chemical properties suitable for “lead” compoundsmolecules that may not yet meet full drug-likeness but serve as starting points for further optimization. Criteria include MW between 200–450 Da, LogP between −1 and 4.5, HBD ≤ 5, HBA ≤ 8, rotatable bonds ≤ 8, and aromatic rings ≤ 4, with allowance for one violation. Compared to drug-like rules, these thresholds are narrower, reflecting the preference for smaller, less lipophilic molecules that can be synthetically elaborated into drug candidates.
Technology Use Disclosure
We used ChatGPT and Claude to help with grammar and typographical corrections during the preparation of this preprint manuscript. The authors have carefully reviewed, verified, and approved all content to ensure accuracy and integrity.
Supplementary Material
Acknowledgments
The authors gratefully acknowledge support from the H. Robert Sharbaugh Presidential Fellowship.
The code that supports this study’s findings can be found in the following publicly available GitHub repository: https://github.com/hoon-ock/AgentD.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.5c02454.
Supporting Information is available and includes the following sections: Domain-specific Question Answering, Failure Cases with Claude-3.7-Sonnet, ADMET Properties, Identified Unfavorable Drug Properties, JAK-2 for Myelofibrosis, Thrombin for Thrombosis, Empirical Filters, Structure Generation Examples, LLM-based Molecule Optimization Approach, Uncertainty in Prediction Tools, Baseline Structural Modification, and Strategy–Structure Alignment Analysis. (PDF)
⊥.
J.O. and R.S.M. contributed equally to this work.
J.O., A.B.F. conceived the study, and J.O., R.S.M., A.C., N.S.A., and S.B. conducted the experiments, analyzed the data, and wrote the original draft. R.S.M., N.S.A., S.B., and A.C. contributed to experimental design and data analysis. A.B.F. supervised the project, provided funding acquisition, and reviewed the manuscript.
The authors declare no competing financial interest.
References
- Paul S. M., Mytelka D. S., Dunwiddie C. T., Persinger C. C., Munos B. H., Lindborg S. R., Schacht A. L.. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat. Rev. Drug Discovery. 2010;9:203–214. doi: 10.1038/nrd3078. [DOI] [PubMed] [Google Scholar]
- Roland A., Fox W., Baker A.. Efficiency, effectiveness and productivity in pharmaceutical R&D. Nat. Rev. Drug Discovery. 2024;23:656–657. doi: 10.1038/d41573-024-00068-6. [DOI] [PubMed] [Google Scholar]
- Subbiah V.. The next generation of evidence-based medicine. Nat. Med. 2023;29:49–58. doi: 10.1038/s41591-022-02160-z. [DOI] [PubMed] [Google Scholar]
- Bohacek R. S., McMartin C., Guida W. C.. The art and practice of structure-based drug design: A molecular modeling perspective. Medicinal Research Reviews. 1996;16:3–50. doi: 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]
- Polishchuk P. G., Madzhidov T. I., Varnek A.. Estimation of the size of drug-like chemical space based on GDB-17 data. Journal of Computer-Aided Molecular Design. 2013;27:675–679. doi: 10.1007/s10822-013-9672-4. [DOI] [PubMed] [Google Scholar]
- Fan J., Fu A., Zhang L.. Progress in molecular docking. Quantitative Biology. 2019;7:83–89. doi: 10.1007/s40484-019-0172-y. [DOI] [Google Scholar]
- Trott O., Olson A. J.. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010;31:455–461. doi: 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pagadala N. S., Syed K., Tuszynski J.. Software for molecular docking: a review. Biophysical Reviews. 2017;9:91–102. doi: 10.1007/s12551-016-0247-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polishchuk P.. Interpretation of Quantitative Structure–Activity Relationship Models: Past, Present, and Future. J. Chem. Inf. Model. 2017;57:2618–2639. doi: 10.1021/acs.jcim.7b00274. [DOI] [PubMed] [Google Scholar]
- Perkins R., Fang H., Tong W., Welsh W. J.. Quantitative structure-activity relationship methods: Perspectives on drug discovery and toxicology. Environ. Toxicol. Chem. 2003;22:1666–1679. doi: 10.1897/01-171. [DOI] [PubMed] [Google Scholar]
- Tropsha A., Isayev O., Varnek A., Schneider G., Cherkasov A.. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nat. Rev. Drug Discovery. 2024;23:141–155. doi: 10.1038/s41573-023-00832-0. [DOI] [PubMed] [Google Scholar]
- Karplus M., Kuriyan J.. Molecular dynamics and protein function. Proc. Natl. Acad. Sci. U. S. A. 2005;102:6679–6685. doi: 10.1073/pnas.0408930102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luan B., Wang H., Huynh T.. Enhanced binding of the N501Y-mutated SARS-CoV-2 spike protein to the human ACE2 receptor: insights from molecular dynamics simulations. FEBS Lett. 2021;595:1454–1461. doi: 10.1002/1873-3468.14076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ock J., Mollaei P., Barati Farimani A.. GradNav: Accelerated Exploration of Potential Energy Surfaces with Gradient-Based Navigation. J. Chem. Theory Comput. 2024;20:4088–4098. doi: 10.1021/acs.jctc.4c00316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Badrinarayanan, S. ; Magar, R. ; Antony, A. ; Meda, R. S. ; Farimani, A. B. . MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models. 2025; https://arxiv.org/abs/2506.00198. [DOI] [PMC free article] [PubMed]
- Guntuboina C., Das A., Mollaei P., Kim S., Barati Farimani A.. Peptidebert: A language model based on transformers for peptide property prediction. J. Phys. Chem. Lett. 2023;14:10427–10434. doi: 10.1021/acs.jpclett.3c02398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mollaei P., Sadasivam D., Guntuboina C., Barati Farimani A.. IDP-Bert: Predicting Properties of Intrinsically Disordered Proteins Using Large Language Models. J. Phys. Chem. B. 2024;128:12030–12037. doi: 10.1021/acs.jpcb.4c02507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Badrinarayanan S., Guntuboina C., Mollaei P., Barati Farimani A.. Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties. J. Chem. Inf. Model. 2025;65:83–91. doi: 10.1021/acs.jcim.4c01443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper J.. et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z., Zeng X., Zhao Y., Chen R.. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy. 2023;8:115. doi: 10.1038/s41392-023-01381-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Abramson J.. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin, W. ; Barzilay, R. ; Jaakkola, T. . Junction Tree Variational Autoencoder for Molecular Graph Generation. Proceedings of the 35th International Conference on Machine Learning. 2018; pp 2323–2332.
- Cao, N. D. ; Kipf, T. . MolGAN: An implicit generative model for small molecular graphs. 2022; https://arxiv.org/abs/1805.11973.
- Peng, X. ; Guan, J. ; Liu, Q. ; Ma, J. . MolDiff: Addressing the Atom-Bond Inconsistency Problem in 3D Molecule Diffusion Generation. 2023; https://arxiv.org/abs/2305.07508.
- Xiong G., Wu Z., Yi J., Fu L., Yang Z., Hsieh C., Yin M., Zeng X., Wu C., Lu A., Chen X., Hou T., Cao D.. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 2021;49:W5–W14. doi: 10.1093/nar/gkab255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Z., Ramsundar B., Feinberg E., Gomes J., Geniesse C., Pappu A. S., Leswing K., Pande V.. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018;9:513–530. doi: 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rollins Z. A., Cheng A. C., Metwally E.. MolPROP: Molecular Property prediction with multimodal language and graph fusion. Journal of Cheminformatics. 2024;16:56. doi: 10.1186/s13321-024-00846-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mayr, A. ; Klambauer, G. ; Unterthiner, T. ; Hochreiter, S. . DeepTox: Toxicity Prediction using Deep Learning Front. Environ. Sci. 2016, Volume 3. 10.3389/fenvs.2015.00080 [DOI] [Google Scholar]
- Berdigaliyev N., Aljofan M.. An Overview of Drug Discovery and Development. Future Medicinal Chemistry. 2020;12:939–947. doi: 10.4155/fmc-2019-0307. [DOI] [PubMed] [Google Scholar]
- Hughes J., Rees S., Kalindjian S., Philpott K.. Principles of early drug discovery. Br. J. Pharmacol. 2011;162:1239–1249. doi: 10.1111/j.1476-5381.2010.01127.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sinha, S. ; Vohora, D. In Pharmaceutical Medicine and Translational Clinical Research; Vohora, D. , Singh, G. , Eds.; Academic Press: Boston, 2018; pp 19–32. [Google Scholar]
- Lu, C. ; Lu, C. ; Lange, R. T. ; Foerster, J. ; Clune, J. ; Ha, D. . The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. 2024; https://arxiv.org/abs/2408.06292.
- Boiko D. A., MacKnight R., Kline B., Gomes G.. Autonomous chemical research with large language models. Nature. 2023;624:570–578. doi: 10.1038/s41586-023-06792-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bran M. A., Cox S., Schilter O., Baldassari C., White A. D., Schwaller P.. Augmenting large language models with chemistry tools. Nature Machine Intelligence. 2024;6:1–11. doi: 10.1038/s42256-024-00832-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ock, J. ; Vinchurkar, T. ; Jadhav, Y. ; Farimani, A. B. . Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent. 2024; https://arxiv.org/abs/2410.16658.
- Jadhav, Y. ; Farimani, A. B. . Large Language Model Agent as a Mechanical Designer. 2024; https://arxiv.org/abs/2404.17525.
- Jadhav, Y. ; Pak, P. ; Farimani, A. B. . LLM-3D Print: Large Language Models To Monitor and Control 3D Printing. 2024; https://arxiv.org/abs/2408.14307.
- Chaudhari, A. ; Ock, J. ; Barati Farimani, A. . Modular Large Language Model Agents for Multi-Task Computational Materials Science. 2025; 10.26434/chemrxiv-2025-zkn81-v2. [DOI]
- Ansari, M. ; Watchorn, J. ; Brown, C. E. ; Brown, J. S. . dZiner: Rational Inverse Design of Materials with AI Agents. 2024; https://arxiv.org/abs/2410.03963.
- McNaughton A. D., Sankar Ramalaxmi G. K., Kruel A., Knutson C. R., Varikoti R. A., Kumar N.. CACTUS: Chemistry Agent Connecting Tool Usage to Science. ACS Omega. 2024;9:46563–46573. doi: 10.1021/acsomega.4c08408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ding K., Yu J., Huang J., Yang Y., Zhang Q., Chen H.. SciToolAgent: a knowledge-graph-driven scientific agent for multitool integration. Nat. Comput. Sci. 2025;5:962–972. doi: 10.1038/s43588-025-00849-y. [DOI] [PubMed] [Google Scholar]
- Li, K. ; Wu, Z. ; Wang, S. ; Wu, J. ; Pan, S. ; Hu, W. . DrugPilot: LLM-based parameterized reasoning agent for drug discovery. 2025.
- Consortium T. U.. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204–D212. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaulton A., Bellis L. J., Bento A. P., Chambers J., Davies M., Hersey A., Light Y., McGlinchey S., Michalovich D., Al-Lazikani B., Overington J. P.. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–D1107. doi: 10.1093/nar/gkr777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blaschke T., Arús-Pous J., Chen H., Margreitter C., Tyrchan C., Engkvist O., Papadopoulos K., Patronov A.. REINVENT 2.0: An AI Tool for De Novo Drug Design. J. Chem. Inf. Model. 2020;60:5918–5922. doi: 10.1021/acs.jcim.0c00915. [DOI] [PubMed] [Google Scholar]
- Loeffler H. H., He J., Tibo A., Janet J. P., Voronov A., Mervin L. H., Engkvist O.. REINVENT 4: Modern AI–driven generative molecule design. Journal of Cheminformatics. 2024;16:20. doi: 10.1186/s13321-024-00812-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibo A., He J., Janet J. P., Nittinger E., Engkvist O.. Exhaustive local chemical space exploration using a transformer model. Nat. Commun. 2024;15:7315. doi: 10.1038/s41467-024-51672-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Myung Y., deSá A. G. C., Ascher D. B.. Deep-PK: deep learning for small molecule pharmacokinetic and toxicity prediction. Nucleic Acids Res. 2024;52:W469–W475. doi: 10.1093/nar/gkae254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meda, R. S. ; Farimani, A. . BAPULM: Binding Affinity Prediction using Language Models. 2024; https://arxiv.org/abs/2411.04150.
- Wohlwend, J. ; Corso, G. ; Passaro, S. ; Reveiz, M. ; Leidal, K. ; Swiderski, W. ; Portnoi, T. ; Chinn, I. ; Silterra, J. ; Jaakkola, T. ; Barzilay, R. . Boltz-1 Democratizing Biomolecular Interaction Modeling. 2024; https://www.biorxiv.org/content/early/2024/11/20/2024.11.19.624167.
- Passaro, S. ; Corso, G. ; Wohlwend, J. ; Reveiz, M. ; Thaler, S. ; Somnath, V. R. ; Getz, N. ; Portnoi, T. ; Roy, J. ; Stark, H. ; Kwabi-Addo, D. ; Beaini, D. ; Jaakkola, T. ; Barzilay, R. . Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. 2025.
- Serper Serper: Google Search API. https://serper.dev, Accessed: 2025-06-12.
- Weller S., Toennießen A., Schaefer B., Beigl T., Muenchow A., Bopple K., Hofmann U., Gillissen B. F., Aulitzky W. E., Kopp H.-G., Essmann F.. et al. The BCL-2 inhibitor ABT-199/venetoclax synergizes with proteasome inhibition via transactivation of the MCL-1 antagonist NOXA. Cell death discovery. 2022;8:215. doi: 10.1038/s41420-022-01009-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, T. ; Kishore, V. ; Wu, F. ; Weinberger, K. Q. ; Artzi, Y. . BERTScore: Evaluating Text Generation with BERT. 2019,.
- Leonetti A., Sharma S., Minari R., Perego P., Giovannetti E., Tiseo M.. Resistance mechanisms to osimertinib in EGFR-mutated non-small cell lung cancer. Br. J. Cancer. 2019;121:725–737. doi: 10.1038/s41416-019-0573-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beltagy, I. ; Lo, K. ; Cohan, A. . SciBERT: A pretrained language model for scientific text. 2019.
- Moriwaki H., Tian Y.-S., Kawashita N., Takagi T.. Mordred: a molecular descriptor calculator. Journal of Cheminformatics. 2018;10:4. doi: 10.1186/s13321-018-0258-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bickerton G. R., Paolini G. V., Besnard J., Muresan S., Hopkins A. L.. Quantifying the chemical beauty of drugs. Nat. Chem. 2012;4:90–98. doi: 10.1038/nchem.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lipinski C. A., Lombardo F., Dominy B. W., Feeney P. J.. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Delivery Rev. 2001;46:3–26. doi: 10.1016/S0169-409X(00)00129-0. [DOI] [PubMed] [Google Scholar]
- Veber D. F., Johnson S. R., Cheng H.-Y., Smith B. R., Ward K. W., Kopple K. D.. Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem. 2002;45:2615–2623. doi: 10.1021/jm020017n. [DOI] [PubMed] [Google Scholar]
- Ghose A. K., Viswanadhan V. N., Wendoloski J. J.. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug discovery. 1. A qualitative and quantitative characterization of known drug databases. J. Comb. Chem. 1999;1:55–68. doi: 10.1021/cc9800071. [DOI] [PubMed] [Google Scholar]
- Oprea T. I., Davis A. M., Teague S. J., Leeson P. D.. Is there a leadlike space? A critical assessment of lead-likeness criteria. J. Chem. Inf. Comput. Sci. 2001;41:1308–1320. doi: 10.1021/ci010366a. [DOI] [PubMed] [Google Scholar]
- OpenAI , GPT-4 Technical Report. 2023; https://cdn.openai.com/papers/gpt-4.pdf.
- Anthropic , Introducing Claude 3 Models. 2024; https://www.anthropic.com/news/claude-3-family.
- DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025; https://arxiv.org/abs/2501.12948.
- Lu, H. ; Liu, W. ; Zhang, B. ; Wang, B. ; Dong, K. ; Liu, B. ; Sun, J. ; Ren, T. ; Li, Z. ; Sun, Y. ; Deng, C. ; Xu, H. ; Xie, Z. ; Ruan, C. . DeepSeek-VL: Towards Real-World Vision-Language Understanding. 2024.
- Willett P., Barnard J. M., Downs G. M.. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998;38:983–996. doi: 10.1021/ci9800211. [DOI] [Google Scholar]
- Bajusz D., Rácz A., Héberger K.. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics. 2015;7:20. doi: 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Missner A., Pohl P.. 110 Years of the Meyer–Overton Rule: Predicting Membrane Permeability of Gases and Other Small Compounds. ChemPhysChem. 2009;10:1405–1414. doi: 10.1002/cphc.200900270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Radford, A. ; Narasimhan, K. . Improving Language Understanding by Generative Pre-Training. 2018; https://api.semanticscholar.org/CorpusID:49313245.
- Chase, H. LangChain. 2022; https://github.com/langchain-ai/langchain, Date released: 2022-10-17.
- Norinder U., Bergström C.. Prediction of ADMET Properties. ChemMedChem. 2006;1:920–937. doi: 10.1002/cmdc.200600155. [DOI] [PubMed] [Google Scholar]
- Ferreira L. L., Andricopulo A. D.. ADMET modeling approaches in drug discovery. Drug Discovery Today. 2019;24:1157–1165. doi: 10.1016/j.drudis.2019.03.015. [DOI] [PubMed] [Google Scholar]
- van de Waterbeemd H., Gifford E.. ADMET in silico modelling: towards prediction paradise? Nat. Rev. Drug Discovery. 2003;2:192–204. doi: 10.1038/nrd1032. [DOI] [PubMed] [Google Scholar]
- Leeson P. D., Springthorpe B.. The influence of drug-like concepts on decision-making in medicinal chemistry. Nat. Rev. Drug Discovery. 2007;6:881–890. doi: 10.1038/nrd2445. [DOI] [PubMed] [Google Scholar]
- Elnaggar A., Heinzinger M., Dallago C., Rehawi G., Yu W., Jones L., Gibbs T., Feher T., Angerer C., Steinegger M., Bhowmik D., Rost B.. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021:1–1. doi: 10.48550/arXiv.2007.06225. [DOI] [PubMed] [Google Scholar]
- Ross J., Belgodere B., Chenthamarakshan V., Padhi I., Mroueh Y., Das P.. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence. 2022;4:1256–1264. doi: 10.1038/s42256-022-00580-7. [DOI] [Google Scholar]
- Wang B., Zhang R., Zhou H.. et al. Computing structure-based binding affinity of small molecules for drug discovery. Nature Reviews Chemistry. 2023;7:681–700. [Google Scholar]
- Yung-Chi C., Prusoff W. H.. Relationship between the inhibition constant (K_i) and the concentration of inhibitor which causes 50% inhibition (I_50) of an enzymatic reaction. Biochem. Pharmacol. 1973;22:3099–3108. doi: 10.1016/0006-2952(73)90196-2. [DOI] [PubMed] [Google Scholar]
- Elbadawi M., Gaisford S., Basit A. W.. Advanced machine-learning techniques in drug discovery. Drug Discovery Today. 2021;26:769–777. doi: 10.1016/j.drudis.2020.12.003. [DOI] [PubMed] [Google Scholar]
- Williams R. J., Zipser D.. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation. 1989;1:270–280. doi: 10.1162/neco.1989.1.2.270. [DOI] [Google Scholar]
- Vaswani, A. ; Shazeer, N. ; Parmar, N. ; Uszkoreit, J. ; Jones, L. ; Gomez, A. N. ; Kaiser, L. u. ; Polosukhin, I. . Attention is All you Need. Advances in Neural Information Processing Systems. 2017. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code that supports this study’s findings can be found in the following publicly available GitHub repository: https://github.com/hoon-ock/AgentD.








