Abstract

1. Introduction
Small-molecule drug discovery plays a pivotal role in pharmaceutical research by designing and screening organic compounds with specific biological activities that modulate disease-related targets. , Small-molecule drugs offer several advantages, including high oral bioavailability, strong tissue permeability, and low production costs, making them predominant in oncology, metabolic diseases, and neurological disorders. , As of 2024, small-molecule drugs dominated the landscape of innovative therapeutics among new drugs approved by the Food and Drug Administration (FDA), comprising 64% of the total (32 new chemical entities). These drugs target a diverse range of biological targets, such as LACTB, PBP, and THR-β, surpassing monoclonal antibodies (26%) and other biologics. These data underscore the continued centrality of small-molecule drug discovery as a key driver of therapeutic advancement. Nevertheless, the conventional paradigm for small-molecule lead discovery is both costly and time-consuming, with the average expenditure per approved drug reaching approximately $2.5 billion and a development timeline exceeding one decade. , This inefficiency has significantly impeded the progress of novel drug development, necessitating the adoption of revolutionary technologies to overcome the existing bottlenecks.
The domain of small-molecule drug discovery has significantly advanced through the synergistic integration of analytical chemistry, computational chemistry, and machine learning methodologies. Analytical chemistry has facilitated the generation of large-scale, standardized datasets about bioactivity and physicochemical properties via high-throughput mass spectrometry and nuclear magnetic resonance (NMR) techniques. Computational chemistry employs quantum-mechanical and molecular dynamics methods to accurately simulate intermolecular interactions and conformational stability, thereby establishing a theoretical framework for evaluating the structural rationality of small molecules. Machine learning then capitalizes on these experimental and simulation datasets to swiftly predict bioactivity, solubility, toxicity, and pharmacokinetic properties across extensive chemical spaces, thereby expediting the processes of molecular generation and optimization. Moreover, advanced algorithms, particularly deep learning, further augment these predictive capabilities, enhancing the efficiency of molecular generation and optimization. However, contemporary methodologies continue to encounter challenges, including a pronounced reliance on data quality, inadequate model interpretability, and restricted applicability to small datasets.
With the advancement of deep learning, artificial intelligence (AI) has been systematically integrated into drug discovery processes. AI facilitates the rapid analysis of extensive biomedical datasets through machine learning, deep learning, and natural language processing (NLP), thereby expediting critical stages, such as target identification, lead compound optimization, and toxicity prediction. , Notably, AI demonstrates the potential to surpass traditional computational methods in handling complex small-molecule structures and multisource heterogeneous data. , Large Language Models (LLMs), as a key technology in the field of Generative AI, are gradually penetrating all aspects of small-molecule drug development. LLMs represented by Bidirectional Encoder Representations from Transformers (BERT) and Generative Pretrained Transformer (GPT) initially achieved breakthroughs in the field of NLP, such as efficient understanding, generation, and semantic correlation analysis of texts. , Their applications have since expanded to fields such as biology, chemistry, and other natural sciences. These models demonstrate robust cross-domain transfer learning capabilities facilitated by Transformer architecture and attention mechanisms. , For instance, simplified molecular input line entry system (SMILES) notation, which serves as a linear symbolic representation of chemical molecular structures, can be interpreted as the language of chemistry. This enables LLMs to analyze the fundamental principles of molecular structures similar to human language. LLMs possess the capability to deduce molecular structures from mass spectrometry data, thereby facilitating isomer assignment and the interpretation of spectral information. When integrated with experimental methodologies, these models compensate for their inherent limitations in elucidating physical mechanisms, thereby enhancing the efficiency of the drug discovery process. This integration streamlines the progression from virtual screening and molecular design to the experimental validation.
The core advantages of LLMs in the realm of small-molecule drug discovery are primarily evident in four areas. First, the pretraining and fine-tuning strategies effectively address the challenge of data scarcity in small-molecule drug discovery. Second, in the generation of small molecules, LLMs enhance the accuracy and efficacy of molecular design by integrating reinforcement learning and diffusion models. Third, multimodal integration empowers LLMs to amalgamate heterogeneous data sources. Fourth, domain-specific models endow LLMs with the scalability required to address complex problems.
This study aims to elucidate the core technologies underlying LLMs in the context of small-molecule drug discovery and their potential applications. It focuses on examining the specific applications of LLMs in several critical areas, including target discovery and validation for small-molecule drugs, virtual screening and exploration of chemical spaces, design and optimization of small molecules, prediction of toxicity, safety assessment of small-molecule drugs, knowledge extraction, literature analysis, and clinical applications. Furthermore, this study discusses current challenges and future developmental directions.
2. Molecular Design Strategies for Small-Molecule Drug Discovery
The evolution of molecular design strategies in small-molecule drug discovery has undergone significant advancements, transitioning from early empirical rule-based quantitative structure−activity relationship (QSAR) models to midterm physics-based molecular docking techniques, and more recently, to the emergence of deep learning generative models over the past decade. Current methodological innovations in this domain primarily progress along two parallel trajectories: one focusing on theoretical calculation strategies grounded in quantum chemistry and molecular force fields, and the other emphasizing machine learning technologies that leverage big data and algorithmic optimization. The technical background is illustrated in Figure .
1.

An overview of the integrated computational framework for small-molecule drug discovery. The framework comprises three main components: (1) theoretical calculation strategies grounded in quantum chemistry and molecular force fields (top left), illustrating the use of quantum mechanics (QM) and molecular mechanics (MM) simulations to model conformational changes; (2) machine learning technologies leveraging big data and algorithmic optimization (top right), highlighting molecular fingerprints, support vector machines (SVMs), and generative adversarial networks (GANs) to distinguish real from fake molecular structures; and (3) large language models (bottom), depicting a Transformer-based backbone for processing multimodal small-molecule inputs, enabling multitask learning, and enhancing multimodal data interpretation to achieve final drug discovery outputs.
2.1. Computational Chemistry-Driven Molecular Design
This approach relies on quantum chemistry and molecular force fields to predict and optimize the molecular properties.
Quantum methods provide highly accurate electronic structures and energetics, which are essential for the mechanistic studies of drug-target interactions (DTI). For example, the MISATO dataset integrates quantum properties with language models to address predictions of biomolecule-ligand interactions.
Force fields, which are fundamental tools in molecular simulation, describe conformational changes and intermolecular interactions. Machine-learning force fields (MLFFs) surpass traditional limitations by learning an accurate mapping between local atomic environments and the potential energy surface, maintaining quantum-level accuracy while significantly accelerating computation.
2.2. Data-Driven Machine Learning/Deep Learning Strategies
Traditional machine learning approaches predominantly rely on shallow feature representations, including molecular fingerprints and manually crafted descriptors. The modeling process is heavily reliant on expert-driven feature engineering, often employing linear techniques such as support vector machines (SVMs) to facilitate effective modeling under conditions of limited sample size. In contrast, deep learning utilizes a multilayer neural network architecture to autonomously learn hierarchical feature representations directly from SMILES strings, molecular graph topologies, or three-dimensional spatial coordinates. This approach exhibits robust nonlinear expression capabilities and cross-task generalization performance, particularly when applied to extensive bioactive datasets.
Deep learning techniques like Graph Neural Networks (GNNs) have been effectively used in protein structure prediction and drug design, overcoming the task-specific limitations of traditional manual feature engineering. In molecular generation, deep generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs) have been employed to create new compounds with specific biological activities from scratch. However, challenges remain in terms of small sample learning and the computational resources needed. Deep models tend to overfit in scenarios with limited data, and their large number of parameters demands significant GPU memory and training time.
3. Core Technologies and Models of LLMs in Small-Molecule Drug Discovery
A Transformer is a neural network architecture based on a self-attention mechanism designed to process sequential data. LLMs are models constructed on the Transformer architecture. Through training on large-scale text data, they enhance training accuracy, are capable of generating natural language text, and leverage their strengths to deliver innovative solutions across multiple stages of small-molecule drug development. In the field of small-molecule drug discovery, the core technical pillars of LLMs include pretraining/fine-tuning strategies, generative augmentation techniques, and multimodal model integration, as illustrated in Figure .
2.
Technological pathways of LLMs in the context of small-molecule drug discovery. (a) Depicts the pretraining and fine-tuning application framework. The left segment represents the pretraining stage, which involves the utilization of extensive unlabeled datasets, such as biomedical literature and SMILES sequences, to pretrain the model. This stage is computationally intensive. The right segment illustrates the fine-tuning phase, which employs smaller labeled datasets, including drug targets and molecular structures, to refine the model. This phase is characterized by a lower computational cost. (b) Outlines of the reinforcement learning application framework based on diffusion models. The upper section of the reinforcement learning loop demonstrates the interaction between the Agent and the small-molecule drug discovery Environment, encompassing the transfer of State, Action, and Reward. The lower section, concerning the diffusion model, describes the process of introducing noise to the initial data through forward diffusion and subsequently generating new data by removing noise via back diffusion following training. (c) The application framework for Multimodal Large Language Models (MLLMs) involves several key steps. Initially, multimodal data inputs, including SMILES molecular sequences, chemical text descriptions, and molecular structure images, are utilized. Subsequently, feature vectors for each data type are extracted by using specialized feature extraction techniques. These feature vectors are then integrated to create a comprehensive feature representation. Based on this integrated representation, predictions are made, encompassing DTI prediction and drug attribute prediction, which facilitate the discovery of small-molecule drugs.
3.1. Pre-Training/Fine-Tuning
The pretraining/fine-tuning framework initially extracts universal representations from extensive unlabeled molecular datasets and subsequently adapts these representations to specific downstream tasks using a limited set of task-specific labels.
3.1.1. Pre-Training
During the pretraining phase, the LLMs were trained using a large unlabeled dataset. The objective of pretraining is to enable the model to learn statistical patterns, syntactic structures, and semantic relationships from extensive unsupervised text data, thereby allowing adaptation to specific downstream tasks, such as text classification, question answering, and translation.
Similarly, this paradigm can be applied to molecular data, provided that molecules are encoded into tokenizable sequences or graphs.
3.1.1.1. Molecular Encoding Strategies
-
1
Sequence route: Three-dimensional molecular structures are initially transformed into one-dimensional SMILES or SELFIES strings, effectively converting chemical graphs into ordered sequences of characters. Subword or character-level tokenizers, commonly utilized in NLP, are directly applied to segment these sequences. Subsequently, masked-language modeling (MLM) is employed as a self-supervised learning objective, whereby random tokens are masked, and the model is tasked with predicting the missing characters based on the surrounding context. This approach facilitates the acquisition of chemical syntax and molecular semantics without the need for labeled data. To enhance sample diversity, MTL-BERT generates multiple SMILES variants for the same molecule, thereby augmenting the corpus. MLM training is then conducted on these expanded sequences, further enhancing the model’s capacity to comprehend molecular grammar and substructural semantics through data augmentation.
-
2
Graph route: The graph structure serves as the most intuitive representation of molecules. Graph route is specifically designed for Transformer variants of graph structures, such as Graphormer or MPNN with self-attention, to initialize nodes based on atomic type, valence state, , and hybridization, and to construct edge features based on bond level, aromaticity, and ring information. During the self-supervised stage, the model reconstructs obscured atomic or bond properties through node/edge masking or contrastive learning, thereby capturing both local chemical environments and global molecular representations in unlabeled data. Following pretraining and fine-tuning within this paradigm, MolE surpassed the best publicly available results prior to September 2023 in 10 out of 22 ADMET tasks.
3.1.1.2. Multitask Self-Supervised Learning
Single masked language modeling often encounters challenges with underfitting when applied to molecular data, primarily due to a lack of topological diversity. To address this issue, multitask self-supervised learning techniques can be employed to construct multiple distinct labeling tasks simultaneously during the pretraining phase. These tasks may include node-edge double masking, functional group classification, subgraph comparison, and local property reconstruction. This approach compels the model to learn synchronously across various granularities and perspectives, thereby explicitly capturing chemical priors such as functional groups, ring systems, and connection patterns. Consequently, this method facilitates the development of more general and transferable molecular representations. For instance, MTSSMol incorporates substructure knowledge through a multitask strategy involving node-edge double masking and functional group prediction. In contrast, TOML-BERT implements a two-level framework that combines general chemistry pretraining with task-specific domain-adaptive pretraining, achieving state-of-the-art performance across ten pharmaceutical datasets.
3.1.2. Fine-Tuning
Despite acquiring knowledge of linguistic statistical rules, fundamental syntax, and semantics during pretraining, LLMs often exhibit limitations in effectively handling specific tasks. During the fine-tuning phase, a small, task-specific, labeled dataset is employed, enabling the model to learn and subsequently optimize its performance for a particular task. This process enhances the accuracy and efficiency of the model in addressing specific challenges.
Parameter-Efficient Fine-Tuning (PEFT): PEFT is a technique that preserves the backbone weights of a pretrained model while incorporating lightweight, trainable modules, such as low-rank matrices, adapters, or prompt vectors, into its bypass structure. This approach minimizes the number of parameters requiring updates, thereby reducing computational resource consumption while maintaining model performance in downstream tasks. This methodology has been effectively employed in predicting protein-small-molecule interactions. For instance, the CLAPE-SMB model integrates a pretrained Protein Language Model (PLM) with a contrastive learning framework, achieving high-precision binding site prediction through the PEFT strategy and diminishing reliance on experimental structures.
Domain-adaptive data strategy: The raw annotations undergo a deduplication and consistency-checking process to eliminate redundant or conflicting samples. A collaborative filtering approach, informed by gradient signals and model confidence, is employed to prioritize the labeling of high-loss, challenging-to-classify samples, thereby enhancing the information density of the dataset. Concurrently, positive samples are systematically augmented through SMILES-equivalent transformations, enumeration, and stereoisomer generation, which broadens the chemical space coverage without increasing the overall data volume. Ultimately, the resulting high-quality, multisource, and diverse fine-tuning dataset is integrated with subsequent parameter-efficient fine-tuning, facilitating rapid and accurate adaptation of large models to molecular tasks.
3.2. Generative Enhancement Techniques
In the realm of small-molecule drug discovery, reinforcement learning and diffusion models, as integral components of generative enhancement technology, demonstrate distinct advantages in decision optimization and the generation of high-quality molecules.
3.2.1. Reinforcement Learning
Reinforcement learning, a machine learning paradigm that derives optimal strategies through interactions with the environment, is typically grounded in the Markov Decision Process (MDP) framework. Its fundamental components include an agent, which functions as the decision-maker, acquires state observations, and executes actions through interactions with the environment. The environment assesses these actions based on immediate rewards. Reinforcement learning iteratively refines strategies via trial-and-error methods to maximize the long-term cumulative rewards.
3.2.1.1. Task-Oriented Molecular Generation
With “measurable chemical or biological endpoints” as the optimization objective, generative models are explicitly designed to produce molecular structures that meet specific criteria, such as maximizing target binding affinity, enhancing metabolic stability, reducing toxicity, or adhering to multiple medicinal chemistry constraints. LLMs are fine-tuned within a reinforcement learning framework that conceptualizes molecular generation as a sequential decision-making process. In this context, the generator, often an RNN or Transformer, functions as the policy network and refines its generation strategy through reward signals such as binding affinity or drug-likeness scores. For example, the Augmented Hill-Climb method enhances the efficiency of language-based molecular generation via reinforcement learning, thereby making computationally intensive scoring functions, like molecular docking, feasible within practical time constraints.
3.2.1.2. Multi-Objective Optimization
Molecular optimization necessitates satisfying multiple criteria, including high solubility, low toxicity, favorable synthetic accessibility, and strong target affinity, which often present conflicting optimization objectives. Current research predominantly addresses this challenge through multiobjective optimization (MOO) techniques. Among these, the Pareto optimization-based approach employs nondominated sorting to achieve a balance among multiple objectives within the reward function. This approach applies to both reinforcement learning reward design and genetic-algorithm-based molecular selection. The CPRL method proposed by Wang et al. explicitly maintains the Pareto front in the design of reinforcement learning rewards by calculating the final reward through clustering nondominated sorting. This method successfully balances multitarget affinity with drug-like properties, resulting in generated molecules with a desirability score of 0.9551 and a validity score of 0.9923.
3.2.2. Diffusion Models
The advantages of diffusion models in the generation of 3D molecules arise from their fine-grained, probability-based sampling methodology. This approach not only guarantees chemical plausibility but also enhances molecular diversity, while simultaneously maintaining physical symmetries through the implementation of an equivariant consistency model.
The technical core involves the concurrent management of discrete atom types and continuous three-dimensional coordinates. At each forward diffusion step, the system employs a predefined noise schedule to introduce progressive Gaussian perturbations to the continuous coordinates while executing uniform transitions on the discrete atom types. This process ensures that both classes of variables collectively degrade into isotropic Gaussian noise. ,
Reinforcement learning (RL) and diffusion models markedly enhance the efficiency of molecular generation through a synergistic mechanism. Diffusion models generate high-quality three-dimensional chemical fragments characterized by desirable structural properties, thereby providing RL with an optimized initial action space. Subsequently, RL learns optimal linkage strategies within the fragment-assembly space to facilitate multiobjective property optimization, ultimately resulting in the synthesis of molecules with enhanced pharmacological profiles.
3.3. Multimodal Large Language Models
MLLMs transcend the constraints of traditional text-based LLMs by effectively processing multimodal information, including images, audio, and video. The fundamental principle of MLLMs is their ability to achieve integrated modeling and reasoning across complex multimodal data by synthesizing various data modalities and leveraging their robust semantic comprehension capabilities.
3.3.1. Multi-Modal Information Fusion Mechanism
MLLMs utilize Transformer architecture to process SMILES sequences while employing equivariant GNNs to extract molecular topology or three-dimensional structural features. Additionally, visual encoders are incorporated to process the molecular images. Through bidirectional cross-attention, the three embeddings, comprising SMILES tokens, molecular-graph node features, and image patches, are mapped into a shared latent space, which is aligned using contrastive learning. A unified decoder subsequently outputs property predictions or generates molecular structures, facilitating end-to-end cross-modal joint modeling. For example, GIT-Mol effectively integrates structural and image information to address the limitations inherent in traditional single-modal language models, resulting in a 5−10% improvement in property-prediction accuracy and a 20.2% enhancement in generation validity compared to baseline models.
3.3.2. Cross-Modal Representation Learning
By employing a contrastive learning framework such as GICL, this study integrates sequence features derived from large LLMs, exemplified by the SMILES-Transformer, with molecular image features. Utilizing the InfoNCE loss function, the framework minimizes the distances between positive samples (representations of the same molecule across different modalities) while maximizing the distances between negative samples, which represent different molecules. This approach facilitates explicit semantic alignment between the SMILES and image modalities. Token-Mol encodes 3D coordinates into discrete tokens and conducts joint modeling with the prealigned SMILES-image features. This methodology enables the model to concurrently capture 2D-3D consistency within a unified latent space, resulting in an enhancement of molecular conformation generation accuracy by over 10% and 20% across two distinct datasets.
3.3.3. Multitask Joint Training
Within the framework of multitask joint training, multimodal LLMs employ a hard parameter-sharing mechanism, wherein their foundational Transformer encoders are capable of concurrently processing both molecular SMILES strings and image inputs. At the model’s uppermost layer, a multitask head facilitates parallel outputs, encompassing the generation of molecular descriptive text, the prediction of regression values for physicochemical properties, and the classification of target affinity. The losses associated with these diverse tasks are amalgamated through a weighted summation approach for backpropagation. By harnessing the complementarity between the extensive corpus of the description task and the sparse labels associated with the property tasks, the model achieves collaborative optimization and implicit data augmentation. This capability enables the model to directly produce text, property, and affinity predictions in a single forward pass, thereby fulfilling the MOO requirements essential to drug discovery.
3.4. Domain-Specific Models in Small-Molecule Drug Discovery
The technical foundation of domain-specific models is rooted in the use of LLMs to seamlessly integrate external knowledge bases and specialized tools. This approach systematically addresses the multidimensional complexities inherent in the field. This paradigm not only enhances the model’s applicability and computational efficiency but also offers researchers an expandable and highly flexible intelligent solution. Presently, domain-specific models in small-molecule drug discovery predominantly rely on the bidirectional semantic encoding capabilities of BERT and the generative pretraining framework of GPT. Through self-attention mechanisms and large-scale text representation learning, these foundational architectures have established a robust technical foundation for the extensive application of domain-specific models.
BERT’s core innovation lies in its ability to capture contextual semantics through the Bidirectional Transformer Encoder architecture. BERT’s capability to concurrently capture contextual information about sequences renders it particularly suitable for tasks necessitating a comprehensive understanding of biomolecular structures or protein sequences. GPT is founded on the architecture of an autoregressive Transformer decoder, employing a unidirectional attention mechanism to generate sequences sequentially from left to right. GPT excels in sequence generation tasks and is particularly suitable for designing novel molecular SMILES sequences using an autoregressive generation strategy.
Table summarizes recent small-molecule drug discovery advances achieved with GPT, BERT, Llama, Claude, T5, and BART. The table illustrates that models based on GPT and BERT have been extensively employed in this field, primarily because of their earlier development and well-established applications in ecosystems. In contrast, the adoption of other models remains limited, largely because the Llama and Claude architectures were only introduced and utilized in 2023, and their applications are still in the nascent stages. , BART and T5 exhibit strong performances in handling general NLP tasks; however, they require further tuning and optimization for specific applications in the domain of small-molecule drug discovery.
1. Provides Examples Utilizing GPT, BERT, BART, T5, Claude, and Llama Models.
| Architecture | Domain-Specific Model | Task Type | Input Modality | Training Strategy | Year | Ref. |
|---|---|---|---|---|---|---|
| BERT-based architecture | BERT-Att-Capsule | CPI extraction | biomedical text | BERT pretraining → CHEMPROT end-to-end fine-tuning | 2020 | |
| TP-DDI | DDI extraction | biomedical text | BioBERT pretraining → end-to-end fine-tuning | 2021 | ||
| Mol-BERT | molecular property prediction | SMILES sequence | 4M unlabeled SMILES pretraining, downstream fine-tuning | 2021 | ||
| MFBERT | molecular fingerprint generation, virtual screening | SMILES text | distributed large-scale pretraining → small-set fine-tuning | 2022 | ||
| K-BERT | molecular property prediction | SMILES | multitask pretraining → evaluation on 15 drug datasets | 2022 | ||
| MDL-CPI | CPI prediction | protein sequence, molecular graph | BERT-CNN and GNN hybrid → AE2 unified feature space | 2022 | ||
| Fingerprints-BERT | molecular property prediction | SMILES | BERT pretraining → CNN feature extraction | 2022 | ||
| Degpred | Degron prediction | protein sequence | BERT pretraining | 2022 | ||
| ChemBERTa and ProBERT | DTI prediction | SMILES, protein sequence | ChemBERTa and ProtBERT pretraining | 2022 | ||
| DTI-BERT | DTI prediction | protein sequence, molecular fingerprint | BERT pretraining, DWT, CNN | 2022 | ||
| GraphBERT | ADMET prediction | molecular graph, descriptors | Graph-BERT pretraining | 2023 | ||
| BioBERT | genotoxicity classification | text | BioBERT, text mining → ensemble model | 2023 | ||
| FG-BERT | molecular property prediction | functional-group molecules | self-supervised pretraining → 44-dataset fine-tuning | 2023 | ||
| MolRoPE-BERT | molecular property prediction | SMILES | RoPE-BERT pretraining → fine-tuning on 4 datasets | 2023 | ||
| AMP-BERT | Antimicrobial peptide classification | Peptide sequence | BERT fine-tuning | 2023 | ||
| PorphyBERT | HOMO/LUMO energy-level prediction | SMILES | PBDD pretraining → MpPD fine-tuning | 2023 | ||
| MTL-BERT | ADMET prediction | SMILES, fragments | Mixed-token pretraining | 2024 | ||
| ChemBERTa | Anticancer activity prediction | SMILES | ChemBERTa fine-tuning | 2024 | ||
| GPRC-BERT | GPCR sequence analysis | Protein sequence | Prot-BERT fine-tuning | 2024 | ||
| FG-BERT | Antioxidant activity prediction | Functional-group molecules | Multitask self-supervised pretraining | 2024 | ||
| MREDTA | DTA prediction | Protein sequence, molecular graph | BERT pretraining | 2024 | ||
| TOX-BERT | Health/eco-toxicity screening | SMILES | Masked-atom-recovery pretraining, multitask learning | 2024 | ||
| GPT-based architecture | cMolGPT | Target-specific molecular generation | SMILES | Conditional GPT pretraining | 2023 | |
| GraphGPT | Conditional molecular generation | Molecular graph, SMILES | Dual-modal GPT pretraining | 2023 | ||
| PETrans | Target-specific ligand generation | SMILES, protein encoding | GPT pretraining → transfer-learning fine-tuning | 2023 | ||
| FragGPT | General molecular generation (fragment-level) | FU-SMILES | FragGPT pretraining → RL | 2024 | ||
| ProtChat | Protein analysis dialogue | Protein sequence | LLM, PLLM multiagent zero-shot | 2024 | ||
| MolReGPT | Molecular title translation | Molecule SMILES, natural language description | No domain-specific pretraining, In-Context Few-Shot retrieval-driven ChatGPT | 2024 | ||
| RM-GPT | Conditional molecular generation | SMILES | LocalRNN, residual-attention GPT pretraining → conditional generation | 2024 | ||
| Adapt-cmolGPT | Target-specific molecular generation | SMILES | GPT optimization fine-tuning | 2024 | ||
| MTMol-GPT | Multitarget molecular generation | SMILES | GPT, IRL | 2024 | ||
| AVP-GPT | Antiviral peptide generation | Peptide sequence | RSV dataset pretraining → application and evaluation on other viruses | 2024 | ||
| 3DSMILES-GPT | 3D molecule generation | SMILES (2D,3D) | Token-based GPT pretraining → RL optimization | 2025 | ||
| NPGPT | Natural-product-like compound generation | Natural-product SMILES | Trained a chemical language model on the NP dataset | 2025 | ||
| cMolGPT | Target-specific molecular generation | SMILES | Conditional GPT pretraining | 2023 | ||
| Llama-based architecture | Llamol | Multicondition molecular generation | SMILES | Llama2, SCL pretraining → conditional generation | 2024 | |
| Llama-Gram | CPI prediction | Protein fold, molecular graph | ESMFold frozen, Graph Transformer, Gram layer | 2025 | ||
| Claude-based architecture | Claude 3 Opus | Molecular generation | SMILES | Zero-shot natural-language prompts | 2024 | |
| T5-based architecture | T5MolGe | Mutant EGFR inhibitor generation | SMILES | T5 encoder-decoder, transfer learning | 2025 | |
| BART-based architecture | MolBART | Algorithm-selection strategy study | SMILES | Classical SVR/FSLC/MolBART comparative experiment | 2024 | |
| MegaMolBART | Blood−brain barrier permeability prediction | SMILES strings, molecular fingerprints | Chemical semantic vectorization | 2024 |
3.5. Core Differences between LLMs and Traditional Deep Learning in Small-Molecule Drug Discovery
LLMs and conventional deep learning models exhibit substantial differences in the context of small-molecule drug discovery. Traditional deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), typically engage in end-to-end learning processes that rely on molecular structural features, including SMILES strings and molecular graphs. In contrast, LLMs, by virtue of their pretraining paradigms and extensive text comprehension capabilities, can integrate multisource heterogeneous data, encompassing literature knowledge, molecular descriptive texts, and structural information. This enables LLMs to exhibit enhanced semantic relevance and transfer learning capabilities in tasks such as molecular generation, property prediction, and optimization. The fundamental distinctions between these two approaches are delineated in Table .
2. Comparison of Core Differences between LLMs and Traditional Deep Learning Models.
| Data Requirements | LLMs | Traditional Deep Learning Models | Data Basis |
|---|---|---|---|
| Molecular Generation | Rely on large-scale unsupervised text/sequence pretraining; only a few-shot fine-tuning is required for downstream tasks. | Rely on task-specific annotations. | ,,,, |
| Property Prediction | Directly generate SMILES/SELFIES; however, additional constraints are needed to ensure chemical validity. | High effectiveness, yet limited by scale constraints. | ,,,, |
| Cross-Task | Advantage in small-data scenarios. | Data-sensitive. | , |
| Cross-Modal | Capable of multitask processing. | Designed for single-task scenarios. | , |
| Interpretability | Capable of multimodal integration. | Require manual integration (of multisource/multimodal data). | , |
| Data Requirements | Employ the attention mechanism, yet still less effective than explicit features. | Have explicit feature importance. | , |
In conclusion, traditional deep learning models are valuable for their precision in executing specific tasks. In contrast, LLMs exhibit significant advantages in overcoming the fundamental challenges of small-molecule drug discovery due to their universal architecture, integrated knowledge, and multitask collaboration capabilities. Rather than existing in a replacement dynamic, these models form a synergistic framework wherein LLMs spearhead innovative design, while traditional models provide precise validation.
4. The Core Applications of LLMs in Small-Molecule Drug Discovery
The conventional process of small-molecule drug discovery is complex and characterized by labor-intensive experiments and extensive data, which contribute to extended timelines and substantial costs. Leveraging their advanced data processing and analytical capabilities, LLMs are well-equipped to address the challenges posed by vast amounts of biomedical data, thereby expediting the critical stages of drug discovery and presenting novel opportunities for the research and development of small-molecule drugs. The primary applications of LLMs include drug target identification and validation, virtual screening and exploration of chemical spaces, molecular design and optimization, toxicity prediction and safety assessment, knowledge extraction, and document analysis.
Figure illustrates the primary phases and tasks associated with LLMs in the context of small-molecule drug discovery. Table lists the LLMs employed in this domain and delineates their respective functions.
3.
LLMs in the main stages and tasks of small-molecule drug discovery. (a) Target identification and validation: LLMs can identify biomolecules (such as proteins) closely associated with pathological processes, explore their intrinsic properties, functions, and interactions with small-molecule compounds, and assess their potential as drug targets. (b) Virtual screening: Combining LLMs with reinforcement learning improves virtual screening by efficiently navigating the extensive chemical space. (c) Molecular generation and optimization: LLMs utilize chemical language models to generate small molecules with targeted pharmacological properties and optimize key characteristics such as ADMET, thereby enhancing their potential as drug candidates. (d) Toxicity prediction and safety assessment: LLMs enable data-driven toxicity prediction by transforming key features from molecular graphs, SMILES sequences, and external databases into computable numerical formats, thereby supporting safety assessments that reduce the risks of adverse drug reactions. (e) Knowledge extraction and document analysis: LLMs can extract and analyze information from unstructured biomedical texts (e.g., literature, patents, and clinical records) while integrating knowledge graphs to facilitate drug repurposing and novel therapeutic discovery.
3. Functions of Specialized LLMs in Different Phases and Tasks of Small-Molecule Drug Discovery.
| Domain-Specific Model | Functions | Primary dataset | Performance | Year | Ref. |
|---|---|---|---|---|---|
| Small-molecule drug target identification and validation: PLMs and drug target characterization | |||||
| ActTRANS | Classify active transport proteins | UniProtKB | Accuracy: primary active transporters 85.44% secondary active transporters 88.74% other proteins 92.84% | 2021 | |
| BERT | Predicting SARS-COV-2 inhibitors | Enamine REAL | SARS-CoV-2 Mpro and PLpro: Spearman ρ ≈ 0.88 Precisionmax = 0.60 | 2022 | |
| Degpred | Prediction of degrons and E3 ubiquitin ligase binding | ELM, E3-substrate interaction (ESI) | Degpred predicts 46,621 degrons from protein sequences with an AUC of 0.88, and constructs 39 E3 motifs achieving ≥40% ESI recall | 2022 | |
| IDP-LM | Prediction of protein intrinsic disorder and disorder functions | MobiDB, PDB, DisProt | Disordered region prediction: AUC = 0.833, disordered DNA binding function prediction: AUC = 0.897 | 2023 | |
| SSEmb | Predicting the impact of amino acid changes on proteins | CATH4.2, MAVE Validation Set ProteinGym | Downstream task: PR-AUCmax = 0.642 | 2024 | |
| ESM-2 | Identification of druggable proteins | DrugMiner, Pharo, SWISS-PROT | benchmark dataset: accuracy = 95.11% | 2024 | |
| GPCR-BERT | Interpreting the sequential design of G protein-coupled receptors | GPCRdb, UniRef100 | downstream task E/DRY: test accuracy = 100% NPxxY: test accuracy = 98.05% CWxP: test accuracy = 86.29% | 2024 | |
| ProtChat | Automated protein analysis | protein property prediction dataset, PDBbind, kinase, SKEMPI | - | 2024 | |
| TransConv | Prediction of protein secondary structure | ProteinNet, NetSurfP-2.0, NEW364 | Precisionmax = 0.63, Recallmax= 0.53, F1max = 0.56 | 2025 | |
| GPT-2 | Solving the protein scaffold filling problem | five sets of protein scaffold data (MabCampath, P5A, P15, P18, CAH2) | MabCampth protein scaffold: gap-filling Accuracy = 100% full sequence Accuracy = 100% | 2025 | |
| Small-molecule drug target identification and validation: protein-small-molecule ligand interaction prediction and dataset construction | |||||
| BERT | Predicting drug-targeting interactions | Human Dataset, DUD-E, ZINC | The model boosts AUC by 2.4% and recall by 9.4% on imbalanced data versus others | 2021 | |
| MGPLI | Predicting protein−ligand binding affinity | KIBA, Davis, BindingDB | KIBA: CI = 0.891, MSE = 0.159, Pearson R = 0.881 Davis: CI = 0.884, MSE = 0.218, Pearson R = 0.818 BindingDB: CI = 0.814, MSE = 0.815, Pearson R = 0.820 | 2022 | |
| MDL-CPI | Predicting compound-protein interactions | two CPI datasets (Human and C. elegans) | Human: AUC = 0.959, accuracy = 0.910 C. elegans: AUC = 0.975, accuracy = 0.933 | 2022 | |
| ChemBERTa-ProtBert | Predicting drug-targeting interactions | BIOSNAP, DAVIS, BindingDB | BIOSNAP: ROC-AUC = 0.914, PR-AUC = 0.900 DAVIS: ROC-AUC = 0.942, PR-AUC = 0.517 BindingDB: ROC-AUC = 0.926, PR-AUC = 0.63 | 2022 | |
| DTI-BERT | Predicting drug-targeting interactions | DTIs database | Accuracy: G Protein-coupled 90.1% Receptors 94.7% ion channels 94.9% enzymes 89% | 2022 | |
| Mutual-DTI | Prediction of drug-target protein interactions | two CPI datasets (Human and C. elegans), GPCR, Davis | Human: AUC = 0.984, Accuracy = 0.962 C. elegans: AUC = 0.987, Accuracy = 0.948 | 2023 | |
| MT-DTA | Enhancing drug-targeting interactions | Davis, KIBA | Davis: CI = 0.844, MSE = 0.384, MAE = 0.368, r2 m = 0.450 KIBA: CI = 0.751, MSE = 0.390, MAER93, r2 m = 0.406 | 2023 | |
| PETrans | De Novo drug design with protein-specific encoding | MOSES, PDB, ExCAPE-DB | EGFR, S1PR1, HTR1A: Docking Score(average) = −7.97 kcal/mol, Docking Score(optimal) = −9.8 kcal/mol, QED = 0.45, SA = 2.74, Novelty = 100% | 2023 | |
| CapBM-DTI | Prediction of drug-targeted interactions | KinaseSARfari, PubChem BioAssay, two DTI datasets | Accuracy = 89.3%, F1 = 90.1%, AUC = 0.946, AUPR = 0.970 | 2023 | |
| MREDTA | Predicting drug-targeting binding affinity | Davis, KIBA | KIBA: CI = 0.880, MSE = 0.146, r2 m = 0.720 Davis: CI = 0.903, MSE = 0.224, r2 m = 0.709 | 2024 | |
| BERT | Prediction of drug-targeted interactions | BindingDB, DAVIS, BIOSNAP | ROC-AUC: BIOSNAP 0.885, DAVIS 0.799, BindingDB 0.910, Sensitivitymax = 89.6% | 2024 | |
| Lama-Gram | Predicting protein−ligand interactions | ChEMBL23, Kinase, GPCR, DUD-E, DEKOIS 2.0 | ChEMBL23(SOT): AUC = 0.918, AUPRC = 0.915 | 2025 | |
| SP-DTI | Predicting DTIs | BIOSNAP, DAVIS | BIOSNAP: ROC-AUC = 0.931, PR-AUC = 0.930 DAVIS: ROC-AUC = 0.934, PR-AUC = 0.462 | 2025 | |
| Small-molecule drug target identification and validation: a multitask learning framework integrates target sequences with drug small-molecule compound data | |||||
| RT | Predicting the properties of small molecules, proteins, and chemical reactions | ChEMBL, ESOL, FreeSolv, Lipophilicity, Property optimization benchmark, UniProt, TAPE benchmark, USPTO, Buchwald−Hartig aminations, Suzuki−Miyaura cross-coupling reactions | molecular property prediction (drug likeness): RMSE = 0.030, PCC = 0.991 potential protein interaction: Spearman’s ρ > 0.994; Chemical reaction modeling(aryl-halides): Accuracy = 98.2 | 2023 | |
| GPCRSPACE | Accelerated identification and screening of potential GPCR-interactive compounds | GLASS, BindingDB, TargetMol, ZINC | AUC = 0.921, F1 = 0.850, Recall = 0.752, Precision = 0.979 | 2024 | |
| DrugReAlign | Improving drug repurposing | NR, GPCR, RCSB PDB, PLIP, BindingDB | NR: TCR = 0.776, T1RSR = 0.328 GPCR: TCR = 0.641, T1RSR = 0.279 Docking Score(average) = −7.35 kcal/mol | 2024 | |
| PCMol | Generating diverse potential active molecules targeting multiple targets | AlphaFold2, Papyrus v5.5 | PCMol generated 100 molecules on 19 unseen targets with a maximum Tanimoto similarity >0.5 to known active ligands | 2024 | |
| TransGEM | Generation of new molecules with desirable properties and disease-targeting affinities | LINCS1000, subLINCS | Validity = 100%, Novelty = 100%, Uniqueness = 84.9%, InDiv = 78.9% | 2024 | |
| Meta-GTNRP | Prediction of the binding activity of compounds to nuclear receptors | NURA, RDKit.Chem | Meta-GTNRP achieved an average ROC-AUC > 0.9 on 11 nuclear receptor binding tasks under both 5-shot and 10-shot conditions | 2024 | |
| BERT-RGCN | Antimalarial drug prediction using plasmodium potential targets | a plasmodium dataset coming from ChEMBL and PubChem | Accuracy = 99.72%, MCC = 99.43% | 2024 | |
| MMDG-DTI | Prediction of drug-targeting interactions by multimodal feature fusion and domain generalization | Human and C. elegans, BindingDB, DrugBank, Davis, KIBA | Optimal Results on All 6 Datasets: AUCROCmax = 0.996, AUC-PRmax = 0.996 | 2025 | |
| Virtual screening of small-molecule drugs and chemical space exploration: Efficient navigation in ultralarge-scale chemical space | |||||
| Transformer | Simplified exploration of the chemical space of druglike molecules | ChEMBL | The model can generate novel molecules with a Tanimoto similarity of 1 or close to 1 (>0.8) to known highly active molecules | 2023 | |
| Transformer | Reduce the molecular nearest-neighbor search space | a small dataset containing 10,000 compounds, a large dataset containing 500,000 compounds, ZINC | 5-order search space reduction (1.5B molecules) | 2023 | |
| CMGN | Generation of molecules with specific target and drug properties | ZINC, ChEMBL, GuacaMol | CMGN-DL based on druglike molecules: Validity = 99.83% CMGN-PKI based on PKIs: Validity = 99.38% CMGN-BTK based on BTK inhibitors: Validity = 98.92% | 2023 | |
| Transformer | Enabling efficient exploration of molecular neighbors | PubChem, TTD, ChEMBL | The model retrieves the majority of near-neighbors from PubChem with an average recovery rate of ∼98% | 2024 | |
| Graph-Transformer | Generating molecular analogues of fentanyl | PubChem, Zinc, PDB | generated 36,799 molecules: Validity = 84.7%, Novelty = 88.8%, Uniqueness = 0.31 | 2024 | |
| Claude 3 Opus | Highly efficient modification of molecules in low-dimensional latent spaces | ZINC | zero-shot setting: Validity = 97% | 2024 | |
| RM-GPT | Generation of drug-like molecules that satisfy specific conditions | MOSES, ZINC250K | unconditional generation: Validity = 99.5%, Uniqueness = 99.9% conditional generation TPSA, logp, SAS, and scaffolds: Validity = 78.2%, Uniqueness = 48.8% | 2024 | |
| Virtual screening of small-molecule drugs and chemical space exploration: Generation and optimization of small molecules based on reinforcement learning | |||||
| MCMG | Enhancing molecular diversity through reinforcement learning | ChEMBL51(DRD2, JNK3, and GSK3β), MOSES | Task1 (DRD2, QED, and SA) Success = 94.18, Task2 (JNK3, GSK3β, QED, and SA) Success = 80.2% | 2021 | |
| DrugEx | Combining reinforcement learning to create effective, high-affinity drug molecules | ChEMBL | Validity = 100.0%, Accuracy = 99.2% Novelty = 68.9% Uniqueness = 82.9% | 2023 | |
| MTMol-GPT | Employing inverse reinforcement learning for generating molecules targeting multiple objectives | ChEMBL, ExCAPE-DB, MOSES, PDB | DRD2: Valid = 87.00%, Novel = 98.70% HTR1A: Valid = 80.89%, Novel = 98.87% | 2024 | |
| ChatChemTS | Chat interactions assist users in designing new molecules and automatically constructing reward functions | ChEMBL | Single-target optimization test-set R = 0.93 Multitarget optimization test-set R = 0.85 | 2025 | |
| TRACER | Prediction of reaction products using reinforcement learning | USPTO 1k TPL, ExCAPE, ChEMBL, DUD-E, ZINC | DRD2: Uniqueness = 89.2% AKT1: Uniqueness = 82.3% CXCR4: Uniqueness = 92.4% | 2025 | |
| Small-molecule design and optimization: Generation of small molecules based on chemical language | |||||
| Transformer-CNN | Improved QSAR/QSPR modeling using SMILES-embedding CHARNN architecture | ChEMBL, QSAR dataset | regression sets: r2 max = 0.98 classification sets: AUCmax = 0.93 | 2020 | |
| Chemformer | Combined with simplified SMILES for quick use in sequence-to-sequence and discriminant tasks | ZINC-15, USPTO-MIT, USPTO-50K, ChEMBL MMPs MoleculeNet, ExCAPE | retrosynthesis reactions prediction (Top-1): Accuracy = 54.3% | 2022 | |
| MoLFormer | Capturing molecular chemical and structural information can be used to predict various molecular properties | PubChem, ZINC, MoleculeNet | Ranked first in 3 of 6 classification tasks and led all 5 regression tasks | 2022 | |
| MTL-BERT | SMILES enumeration boosts multitask learning, enhancing molecular property prediction performance and generalization | ChEMBL, ADMETlab, MoleculeNet | MTL-BERT surpasses SOTA on 58 out of 60 tasks, achieving >10% improvement in CL, PPB, VD, and LC50DM | 2022 | |
| K-BERT | Extracting chemical information from SMILES through pretraining tasks to optimize molecular property prediction | CHEMBL, ADMETlab 2.0, Malaria dataset, CHIRAL1, Sim-Sub-Dataset, DrugBank | 15 pharmaceutical datasets: ROC-AUCaverage = 0.806 | 2022 | |
| SELF-EdiT | SELFIES-based models for structure-constrained molecular optimization | QED, DRD2 | QED: Success = 59.6% DRD2: Success = 82.2% | 2023 | |
| MolRoPE-BERT | Rotary position embedding-enhanced molecular representations for predicting molecular properties | ZINC, ChEMBL, MoleculeNet(BBBP, Tox21, SIDER, ClinTox) | ROC-AUC: BBBP 0.933, Tox21 0.876, SIDER 0.712, ClinTox 0.949 | 2023 | |
| FragGPT | The fragment-based model excels at generating molecules with improved properties and novel structures | PubChem, ChEMBL, Moses, ZINC-250K, PDBbind, CASF-2016 | Novelty: de novo 99.4%, linker 98.6%, R-group 98.2%, scaffold hopping 99.8%, side-chain 99.99% | 2024 | |
| MegaMolBART | Predicting blood−brain barrier permeability of molecules | B3DB, CMUH-NPRLLightBBB, DeePred-BBB, ZINC-15 | BBB-permeable: AUC = 0.88 | 2024 | |
| HBCVTr | Predicting small molecules’ inhibitory activity against Hepatitis B and C viruses using SMILES | ChEMBL, eMolecules, PDB | HBV: R-squared = 0.641 HCV: R-squared = 0.721 | 2024 | |
| 3DSMILES-GPT | With 3D molecular structure, the generated molecules surpass existing methods in binding affinity, drug similarity, and synthesis | PubChem, CrossDocked2020, PDBbind2020 | Docking Score = −7.72 kcal/mol, QED average = 0.76, SAS average = 3.07 | 2025 | |
| Transfer Text-to-Text Transformer(T5) | Conditional molecular properties-guided SMILES sequence generation | GuacaMol, ChEMBL | generation tasks conditioned on molecular skeleton: Validity = 0.989 Uniqueness = 0.729, Novelty = 1.000, Similarity = 0.975 | 2025 | |
| BERT | By optimizing the position coding and position embedding, the prediction performance of the model on the SMILES string is improved | ZINC, PubChem, ChEMBL, MoleculeNet, new balanced datasets (Antimalarial Drugs, Cocrystals, COVID, COVID-19) | Tox21 dataset: Accuracy = 0.9394, F1 = 0.9688 | 2025 | |
| Small-molecule design and optimization: Goal-directed small-molecule optimization | |||||
| Constraints-Transformer | A web server for ADMET property prediction and molecular optimization | Therapeutics Data Commons (TDC), CHEMBL 24 | 97% of the target molecules generated by the model satisfied structural constraints, and 59% met both structural and property constraints | 2023 | |
| PharmaBench | The multiagent data mining system efficiently identifies experimental conditions for 14,401 bioassays, serving as a comprehensive benchmark set for ADMET properties | ChEMBL | - | 2024 | |
| Grapherformer | Enhanced ADME/T prediction by combining molecular fingerprints with physicochemical data | ChEMBL | ADME/Tox Prediction task: SMAPEaverage = 18.9%, PCCaverage = 0.86 | 2024 | |
| MTL-BERT | Hybrid fragment-SMILES tokenization for predicting ADMET | ChEMBL, MOSES, ZINC-250K, TDC, TDC ADMET Group Benchmark | TDC ADMET: 14/22 metrics surpass SMILES Bioavailability (AUC): 0.645 vs 0.609 Hepatocyte Clear (Spearman): 0.441 vs 0.431 | 2024 | |
| Transformer | Integrating Transformer with MOO in drug design | AlphaFold Protein Structure Database, ZINC-250K, ChEMBL, MOSES, TDC | mean test set: Accuracy = 99.81%, Reconstruction Loss = 0.00734, Total Loss = 0.0257 | 2024 | |
| Transformer | Identifying promising drugs while excluding those with poor pharmacokinetics or toxicity | ZINC, PubChem, ChEMBL, TOXRIC, AquaSolDB, MoleculeNet | Regression tasks: R2 > 0.96 Regression tasks: AUC > 0.96 | 2025 | |
| Small-molecule design and optimization: Chemical language model and generative design | |||||
| ChemCrow | LLMs’ chemical agents can autonomously plan and execute tasks like small-molecule drug discovery by using expertly designed tools | PubChem, ZINC20, Chem-Space | - | 2024 | |
| LCMs | Large-scale chemical models are applied to small-molecule drug design through reward modeling and proximal strategy optimization | BindingDB, PubChem | 99.2% of model-generated molecules show pIC50 > 7 against amyloid precursor protein; all are valid and novel | 2024 | |
| TamGen | Generating target-aware molecules through chemical language models | PubChem, CrossDocked2020, PDB, ChEMBL | CrossDocked2020: Top 2 in 5/6 benchmarks, 9 s per 100 molecules, 14 ClpP hits with IC50lowest = 1.9 μM | 2024 | |
| Toxicity prediction and safety assessment of small-molecule drugs: Sequence-based toxicity classification models | |||||
| MolBERT | Predicting toxicological endpoints | CATMoS | Balanced Accuracy = 0.871, Sensitivity = 0.863 Specificity = 0.878 | 2023 | |
| TOX-BERT | Toxicity assessment | CHEMBL32, IECSC, 19 healthy and ecological toxicity endpoints from literature | Accuracy >90%, MAE < 0.52 | 2024 | |
| Toxicity prediction and safety assessment of small-molecule drugs: Improving safety assessment by incorporating contextual information | |||||
| BERT | Using layer and content mapping alongside opportunity identification to discover potential hyperuricemia drugs | Medline, Derwent Innovations Index, Abstracts of Business Information | - | 2022 | |
| LLM-DDA | Predicting drug-disease associations via drug-disease heterogeneous networks | CTD, OMIM, Dndataset, KEGG | LLM-DDA VS 11 baselines: AUPR ↑23.22%, F1 ↑17.20%, Precision ↑25.35% | 2024 | |
| MolReGPT | Retrieve related molecules and their descriptions from a local database, and translate molecular captions using contextual learning | ChEBI-20 | Mol2Cap: Text2Mol = 0.585 Cap2Mol: Text2Mol = 0.593 | 2024 | |
| Llamol | Training with Stochastic Context Learning enables the flexible generation of organic molecules that meet a variety of conditions | OrganiX13(ZINC15, QM9, ZINC 250k, RedDB, OPV, PubchemQC 2017/2020, CEP subset, ChEMBL) | Unconditional: Novelty = 89.7, Validity = 99.5 Single-condition: MAD 0.041−0.39 Multicondition: MAD 0.04−0.25 | 2024 | |
| Knowledge extraction and document analysis: Automated literature mining | |||||
| BERT | Extracting chemical-protein interactions using Gaussian probability distributions and external biomedical knowledge | BLUE CHEMPROT, BLUE DDI Extraction 2013 | CHEMPROT: F1 = 76.56% overlapping instances: F1 = 75.89% DDI: F1 = 82.04% | 2020 | |
| BERT-Att-Capsule | Automated extraction of chemical-protein interactions from biomedical texts | CHEMPROT, ADE | CHEMPROT: F1 = 74.70%, Precision = 77.78%, Recall = 71.86% | 2020 | |
| PharmaCoNER | Identification of pharmacological entities from biomedical domain knowledge language texts | PharmaCoNER | F1 = 92.01% | 2021 | |
| BertSRC | Mining semantic relationships between biomedical entities in medical literature | PubMed | semantic relation classification task: F1 = 0.852 | 2022 | |
| BERT | Identified 600,000 articles containing DTIs that were not included in public databases | PubTator, DrugTargetCommons, ChEMBL, DisGeNET | identification quantitative drug-target profile: Accuracy >99% identifying assay formats: F1micro = 88.1% | 2022 | |
| SUSIE | Automatically extract entities and relationships from the unstructured text of drug documents | ICH (International Conference on Harmonization) documents, Eli Lilly | Accuracy = 96%, F1 = 88% | 2023 | |
| LBMFF | Drug-disease association prediction with literature-based multifeature fusion | CTD, DrugBank, SIDER, MeSH, PubMed, TL-HGBI | AUC = 0.8818, AUPR = 0.5916 | 2023 | |
| LLMs | Extracting chemical reaction data from patents, expanding reaction databases, and correcting errors | USPTO, Open Reaction Database (ORD) | Augment the dataset with 26% more novel reactions | 2024 | |
| Chemformer | A proprietary dataset of 18 million reactions from literature, patents, and electronic lab notebooks performs exceptionally well in retrosynthetic prediction | AstraZeneca, PaRoutes, ChEMBL | single-step retrosynthetic prediction Top-10 round-trip: Accuracy >0.97 | 2024 | |
| Nach0 | By using the pretraining of scientific literature, patents, and molecular strings, chemistry and language knowledge are integrated to solve multitasks such as biomedical Q&A, named entity recognition, molecular generation, synthesis and attribute prediction | PubMed, USPTO, ZINC | Retrosynthesis: Accuracy = 56.26, Forward reaction prediction: Accuracy = 89.94%, Molecule generation: Validity = 99.86% | 2024 | |
| Knowledge extraction and literature analysis: small-molecule drug repurposing and multimodal knowledge map construction | |||||
| LBD | Using literature and knowledge graphs to discover COVID-19 drug candidates | SemMedDB, LitCovid, CORD-19, PubMed | semantic predication: F1 = 0.854 | 2021 | |
| SKiM | Concept connections discovered in isolated literature domains through knowledge graphs | PubMed, three LBD systems(BITOLA, LION LBD, and Arrowsmith) | entity recognition: F1 = 0.848, Precision = 0.885, Recall = 0.814 relationship extraction: F1= 0.552, Precision = 0.777, Recall = 0.428 | 2023 | |
| DrugProt | A corpus containing 5000 PubMed abstracts was collected to generate a silver-standard knowledge graph, which supports relation extraction | PubMed | Main DrugProt task: Precisionmax = 0.8044, Recallmax = 0.8794, F1max = 0.7939 Large Scale DrugProt: Precisionmax = 0.8008, Recallmax = 0.8481 F1max = 0.7886 | 2023 | |
| FuseLinker | A top-performing biomedical knowledge graph framework for drug repurposing | KEGG50k, Hetionet, SuppKG, ADInt | KEGG50k: MRR = 0.969, AUC = 0.987 Hetionet: MRR = 0.548, AUC = 0.903 SuppKG: MRR = 0.739, AUC = 0.928 ADInt: MRR = 0.831, AUC = 0.890 | 2024 | |
| TransE | Mining the relationships of potential AD-related semantic triples in the Alzheimer’s disease knowledge graph | SemMedDB | MR = 10.53, Hits@10 = 0.58 | 2024 | |
4.1. Small-Molecule Drug Target Identification and Validation
Central to drug development is the identification and validation of small-molecule drug targets, which involves identifying key biomolecules, such as proteins and RNA, that are intricately linked to the pathological processes of diseases and assessing their viability as targets for therapeutic intervention. By leveraging their robust data analysis and pattern recognition capabilities, LLMs can efficiently process multiomics data to uncover the potential characteristics, functions, and interactions of disease-related targets, thereby facilitating the acceleration of target identification and validation.
4.1.1. PLMs and Small-Molecule Target Characterization
Proteins currently represent the primary targets for small-molecule drugs. ,
PLMs based on LLM architectures capture evolutionary, biomolecular structural, and functional information from extensive protein sequence datasets using unsupervised learning. , The embedded representations produced by these models can effectively replace traditional multiple sequence alignments as input features for downstream tasks, , thereby improving the efficiency of drug target research.
To characterize small-molecule targets, PLMs utilize residue-level semantic encoding capabilities to predict critical drug-binding features, such as protein secondary structures, active sites, and post-translational modification (PTM) sites that influence ligand binding. These predictions offer more precise target information and facilitate rational small-molecule drug design. Models such as ProtT5 and ESM-2 demonstrate outstanding performance in predicting protein crystallization propensities. Emerging architectures such as PTM-Mamba enhance representational dimensionality by incorporating PTM-specific information. These advancements underscore LLMs, particularly PLMs, as essential tools for interpreting the “language of life.” By integrating sequence, structure, , and function representations, they establish a transformative paradigm for innovating strategies in small-molecule drug target discovery.
4.1.2. Protein-Small-Molecule Ligand Interaction Prediction and Dataset Construction
LLMs have exhibited significant advantages in predicting the interactions between proteins and small-molecule ligands. The architecture of LLMs facilitates the direct mapping of relationships between amino acid sequences and molecular chemical structures, thereby offering innovative methodologies for predicting key parameters, such as protein binding sites and ligand binding affinity. , For example, GPT-4 has been effectively utilized to predict kinase-inhibitor binding affinity, whereas the CAPLA model leverages cross-attention mechanisms to capture the mutual effects between protein-binding pockets and small-molecule ligands. These models surpass the limitations of traditional methods, which depend solely on static crystal structures, by incorporating information from dynamic binding processes. By integrating multimodal information fusion and cross-domain feature learning, LLMs can also enhance the predictive performance of PLI. For instance, the Lama−Gramm model incorporates protein folding embeddings, graph-based molecular representations, and uncertainty estimation to more accurately capture the structural complexity of PLI. The Prot2Drug model similarly integrates latent knowledge derived from PLMs with extensive PLI datasets to generate small-molecule compounds tailored to specific targets.
However, robust solutions that utilize LLMs remain limited in effectively addressing the challenges inherent in structure-based drug discovery, quantum chemistry, and structural biology. These fields require more precise datasets of biomolecule-ligand interactions to enhance the applicability of LLMs. To address this issue, the MISATO dataset integrates the quantum mechanical properties of small molecules with data from corresponding molecular dynamics simulations, modeling nearly 20,000 experimentally determined protein−ligand complexes. Widely utilized databases of protein−ligand complexes, such as PDBbind, and benchmark DTI datasets, such as Davis, collectively form a multimodal data foundation that is essential for training highly accurate predictive models. These high-quality datasets address critical challenges in current PLI predictions, including protein conformational changes and the chemical diversity of small-molecule ligands.
4.1.3. Multitask Learning Framework Integrates Target Sequences with Drug Small-Molecule Compound Data
The integration of LLMs with multitask learning (MTL) frameworks offers novel solutions for the synergistic analysis of target sequences and small-molecule compound data. By capitalizing on their advanced sequence modeling capabilities, LLMs can directly process target protein sequences along with molecular structural data. Concurrently, the MTL facilitates the coordinated optimization of essential tasks during drug discovery. For example, the AiGPro multitask model concurrently predicts the agonist (EC50) and antagonist (IC50) activities of small molecules against targets such as G-protein-coupled receptors (GPCRs). Additionally, the DeepFusion model utilizes multiscale feature fusion techniques to jointly model contextual information derived from target sequences and the structural characteristics of small-molecule compounds, thereby demonstrating improved prediction accuracy even under conditions of data scarcity.
When integrated with knowledge graph embedding (KGE), MTL frameworks facilitate the simultaneous prediction of DTIs and drug−drug interactions (DDIs). In this context, LLMs extract diverse relational features from pharmacological knowledge graphs, thereby enhancing the structural representation of both small molecules and proteins.
4.2. Virtual Screening of Small-Molecule Drugs and Chemical Space Exploration
One of the main challenges in small-molecule drug discovery is how to efficiently identify small-molecule compounds associated with specific targets within the vast chemical space that contains over 1060 drug-like molecules. LLMs can efficiently navigate the chemical space and rapidly identify potentially active small molecules by learning from extensive chemical data. Reinforcement learning optimizes the generation of small molecules through reward mechanisms, ensuring that they meet specific pharmacological properties, thereby improving the efficiency and accuracy of virtual screening for small-molecule drugs.
4.2.1. Efficient Navigation in Ultra-Large-Scale Chemical Space
LLMs surpass the limitations inherent in traditional small-molecule designs by facilitating the efficient generation of structurally novel, yet chemically viable, candidate molecules, thereby broadening the exploration of multidimensional chemical spaces. Using GPCR targets as a case study, GPCR-specific LLMs trained on the chemical space features of GPCR-targeted compounds demonstrated accelerated identification and screening of potential GPCR-interacting molecules. The resulting GPCRSPACE database outperformed conventional chemical datasets in terms of synthetic accessibility, structural diversity, and GPCR functional similarity. A Transformer-based architecture trained on over 200 billion molecular pairs employs similarity kernel function regularization to establish direct relationships between target molecules and their source analogs, thereby enhancing the exploration capabilities of molecular neighborhoods.
For diversity assessment, LLMs enable a systematic analysis of structure−property/activity relationships by identifying molecular pairs that exhibit high structural similarity yet significant pharmacological divergence, known as “activity cliffs” in chemical space. Models such as GIT-Mol can process intricate structural information on small molecules through multimodal integration.
Traditional virtual screening methods, such as fragment-based drug design, are often limited by the restricted chemical space encompassed by predefined fragment libraries, posing challenges in overcoming the structural constraints of existing repositories. In contrast, LLMs utilize generative architectures to create novel small-molecule structures dynamically, thereby obviating the need to rely on preexisting fragment libraries. For instance, LLMs can generate new molecules that maintain target-specific bioactivity by “translating” known active compounds against a specific protein target into structural analogs without being strictly dependent on the data within the training set. LLMs facilitate guided molecular generation through natural language prompts, enabling the precise reading, writing, and modification of molecular structures.
Furthermore, LLMs allow for the controlled generation of novel molecules tailored to specific requirements. For example, the RM-GPT model can accurately and consistently produce drug-like molecules with predefined criteria, such as desired physicochemical properties and molecular scaffolds.
4.2.2. Generation and Optimization of Small Molecules Based on Reinforcement Learning
In the context of drug discovery, reinforcement learning can be used to generate and optimize small molecules by directing generative models to prioritize compounds that exhibit target-specific pharmacological properties by using reward-driven optimization frameworks. The integration of reinforcement learning with LLMs offers a novel approach for the generation and optimization of small molecules. For example, Transformer models integrated with reinforcement learning-based Monte Carlo tree search frameworks facilitate the iterative refinement of small-molecule drug candidates. In fragment optimization tasks, models, such as DrugGen, achieve simultaneous multiparameter optimization via reinforcement learning, producing molecules characterized by efficacy, diversity, and novelty. REINVENT 4 enhances and streamlines de novo design, R-group substitution, library design, linker design, scaffold hopping, and molecular optimization, with increased efficiency through the application of reinforcement learning. Furthermore, inverse reinforcement learning addresses the challenges associated with manual reward design by inferring implicit reward functions from the existing data.
4.3. Small-Molecule Design and Optimization
In the domain of small-molecule design and optimization, chemistry-language-based molecular generation techniques such as SMILES and self-referencing embedded strings (SELFIES) have gained prominence as potent tools for navigating extensive chemical spaces. These methodologies utilize Chemistry Language Models (ChemLM) to generate small molecules that exhibit specific bioactivities, toxicities, and pharmacokinetic characteristics. The amalgamation of chemical language models with generative design frameworks not only augments the efficiency and precision of molecular generation but also forges innovative pathways for molecular design and optimization, thus expediting the drug discovery process.
4.3.1. Generation of Small Molecules Based on Chemical Language
LLMs facilitate the efficient generation and optimization of small molecules through the application of chemical language processing techniques, utilizing molecular string representations such as SMILES and SELFIES. , SMILES, a widely adopted method, translates chemical structures into character sequences, enabling LLMs to employ NLP for sequence generation. , Nonetheless, SMILES is prone to syntactic fragility, often resulting in the generation of invalid molecular structures, and exhibits limitations in accurately conveying chemical properties. , In contrast, SELFIES, with a self-referencing mechanism, , addresses these challenges by providing more precise representations of complex small-molecule structures.
LLMs facilitate the generation of small molecules with specific pharmacological properties through chemical language modeling techniques, including scaffold-based structural optimization and cross-modal molecular editing. These models exhibit outstanding performance in predicting molecular properties and DDIs, with their embedding vectors effectively capturing the structural features of small molecules. The Transformer-CNN model produces high-quality and interpretable QSAR and quantitative structure−property relationship models using dynamic SMILES embeddings. The MTL-BERT model enhances the efficiency of pharmacological property prediction through SMILES enumeration-augmented multitask learning.
4.3.2. Goal-Directed Small-Molecule Optimization
Small-molecule optimization seeks to enhance the structural balance of existing compounds by improving multiple critical properties such as activity, toxicity, and pharmacokinetic characteristics, thereby increasing their potential to become successful drug candidates. Traditional methodologies typically optimize these parameters sequentially, often leading to high attrition rates during late-stage development. By contrast, LLMs with billion-parameter architectures facilitate the simultaneous analysis of nonlinear interactions among multiple parameters through self-attention mechanisms. For example, in the context of small-molecule optimization, conventional methods require iterative testing of chemical group substitutions. By contrast, LLMs can concurrently generate diverse candidate structures and predict their bioactivities, thereby significantly reducing the number of iterative cycles required.
LLMs play a crucial role in drug discovery by identifying promising drug candidates and filtering small-molecule compounds with suboptimal pharmacokinetics or toxicity profiles. The constraint-Transformer model improves the Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties of lead compounds by integrating molecular graph features with descriptor characteristics. A previous study introduced two molecular optimization models: the HlCLM model, which offers user-friendly holistic optimization, and the SpCLM model, which was designed for single-point optimization. The synergistic integration of these models facilitates the simultaneous optimization of multiple pharmacokinetic properties.
4.3.3. ChemLMs and Generative Design
ChemLMs integrate NLP methodologies by conceptualizing molecular structures, such as SMILES strings, as a form of “chemical language.” This approach utilizes large-language-model architectures to acquire the latent representations of small-molecule structures. An encoding and tokenization scheme was developed that employs bidirectional Transformer models to predict novel bioactive small-molecule compounds through evolutionary analogs with multisite substitutions. This method broadens the chemical space for R-group combinations in bioactive compounds beyond the capabilities of traditional Recurrent Neural Network models, offering an innovative optimization strategy for small molecules.
In the realm of generative molecular design, the GPT-like chemical language model TamGen has demonstrated the ability to generate molecules that target specific proteins. This is exemplified by the successful creation of novel small-molecule inhibitors of the ClpP protease in Mycobacterium tuberculosis. Additionally, LLMs transform the task of drug design into a sequence generation problem through causal language modeling and accelerate the molecular optimization process by integrating reward models and supervised fine-tuning strategies. This methodology has been employed to design compounds that modulate amyloid precursor protein metabolism, potentially delaying or treating the progression of Alzheimer’s disease (AD). Furthermore, multimodal models such as ChemCrow, which incorporate 18 specialized chemical tools, including synthetic route planning, reaction condition optimization, and compound property prediction, significantly enhance the efficiency of chemists in designing and synthesizing small-molecule compounds.
4.4. Toxicity Prediction and Safety Assessment of Small-Molecule Drugs
Approximately 30% of drug candidates for pharmaceutical development are discontinued due to toxicity risks, resulting in significant financial burdens and potential safety hazards for patients. Traditional toxicity assessments rely predominantly on animal testing; however, this method is limited by high costs, extended timelines, and physiological differences between species, which undermine translational accuracy. LLMs offer novel analytical approaches for predicting the toxicity of small molecules by extracting and converting features from diverse data sources into quantifiable metrics for toxicity assessments.
4.4.1. Sequence-Based Toxicity Classification Models
Traditional small-molecule toxicity classification models predominantly utilize molecular fingerprints or GNNs for feature extraction. , By contrast, LLMs derive more nuanced representations directly from molecular sequences by drawing analogies to natural language sentences. This approach involves segmenting and embedding “chemical words,” thereby capturing structural features and enhancing the reliability of data-driven toxicity models. , These models autonomously identify structural characteristics within molecular sequences through self-supervised pretraining, producing features of substantial value for subsequent toxicity classification tasks, such as predicting drug-induced QT interval prolongation, drug-induced teratogenicity, and drug-induced rhabdomyolysis. Research indicates that BERT-based architectures enhance toxicity prediction performance by integrating multimodal small-molecule data and leveraging prior knowledge of LLMs. For the classification of drug-induced liver injury and cardiotoxicity, an LLM framework based on the FDA Label achieved F1 scores exceeding 0.9.
4.4.2. Improving Safety Assessment by Incorporating Contextual Information
In contrast to traditional methodologies, LLMs demonstrate superior proficiency in processing unstructured clinical narratives, such as patient safety event reports, thereby facilitating the automated extraction of toxicity features through NLP techniques. These models further incorporate textual descriptions of chemical substances from databases, such as the Comparative Toxicogenomics Database.
GPT series models offer validated solutions for safety assessment of small-molecule drugs, showcasing their ability to simulate expert dialogs for the evaluation of hepatic, cardiac, and renal toxicity. Additionally, LLMs can also predict the risk of drug withdrawal owing to safety concerns and establish multiagent data mining systems that integrate 11 ADMET datasets with 52,482 bioassay entries.
The prediction of Adverse Drug Events (ADEs) is essential for the development of safer small-molecule therapeutics and the enhancement of patient outcomes. LLMs that integrate therapeutic and patient information demonstrate a 21−38% improvement in ADE prediction performance compared with models that rely solely on structural data. The LLM-BiLSTM model further exceeded traditional machine learning approaches in identifying causally related drug-adverse event pairs within unstructured clinical discharge summaries, achieving a 16.1% increase in the mean F1 score. These advancements underscore the superior capability of LLMs to comprehensively capture potential adverse events associated with small-molecule drugs, thereby offering more precise tools for drug safety assessment.
4.5. Knowledge Extraction and Literature Analysis
LLMs demonstrate proficiency in processing unstructured text, such as biomedical literature and patents, to accurately extract and construct structured knowledge. This capability facilitates the efficient screening of valuable information from extensive biomedical literature corpora, thereby expediting the research process for small-molecule drugs.
4.5.1. Automated Literature Mining
The discovery and optimization of small-molecule drugs are fundamentally dependent on a comprehensive analysis of biomedical knowledge. Through meticulous fine-tuning, LLMs process the intricate chemical lexicon and scientific terminology present in the biomedical literature to execute essential tasks, such as compound entity recognition and the extraction of drug-target relationships. For example, the lit-OTAR model employs named entity recognition to automatically identify entities, such as genes, proteins, diseases, organisms, and chemical compounds in scientific texts, thereby providing evidential support for drug target validation. Additionally, the nach0 model, which is pretrained on unlabeled scientific literature, patents, and molecular strings, integrates a wide array of chemical and linguistic knowledge to address complex biochemical tasks, including answering biomedical questions, named entity recognition, molecular generation, molecular synthesis, and property prediction. Recent advancements indicate that LLMs enhanced with retrieval-augmented generation technology achieve performance improvements of 41−50% compared to traditional string-matching methods in drug information mapping tasks.
In the context of relation extraction, LLMs capture intricate associations, including DDI and compound-protein relationships, within biomedical texts by employing multihead self-attention mechanisms. , For drug repurposing research, these models uncover latent therapeutic relationships by mining disease-drug association networks present in the PubMed literature. , These applications illustrate that LLMs enhance literature mining workflows through their end-to-end knowledge extraction capabilities, thereby providing intelligent solutions for the discovery of small-molecule drugs.
4.5.2. Small-Molecule Drug Repurposing and Multimodal Knowledge Map Construction
The integration of biomedical knowledge graphs (BKGs) with LLMs substantially enhances the semantic understanding in the prediction of drug-disease associations, thereby broadening the therapeutic applications of approved small-molecule drugs.
BKGs amalgamate heterogeneous data from multiple sources to encapsulate topological and semantic features critical for drug repurposing. , LLMs extract semantic information from unstructured texts and augment BKG associations through contextual reasoning, facilitating a deeper understanding of small-molecule drug mechanisms. For example, an analysis of 5000 PubMed articles led to the construction of a drug-gene/protein association network within a knowledge graph comprising 53,993,602 nodes and 19,367,406 edges. Innovative models such as knowledge graph transformers integrate LLMs with BKGs and mitigate hallucinatory outputs through graph structural constraints to enhance the predictive reliability. These technologies not only expedite hypothesis generation for drug repurposing but also prioritize clinically translatable candidates by providing mechanistic explanations. This demonstrates their potential in the discovery of small-molecule therapeutics for conditions such as Alzheimer’s, angioma, and coronavirus disease.
4.6. LLMs Empower Multidisease Small-Molecule Drug Discovery Studies
LLMs have been used extensively to facilitate the discovery of small-molecule drugs across a wide range of disease domains, including noncommunicable respiratory diseases, cancers, infectious diseases, metabolic disorders, neurological conditions, articular disorders, immune system pathologies, and parasitic infections. These models enhance the comprehension of disease mechanisms while improving precision and efficiency in drug development, thereby significantly reducing discovery timelines. For example, Transformer-based models have successfully identified the antifibrotic target TNIK, leading to the development of small-molecule inhibitors that have progressed to Phase I clinical trials. The entire process, from target discovery to the nomination of a preclinical candidate, was completed in approximately 18 months. Table summarizes the disease-specific applications and development stages of LLM-assisted small-molecule drug discovery.
4. Domain-Specific LLM-Assisted Small-Molecule Discovery in Various Diseases.
| Disease type | Disease | Drug discovery phase | Domain-specific model | Function | Ref. |
|---|---|---|---|---|---|
| Noncommunicable respiratory diseases | Idiopathic pulmonary fibrosis | Target identification | PandaOmics | TNIK was targeted for idiopathic pulmonary fibrosis, leading to the development of the small-molecule inhibitor INS018_055, which showed antifibrotic and anti-inflammatory effects in vitro and in vivo, with good safety and pharmacokinetics in Phase I trials. | |
| Oncological diseases | Sarcoma | Virtual screening | GPT-4 | Gossypol was identified as effective against osteosarcoma after evaluating 60 polyphenols. | |
| Prostate cancer | Retargeting | GPT-4 | True negative data were identified in clinical trials, and 980 potential prostate cancer drugs were screened. | ||
| Breast cancer | Molecular production | MTmol-GPT | High-quality breast cancer drug molecules were generated to target multiple targets. | ||
| Multiple cancers | Virtual screening | BRAFPred | BRAF inhibitor activity was accurately predicted. | ||
| Nonsmall-cell lung carcinoma | Drug repurposing | MREDTA | Molecular features were extracted, and their robustness was confirmed in drugs for nonsmall cell lung cancer. | ||
| Infectious Diseases | COVID-19, Nipah | Identification and validation of drug targets | ChatGPT | Automated analysis of biomedical literature for quick identification of drug targets for COVID-19 and Nipah virus, along with efficient drug candidate generation and screening. | |
| SARS-COV-2 | Molecular production | BERT | Efficient generation and screening of drug candidates against SARS-COV-2 targets. | ||
| Coronavirus | Drug repurposing | PubMedBERT | Identification of COVID-19 drug candidates from sources like PubMed, with mechanistic explanations. | ||
| Chronic liver disease | Virtual screening | HBCVTr | Prediction and screening of novel inhibitor candidates against HBV and HCV. | ||
| Malaria | Identification and validation of drug targets | BERT-RGCN | Prediction of antimalarial drug efficacy against Plasmodium falciparum. | ||
| Tuberculosis | Molecular production | TamGen | 14 compounds were found to inhibit the Tuberculosis ClpP protease. | ||
| Diseases caused by viral infections | Target identification | BERT | High-accuracy virus identification. | ||
| Metabolic Diseases | Hyperuricemia | Target identification | BERT | Identifying potential technological opportunities for hyperuricemia drugs. | |
| Neurological Disorder | Regulation of fentanyl and its analogues | Molecular generation, virtual screening | Transform | Screened out 36,799 potential fentanyl analogs. | |
| Glioma | Molecular design | MegaMolBART | Predicted BBB permeability of small-molecule compounds. | ||
| Parkinson’s disease | Drug repurposing | FuseLinker | Text and Knowledge EmbeddingEnhanced Link Prediction for Parkinson’s Disease Drug Repurposing. | ||
| Ischemic stroke | Identification and validation of drug targets | StrokeDTI | Identify Cerdulatinib as a potential antistroke drug. | ||
| Alzheimer’s disease | Target identification, the drug repurposing phase | PubMedBERT | Exploring AD-related semantic triples in the Alzheimer’s knowledge graph. | ||
| Joint Diseases | Rheumatoid arthritis (RA) and joint fibrosis (AF) | Drug target identification and validation, drug repositioning | GPT-3.5-Turbo | Investigating molecular links between rheumatoid arthritis and joint fibrosis to find common drugs. | |
| Diseases of the immune system | Allergic | Drug repurposing | LLM-DDA | Identify the potential therapeutic effects of prednisone on allergic rhinitis. | |
| HIV | Assessment of key pharmacokinetic and toxicological properties | Transform | Predicting drug properties and identifying HIV Integrase-1 candidates. | ||
| Parasitic diseases | Filariasis | Drug repurposing | ChatGPT | Assessment of medication recommendations for filariasis treatment. |
In the realm of tumor drug discovery, LLMs have demonstrated the capability to accurately predict and screen high-quality drug molecules, thereby offering an efficient and novel tool for cancer treatment. For example, the BERT-TransBlock model effectively extracts molecular features to improve the accuracy and generalizability of drug-target binding affinity predictions, thus aiding the development of therapeutics for nonsmall cell lung cancer (NSCLC). Additionally, the MTmol-GPT model can generate high-quality multitarget drug molecules specifically for breast cancer. Furthermore, models such as CancerGPT exhibit robust performance in predicting drug synergy in rare tissues, making them particularly suitable for research on cancer types with limited data availability.
In infectious disease drug development, LLMs demonstrate a superior capacity to analyze the extensive biomedical literature, thereby expediting the identification of potential drug targets and candidates beyond the capabilities of traditional manual review methods. For instance, ChatGPT facilitates the automation of biomedical literature reviews, enabling the rapid identification of drug targets for pathogens such as SARS-CoV-2 and Nipah virus. It also identifies candidate drugs for coronavirus-related diseases by analyzing the PubMed literature and generating mechanistic explanations. Similarly, BERT-based methodologies that integrate literature, patents, and commercial data on hyperuricemia drugs have facilitated the identification of potential targets for therapeutics addressing metabolic disorders. In the context of neurological diseases, including Alzheimer’s and Parkinson’s, LLMs enhance the efficiency of drug repurposing by constructing knowledge graphs that amalgamate structures, textual data, and domain-specific knowledge embedding. Furthermore, in the study of joint disorders, LLMs employ text mining and data analysis to uncover significant genetic and functional similarities between rheumatoid arthritis (RA) and articular fibrosis (AF), thereby advancing our understanding of pathogenesis and suggesting novel avenues for drug repurposing.
In immune-related diseases, LLMs utilize self-aware mechanisms to derive molecular embeddings directly from SMILES sequences. This capability facilitates the identification of promising candidate drugs for HIV integrase-1 while effectively filtering compounds with suboptimal pharmacokinetic properties and toxicity. The LLM-DDA model has successfully predicted the potential therapeutic effect of prednisone on allergic rhinitis through drug-disease association prediction.
Furthermore, LLMs show promise in drug repurposing for parasitic diseases. For ten distinct clinical scenarios of filariasis, ChatGPT offers precise recommendations for repurposing potential therapeutics, which are consistent with existing medical research and literature.
4.7. Application and Translation in Clinical
LLMs have exhibited a sophisticated ability to analyze intricate textual data within the context of clinical trial design, particularly in areas such as protocol generation, patient recruitment, and outcome prediction. This capability significantly expedites the translation of small-molecule drug candidates into clinical practice. For instance, ChatGPT, when fine-tuned on medical corpora, can effectively extract critical information (concise summaries and eligibility criteria) from oncology trial descriptions, thereby enhancing the efficiency of clinical decision-making. Additionally, LLMs can analyze patients’ electronic health records (EHRs) to evaluate the appropriateness of specific drugs for designated cohorts and to automatically match eligible individuals with small-molecule drug trials, thus improving recruitment efficiency. Tools such as TrialGPT are capable of predicting patient-trial compatibility and reducing screening costs. Furthermore, by synthesizing information related to drugs, diseases, and trial protocols, LLMs can project the likelihood of clinical success. The MEREDITH system synergistically combines LLMs with medical evidence retrieval to systematically identify and prioritize oncology therapeutic trial protocols that exhibit the highest predicted probabilities of success.
Based on the currently available data, a relatively small number of small-molecule drug candidates designed by AI have successfully advanced to the stage of clinical validation. By 2023, AI biotechnology firms had initiated clinical evaluation for 67 therapeutics discovered through AI methodologies, with small molecules comprising the majority (>30%). AI-derived small-molecule compounds targeting SHP2, WRN, MALT1, TYK2, and serotonin receptors have entered clinical trials. Of particular note, rentosertib, a TNIK inhibitor, stands as the first investigational agent for which both the target and molecular scaffold were conceptualized through generative AI. Phase I evaluation was concluded in 2024, and by 2025, the Phase 2a trial reached a significant milestone. A 12-week, double-blind, randomized, placebo-controlled study demonstrated a statistically significant improvement in lung function, as measured by the change from baseline in forced expiratory volume in 1 s (FEV1), in patients receiving 60 mg once daily. This study offers the first clinical evidence that an AI-generated small-molecule drug can achieve both efficacy and tolerability in human subjects.
4.8. LLMs-Driven Analytical Chemistry Empowering Small-Molecule Drug Discovery
Analytical chemistry plays a crucial role throughout the research and development (R&D) process of small-molecule drugs. It is primarily responsible for essential tasks such as elucidating molecular structures, determining properties, and monitoring reactions. The accuracy and efficiency of these tasks directly influence the progress and success rate of R&D efforts. Presently, large LLMs, with their advanced capabilities in cross-modal data processing and chemical language comprehension, are facilitating the transformation of analytical chemistry from an empirical science to an intelligent discipline. This transformation introduces a novel technical approach to overcoming R&D bottlenecks, including challenges in interpreting high-dimensional spectral data, inefficiencies in integrating multisource information, and the lack of intelligence in experimental workflows.
4.8.1. Analysis of High-Dimensional Spectral Data
Algorithms for cross-modal mapping between spectroscopy and molecular structure, which are based on LLMs, facilitate the precise translation of spectral data into molecular structural information. This is achieved through the application of multimodal alignment and joint representation techniques, thereby markedly enhancing the efficiency and accuracy of the data processing.
The CSU-MS2 model incorporates an external space attention aggregation (ESA) module to dynamically align tandem mass spectrometry (MS/MS) data with their corresponding structural features. When applied to retrieve 1047 spectra from a library containing one million compounds, the model achieves a Recall@1 of 75.45%. In contrast, the SpecRecFormer model utilizes a self-aware mechanism to align characteristic peaks across the entire spectral range, thereby establishing global dependencies. This methodology enhances the rapid identification of individual components within mixed Raman spectra, achieving an identification accuracy exceeding 89%, and demonstrates superior performance in resolving overlapping peaks compared to CNNs.
Moreover, LLMs leverage the Transformer architecture to learn complex spectral features, effectively addressing the limitations of restricted spectral library coverage that are characteristic of traditional approaches. For example, the TransExION model identifies pertinent fragments in MS/MS spectra through mass difference analysis, assesses spectral similarity based on these fragments, and consequently facilitates interpretable identification of small-molecule structural analogs. In the context of structural similarity prediction, this model attains a Pearson correlation coefficient of 0.811, surpassing heuristic similarity measurement techniques such as cosine and modified cosine, as well as the unsupervised Spec2Vec model.
In a recent study, the research team led by Ludovic Duponchel developed a ChatGPT 4.0 tool specifically designed for smartphone terminals. This innovative tool facilitates the comprehensive analysis of laser-induced breakdown spectroscopy (LIBS) high-dimensional data through voice interaction commands. It effectively manages complex tasks, including data parsing, principal component analysis, and K-means clustering. The analysis results produced by this tool are consistent with those obtained using traditional methods, yet it enhances efficiency by more than 10-fold compared to manual coding. By integrating multimodal collaborative processing capabilitiesencompassing voice commands, code generation, spectral data, and element distribution mapsthe tool supports the cross-literature reproduction of new algorithms. It offers a portable and intelligent solution for data analysis.
4.8.2. Extraction and Analysis of Multisource Chemical Information
LLMs demonstrate significant capabilities in the integration of multisource cheminformatics for extraction and analysis tasks. They effectively align chemical terminologies, such as SMILES codes and reaction equations, with natural language, while also synthesizing textual, data, and structural image information to efficiently execute tasks, including compound extraction, reaction role annotation, and parsing of experimental workflows. Fine-tuned versions of GPT-3.5-turbo exhibit performance that is either comparable to or surpasses that of specialized domain-specific models in chemical literature mining tasks. For instance, it achieves an F1-score of 90% in compound extraction, an F1-score of 83.0% in reaction role annotation, and exact accuracy rates of 82.7% for single reactions and 68.8% for multiple reactions in metal−organic framework (MOF) synthesis information extraction. Additionally, it demonstrates an accuracy exceeding 85% in NMR spectroscopy data extraction and achieves state-of-the-art (SOTA) performance with a sentence-level precision of 69.0% in the task of converting experimental paragraphs into action sequences. ChemDFM, a model tailored for the chemical domain and built upon the LLaMA-13B architecture, significantly broadens the functional scope of computational chemistry. It facilitates essential tasks such as molecular identification, property prediction (with an average AUC-ROC of 77.7%), reaction prediction (achieving an accuracy of 49%), retrosynthetic analysis, and molecular design (demonstrating an exact SMILES match rate of 45%). Additionally, it incorporates capabilities for dialogue interactions and experimental design assistance. These advancements underscore the high accuracy, minimal coding requirements, and robust generalization properties of LLMs within the field of analytical chemistry.
4.8.3. Intelligentization of Chemistry Experimental Workflows
LLMs, with their advanced natural language understanding and task planning capabilities, have become pivotal technologies in driving innovations within the automation of analytical chemistry experiments. For instance, the ChemAgents system employs a hierarchical multiagent architecture based on Llama-3.1-70B, facilitating the autonomous execution of the entire chemical research workflow, encompassing literature mining, experimental design, and robotic execution. In the context of photocatalytic dehalogenation reactions, this system achieves a conversion rate approaching 100% within 24 h. Similarly, the Coscientist system leverages a GPT-4-driven AI framework to enable fully autonomous generation and execution of experimental scripts on Opentrons OT-2 and Emerald Cloud Lab. This system exhibits exceptional performance in tasks such as compound synthesis planning, document retrieval, hardware control, and cross-coupling reactions. Moreover, the LLM-RDF system facilitates engagement with automated experimental platforms through natural language directives, encompassing a range of processes, such as literature retrieval, condition screening, and reaction optimization. Its efficacy has been demonstrated across diverse synthesis tasks, including nucleophilic aromatic substitution (SNAr) reactions, photoredox C−C cross-coupling reactions, and heterogeneous photoelectrochemical reactions.
5. Challenges and Limitations
LLMs have demonstrated significant potential in the domain of small-molecule drug discovery. However, they also encounter numerous challenges that require resolution. This section examines the primary issues associated with LLMs, focusing on data quality and reliability, model interpretability, integration of domain knowledge, and ethical and privacy considerations. In addition, the impact of LLMs on the drug discovery process and potential avenues for enhancement were explored.
5.1. Data Quality and Reliability
Small-molecule drug discovery faces the challenge of integrating multisource heterogeneous data. Research has indicated that data collection from the ChEMBL database necessitates meticulous organization and quality assurance because of potential inaccuracies or inconsistencies. The scarcity of high-quality small-molecule interaction datasets further limits the reliability of LLMs in structure-based drug design. In critical tasks such as molecular toxicity prediction, LLMs can generate molecular designs that are structurally unsound or physically implausible. , Models such as GPT-4 exhibit issues such as atomic-type misjudgment and bonding errors, which compromise the accuracy. Furthermore, the reliability of these models depends on data quality. The presence of outliers and noise in data can significantly affect the operational outcomes of models, leading to suboptimal predictive performance.
5.2. Model Interpretability
The current LLMs encounter significant challenges related to the “black box” problem in the realm of small-molecule drug discovery. This issue arises from the lack of transparency in the models’ decision-making processes, which undermines researchers’ confidence in the outcomes related to molecular synthesis or target prediction. Shapley Additive Explanations (SHAP), a cooperative game-theoretic approach grounded in the Shapley value, offers a potential solution by quantifying the contribution of each feature to a model’s predictions, thereby enhancing interpretability. Nonetheless, existing SHAP methodologies struggle to effectively elucidate the complex reasoning processes of LLMs when dealing with cross-modal data. In the context of molecular optimization tasks in small-molecule drug discovery, LLMs may inadvertently rely on superficial data correlations rather than authentic bioactivity relationships. For instance, during the optimization of molecular structures, the model may mistakenly preserve functional groups that do not contribute to efficacy. This exemplifies the potential “black box” issue associated with LLMs in the context of small-molecule drug discovery, where the model’s decision-making process is opaque and may lead to misleading outcomes.
5.3. Insufficient Integration of Domain Knowledge
A notable knowledge gap exists between the general pretraining paradigms of LLMs and the specialized knowledge systems required for small-molecule drug discovery. When handling multimodal data, such as compound structures (e.g., SMILES) and documentary evidence, current models often encounter difficulties in accurately capturing the intrinsic relationships between different modalities, resulting in incomplete information integration. For instance, models may struggle to fully comprehend the semantic correspondence between SMILES sequences and chemical text descriptions, or may be unable to effectively associate molecular structure images with relevant bioactivity data. This inadequate integration constrains the ability of the model to achieve a comprehensive understanding and limits its predictive accuracy for tasks related to small-molecule drug discovery.
5.4. Ethical and Privacy Issues
Ethical and privacy challenges associated with LLMs are primarily evident in data integration and decision-making processes. For instance, the training of these models may inadvertently utilize patient omics data, potentially leading to unauthorized disclosure of patient genomic information, thereby contravening the General Data Protection Regulation (GDPR). Additionally, legal ambiguity exists in the realm of intellectual property, particularly concerning the molecular structures generated by LLMs, which complicates the delineation of rights among contributors to training data, model developers, and end users. Although LLMs have significant potential for application in small-molecule drug discovery, they still face numerous challenges and limitations. To address these issues, future research should focus on specific areas to enhance the effective application of LLMs in small-molecule drug discovery.
6. Future Research Directions
Although LLMs have significant potential for application in small-molecule drug discovery, they still face numerous challenges and limitations. To address these issues, future research should focus on specific areas to enhance the effective application of LLMs in small-molecule drug discovery.
6.1. Enhancing LLMs and Experimentation-Computing Collaboration Integration
Future research must enhance the cross-disciplinary integration of LLMs with experimental-computational technologies to overcome the current fragmentation between virtual computation and experimental validation. Presently, LLMs are predominantly utilized for computational tasks, such as molecular generation and activity prediction, but they exhibit limited collaboration with high-throughput experimental platforms. , An optimal pathway for technical integration should encompass automated platforms for the rapid execution of synthesis and activity detection, , along with real-time feedback of experimental data for model refinement. Such integration is anticipated to substantially reduce the optimization cycle of traditional lead compounds and enhance the research and development efficiency.
6.2. Multimodal Representation Learning and Annotation Optimization
Small-molecule drug discovery necessitates the integration of heterogeneous data from multiple sources, often encountering challenges, such as data inaccuracies and inconsistencies. To address these issues, the development of more efficient and intelligent algorithms for data collection and quality inspection is imperative to enable the automatic identification and correction of such data problems. Furthermore, the presence of outliers and noise can adversely affect the predictive performance of models, highlighting the need for novel data-cleaning techniques to enhance the model robustness with respect to data quality. Additionally, the establishment of a data quality assessment and monitoring system is crucial for ensuring data reliability.
Future research should focus on integrating multimodal data, including protein sequences, biomolecular structures, and biomedical texts, to develop more advanced cross-modal characterization models. For instance, multimodal models such as the Functional Annotation of Protein Multimodal (FAPM) can accurately annotate protein characteristics, including functional elements and catalytic activities. The application of self-supervised learning techniques can reduce the dependence on manual labeling, while the prediction accuracy of PLMs can be improved through the characterization of 3D conformational ensembles.
6.3. Development and Validation of Trustworthy Models
Currently, LLMs are encountering challenges related to their opaque nature in the context of small-molecule drug discovery. Future advancements could involve refining and optimizing methods, such as SHAP, to analyze the reasoning processes of LLMs more effectively. In addition, developing specialized interpretability frameworks informed by domain-specific knowledge can enhance researchers’ confidence in the outcomes of these models. It is imperative to establish rigorous evaluation frameworks to assess the reliability of models in drug discovery, such as designing pharmacology test sets tailored to specific domains, to verify the clinical applicability of LLMs. Furthermore, it is crucial to address the issues of model transparency, including the interpretability of visual representations in multimodal large models.
6.4. Domain Knowledge Augmentation Techniques
To enhance the multimodal data integration capabilities of LLMs in small-molecule drug discovery, it is necessary to integrate specialized knowledge systems with pretraining paradigms and develop dedicated pretraining strategies and model architectures. Optimizing model architectures and algorithms can more effectively integrate multimodal data and improve our understanding of complex issues. Additionally, continuously refining LLM-driven automatic target analysis tools based on biomedical literature will enable assessment of the potential benefits and risks of new targets in the early stages of drug discovery.
6.5. Resource Optimization and Ethical Compliance Design
LLMs encounter significant ethical and privacy challenges in the field of small-molecule drug discovery. Future research should focus on developing strategies to safeguard patient privacy, thereby preventing leakage of sensitive information. Additionally, there is a need to establish and enhance mechanisms for intellectual property protection and to clarify the ownership of the rights. Strengthening research on ethical and privacy concerns is imperative alongside the formulation of laws, regulations, and industry standards to ensure ethical compliance and sustainable development in this field. It is also essential to develop privacy-preserving frameworks for healthcare multimodal systems, assess regulatory compliance with AI-generated drugs, and establish ethical guidelines for interdisciplinary collaboration. Furthermore, research should investigate the synthesizability and toxicity prediction of molecules generated by chemical language models, ensuring adherence to drug development guidelines. ,
7. Conclusion
LLMs have demonstrated transformative potential for small-molecule drug discovery. By expediting the target identification process and enhancing the molecular design workflow, LLMs have significantly increased the research and development efficiency.
However, LLMs encounter technical challenges, including inadequate data quality, limitations in model generation, and ethical concerns regarding small-molecule drug discovery. Addressing these issues necessitates the establishment of an interdisciplinary collaborative framework involving chemists, clinicians, and AI specialists to collectively refine the model performance and ensure the ethical application of the technology.
In the future, small-molecule drug discovery will be expected to advance through the integration of multimodal fusion technologies, culminating in the establishment of a comprehensive automated system. MLLMs are capable of concurrently processing textual descriptions, molecular structures, and omics data, whereas automated systems facilitate the entire AI-driven process, from target identification to preclinical research. Within the domain of small-molecule drug discovery, it is imperative to elevate LLMs from their current role as auxiliary tools to those of the central engines of research and development. This transition should be accompanied by efforts to enhance the efficiency of drug development through interdisciplinary collaboration and to ensure that technological advancements adhere to ethical standards.
Acknowledgments
This work was supported by the key project of the National Natural Science Foundation of China (82030091), the key project of the Natural Science Foundation of Liaoning Science Foundation (2024JH2/102500016, 2023JYTMS20230135, 2022JH1/10400001), and the key project of the China Medical University Virtual Simulation Construction Project (YDXF2024012).
Biographies
Jin Ma is an associate professor at the School of Intelligent Medicine, China Medical University. Her research focuses on intelligent medicine and large-scale health-data analytics. She has coauthored multiple textbooks, including Medical Big Data Mining and Applications and Fundamentals of Big Data Applications.
Jia Liu is an associate professor at the School of Intelligent Medicine, China Medical University. She earned her master’s degree in Biomedical Engineering from Northeastern University and now works at the interface of computer science and medicine. Her research focuses on intelligent medicine and large-scale health-data analytics.
Dongyu Xu is an associate professor in the Department of Computer Science of China Medical University. He obtained a bachelor’s degree in computer software from Liaoning University in China and a master’s degree from Shenyang Aerospace University. Currently, his research focuses on the application of AI in biological sciences, the construction of medical knowledge graphs, and data mining in global medical databases.
Professor Xiaoyu Song, the director of the Health Sciences Institute at China Medical University and the research leader of the Key Laboratory of Medical Cell Biology of the Ministry of Education, has conducted a series of studies on the mechanisms of the occurrence and development of aging-related diseases. Focusing on genome stability maintenance and the occurrence and development of tumors, she has published multiple high-quality research papers as the corresponding author/cocorresponding author in journals such as Science Advances, EMBO J, Cell Reports, Oncogene, and Cell Death & Differ, etc.
Zhichang Zhang is an Associate Professor and Vice Director of the Department of Computer Science, School of Intelligent Medicine, China Medical University, and concurrently serves as a Director at the China Medical Education Association (CMEA). His primary research focus is on NLP and Intelligent Medicine. He authored the textbook “Frontiers of Science Maps and Medical SCI Paper Writing”, which won the 2024 National University AI + Digital Economy Textbook Award; he also received the CMEA Science and Technology Progress Award in 2018, 2019, 2022, and 2024, and was named an Excellent Teacher by China Medical University in 2024.
§.
J.M., J.L., and D.X. contributed equally.
The authors declare no competing financial interest.
References
- Ye G., Cai X., Lai H., Wang X., Huang J., Wang L., Liu W., Zeng X.. DrugAssist: a large language model for molecule optimization. Briefings Bioinf. 2024;26(1):bbae693. doi: 10.1093/bib/bbae693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guan C., Fernandes F. C., Franco O. L., de la Fuente-Nunez C.. Leveraging large language models for peptide antibiotic design. Cell Rep. Phys. Sci. 2025;6(1):102359. doi: 10.1016/j.xcrp.2024.102359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raouf Y. S.. Targeting histone deacetylases: Emerging applications beyond cancer. Drug Discovery Today. 2024;29(9):104094. doi: 10.1016/j.drudis.2024.104094. [DOI] [PubMed] [Google Scholar]
- Stuart D. D., Guzman-Perez A., Brooijmans N., Jackson E. L., Kryukov G. V., Friedman A. A., Hoos A.. Precision Oncology Comes of Age: Designing Best-in-Class Small Molecules by Integrating Two Decades of Advances in Chemistry, Target Biology, and Data Science. Cancer Discovery. 2023;13(10):2131–2149. doi: 10.1158/2159-8290.CD-23-0280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y., Yang F., Wang B., Xie L., Chen W.. New FDA drug approvals for 2024: Synthesis and clinical application. Eur. J. Med. Chem. 2025;285:117241. doi: 10.1016/j.ejmech.2025.117241. [DOI] [PubMed] [Google Scholar]
- Chakraborty C., Bhattacharya M., Pal S., Chatterjee S., Das A., Lee S.-S.. AI-enabled language models (LMs) to large language models (LLMs) and multimodal large language models (MLLMs) in drug discovery and development. J. Adv. Res. 2025;78:377–389. doi: 10.1016/j.jare.2025.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duo L., Liu Y., Ren J., Tang B., Hirst J. D.. Artificial intelligence for small molecule anticancer drug discovery. Expert Opin. Drug Discovery. 2024;19(8):933–948. doi: 10.1080/17460441.2024.2367014. [DOI] [PubMed] [Google Scholar]
- Hong Y., Ye Y., Tang H.. Machine Learning in Small-Molecule Mass Spectrometry. Annu. Rev. Anal. Chem. 2025;18(1):193–215. doi: 10.1146/annurev-anchem-071224-082157. [DOI] [PubMed] [Google Scholar]
- Keith J. A., Vassilev-Galindo V., Cheng B., Chmiela S., Gastegger M., Müller K.-R., Tkatchenko A.. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chem. Rev. 2021;121(16):9816–9872. doi: 10.1021/acs.chemrev.1c00107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He D., Liu Q., Mi Y., Meng Q., Xu L., Hou C., Wang J., Li N., Liu Y., Chai H.. et al. De Novo Generation and Identification of Novel Compounds with Drug Efficacy Based on Machine Learning. Adv. Sci. 2024;11(11):e2307245. doi: 10.1002/advs.202307245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- An X., Chen X., Yi D., Li H., Guan Y.. Representation of molecules for drug response prediction. Briefings Bioinf. 2022;23(1):bbab393. doi: 10.1093/bib/bbab393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joshi P. B.. Navigating with chemometrics and machine learning in chemistry. Artif. Intell. Rev. 2023;56(9):9089–9114. doi: 10.1007/s10462-023-10391-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cuevas-Zuviría B., Pacios L. F.. Machine Learning of Analytical Electron Density in Large Molecules Through Message-Passing. J. Chem. Inf. Model. 2021;61(6):2658–2666. doi: 10.1021/acs.jcim.1c00227. [DOI] [PubMed] [Google Scholar]
- Shalit Peleg H., Milo A.. Small Data Can Play a Big Role in Chemical Discovery. Angew. Chem., Int. Ed. 2023;62(26):e202219070. doi: 10.1002/anie.202219070. [DOI] [PubMed] [Google Scholar]
- Chakraborty C., Bhattacharya M., Lee S. S., Wen Z. H., Lo Y. H.. The changing scenario of drug discovery using AI to deep learning: Recent advancement, success stories, collaborations, and challenges. Mol. Ther. Nucleic Acids. 2024;35(3):102295. doi: 10.1016/j.omtn.2024.102295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Busser B., Roth L., De Loof H.. The role of large language models in self-care: a study and benchmark on medicines and supplement guidance accuracy. Int. J. Clin. Pharm. 2025;47(4):1001–1010. doi: 10.1007/s11096-024-01839-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang E. T. C., Yang J. S., Liao K. Y. K., Tseng W. C. W., Lee C. K., Gill M., Compas C., See S., Tsai F. J.. Predicting blood-brain barrier permeability of molecules with a large language model and machine learning. Sci. Rep. 2024;14(1):15844. doi: 10.1038/s41598-024-66897-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun J., Wang H., Mi J., Wan J., Gao J.. MTAF-DTA: multi-type attention fusion network for drug-target affinity prediction. BMC Bioinf. 2024;25(1):375. doi: 10.1186/s12859-024-05984-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X. H., Lu Z. H., Wang T., Liu F.. Large language models facilitating modern molecular biology and novel drug development. Front. Pharmacol. 2024;15:1458739. doi: 10.3389/fphar.2024.1458739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye G.. De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning. J. Comput.-Aided Mol. Des. 2024;38(1):20. doi: 10.1007/s10822-024-00559-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perez R., Li X., Giannakoulias S., Petersson E. J.. AggBERT: Best in Class Prediction of Hexapeptide Amyloidogenesis with a Semi-Supervised ProtBERT Model. J. Chem. Inf. Model. 2023;63(18):5727–5733. doi: 10.1021/acs.jcim.3c00817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang P., Kearney L., Bhowmik D., Fox Z., Naskar A. K., Gounley J.. Transferring a Molecular Foundation Model for Polymer Property Predictions. J. Chem. Inf. Model. 2023;63(24):7689–7698. doi: 10.1021/acs.jcim.3c01650. [DOI] [PubMed] [Google Scholar]
- Zhang Y., Mastouri M., Zhang Y.. Accelerating drug discovery, development, and clinical trials by artificial intelligence. Med. 2024;5(9):1050–1070. doi: 10.1016/j.medj.2024.07.026. [DOI] [PubMed] [Google Scholar]
- Liu P., Ren Y., Tao J., Ren Z.. GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text. Comput. Biol. Med. 2024;171:108073. doi: 10.1016/j.compbiomed.2024.108073. [DOI] [PubMed] [Google Scholar]
- Bran A. M., Cox S., Schilter O., Baldassari C., White A. D., Schwaller P.. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 2024;6(5):525–535. doi: 10.1038/s42256-024-00832-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siebenmorgen T., Menezes F., Benassou S., Merdivan E., Didi K., Mourão A. S. D., Kitel R., Liò P., Kesselheim S., Piraud M.. et al. MISATO: machine learning dataset of protein−ligand complexes for structure-based drug discovery. Nat. Comput. Sci. 2024;4(5):367–378. doi: 10.1038/s43588-024-00627-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen M., Jiang X., Zhang L., Chen X., Wen Y., Gu Z., Li X., Zheng M.. The emergence of machine learning force fields in drug design. Med. Res. Rev. 2024;44(3):1147–1182. doi: 10.1002/med.22008. [DOI] [PubMed] [Google Scholar]
- Isert C., Atz K., Schneider G.. Structure-based drug design with geometric deep learning. Curr. Opin. Struct. Biol. 2023;79:102548. doi: 10.1016/j.sbi.2023.102548. [DOI] [PubMed] [Google Scholar]
- Korshunova M., Ginsburg B., Tropsha A., Isayev O.. OpenChem: A Deep Learning Toolkit for Computational Chemistry and Drug Design. J. Chem. Inf. Model. 2021;61(1):7–13. doi: 10.1021/acs.jcim.0c00971. [DOI] [PubMed] [Google Scholar]
- Li H., Sun X., Cui W., Xu M., Dong J., Ekundayo B. E., Ni D., Rao Z., Guo L., Stahlberg H.. et al. Computational drug development for membrane protein targets. Nat. Biotechnol. 2024;42(2):229–242. doi: 10.1038/s41587-023-01987-2. [DOI] [PubMed] [Google Scholar]
- Lin E., Lin C.-H., Lane H.-Y.. De Novo Peptide and Protein Design Using Generative Adversarial Networks: An Update. J. Chem. Inf. Model. 2022;62(4):761–774. doi: 10.1021/acs.jcim.1c01361. [DOI] [PubMed] [Google Scholar]
- van Tilborg D., Brinkmann H., Criscuolo E., Rossen L., Özçelik R., Grisoni F.. Deep learning for low-data drug discovery: Hurdles and opportunities. Curr. Opin. Struct. Biol. 2024;86:102818. doi: 10.1016/j.sbi.2024.102818. [DOI] [PubMed] [Google Scholar]
- Bluthgen C.. Technical foundations of large language models. Die Radiol. 2025;65(4):227–234. doi: 10.1007/s00117-025-01427-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor N., Schofield D., Kormilitzin A., Joyce D. W., Nevado-Holgado A.. Developing healthcare language model embedding spaces. Artif. Intell. Med. 2024;158:103009. doi: 10.1016/j.artmed.2024.103009. [DOI] [PubMed] [Google Scholar]
- Mastrolorito F., Ciriaco F., Togo M. V., Gambacorta N., Trisciuzzi D., Altomare C. D., Amoroso N., Grisoni F., Nicolotti O.. fragSMILES as a chemical string notation for advanced fragment and chirality representation. Commun. Chem. 2025;8(1):26. doi: 10.1038/s42004-025-01423-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X.-C., Wu C.-K., Yi J.-C., Zeng X.-X., Yang C.-Q., Lu A.-P., Hou T.-J., Cao D.-S.. Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. Research. 2022;2022:0004. doi: 10.34133/research.0004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang H., Shao L., Sebe N., Van Gool L.. Graph Transformer GANs With Graph Masked Modeling for Architectural Layout Generation. IEEE Trans. Pattern Anal. Mach. Intell. 2024;46(6):4298–4313. doi: 10.1109/TPAMI.2024.3355248. [DOI] [PubMed] [Google Scholar]
- Yuan Q., Chen S., Rao J., Zheng S., Zhao H., Yang Y.. AlphaFold2-aware protein−DNA binding site prediction using graph transformer. Briefings Bioinf. 2022;23(2):bbab564. doi: 10.1093/bib/bbab564. [DOI] [PubMed] [Google Scholar]
- Fallani A., Nugmanov R., Arjona-Medina J., Wegner J. K., Tkatchenko A., Chernichenko K.. Pretraining graph transformers with atom-in-a-molecule quantum properties for improved ADMET modeling. J. Cheminf. 2025;17(1):25. doi: 10.1186/s13321-025-00970-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Méndez-Lucio O., Nicolaou C. A., Earnshaw B.. MolE: a foundation model for molecular graphs using disentangled attention. Nat. Commun. 2024;15(1):9431. doi: 10.1038/s41467-024-53751-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang X., Wang Y., Lin Y., Zhang M., Liu O., Shuai J., Zhao Q.. A Multi-Task Self-Supervised Strategy for Predicting Molecular Properties and FGFR1 Inhibitors. Adv. Sci. 2025;12(13):e2412987. doi: 10.1002/advs.202412987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duan Y., Yang X., Zeng X., Wang W., Deng Y., Cao D.. Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge. J. Med. Chem. 2024;67(11):9575–9586. doi: 10.1021/acs.jmedchem.4c00692. [DOI] [PubMed] [Google Scholar]
- Zhao R., Li W., Xu J., Chen L., Wei X., Kong X.. A CNN-based self-supervised learning framework for small-sample near-infrared spectroscopy classification. Anal. Methods. 2025;17(5):1090–1100. doi: 10.1039/D4AY01970A. [DOI] [PubMed] [Google Scholar]
- Wang J., Liu Y., Tian B.. Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning. J. Cheminf. 2024;16(1):125. doi: 10.1186/s13321-024-00920-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dey V., Ning X.. Enhancing molecular property prediction with auxiliary learning and task-specific adaptation. J. Cheminf. 2024;16(1):85. doi: 10.1186/s13321-024-00880-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duan Y.-J., Fu L., Zhang X.-C., Long T.-Z., He Y.-H., Liu Z.-Q., Lu A.-P., Deng Y.-F., Hsieh C.-Y., Hou T.-J.. et al. Improved GNNs for Log D7.4 Prediction by Transferring Knowledge from Low-Fidelity Data. J. Chem. Inf. Model. 2023;63(8):2345–2359. doi: 10.1021/acs.jcim.2c01564. [DOI] [PubMed] [Google Scholar]
- Han X., Cai J., Bai C., Wu Z.. Triview Molecular Representation Learning Combined with Multitask Optimization for Enhanced Molecular Property Prediction. J. Chem. Inf. Model. 2025;65(10):5163–5175. doi: 10.1021/acs.jcim.5c00436. [DOI] [PubMed] [Google Scholar]
- Takahashi K., Fukai T., Sakai Y., Takekawa T.. Goal-oriented inference of environment from redundant observations. Neural Networks. 2024;174:106246. doi: 10.1016/j.neunet.2024.106246. [DOI] [PubMed] [Google Scholar]
- Al-Hamadani M. N. A., Fadhel M. A., Alzubaidi L., Harangi B.. Reinforcement Learning Algorithms and Applications in Healthcare and Robotics: A Comprehensive and Systematic Review. Sensors. 2024;24(8):2461. doi: 10.3390/s24082461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas M., O’Boyle N. M., Bender A., de Graaf C.. Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. J. Cheminf. 2022;14(1):68. doi: 10.1186/s13321-022-00646-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzuki T., Ma D., Yasuo N., Sekijima M. M.. Multiobjective de novo Molecular Generation Using Monte Carlo Tree Search. J. Chem. Inf. Model. 2024;64(19):7291–7302. doi: 10.1021/acs.jcim.4c00759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fromer J. C., Coley C. W.. Computer-aided multi-objective optimization in small molecule discovery. Patterns. 2023;4(2):100678. doi: 10.1016/j.patter.2023.100678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J., Zhu F.. Multi-objective molecular generation via clustered Pareto-based reinforcement learning. Neural Networks. 2024;179:106596. doi: 10.1016/j.neunet.2024.106596. [DOI] [PubMed] [Google Scholar]
- Fan Z., Yang Y., Xu M., Chen H.. EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency. J. Cheminf. 2024;16(1):107. doi: 10.1186/s13321-024-00893-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alakhdar A., Poczos B., Washburn N.. Diffusion Models in De Novo Drug Design. J. Chem. Inf. Model. 2024;64(19):7238–7256. doi: 10.1021/acs.jcim.4c01107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin H., Huang Y., Zhang O., Ma S., Liu M., Li X., Wu L., Wang J., Hou T., Li S. Z.. DiffBP: generative diffusion of 3D molecules for target protein binding. Chem. Sci. 2025;16(3):1417–1431. doi: 10.1039/D4SC05894A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan Y., Pan X., Li X., Zhang R., Su W.. A 3D generation framework using diffusion model and reinforcement learning to generate multi-target compounds with desired properties. J. Cheminf. 2025;17(1):93. doi: 10.1186/s13321-025-01035-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bolcato G., Heid E., Boström J.. On the Value of Using 3D Shape and Electrostatic Similarities in Deep Generative Methods. J. Chem. Inf. Model. 2022;62(6):1388–1398. doi: 10.1021/acs.jcim.1c01535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang J., Ye J. C.. Bidirectional generation of structure and properties through a single molecular foundation model. Nat. Commun. 2024;15(1):2323. doi: 10.1038/s41467-024-46440-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boadu F., Cao H., Cheng J.. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics. 2023;39(Supplement_1):i318–i325. doi: 10.1093/bioinformatics/btad208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li N., Qiao J., Gao F., Wang Y., Shi H., Zhang Z., Cui F., Zhang L., Wei L.. GICL: A Cross-Modal Drug Property Prediction Framework Based on Knowledge Enhancement of Large Language Models. J. Chem. Inf. Model. 2025;65(11):5518–5527. doi: 10.1021/acs.jcim.5c00895. [DOI] [PubMed] [Google Scholar]
- Wang J., Qin R., Wang M., Fang M., Zhang Y., Zhu Y., Su Q., Gou Q., Shen C., Zhang O.. et al. Token-Mol 1.0: tokenized drug design with large language models. Nat. Commun. 2025;16(1):4416. doi: 10.1038/s41467-025-59628-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin C., Guo S., Zhou S., Guan J.. Effective and Explainable Molecular Property Prediction by Chain-of-Thought Enabled Large Language Models and Multi-Modal Molecular Information Fusion. J. Chem. Inf. Model. 2025;65(11):5438–5455. doi: 10.1021/acs.jcim.5c00577. [DOI] [PubMed] [Google Scholar]
- Wang X., Yan J., Jin B., Li W.. Distributed and Parallel ADMM for Structured Nonconvex Optimization Problem. IEEE Trans Cybern. 2021;51(9):4540–4552. doi: 10.1109/TCYB.2019.2950337. [DOI] [PubMed] [Google Scholar]
- Kang H., Goo S., Lee H., Chae J. W., Yun H. Y., Jung S.. Fine-tuning of BERT Model to Accurately Predict Drug-Target Interactions. Pharmaceutics. 2022;14(8):1710. doi: 10.3390/pharmaceutics14081710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang H., Wang Z., Shi M., Cheng Z., Qian Y.. Enhancing Unconditional Molecule Generation via Online Knowledge Distillation of Scaffolds. Molecules. 2025;30(6):1262. doi: 10.3390/molecules30061262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Q., Hu Y., Peng X., Xie Q., Jin Q., Gilson A., Singer M. B., Ai X., Lai P. T., Wang Z.. et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat. Commun. 2025;16(1):3280. doi: 10.1038/s41467-025-56989-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong F., Yue R., Chen J., Wang D., Ma S., Chen S.. Folding-Based End-To-End Chemical Drug Design with Uncertainty Estimation: Tackling Hallucination in the Post-GPT Era. J. Med. Chem. 2025;68(6):6804–6814. doi: 10.1021/acs.jmedchem.5c00271. [DOI] [PubMed] [Google Scholar]
- Sun C., Yang Z. H., Wang L., Zhang Y., Lin H. F., Wang J.. Attention guided capsule networks for chemical-protein interaction extraction. J. Biomed. Inf. 2020;103:103392. doi: 10.1016/j.jbi.2020.103392. [DOI] [PubMed] [Google Scholar]
- Zaikis D., Vlahavas I.. TP-DDI: Transformer-based pipeline for the extraction of Drug-Drug Interactions. Artif. Intell. Med. 2021;119:102153. doi: 10.1016/j.artmed.2021.102153. [DOI] [PubMed] [Google Scholar]
- Li J. C., Jiang X. F.. Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction. Wireless Commun. Mobile Comput. 2021;2021:7181815. doi: 10.1155/2021/7181815. [DOI] [Google Scholar]
- Abdel-Aty H., Gould I. R.. Large-Scale Distributed Training of Transformers for Chemical Fingerprinting. J. Chem. Inf. Model. 2022;62(20):4852–4862. doi: 10.1021/acs.jcim.2c00715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Z. X., Jiang D. J., Wang J. K., Zhang X. J., Du H. Y., Pan L. R., Hsieh C. Y., Cao D. S., Hou T. J.. Knowledge-based BERT: a method to extract molecular features such as computational chemists. Briefings Bioinf. 2022;23(3):bbac131. doi: 10.1093/bib/bbac131. [DOI] [PubMed] [Google Scholar]
- Wei L. S., Long W. T., Wei L. Y. M.-C.. Multi-view deep learning model for compound-protein interaction prediction. Methods. 2022;204:418–427. doi: 10.1016/j.ymeth.2022.01.008. [DOI] [PubMed] [Google Scholar]
- Wen N. F., Liu G. Q., Zhang J., Zhang R. B., Fu Y. T., Han X.. A fingerprints based molecular property prediction method using the BERT model. J. Cheminf. 2022;14(1):71. doi: 10.1186/s13321-022-00650-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hou C., Li Y. X., Wang M. Y., Wu H., Li T. T.. Systematic prediction of degrons and E3 ubiquitin ligase binding via deep learning. BMC Biol. 2022;20(1):162. doi: 10.1186/s12915-022-01364-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng J., Xiao X., Qiu W. R.. DTI-BERT: Identifying Drug-Target Interactions in Cellular Networking Based on BERT and Deep Learning Method. Front. Genet. 2022;13:859188. doi: 10.3389/fgene.2022.859188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang L. J., Jin C., Yang G. H., Bing Z. T., Huang L., Niu Y. Z., Yang L.. Transformer-based deep learning method for optimizing ADMET properties of lead compounds. Phys. Chem. Chem. Phys. 2023;25(3):2377–2385. doi: 10.1039/D2CP05332B. [DOI] [PubMed] [Google Scholar]
- Khondkaryan L., Tevosyan A., Navasardyan H., Khachatrian H., Tadevosyan G., Apresyan L., Chilingaryan G., Navoyan Z., Stopper H., Babayan N.. Datasets Construction and Development of QSAR Models for Predicting Micronucleus In Vitro and In Vivo Assay Outcomes. Toxics. 2023;11(9):785. doi: 10.3390/toxics11090785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B. S., Lin M. J., Chen T. G., Wang L.. FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Briefings Bioinf. 2023;24(6):bbad398. doi: 10.1093/bib/bbad398. [DOI] [PubMed] [Google Scholar]
- Liu Y. W., Zhang R. S., Li T. F., Jiang J., Ma J., Wang P.. MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. J. Mol. Graphics Modell. 2023;118:108344. doi: 10.1016/j.jmgm.2022.108344. [DOI] [PubMed] [Google Scholar]
- Lee H., Lee S., Lee I., Nam H.. AMP-BERT: Prediction of antimicrobial peptide function based on a BERT model. Protein Sci. 2023;32(1):e4529. doi: 10.1002/pro.4529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su A., Zhang X., Zhang C. W., Ding D. B., Yang Y. F., Wang K. K., She Y. B.. Deep transfer learning for predicting frontier orbital energies of organic materials using small data and its application to porphyrin photocatalysts. Phys. Chem. Chem. Phys. 2023;25(15):10536–10549. doi: 10.1039/D3CP00917C. [DOI] [PubMed] [Google Scholar]
- Aksamit N., Tchagang A., Li Y. F., Ombuki-Berman B.. Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery. BMC Bioinf. 2024;25(1):255. doi: 10.1186/s12859-024-05861-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torabi M., Haririan I., Foroumadi A., Ghanbari H., Ghasemi F.. A deep learning model based on the BERT pre-trained model to predict the antiproliferative activity of anti-cancer chemical compounds. SAR QSAR Environ. Res. 2024;35(11):971–992. doi: 10.1080/1062936X.2024.2431486. [DOI] [PubMed] [Google Scholar]
- Kim S., Mollaei P., Antony A., Magar R., Farimani A. B.. GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models. J. Chem. Inf. Model. 2024;64(4):1134–1144. doi: 10.1021/acs.jcim.3c01706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao D. C., Zhang Y. H., Chen Y. H., Li B. S., Zhou W. G., Wang L.. Highly Accurate and Explainable Predictions of Small-Molecule Antioxidants for Eight In Vitro Assays Simultaneously through an Alternating Multitask Learning Strategy. J. Chem. Inf. Model. 2024;64(24):9098–9110. doi: 10.1021/acs.jcim.4c00748. [DOI] [PubMed] [Google Scholar]
- Sun X., Huang J. J., Fang Y. B., Jin Y. X., Wu J. G., Wang G. Q., Jia J. W.. MREDTA: A BERT and transformer-based molecular representation encoder for predicting drug-target binding affinity. FASEB J. 2024;38(19):e70083. doi: 10.1096/fj.202401254R. [DOI] [PubMed] [Google Scholar]
- Tan Z. C., Zhao Y. C., Lin K. S., Zhou T.. Multi-task pretrained language model with novel application domains enables more comprehensive health and ecological toxicity prediction. J. Hazard. Mater. 2024;477:135265. doi: 10.1016/j.jhazmat.2024.135265. [DOI] [PubMed] [Google Scholar]
- Wang Y., Zhao H. G., Sciabola S., Wang W. L.. cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation. Molecules. 2023;28(11):4430. doi: 10.3390/molecules28114430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu H., Wei Z. Q., Wang X. Z., Zhang K., Liu H.. GraphGPT: A Graph Enhanced Generative Pretrained Transformer for Conditioned Molecular Generation. Int. J. Mol. Sci. 2023;24(23):16761. doi: 10.3390/ijms242316761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X., Gao C., Han P., Li X., Chen W., Patón A. R., Wang S., Zheng P.. PETrans: De Novo Drug Design with Protein-Specific Encoding Based on Transfer Learning. Int. J. Mol. Sci. 2023;24(2):1146. doi: 10.3390/ijms24021146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yue J., Peng B. X., Chen Y., Jin J. Y., Zhao X. D., Shen C., Ji X. Y., Hsieh C. Y., Song J. F., Hou T. J.. et al. Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language. Chem. Sci. 2024;15(34):13727–13740. doi: 10.1039/D4SC03744H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang H. Z., Shi X. G., Lei H. Y., Hu F., Cai Y. P.. ProtChat: An AI Multi-Agent for Automated Protein Analysis Leveraging GPT-4 and Protein Language Model. J. Chem. Inf. Model. 2025;65(1):62–70. doi: 10.1021/acs.jcim.4c01345. [DOI] [PubMed] [Google Scholar]
- Li J. T., Liu Y. Q., Fan W. Q., Wei X. Y., Liu H., Tang J. L., Li Q.. Empowering Molecule Discovery for Molecule-Caption Translation With Large Language Models: A ChatGPT Perspective. IEEE Trans. Knowl. Data Eng. 2024;36(11):6071–6083. doi: 10.1109/TKDE.2024.3393356. [DOI] [Google Scholar]
- Fan W. F., He Y., Zhu F.. RM-GPT: Enhance the comprehensive generative ability of molecular GPT model via LocalRNN and RealFormer. Artif. Intell. Med. 2024;150:102827. doi: 10.1016/j.artmed.2024.102827. [DOI] [PubMed] [Google Scholar]
- Yoo S., Kim J.. Adapt-cMolGPT: A Conditional Generative Pre-Trained Transformer with Adapter-Based Fine-Tuning for Target-Specific Molecular Generation. Int. J. Mol. Sci. 2024;25(12):6641. doi: 10.3390/ijms25126641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ai C. W., Yang H. P., Liu X. Y., Dong R. H., Ding Y. J., Guo F.. MTMol-GPT: De novo multi-target molecular generation with transformer-based generative adversarial imitation learning. PLoS Comput. Biol. 2024;20(6):e1012229. doi: 10.1371/journal.pcbi.1012229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H. J., Song G.. Antiviral Peptide-Generative Pre-Trained Transformer (AVP-GPT): A Deep Learning-Powered Model for Antiviral Peptide Design with High-Throughput Discovery and Exceptional Potency. Viruses. 2024;16(11):1673. doi: 10.3390/v16111673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J. K., Luo H., Qin R., Wang M. Y., Wan X. Z., Fang M. J., Zhang O., Gou Q. L., Su Q., Shen C.. et al. 3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model. Chem. Sci. 2025;16(2):637–648. doi: 10.1039/D4SC06864E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sakano K., Furui K., Ohue M.. NPGPT: natural product-like compound generation with GPT-based chemical language models. J. Supercomput. 2025;81(1):352. doi: 10.1007/s11227-024-06860-w. [DOI] [Google Scholar]
- Dobberstein N., Maass A., Hamaekers J.. Llamol: a dynamic multi-conditional generative transformer for de novo molecular design. J. Cheminf. 2024;16(1):73. doi: 10.1186/s13321-024-00863-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhattacharya D., Cassady H. J., Hickner M. A., Reinhart W. F.. Large Language Models as Molecular Design Engines. J. Chem. Inf. Model. 2024;64(18):7086–7096. doi: 10.1021/acs.jcim.4c01396. [DOI] [PubMed] [Google Scholar]
- Wang Y. S., Guo M. Y., Chen X. M., Ai D. M.. Screening of multi deep learning-based de novo molecular generation models and their application for specific target molecular generation. Sci. Rep. 2025;15(1):4419. doi: 10.1038/s41598-025-86840-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snyder S. H., Vignaux P. A., Ozalp M. K., Gerlach J., Puhl A. C., Lane T. R., Corbett J., Urbina F., Ekins S.. The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications. Commun. Chem. 2024;7(1):134. doi: 10.1038/s42004-024-01220-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z., Zhao L., Wang J., Wang C.. A Hierarchical Graph Neural Network Framework for Predicting Protein-Protein Interaction Modulators With Functional Group Information and Hypergraph Structure. IEEE J. Biomed. Health Inf. 2024;28(7):4295–4305. doi: 10.1109/JBHI.2024.3384238. [DOI] [PubMed] [Google Scholar]
- Soares E., Vital Brazil E., Shirasuna V., Zubarev D., Cerqueira R., Schmidt K.. An open-source family of large encoder-decoder foundation models for chemistry. Commun. Chem. 2025;8(1):193. doi: 10.1038/s42004-025-01585-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y., Vlachos D. G., Liu D., Fang H.. Rapid Adaptation of Chemical Named Entity Recognition Using Few-Shot Learning and LLM Distillation. J. Chem. Inf. Model. 2025;65(9):4334–4345. doi: 10.1021/acs.jcim.5c00248. [DOI] [PubMed] [Google Scholar]
- Atton Beckmann D., Werther M., Mackay E. B., Spyrakos E., Hunter P., Jones I. D.. Are more data always better? − Machine learning forecasting of algae based on long-term observations. J. Environ. Manage. 2025;373:123478. doi: 10.1016/j.jenvman.2024.123478. [DOI] [PubMed] [Google Scholar]
- Grisoni F.. Chemical language models for de novo drug design: Challenges and opportunities. Curr. Opin. Struct. Biol. 2023;79:102527. doi: 10.1016/j.sbi.2023.102527. [DOI] [PubMed] [Google Scholar]
- Özçelik R., de Ruiter S., Criscuolo E., Grisoni F.. Chemical language modeling with structured state space sequence models. Nat. Commun. 2024;15(1):6176. doi: 10.1038/s41467-024-50469-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Herck J., Gil M. V., Jablonka K. M., Abrudan A., Anker A. S., Asgari M., Blaiszik B., Buffo A., Choudhury L., Corminboeuf C.. et al. Assessment of fine-tuned large language models for real-world chemistry and material science applications. Chem. Sci. 2025;16(2):670–684. doi: 10.1039/D4SC04401K. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flam-Shepherd D., Zhu K., Aspuru-Guzik A.. Language models can learn complex molecular distributions. Nat. Commun. 2022;13(1):3293. doi: 10.1038/s41467-022-30839-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie L., Jin Y., Xu L., Chang S., Xu X.. Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction. J. Chem. Theory Comput. 2025;21(14):6743–6758. doi: 10.1021/acs.jctc.5c00605. [DOI] [PubMed] [Google Scholar]
- Deng J., Yang Z., Wang H., Ojima I., Samaras D., Wang F.. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 2023;14(1):6395. doi: 10.1038/s41467-023-41948-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Livne M., Miftahutdinov Z., Tutubalina E., Kuznetsov M., Polykovskiy D., Brundyn A., Jhunjhunwala A., Costa A., Aliper A., Aspuru-Guzik A.. et al. nach0: multimodal natural and chemical languages foundation model. Chem. Sci. 2024;15(22):8380–8389. doi: 10.1039/D4SC00966E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Z., Jiang M., Wang S., Zhang S.. Deep learning methods for molecular representation and property prediction. Drug Discovery Today. 2022;27(12):103373. doi: 10.1016/j.drudis.2022.103373. [DOI] [PubMed] [Google Scholar]
- Ramos M. C., Collison C. J., White A. D.. A review of large language models and autonomous agents in chemistry. Chem. Sci. 2025;16(6):2514–2572. doi: 10.1039/D4SC03921A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taju S. W., Shah S. M. A., Ou Y. Y.. ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations. Comput. Biol. Chem. 2021;93:107537. doi: 10.1016/j.compbiolchem.2021.107537. [DOI] [PubMed] [Google Scholar]
- Blanchard A. E., Gounley J., Bhowmik D., Shekar M. C., Lyngaas I., Gao S., Yin J., Tsaris A., Wang F., Glaser J.. Language models for the prediction of SARS-CoV-2 inhibitors. Int. J. High Perform. Comput. Appl. 2022;36(5−6):587–602. doi: 10.1177/10943420221121804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pang Y. H., Liu B.. IDP-LM: Prediction of protein intrinsic disorder and disorder functions based on language models. PLoS Comput. Biol. 2023;19(11):e1011657. doi: 10.1371/journal.pcbi.1011657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blaabjerg L. M., Jonsson N., Boomsma W., Stein A., Lindorff-Larsen K.. SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions. Nat. Commun. 2024;15(1):9646. doi: 10.1038/s41467-024-53982-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu H. K., Liu T. G.. Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models. Int. J. Mol. Sci. 2024;25(8):4507. doi: 10.3390/ijms25084507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Das S., Ghosh S., Jana N. D.. TransConv: convolution-infused transformer for protein secondary structure prediction. J. Mol. Model. 2025;31(2):37. doi: 10.1007/s00894-024-06259-7. [DOI] [PubMed] [Google Scholar]
- Qingge L., Badal K., Annan R., Sturtz J., Liu X. W., Zhu B. H.. Generative AI Models for the Protein Scaffold Filling Problem. J. Comput. Biol. 2025;32(2):127–142. doi: 10.1089/cmb.2024.0510. [DOI] [PubMed] [Google Scholar]
- Chen W., Chen G. X., Zhao L., Chen C. Y. C.. Predicting Drug-Target Interactions with Deep-Embedding Learning of Graphs and Sequences. J. Phys. Chem. A. 2021;125(25):5633–5642. doi: 10.1021/acs.jpca.1c02419. [DOI] [PubMed] [Google Scholar]
- Wang J. J., Hu J., Sun H. T., Xu M. D., Yu Y., Liu Y., Cheng L.. MGPLI: exploring multigranular representations for protein-ligand interaction prediction. Bioinformatics. 2022;38(21):4859–4867. doi: 10.1093/bioinformatics/btac597. [DOI] [PubMed] [Google Scholar]
- Wen J. H., Gan H. T., Yang Z., Zhou R., Zhao J., Ye Z. W.. Mutual-DTI: A mutual interaction feature-based neural network for drug-target protein interaction prediction. Math. Biosci. Eng. 2023;20(6):10610–10625. doi: 10.3934/mbe.2023469. [DOI] [PubMed] [Google Scholar]
- Zhu Z. Q., Yao Z., Qi G. Q., Mazur N., Yang P., Cong B. S.. Associative learning mechanism for drug-target interaction prediction. CAAI Trans. Intell. Technol. 2023;8(4):1558–1577. doi: 10.1049/cit2.12194. [DOI] [Google Scholar]
- Huang Y. X., Huang H. Y., Chen Y. G., Lin Y. C. D., Yao L. T., Lin T. X., Leng J. L., Chang Y., Zhang Y. T., Zhu Z. H.. et al. A Robust Drug-Target Interaction Prediction Framework with Capsule Network and Transfer Learning. Int. J. Mol. Sci. 2023;24(18):14061. doi: 10.3390/ijms241814061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. H., Liu J., Yang F., Zhang X. L., Zhang Q., Zhu X. K., Jiang P.. Advancing Drug-Target Interaction prediction with BERT and subsequence embedding. Comput. Biol. Chem. 2024;110:108058. doi: 10.1016/j.compbiolchem.2024.108058. [DOI] [PubMed] [Google Scholar]
- Liu S. Z., Liu Y. C., Xu H. F., Xia J., Li S. Z.. SP-DTI: subpocket-informed transformer for drug-target interaction prediction. Bioinformatics. 2025;41(3):btaf011. doi: 10.1093/bioinformatics/btaf011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Born J., Manica M.. Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. Nat. Mach. Intell. 2023;5(4):432–444. doi: 10.1038/s42256-023-00639-z. [DOI] [Google Scholar]
- Chen S. M., Zhong F. S.. GPCRSPACE: A New GPCR Real Expanded Library Based on Large Language Models Architecture and Positive Sample Machine Learning Strategies. J. Med. Chem. 2024;67(18):16912–16922. doi: 10.1021/acs.jmedchem.4c01983. [DOI] [PubMed] [Google Scholar]
- Wei J. H., Zhuo L., Fu X., Zeng X., Wang L., Zou Q., Cao D.. DrugReAlign: a multisource prompt framework for drug repurposing based on large language models. BMC Biology. 2024;22(1):266. doi: 10.1186/s12915-024-02028-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernatavicius A., Sícho M., Janssen A. P. A., Hassen A. K., Preuss M., van Westen G. J. P.. AlphaFold Meets De Novo Drug Design: Leveraging Structural Protein Information in Multitarget Molecular Generative Models. J. Chem. Inf. Model. 2024;64(21):8113–8122. doi: 10.1021/acs.jcim.4c00309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y. G., Yu H. L., Duan X. Y., Zhang X. M., Cheng T., Jiang F., Tang H., Ruan Y., Zhang M., Zhang H. Y.. et al. TransGEM: a molecule generation model based on Transformer with gene expression data. Bioinformatics. 2024;40(5):btae189. doi: 10.1093/bioinformatics/btae189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torres L. H. M., Arrais J. P., Ribeiro B.. Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction. J. Cheminf. 2024;16(1):109. doi: 10.1186/s13321-024-00902-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mswahili M. E., Ndomba G. E., Jo K., Jeong Y. S.. Graph Neural Network and BERT Model for Antimalarial Drug Predictions Using Plasmodium Potential Targets. Appl. Sci. 2024;14(4):1472. doi: 10.3390/app14041472. [DOI] [Google Scholar]
- Hua Y., Feng Z. H., Song X. N., Wu X. J., Kittler J.. MMDG-DTI: Drug-target interaction prediction via multimodal feature fusion and domain generalization. Pattern Recognit. 2025;157:110887. doi: 10.1016/j.patcog.2024.110887. [DOI] [Google Scholar]
- Tysinger E. P., Rai B. K., Sinitskiy A. V.. Can We Quickly Learn to ?Translate? Bioactive Molecules with Transformer Models? J. Chem. Inf. Model. 2023;63(6):1734–1744. doi: 10.1021/acs.jcim.2c01618. [DOI] [PubMed] [Google Scholar]
- Sellner M. S., Mahmoud A. H., Lill M. A.. Efficient virtual high-content screening using a distance-aware transformer model. J. Cheminf. 2023;15(1):18. doi: 10.1186/s13321-023-00686-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang M. J., Sun H. Y., Liu X., Xue X., Deng Y. F., Wang X. J.. CMGN: a conditional molecular generation net to design target-specific molecules with desired properties. Briefings Bioinf. 2023;24(4):bbad185. doi: 10.1093/bib/bbad185. [DOI] [PubMed] [Google Scholar]
- Tibo A., He J. Z., Janet J. P., Nittinger E., Engkvist O.. Exhaustive local chemical space exploration using a transformer model. Nat. Commun. 2024;15(1):7315. doi: 10.1038/s41467-024-51672-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang G. L., Zhang Y., Li L., Zhou J. Y., Chen H. L., Ji J. W., Li Y. R., Cao Y., Xu Z. H., Pian C.. Exploring Novel Fentanyl Analogues Using a Graph-Based Transformer Model. Interdiscip. Sci.:Comput. Life Sci. 2024;16(3):712–726. doi: 10.1007/s12539-024-00623-0. [DOI] [PubMed] [Google Scholar]
- Wang J. K., Hsieh C. Y., Wang M. Y., Wang X. R., Wu Z. X., Jiang D. J., Liao B. B., Zhang X. J., Yang B., He Q. J.. et al. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 2021;3(10):914–922. doi: 10.1038/s42256-021-00403-1. [DOI] [Google Scholar]
- Liu X. H., Ye K., van Vlijmen H. W. T., Ijzerman A., van Westen G. J. P.. DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. J. Cheminf. 2023;15(1):24. doi: 10.1186/s13321-023-00694-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ishida S., Sato T., Honma T., Terayama K.. Large language models open new way of AI-assisted molecule design for chemists. J. Cheminf. 2025;17(1):36. doi: 10.1186/s13321-025-00984-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakamura S., Yasuo N., Sekijima M.. Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning. Commun. Chem. 2025;8(1):40. doi: 10.1038/s42004-025-01437-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karpov P., Godin G., Tetko I. V.. Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J. Cheminf. 2020;12(1):17. doi: 10.1186/s13321-020-00423-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Irwin R., Dimitriadis S., He J. Z., Bjerrum E. J.. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn: Sci. Technol. 2022;3(1):015022. doi: 10.1088/2632-2153/ac3ffb. [DOI] [Google Scholar]
- Ross J., Belgodere B., Chenthamarakshan V., Padhi I., Mroueh Y., Das P.. Large-scale chemical language representations capture molecular structure and properties. Nat. Mach. Intell. 2022;4(12):1256–1264. doi: 10.1038/s42256-022-00580-7. [DOI] [Google Scholar]
- Piao S., Choi J., Seo S., Park S.. SELF-EdiT: Structure-constrained molecular optimization using SELFIES editing transformer. Appl. Intell. 2023;53(21):25868–25880. doi: 10.1007/s10489-023-04915-8. [DOI] [Google Scholar]
- Meewan I., Panmanee J., Petchyam N., Lertvilai P.. HBCVTr: an end-to-end transformer with a deep neural network hybrid model for anti-HBV and HCV activity predictor from SMILES. Sci. Rep. 2024;14(1):9262. doi: 10.1038/s41598-024-59933-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mswahili M. E., Hwang J., Rajapakse J. C., Jo K., Jeong Y. S.. Positional embeddings and zero-shot learning using BERT for molecular-property prediction. J. Cheminf. 2025;17(1):17. doi: 10.1186/s13321-025-00959-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niu Z. M., Xiao X. L., Wu W. F., Cai Q. W., Jiang Y. H., Jin W. Z., Wang M. H., Yang G. J., Kong L. K., Jin X. R.. et al. PharmaBench: Enhancing ADMET benchmarks with large language models. Sci. Data. 2024;11(1):985. doi: 10.1038/s41597-024-03793-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shao C. H., Shao F. J., Huang S., Sun R. C., Zhang T.. An Evolved Transformer Model for ADME/Tox Prediction. Electronics. 2024;13(3):624. doi: 10.3390/electronics13030624. [DOI] [Google Scholar]
- Aksamit N., Hou J. Q., Li Y. F., Ombuki-Berman B.. Integrating transformers and many-objective optimization for drug design. BMC Bioinf. 2024;25(1):208. doi: 10.1186/s12859-024-05822-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khambhawala A., Lee C. H., Pahari S., Kwon J. S. I.. Minimizing late-stage failure in drug development with transformer models: Enhancing drug screening and pharmacokinetic predictions. Chem. Eng. J. 2025;508:160423. doi: 10.1016/j.cej.2025.160423. [DOI] [Google Scholar]
- Wu K. H., Xia Y. C., Deng P., Liu R. H., Zhang Y., Guo H., Cui Y. M., Pei Q. Z., Wu L. J., Xie S. F.. et al. TamGen: drug design with target-aware molecule generation through a chemical language model. Nat. Commun. 2024;15(1):9360. doi: 10.1038/s41467-024-53632-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Norinder U.. Traditional Machine and Deep Learning for Predicting Toxicity Endpoints. Molecules. 2023;28(1):217. doi: 10.3390/molecules28010217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng L. J., Zhao W. Y., Wang J. F., Lin K. Y., Guo Y. A., Zhang L. Y.. Data-Driven Technology Roadmaps to Identify Potential Technology Opportunities for Hyperuricemia Drugs. Pharmaceuticals. 2022;15(11):1357. doi: 10.3390/ph15111357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu Y. W., Xu Z. D., Yang C. R.. Empowering Graph Neural Network-Based Computational Drug Repositioning with Large Language Model-Inferred Knowledge Representation. Interdiscip. Sci.:Comput. Life Sci. 2025;17(3):698–715. doi: 10.1007/s12539-024-00654-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun C., Yang Z. H., Su L. L., Wang L., Zhang Y., Lin H. F., Wang J.. Chemical-protein interaction extraction via Gaussian probability distribution and external biomedical knowledge. Bioinformatics. 2020;36(15):4323–4330. doi: 10.1093/bioinformatics/btaa491. [DOI] [PubMed] [Google Scholar]
- Sun C., Yang Z., Wang L., Zhang Y., Lin H., Wang J.. Deep learning with language models improves named entity recognition for PharmaCoNER. BMC Bioinf. 2021;22(SUPPL S1):602. doi: 10.1186/s12859-021-04260-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee Y., Son J., Song M.. BertSRC: transformer-based semantic relation classification. BMC Med. Inf. Decis. Making. 2022;22(1):234. doi: 10.1186/s12911-022-01977-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aldahdooh J., Vähä-Koskela M., Tang J., Tanoli Z.. Using BERT to identify drug-target interactions from whole PubMed. BMC Bioinf. 2022;23(1):245. doi: 10.1186/s12859-022-04768-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mann V., Viswanath S., Vaidyaraman S., Balakrishnan J., Venkatasubramanian V. S.. Pharmaceutical CMC ontology-based information extraction for drug machine. Comput. Chem. Eng. 2023;179:108446. doi: 10.1016/j.compchemeng.2023.108446. [DOI] [Google Scholar]
- Kang H. Y., Hou L., Gu Y. W., Lu X., Li J., Li Q.. Drug-disease association prediction with literature based multi-feature fusion. Front. Pharmacol. 2023;14:1205144. doi: 10.3389/fphar.2023.1205144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vangala S. R., Krishnan S. R., Bung N., Nandagopal D., Ramasamy G., Kumar S., Sankaran S., Srinivasan R., Roy A.. Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. J. Cheminf. 2024;16(1):131. doi: 10.1186/s13321-024-00928-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westerlund A. M., Koki S. M., Kancharla S., Tibo A., Saigiridharan L., Kabeshov M., Mercado R., Genheden S.. Do Chemformers Dream of Organic Matter? Evaluating a Transformer Model for Multistep Retrosynthesis. J. Chem. Inf. Model. 2024;64(8):3021–3033. doi: 10.1021/acs.jcim.3c01685. [DOI] [PubMed] [Google Scholar]
- Zhang R., Hristovski D., Schutte D., Kastrin A., Fiszman M., Kilicoglu H.. Drug repurposing for COVID-19 via knowledge graph completion. J. Biomed. Inf. 2021;115:103696. doi: 10.1016/j.jbi.2021.103696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Millikin R. J., Raja K., Steill J., Lock C., Tu X. C., Ross I., Tsoi L. C., Kuusisto F., Ni Z. J., Livny M.. et al. Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC Bioinf. 2023;24(1):412. doi: 10.1186/s12859-023-05539-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miranda-Escalada A., Mehryary F., Luoma J., Estrada-Zavala D., Gasco L., Pyysalo S., Valencia A., Krallinger M.. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database. 2023;2023:baad080. doi: 10.1093/database/baad080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao Y. K., Zhang S. N., Zhou H. X., Li M. C., Yang H., Zhang R. F.. Leveraging LLM’s’ s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs. J. Biomed. Inf. 2024;158:104730. doi: 10.1016/j.jbi.2024.104730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nian Y., Hu X., Zhang R., Feng J., Du J., Li F., Bu L., Zhang Y., Chen Y., Tao C.. Mining on Alzheimer’s diseases related knowledge graph to identity potential AD-related semantic triples for drug repurposing. BMC Bioinf. 2022;23(SUPPL 6):407. doi: 10.1186/s12859-022-04934-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang D., Pourmirzaei M., Abbas U. L., Zeng S., Manshour N., Esmaili F., Poudel B., Jiang Y., Shao Q., Chen J.. et al. S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure. Adv. Sci. 2025;12(5):e2404212. doi: 10.1002/advs.202404212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heinzinger M., Rost B.. Teaching AI to speak protein. Curr. Opin. Struct. Biol. 2025;91:102986. doi: 10.1016/j.sbi.2025.102986. [DOI] [PubMed] [Google Scholar]
- Erckert K., Rost B.. Assessing the role of evolutionary information for enhancing protein language model embeddings. Sci. Rep. 2024;14(1):20692. doi: 10.1038/s41598-024-71783-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh J., Paliwal K., Litfin T., Singh J., Zhou Y.. Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment. Sci. Rep. 2022;12(1):7607. doi: 10.1038/s41598-022-11684-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pantolini L., Studer G., Pereira J., Durairaj J., Tauriello G., Schwede T.. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics. 2024;40(1):btad786. doi: 10.1093/bioinformatics/btad786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sargsyan K., Lim C.. Using protein language models for protein interaction hot spot prediction with limited data. BMC Bioinf. 2024;25(1):115. doi: 10.1186/s12859-024-05737-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pratyush P., Bahmani S., Pokharel S., Ismail H. D., Kc D. B.. LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model. Bioinformatics. 2024;40(5):btae290. doi: 10.1093/bioinformatics/btae290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mall R., Kaushik R., Martinez Z. A., Thomson M. W., Castiglione F.. Benchmarking protein language models for protein crystallization. Sci. Rep. 2025;15(1):2381. doi: 10.1038/s41598-025-86519-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng F. Z., Wang C., Chen T., Schussheim B., Vincoff S., Chatterjee P.. PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks. Nat. Methods. 2025;22(5):945–949. doi: 10.1038/s41592-025-02656-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J. S., Abdin O., Kim P. M.. Language models for protein design. Curr. Opin. Struct. Biol. 2025;92:103027. doi: 10.1016/j.sbi.2025.103027. [DOI] [PubMed] [Google Scholar]
- Valentini G., Malchiodi D., Gliozzo J., Mesiti M., Soto-Gomez M., Cabri A., Reese J., Casiraghi E., Robinson P. N.. The promises of large language models for protein design and modeling. Front. Bioinform. 2023;3:1304099. doi: 10.3389/fbinf.2023.1304099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu K., Yu X., Cui H., Li W., Han W.. GPT4Kinase: High-accuracy prediction of inhibitor-kinase binding affinity utilizing large language model. Int. J. Biol. Macromol. 2024;282(Pt 5):137069. doi: 10.1016/j.ijbiomac.2024.137069. [DOI] [PubMed] [Google Scholar]
- Jin Z., Wu T., Chen T., Pan D., Wang X., Xie J., Quan L., Lyu Q.. CAPLA: improved prediction of protein-ligand binding affinity by a deep learning approach based on a cross-attention mechanism. Bioinformatics. 2023;39(2):btad049. doi: 10.1093/bioinformatics/btad049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Creanza T. M., Alberga D., Patruno C., Mangiatordi G. F., Ancona N.. Transformer Decoder Learns from a Pretrained Protein Language Model to Generate Ligands with High Affinity. J. Chem. Inf. Model. 2025;65(3):1258–1277. doi: 10.1021/acs.jcim.4c02019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pal S., Pal A., Mohanty D.. SG-ML-PLAP: A structure-guided machine learning-based scoring function for protein-ligand binding affinity prediction. Protein Sci. 2025;34(1):e5257. doi: 10.1002/pro.5257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Z., Ren P., Yang H., Zheng J., Bai F.. TEFDTA: a transformer encoder and fingerprint representation combined prediction method for bonded and non-bonded drug-target affinities. Bioinformatics. 2024;40(1):btad778. doi: 10.1093/bioinformatics/btad778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dhakal A., McKay C., Tanner J. J., Cheng J.. Artificial intelligence in the prediction of protein-ligand interactions: recent advances and future directions. Briefings Bioinf. 2022;23(1):bbab476. doi: 10.1093/bib/bbab476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brahma R., Moon S., Shin J. M., Cho K. H.. AiGPro: a multi-tasks model for profiling of GPCRs for agonist and antagonist. J. Cheminf. 2025;17(1):12. doi: 10.1186/s13321-024-00945-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song T., Zhang X., Ding M., Rodriguez-Paton A., Wang S., Wang G.. DeepFusion: A deep learning based multi-scale feature fusion method for predicting drug-target interactions. Methods. 2022;204:269–277. doi: 10.1016/j.ymeth.2022.02.007. [DOI] [PubMed] [Google Scholar]
- Zhang C., Zang T., Zhao T.. KGE-UNIT: toward the unification of molecular interactions prediction based on knowledge graph and multi-task learning on drug discovery. Briefings Bioinf. 2024;25(2):bbae043. doi: 10.1093/bib/bbae043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H., Bajorath J.. Designing highly potent compounds using a chemical language model. Sci. Rep. 2023;13(1):7412. doi: 10.1038/s41598-023-34683-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ang D., Rakovski C., Atamian H. S.. De Novo Drug Design Using Transformer-Based Machine Translation and Reinforcement Learning of an Adaptive Monte Carlo Tree Search. Pharmaceuticals. 2024;17(2):161. doi: 10.3390/ph17020161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheikholeslami M., Mazrouei N., Gheisari Y., Fasihi A., Irajpour M., Motahharynia A.. DrugGen enhances drug discovery with large language models and reinforcement learning. Sci. Rep. 2025;15(1):13445. doi: 10.1038/s41598-025-98629-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loeffler H. H., He J., Tibo A., Janet J. P., Voronov A., Mervin L. H., Engkvist O.. Reinvent 4: Modern AI-driven generative molecule design. J. Cheminf. 2024;16(1):20. doi: 10.1186/s13321-024-00812-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ozcelik R., Grisoni F.. A hitchhiker’s guide to deep chemical language processing for bioactivity prediction. Digital Discovery. 2025;4(2):316–325. doi: 10.1039/D4DD00311J. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bagal V., Aggarwal R., Vinod P. K., Priyakumar U. D.. MolGPT: Molecular Generation Using a Transformer-Decoder Model. J. Chem. Inf. Model. 2022;62(9):2064–2076. doi: 10.1021/acs.jcim.1c00600. [DOI] [PubMed] [Google Scholar]
- Ucak U. V., Ashyrmamatov I., Lee J.. Reconstruction of lossless molecular representations from fingerprints. J. Cheminf. 2023;15(1):26. doi: 10.1186/s13321-023-00693-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H., Lee J., Ahn S., Lee J. R.. A merged molecular representation learning for molecular properties prediction with a web-based service. Sci. Rep. 2021;11(1):11028. doi: 10.1038/s41598-021-90259-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alberga D., Lamanna G., Graziano G., Delre P., Lomuscio M. C., Corriero N., Ligresti A., Siliqi D., Saviano M., Contino M.. et al. DeLA-DrugSelf: Empowering multi-objective de novo design through SELFIES molecular representation. Comput. Biol. Med. 2024;175:108486. doi: 10.1016/j.compbiomed.2024.108486. [DOI] [PubMed] [Google Scholar]
- Lo A., Pollice R., Nigam A., White A. D., Krenn M., Aspuru-Guzik A.. Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery. 2023;2(4):897–908. doi: 10.1039/D3DD00044C. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadeghi S., Bui A., Forooghi A., Lu J., Ngom A.. Can large language models understand molecules? BMC Bioinf. 2024;25(1):225. doi: 10.1186/s12859-024-05847-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bung N., Krishnan S. R., Roy A.. An In Silico Explainable Multiparameter Optimization Approach for De Novo Drug Design against Proteins from the Central Nervous System. J. Chem. Inf. Model. 2022;62(11):2685–2695. doi: 10.1021/acs.jcim.2c00462. [DOI] [PubMed] [Google Scholar]
- Sarumi O. A., Heider D.. Large language models and their applications in bioinformatics. Comput. Struct. Biotechnol. J. 2024;23:3498–3505. doi: 10.1016/j.csbj.2024.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bajorath J.. Chemical language models for molecular design. Mol. Inf. 2024;43(1):e202300288. doi: 10.1002/minf.202300288. [DOI] [PubMed] [Google Scholar]
- Chen H. W., Yoshimori A., Bajorath J.. Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model. RSC Med. Chem. 2024;15(7):2527–2537. doi: 10.1039/D4MD00423J. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Setiya A., Jani V., Sonavane U., Joshi R.. MolToxPred: small molecule toxicity prediction using machine learning approach. RSC Adv. 2024;14(6):4201–4220. doi: 10.1039/D3RA07322J. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Monem S., Abdel-Hamid A. H., Hassanien A. E.. Drug toxicity prediction model based on enhanced graph neural network. Comput. Biol. Med. 2025;185:109614. doi: 10.1016/j.compbiomed.2024.109614. [DOI] [PubMed] [Google Scholar]
- Jeong J., Choi J.. Artificial Intelligence-Based Toxicity Prediction of Environmental Chemicals: Future Directions for Chemical Management Applications. Environ. Sci. Technol. 2022;56(12):7532–7543. doi: 10.1021/acs.est.1c07413. [DOI] [PubMed] [Google Scholar]
- Lin J., He Y., Ru C., Long W., Li M., Wen Z.. Advancing Adverse Drug Reaction Prediction with Deep Chemical Language Model for Drug Safety Evaluation. Int. J. Mol. Sci. 2024;25(8):4516. doi: 10.3390/ijms25084516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu K., He Y., Wei J., Sun C., Geng J., Wei L., Su R.. BFGTP: A BERT-Guided Two-Stage Molecular Representation Learning Framework for Toxicity Prediction. IEEE J. Biomed. Health Inf. 2025;29(10):6960–6970. doi: 10.1109/jbhi.2025.3556766. [DOI] [PubMed] [Google Scholar]
- Wu L., Fang H., Qu Y., Xu J., Tong W.. Leveraging FDA Labeling Documents and Large Language Model to Enhance Annotation, Profiling, and Classification of Drug Adverse Events with AskFDALabel. Drug Saf. 2025;48(6):655–665. doi: 10.1007/s40264-025-01520-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fong A., Adams K. T., Boxley C., Revoir J. A., Krevat S., Ratwani R. M.. Does one size fit all? Developing an evaluation strategy to assess large language models for patient safety event report analysis. JAMIA Open. 2024;7(4):ooae128. doi: 10.1093/jamiaopen/ooae128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bagherzadeh P., Sultanem K., Batist G., Abbasinejad Enger S.. An automatic pipeline for temporal monitoring of radiotherapy-induced toxicities in head and neck cancer patients. NPJ Precis. Oncol. 2025;9(1):40. doi: 10.1038/s41698-025-00824-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weber L., Sanger M., Garda S., Barth F., Alt C., Leser U.. Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models. Database. 2022;2022:baac098. doi: 10.1093/database/baac098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Connor S., Wu L., Roberts R. A., Tong W.. Is ChatGPT Ready for Public Use in Organ-Specific Drug Toxicity Research? Drug Discovery Today. 2025;30(2):104297. doi: 10.1016/j.drudis.2025.104297. [DOI] [PubMed] [Google Scholar]
- Mazuz E., Shtar G., Kutsky N., Rokach L., Shapira B.. Pretrained transformer models for predicting the withdrawal of drugs from the market. Bioinformatics. 2023;39(8):btad519. doi: 10.1093/bioinformatics/btad519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yazdani A., Bornet A., Khlebnikov P., Zhang B., Rouhizadeh H., Amini P., Teodoro D.. An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results. Sci. Data. 2025;12(1):424. doi: 10.1038/s41597-025-04718-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koon Y. L., Lam Y. T., Tan H. X., Teo D. H. C., Neo J. W., Yap A. J. Y., Ang P. S., Loke C. P. W., Tham M. Y., Tan S. H.. et al. Effectiveness of Transformer-Based Large Language Models in Identifying Adverse Drug Reaction Relations from Unstructured Discharge Summaries in Singapore. Drug Saf. 2025;48(6):667–677. doi: 10.1007/s40264-025-01525-w. [DOI] [PubMed] [Google Scholar]
- Zhang W., Wang Q., Kong X., Xiong J., Ni S., Cao D., Niu B., Chen M., Li Y., Zhang R.. et al. Fine-tuning large language models for chemical text mining. Chem. Sci. 2024;15(27):10600–10611. doi: 10.1039/D4SC00924J. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo L., Lai P. T., Wei C. H., Arighi C. N., Lu Z.. BioRED: a rich biomedical relation extraction dataset. Briefings Bioinf. 2022;23(5):bbac282. doi: 10.1093/bib/bbac282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tirunagari S., Saha S., Venkatesan A., Suveges D., Carmona M., Buniello A., Ochoa D., McEntyre J., McDonagh E., Harrison M.. Lit-OTAR framework for extracting biological evidences from literature. Bioinformatics. 2025;41(4):btaf113. doi: 10.1093/bioinformatics/btaf113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura E., Kawakami Y., Inoue S., Okajima A.. Mapping Drug Terms via Integration of a Retrieval-Augmented Generation Algorithm with a Large Language Model. Healthcare Inf. Res. 2024;30(4):355–363. doi: 10.4258/hir.2024.30.4.355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang L., Lin J., Li X., Song L., Zheng Z., Wong K. C.. EGFI: drug-drug interaction extraction and generation with fusion of enriched entity and sentence information. Briefings Bioinf. 2022;23(1):bbab451. doi: 10.1093/bib/bbab451. [DOI] [PubMed] [Google Scholar]
- Luo H., Yang H., Zhang G., Wang J., Luo J., Yan C.. KGRDR: a deep learning model based on knowledge graph and graph regularized integration for drug repositioning. Front. Pharmacol. 2025;16:1525029. doi: 10.3389/fphar.2025.1525029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H., Li H., Liu Q., Dong G., Hou C., Li Y., Zhao Y.. Using TransR to enhance drug repurposing knowledge graph for COVID-19 and its complications. Methods. 2024;221:82–90. doi: 10.1016/j.ymeth.2023.12.001. [DOI] [PubMed] [Google Scholar]
- Remy F., Demuynck K., Demeester T.. BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights. J. Am. Med. Inf. Assoc. 2024;31(9):1844–1855. doi: 10.1093/jamia/ocae029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang M., Ahn J., Kang B. G., Yoon S.. Cross-modal embedding integrator for disease-gene/protein association prediction using a multi-head attention mechanism. Pharmacol. Res. Perspect. 2024;12(6):e70034. doi: 10.1002/prp2.70034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng Y., Zhou L., Ma C., Zheng Y., He R., Li Y.. Knowledge graph-based thought: a knowledge graph-enhanced LLM framework for pan-cancer question answering. Gigascience. 2025;14:giae082. doi: 10.1093/gigascience/giae082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren F., Aliper A., Chen J., Zhao H., Rao S., Kuppe C., Ozerov I. V., Zhang M., Witte K., Kruse C.. et al. A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat. Biotechnol. 2025;43(1):63–75. doi: 10.1038/s41587-024-02143-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li T., Shetty S., Kamath A., Jaiswal A., Jiang X., Ding Y., Kim Y.. CancerGPT for few shot drug pair synergy prediction using large pretrained language models. NPJ Digital Med. 2024;7(1):40. doi: 10.1038/s41746-024-01024-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J. M., Walker K. C., Bekar-Cesaretli A. A., Hao B. R., Bhadelia N., Joseph-McCarthy D., Paschalidis I. C.. Automating biomedical literature review for rapid drug discovery: Leveraging GPT-4 to expedite pandemic response. Int. J. Med. Inf. 2024;189:105500. doi: 10.1016/j.ijmedinf.2024.105500. [DOI] [PubMed] [Google Scholar]
- Wei Z. Q., Chen X., Sun Y. S., Zhang Y. F., Dong R. F., Wang X. J., Chen S. T.. Exploring the molecular mechanisms and shared potential drugs between rheumatoid arthritis and arthrofibrosis based on large language model and synovial microenvironment analysis. Sci. Rep. 2024;14(1):18939. doi: 10.1038/s41598-024-69080-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wiwanitmkit S., Wiwanitkit V.. Artificial Intelligence in the repurposing of potential herbs for filariasis therapy. J. Vector Borne Dis. 2024;61(2):289–294. doi: 10.4103/jvbd.jvbd_153_23. [DOI] [PubMed] [Google Scholar]
- Fan Q. X., He Y. X., Liu J. L., Liu Q. L., Wu Y., Chen Y. X., Dou Q. Y., Shi J., Kong Q. Q., Ou Y. S.. et al. Large Language Model-Assisted Genotoxic Metal-Phenolic Nanoplatform for Osteosarcoma Therapy. Small. 2025;21(5):e2403044. doi: 10.1002/smll.202403044. [DOI] [PubMed] [Google Scholar]
- Picard M., Leclercq M., Bodein A., Scott-Boyer M. P., Perin O., Droit A.. Improving drug repositioning with negative data labeling using large language models. J. Cheminf. 2025;17(1):16. doi: 10.1186/s13321-025-00962-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang M., Zhang C. M., Liu K. Y., Yang X. B., Liu X. J., Ge F.. BRAFPred: A Novel Approach for Accurate Prediction of the B-Type Rapidly Accelerated Fibrosarcoma Inhibitor. Acs Omega. 2025;10(12):12170–12184. doi: 10.1021/acsomega.4c10367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadad T., Aurangzeb R. A., Safran M., Imran, Alfarhood S., Kim J.. Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models. Biomedicines. 2023;11(5):1323. doi: 10.3390/biomedicines11051323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng J. J., Zhang Y. Y., Li R. F., Zhu W. J., Liu H. R., Li H. Y., Liu B., Cao D. S., Peng J., Luo X. J.. Hybrid approach for drug-target interaction predictions in ischemic stroke models. Artif. Intell. Med. 2025;161:103067. doi: 10.1016/j.artmed.2025.103067. [DOI] [PubMed] [Google Scholar]
- Alkhoury N., Shaik M., Wurmus R., Akalin A.. Enhancing biomarker based oncology trial matching using large language models. NPJ. Digital Med. 2025;8(1):250. doi: 10.1038/s41746-025-01673-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin Q., Wang Z., Floudas C. S., Chen F., Gong C., Bracken-Clarke D., Xue E., Yang Y., Sun J., Lu Z.. Matching patients to clinical trials with large language models. Nat. Commun. 2024;15(1):9074. doi: 10.1038/s41467-024-53081-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lammert J., Dreyer T., Mathes S., Kuligin L., Borm K. J., Schatz U. A., Kiechle M., Lorsch A. M., Jung J., Lange S.. et al. Expert-Guided Large Language Models for Clinical Decision Support in Precision Oncology. JCO Precis. Oncol. 2024;8:e2400478. doi: 10.1200/PO-24-00478. [DOI] [PubMed] [Google Scholar]
- Kp Jayatunga M., Ayers M., Bruens L., Jayanth D., Meier C.. How successful are AI-discovered drugs in clinical trials? A first analysis and emerging lessons. Drug Discovery Today. 2024;29(6):104009. doi: 10.1016/j.drudis.2024.104009. [DOI] [PubMed] [Google Scholar]
- Jayatunga M. K. P., Xie W., Ruder L., Schulze U., Meier C.. AI in small-molecule drug discovery: a coming wave? Nat. Rev. Drug Discovery. 2022;21(3):175–176. doi: 10.1038/d41573-022-00025-1. [DOI] [PubMed] [Google Scholar]
- Xu Z., Ren F., Wang P., Cao J., Tan C., Ma D., Zhao L., Dai J., Ding Y., Fang H.. et al. A generative AI-discovered TNIK inhibitor for idiopathic pulmonary fibrosis: a randomized phase 2a trial. Nat. Med. 2025;31(8):2602–2610. doi: 10.1038/s41591-025-03743-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie T., Zhang H., Yang Q., Sun J., Wang Y., Long J., Zhang Z., Lu H.. CSU-MS2: A Contrastive Learning Framework for Cross-Modal Compound Identification from MS/MS Spectra to Molecular Structures. Anal. Chem. 2025;97(25):13350–13360. doi: 10.1021/acs.analchem.5c01594. [DOI] [PubMed] [Google Scholar]
- Yu X., Liu T., Kong L., Lan T., Sun Q., Qu F., Liu M., Chen J., Huang M.. SpecRecFormer: Deep Learning-Driven Adaptive Component Identification of PAH Mixtures Based on Single-Component Raman Spectra. Anal. Chem. 2025;97(18):9876–9883. doi: 10.1021/acs.analchem.5c00461. [DOI] [PubMed] [Google Scholar]
- Bui-Thi D., Liu Y., Lippens J. L., Laukens K., De Vijlder T.. TransExION: a transformer based explainable similarity metric for comparing IONS in tandem mass spectrometry. J. Cheminf. 2024;16(1):61. doi: 10.1186/s13321-024-00858-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duponchel L., Rocha de Oliveira R., Motto-Ros V.. Large Language Models (such as ChatGPT) as Tools for Machine Learning-Based Data Insights in Analytical Chemistry. Anal. Chem. 2025;97(13):6956–6961. doi: 10.1021/acs.analchem.4c05046. [DOI] [PubMed] [Google Scholar]
- Chen X., Tang H.. Designing a large language model for chemists. Patterns. 2025;6(5):101264. doi: 10.1016/j.patter.2025.101264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Z., Ma D., Chen L., Sun L., Li Z., Xia Y., Chen B., Xu H., Zhu Z., Zhu S.. et al. Developing ChemDFM as a large language foundation model for chemistry. Cell Rep. Phys. Sci. 2025;6(4):102523. doi: 10.1016/j.xcrp.2025.102523. [DOI] [Google Scholar]
- Song T., Luo M., Zhang X., Chen L., Huang Y., Cao J., Zhu Q., Liu D., Zhang B., Zou G.. et al. A Multiagent-Driven Robotic AI Chemist Enabling Autonomous Chemical Research On Demand. J. Am. Chem. Soc. 2025;147(15):12534–12545. doi: 10.1021/jacs.4c17738. [DOI] [PubMed] [Google Scholar]
- Boiko D. A., MacKnight R., Kline B., Gomes G.. Autonomous chemical research with large language models. Nature. 2023;624(7992):570–578. doi: 10.1038/s41586-023-06792-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruan Y., Lu C., Xu N., He Y., Chen Y., Zhang J., Xuan J., Pan J., Fang Q., Gao H.. et al. An automatic end-to-end chemical synthesis development platform powered by large language models. Nat. Commun. 2024;15(1):10160. doi: 10.1038/s41467-024-54457-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Esaki T., Yonezawa T., Ikeda K.. A new workflow for the effective curation of membrane permeability data from open ADME information. J. Cheminf. 2024;16(1):30. doi: 10.1186/s13321-024-00826-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J., Fang Y., Shao X., Chen H., Zhang N., Fan X.. The Future of Molecular Studies through the Lens of Large Language Models. J. Chem. Inf. Model. 2024;64(3):563–566. doi: 10.1021/acs.jcim.3c01977. [DOI] [PubMed] [Google Scholar]
- Liu Y., Zhu Y., Wang J., Hu R., Shen C., Qu W., Wang G., Su Q., Zhu Y., Kang Y.. et al. A Multi-Objective Molecular Generation Method Based on Pareto Algorithm and Monte Carlo Tree Search. Adv. Sci. 2025;12:e2410640. doi: 10.1002/advs.202410640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song H., Kim M., Park D., Shin Y., Lee J. G.. Learning From Noisy Labels With Deep Neural Networks: A Survey. IEEE Trans. Neural Networks Learn. Syst. 2023;34(11):8135–8153. doi: 10.1109/TNNLS.2022.3152527. [DOI] [PubMed] [Google Scholar]
- Kolobkov D., Mishra Sharma S., Medvedev A., Lebedev M., Kosaretskiy E., Vakhitov R.. Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project. Front. Big Data. 2024;7:1266031. doi: 10.3389/fdata.2024.1266031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ong J. C. L., Chang S. Y.-H., William W., Butte A. J., Shah N. H., Chew L. S. T., Liu N., Doshi-Velez F., Lu W., Savulescu J.. et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digital Health. 2024;6(6):e428−e432. doi: 10.1016/S2589-7500(24)00061-X. [DOI] [PubMed] [Google Scholar]
- Goyon A., Masui C., Sirois L. E., Han C., Yehl P., Gosselin F., Zhang K.. Achiral−Chiral Two-Dimensional Liquid Chromatography Platform to Support Automated High-Throughput Experimentation in the Field of Drug Development. Anal. Chem. 2020;92(22):15187–15193. doi: 10.1021/acs.analchem.0c03754. [DOI] [PubMed] [Google Scholar]
- Shi L., Liu S., Li X., Huang X., Luo H., Bai Q., Li Z., Wang L., Du X., Jiang C.. et al. Droplet microarray platforms for high-throughput drug screening. Microchim. Acta. 2023;190(7):260. doi: 10.1007/s00604-023-05833-9. [DOI] [PubMed] [Google Scholar]
- Asgari E., Montaña-Brown N., Dubois M., Khalil S., Balloch J., Yeung J. A., Pimenta D.. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digital Med. 2025;8(1):274. doi: 10.1038/s41746-025-01670-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiang W., Xiong Z., Chen H., Xiong J., Zhang W., Fu Z., Zheng M., Liu B., Shi Q.. FAPM: functional annotation of proteins using multimodal models beyond structural modeling. Bioinformatics. 2024;40(12):btae680. doi: 10.1093/bioinformatics/btae680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu F., Zhang W., Huang H., Li W., Li Y., Yin P.. A Transferability-Based Method for Evaluating the Protein Representation Learning. IEEE J. Biomed. Health Inf. 2024;28(5):3158–3166. doi: 10.1109/JBHI.2024.3370680. [DOI] [PubMed] [Google Scholar]
- Rollins Z. A., Widatalla T., Waight A., Cheng A. C., Metwally E.. AbLEF: antibody language ensemble fusion for thermodynamically empowered property predictions. Bioinformatics. 2024;40(5):btae268. doi: 10.1093/bioinformatics/btae268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y., Ren S., Wang J., Lu J., Wu C., He M., Liu X., Wu R., Zhao J., Zhan C.. et al. Aligning Large Language Models with Humans: A Comprehensive Survey of ChatGPT’s Aptitude in Pharmacology. Drugs. 2025;85(2):231–254. doi: 10.1007/s40265-024-02124-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X., Zeng F., Gu C.. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Networks. 2025;184:107059. doi: 10.1016/j.neunet.2024.107059. [DOI] [PubMed] [Google Scholar]
- Venhorst J., Kalkman G.. Drug target assessments: classifying target modulation and associated health effects using multi-level BERT-based classification models. Bioinf. Adv. 2025;5(1):vbaf043. doi: 10.1093/bioadv/vbaf043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michalowski M., Topaz M., Peltonen L. M.. An AI-Enabled Nursing Future With no Documentation Burden: A Vision for a New Reality. J. Adv. Nurs. 2025:16911. doi: 10.1111/jan.16911. [DOI] [PMC free article] [PubMed] [Google Scholar]


