Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Jan 3;26(1):bbae693. doi: 10.1093/bib/bbae693

DrugAssist: a large language model for molecule optimization

Geyan Ye 1,#,, Xibao Cai 2,#, Houtim Lai 3, Xing Wang 4, Junhong Huang 5, Longyue Wang 6,, Wei Liu 7, Xiangxiang Zeng 8
PMCID: PMC11697106  PMID: 39751647

Abstract

Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human–machine dialogue by leveraging LLM’s strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called ‘MolOpt-Instructions’ for fine-tuning language models on molecule optimization tasks. We have made our code and data publicly available at https://github.com/blazerye/DrugAssist, which we hope to pave the way for future research in LLMs’ application for drug discovery.

Keywords: large language model, molecule optimization, drug discovery

Introduction

Recently, generative artificial intelligence has made remarkable strides in the field of natural language processing (NLP), particularly with the advent of Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) [1]. These models have demonstrated impressive capabilities in a wide range of tasks, extending far beyond everyday communication and question-answering scenarios. Researchers have increasingly recognized the potential of these models in addressing complex and diverse problems across various domains, prompting interests in their applications within professional fields.

In recent years, there has been an increasing number of attempts to apply conversational LLMs in the field of drug discovery [2–7]. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Existing approaches can be broadly categorized into two main types. The first type represents molecules as sequences, with SMILES strings (Simplified Molecular Input Line Entry System) being a common representation and generates an optimized molecular sequence by learning from the input data. The second type of approach represents molecules as graphs and formulates molecule optimization as a graph-to-graph translation problem [8]. One of the major issues with these approaches is the lack of interactivity. They focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of the invaluable expert experience and feedback. Some studies have injected external knowledge into models through knowledge graphs [9, 10]. However, this approach is significantly limited in its ability to express external knowledge compared to natural language. In contrast, the drug discovery pipeline involves iterative refining processes that entail conversations with domain experts to incorporate their feedback, ultimately achieving the desired outcome [7]. The lack of real-time expert feedback significantly limits the effectiveness of these end-to-end optimization models in addressing urgent pandemics caused by novel viruses, such as COVID-19 [11], as well as in handling complex scenarios, such as anti-cancer drug development [12].

In light of the advancements in powerful LLMs, our work aims to leverage their strong interactivity and generalizability for molecule optimization. To the best of our knowledge, there are currently no molecule optimization models that focus on human–machine interaction. We summarize main contributions of this work as follows:

  • To facilitate future research, we publicly release a large instruction-based dataset called ‘MolOpt-Instructions’ for fine-tuning language models on molecule optimization tasks. The dataset contains an adequate amount of data, ensuring both similarity constraints and a substantial difference in properties between molecules.

  • We propose DrugAssist, an interactive molecule optimization model fine-tuned on Llama2-7B-Chat, which performs optimization through human–machine dialogue. By enabling multi-turn conversations, domain experts can guide the model in further optimizing initially generated molecules with imperfections. Figure 1 illustrates the framework of our proposed DrugAssist model.

  • Compared to traditional molecular optimization approaches [13, 14] and LLM-based implementations [5, 7], DrugAssist has consistently achieved leading results in multi-property optimization, which is a less frequently addressed and more challenging task in molecule optimization. Moreover, our optimization objectives include maintaining optimized molecular property values within a given range. DrugAssist continues to demonstrate impressive performance in this category of tasks, which are more aligned with real-world requirements compared to most studies that solely focus on increasing or decreasing property values.

Figure 1.

Figure 1

The illustration of our proposed DrugAssist model framework, which focus on optimizing molecules through human–machine dialogue.

Related work

Traditional approaches in molecule optimization

Based on the different representations of molecules, we can divide these models into two categories: sequence-based and graph-based.

Sequence-based

Most of these methods utilize SMILES (Simplified Molecular-Input Line-Entry System) string as the molecular representation. They view molecule optimization as machine translation problem in NLP, where a text is translated from one language to another [13]. Similar to translation tasks in NLP, the conversion between molecules encoded in SMILES in molecular optimization tasks can also be seen as a transformation between ‘languages’. The main architectures for this category of models include recurrent neural networks [15–17], variational autoencoders (VAEs) [18–22], and Transformer [8, 13, 23]. Meanwhile, reinforcement learning [24, 25], adversarial training [26], and transfer learning [17] serve as typical optimization techniques. Considering the significant progress has made in the field of LLMs in recent years, we believe that these sequence-based molecule optimization methods have great potential for further exploration.

Graph-based

These methods typically use graph to represent molecules and directly generate molecule graphs. VAEs are also very popular in molecular graph generation. He et al. [18] decomposed a molecular graph into a junction tree of chemical substructures. Then, they employed a junction tree VAE (JT-VAE) to generate molecules with improved properties by applying gradient ascent over the learned latent space. Subsequent works have derived many variants based on JT-VAE, such as VJTNN [19], HierG2G [14], etc. In comparison to sequence-based methods, a notable distinction is that most graph-based approaches, such as JT-VAE, can consistently generate valid molecules due to the validation checks performed at each step of the generation process.

Despite the achievements of the aforementioned methods in the field of molecule optimization, we believe that they still have some shortcomings that need to be addressed:

  • Most of the existing works focus on optimizing a single property of molecules, while there are few that simultaneously optimize multiple properties, which is a more common requirement in real life. Moreover, in most works, the optimization goal is to maximize the difference in properties between the optimized and original molecules while satisfying a certain similarity constraint. Alongside this, it’s worth noting that in real-life situations, there is often a need for the property value of the optimized molecule to fall within a specific range, an aspect that has received little attention in existing research.

  • Most methods suffer from catastrophic forgetting when the optimization task is changed. For example, a model that performs well in optimizing QED (Quantitative Estimate of Drug-likeness) values of molecules needs to be retrained on a dataset containing logP (logarithm of the partition coefficient) property before being used to optimize logP values. This approach not only incurs additional costs but also suffers from a lack of sufficient experimental data to facilitate training for some molecular properties, such as ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity).

  • To the best of our knowledge, no existing studies have focused on the interactivity of these molecular optimization models. Interactive models facilitate effective communication between domain experts and artificial intelligence models. Experts can conveniently provide feedback and suggestions to the model in the form of natural language, and the model can also obtain real-time access to expert experience related to specific problems. However, existing approaches struggle to efficiently utilize these valuable expert experiences and feedback.

LLMs in biomedical domain

In recent years, there has been an increasing number of attempts to apply LLMs in the field of biomedicine. The majority of research efforts are centered around QA (question-answering) tasks, such as ChatDoctor [2], Med-Alpaca [3], PMC-LLaMA [4] that focus on medical QA, and BioMedGPT [5] on molecule/protein QA. There is relatively much less work focused on addressing practical tasks within the drug discovery domain. ChatMol [6] is employed for conversational molecular design, specifically, it can accomplish two tasks: molecule understanding and molecule generation, which involve bidirectional conversion between molecular descriptions and SMILES strings of molecules. ChatDrug [7] is a framework to facilitate drug editing using LLMs. Specifically, following the ChatDrug workflow, users can obtain carefully crafted prompts that assist in obtaining suggestions on drug editing tasks from general LLMs, such as ChatGPT.

Methods

Our methodology incorporates two primary components: the construction of our MolOpt-Instructions dataset and the subsequent instruction tuning of Llama2-7B-chat model.

Construction of MolOpt-Instructions dataset

Most of datasets currently used for molecule optimization are in the form of ‘molecule-molecule pairs’, which cannot be directly used to train language models like Llama. Although Fang et al. [29] introduce a comprehensive instruction dataset specifically designed for training Large Language Models in the biomedical domain, it does not cover tasks associated with molecule optimization. Additionally, in some popular benchmark dataset [19], the molecule pairs which are relatively few in number, only satisfy similarity constraints, and the difference in properties between the molecules within the same pair is not sufficiently significant. To tackle these issues, we construct instruction-based datasets called ‘MolOpt-Instructions’ for fine-tuning language models on molecule optimization tasks. It contains an adequate amount of data, ensuring both similarity constraints and a substantial difference in properties between molecules.

Overview and statistics

MolOpt-Instructions consists over one million molecule pairs. Currently, it includes six types of molecular properties, namely Solubility, BBBP (Blood-Brain Barrier Penetration), hERG (Human Ether-a-go-go-Related Gene) inhibition, QED, and the number of hydrogen bond donor and acceptor, with detailed information provided in Table 1.

Table 1.

Statistics of our proposed MolOpt-Instructions dataset. It contains an adequate amount of data, ensuring both similarity constraints and a substantial difference in properties between molecules

Unique pairs Unique molecules Similarity LogP difference
1 029 949 1 595 839 Inline graphic Inline graphic

Data construction

The workflow of the data construction is shown in Fig. 2. To begin with, we randomly selected one million molecules from ZINC database [30]. Then, we used mmpdb [27] to construct a database from these molecules and generate similar pairs. Mmpdb, an open-source Matched Molecular Pair (MMP) platform, generates MMPs through Matched Molecular Pair Analysis. In essence, an MMP consists of two molecules that differ by a defined structural transformation, resulting in highly similar molecular structures within the pairs generated by mmpdb. Following this, we selected the molecular pairs that met our requirements from these candidates. Our selection criteria are as follows: the similarity between each pair of molecules should be < 0.65, and the difference in logP should be < 2.5. Once we identified the suitable molecular pairs, we proceeded to calculate their property values using iDrug, an AI-driven drug discovery platform developed by Tencent [28]. Users can query various ADMET property values of molecules for free through the website https://drug.ai.tencent.com. To make the data more balanced, we maintain a roughly 1:1 ratio of increased to decreased property values for target molecules relative to source molecules by swapping the source and target molecules of some pairs. The rationale behind choosing the difference in logP as a screening criterion lies in its close relation to various aspects of a molecule’s biological activity and pharmacokinetics. After obtaining these pairs and their corresponding property values, we asked ChatGPT to suggest a variety of instructions and manually refine them for the molecule optimization tasks. To facilitate possible future exploration in molecule property prediction, some instructions include requirements for the model to output optimized molecule property values.

Figure 2.

Figure 2

The workflow of data construction of MolOpt-Instructions. First, we randomly picked one million molecules from the ZINC dataset. Then, we used mmpdb [27] to generate similar pairs based on these molecules and selected the molecular pairs that met our requirements from these candidates. Once we identified the suitable molecular pairs, we proceeded to calculate their property values using iDrug [28]. After obtaining these pairs and their corresponding property values, we asked ChatGPT to suggest a variety of instructions and manually refine them for the molecule optimization tasks.

We designed three types of optimization tasks: the first category requires only an increase or decrease in the given property value; the second category adds a threshold requirement for the increase or decrease; and the third category requires the optimized property value to be within a given range. In Table 2, we show an example of an instruction for each of these three categories. All instructions in the dataset can be found in https://github.com/blazerye/DrugAssist.

Table 2.

Examples of prompts for optimization tasks with three different goals—loose, strict, and range. ‘[SMILES]’ represents the SMILES string for the molecule

Task category Example prompt
loose I have a molecule with the SMILES string [SMILES]. Suggest modifications to increase its [property] value while maintaining its core structure.
strict I have a molecule with the SMILES string [SMILES]. Suggest modifications to increase its [property] value by at least [threshold] compared to the pre-optimized value while maintaining its core structure.
range Here is a molecule represented by the SMILES string [SMILES]. Provide me with an optimized version that has a [property] value between [lower bound] and [upper bound]. The output molecule should be similar to the input molecule.

Different from several widely used molecule optimization datasets, our optimization tasks are not just vaguely asking to ‘optimize the given molecule’, but also have range requirements, making them more closely aligned with real-world scenarios.

Analysis and discussion

To ensure the diversity of molecules in our dataset, we employ Murcko scaffold analysis to evaluate the chemical diversity of the source molecules randomly selected from ZINC database. The average molecules per scaffold is 2.95, and more than 93.7% of the scaffolds contained no more than five molecules. The scaffold analysis indicates a high degree of structural diversity among the source molecules. Therefore, models developed using this dataset are expected to demonstrate robust prediction coverage for a broad spectrum of structurally diverse compounds.

Furthermore, we also plot distribution graphs for molecular structural and ADMET-related properties, as shown in Figs 3 and 4, respectively. For molecular structural properties, we focus on Bertz complexity, molecular weight, atom count and ring count. Bertz Complexity is a key parameter for assessing the structural complexity of a molecule, providing insights into its potential reactivity and stability. Molecular weight, a measure of a molecule’s size, influences various physical and chemical properties, including solubility, volatility, and reaction kinetics. Atom count, indicative of the molecule’s size and complexity, impacts its stability and potential intermolecular interactions. Ring count, a measure of cyclic structures within a molecule, informs about its structural rigidity, conformational flexibility, and possible biological activity. These graphs provide a more intuitive visualization of the diversity in the physical structure and biochemical properties of molecules in MolOpt-Instructions.

Figure 3.

Figure 3

Distribution of structural properties of molecules within MolOpt-Instructions, illustrating the structural diversity of the molecules.

Figure 4.

Figure 4

Distribution of ADMET-related properties of molecules within MolOpt-Instructions. Currently, MolOpt-Instructions covers six properties, namely Solubility, BBBP, hERG inhibition, QED and the number of hydrogen bond donor and acceptor. The distribution graph demonstrates the diversity of biochemical properties of molecules in our dataset.

Instruction tuning

For LLMs to follow natural language instructions and complete real-world tasks, instruction tuning has been widely used for alignment [31]. In this process, the LLM is fine-tuned on a collection of tasks, which are defined through a set of instructions.

Our work follow the similar approach performing instruction tuning on Llama2-7B-Chat using our MolOpt-Instructions dataset. Formally, we define the text input as a sequence of tokens, Inline graphic, where each Inline graphic is a text token and Inline graphic is the total sequence length. At the stage of instruction fine-tuning, the sequence Inline graphic is further split into two parts, instruction Inline graphic and response Inline graphic. The training objective is to minimize the negative log-likelihood over the response Inline graphic with respect to trainable parameters Inline graphic as follows: Inline graphic.

Multi-task learning In instruction tuning, the occurrence of catastrophic forgetting is a common phenomenon in pre-trained language models [32, 33]. Ensuring that the model maintains high interactivity while optimizing molecules is one of our important objectives. To achieve this, we employ multi-task learning as our instruction tuning strategy. Specifically, the composition of our text input consists of two parts: (1) General knowledge: such as everyday conversational question-answering data; (2) Domain-specific knowledge: for our model, it pertains to molecule optimization. We mix these two types of data at a certain ratio. To achieve this ratio, we replicate the data from the less abundant category. Figure 5 illustrates our instruction tuning strategy.

Figure 5.

Figure 5

The illustration of multi-task learning strategy. We apply instruction tuning by directly combining different data sources (general knowledge and molecule optimization), effectively mitigating catastrophic forgetting during the fine-tuning stage.

Experiments

We provide a comprehensive view of DrugAssist’s performance on traditional molecule optimization tasks, as well as its capabilities in dialogue and interaction. In addition, due to the molecular property values contained in the MolOpt-Instructions dataset used for training, DrugAssist not only has the ability to optimize molecules, but also possesses a certain capability for property prediction, which we demonstrate in Appendix A.

Experimental setup

Models

DrugAssist is a model fine-tuned from the Meta’s Llama-2-7B-Chat model on over one million instruction-response demonstrations. We conduct a systematic comparison with the following sequence-based models:

  • He et al. [13] utilized the SOTA machine translation models, the Seq2Seq with attention and the Transformer for molecule optimization tasks.

  • ChatDrug [7] is a framework to facilitate the systematic investigation of drug editing using LLMs. For the molecule optimization tasks, they obtained results from ChatGPT (GPT-3.5-turbo) using carefully crafted prompts.

  • Llama2-7B-Chat [34] is a fine-tuned generative text model with 7 billion parameters, developed and publicly released by Meta. It outperforms open-source chat models on most benchmarks, and is on par with some popular closed-source models like ChatGPT and PaLM [35].

  • BioMedGPT-LM-7B [5] is the first large generative language model based on Llama2 in the biomedical domain.

Training details

At instruction tuning stage, we train the model for 10 epochs with a batch size of 512. We use the AdamW optimizer, with Inline graphic =(0.9, 0.999) and a learning rate of 1e-4, without weight decay. Warm-up is executed over 3% of the total training steps, followed by a cosine schedule for learning rate decay. The model is trained on 8 NVIDIA Tesla A100-SXM4-40GB GPUs.

LoRA (Low-Rank Adaptation) [36] is a technique designed to make fine-tuning large language models more efficient. Instead of updating all the model’s weights, LoRA keeps the pre-trained model weights frozen and introduces small, trainable low-rank matrices into each layer of the Transformer architecture. This approach significantly reduces the number of parameters that need to be trained, leading to lower memory and computational costs without sacrificing performance. By approximating weight updates with these low-rank matrices, LoRA enables efficient adaptation to new tasks while maintaining the quality of the original model. In our implementation, we use a LoRA rank of 64 (the dimension of the low-rank matrices) and a LoRA alpha of 128 (a scaling factor applied to the updates). When training with LoRA, the process involves creating a set of trainable low-rank weight matrices while keeping the original model weights frozen. After training, the LoRA weights are merged with the original pre-trained weights to produce the final fine-tuned model. This merging step integrates the low-rank updates into the base model, resulting in a fully functional, fine-tuned model ready for deployment.

Dataset for instruction tuning

To ensure our model maintains high interactivity while optimizing moluecules, we employ multi-task learning strategy introduced in Section Methods to construct training data. We utilize instruction data from two sources:

  • MolOpt-Instructions We have provided a detailed introduction to this dataset in Section Methods.

  • Stanford Alpaca In order to preserve the model’s natural language dialogue capabilities and counteract the forgetting effect during the supervised fine-tuning phase, we utilized the dataset employed for fine-tuning a 7B Llama model, which comprises 52k instruction-following data provided by Stanford.

Considering that the MolOpt-Instructions dataset contains significantly more data than the Stanford Alpaca dataset, we created the final dataset by replicating the Stanford Alpaca dataset five times and then mixing it with the Molopt-Instructions dataset. We divide the mixed data into training, validation, and test sets at a ratio of 0.9: 0.05: 0.05, respectively.

Evaluation methods

Comparisons with traditional approaches

We compared DrugAssist with two molecule optimization models proposed by He et al. [13]. One of them employs a Seq2Seq with attention architecture (which we refer to as Mol-Seq2Seq), and the other uses a Transformer architecture (which we refer to as Mol-Transformer). We randomly selected 500 molecules from the MolOpt-Instructions dataset’s test set to serve as the test set for this experiment. Specifically, we compared the performance of these models in optimizing two properties: BBBP and Solubility. We calculated the success rates, validity, and average similarity between molecules before and after optimization, with the detailed definition of ‘success’ summarized as follows:

  • Solubility: We consider the optimization to be successful if the Solubility of the generated molecule falls within the given range. Specifically, we have divided the Solubility values into 10 intervals, each with a size of 1.

  • BBBP: We consider the optimization to be successful if the generated molecule’s BBBP property type is correct. Specifically, we have categorized BBBP values into three groups: low, medium, and high, corresponding to the value ranges of 0–0.3, 0.3–0.7, and 0.7–1, respectively.

The prompt we used is ‘Here is a molecule represented by the SMILES string [SMILES]. Provide me with an optimized version that has a molecular solubility value between [lower bound] and [upper bound] (unit: logarithm of mol/L), and change the blood-brain barrier penetration (BBBP) from [source category] to [target category]’. We use this prompt to obtain results from DrugAssist in a single-turn dialogue manner.

Comparisons with LLMs

We compared DrugAssist with ChatDrug, Llama2-7B-Chat, and BioMedGPT-LM-7B. We randomly selected an additional 500 molecules from the ZINC dataset to serve as the test set for this experiment. Specifically, we compared the performance of these models on 16 tasks. Following the approach of Liu et al. [7], we employ multi-turn dialogues to enable LLMs to optimize molecules. We first propose optimization requirements, and if the model’s output does not meet our requirements, we search the database for a molecule that meets the requirements and is most similar to the model’s output, using it as a hint for the model to make modifications, until the requirements are met or the pre-set number of iterations is reached. Figure 6 illustrates the optimization process.

Figure 6.

Figure 6

Multi-round optimization process proposed by Liu et al. [7]. The model is initially provided with a source molecule and specific optimization criteria. Upon generating the optimized molecule, its property values are assessed. If it meets the requirements, the process terminates. If not, a molecule most similar to the source molecule and fulfilling the criteria is retrieved from the database to guide further optimization. The process continues until the requirements is met or the predefined maximum number of iterations is reached.

We computed the success rate and validity for each task. We have adopted two sets of criteria for defining ‘successful optimization’ - loose and strict. For the loose criteria, if the optimized molecular property is higher or lower than the pre-optimization property as required, we consider the optimization to be successful. For the strict criteria, except for the ‘Solubility’, if the optimized molecular property is higher or lower than the pre-optimization property by a specified threshold, we consider the optimization to be successful. For the ‘Solubility’, if the optimized molecular property value falls within the required range, we consider the optimization to be successful. Our range requirements are set as follows: Given a test molecule with Solubility value x, in the task of increasing the Solubility value, the optimized range requirement is [x+0.5, x+1.5]; in the task of decreasing the Solubility value, the optimized range requirement is [x-1.5, x-0.5]. The threshold settings for different properties in our experiments are shown in Table 3. Detailed prompt settings for each task can be found in Table S2 in Appendix B. BBBP, Solubility, and hERG inhibition are predicted from iDrug, while the rest can be calculated deterministically using RDKit.

Table 3.

The threshold settings for different properties. For the strict criteria, except for the ‘Solubility’, we consider the optimization to be successful only if the optimized molecular property is higher or lower than the pre-optimization property by the threshold shown in table

Property Threshold
QED 0.1
hydrogen bond acceptor 1
hydrogen bond donor 1
BBBP 0.1
hERG inhibition 0.1

Main results

In this section, we conduct a comprehensive and systematic comparison with aforementioned models. In addition, we also investigate the impact of the number of training data points during fine-tuning and the number of dialogue rounds on DrugAssist’s performance. Detailed results can be found in Appendix C and Appendix D, while here we present a brief summary of the conclusions.

Comparisons with traditional approaches Here we compared our model with He et al. [13]. The results are shown in Table 4. Our model has achieved the highest success rates in both single-property and multi-property optimization while maintaining high validity and high similarity to the molecules to be optimized. More Comparisons with Mol-Transformer across additional test sets and properties can be found in Appendix E.

Table 4.

Comparisons with traditional approaches on optimizing molecules’ Solubility and BBBP value. We choose success rate, valid rate, and similarity as evaluation metrics. The Solubility and BBBP columns display the success rates of the model optimizing these two individual properties, respectively, while the ‘All’ column shows the success rate of the model simultaneously optimizing both properties. Our model has achieved the highest success rates in both single-property and multi-property optimization while maintaining high validity and high similarity to the molecules to be optimized. Furthermore, we can also observe that the Transformer architecture performs much better than the Seq2Seq with attention architecture on this task

Model Solubility BBBP All Valid rate Similarity
Mol-Seq2Seq 0.46 0.55 0.35 0.76 0.61
Mol-Transformer 0.70 0.78 0.59 0.96 0.70
Ours 0.74 0.80 0.62 0.98 0.69

Comparisons with LLMs The results are shown in Table 5. Our model significantly outperforms other LLMs in terms of both success rate and valid rate across all tasks. In the Appendix F, we provide a detailed explanation of why GPT-3.5-turbo’s performance in our tests is significantly lower than what has been reported in the work of Liu et al. [7].

Table 5.

Comparisons with LLMs. We evaluated the performance of LLMs on 16 tasks, covering all three optimization objectives introduced in Section Methods—loose, strict, and range. We calculated the valid ratio (number of valid SMILES generated/total number in the test set) and success rate (number of molecules meeting optimization objectives/total number in the test set) of the generated molecules. In the task naming, ‘+’ represents the goal of increasing the property value, while ‘-’ represents property the attribute value. The ‘&’ symbol represents the simultaneous optimization of two given properties. ‘sol’ stands for ‘Solubility’, and ‘acc’ stands for ‘the number of hydrogen bond acceptor’. ‘loose’ and ‘strict’ are the two criteria for defining successful optimization, as detailed in Section Experimental Setup

Task Model Valid ratio (loose/strict) Correct ratio (loose/strict)
qed+ Llama2-7B-Chat 0.69 / 0.55 0.17 / 0.16
GPT-3.5-turbo 0.97 / 0.96 0.15 / 0.15
BioMedGPT-LM 0.34 / 0.32 0.15 / 0.09
Ours 0.99 / 0.97 0.76 / 0.63
acceptor+ Llama2-7B-Chat 0.45 / 0.43 0.08 / 0.08
GPT-3.5-turbo 0.98 / 0.96 0.04 / 0.06
BioMedGPT-LM 0.45 / 0.39 0.18 / 0.13
Ours 0.97 / 0.96 0.71 / 0.67
donor+ Llama2-7B-Chat 0.45 / 0.48 0.15 / 0.08
GPT-3.5-turbo 0.98 / 0.95 0.10 / 0.04
BioMedGPT-LM 0.46 / 0.46 0.17 / 0.09
Ours 0.98 / 0.95 0.72 / 0.76
solubility+ Llama2-7B-Chat 0.56 / 0.56 0.36 / 0.20
GPT-3.5-turbo 0.94 / 0.95 0.16 / 0.05
BioMedGPT-LM 0.27 / 0.35 0.18 / 0.09
Ours 0.98 / 0.98 0.80 / 0.41
bbbp+ Llama2-7B-Chat 0.56 / 0.57 0.19 / 0.14
GPT-3.5-turbo 0.97 / 0.95 0.10 / 0.10
BioMedGPT-LM 0.26 / 0.22 0.16 / 0.07
Ours 0.99 / 0.98 0.82 / 0.61
herg- Llama2-7B-Chat 0.59 / 0.55 0.39 / 0.31
GPT-3.5-turbo 0.98 / 0.97 0.13 / 0.15
BioMedGPT-LM 0.20 / 0.18 0.13 / 0.12
Ours 0.99 / 0.98 0.71 / 0.67
sol+ & acc+ Llama2-7B-Chat 0.55 / 0.52 0.15 / 0.04
GPT-3.5-turbo 0.92 / 0.91 0.09 / 0.02
BioMedGPT-LM 0.29 / 0.32 0.10 / 0.07
Ours 0.95 / 0.95 0.50 / 0.27
qed+ & bbbp+ Llama2-7B-Chat 0.52 / 0.56 0.14 / 0.09
GPT-3.5-turbo 0.96 / 0.95 0.09 / 0.06
BioMedGPT-LM 0.35 / 0.36 0.16 / 0.11
Ours 0.99 / 0.98 0.65 / 0.41

From the perspective of the valid ratio of generating valid molecules, BioMedGPT-LM performs poorly. The main reason is that it has difficulty understanding the optimization requirements, often generating content such as guiding users to websites for molecule optimization, rather than outputting the optimized molecule. Although GPT-3.5-turbo appears to have high valid ratio, it often generates molecules that are identical to the given molecule to be optimized, thus failing to serve the purpose of molecule optimization. Our model, on the other hand, demonstrates a significant advantage in generating valid molecules, with virtually no instances of misunderstanding requirements or generating identical molecules to the ones to be optimized.

From the perspective of accuracy, we can find that Llama itself exhibits some molecular optimization capabilities, likely due to the knowledge acquired during its pre-training on large datasets. Our fine-tuning process further enhances and stimulates these related capabilities. Traditional models, on the other hand, generally cannot take advantage of such pre-trained knowledge. Additionally, compared to other pre-trained large models, DrugAssist’s fine-tuning on our specifically designed Molopt-Instructions dataset significantly enhances its performance in molecular optimization tasks. Even when we use multi-turn dialogues to prompt the baseline LLMs for comparison, they still struggle to complete the optimization tasks, with low success rates even on relatively simple tasks that only require increasing a single property value.

Our model exhibits good molecule optimization capabilities and strong adaptability to different properties and optimization objectives. Even though our model has only been exposed to data with individual properties during training, it achieves competitive results in multi-property optimization tasks. We believe that this adaptability may be attributed to the inherent generalization and emergent capabilities of pre-trained LLMs, which potentially allow them to extend beyond their original training setup.

We also observed that the model’s performance varies across tasks, primarily caused by the following factors:

  • The difficulty of optimizing different properties of molecules has inherent differences. For instance, properties such as hERG inhibition are inherently more challenging to predict. This difficulty arises from the complexity of interactions with molecular structures, diverse binding mechanisms, and the limited availability of high-quality data. Consequently, optimizing hERG inhibition is more difficult compared to some basic physicochemical properties, such as logP.

  • The differences in the distribution of molecular properties within the training data. For instance, as illustrated in Fig. 4, the distribution of the QED values in the training molecules tends towards lower values, whereas the BBBP values are skewed towards higher values. This disparity in distribution can lead to a performance advantage for the BBBP task during testing that requires the increase of property values, as the model has been exposed to a greater number of high-BBBP molecules during the training phase.

  • The criteria for success vary across different optimization tasks. Overall, the model’s performance is better under ‘loose’ criteria compared to ‘strict’ criteria, and single-property optimization outperforms multi-property opti-mization. We observed a significant decline in model performance in the ‘esol+ strict’ task, indicating that optimization tasks requiring the molecular property values to fall within a specific range are particularly challenging. There has been little research addressing such optimization tasks before. Additionally, the success rate for multi-property optimization has a significant decline compared to single-property optimization. Although providing multi-property optimization datasets during training can mitigate this issue, efficiently achieving arbitrary combinations of properties during the optimization process remains a topic worthy of further investigation.

Impact of the number of training data points The experimental results show that a training dataset size of 600 000 (60% of the original dataset size) can provide similar performance to the original dataset, but a dataset size of 200 000 cannot. This suggest that considering the training cost and potential generalization issues, a moderate amount of data is a better choice. Detailed results and analysis can be found in Appendix C.

Impact of the number of dialogue rounds The experimental results show that simply increasing the number of dialog rounds does not effectively improve the model’s ability to optimize molecules. Detailed results and analysis can be found in Appendix D.

Case study

In this section, we showcase the exceptional capabilities of our model in molecule optimization tasks through several specific examples, beyond its high success rate. More cases can be found in Appendix G.

Transferability Fig. 7 demonstrates the good transferability of our model under the zero-shot setting. We randomly selected two properties, BBBP and QED, and asked DrugAssist to increase their values by at least 0.1 simultaneously. Our model achieved this and the resulting molecule is structurally similar to the original one. Although the model has only been exposed to data with individual properties during training, users can still freely combine these properties when using the model to optimize them simultaneously. Traditional models, however, often require retraining on new datasets with multiple properties in order to achieve this.

Figure 7.

Figure 7

Good transferability of DrugAssist under the zero-shot setting. Users can freely combine individual properties in training data to request DrugAssist to optimize them simultaneously.

Figure 8 demonstrates the good transferability of DrugAssist under the few-shot setting. We asked DrugAssist to increase the logP value of a given molecule by at least 0.1, even though this property is not included in the training data. By providing a few examples of similar molecules with successfully increased logP values by at least 0.1 in the prompt, our model was able to achieve this. Our model can optimize properties not encountered during training through few-shot, which is difficult to achieve for traditional models (e.g. JT-VAE).

Figure 8.

Figure 8

Good transferability of DrugAssist under the few-shot setting. By providing examples of successful optimizations for molecules similar to the one to be optimized, DrugAssist can optimize properties not encountered during training.

Iterative optimization Figure 9 illustrates the iterative optimization capability of our model. We asked DrugAssist to increase the water solubility of the given molecule, but it failed. Then, we suggested that it could add some polar functional groups. DrugAssist took our advice and performed further optimization based on the failed molecule (by adding amino groups), successfully increasing the water solubility of the given molecule. We can conclude that when the model provides a molecule that does not fully meet the requirements, it can correct the error and generate a new, compliant molecule based on human-provided suggestions and feedback. This ability highlights the potential for DrugAssist to assist researchers in continually adjusting and optimizing molecules in real-world scenarios.

Figure 9.

Figure 9

Iterative optimization capability of DrugAssist. When the model provides a molecule that does not fully meet the requirements, it can correct the error and generate a new, compliant molecule based on human suggestions.

Conclusion and future work

In this paper, we present DrugAssist, an interactive molecule optimization model. Unlike previous methods, DrugAssist can interact with humans in real-time using natural language. It can provide optimized results based on the instructions given by users and continue to adjust according to their feedback. It demonstrates excellent performance in both single-property and multi-property optimization, including more challenging tasks, such as optimizing within specified property value ranges. Additionally, it shows great potential in transferability and iterative optimization capabilities during the interaction process. Furthermore, we publicly release MolOpt-Instructions, an instruction-based dataset to facilitate future work on fine-tuning LLMs in the molecule optimization domain.

In the future, we aim to improve the model’s ability to handle multimodal data and tasks [37, 38] to reduce hallucination problems [39–42]. Additionally, we are endeavoring to further enhance DrugAssist’s interactive capabilities to better understand users’ needs and feedback.

Key Points

  • We propose DrugAssist, an interactive molecule optimization model which performs optimization through human–machine dialogue by leveraging LLM’s strong interactivity and generalizability.

  • DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization.

  • We publicly release a large instruction-based dataset called ‘MolOpt-Instructions’ for fine-tuning language models on molecule optimization tasks.

Supplementary Material

DrugAssist_supplementary_file_revised_bbae693

Contributor Information

Geyan Ye, Tencent AI Lab, Tencent, Shenzhen 518057, China.

Xibao Cai, Department of Computer Science, Hunan University, Changsha 410008, China.

Houtim Lai, Tencent AI Lab, Tencent, Shenzhen 518057, China.

Xing Wang, Tencent AI Lab, Tencent, Shenzhen 518057, China.

Junhong Huang, Tencent AI Lab, Tencent, Shenzhen 518057, China.

Longyue Wang, Tencent AI Lab, Tencent, Shenzhen 518057, China.

Wei Liu, Tencent AI Lab, Tencent, Shenzhen 518057, China.

Xiangxiang Zeng, Department of Computer Science, Hunan University, Changsha 410008, China.

Funding

This work was supported by the National Science and Technology Major Project (2023ZD0120902 to X.Z.), the National Natural Science Foundation of China (U22A2037 to X.Z.; 62425204 to X.Z.; 62122025 to X.Z.; 62450002 to X.Z.; and 62432011 to X.Z.). This work was supported by the Beijing Natural Science Foundation (L248013 to X.Z.).

Code, data, and model availability

Our code is publicly available at https://github.com/blazerye/DrugAssist. We have published the MolOpt-Instructions dataset at https://huggingface.co/datasets/blazerye/MolOpt-Instructions. The original weights and the quantized weights of DrugAssist are available at https://huggingface.co/blazerye/DrugAssist-7B. By using the quantized weights, DrugAssist can be deployed on personal laptops without GPUs. For specific instructions, please refer to our GitHub homepage.

References

  • 1. Radford A, Jeffrey W, Child R. et al.. Language models are unsupervised multitask learners. OpenAI blog 2019;1:9. [Google Scholar]
  • 2. Yunxiang L, Zihan L, Kai Z. et al.. ChatDoctor: a medical chat model fine-tuned on LLaMA model using medical domain knowledge. Cureus 2023;15:e40895. 10.7759/cureus.40895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Han T, Adams LC, Papaioannou J-M. et al.. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:2304.08247. 2023.
  • 4. Wu C, Lin W, Zhang X. et al.. PMC-LLaMA: towards building open-source language models for medicine. Journal of the American Medical Informatics Association. 2024;31:1833–43. 10.1093/jamia/ocae045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Luo Y, Zhang J, Fan S. et al.. BioMedGPT: open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442. 2023.
  • 6. Zeng Z, Yin B, Wang S. et al.. Interactive molecular discovery with natural language. arXiv preprint arXiv:2306.11976. 2023. [DOI] [PMC free article] [PubMed]
  • 7. Liu S, Wang J, Yang Y. et al.. ChatGPT-powered conversational drug editing using retrieval and domain feedback. In: The Twelfth International Conference on Learning Representations. Vienna, Austria, 2024.
  • 8. He J, Nittinger E, Tyrchan C. et al.. Transformer-based molecular optimization beyond matched molecular pairs. J Chem 2022;14:18. 10.1186/s13321-022-00599-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Lin X, Quan Z, Wang Z-J. et al.. KGNN: knowledge graph neural network for drug-drug interaction prediction. In: IJCAI, vol. 380, pp. 2739–45. Yokohama, Japan, 2020. [Google Scholar]
  • 10. Chen S, Semenov I, Zhang F. et al.. An effective framework for predicting drug–drug interactions based on molecular substructures and knowledge graph neural network. Comput Biol Med 2024;169:107900. 10.1016/j.compbiomed.2023.107900. [DOI] [PubMed] [Google Scholar]
  • 11. Suwen H, Jiang S, Qi X. et al.. Races of small molecule clinical trials for the treatment of Covid-19: an up-to-date comprehensive review. Drug Dev Res 2022;83:16–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Zhu J, Jiang X, Luo X. et al.. Combination of chemotherapy and gaseous signaling molecular therapy: novelInline graphic-elemene nitric oxide donor derivatives against leukemia. Drug Dev Res 2023;84:718–35. 10.1002/ddr.22051. [DOI] [PubMed] [Google Scholar]
  • 13. He J, You H, Sandström E. et al.. Molecular optimization by capturing chemist’s intuition using deep neural networks. J Chem 2021;13:1–17. 10.1186/s13321-021-00497-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Jin W, Barzilay R, Jaakkola T. Hierarchical generation of molecular graphs using structural motifs. In: International conference on machine learning, pp. 4839–48. China: PMLR Xi'an, 2020. [Google Scholar]
  • 15. Gupta A, Müller AT, Huisman BJH. et al.. Generative recurrent networks for de novo drug design. Mol Inform 2018;37:1700111. 10.1002/minf.201880141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Bjerrum EJ, Threlfall R. Molecular generation with recurrent neural networks (RNNs). arXiv preprint arXiv:1705.04612. 2017.
  • 17. Segler MHS, Kogej T, Tyrchan C. et al.. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 2018;4:120–31. 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Jin W, Barzilay R, Jaakkola T. Junction tree variational autoencoder for molecular graph generation. In: International conference on machine learning, pp. 2323–32. Stockholm, Sweden: PMLR, 2018. [Google Scholar]
  • 19. Jin W, Yang K, Barzilay R. et al.. Learning multimodal graph-to-graph translation for molecular optimization. In: International Conference on Learning Representations. New Orleans, Louisiana, USA, 2019.
  • 20. Dai H, Tian Y, Dai B. et al.. Syntax-directed variational autoencoder for molecule generation. In: Proceedings of the International Conference on Learning Representations. Vancouver, Canada, 2018.
  • 21. Liu Q, Allamanis M, Brockschmidt M. et al.. Constrained graph variational autoencoders for molecule design. Adv Neural Inf Process Syst 2018;31. [Google Scholar]
  • 22. Simonovsky M, Komodakis N. Graphvae: Towards generation of small graphs using variational autoencoders. In: Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, pp. 412–22. Cham, Switzerland: Springer, 2018. [Google Scholar]
  • 23. Yang Y, Junhong Huang H, He JH. et al.. Accelerated discovery of macrocyclic CDK2 inhibitor QR-6401 by generative models and structure-based drug design. ACS Med Chem Lett 2023;14:297–304. 10.1021/acsmedchemlett.2c00515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Olivecrona M, Blaschke T, Engkvist O. et al.. Molecular de-novo design through deep reinforcement learning. J Chem 2017;9:1–14. 10.1186/s13321-017-0235-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Putin E, Asadulaev A, Ivanenkov Y. et al.. Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 2018;58:1194–204. 10.1021/acs.jcim.7b00690. [DOI] [PubMed] [Google Scholar]
  • 26. Kadurin A, Nikolenko S, Khrabrov K. et al.. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol Pharm 2017;14:3098–104. 10.1021/acs.molpharmaceut.7b00346. [DOI] [PubMed] [Google Scholar]
  • 27. Dalke A, Hert J, Kramer C. mmpdb: an open-source matched molecular pair platform for large multiproperty data sets. J Chem Inf Model 2018;58:902–10. 10.1021/acs.jcim.8b00173. [DOI] [PubMed] [Google Scholar]
  • 28. iDrug, 2020.
  • 29. Fang Y, Liang X, Zhang N. et al.. Mol-Instructions: a large-scale biomolecular instruction dataset for large language models. In: Proceedings of the Twelfth International Conference on Learning Representations. Vienna, Austria, 2024.
  • 30. Irwin JJ, Shoichet BK. ZINC- a free database of commercially available compounds for virtual screening. J Chem Inf Model 2005;45:177–82. 10.1021/ci049714+. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Ouyang L, Jeffrey W, Jiang X. et al.. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 2022;35:27730–44. [Google Scholar]
  • 32. De Lange M, Aljundi R, Masana M. et al.. A continual learning survey: defying forgetting in classification tasks. IEEE Trans Pattern Anal Mach Intell 2021;44:3366–85. [DOI] [PubMed] [Google Scholar]
  • 33. Dong G, Yuan H, Lu K. et al.. How abilities in large language models are affected by supervised fine-tuning data composition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 177–98. Bangkok, Thailand: Association for Computational Linguistics, 2024.
  • 34. Touvron H, Martin L, Stone K. et al.. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023.
  • 35. Narang S, Chowdhery A. PaLM: Scaling language modeling with pathways. J Mach Learn Res 2024;24:Article 240, 113. [Google Scholar]
  • 36. Edward JH, Shen Y, Wallis P. et al.. LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations, 2022.
  • 37. Lyu C, Wu M, Wang L. et al.. Macaw-LLM: multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093. 2023.
  • 38. Li Y, Liu Y, Wang Z. et al.. A comprehensive study of GPT-4V’s multimodal capabilities in medical imaging medRxiv. 2023;2023–11.
  • 39. Zhang Y, Li Y, Cui L. et al.. Siren’s song in the AI ocean: a survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst. 2024. Just Accepted.
  • 40. Liu B, Lyu C, Min Z. et al.. Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714. 2023.
  • 41. Li Y, Wang L, Hu B. et al.. A comprehensive evaluation of GPT-4V on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536. 2023.
  • 42. Cai X, Lai H, Wang X. et al.. Comprehensive evaluation of molecule property prediction with ChatGPT. Methods 2023;222:133–41. 10.1016/j.ymeth.2024.01.004. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

DrugAssist_supplementary_file_revised_bbae693

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES