Gargoyles: An Open Source Graph-Based Molecular Optimization Method Based on Deep Reinforcement Learning

Daiki Erikawa; Nobuaki Yasuo; Takamasa Suzuki; Shogo Nakamura; Masakazu Sekijima

doi:10.1021/acsomega.3c05430

. 2023 Sep 28;8(40):37431–37441. doi: 10.1021/acsomega.3c05430

Gargoyles: An Open Source Graph-Based Molecular Optimization Method Based on Deep Reinforcement Learning

Daiki Erikawa ^†, Nobuaki Yasuo ^‡, Takamasa Suzuki ^†, Shogo Nakamura ^§, Masakazu Sekijima ^†,^*

PMCID: PMC10568706 PMID: 37841174

Abstract

Automatic optimization methods for compounds in the vast compound space are important for drug discovery and material design. Several machine learning-based molecular generative models for drug discovery have been proposed, but most of these methods generate compounds from scratch and are not suitable for exploring and optimizing user-defined compounds. In this study, we developed a compound optimization method based on molecular graphs using deep reinforcement learning. This method searches for compounds on a fragment-by-fragment basis and at high density by generating fragments to be added atom by atom. Experimental results confirmed that the quantum electrodynamics (QED), the optimization target set in this study, was enhanced by searching around the starting compound. As a use case, we successfully enhanced the activity of a compound by targeting dopamine receptor D2 (DRD2). This means that the generated compounds are not structurally dissimilar from the starting compounds, as well as increasing their activity, indicating that this method is suitable for optimizing molecules from a given compound. The source code is available at https://github.com/sekijima-lab/GARGOYLES.

Introduction

Drug discovery is the process of identifying potential new therapeutic compounds, peptides, or antibodies for the treatment of diseases. It involves a series of steps beginning with the identification of biological targets, followed by the development of hit compounds, optimization of lead compounds, and finally, preclinical and clinical testing of drug candidates.¹

Hit compounds are compounds that show initial activity against target proteins involved in diseases, such as 3CL protease,² spermidine synthase,^3,4 dephospho-CoA kinase,⁵ and nicotinamide adenine dinucleotide (NAD)-dependent deacetylase Sirtuin 1,⁶ and have potential to be developed as drugs. Hit compounds are found through various screening methods, including high-throughput screening, which searches for targets from a large compound library.

Hit-to-lead is the process of optimizing a hit compound into a lead compound, which is a more promising and optimized drug candidate. In this phase, the properties of the hit compound are optimized to improve its efficacy and safety as a drug candidate. The lead compound then undergoes further testing and development to determine its suitability as a drug candidate.⁷

Typically, 200,000 to 1 million compounds are screened first. Then, more than 100 compounds are screened in hit-to-lead and lead optimization to narrow the molecules down to one or two candidates.¹ Subsequently, it has been shown that approximately 1 in 10 (10.4%, n = 5820) of the leads and all indications that entered Phase 1 were approved by the FDA.⁸ Thus, the project often fails and has a low success rate.^9,10 If even a portion of these processes could be assisted by in silico methods, it would save a tremendous amount of time and expense.¹¹ With recent developments in computers and algorithms, the application of computer science technology to drug discovery has been explored, and the efficiency and quality of the drug discovery processes have been improved.¹²⁻¹⁷

Molecular generative models are computer-based methods related to hit compound discovery and lead optimization.¹⁸ Molecular generative models have the advantage of being able to efficiently explore a huge chemical space and generate novel compounds with desirable properties using machine learning. They also have the advantage of avoiding explicitly dealing with complex chemical knowledge by using large compound data sets to train machine learning models. Several methods have been proposed to optimize molecules according to evaluation functions.¹⁹ The variational autoencoder can be used to generate molecules by modeling in the latent space.²⁰⁻²⁴ Then, optimization can be performed with the gradient method, leveraging the fact that the latent variables are continuous with the model that predicts the evaluation value from the latent variables. Such approximate models of the evaluation function are generally used in molecular optimization. String representation of molecules²⁵ (SMILES) can be generated by long short-term memories (LSTMs) of recurrent neural networks²⁶⁻²⁹ (RNNs). In this case, optimization is performed by retraining the generative model using the data set in which molecules without the target property are removed from the generated molecules by the approximate model of the evaluation function. Optimization using generative adversarial networks (GANs) is performed in the same manner.³⁰ Some optimization methods are also based on reinforcement learning. One popular approach is to represent molecular generation as a Markov decision process in which the state is a molecule, and the action is the addition of atoms (or fragments) to approximate the policy function with a machine learning model.^31,32 This approach has the advantage of directly optimizing the evaluation function compared to an approximate model. In addition to the policy gradient method, there are other optimization methods such as Q-learning and Monte Carlo Tree Search (MCTS).³³⁻³⁵ While this is suitable for generating compound libraries used to search for hit compounds, it is not suitable for cases such as lead optimization, where a candidate compound has already been narrowed, and optimization is performed on that compound. Several methods focusing on starting generation from arbitrary molecules have been proposed.^29,36 Mol-CycleGAN is one of the methods that start generation from a given molecule. The method is trained using sets of molecules before and after optimization based on the CycleGAN generation scheme.³⁷ Optimization is performed via latent variables, but according to experimental results, it is inferior in terms of performance compared to methods using reinforcement learning. MERMAID is the most relevant method for this study, and it starts generation from an arbitrary molecule by editing SMILES with MCTS and LSTM.³⁸ However, the generated molecules often deviate significantly from the initial molecules as the optimization progresses. In this study, we developed a molecular optimization method based on molecular graphs, starting from an arbitrary molecule to be explored by MCTS. The use of molecular graphs allows for high similarity to the starting compound, while MCTS allows for efficient generation without prior learning of a specific evaluation function. In addition, a graph neural network model trained on the compound data set is used to enhance the efficiency of the search. The search is conducted per fragment, but the fragments to be added are generated atom by atom. This allows more appropriate fragments to be added to the current molecule while avoiding the lack of diversity caused by using a fixed fragment vocabulary.

Results

Unconstrained Optimization

Table 4 shows that the quantum electrodynamics (QED) value was more than 0.92 for the top few cases, which is approximately sufficient for optimization. Furthermore, MERMAID,³⁸ which is the baseline for comparison in this study, had a lower evaluation function value of 0.3 to 1.4. This is also visualized in Figure 1. The difference between the two methods is the representation of molecules: MERMAID uses SMILES, and the proposed method uses molecular graphs. Considering the ZINC data set as an example of ”data size,” the average number of nodes in a molecular graph was approximately 24, while the average number of tokens in SMILES was approximately 40. Since both methods search only one node/token per step, this method was considered to search more when compared with the same number of steps. Additionally, while SMILES generated invalid SMILES, the molecular graph was always valid, which is another reason why this method was able to search more. Note that SMILES with RNNs generally runs faster than molecular graphs with graph convolutional networks (GCN) when compared in a single step.

Table 4. QED Values of Unconstrained Optimization.

method	first	second	third	50th	avg top 50
seed	0.653 ± 0.028	−	−	−	−
MERMAID	0.890 ± 0.041	0.875 ± 0.045	0.865 ± 0.047	0.750 ± 0.048	0.794 ± 0.047
proposed method	0.928 ± 0.032	0.925 ± 0.033	0.923 ± 0.036	0.891 ± 0.051	0.905 ± 0.045

Open in a new tab

Distribution of QED for molecules optimized by the proposed method and MERMAID. The distribution of the starting molecules is represented as ‘initial.’

In terms of validity, this method employing molecular graphs always produced valid molecules (i.e., validity = 1), while MERMAID using SMILES produced strings that do not satisfy the SMILES grammar (i.e., validity = 0.7). Novelty and uniqueness were all 1 or nearly 1, confirming that there were no problems when viewed from these perspectives. The similarity shows higher values for this method clearly (Figure 2). Given that the similarity for all pairs of molecules in the ZINC data set of approximately 250,000 molecules was 0.144, the molecules generated by this method can be considered to be similar to the starting molecules. The SA score was clearly better for this method. This may be due to the fact that the molecular graphs used in this method allow only the addition and deletion of single bonds, making it difficult to create complex structures.

Similarity distributions from the starting molecule to the molecules optimized with the proposed method and MERMAID.

Examples of molecules generated by this method are shown in Figure 3. It can be seen that the addition and deletion of a few fragments from the starting molecule resulted in a molecule with improved QED values while maintaining a high degree of similarity. One of the reasons for the higher similarity obtained with this method compared with MERMAID using SMILES is that the addition and deletion of fragments are limited to single bonds, thereby avoiding the direct editing of the ring structure as in MERMAID. However, this restriction of actions has a disadvantage in terms of diversity because complex rings cannot be generated (in the fragment-wise search). Examples of molecules classified by Graph as having a synthetic pathway are shown in Figure 4.

Molecular pairs before (left) and after (right) optimization. The blue areas represent substructures that were removed from the initial structure, while the red areas represent substructures that were added during the optimization process.

Examples of synthetic pathways produced by AiZynthFinder for optimized molecules. Compounds in the green frame are registered as commercial reagents in AiZynthFinder.

Constrained Optimization

Table 5 shows that the improved value of P log P was slightly less than that of GraphAF but was still sufficiently better than that of the other three methods. The success rate was approximately 1, indicating that P log P can be optimized regardless of starting molecules. Note that the comparison of the methods here is not completely fair because they have different generation schemes. For example, GraphAF and GCPN generate molecules after fully optimizing the policy model through reinforcement learning, whereas MERMAID and the proposed method exhibit a disadvantage in that they do not train the network for a specific evaluation function. Furthermore, the constraint of similarity is a favorable task for MERMAID and the proposed method in terms of the generation scheme.

Table 5. Results of Constrained Optimization Experiments^a.

methods	improvement	similarity	success (%)
GCPN	0.79 ± 0.63	0.68 ± 0.08	100
mol-cycle-GAN	1.22 ± 1.48	0.69 ± 0.07	19.3
graphAF	4.98 ± 6.49	0.66 ± 0.05	96.88
MERMAID	1.99 ± 1.74	0.62 ± 0.02	85.3
proposed method	4.18 ± 5.84	0.62 ± 0.06	99.3

Open in a new tab

Improvement represents the increase from the original P log P, and success represents the ratio of successfully optimized molecules where improvement is a positive value. In addition to the mean, standard deviations are also shown for improvement and similarity.

Activity Optimization

Table 7 shows that seed molecules with an average low activity prediction score of 0.122 were optimized down to molecules with an activity prediction score of 0.782, which is close to molecules with K_i < 100 nM. The percentage of molecules with an activity prediction score of 0.5 or higher among the generated molecules is also shown. The results of other methods are also shown but should only be used as a reference since the seed molecules are different. The optimization can also be seen in Figure 5, which shows the distribution of activity prediction scores for seed and optimized molecules. The execution time for optimization depends on both the number of steps and the evaluation function. For example, this compound optimization experiment in dopamine receptor D2 (DRD2) took an average of 175 s for each iteration of MCTS. The number of atoms of the generated fragments was 6.43 ± 2.65, and the molecular weight was 97 ± 39. Thus, it was found that a wide variety of fragments can be generated. This is because fragment generation is performed in a Monte Carlo tree search; therefore, different fragments are always generated at different steps, and the number of fragments generated does not differ significantly. It should be noted that these results include errors in the activity prediction model.

Table 7. Results of DRD2 Activity Optimization Experiments^a.

metrics	proposed method	Mol-CycleGAN
predicted activity score	0.782 (seed: 0.122)	0.362 (seed: 0.179)

Open in a new tab

The predicted activity score is calculated by the activity prediction model. The results of Mol-CycleGAN are obtained from each original paper. Note that the seed compounds of Mol-CycleGAN were different.

Distribution of predicted activity score for DRD2 for molecules optimized by the proposed method. The distribution of the starting molecules is represented as ‘initial.’

Discussion

We compared and evaluated the initial molecule and the generated molecule in terms of the similarity given by the Tanimoto coefficient based on the ECFP4 fingerprints, except for the value of the evaluation function. However, in practice, lead optimization is expected to maintain not only the similarity but also the potency and selectivity to the target. Although the similarity may be correlated to these properties, it is more appropriate to evaluate in terms of potency and selectivity. Considering the comparison of the initial molecule and the generated molecule by such a metric for drug efficacy, this method is not expected to retain a high percentage of the original drug efficacy. The search space of this method is narrowed down to a molecule-like graph by the GCN model and biased by MCTS toward regions with higher values of the evaluation function. Thus, if specific properties such as drug effects are to be considered, then they must be explicitly handled in the evaluation function. Masking the evaluation function to preserve important substructures with respect to the property of interest is also effective.

Conclusions

Most molecular generative models do not consider starting from a given molecule and are not appropriate for situations such as lead optimization. Additionally, existing methods that start generation from a given molecule have problems such as low similarity to the starting molecule. Therefore, in this study, we developed a molecular optimization method that starts with generation from arbitrary molecules based on molecular graphs. Optimization is performed in a fragment-wise tree search according to a given evaluation function. Fragments are generated individually by MCTS in an atom-by-atom manner. The GCN model trained on the fragment data set is used to improve the efficiency of fragment generation. In an experiment of optimizing QED as an example of the evaluation function, the generated compound not only improved the value of the evaluation function sufficiently but also had a high similarity to that of the starting compound. In addition, it was confirmed that this method searches near the starting compound compared with existing methods. For the synthetic pathway prediction, the results also show the advantages of a graph-based approach. Experiments demonstrated that this method allows for optimization not only in QED but also in aspects such as activity toward the target protein. Thus, this method is considered to be suitable for processes such as lead optimization, where compound candidates have already been obtained. Furthermore, this method can be used to mask important structures identified beforehand, such that they remain unchanged. This method can be applied to other applications, such as providing more promising compounds to compound libraries for virtual screening.

Methods

The molecular generation method developed in this study is a molecular-graph-based optimization method for a given arbitrary molecule based on an evaluation function. The method performs a tree search with the molecule to be optimized as the root by adding and deleting fragments based on the upper confidence bound (UCB) score. Fragments to be added are generated atom by atom with MCTS, and the efficiency is enhanced by using a GCN model. Generating fragments atom by atom allows the molecule to be further optimized locally (with only partial conformational changes).

Monte Carlo Tree Search

MCTS³⁹ is a model-based reinforcement learning algorithm and has been used as an effective policy improvement operator in deep reinforcement learning methods such as AlphaGo.⁴⁰ MCTS searches for the optimal sequence of state and action sequences by sequentially constructing a search tree with the initial state as the root. Each node has a state value and a number of visits, each initialized with 0. The following four steps are repeated in one cycle until a given convergence condition is satisfied.

1.
Selection: select one leaf node from the current search tree according to a criterion known as Tree Policy.
2.
Expansion: add a child node to the selected node.
3.
Simulation: the newly added node is expanded to the terminal state (not added to the search tree) according to the Default Policy, which is called rollout.
4.
Backpropagation: update the evaluation value of the node corresponding to the path from the root node to the selected node using the obtained evaluation value and increase the number of visits plus one.

Tree Policy must take into account exploration and exploitation and the UCB1 score⁴¹ expressed in the following manner:

where x̅ is the average reward for the self-node n, and n_p is the number of visits to the self-node and the parent node. Default Policy is guaranteed to converge to the optimal solution with a sufficient number of steps even with random selection, but this is not feasible in practice. Machine learning models have often been used in recent years to search efficiently.

GCN

GCN⁴² is a neural network comprising layers that perform convolution operations defined over graph data. Convolution cannot be simply applied to graphs, unlike images or series, in which the relationship between neighboring elements is not fixed. There are two types of convolutions on graphs: one dealing with signals over graphs and the other based on the spatial structure of graphs. Here, we describe graph convolution defined on the space used in this study. The output of the l layer of node i depends only on its neighbors and is defined as follows:⁴³

where h^l/l+1 is the output of the l/l + 1 layer, b^l is the bias of the l layer, W^l is the weight parameter, Inline graphic is the neighbor of the node i, c_ji is the product of the order roots of the nodes j and i, and σ is the activation function.

Fragment-Wise Search

The fragment-wise search (Figure 6) is the core process of the method. Optimization is performed by editing the molecular graph fragment by fragment, given an arbitrary molecule and an evaluation function as input. The state of each node in the search tree corresponds to a molecule with a state value and a number of visits. The action in the tree search is to remove and add fragments. The tree search with the starting molecule as the root searches with the following process as one cycle. The molecule corresponding to the newly added node is treated as the optimized molecule.

1.
Selection: one leaf node is selected from the current search tree based on the UCB1 score.
2.
Expansion: the molecules obtained by adding and removing fragments from the molecules corresponding to the selected nodes are added as child nodes.
- Removing fragments (Figure 7): for all single bonds in the molecule corresponding to the selected node, the node with the higher number of atoms is added as a new child node with the fragment-removed molecule only if it splits into two when breaking the bond.
- Adding fragments (Figure 8): the molecule adding the fragment is generated by the fragment generation module and is treated as a molecule corresponding to the child node added to the search tree. The details are described in the atom-wise search.
The number of child nodes for removal is finite and small, e.g., an average of approximately 9 in the ZINC data set,⁴⁴ but the number of child nodes for addition is significantly large due to the number of molecules generated by the fragment generation module. Therefore, the selection of some molecules based on some criteria is necessary. In this study, random selection and a criterion for a high value of the evaluation function, such as ϵ-greedy,⁴⁵ were selected.
3.
Evaluation: the evaluation value for the molecules of the newly added child node is used as the reward. The reason for this is that unlike board games, where the terminal state can be easily determined, it is difficult to perform a rollout in molecular generation due to the ambiguity of the terminal state. For example, the benzene ring is nonterminal when naphthalene is generated, but it is terminal when the benzene ring is generated. Existing methods use machine learning models to determine if the state is terminal or to fix the number of steps. However, when the search starts from an arbitrary molecule and proceeds in the direction of increasing and decreasing atoms, as in this method, determining whether the search is terminated is difficult because the initial molecule is already completed. Another reason is that deep nodes are likely to deviate significantly from the initial molecule, which is not in accordance with the purpose of the study.
4.
Backpropagation: for all nodes in the path from the root node to the selected node, the state value and number of visits are updated using the maximum value of the rewards calculated in the simulation step. The following transformations are applied to keep the reward value in the range of −1 to 1:

Optimization is performed by tree search with addition and removal of fragments.

Removal of fragments in a fragment-wise tree search.

Addition of fragments in a fragment-wise tree search.

Atom-Wise Search

Atom-wise search corresponds to the generation of newly fragmented molecules in the expansion step of MCTS in the fragment-wise search (Figure 9). One path corresponds to one fragment by assigning one atom to one node. In one step, a new atom and the bonds associated with that atom are predicted to be added to the fragment in the intermediate state. Atoms that cannot be further bonded are excluded, considering the valence rule in the expansion step. Atom-wise MCTS is executed as one cycle of the following process.

1.
Selection: as with fragment-wise MCTS, one leaf node is selected from the current search tree based on the UCB1 score.
2.
Expansion: the molecular graph corresponding to the selected node is used as input to predict the next atom and bond with the GCN model. New molecules added the predicted atoms and bonds are added to the search tree as child nodes of the selected node.
3.
Simulation: rollout is performed using the same GCN model used in the expansion step to evaluate the added child nodes. Unlike the fragment-wise search, the fragments are generated from scratch; therefore, the terminal state can be defined in the same way as for existing methods. A state is treated as a terminal if the GCN model predicts an empty atom or when no bonds are predicted. The generated fragments are added to the molecule selected in the selection step of MCTS in the fragment-wise search to obtain a new molecule, and the value of the evaluation function for that molecule is used as the reward.
4.
Backpropagation: similar to the fragment-wise search, the value and number of visits are updated for all nodes in the path from the root node to the selected node.

Fragment generation in an atom-wise manner.

Fragment Generation by the GNN Model

This section details the GCN model used for the expansion and rollout of nodes in MCTS in the atom-wise search. An overview of the process is shown in Figure 10.

Flow of Prediction by the GCN Model

The GCN model comprises three modules that perform feature extraction, atom prediction, and bond prediction.

1.
Feature extraction: the feature extraction module takes a molecular graph x as input and outputs hidden states of graph and nodes h_g and h_n through a GCN. The graph hidden state is computed from the node hidden states using the aggregation function as follows:
2.
Atom prediction: the atom prediction module takes the hidden state of the graph as input and outputs the type of atom as a probability by passing it through the fully connected layer (FC_a(·)). The dimension of the output is (atom type) +1, which is the sum of the number of atom types and the label indicating the termination.
3.
Bond prediction: the bond prediction module takes the predicted atom and hidden states of the nodes as input and predicts the bonds through the RNN layer. The initial state vector s of the RNN is the concatenation of the vector obtained by transforming the predicted atoms with the embedding layer (Emb(·)) and the hidden representation of the graph h_g. The input of the RNN is the hidden state vector of nodes h_n arranged as a series of data according to the BFS order of nodes in the input graph. The output is the type of bonds as a probability, whose dimension is (number of bond types) +1 including a label, indicating that there is no bond.

The node features of the molecular graph are one-hot vectors of atom types. Aromatic bonds are not explicitly represented by representing aromatic rings as a Kekule structure in which three single and double bonds appear alternately.

Training

The GCN model used for fragment generation needs to be trained on the fragment data set and not the whole molecule. Fragment data sets are created from existing molecule data sets by the following procedure.

1.
For each molecule, a set of fragments is obtained by breaking all bonds that connect rings and nonrings (Figure 11).

The other processes are the same as those for general training of models.

Examples of fragments used to train GCN model. In this example, cutting all of the bonds between rings and nonrings yields five fragments.

Experimental Section

Training of the GCN Model

This section describes the architecture of the GCN model used in the atom-wise search and its training. The GCN model is used in fragment generation and aims to capture the features of molecules without maximizing the efficiency of a specific evaluation function. Therefore, the same model was used for all experiments in this study.

Architecture of the GCN Model

As described in the Methods section, the model comprises three modules: feature extraction, atom prediction, and coupling prediction, and the details of each are described below. The architecture of this model is shown in Figure 12.

Feature extraction: the input molecular graph has nine-dimensional node features and three-dimensional edge features. The transformation with 6-layer MPNN outputs a 128-dimensional node hidden state vector h_n (N_n, 128), where N_n is the number of nodes in the input graph. Sum pooling is then applied to obtain the graph hidden state vector h_g (1, 128).
Atom prediction: the atom prediction module comprises two fully connected layers. The hidden layer has 64 dimensions and uses ReLu functions as activation functions. A softmax function is applied to this output y_a to obtain the probability of each atom. The output is ten-dimensional, comprising an empty atom, meaning termination and nine types of atoms.
Bond prediction: the bond prediction module comprises a two-layer GRU⁴⁶ followed by a two-layer fully connected layer. The initial state vector of the GRU is a concatenation of the graph hidden state vector h_g and the 64-dimensional embedded representation of the predicted atoms Emb(y_a) and is converted to 256 dimensions in a single fully connected layer. The output of the GRU is transformed to y_b by the fully connected layer, with the hidden layer having 64 dimensions and ReLu function as the activation function. The dimension of outputs is four, including the label, indicating no bond. The probability of a bond is obtained by applying a softmax function.

Architecture of the GCN prediction model.

Training Data

The ZINC database⁴⁴ used for training is a database for virtual screening and contains over 750 million molecules. Approximately 250,000 molecules, which are commonly used in molecular generative models, were used for training. This data set is the same as that used in ChemTS³⁴ and is publicly available at https://github.com/tsudalab/ChemTS. 22,234 fragments obtained by applying the procedure described in the Methods section to each molecule were used for training. Properties of the fragment dataset used for training the GCN model are shown in Table 1. An example of the fragments used for training is shown in Figure 11.

Table 1. Properties of the Fragment Data Set Used for Training the GCN Model.

metrics
molecular weight	158 ± 45
number of atoms	10.6 ± 3.2

Open in a new tab

Training Process

The labels are atom type and bond type, and each loss was calculated as a cross-entropy loss function, where the sum of losses is the overall loss. The ratio of train–test data was 4:1, and parameters were updated with the Adam optimizer. The model was trained in a teacher-forcing⁴⁷ manner, in which label data were used as inputs when there was a dependency between the inputs. The input for the bond prediction was not the output of the atom prediction module but the label of the atom because the prediction of the bond depends on the prediction of the atom. The other training settings were as follows.

Learning rate: 0.0001
Batch size: 128
Epoch: 50

In the experiment, the parameters for 20 epochs of training were used.