Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Sep 10;64(18):6969–6978. doi: 10.1021/acs.jcim.4c00842

IEV2Mol: Molecular Generative Model Considering Protein–Ligand Interaction Energy Vectors

Mami Ozawa , Shogo Nakamura , Nobuaki Yasuo §, Masakazu Sekijima †,*
PMCID: PMC11423338  PMID: 39254942

Abstract

graphic file with name ci4c00842_0006.jpg

Generating drug candidates with desired protein–ligand interactions is a significant challenge in structure-based drug design. In this study, a new generative model, IEV2Mol, is proposed that incorporates interaction energy vectors (IEVs) between proteins and ligands obtained from docking simulations, which quantitatively capture the strength of each interaction type, such as hydrogen bonds, electrostatic interactions, and van der Waals forces. By integrating this IEV into an end-to-end variational autoencoder (VAE) framework that learns the chemical space from SMILES and minimizes the reconstruction error of the SMILES, the model can more accurately generate compounds with the desired interactions. To evaluate the effectiveness of IEV2Mol, we performed benchmark comparisons with randomly selected compounds, unconstrained VAE models (JT-VAE), and compounds generated by RNN models based on interaction fingerprints (IFP-RNN). The results show that the compounds generated by IEV2Mol retain a significantly greater percentage of the binding mode of the query structure than those of the other methods. Furthermore, IEV2Mol was able to generate compounds with interactions similar to those of the input compounds, regardless of structural similarity. The source code and trained models for IEV2Mol, JT-VAE, and IFP-RNN designed for generating compounds active against the DRD2, AA2AR, and AKT1, as well as the data sets (DM-QP-1M, active compounds to each protein, and ChEMBL33) utilized in this study, are released under the MIT License and available at https://github.com/sekijima-lab/IEV2Mol.

Introduction

Structure-based drug discovery (SBDD) plays a critical role in drug development.14 SBDD is a method that uses the three-dimensional structure of target proteins to optimize their interactions with ligands and to rationally design new drug candidates to simultaneously meet a wide range of optimization goals, such as activity, selectivity, and physical properties.5,6

SBDD has been successfully applied in many drug discovery projects.7 For example, in the development of celecoxib, a selective COX-2 inhibitor,8 structure–activity relationship studies of 1,5-diarylpyrazole derivatives focused on optimizing physical properties while maintaining COX-2 inhibitory activity and selectivity, ultimately leading to the discovery of a compound with high inhibitory activity and selectivity.9

In the development of therapeutics targeting the 3CL protease of SARS-CoV-2 in response to the global COVID-19 pandemic,10,11 docking-based virtual and biological screening using the SBDD approach identified several hits with IC50 values below 10 μM.12 One of the hits, compound 1, was selected for clinical development. Compound 1 was structurally optimized by SBDD using X-ray cocrystal structures, resulting in a more than 600-fold increase in activity, ultimately yielding S-217622, a nonpeptidic, noncovalently bound oral 3CLpro inhibitor.

Docking simulations using existing libraries and the resulting in vitro and in vivo assays have successfully identified promising hits in many studies.1316 However, the best compounds are not always included in the library. To achieve the ultimate goal of drug discovery, i.e., to discover novel drug candidates with desirable pharmacological properties and few side effects, innovative methods that can efficiently explore the vast space of 106017 chemical compounds and generate highly relevant compounds are needed.

In recent years, with the rapid development of artificial intelligence (AI) technology, its application to molecular design has attracted considerable attention. In particular, deep learning models such as recurrent neural networks (RNNs),1820 variational autoencoders (VAEs),21,22 generative adversarial networks (GANs),2325 and graph neural networks (GNNs)26 have successfully generated novel compounds using molecular graphs and simplified molecular input line entry system (SMILES)27 representations and are expected to be powerful tools for molecular design.

These models can learn a distribution of chemical structures from a large compound database and generate new structures. A ligand generation method that combines transfer learning and docking score optimization has been proposed to design new ligands by using deep learning to exploit the pocket information on target proteins, even in the absence of known ligand information for the target protein, by using the ligand information on proteins belonging to the same family.28 On the other hand, there are many cases where the interaction of a specific residue with a ligand should be considered.

A conditional recurrent neural network (cRNN) model utilizing ligand/protein interaction fingerprinting (IFP)2933 has been proposed to generate compounds that interact with target proteins.34 IFP is a binary vector based on the docking pose that is automatically constructed to indicate whether a ligand interacts with a protein, and incorporating this vector into the cRNN model enables the generation of novel ligand structures with the desired binding mode for a specific target. However, the IFP only considers the presence or absence of interactions and does not reflect the strength of the interaction. Virtual screening has already shown that considering the strength of interactions is more accurate than considering the IFP alone.35

In this study, we introduce the interaction energy vector (IEV) as a descriptor that quantitatively measures the strength of the interaction between a protein and a ligand. The IEV is obtained from docking simulations. It is calculated for each type of interaction, including hydrogen bonding, electrostatic interactions, and van der Waals forces.35 This vector provides a comprehensive representation of the protein–ligand interaction landscape, consisting of elements corresponding to each interaction type. This IEV is used as input to a variational autoencoder (VAE) model, which is trained end-to-end with another VAE that learns the chemical space from SMILES representations. By minimizing the reconstruction error of the SMILES in the joint latent space during training, the model learns to generate compounds with the desired interactions more accurately.

To evaluate the effectiveness of the proposed method, we performed benchmark comparisons with randomly selected compounds, the unconstrained VAE model (JT-VAE), and compounds generated by the IFP-RNN. The results confirmed that the compounds generated by the proposed method have a significantly greater rate of maintaining the binding mode of the query structure. These results indicate that IEV2Mol has the potential to be a useful tool for the generation of novel compounds with desirable protein–ligand interactions, and it is expected to contribute to the efficiency of the drug discovery process.

Materials and Methods

Docking Simulation

This research used docking simulations in Glide SP mode to calculate the interactions between the target protein and the generated compounds.36 The protein preparation wizard was used for hydrogen addition and structural minimization. LigPrep was used to generate ligand tautomeric and ionization states, ring conformations, and stereoisomers at pH 7.4. Further analysis was conducted using the best-scored pose.

Construction of the Interaction Energy Vector

The interaction fingerprint is used to quantitatively calculate the similarity of protein–ligand interactions. It is a bit-string representation based on the presence or absence of interactions. IEV (interaction energy vector) is a method inspired by the interaction fingerprint and was developed by Yasuo and Sekijima.35 This method differs from the interaction fingerprint in that the IEV is a vector of real-valued protein–ligand interactions based on energy, whereas the interaction fingerprint is a sequence of bits expressed as 0/1 values depending only on the presence of an interaction. The IEV is calculated using the following procedure. First, for each amino acid residue within 12 Å of the center of the docking grid, van der Waals forces, Coulomb forces, and hydrogen bond energy values are calculated for all atomic pairs between the docked compound and the residue. The three interaction energy values for each amino acid residue are then calculated by summing the energy terms of the atoms within each amino acid residue. The van der Waals forces, Coulomb forces and hydrogen bonds are then ranked in order of the number assigned to each amino acid residue in the target protein’s PDB file. In other words, the length of the IEV vector is three times the number of amino acid residues within 12 Å of the center of the docking grid, which varies for each target protein and is therefore a unique representation for each protein. These calculations can be performed in docking simulations using Glide.36,37

Data Sets

In the present study, to prevent the model from overfitting to the chemical space of active compounds and to learn a broad chemical space, the DM-QP-1 M data set was used.

In addition, dopamine receptor D2 (DRD2), adenosine receptor A2 (AA2AR), and AKT serine/threonine kinase 1 (AKT1) was selected as the target for evaluating the method, and the active compound data set, a data set of active compounds for each protein, was created. The DM-QP-1 M data set and active compound data set are available on GitHub.

The DM-QP-1 M data set consists of 981,139 compounds obtained by the following procedure. First, 1,000,000 compounds were randomly selected from the DM-QP data set, a data set of drug-like compounds created by Lee et al.38 Second, if SMILES notations were found containing multiple compounds, such as solvents, we kept only the compounds with the highest molecular weight. Finally 18,861 duplicate compounds were removed.

The active compound data set was created in this study and is a data set of compounds with activity against each protein. The active compound data set consists of 8,350, 6,640, and 3,576 compounds for DRD2, AA2AR, and AKT1 respectively, obtained by the following procedure.

First, SMILES of compounds with available Ki or IC50 values, which are indicators of protein binding activity, were obtained from the ChEMBL database.39 The active compounds for DRD2 were obtained as of September 25, 2023, while those for AA2AR and AKT1 were obtained as of June 21, 2024. Next, if there were SMILES containing multiple compounds, such as solvents, only the compound with the highest molecular weight among them was retained. Duplicate compounds were then removed. The 3D structures were then generated using LigPrep, and docking simulations with DRD2 (PDB ID: 6CM4), AA2AR (PDB ID: 3EML), and AKT1 (PDB ID: 3CQW) were performed in Glide HTVS mode to calculate the IEV. In the docking simulations, multiple docking poses could be generated for a single compound, in which case the one with the most negative docking score was selected. IEV resulting from docking to DRD2, AA2AR, and AKT1 had 189, 177, and 207 dimensions, respectively. Finally, for the ligands of each protein, the active compounds were divided into 100 clusters in the ECFP4, 1024-bit space using k-means clustering, and 100 compounds closest to the centroid of each cluster were selected to create the test data set, and the remaining compounds were selected as the training data set.

Model Architecture

The IEV2Mol architecture consists of SMILES-VAE, IEV-VAE, and Z-DNN (Figure 1).

Figure 1.

Figure 1

Schematic representation of the IEV2Mol framework, which integrates three key modules – SMILES-VAE, IEV-VAE, and the Z-DNN – for the generation of novel molecular compounds with desired protein–ligand interaction energy values. SMILES-VAE learns the chemical space distribution from SMILES representations, while IEV-VAE captures the interaction energy vector (IEV) distribution of active compounds. The Z-DNN combines the latent representations from both VAEs to generate new compounds with targeted interaction profiles.

SMILES-VAE

SMILES-VAE is a VAE module that uses SMILES notations as the input and output, and the VAE model of the MOSES40 benchmark was used. The encoder in the SMILES-VAE is composed of bidirectional GRU layers and fully connected layers, and the decoder is composed of three GRU layers with dropout. The latent space has 128 dimensions. SMILES-VAE is used for learning and using chemical space distributions.

IEV-VAE

IEV-VAE is a VAE module that uses the IEV as the input and output. The encoder and decoder are implemented using a combination of convolution layers (Conv) and fully connected layers (FC), along with dropout, batch normalization (BatchNorm), and activation layers, such as ReLU and SELU.

The Conv layer here is a one-dimensional convolution layer. eq 1 defines the transformed feature at the ith position, Inline graphic, as the result of the dot product between the kernel Wk and the input feature vector hi, which can be expressed as Inline graphic.

graphic file with name ci4c00842_m003.jpg 1

The BatchNorm layer is a layer that performs batch normalization, where each ith feature value hi of the data in a batch is standardized by the operation of eq 2 so that the mean is 0 and the variance is 1 within a batch. In this case, epsilon = 1e – 5 in eq 2.

graphic file with name ci4c00842_m004.jpg 2

In the dropout layer, neurons are randomly thinned at a certain rate in each epoch during learning.

The FC layer in this study refers to the fully connected layer. Here, the input vector h is linearly transformed using the learned weight matrix W and the bias vector b as in eq 3.

graphic file with name ci4c00842_m005.jpg 3

In the ReLU layer, the input vector h is activated using the ReLU activation function, as shown in eq 4.

graphic file with name ci4c00842_m006.jpg 4

In the SELU layer, the input vector h is activated using the activation function SELU, as shown in eq 5. In this case, α = 1.6732632423543772848170429916717 and λ = 1.0507000009873554804934193349852946 in eq 5, which are the same values proposed in the original article.41

graphic file with name ci4c00842_m007.jpg 5

The latent space has 56 dimensions. IEV-VAE is employed to learn and use the IEV distribution of active compounds.

Z-DNN

The Z-DNN (Z-deep neural network) consists of three fully connected layers (FCs) and is designed to learn a mapping from the concatenated latent space of SMILES-VAE and IEV-VAE to the SMILES-VAE latent space. This mapping enables the generation of new compounds with the desired interaction energy profile, as represented by IEV. Since the Z-DNN has the same dimensionality as the SMILES-VAE latent space, the output of the Z-DNN can be directly fed into the SMILES-VAE decoder to generate new compounds.

Model Construction

SMILES-VAE Pretraining

The DM-QP-1 M data set was used to pretrain models to learn diverse representations of a chemical space; pretraining the DM-QP-1 M data set prevents the models from overfitting the chemical space of active compounds in the target protein, allowing them to learn a wide chemical space. The SMILES-VAE was pretrained using the DM-QP-1 M data set with a batch size of 512 and the Adam optimizer for 100 epochs.

As expressed in eq 6, the loss function LSMILES-VAE consists of a weighted sum of the reconstruction error LSMILES due to the cross-entropy error between the correct SMILES and the probability value for each token in the predicted SMILES string, represented in eq 7, and the KL divergence (Kullback–Leibler divergence) LKL, represented in eq 8.

graphic file with name ci4c00842_m008.jpg 6
graphic file with name ci4c00842_m009.jpg 7
graphic file with name ci4c00842_m010.jpg 8

where N is the batch size, Ti is the length of the SMILES string for the ith sample, y is the input SMILES string, p is the probability of output SMILES string, V is the size of the SMILES vocabulary. D is the number of dimensions of the VAE latent space, and μ and Σ are the mean and variance of the D-dimensional normal distribution output by the VAE, respectively. The initial value of the KL divergence weights α was set to 0 and annealed by 5e-3 per epoch. Additionally, the learning rate was fixed at 3e-4.

IEV-VAE Pretraining

The IEV-VAE was pretrained using the active compound training data set with a batch size of 128 and the Adam optimizer for 100 epochs in this experiment.

As expressed in eq 9, the loss function consists of the sum of the LIEV-L1, L1 loss of the correct IEV and the output IEV represented in eq 10, and the LKL, the KL divergence represented in eq 8.

graphic file with name ci4c00842_m011.jpg 9
graphic file with name ci4c00842_m012.jpg 10

where Inline graphic is the lth value of the input IEV and Inline graphic is the lth value of the output IEV. Additionally, the initial value of the learning rate was 1e-3 and was multiplied by 0.9 every 2000 batches processed.

End-to-End Model Training

After pretraining of the SMILES-VAE and IEV-VAE was completed, the integrated model, consisting of the pretrained SMILES-VAE, IEV-VAE, and the Z-DNN, was trained using the active compound training data set with a batch size of 128 and the Adam optimizer for 100 epochs.

During training, the SMILES and IEV pairs of the compounds in the active compound data set were used as input to the pretrained SMILES-VAE and IEV-VAE encoders, respectively. Then, the latent representations output by the two encoders were concatenated and used as input to the Z-DNN. Finally, the variables output by the Z-DNN were decoded into SMILES representations by the SMILES-VAE decoder. As expressed in eq 11, the loss function LModel is defined by the reconstruction error of SMILES LSMILES, expressed in eq 7.

graphic file with name ci4c00842_m015.jpg 11

Note that the pretrained SMILES-VAE encoder and IEV-VAE weights are frozen, and only the SMILES-VAE decoder and Z-DNN weights are trained. The initial learning rate was 1e-4 and was multiplied by 0.8 for every 20 epochs processed.

Compound Generation

After a series of training, IEV2Mol uses the IEV as input to generate compounds believed to have similar IEVs via the following procedure.

  • i

    The target IEV is used as input to the IEV-VAE encoder to obtain a latent representation.

  • ii

    Variables are randomly sampled from a standard normal distribution with the same number of dimensions as the SMILES-VAE latent space.

  • iii

    The variables obtained in procedures (i) and (ii) are concatenated and used as input to the Z-DNN.

  • iv

    The output of the Z-DNN is taken as input to the SMILES-VAE decoder and decoded into a SMILES representation.

Technical Details

IEV2Mol was implemented using PyTorch42 along with the RDKit.43 The computing environment was a SUSE Linux Enterprise Server 12 SP2 with an Intel Xeon E5-2680 v4 CPU, and four NVIDIA TESLA P100 (16 GB memory) were used for the docking simulation to calculate the IEV for the active compound data set and for model evaluation. The computing environment for model training and compound generation was the Ubuntu 22.04 OS, an Intel(R) Xeon(R) Silver 4110 CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 4090 GPU (with 24 GB of memory).

Experiments and Evaluation

Experimental Details

To evaluate IEV2Mol, we performed compound generation experiments using IEVs from the active compound test data set as input. In this experiment, 100 compounds were generated per test compound. Note that in this 100th generation, the IEV-VAE latent representation and the SMILES-VAE latent representation were newly sampled each time.

JT-VAE

For comparison with the proposed IEV2Mol model, a graph-based de novo generative model, JT-VAE,44 was trained and evaluated using the active compound training data set with a batch size of 2 and the Adam optimizer for 20 epochs. Note that we used the JT-VAE model implemented in Python3 by Bibhash Mitra, which is available from the GitHub repository https://github.com/Bibyutatsu/FastJTNNpy3.45

Additionally, the initial value of the learning rate was 1e-3 and was multiplied by 0.9 for every 2000 batches processed. The other settings had the default values. The models in training were saved every 10 epochs during training, and the model with the lowest validation loss was selected.

Similar to IEV2Mol, JT-VAE generated 100 compounds per test compound. Because JT-VAE does not accept IEVs as input, the SMILES of the test compounds were used as input for the generation.

IFP-RNN

For further comparison, the interaction fingerprint (IFP)-based cRNN generation model, IFP-RNN,46 was also trained and evaluated using the active compound training data set with a batch size of 500 and the Adam optimizer for 500 epochs. Note that we used the IFP-RNN model implemented in Python3 by Jie Zhang, which is available from the GitHub repository (https://github.com/jeah-z/IFP-RNN).47

For training, we used residue-specific IFPs calculated from the active compound data set docked to each protein in Glide HTVS mode. The learning rate was set to the default setting (the initial value was 1e-3, and after the 200th epoch, the initial value decayed through multiplication by Inline graphic).

Then, similar to IEV2Mol, we generated 100 compounds per test compound in the IFP-RNN. The IFP of the test compound was used as input for the generation process.

Random ChEMBL

As a baseline, 100 randomly selected compounds from the ChEMBL3348 database, which contains approximately 2.27 million compounds, were evaluated for each test data point. Notably, 2.27 million compounds were obtained by preprocessing all SMILES sequences on ChEMBL33 to remove duplicates after removing the solvent, as was done when the data set was created.

Metrics

The following 6 metrics were used to evaluate IEV2Mol. For comparison, we used the average values of these metrics for the compounds generated by each model using each test data set.

  • Validity (number of valid compounds)

  • Uniqueness (uniqueness of valid compounds)

  • Diversity (diversity of valid compounds)

  • Number of compounds that were able to dock to the target protein in Glide HTVS mode (number of compounds for which the IEV could be calculated)

  • Number of compounds for which the cosine similarity between the IEV of the input compound and its own IEV is greater than 0.7

  • Number of compounds for which the cosine similarity between the IEV of the input compound and its own IEV is greater than 0.7 and the Tanimoto coefficient with the input compound is less than 0.5

Validity, uniqueness, and diversity are metrics also used in the MOSES40 benchmark. Diversity was calculated by the eq 12 with p = 1.

graphic file with name ci4c00842_m017.jpg 12
graphic file with name ci4c00842_m018.jpg 13

where G is the set of valid compounds generated, T is the Tanimoto coefficient given by the eq 13, t is the total number of bits that are both 1 when comparing the ECFP of compound m1 with that of m2, f is the total number of bits that are both 0, and S is the number of ECFP bits. In this study, we used ECFP4 with S = 2048.

Among these 6 metrics, validity, uniqueness, and diversity were used to evaluate the performance of IEV2Mol as a generative model. The remaining 3 metrics were used to evaluate the ability of IEV2Mol to generate compounds that have similar interactions with the input compounds but have diverse structures, which is the objective of this study.

In addition, we plotted the distributions for the Tanimoto coefficient (ECFP4) and cosine similarity to the IEV of the seed compound for all compounds generated using all test data as input, along with the distribution of the chemical space of the data set used for training and the positions of the generated compounds. Regarding plots of the distribution of the chemical space and the location of the generated compounds, a two-dimensional plot of ECFP4 with S = 2048 was generated by principal component analysis (PCA) with dimensionality reduction.

Finally, we evaluated the docking poses in Glide HTVS mode for the top four compounds generated by IEV2Mol with the highest IEV cosine similarity among those with Tanimoto coefficients of 0.5 or less.

Results and Discussion

Table 1 shows the results of evaluating the validity, uniqueness, and diversity of IEV2Mol, JT-VAE, IEV-VAE, and Random ChEMBL.

Table 1. Average Values of Validity, Uniqueness, and Diversity for the Compounds Generated by Each Model (IEV2Mol, JT-VAE, IFP-RNN, and Random ChEMBL) for Each Proteina.

Target Model Validity Uniqueness Diversity
DRD2 IEV2Mol 97.5 0.987 0.835
JT-VAE 100.0 0.236 0.303
IFP-RNN 85.1 1.000 0.712
Random ChEMBL 100.0 1.000 0.880
AA2AR IEV2Mol 96.3 0.979 0.855
JT-VAE 100.0 0.228 0.316
IFP-RNN 74.6 1.000 0.755
Random ChEMBL 100.0 1.000 0.880
AKT1 IEV2Mol 94.6 0.971 0.851
JT-VAE 100.0 0.262 0.369
IFP-RNN 44.7 1.000 0.722
Random ChEMBL 100.0 1.000 0.880
a

The bolded values are the best values, and the underlined values are the second-best values.

It is shown that IEV2Mol has a higher validity than the IFP-RNN, which also uses SMILES-based generation. We attribute our success to the fact that IEV2Mol was able to learn grammars on a larger data set, DM-QP-1 M data, whereas the IFP-RNN learned grammars on the active compound data set.

The IFP-RNN architecture requires the computation of IFPs for all training data, which makes it difficult to expand the training data. On the other hand, IEV2Mol required only SMILES for SMILES-VAE pretraining, making it easily scalable and allowing us to train the grammar on a larger set of DM-QP-1 M data.

Additionally, IEV2Mol was comparable to Random ChEMBL in the evaluation of uniqueness and diversity. JT-VAE yielded particularly poor results for uniqueness and diversity, which may be because JT-VAE generates compounds by sampling based on structural similarity to the input compounds. IEV2Mol, on the other hand, seems to have performed better by generating compounds through sampling based on IEV similarity.

Table 2 shows the results of the evaluation of the number of compounds that were able to dock to the target protein in Glide HTVS mode, the number of compounds for which the cosine similarity between the IEV of the input compound and its own IEV was greater than 0.7, and the number of compounds for which the cosine similarity between the IEV of the input compound and its own IEV was greater than 0.7 and the Tanimoto coefficient with the input compound was less than 0.5.

Table 2. Average Number of Dockable Compounds Generated by Each Model for Each Protein with IEV Cosine Similarity to the Input Compound Greater than 0.7, and Average Numbers of Generated Compounds with IEV Cosine Similarity to the Input Compound Greater than 0.7 and Tanimoto Similarity to the Input Compound Less Than 0.5a.

Target Model Dockable IEV Cos ≥0.7 IEV Cos ≥0.7 and Tanimoto ≤0.5
DRD2 IEV2Mol 91.5 27.6 27.0
JT-VAE 94.5 50.8 5.1
IFP-RNN 81.8 24.3 22.2
Random ChEMBL 78.7 13.7 13.7
AA2AR IEV2Mol 94.5 67.9 66.7
JT-VAE 98.8 77.3 16.1
IFP-RNN 72.2 49.4 47.2
Random ChEMBL 79.8 38.1 38.1
AKT1 IEV2Mol 91.1 24.8 22.7
JT-VAE 96.6 39.8 5.2
IFP-RNN 39.9 7.8 6.3
Random ChEMBL 82.3 10.8 10.8
a

The bolded values are the best values, and the underlined values are the second-best values.

IEV2Mol performed better than any other case except JT-VAE in evaluating the number of generated dockable compounds and the number of generated compounds with high IEV cosine similarity. However, as discussed in Table 1, we believe that JT-VAE shows good values for these indices because it produces compounds with high structural similarity to the input compounds. Therefore, for our purposes, the JT-VAE results cannot be simply evaluated as good. This consideration is reinforced by the fact that JT-VAE showed significantly worse results when evaluated with the additional conditions of low structural similarity to the input compound as well as high IEV cosine similarity. On the other hand, IEV2Mol did not significantly weaken the results when evaluated with the additional condition of low structural similarity to the input compound. In other words, IEV2Mol succeeds in generating compounds that have similar interactions without depending on structural similarity to the input compound. These model-specific trends are consistent when targeting DRD2, AA2AR, and AKT1, showing that the ability of IEV2Mol to generate compounds with high IEV similarity and low structural similarity to seed compounds is robust across different target proteins.

Figure 2 shows a comparative analysis of the distributions of (a) Tanimoto coefficients (calculated using ECFP4) and (b) cosine similarities of IEVs between the seed compounds and the generated compounds using kernel density estimation (KDE). The proposed IEV2Mol framework (referred to as “our model”) is compared to the JT-VAE and IFP-RNN methods, as well as to a random sample of the ChEMBL database (Random ChEMBL). All test data were used as input for each approach. The KDE plot of the Tanimoto coefficients in Figure 2a shows that IEV2Mol generates compounds with a greater degree of structural diversity than do JT-VAE and IFP-RNN, as evidenced by the greater density of compounds with lower Tanimoto coefficients. This indicates that IEV2Mol is capable of generating a diverse range of molecules that lack structural similarity to the seed compound. In terms of the KDE plot of IEV cosine similarities, shown in Figure 2b, the compounds generated by IEV2Mol show greater similarity to the seed compounds than those of IFP-RNN. This observation suggests that IEV2Mol successfully reproduces IEV even when generating compounds with no structural similarity to the seed compound. Notably, the IEV cosine similarity distribution for JT-VAE has a prominent peak at 1.0, indicating that JT-VAE tends to generate compounds with high structural similarity to the seed compound. This can be attributed to the fact that JT-VAE relies heavily on the structural information on the seed compound during the generation process.

Figure 2.

Figure 2

Comparative analysis of the distributions of (a) Tanimoto coefficients (calculated using ECFP4) and (b) cosine similarities of interaction energy values (IEVs) between the seed compounds and the generated compounds for each protein. The proposed IEV2Mol framework (referred to as “our model”) is compared to the JT-VAE and IFP-RNN methods, as well as to a random sample of the ChEMBL database (Random ChEMBL), with all test data used as input for each approach.

Figure 3 shows the QED–LogP plots of the generated compounds for each of DRD2, AA2AR, and AKT1. These properties were calculated by RDKit. For each protein, IEV2Mol generated molecules with property distributions similar to those of active compounds, indicating that IEV2Mol learned the features of the known ligands.

Figure 3.

Figure 3

QED–LogP plots for compounds generated by IEV2Mol targeting DRD2, AA2AR, and AKT1. Points represent generated compounds, while the red density map shows the distribution of the active compounds.

Considering the objective of this study, which is to generate compounds with diverse structures while maintaining similar interaction energy profiles to the seed compound, the lower IEV cosine similarity of IEV2Mol compared to that of JT-VAE is not a problem. Indeed, it highlights the ability of IEV2Mol to generate structurally diverse compounds with interaction energy profiles similar to those of the seed compound. The results show that IEV2Mol is a promising approach for the generation of novel compounds with desired protein–ligand interactions, as it achieves a balance between structural diversity and the reproduction of interaction energy values.

Figure 4 shows that IEV2Mol generates a wide variety of compounds that cover the chemical space of the active compound, regardless of its position in the chemical space of the seed compound. Figures for all seed compounds are shown in Figures S1–S15. This result, as well as the results shown in Figure 2, suggests that IEV2Mol can generate compounds with IEVs similar to those of the seed compound, even when the generated compounds have no structural similarity to the seed compound. The ability of IEV2Mol to explore different regions of chemical space while maintaining the targeted protein–ligand interaction profile is a major advantage in the context of drug discovery.

Figure 4.

Figure 4

Distribution of compounds generated using IEV2Mol for each protein. The chemical space is visualized using principal component analysis (PCA) to reduce the dimensionality of the ECFP4 fingerprints (2048 bits) to two dimensions. The red crosses represent the seed compounds; the blue density map shows the distribution of 10,000 randomly selected compounds from the DM-QP-1 M data set, while the red density map shows the distribution of the active compound data set. The blue dots correspond to the compounds generated based on each seed compound, demonstrating the ability of the model to generate molecules that cover the chemical space of active compounds.

Figure 5 shows the docking poses of the seed compound (top) and the four compounds generated by IEV2Mol (bottom four). These generated compounds were selected from among compounds with Tanimoto coefficients less than 0.5 and with high IEV cosine similarity to the seed compound. The docking poses were obtained using the Glide HTVS mode, indicating the potential for IEV2Mol to generate structurally diverse compounds that may interact with its target protein, DRD2.

Figure 5.

Figure 5

Docking poses generated by IEV2Mol for DRD2 using the Glide HTVS mode for the top four compounds with the highest IEV cosine similarity among those with Tanimoto coefficients less than 0.5. The seed compound is shown at the top, with the four generated compounds below it.

Conclusion

In this study, we proposed a new VAE model, IEV2Mol, which generates compounds using information about the interaction between a compound and a protein, called the interaction energy vector (IEV). By using the IEVs calculated between the compound and the target protein as input, IEV2Mol generates compounds with similar IEVs for that target protein. This is accomplished by combining latent representations derived from IEVs between the seed compound and the target protein with latent representations randomly obtained from the vast chemical space and decoding them.

A series of experiments showed that compared to those of other generative models such as JTVAE and IFP-RNN, compounds generated with IEV2Mol tend to have IEVs similar to those of the seed compound, although they are structurally less similar to the seed compound. Although IEV2MOL has the limitation that it requires the tertiary structure of the target protein and known ligand data, it has the potential to be a useful tool in the hit discovery process because it can generate compounds with similar interactions, regardless of their structural similarity to the seed compound. For future work, to streamline compound generation targeting various proteins, it is conceivable to condition not only on vectors related to the interaction between proteins and small molecule compounds but also on fingerprints learned from the entire protein structure.4951

Acknowledgments

This research was partially supported by the Research Support Project for Life Science and Drug Discovery (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) from AMED under Grant Number JP24ama121026 and the Japan Society for the Promotion of Science (JSPS) under KAKENHI Grant Numbers JP20H00620 and JP24H01760.

Data Availability Statement

The source code and trained models for IEV2Mol, JT-VAE, and IFP-RNN designed for generating compounds active against the DRD2, AA2AR, and AKT1, as well as the data sets (DM-QP-1M, DRD2 Active, and ChEMBL33) utilized in this study, are released under the MIT License and available at https://github.com/sekijima-lab/IEV2Mol. This repository also includes scripts for plotting figures and calculating metrics to evaluate the performance of the different methods.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c00842.

  • Chemical space visualization of seed compounds, active compounds, and generated compounds by IEV2Mol for each protein (PDF)

Author Contributions

M.S. designed the study and supervised the computational study. M.O. and S.N. developed the software, and N.Y. guided the experiments. M.O., S.N., and M.S. wrote the manuscript. All authors critically reviewed and revised the manuscript draft and approved the final version for submission.

The authors declare no competing financial interest.

Supplementary Material

ci4c00842_si_001.pdf (39.6MB, pdf)

References

  1. Gohlke H.; Klebe G. Approaches to the Description and Prediction of the Binding Affinity of Small-Molecule Ligands to Macromolecular Receptors. Angew. Chem., Int. Ed. 2002, 41, 2644–2676. . [DOI] [PubMed] [Google Scholar]
  2. Kitchen D. B.; Decornez H.; Furr J. R.; Bajorath J. Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat. Rev. Drug Discovery 2004, 3, 935–949. 10.1038/nrd1549. [DOI] [PubMed] [Google Scholar]
  3. Chen L.; Morrow J. K.; Tran H. T.; Phatak S. S.; Du-Cuny L.; Zhang S. From Laptop to Benchtop to Bedside: Structure-based Drug Design on Protein Targets. Curr. Pharm. Des. 2012, 18, 1217–1239. 10.2174/138161212799436386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Ferreira L.; dos Santos R.; Oliva G.; Andricopulo A. Molecular Docking and Structure-Based Drug Design Strategies. Molecules 2015, 20, 13384–13421. 10.3390/molecules200713384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Kuntz I. D. Structure-Based Strategies for Drug Design and Discovery. Science 1992, 257, 1078–1082. 10.1126/science.257.5073.1078. [DOI] [PubMed] [Google Scholar]
  6. Jorgensen W. L. The Many Roles of Computation in Drug Discovery. Science 2004, 303, 1813–1818. 10.1126/science.1096361. [DOI] [PubMed] [Google Scholar]
  7. Lydon N. B.; Druker B. J. Lessons learned from the development of imatinib. Leuk. Res. 2004, 28, 29–38. 10.1016/j.leukres.2003.10.002. [DOI] [PubMed] [Google Scholar]
  8. Flower R. J. The development of COX2 inhibitors. Nat. Rev. Drug Discovery 2003, 2, 179–191. 10.1038/nrd1034. [DOI] [PubMed] [Google Scholar]
  9. Penning T. D.; et al. Synthesis and Biological Evaluation of the 1, 5-Diarylpyrazole Class of Cyclooxygenase-2 Inhibitors: Identification of 4-[5-(4-Methylphenyl)-3- (trifluoromethyl)-1H-pyrazol-1-yl]benzenesulfonamide (SC-58635, Celecoxib). J. Med. Chem. 1997, 40, 1347–1365. 10.1021/jm960803q. [DOI] [PubMed] [Google Scholar]
  10. Zhu N.; et al. A Novel Coronavirus from Patients with Pneumonia in China, 2019. N. Engl. J. Med. 2020, 382, 727–733. 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lu H.; Stratton C. W.; Tang Y. Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle. J. Med. Virol. 2020, 92, 401–402. 10.1002/jmv.25678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Unoh Y.; et al. Discovery of S-217622, a Noncovalent Oral SARS-CoV-2 3CL Protease Inhibitor Clinical Candidate for Treating COVID-19. J. Med. Chem. 2022, 65, 6499–6512. 10.1021/acs.jmedchem.2c00117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Yoshino R.; Yasuo N.; Hagiwara Y.; Ishida T.; Inaoka D. K.; Amano Y.; Tateishi Y.; Ohno K.; Namatame I.; Niimi T.; et al. In silico, in vitro, X-ray crystallography, and integrated strategies for discovering spermidine synthase inhibitors for Chagas disease. Sci. Rep. 2017, 7, 6666. 10.1038/s41598-017-06411-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Chiba S.; Ishida T.; Ikeda K.; Mochizuki M.; Teramoto R.; Taguchi Y.-H.; Iwadate M.; Umeyama H.; Ramakrishnan C.; Thangakani A. M.; et al. An iterative compound screening contest method for identifying target protein inhibitors using the tyrosine-protein kinase Yes. Sci. Rep. 2017, 7, 12038. 10.1038/s41598-017-10275-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Yamamoto K. Z.; Yasuo N.; Sekijima M. Screening for Inhibitors of Main Protease in SARS-CoV-2: In Silico and In Vitro Approach Avoiding Peptidyl Secondary Amides. J. Chem. Inf. Model. 2022, 62, 350–358. 10.1021/acs.jcim.1c01087. [DOI] [PubMed] [Google Scholar]
  16. Yoshino R.; Yasuo N.; Hagiwara Y.; Ishida T.; Inaoka D. K.; Amano Y.; Tateishi Y.; Ohno K.; Namatame I.; Niimi T.; Orita M.; Kita K.; Akiyama Y.; Sekijima M. Discovery of a Hidden Trypanosoma cruzi Spermidine Synthase Binding Site and Inhibitors through In Silico, In Vitro, and X-ray Crystallography. ACS Omega 2023, 8, 25850–25860. 10.1021/acsomega.3c01314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Bohacek R. S.; McMartin C.; Guida W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50. . [DOI] [PubMed] [Google Scholar]
  18. Yang X.; Zhang J.; Yoshizoe K.; Terayama K.; Tsuda K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017, 18, 972–976. 10.1080/14686996.2017.1401424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Erikawa D.; Yasuo N.; Sekijima M. MERMAID: An open source automated hit-to-lead method based on deep reinforcement learning. J. Cheminf. 2021, 13, 94. 10.1186/s13321-021-00572-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Ishida S.; Aasawat T.; Sumita M.; Katouda M.; Yoshizawa T.; Yoshizoe K.; Tsuda K.; Terayama K. ChemTSv2: Functional molecular design using de novo molecule generator. WIREs Comput. Mol. Sci. 2023, 13, e1680 10.1002/wcms.1680. [DOI] [Google Scholar]
  21. Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernández-Lobato J. M.; Sánchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4, 268–276. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lim J.; Ryu S.; Kim J. W.; Kim W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J. Cheminf. 2018, 10, 31. 10.1186/s13321-018-0286-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Sanchez-Lengeling B.; Aspuru-Guzik A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 2018, 361, 360–365. 10.1126/science.aat2663. [DOI] [PubMed] [Google Scholar]
  24. Prykhodko O.; Johansson S. V.; Kotsias P.-C.; Arús-Pous J.; Bjerrum E. J.; Engkvist O.; Chen H. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminf. 2019, 11, 74. 10.1186/s13321-019-0397-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lee Y. J.; Kahng H.; Kim S. B. Generative Adversarial Networks for De Novo Molecular Design. Mol. Inf. 2021, 40, 2100045. 10.1002/minf.202100045. [DOI] [PubMed] [Google Scholar]
  26. Liu Q.; Allamanis M.; Brockschmidt M.; Gaunt A. L.. Constrained Graph Variational Autoencoders For Molecule Design, 2018. https://arxiv.org/abs/1805.09076.
  27. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
  28. Krishnan S. R.; Bung N.; Bulusu G.; Roy A. Accelerating De Novo Drug Design against Novel Proteins Using Deep Learning. J. Chem. Inf. Model. 2021, 61, 621–630. 10.1021/acs.jcim.0c01060. [DOI] [PubMed] [Google Scholar]
  29. Deng Z.; Chuaqui C.; Singh J. Structural Interaction Fingerprint (SIFt): A Novel Method for Analyzing Three-Dimensional Protein–Ligand Binding Interactions. J. Med. Chem. 2004, 47, 337–344. 10.1021/jm030331x. [DOI] [PubMed] [Google Scholar]
  30. Kelly M. D.; Mancera R. L. Expanded Interaction Fingerprint Method for Analyzing Ligand Binding Modes in Docking and Structure-Based Drug Design. J. Chem. Inf. Comput. Sci. 2004, 44, 1942–1951. 10.1021/ci049870g. [DOI] [PubMed] [Google Scholar]
  31. Marcou G.; Rognan D. Optimizing Fragment and Scaffold Docking by Use of Molecular Interaction Fingerprints. J. Chem. Inf. Model. 2007, 47, 195–207. 10.1021/ci600342e. [DOI] [PubMed] [Google Scholar]
  32. Pérez-Nueno V. I.; Rabal O.; Borrell J. I.; Teixidó J. APIF: A New Interaction Fingerprint Based on Atom Pairs and Its Application to Virtual Screening. J. Chem. Inf. Model. 2009, 49, 1245–1260. 10.1021/ci900043r. [DOI] [PubMed] [Google Scholar]
  33. Jasper J. B.; Humbeck L.; Brinkjost T.; Koch O. A novel interaction fingerprint derived from per atom score contributions: Exhaustive evaluation of interaction fingerprint performance in docking based virtual screening. J. Cheminf. 2018, 10, 15. 10.1186/s13321-018-0264-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhang J.; Chen H. De Novo Molecule Design Using Molecular Generative Models Constrained by Ligand–Protein Interactions. J. Chem. Inf. Model. 2022, 62, 3291–3306. 10.1021/acs.jcim.2c00177. [DOI] [PubMed] [Google Scholar]
  35. Yasuo N.; Sekijima M. Improved Method of Structure-Based Virtual Screening via Interaction-Energy-Based Learning. J. Chem. Inf. Model. 2019, 59, 1050–1061. 10.1021/acs.jcim.8b00673. [DOI] [PubMed] [Google Scholar]
  36. Friesner R. A.; Banks J. L.; Murphy R. B.; Halgren T. A.; Klicic J. J.; Mainz D. T.; Repasky M. P.; Knoll E. H.; Shelley M.; Perry J. K.; Shaw D. E.; Francis P.; Shenkin P. S. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47, 1739–1749. 10.1021/jm0306430. [DOI] [PubMed] [Google Scholar]
  37. Halgren T. A.; Murphy R. B.; Friesner R. A.; Beard H. S.; Frye L. L.; Pollard W. T.; Banks J. L. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening. J. Med. Chem. 2004, 47, 1750–1759. 10.1021/jm030644s. [DOI] [PubMed] [Google Scholar]
  38. Lee J.; Myeong I.-S.; Kim Y. The Drug-Like Molecule Pre-Training Strategy for Drug Discovery. IEEE Access 2023, 11, 61680–61687. 10.1109/ACCESS.2023.3285811. [DOI] [Google Scholar]
  39. Davies M.; Nowotka M.; Papadatos G.; Dedman N.; Gaulton A.; Atkinson F.; Bellis L.; Overington J. P. ChEMBL web services: Streamlining access to drug discovery data and utilities. Nucleic Acids Res. 2015, 43, W612–W620. 10.1093/nar/gkv352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Polykovskiy D.; et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol. 2020, 11, 565644. 10.3389/fphar.2020.565644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Klambauer G.; Unterthiner T.; Mayr A.; Hochreiter S.. Self-Normalizing Neural Networks, 2017. https://arxiv.org/abs/1706.02515.
  42. Paszke A.; Gross S.; Massa F.; Lerer A.; Chintala S.. Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc., 2019. [Google Scholar]
  43. Landrum G.RDKit: Open-Source Cheminformatics Software. 2016; https://github.com/rdkit/rdkit/releases/tag/Release_2016_09_4.
  44. Jin W.; Barzilay R.; Jaakkola T.. Junction Tree Variational Autoencoder For Molecular Graph Generation, 2018. https://arxiv.org/abs/1802.04364.
  45. Mitra B. C.FastJtnnpy3: Junction Tree Variational Autoencoder For Molecular Graph Generation, 2020. https://github.com/Bibyutatsu/FastJTNNpy3.
  46. Zhang J.; Chen H. De Novo Molecule Design Using Molecular Generative Models Con-Constrained by Ligand-Protein Interactions. J. Chem. Inf. Model. 2022, 62, 3291–3306. 10.1021/acs.jcim.2c00177. [DOI] [PubMed] [Google Scholar]
  47. Zhang J.IFP-RNN, 2021. https://github.com/jeah-z/IFP-RNN.
  48. Zdrazil B.; et al. The ChEMBL Database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024, 52, D1180–D1192. 10.1093/nar/gkad1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Gainza P.; Sverrisson F.; Monti F.; Rodolà E.; Boscaini D.; Bronstein M. M.; Correia B. E. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 2020, 17, 184–192. 10.1038/s41592-019-0666-6. [DOI] [PubMed] [Google Scholar]
  50. Gainza P.; et al. De novo design of protein interactions with learned surface fingerprints. Nature 2023, 617, 176–184. 10.1038/s41586-023-05993-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Isert C.; Atz K.; Schneider G. Structure-based drug design with geometric deep learning. Curr. Opin. Struct. Biol. 2023, 79, 102548. 10.1016/j.sbi.2023.102548. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci4c00842_si_001.pdf (39.6MB, pdf)

Data Availability Statement

The source code and trained models for IEV2Mol, JT-VAE, and IFP-RNN designed for generating compounds active against the DRD2, AA2AR, and AKT1, as well as the data sets (DM-QP-1M, DRD2 Active, and ChEMBL33) utilized in this study, are released under the MIT License and available at https://github.com/sekijima-lab/IEV2Mol. This repository also includes scripts for plotting figures and calculating metrics to evaluate the performance of the different methods.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES