Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Jul 1;64(14):5557–5569. doi: 10.1021/acs.jcim.4c00342

Scaffold-Hopped Compound Identification by Ligand-Based Approaches with a Prospective Affinity Test

Itsuki Maeda , Shunsuke Tamura , Yoshihiro Ogura , Takayuki Serizawa , Takashi Shimada §, Ryo Kunimoto ∥,*, Tomoyuki Miyao †,⊥,*
PMCID: PMC11267578  PMID: 38950192

Abstract

graphic file with name ci4c00342_0011.jpg

Scaffold-hopped (SH) compounds are bioactive compounds structurally different from known active compounds. Identifying SH compounds in the ligand-based approaches has been a central issue in medicinal chemistry, and various molecular representations of scaffold hopping have been proposed. However, appropriate representations for SH compound identification remain unclear. Herein, the ability of SH compound identification among several representations was fairly evaluated based on retrospective validation and prospective demonstration. In the retrospective validation, the combinations of two screening algorithms and four two- and three-dimensional molecular representations were compared using controlled data sets for the early identification of SH compounds. We found that the combination of the support vector machine and extended connectivity fingerprint with bond diameter 4 (SVM-ECFP4) and SVM and the rapid overlay of chemical structures (SVM-ROCS) showed a relatively high performance. The compounds that were highly ranked by SVM-ROCS did not share substructures with the active training compounds, while those ranked by SVM-ECFP4 were mostly recombinant. In the prospective demonstration, 93 SH compounds were prepared by screening the Namiki database using SVM-ROCS, targeting ABL1 inhibitors. The primary screening using surface plasmon resonance suggested five active compounds; however, in the competitive binding assays with adenosine triphosphate, no hits were found.

Introduction

Scaffold-hopped (SH) compounds are bioactive small molecules against a target macromolecule that are structurally different from already known ligands. “Scaffold-hopping”, a term coined by Schneider et al.,1 has been a central issue in medicinal chemistry due to the potential benefits of SH compounds (e.g., avoiding patent infringement and exhibiting desirable properties for absorption, distribution, metabolism, excretion, and toxicity optimization). Recent advances in scaffold-hopping for medicinal chemistry are reported elsewhere.2,3 SH compounds are similar to known bioactive compounds in terms of molecular interaction with the target macromolecule, implying that these compounds could be identified by measuring their similarity to the bioactive ones in three-dimensional (3D) space, such as by pharmacophore-based filtering4 and/or direct structure-based approaches including docking.5

In ligand-based virtual screening (VS), the choice of molecular representations is important. In general, molecular representations can be classified into two-dimensional (2D) and 3D representations, corresponding to structural formula-based and conformer-based information, respectively. For the identification of SH compounds, a 2D representation including potential pharmacophoric points was proposed.6,7 The potential pharmacophoric points are hypotheses of molecular interaction with a target macromolecule, for example, hydrogen-bond-forming atoms. Chemically advanced template search (CATS)1,8 represents the normalized frequencies of specific topological distances between pairs of potential pharmacophoric points. Several prospective applications (experimental tests) with a de novo design tool utilizing topological pharmacophore like CATS were reported to identify SH compounds.9,10 In the category of 3D representations, weighted holistic atom localization and entity shape (WHALES)11,12 was developed with an application of selecting SH compounds from a set of synthetic molecules against a natural product template molecule. WHALES utilizes information on the molecular shape and partial charge of the molecule’s atoms. Another representation for SH compound identification is rapid overlay of chemical structure (ROCS).13,14 ROCS calculates the similarity to a query active compound in terms of molecular shape and chemical properties (potential pharmacophoric points). By using scores obtained by ROCS for a set of reference compounds in combination with a machine learning (ML) model, early identification of SH compounds in retrospective validation was reported and compared favorably with a conventional 2D molecular representation of extended connectivity fingerprints (ECFP).15,16

Several studies have attempted to compare the ability to identify SH compounds. Vogt et al. compared five popular 2D fingerprints of structural fragment, pharmacophore, and topological distances.17 They found that overall ECFP4 showed the highest SH potential. Meanwhile, Renner and Schneider compared screening ability among four representations. They tested topological, 3D, and molecular-surface-based pharmacophore pair descriptors (CATS), in addition to a 2D substructure-based fingerprint of MACCS keys,18 and found that the performance using pharmacophore-based CATS descriptors was slightly better than the use of MACCS keys, as a general trend for diverse target classes.19 Moreover, Zhang and Muegge investigated topological and 3D pharmacophore-based fingerprints. They found that 3D representations performed better than 2D descriptors when there was no topological-based similarity bias based on the similarity distribution within active and between active and inactive controls.20 However, no comparison has been made among the structural fingerprints, including the highest SH potential (i.e., ECFP4) and various 2D and 3D pharmacophore-based descriptors, so it is unclear which molecular representation is suitable for identifying SH compounds.

Herein, we aimed to fairly compare the ability of SH compound identification among various representations for practical VS applications. Our contributions are summarized as two points: comparison with diverse 2D and 3D molecular representations using controlled data sets for the early identification of SH compounds and prospective demonstration of a VS trial for SH compounds using the best-performing method derived from a retrospective validation. Specifically, in the retrospective validation, only a small number of structurally similar compounds were compiled as a training data set for VS to control the uncertainty of binding site assumption and complexity in interpretation when diverse active compounds were included for training. The data sets for the four biological targets were compiled from publicly accessible databases. Test compounds were SH compounds against only the active compounds in the training set. As methods for SH compound identification, similarity search and activity prediction by the support vector machine (SVM)21 were employed. As representations, ECFP, CATS, WHALES, and ROCS were used. Furthermore, the chemical structures of top-ranked compounds were compared since different representations might prioritize compounds with different scaffolds. In the prospective demonstration, for tyrosine-protein kinase ABL1 as a target, the best-performing method in the retrospective validation was also tested for SH compound identification, starting from sets of analogous active compounds. A commercially available compound database was screened using the models. The selected SH compounds were experimentally evaluated for affinity against ABL1 by surface plasmon resonance (SPR). Furthermore, a docking study was also conducted, and the score distribution was compared with that from decoy compounds to reveal the characteristics of the selected SH compounds.

Materials and Methods

Definition of Scaffold-Hopped Compounds

A pair of active compounds has the SH relation when the number of atoms in the common substructure (connected graph) is less than or equal to a certain threshold against the number of atoms in one of the paired compounds, that is, the query compound, represented as the common atom ratio:

graphic file with name ci4c00342_m001.jpg

In this study, a threshold value of 0.4 was used. A screened molecule is recognized as an SH compound when the ratio of the test compound to all active training compounds is below this threshold.

Data Sets

Target Selection and Compound Data Preparation

Beta-2 adrenergic receptor (ADRB2), lysosomal alpha-glucosidase (GAA), MAP kinase ERK2 (ERK2), and tyrosine-protein kinase ABL (ABL) were chosen as target macromolecules based on there being sufficient numbers of compounds that are active or inactive against them in publicly accessible databases. For ABL, active compounds were collected from the Protein Data Bank (PDB)22 and ChEMBL (version 25)23 databases, while inactive ones were from the ChEMBL and PubChem24 databases. For the other three targets, active compounds were collected from ChEMBL (version 26), while inactive compounds were from PubChem. The data set profiles for the four targets are listed in Table 1 and all of the data sets containing curated compounds as simplified molecular input line entry system (SMILES) are provided as text files in the Supporting Information.

Table 1. Data Set Profiles.
target (abbreviation) UniProtKB ID PubChem AID # CPDs from PDB (active) # CPDs from ChEMBL (active) # CPDs from ChEMBL (inactive) # CPDs from PubChem (inactive) # curated active CPDs # curated inactive CPDs # SARMs
beta-2 adrenergic receptor (ADRB2) P07550 492947 0 2,451 0 329,716 1,021 164,831 69
lysosomal alpha-glucosidase (GAA) P10253 2100 0 35,088 0 293,744 8,784 216,080 47
MAP kinase ERK2 (ERK2) P28482 1454 0 17,480 0 126,845 2,201 58,263 188
tyrosine-protein kinase ABL (ABL) P00519 588664 59 594 255 357,748 444 196,744 37

For each target, target name (abbreviation), ChEMBL UniProtKB accession number, PubChem assay ID (AID), numbers of original active and inactive compounds (CPDs) extracted from Protein Data Bank (PDB), ChEMBL, and PubChem databases, numbers of active and inactive CPDs after curation, and number of structure–activity relationship matrices (SARMs) are listed.

Compound data were preprocessed to obtain valid and drug-like compounds as well as to remove active compounds when there were too many of them. For ABL, the compounds taken from the PDB were checked for their presence in the PDBbind (version 2019)25 database to enhance the likelihood of them being active. All compounds were standardized to neutral forms using in-house Python scripts with the help of the OEChem toolkit.26 To remove compounds that might not be suitable as drugs (e.g., compounds with extremely large molecular weights), a modified version of OpenEye’s BlockBuster filter from the MolProp toolkit27 was applied. The modification was intended to lower the stringency of the filter to allow a larger number of compounds. For example, the molecular weight range was changed to (100, 800) from (130, 781). To adjust the ratio of the active compounds in the data sets, further filters were applied. For ADRB2 and ERK2, compounds with an activity type of Potency were excluded from those with types of IC50, EC50, Ki, Kd, and Potency. For GAA, only compounds with a confidence score of 9, which is the maximum confidence score provided by the ChEMBL database, were selected. Then, compounds with contradictory annotations regarding their activity (active and inactive) and duplicated compounds in the nonisomeric canonical SMILES format were removed. Finally, for each target, compounds with a number of heavy atoms and molecular weights within the 10th–90th percentile range of the active compound distribution were selected. Thus, compounds in the data are relatively similar in terms of molecular weight. The numbers of active and inactive compounds after preprocessing are given in Table 1.

Data Sets for Benchmarking Virtual Screening

Assuming a screening campaign in the early hit-to-lead phase, active compounds in a structure–activity relationship matrix (SARM)28,29 were used as a set of training compounds. A SARM is a set of analogous molecules arranged in the form of a matrix, where each row represents a matching molecular series (MMS)30 and column substituents. In a SARM, all of the core substructures of the MMS have matched molecular pair (MMP)31 relations. Each row of a SARM contains compounds that share the same core, and each column contains compounds that share the same substituent. MMPs were generated using the efficient algorithm implemented by Wawer and Bajorath.30 For each target, SARMs were generated from the curated compounds. SARMs in which the active compounds were duplicated were removed, with the SARM containing the most inactive compounds being retained. The SARMs containing 10–50 active compounds were extracted from the generated SARMs. In Table 1, the numbers of extracted SARMs are listed. Figure 1a illustrates the procedure for SARM generation.

Figure 1.

Figure 1

Data set preparation. (a) Generation of structure–activity relationship matrices (SARMs). Compounds extracted from three databases, Protein Data Bank (PDB), ChEMBL, and PubChem, were standardized. SARMs were generated by collecting compounds with common core substructures from the curated compounds. (b) Training and test data preparation. For active compounds (CPDs), the compounds contained in a SARM were used as training data. The other compounds that were scaffold-hopped (SH) equivalents of the active training compounds were used as test data. For inactive compounds, excluding the compounds in the SARMs, they were randomly split in a 1:10 ratio. For the tyrosine-protein kinase ABL, 10,000 and 100,000 compounds were randomly selected. Overall, 1/11 (or 10,000) compounds and compounds in SARMs were used as potential training compounds. Of the remaining compounds, SH compounds were used as test data.

After the SARMs were generated, the curated compounds were split into training and test data sets (Figure 1b). Active compounds in each SARM became training compounds, and the SH compounds were extracted from the rest of the active compounds as a test set. For inactive compounds, excluding the compounds in the SARMs, all inactive compounds were randomly split into training and test sets in a ratio of 1:10. For ABL, 10,000 and 100,000 compounds were randomly selected. Overall, 1/11 (or 10,000) compounds and compounds in SARMs were used as potential training compounds. Of the remaining (or 100,000 randomly selected) compounds, SH compounds were used as test inactive compounds using the same criterion as for active SH compounds.

Molecular Descriptors

Extended Connectivity Fingerprint

ECFP15 is an atom-centered circular fingerprint, capturing atomic environments within a predetermined bond diameter in a molecule. It is frequently applied to ligand-based VS studies. In this study, ECFP with a bond diameter of 4 (ECFP4) was used. The ECFP features were folded into a 2048-bit vector by modulo operation. ECFP4 was generated using in-house Python scripts with the help of the OEChem toolkit.26

Chemically Advanced Template Search

CATS1,8 is a topological pharmacophore descriptor. The CATS descriptor is a collection of the scaled occurrences of pharmacophore feature pairs at various topological distances. Several prospective applications for de novo design have been reported using topological pharmacophores like this descriptor.9,10 In this study, the CATS descriptor was generated by a modified version of the RDKit-based program implemented by Guha using OEChem.32 The occurrence of a pair of pharmacophoric features was scaled by the total number of occurrences of the features.

Weighted Holistic Atom Localization and Entity Shape

The descriptor of WHALES11,12 is designed to identify SH synthetic accessible molecules from complex bioactive natural products. The WHALES descriptor simultaneously captures the partial charge distribution and 3D molecular shape. The original program provided by the authors11 was used in this study with the force-field-based energy-minimized conformation of a molecule. When stereocenters were unknown, up to 4096 enantiomers were generated, and the average of the scores calculated over all stereoisomers was used as the score for the molecule.

Rapid Overlay of Chemical Structures

ROCS13,14 is a tool for aligning and scoring molecules based on similarity to the reference molecule in 3D space. For each molecule in a database, multiple conformers were generated and a set of generated conformers was overlaid on the reference compound. In this study, for the curated compounds, up to 30 conformers per enantiomer were generated by the Omega toolkit.33 As parameters of Omega, the root-mean-square Cartesian distance (RMSD) threshold was set to 1.0, and a strict definition of atom typing was used, where a molecule is rejected if any atom has an invalid atom type. Up to 4096 enantiomers were generated when stereocenters were unknown. For reference compounds, the energy-minimized conformers were selected from the generated conformers as per the MMFF94s force field.34 For each reference compound, among the generated conformers from a stereodetermined compound set for a database compound, the one with the largest TanimotoCombo similarity score to the reference compound was retained. The TanimotoCombo score is the summation of the ShapeTanimoto and ColorTanimoto scores. ShapeTanimoto is the overlap between molecules in terms of shape, while ColorTanimoto is the overlap in terms of chemical properties representing potential pharmacophoric points.

One way of making a molecular descriptor based on ROCS is to vectorize ROCS-derived scores against a set of reference compounds.35 Each score is calculated on a similarity metric against a reference compound. When the SVM was used as the ML model, this approach was reported to enable the early identification of SH compounds in retrospective validations.16 In this study, three types of reference compounds were prepared: active, inactive, and similar. Active reference compounds were active training compounds. Inactive reference compounds were randomly selected from the potential training compounds. Testing was performed with various numbers of inactive reference compounds, namely, 0, 25, 50, 100, and 200. Similar reference compounds were top-ranked compounds from the results of similarity searching (SS) by ROCS. Testing was performed with various numbers of similar reference compounds, namely, 0, 25, 50, and 100.

Machine Learning Models

Similarity Search

In SS, a test compound was compared to all active training compounds, and the largest similarity value was selected as the outcome. As metrics, Tanimoto similarity was employed for ECFP4, 1–the Euclidean distance for CATS and WHALES, and the ColorTanimoto and TanimotoCombo scores for ROCS.

Support Vector Machine

An SVM is a supervised learning algorithm that identifies the hyperplane that separates two classes of objects (compounds) while maximizing the margin from the hyperplane.21 This method was originally proposed as a linear classification algorithm, but can be applied to nonlinear classification tasks using the kernel functions.36 In this study, SVM models were constructed with the help of the scikit-learn library (version 0.23.2).37 The Tanimoto kernel38 for ECFP4 and the radial basis function kernel for other descriptors were used. The class imbalance was adjusted by introducing the class weights, which are inversely proportional to the class frequencies in the input data. SVM hyperparameters were optimized using the results of 5-fold cross-validation. One hundred inactive compounds randomly selected from the potential training compounds were used as inactive training compounds. Test compounds were ranked by calculating the signed distance from the hyperplane in the direction of active to inactive.

Evaluation Metrics

To evaluate screening performance, the area under the precision recall curve (AUC-PRC) and the ranking of the top-ranked active participants were employed. The AUC-PRC can measure VS performance focusing on the minority class, as well as the early enrichment of active compounds.39 This metric ranges from 0 to 1, with values close to 1 indicating a high screening performance. The top rank of active compounds is represented as the common logarithm of the top rank of the active test compound based on screening scores.

Data Preparation for Scaffold-Hopped Compounds for ABL

The Namiki high-throughput screening database (ver. 201910)40 was screened to identify SH compounds starting from active compounds in individual SARMs for ABL. The number of IDs in the Namiki database was 6,934,154. The standardization procedure, involving the removal of duplicated compounds, was applied to compounds in the database followed by the removal of compounds with a number of heavy atoms in excess of the 99th percentile, after which 6,804,945 compounds remained. For each enantiomer of every compound, up to 30 conformers were generated by Omega using a strict atom-type representation with an RMSD threshold of 1.0. The same filter on molecular size as for the ABL retrospective study was then applied, and 4,082,935 compounds remained.

Chemical Space Networks of Active Compounds

A chemical space network (CSN) visualizes relations among compounds in a data set. A node in a CSN represents a compound, and an edge represents a relation between a pair of compounds. In this study, CSNs were utilized to analyze the distributions of targetwise active compounds on the basis of structural similarity. Because we clustered active compounds as SARMs, a maximum common substructure (MCS)-based Tanimoto coefficient (TcMCS) was employed as a similarity metric, which is defined as follows:

graphic file with name ci4c00342_m002.jpg

Given substructure S, |S|b represents the number of bonds in S. TcMCS is the ratio of the number of bonds in the MCS of both compounds relative to the number of bonds present in one compound and the other. As such, it represents a hybrid of a substructure-based and continuous similarity measure.41,42 In CSNs, an edge was created when the TcMCS similarity value was greater than or equal to 0.4. MCSs for compound pairs were determined using the OEChem toolkit.26 The CSNs were visualized with Cytoscape.43

Binding Study by Surface Plasmon Resonance

SPR experiments were performed at 25 °C on a Biacore T200 instrument (Cytiva). Protein was immobilized by treating a streptavidin sensor chip successively with 1 M NaCl in 50 mM NaOH for 1 min (flow rate = 10 μL/min) three times and with 50 μg/mL ABL1(2-1130) biotinylated at the N-terminus (purchased from CarnaBioscience, catalog#: 08-401-20N) for 7 min (flow rate = 10 μL/min). The immobilization level was around 10,000–11,000 response units (RU). HBS-P+ [10 mM HEPES-NaOH (pH 7.4), 150 mM NaCl, and 0.05% Tween 20] containing 1 mM TCEP was used as the continuous flow buffer for immobilization. As for interaction analysis, flow buffer was composed of 50 mM Tris-HCl (pH 7.5), 150 mM NaCl, 5 mM MgCl2, 5 mM DTT, 0.05% Tween 20, and 2% DMSO. Measurements were performed using multicycle kinetics mode. Each of the compounds was injected as follows: flow rate of 30 μL/min, contact time of 60 s, and dissociation time of 180 s. Here, 1 mM ADP was injected periodically as a positive control to check the stability of the immobilized protein molecules. The binding level was corrected with a standard solvent correction protocol (1% to 3% DMSO in six steps). All analyses and corrections were performed with Biacore Evaluation Software version 2.0 (Cytiva).

Molecular Docking

Molecular docking evaluated the SH compounds for ABL in terms of protein–ligand binding. From the PDB, a high-resolution X-ray cocrystallized structure was obtained (PDBID: 2HZI). After the ligand molecule was extracted, the protein structure was prepared using the Protein Preparation Wizard application in the Schrödinger suite (ver. 2022 04),44 with the default settings. AutoDock Vina (ver. 1.2.5)45 was used for molecular docking. For each docked compound that has no ambiguity regarding stereoinformation, a maximum of 10 conformers were generated by Omega.45 As a negative control, 10,750 decoy molecules obtained from DUD-E46 were also docked. The PDBQT files as inputs for AutoDock Vina were created from the ligand SDFiles with Meeko47 and the receptor PDB file using the AutoDock tools. A grid box (25 Å × 25 Å × 25 Å) was centered at (19.251, 12.334, 14.588) Å, which was the center of mass of the ligand, with an exhaustiveness of 16 for the search. The binding energy of a ligand was defined as the lowest among all conformers.

Results and Discussion

Study Design

Various VS approaches to identifying SH compounds were tested, assuming that only a limited number of analogous active compounds were available. In the retrospective validation, VS trials for SH compounds were conducted for four biological targets. Each training data set consisted of 10–50 analogous active compounds in a SARM and 100 randomly selected inactive compounds. The corresponding test set consisted of only SH compounds. In the prospective validations, ABL was selected as a target, and the Namiki database was screened. For this purpose, only SVM-ROCS was selected with two types of reference compounds due to high performance against this target in the retrospective validation. The selected compounds distinct from the active training compounds were experimentally tested by SPR. For these compounds, docking study was conducted and the scoring distribution was compared with that of the decoy compounds,46 revealing a characteristic of the selected compounds.

In the retrospective validation, the VS performance was evaluated based on the AUC-PRC and the top rank of active compounds. Tested strategies were combinations of two screening algorithms (SS or SVM) and four descriptors (ECFP4, CATS, WHALES, or ROCS). To reveal a connection between the distribution of active compounds and VS performance, compound networks were created by connecting compounds with a common scaffold. Furthermore, for SVM-ECFP4 and SVM-ROCS, which showed high performances, the chemical structures of their top active compounds were manually checked to understand the two representations. For SVM-ROCS, reference compounds were investigated by including compounds similar to active training compounds.

In the prospective demonstration, three SARMs for ABL were selected based on scaffold types. For each SARM, two SVM-ROCS models with different reference compounds, namely, (active, similar) and (active, inactive), were trained. Each SVM model ranked the compounds in the database, and the top 1000 compounds were selected from a pool of SH compounds. Then, the compounds were classified by scaffolds, and the clustered compounds were visually filtered for draggability by medicinal chemists. Final assay candidate compounds were determined based on ease of purchase, including cost. Selected compounds were purchased and tested for affinity against ABL by SPR assays. Focusing on the experimentally tested compounds, a docking study using X-ray cocrystallized complexes was conducted to understand these compounds from a molecular simulation perspective.

Known Active Compound Distributions

The MCS-based active compound networks for the four targets are reported in Figure 2. Nodes are active compounds, and edges represent TcMCS values more than or equal to 0.4. In the ABL CSN, three independent clusters were formed, corresponding to three scaffolds. Thus, identifying active compounds residing in a cluster using the model built on compounds in another cluster was intrinsically difficult. Since active training compounds were from a SARM, SH compound identification was likely to correspond to this difficult scenario. Meanwhile, in the GAA CSN, one large cluster was formed. All active compounds could be associated with the repeatedly applied chemical structure modifications of one compound relative to another while keeping part of the molecule. Therefore, for this target, higher predictive accuracy due to the structural connection was expected than for the other targets.

Figure 2.

Figure 2

Maximum common substructure networks. Nodes represent active compounds, and edges connect between nodes when the maximum common substructure ratio in terms of the number of heavy atoms is 0.4 or more.

Comparison of the Virtual Screening Methods

For the four biological targets, the AUC-PRC values for each combination of screening algorithm and descriptor are reported in Figure 3. The reference set of SVM-ROCS consisted of all of the active training compounds and 100 inactive compounds randomly selected from the potential training compounds. The reference set for SVM-ROCS-sim consisted of all of the active training compounds, 100 randomly selected inactive compounds, and the 25 compounds most similar to the active training compounds. The numbers of inactive and similar compounds among the reference compounds were determined on the basis of the predictive ability of models using different numbers of inactive and similar compounds, as shown in Figure S1–S3 in the Supporting Information. In general, the more inactive compounds that were used, the higher the AUC-PRC values became (Figure S1). However, for ADRB2 without similar compounds and ABL, the use of 200 inactive compounds decreased the AUC-PRC values. Thus, 100 inactive compounds were used. In a similar way, 25 similar compounds showed the same or slightly better performance than the use of 100 similar compounds (Figures S2 and S3). The paired t test with a significance level of 0.05 was performed for each combination of methods in Figure 3. The details of the t test and all of the results are shown in Tables S1–S5. Overall, SVM-ECFP4 and SVM-ROCS showed a relatively high performance (Table S1 and S2): introducing ML models for VS gave a higher performance than SS (Table S3). In SVM-ROCS, the use of similar compounds as reference compounds increased the AUC-PRC values only for ABL (Table S4). In terms of scoring metrics, ColorTanimoto was better than TanimotoCombo in combination with all methods for ADRB2 and with SVM-ROCS for AGG and ABL (Table S5). Distributions of the top rank of the active SH compounds are listed in Figure 4. The Wilcoxon signed-rank test at the significance level of 0.05 was performed for these results. In these distributions, SVM-ECFP4 performed best for targets other than ABL (Table S6). For ABL, ECFP4 did not work at all. The average ranking of the top compounds was 1550 when using SVM-ECFP4. Meanwhile, SVM-ROCS was the best for the early identification of active compounds for ADRB2 and ABL (Table S7), and the average ranking of the top active compounds for ABL was 158. The difference in the VS performance between target macromolecules can be explained by the distribution of active compounds in the data sets.

Figure 3.

Figure 3

Virtual screening performance. For each target and method, the area under the precision recall curve (AUC-PRC) values are reported as a boxplot. Methods are represented as combinations of screening algorithms [similarity search (SS) and support vector machine (SVM)] and descriptors [extended connectivity fingerprints of bond diameter 4 (ECFP4), chemically advanced template search (CATS), weighted holistic atom localization and entity shape (WHALES), and rapid overlay of chemical structure (ROCS)]. The similarity metrics are provided in parentheses. Color and Combo are ColorTanimoto and TanimotoCombo in the ROCS calculation, respectively. Reference compounds for SVM-ROCS were all active and 100 inactive compounds, while all active, 100 inactive, and 25 similar compounds were used for SVM-ROCS-sim.

Figure 4.

Figure 4

Virtual screening performance. For each method, the common logarithms of the ranks of the active compounds with the top screening ranking are reported as a boxplot. The representation of the methods is the same as that in Figure 3.

Comparison of Top-Ranked Compound Structures

For SVM-ECFP4 and SVM-ROCS, which exhibited high performances for different target macromolecules, the chemical structures of their top active compounds for ERK2 and ABL were visually compared. For ERK2, SVM-ECFP4 worked best among the tested methods. Figure 5a shows the top three active compounds per method for a representative training SARM for ERK2. The representative SARM was selected as the one for which the top three active compounds from each method were ranked within the top 100 and had the maximum median difference in rank between SVM-ECFP4 and SVM-ROCS (Color). ECFP4 performed well for the target where the test data set contained many compounds, in which the order of the substructures was changed. These compounds met the definition of SH compounds but were recombinant compounds of known active compounds. However, for ABL in Figure 5b, in which the top three compounds using SVM-ECFP4 were ranked from 553 to 676 while using SVM-ROCS from 25 to 28, the average AUC-PRC value for SVM-ROCS was 0.0173 and that for SVM-ECFP4 was 0.0043. Compounds that were highly ranked using SVM-ROCS were those with structures different from those of the active training compounds, focusing on the spatial relation between pharmacophoric points.

Figure 5.

Figure 5

Comparison of the top active compounds in different screening methods. For ERK2 (a) and ABL (b), the top three active compounds are shown for a representative test SARM: for the SARM, the top three compounds using SVM-ECFP4 and SVM-ROCS (Color) were ranked within the top 100, while those using SVM-ECFP4 for ABL were ranked within the top 1000, and the maximum median difference in rank between SVM-ECFP4 and SVM-ROCS (Color) was achieved. In the first row, active training compounds are shown. In the second and third rows, the top three active compounds using SVM-ECFP4 and SVM-ROCS (Color) are shown, respectively. The numbers in parentheses below each test compound represent the ranks using SVM-ECFP4 and SVM-ROCS (Color), respectively.

To understand the characteristics of the descriptors designed for SH, chemical structures among the top-ranked compounds were evaluated in terms of scaffold diversity and molecular complexity. The scaffold diversity was measured with both the pairwise Tanimoto similarity in ECFP4 and the number of unique Bemis-Murcko scaffolds48 in the top-ranked 50 compounds and the molecular complexity with the synthetic accessibility score.49 The averages of the metric values for all SARMs using ECFP4, CATS, WHALES, and ROCS (Color) in combination with SVM are reported in Table 2. Although there was no clear difference among representations in the molecular complexity, the scaffold diversity was higher for WHALES followed by CATS and ROCS (Color).

Table 2. Scaffold Diversity and Molecular Complexity of the Top-Ranked Compounds.

  pairwise Tanimoto similarity
unique BM scaffold ratio
SA score
method and representation ADRB2 GAA ERK2 ABL ADRB2 GAA ERK2 ABL ADRB2 GAA ERK2 ABL
SVM-ECFP4 0.26 0.24 0.26 0.21 0.68 0.68 0.68 0.79 2.7 2.4 2.8 2.4
SVM-CATS 0.20 0.20 0.16 0.15 0.70 0.58 0.71 0.76 2.6 2.5 2.9 2.6
SVM-WHALES 0.16 0.15 0.15 0.14 0.87 0.85 0.89 0.91 2.6 2.4 2.5 2.5
SVM-ROCS (Color) 0.20 0.19 0.20 0.19 0.70 0.56 0.71 0.71 2.8 2.5 2.4 2.4

For each target and method, the averaged pairwise Tanimoto similarity, the average unique Bemis-Murcko (BM) scaffold ratio, and the average synthetic accessibility (SA) score are listed. Pairwise Tanimoto similarity was the averaged pairwise Tanimoto similarity among the top-ranked compounds. The unique BM scaffold ratio is the number of unique BM scaffolds in the top-ranked compounds divided by the number of top-ranked compounds. The averages of the top 50 compounds of all of the SARMs are listed.

Exemplary top-ranked compounds using SVM-ECFP4, SVM-CATS, SVM-WHALES, and SVM-ROCS (Color) for SARM with the highest AUC-PRC value when SVM-ROCS (Color) is used for ABL are reported in Figure S4. CATS and WHALES did not lead to more similar scaffold than ECFP4. These descriptors ranked compounds high that did not share a common substructure with the active training and other top-ranked compounds. This result confirms the reports that CATS and WHALES have a high scaffold-hopping ability.8,12 The top-ranked compounds for other targets are shown in Figure S5.

Effects of Using Compounds Similar to the Reference Compounds

For ABL, the VS performance for SVM-ROCS was further improved by extending the reference compound set via the introduction of compounds that resemble the active training compounds in terms of the chemical property-based metric (ColorTanimoto). Focusing on this target, the effect of the number of similar compounds in reference sets was investigated. Figure 6 reports the SARM-wise AUC-PRC values against the number of similar compounds in SVM-ROCS using the ColorTanimoto score. Increasing the number of similar compounds improved the performance for some SARMs, while for other SARMs, the performance was impaired.

Figure 6.

Figure 6

Effects of reference compounds in SVM-ROCS (Color) for the tyrosine-protein kinase ABL. The AUC-PRC values are plotted against the number of similar compounds in the reference sets. The reference sets consist of all active, 100 inactive, and 0, 25, 50, or 100 similar compounds. Each line represents a SARM for training. SARMs are sorted in ascending order based on the number of active test compounds. The colors of lines may be regarded as SARM scaffolds because a similar number of active test compounds resulted from the use of similar active training compounds.

For ABL, all of the SARMs were classified into three clusters with different scaffolds (Figure 2). We observed that only one of the six combinations of the cluster containing the training SARM and the cluster containing test SARMs improved the performance.

Figure 7 shows the top-ranked active compounds when using the different numbers of similar compounds as references for SARM ID 67005_series_1 in Figure 6. For this SARM, the introduction of similar compounds improved the screening performance (AUC-PRC: 0.0163 without similar compounds, 0.0638 with 100 similar compounds). As the number of similar compounds increased, compounds with structurally similar scaffolds dominated the top ranks. Indeed, the use of similar active compounds appears to lead to bias for a certain type of scaffold in the screening results like variable augmentation. For SARMs, where increasing the number of similar compounds did not improve the performance, the top compounds were similar even without using similar compounds. The increase in similar compounds did not affect the top-ranked compounds.

Figure 7.

Figure 7

Top-ranked compounds with different numbers of reference compounds. Comparison of the top-ranked active compounds was based on different numbers of similar compounds. An exemplary SARM for which performance was improved by increasing the number of similar compounds was selected. The top five active compounds in the test data set are shown. The reference sets consist of all active, 100 inactive, and 0, 25, 50, or 100 similar compounds. On the bottom row, active training compounds selected randomly.

Virtual Screening of the Namiki Database

For ABL, a VS trial for the identification of SH compounds among active compounds in the three SARMs was conducted. These SARMs consisted of three distinct scaffolds (Figure 2), and example compounds in the SARMs as well as the number of active compounds are given in Figure 8. The modeling method was SVM-ROCS, and two reference compound sets were employed due to high performance in our retrospective study. The first was a set of all active training compounds and the top 25 similar compounds, while the second was all active training compounds and 100 inactive compounds. For each SARM, the two models were used for screening the 4,082,935 compounds in the Namiki database. The top 1000 SH compounds were selected for each of the six models (3 SARMs × 2 methods). There was a large overlap of selected compounds between the two methods. For the SARM ID 0_series_1, 803 compounds out of 1000 were in common, while for 188777_series_1 and 194338_series_1, the numbers were 776 and 846, respectively. These candidate compounds were mixed (3571 unique compounds) and further scrutinized based on the following characteristics: dissimilarity with known compounds on 2D fingerprint, scaffold diversity, regulation, and price. This process led to the selection of 100 compounds, 93 of which that were available for purchase were used for in vitro experiments.

Figure 8.

Figure 8

Target SARMs for prospective validation. For each target SARM, an example compound and the numbers of active and inactive compounds in the SARM are shown.

The first screening interaction analysis yielded five compounds (Figure 9) with a response comparable to that of the positive control compound. However, the second screening, the coinjection assay with ADP, yielded no hit compounds. Based on these results, we could not find any compounds that specifically act on the ATP binding site from the compounds assayed in this study.

Figure 9.

Figure 9

Five hit compounds by the SPR response with 20 μM. The structures and response values of the compounds are shown.

The prospective demonstration found no competitive inhibitors for ABL that had an SH relation to known active compounds. We speculate that there are several explanations for this failure. First, the top compounds selected by the SVM-ROCS were not always tested for prospective validation due to a filtering process including scaffold diversity, regulation, and price in addition to the scaffold hopping criterion (an MCS threshold of 0.4). Some of the selected compounds for the demonstration were ranked below 2000 by the SVM-ROCS method. Second, the training “active” compounds used for validation might not be competitive inhibitors, including experimental errors. Third, a limited diversity of screening compounds in the Namiki database might make the identification of SH compounds difficult. The number of compounds screened in this study was around 4 million, which was one or two orders of magnitude smaller than large compound libraries, such as ZINC.50

To overcome these potential pitfalls, we propose to use a larger compound library and expand the search space of compounds. Furthermore, the 93 compounds that were found to be inactive should be incorporated into the training data set for the next round of screening. These inactive compounds were originally predicted as SH candidates; thus, employing them as inactive would help models learn to identify a different region of the chemical space for the active compound identification.

Docking Study

Although our prospective validation based on SPR identified no SH compounds as competitive inhibitors for ABL, a docking study was conducted to evaluate the compounds experimentally tested as well as selected by the SVM-ROCS from a structure-based approach. Six data sets were prepared for the docking simulation: known active compounds against ABL (number of compounds: 597 extracted from the ChEMBL database); compounds selected by SVM-ROCS as SH for the three SARMs, namely, 0_series_1, 18777_series_1, and 194338_series_1 (number of compounds: 1000 for each SARM); experimentally tested compounds (n = 93); and decoy compounds from DUD-E (n = 10,750). From these data sets, 5 active compounds, 19 decoy compounds, 5 compounds for 18777_series_1, and 2 compounds for 194338_serie_1 were excluded due to failures during the docking study process. The root-mean-square distance between the crystallized ligand and redocked one was 0.2 Å, and the area under the receiver operating characteristic curve was 0.86 using actives and decoys, supporting the validity of the docking procedure.

The distributions of binding energies for these data sets are reported as box plots in Figure 10. The median binding energy for the known active compounds was −10.68 kcal/mol, while that for the decoy compounds was −7.40 kcal/mol. The binding energies of the known active compounds were significantly lower, i.e., more stable, than those of the decoy compounds according to Welch’s t test with a significance level of 0.05 (Table S8). The median binding energies for the SH compounds selected by SVM-ROCS were −9.49, −9.59, −9.16, and −9.42 kcal/mol for 0_series_1, 18777_series_1, 194338_series_1, and experimentally tested compounds, respectively (median binding energy was −9.38 kcal/mol for the combined three series). The four data sets of SH compound exhibited significantly lower binding energies than the decoy compounds (Table S8). These median energies for the selected SH compounds for the three targets were almost the same as those for the corresponding training compounds: −8.40 kcal/mol for 0_series_1, −9.41 kcal/mol for 18777_series_1, and −9.69 kcal/mol for 194338_series_1.

Figure 10.

Figure 10

Distributions of binding energy (kcal/mol) by docking simulation for the selected compounds for ABL. Binding energies are reported as boxplots for respective data sets. The tested data sets were Active, namely, known active compounds against ABL; SARMs (0_series_1, 18777_series_1, and 194338_series_1), namely, SH compounds selected by SVM-ROCS; Validated, namely, compounds selected for experimental validation; and Decoy, namely, decoy compounds. Each boxplot represents the interquartile range of the data, with the median value indicated by the horizontal line within the box.

Conclusions

Retrospective and prospective validations of the ability to identify SH compounds using ligand-based approaches were conducted. From the results of methodological comparison, the combination of SVM and ECFP4 (SVM-ECFP4) or SVM and ROCS (SVM-ROCS) showed a relatively high performance for identifying SH compounds, although which of these was superior depended on the target macromolecule. The compounds that were highly ranked using SVM-ROCS were compounds with completely different structures from the active training compounds, while the compounds that were highly ranked using SVM-ECFP4 were merely recombinant but still different in terms of the molecular scaffold. Furthermore, one important parameter in SVM-ROCS, namely, the reference compound set, was found to prioritize structurally similar compounds as the number of similar training compounds increased. In the prospective demonstration, for targeting ABL1 inhibitors, 93 compounds were prepared as a result of screening the Namiki database. The docking study revealed that the docking score distribution of the selected SH compounds was lower than that of the decoy compounds. The primary screening using SPR suggested five active compounds; however, in the competitive binding assays with ATP, no hits were found.

Acknowledgments

We thank Jürgen Bajorath and Martin Vogt, University of Bonn, Germany, for providing us with codes for MMP fragmentation. We are grateful the Forli lab at the Center for Computational Structural Biology at Scripps Research, for providing Meeko, the code for the preparation of small molecules for AutoDock.

Data Availability Statement

The data used in the retrospective validation are available in the Supporting Information for this paper. The data used in the prospective validation were provided by NAMIKI SHOJI Co., Ltd. For compound data and descriptors preparation, the OpenEye Toolkits and the ROCS were used under the academic license. For docking, the Schrödinger suite was used under the commercial license.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c00342.

  • Figure S1. Effects of inactive reference compounds in SVM-ROCS (Color); Figure S2. Investigation of reference compounds in SVM-ROCS; Figure S3. Investigation of reference compounds; Figure S4. Top-ranked compounds for different descriptors for ABL; Figure S5. Top-ranked compounds for different descriptors for ADRB2, GAA, and ERK2; Table S1. SVM-ECFP4 vs others in the AUC-PRC; Table S2. SVM-ROCS vs others in the AUC-PRC; Table S3. SVM vs SS in the AUC-PRC; Table S4. SVM-ROCS-sim vs SVM-ROCS in the AUC-PRC; Table S5. ColorTanimoto vs TanimotoCombo in the AUC-PRC; Table S6. SVM-ECFP4 vs others in the top rank of active; Table S7. SVM-ROCS vs others in the top rank of active; Table S8. Known active and SH compounds vs decoy compounds (PDF)

  • Four data sets containing curated compounds as SMILES for retrospective validation (ZIP)

The authors declare no competing financial interest.

Supplementary Material

ci4c00342_si_001.pdf (1.2MB, pdf)
ci4c00342_si_002.zip (5.1MB, zip)

References

  1. Schneider G.; Neidhart W.; Giller T.; Schmid G. “Scaffold-Hopping” by Topological Pharmacophore Search: A Contribution to Virtual Screening. Angewandte Chemie - International Edition 1999, 38 (19), 2894–2896. . [DOI] [PubMed] [Google Scholar]
  2. Callis T. B.; Garrett T. R.; Montgomery A. P.; Danon J. J.; Kassiou M. Recent Scaffold Hopping Applications in Central Nervous System Drug Discovery. J. Med. Chem. 2022, 65 (20), 13483–13504. 10.1021/acs.jmedchem.2c00969. [DOI] [PubMed] [Google Scholar]
  3. Hu Y.; Stumpfe D.; Bajorath J. Recent Advances in Scaffold Hopping. J. Med. Chem. 2017, 60 (4), 1238–1246. 10.1021/acs.jmedchem.6b01437. [DOI] [PubMed] [Google Scholar]
  4. Wolber G.; Langer T. LigandScout: 3-D Pharmacophores Derived from Protein-Bound Ligands and Their Use as Virtual Screening Filters. J. Chem. Inf. Model. 2005, 45 (1), 160–169. 10.1021/ci049885e. [DOI] [PubMed] [Google Scholar]
  5. Olla S.; Manetti F.; Crespan E.; Maga G.; Angelucci A.; Schenone S.; Bologna M.; Botta M. Indolyl-Pyrrolone as a New Scaffold for Pim1 Inhibitors. Bioorg. Med. Chem. Lett. 2009, 19 (5), 1512–1516. 10.1016/j.bmcl.2009.01.005. [DOI] [PubMed] [Google Scholar]
  6. Stiefl N.; Watson I. A.; Baumann K.; Zaliani A. ErG: 2D Pharmacophore Descriptions for Scaffold Hopping. J. Chem. Inf. Model. 2006, 46 (1), 208–220. 10.1021/ci050457y. [DOI] [PubMed] [Google Scholar]
  7. Nakano H.; Miyao T.; Funatsu K. Exploring Topological Pharmacophore Graphs for Scaffold Hopping. J. Chem. Inf. Model. 2020, 60 (4), 2073–2081. 10.1021/acs.jcim.0c00098. [DOI] [PubMed] [Google Scholar]
  8. Reutlinger M.; Koch C. P.; Reker D.; Todoroff N.; Schneider P.; Rodrigues T.; Schneider G. Chemically Advanced Template Search (CATS) for Scaffold-Hopping and Prospective Target Prediction for “Orphan” Molecules. Mol. Inform. 2013, 32 (2), 133–138. 10.1002/minf.201200141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Rodrigues T.; Roudnicky F.; Koch C. P.; Kudoh T.; Reker D.; Detmar M.; Schneider G. De Novo Design and Optimization of Aurora A Kinase Inhibitors. Chem. Sci. 2013, 4 (3), 1229–1233. 10.1039/c2sc21842a. [DOI] [Google Scholar]
  10. Hartenfeller M.; Zettl H.; Walter M.; Rupp M.; Reisen F.; Proschak E.; Weggen S.; Stark H.; Schneider G. DOGS: Reaction-Driven de Novo Design of Bioactive Compounds. PLoS Comput. Biol. 2012, 8 (2), e1002380 10.1371/journal.pcbi.1002380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Grisoni F.; Merk D.; Consonni V.; Hiss J. A.; Tagliabue S. G.; Todeschini R.; Schneider G. Scaffold Hopping from Natural Products to Synthetic Mimetics by Holistic Molecular Similarity. Communications Chemistry 2018, 1 (1), 1–9. 10.1038/s42004-018-0043-x. [DOI] [Google Scholar]
  12. Grisoni F.; Merk D.; Byrne R.; Schneider G. Scaffold-Hopping from Synthetic Drugs by Holistic Molecular Representation. Sci. Rep. 2018, 8 (1), 16469. 10.1038/s41598-018-34677-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. ROCS Version 3.4.0. OpenEye Scientific Software: Santa Fe, NM, 2020. https://www.eyesopen.com/rocs (accessed 2024–02–19). [Google Scholar]
  14. Hawkins P. C. D.; Skillman A. G.; Nicholls A. Comparison of Shape-Matching and Docking as Virtual Screening Tools. J. Med. Chem. 2007, 50 (1), 74–82. 10.1021/jm0603365. [DOI] [PubMed] [Google Scholar]
  15. Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  16. Miyao T.; Jasial S.; Bajorath J.; Funatsu K. Evaluation of Different Virtual Screening Strategies on the Basis of Compound Sets with Characteristic Core Distributions and Dissimilarity Relationships. J. Comput. Aided Mol. Des. 2019, 33 (8), 729–743. 10.1007/s10822-019-00218-8. [DOI] [PubMed] [Google Scholar]
  17. Vogt M.; Stumpfe D.; Geppert H.; Bajorath J. Scaffold Hopping Using Two-Dimensional Fingerprints: True Potential, Black Magic, or a Hopeless Endeavor? Guidelines for Virtual Screening. J. Med. Chem. 2010, 53 (15), 5707–5715. 10.1021/jm100492z. [DOI] [PubMed] [Google Scholar]
  18. Durant J. L.; Leland B. A.; Henry D. R.; Nourse J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42 (6), 1273–1280. 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
  19. Renner S.; Schneider G. Scaffold-Hopping Potential of Ligand-Based Similarity Concepts. ChemMedChem. 2006, 1 (2), 181–185. 10.1002/cmdc.200500005. [DOI] [PubMed] [Google Scholar]
  20. Zhang Q.; Muegge I. Scaffold Hopping through Virtual Screening Using 2D and 3D Similarity Descriptors: Ranking, Voting, and Consensus Scoring. J. Med. Chem. 2006, 49 (5), 1536–1548. 10.1021/jm050468i. [DOI] [PubMed] [Google Scholar]
  21. Vapnik V. N.The Nature of Statistical Learning Theory, 2nd ed.; Springer:: Berlin, 2000. 10.1007/978-1-4757-3264-1. [DOI] [Google Scholar]
  22. Berman H. M.; Westbrook J.; Feng Z.; Gilliland G.; Bhat T. N.; Weissig H.; Shindyalov I. N.; Bourne P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1), 235–242. 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Mendez D.; Gaulton A.; Bento A. P.; Chambers J.; De Veij M.; Félix E.; Magariños M. P.; Mosquera J. F.; Mutowo P.; Nowotka M.; Gordillo-Marañón M.; Hunter F.; Junco L.; Mugumbate G.; Rodriguez-Lopez M.; Atkinson F.; Bosc N.; Radoux C. J.; Segura-Cabrera A.; Hersey A.; Leach A. R. ChEMBL: Towards Direct Deposition of Bioassay Data. Nucleic Acids Res. 2019, 47 (D1), D930–D940. 10.1093/nar/gky1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kim S.; Chen J.; Cheng T.; Gindulyte A.; He J.; He S.; Li Q.; Shoemaker B. A.; Thiessen P. A.; Yu B.; Zaslavsky L.; Zhang J.; Bolton E. E. PubChem in 2021: New Data Content and Improved Web Interfaces. Nucleic Acids Res. 2021, 49 (D1), D1388–D1395. 10.1093/nar/gkaa971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Liu Z.; Li Y.; Han L.; Li J.; Liu J.; Zhao Z.; Nie W.; Liu Y.; Wang R. PDB-Wide Collection of Binding Data: Current Status of the PDBbind Database. Bioinformatics 2015, 31 (3), 405–412. 10.1093/bioinformatics/btu626. [DOI] [PubMed] [Google Scholar]
  26. OEChem TK Version 3.0.0; OpenEye Scientific Software Inc: Santa Fe, NM, 2020. https://www.eyesopen.com/oechem-tk (accessed 2024–02–19). [Google Scholar]
  27. MolProp TK Version 2.5.4; OpenEye Scientific Software Inc: Santa Fe, NM, 2020. https://www.eyesopen.com/molprop-tk (accessed 2024–02–19). [Google Scholar]
  28. Wassermann A. M.; Haebel P.; Weskamp N.; Bajorath J. SAR Matrices: Automated Extraction of Information-Rich SAR Tables from Large Compound Data Sets. J. Chem. Inf. Model. 2012, 52 (7), 1769–1776. 10.1021/ci300206e. [DOI] [PubMed] [Google Scholar]
  29. Bajorath J.; Gupta-Ostermann D. The “SAR Matrix” Method and Its Extensions for Applications in Medicinal Chemistry and Chemogenomics. F1000Res. 2014, 3, 113. 10.12688/f1000research.4185.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wawer M.; Bajorath J. Local Structural Changes, Global Data Views: Graphical Substructure- Activity Relationship Trailing. J. Med. Chem. 2011, 54 (8), 2944–2951. 10.1021/jm200026b. [DOI] [PubMed] [Google Scholar]
  31. Kenny P. W.; Sadowski J.. Structure Modification in Chemical Databases. In Chemoinformatics in Drug Discovery; Oprea T. I., Ed.; Methods and Principles in Medicinal Chemistry; Wiley-VCH Verlag GmbH & Co. KGaA: Weinheim, 2005; Vol. 23, pp 271–285. 10.1002/3527603743.ch11. [DOI] [Google Scholar]
  32. iwatobipen . CATS2D: Rdkit_cats2d. https://github.com/iwatobipen/CATS2D (accessed 2022–05–18).
  33. Omega TK Version 4.0.0; OpenEye Scientific Software Inc: Santa Fe, NM, 2020. https://www.eyesopen.com/omega-tk (accessed 2024–02–19). [Google Scholar]
  34. Halgren T. A. MMFF94s Option for Energy Minimization Studies. J. Comput. Chem. 1999, 20 (7), 720–729. . [DOI] [PubMed] [Google Scholar]
  35. Sato T.; Yuki H.; Takaya D.; Sasaki S.; Tanaka A.; Honma T. Application of Support Vector Machine to Three-Dimensional Shape-Based Virtual Screening Using Comprehensive Three-Dimensional Molecular Shape Overlay with Known Inhibitors. J. Chem. Inf. Model. 2012, 52 (4), 1015–1026. 10.1021/ci200562p. [DOI] [PubMed] [Google Scholar]
  36. Boser B. E.; Guyon I. M.; Vapnik V. N.. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory; ACM, 1992; pp 144–152. 10.1145/130385.130401. [DOI] [Google Scholar]
  37. Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay É. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12 (85), 2825–2830. [Google Scholar]
  38. Ralaivola L.; Swamidass S. J.; Saigo H.; Baldi P. Graph Kernels for Chemical Informatics. Neural Netw. 2005, 18 (8), 1093–1110. 10.1016/j.neunet.2005.07.009. [DOI] [PubMed] [Google Scholar]
  39. Saito T.; Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS One 2015, 10 (3), e0118432 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. NAMIKI SHOJI Co., Ltd . ChemCupid. https://www.namiki-s.co.jp/compound/chemcupid/ (accessed 2022–05–18). [Google Scholar]
  41. Maggiora G.; Vogt M.; Stumpfe D.; Bajorath J. Molecular Similarity in Medicinal Chemistry. J. Med. Chem. 2014, 57 (8), 3186–3204. 10.1021/jm401411z. [DOI] [PubMed] [Google Scholar]
  42. Zhang B.; Vogt M.; Maggiora G. M.; Bajorath J. Design of Chemical Space Networks Using a Tanimoto Similarity Variant Based upon Maximum Common Substructures. J. Comput. Aided Mol. Des. 2015, 29 (10), 937–950. 10.1007/s10822-015-9872-1. [DOI] [PubMed] [Google Scholar]
  43. Shannon P.; Markiel A.; Ozier O.; Baliga N. S.; Wang J. T.; Ramage D.; Amin N.; Schwikowski B.; Ideker T. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003, 13 (11), 2498–2504. 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Madhavi Sastry G.; Adzhigirey M.; Day T.; Annabhimoju R.; Sherman W. Protein and Ligand Preparation: Parameters, Protocols, and Influence on Virtual Screening Enrichments. J. Comput. Aided Mol. Des. 2013, 27 (3), 221–234. 10.1007/s10822-013-9644-8. [DOI] [PubMed] [Google Scholar]
  45. Eberhardt J.; Santos-Martins D.; Tillack A. F.; Forli S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61 (8), 3891–3898. 10.1021/acs.jcim.1c00203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Mysinger M. M.; Carchia M.; Irwin J. J.; Shoichet B. K. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 2012, 55 (14), 6582–6594. 10.1021/jm300687e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Forli-lab . Meeko: Interfacing RDKit and AutoDock. https://github.com/forlilab/Meeko (accessed 2023–12–26).
  48. Bemis G. W.; Murcko M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39 (15), 2887–2893. 10.1021/jm9602928. [DOI] [PubMed] [Google Scholar]
  49. Ertl P.; Schuffenhauer A. Estimation of Synthetic Accessibility Score of Drug-like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminform. 2009, 1 (1), 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Irwin J. J.; Tang K. G.; Young J.; Dandarchuluun C.; Wong B. R.; Khurelbaatar M.; Moroz Y. S.; Mayfield J.; Sayle R. A. ZINC20-A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model. 2020, 60 (12), 6065–6073. 10.1021/acs.jcim.0c00675. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci4c00342_si_001.pdf (1.2MB, pdf)
ci4c00342_si_002.zip (5.1MB, zip)

Data Availability Statement

The data used in the retrospective validation are available in the Supporting Information for this paper. The data used in the prospective validation were provided by NAMIKI SHOJI Co., Ltd. For compound data and descriptors preparation, the OpenEye Toolkits and the ROCS were used under the academic license. For docking, the Schrödinger suite was used under the commercial license.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES