Abstract
In the present work, two reaction-based generative models for molecular design are presented: growing optimizer and linking optimizer. These models are designed to emulate real-life chemical synthesis by sequentially selecting building blocks and simulating the reactions between them to form new compounds. By focusing on the feasibility of the generated molecules, growing optimizer and linking optimizer offer several advantages, including the ability to restrict chemistry to specific building blocks, reaction types, and synthesis pathways, a crucial requirement in drug design. Unlike text-based models, which construct molecules by iteratively forming a textual representation of the molecular structure, and graph-based models, which assemble molecules atom by atom or fragment by fragment, our approach incorporates a more comprehensive understanding of chemical knowledge, making it relevant for drug discovery projects. Comparative analysis with REINVENT 4, a state-of-the-art molecular generative model, shows that growing optimizer and linking optimizer are more likely to produce synthetically accessible molecules while reaching molecules of interest with the desired properties.
Keywords: drug design, generative AI, hit discovery, lead optimization, reinforcement fine tuning, deep learning
Introduction
The discovery of new molecules with specific biochemical or physical properties is essential for progress in fields such as drug discovery, materials science, and sustainable chemistry. However, traditional molecular design approaches are often slow and resource-intensive [1]. In recent years, deep learning has provided powerful new methods for molecular generation, allowing the exploration of vast chemical spaces and the optimization of targeted properties [2]. Despite these advances, a central challenge remains: ensuring that the generated molecules are synthetically accessible, as molecules without well-defined synthetic pathways are often impractical or extremely costly to produce [3].
In this work, growing optimizer (GO) and linking optimizer (LO) are introduced, two models designed to generate molecular structures by assembling user-provided fragments and commercially available building blocks through chemical reactions. Through this generative process, these models can produce diverse molecules optimized according to scoring functions that model properties such as biological activity, selectivity, and absorption [4], while maintaining good synthetic accessibility. We evaluated these two models through molecular rediscovery tasks, demonstrating their ability to design structurally diverse molecules. Their performance in key phases of the drug discovery pipeline, such as hit discovery and lead optimization, was also assessed [5]. Additionally, we demonstrated how our generative models enable users to set constraints that reflect their chemistry preferences, such as specifying reaction types, customizing starting materials, or tailoring synthesis conditions to achieve their desired chemistry. The results presented in this paper show that both GO and LO achieve superior performance to REINVENT 4 [6] in optimizing molecular properties while ensuring that the resulting molecules are diverse and practical for real-world synthesis.
A common approach to molecular generation employs a textual representation of molecules, namely the simplified molecular input line entry system (SMILES) notation [7], where molecules are encoded as strings. This representation allows sequence-based models to generate molecules using techniques such as variational autoencoders (VAEs) [8], recurrent neural networks (RNNs [9]) [6, 10], or transformer architectures [11]. In contrast, atom-by-atom generative methods represent molecules as undirected graphs, where atoms serve as nodes and bonds as edges, using graph neural networks (GNNs) [12, 13]. Another prominent generative approach is fragment-based generation, which constructs molecules by assembling predefined molecular fragments using RNN [14] and GNN architectures [15].
Synthetic accessibility indicates the ease with which a compound can be synthesized in the laboratory. The algorithms presented above often generate a significant amount of molecules that are not synthetically accessible [16]. To overcome this limitation, some approaches have incorporated synthetic accessibility scores [17, 18] into the generation process, facilitating the design of molecules with enhanced accessibility [3, 19].
To further improve synthetic accessibility, reaction-based generation strategies have been developed to integrate the synthesis process directly into molecular generation. Methods based on synthetic pathways construct molecules step by step through a series of chemical reactions, starting from commercially available building blocks (CABB), using RNNs [20] or GFlowNets [21–23]. By ensuring that each generated molecule has a plausible synthetic route, these approaches improve synthetic accessibility.
GO and LO draw inspiration from Bradshaw et al. [20] and extend their methodology in the following ways:
An optimization approach that encourages molecular diversity while optimizing user-defined objectives (see Optimization of the generative models).
Scaling the number of building blocks to 1 million through architectural adjustments, surpassing the tens of thousands initially proposed (see Commercially available building blocks dataset).
Using a template (SMARTS transformation [24]) based reaction model, instead of the originally proposed template-free model [25], which offers greater control over the generated chemistry by enabling the inclusion or exclusion of specific name reactions or by filtering the predicted templates to match a desired reactive functional group (from now on referred to as exit vector in the context of molecular generation) (see Template predictor).
In the case of fragment growing, allowing users to define an initial fragment, optionally specifying one or more exit vectors (see Architecture of the generative models).
In the case of fragment linking, allowing users to define two initial fragments, optionally specifying an exit vector for each fragment (see Architecture of the generative models).
Implement a customized strategy that allows GO to address the specific case of macrocyclization (see Architecture of the generative models).
Unlike SynFlowNet [21], RGFN [22], and RxnFlow [23], which do not support macrocyclization, fragment growing, or fragment linking with user-defined input fragments, GO and LO are specifically designed to handle these scenarios. Additionally, rather than relying on the GFlowNet framework used in these methods, we adopt an optimization algorithm based on maximum likelihood estimation, which is conceptually simpler and easier to train.
Regarding dataset scale, GO explores a broader chemical space than SynFlowNet and RGFN by leveraging a dataset of 1 million commercial compounds, while RxnFlow operates on a similar order of magnitude. Like SynFlowNet and RxnFlow, GO and LO can process single-reactant reactions, whereas RGFN does not support this capability. However, unlike these methods, which integrate template prediction directly within the model’s action space, GO and LO rely on a pre-trained template-based reaction predictor with fixed weights.
Materials and methods
Commercially available building blocks dataset
GO and LO draw their building blocks from a curated dataset of commercially available compounds. This dataset was obtained by aggregating the catalogs of the providers 1PlusChem, Ambeed, BLDpharm, ChemBridge, Chemspace, eMolecules, Key Organics, Life Chemicals, and TCI Chemicals. Only compounds explicitly designated as building blocks were included, while virtual compounds and screening compounds were excluded. The resulting dataset contains over 5 million building blocks. These were standardized by neutralizing functional groups and removing salts, solvents, radicals, and inorganic entries, and eliminating duplicates. Building blocks with molecular weights above 350 were excluded. Following this data curation, the 1 million most readily available building blocks, prioritized based on delivery time, were selected to be used in the GO and LO generations.
Molecular generation strategy
In the present study, three distinct molecular generation strategies are explored: unconstrained design, fragment growing, and fragment linking (Fig. 1). In the unconstrained design strategy, the model generates molecules freely, without imposing structural constraints. In fragment growing, the model expands a given molecular fragment by incorporating new substructures. In fragment linking, the model generates a compound that connects two predefined molecular fragments.
Figure 1.
Molecular design strategies. Unconstrained design involves the generation of a molecule without any structural input. Fragment growing generates a molecule from an input fragment that remains part of the final compound. Fragment linking generates a molecule by binding two input fragments with a linker. GO handles unconstrained design and fragment growing, including macrocyclization, while LO handles the fragment linking strategy.
To address these use cases, GO handles unconstrained design and fragment growing, while LO is tailored for fragment linking.
GO is a generative model that designs molecules by iteratively constructing virtual synthetic pathways, referred to as molecular trees or simply trees, as illustrated in Fig. 2a. The process combines compounds from the CABB dataset (see Commercially available building blocks dataset) using chemical transformations (reaction templates) encoded as reaction SMARTS. An RNN sequentially encodes the current state and passes it to dedicated networks responsible for managing three possible actions: (i) adding (or not) a new reaction step to extend the molecular tree, (ii) selecting a reaction type from a predefined set (uni-reactant, bi-reactant, or macrocyclization reactions, see Supplementary Fig. SI-7), (3) choosing a building block from the CABB dataset. This iterative process continues until either the model stops adding reactions or the molecular tree reaches a predefined maximum length. For a maximum of four reactions, the chemical space covered by GO is estimated to exceed
molecules, which is order of magnitudes higher than virtual screening spaces such as Enamine REAL Space:
[26], as detailed in Supplementary information (Section A).
Figure 2.
Example of a molecular tree generated by GO. The process starts with an initial fragment (a), provided by the user, featuring two exit vectors. GO chooses to add a reaction to the tree and that this reaction will be an
reaction. GO chooses building block (b) among the CABB dataset. The reaction predictor applies reaction 1, respecting the exit vector constraints and the intermediate molecule (c) is obtained. GO selects another
reaction, chooses building block (d), and applies reaction 2 using the template predictor (see Template predictor). This results in the intermediate product (e). Then GO adds an
reaction to the intermediate molecule (e), producing the intermediate (f). Finally, GO decides not to add another reaction to the tree, and (f) becomes the final molecule of the molecular tree.
Similar to GO, LO generates molecules by constructing molecular trees (Fig. 3a). It selects a linker from the CABB dataset to connect two fragments defined by the user. Its available actions are: (i) selecting a building block to serve as a linker and (ii) optionally applying intermediate reactions to modify the linker before it reacts with the input fragments. The chemical space covered by LO is estimated at
molecules (Supplementary information, Section A).
Figure 3.

Example of a molecular tree generated by LO. The process starts with two initial fragments, (a) and (e), along with their respective exit vectors provided by the user. LO selects building block (b) from the CABB dataset to serve as the linker between the two fragments. The SRNN network determines that a
reaction is not necessary to transform the linker before its reaction with fragment (a). The template predictor applies reaction 1, respecting the exit vector constraints, resulting in the intermediate molecule (c). The SRNN network then decides to apply an
reaction to transform the remaining portion of the linker in (c), and the reaction predictor carries out the deprotection 2. Finally, the reaction predictor applies reaction 3, also adhering to the exit vector constraints, to link the second fragment (e) to the deprotected linker (d), resulting in the final molecule (f).
Molecular trees generated by GO and LO do not represent retrosynthesis pathways. Their purpose is to generate molecules with a high probability of synthetic accessibility by design. Regio- and chemo-selectivity are deliberately omitted during the generation to accelerate the exploration of synthetically promising molecules. These chemical selectivity considerations can be more appropriately addressed with other dedicated retrosynthesis tools such as Spaya, [27] which are specifically designed for that task.
Generators
Architecture of the generative models
Growing optimizer architecture GO employs a modular architecture consisting of specialized neural network components, each addressing a specific task of the reaction-based generative process (Fig. 2b):
RNN (Recurrent neural network): An autoregressive gated recurrent unit (GRU) [28] network that accumulates information about past reactions and the molecular tree’s current state, embedding this information for the following network heads.
RCNN (Reaction continuation neural network): A linear projection binary classifier that decides whether to stop the generative process or continue reacting with the current intermediate molecule.
RTNN (Reaction type neural network): A linear projection that selects the reaction type: uni-reactant
, bi-reactant
, or
macrocyclization.BBNN (Building block neural network): A linear projection that predicts the likelihood of each building block to be selected to react with the current intermediate molecule. Specifically, the BBNN generates an embedding, and the likelihood of building blocks is determined by computing the scalar product between this embedding and the Morgan fingerprints of the CABB dataset [20, 21].
In the context of fragment growing, GO takes as input the Morgan Fingerprint of the initial fragment. In unconstrained design, the input is represented by a zero tensor and the first fragment is selected from the CABB dataset.
Linking optimizer architecture LO is built upon an architecture designed to facilitate the efficient linking of two input molecular fragments through specialized neural network components (Fig. 3b):
BBNN (Building block neural network): A multilayer perceptron (MLP) [29] that predicts the likelihood for each building block to be selected to link the input fragments.
SRRN (Single reactant reaction network): An MLP that determines whether a uni-reactant reaction should be applied to transform the linker before its reaction with the input fragments.
Unlike GO, a single building block is selected to link the input fragments, constraining the number of reactions to a fixed range between 2 and 4: two mandatory
reactions to link the building to the input fragments and up to two optional
reactions on the linker and the intermediate molecule, determined by SRRN. To allow for a simpler model architecture and in light of the more constrained and less variable reaction sequences in LO compared to GO, an RNN architecture was not employed for LO. In GO, the number of reactions is defined by the user within a range, and the model decides when to stop, making the sequential aspect more important.
Both GO and LO rely on a template-based forward reaction predictive model to output reaction products (see Template predictor).
Template predictor
A template-based approach is used to predict the outcomes of reactions. Specifically, reaction templates are represented as SMARTS patterns (Supplementary Fig. SI-7), which capture the structural changes of a reaction by defining the atoms and bonds involved along with their modifications. Similar to Segler and Waller [30], but for forward synthesis, the template prediction task is framed as a classification problem, where each template corresponds to a distinct class. To improve performance, two different models are trained: one specialized for predicting uni-reactant templates and another one for bi-reactant templates (Supplementary Fig. SI-7).
The templates are extracted from the Pistachio dataset [31] version 2024Q3. After curation and processing of the raw reactions (Supplementary information, Section C.2), 2 523 991 bi-reactant reactions were obtained, resulting in 48 681 bi-reactant templates. Similarly, 1 832 757 uni-reactant reactions led to 53 277 uni-reactant templates, among which 2470 were identified as macrocyclization templates. Following this processing, reaction templates were associated with name reactions (known chemical transformations described and exemplified in the organic chemistry literature) on the basis of the name reactions available in Pistachio.
The template predictor allows users to specify the exact reaction site on the input fragment by selecting exit vectors. This constraint is integrated into the template prediction process during the inference phase. Thus, templates that do not align with the user-defined exit vectors are filtered out (Supplementary Fig. SI-8) ensuring that only reactions conforming to the designated reactive sites are considered. This additional flexibility enhances the control over reaction outcomes and allows for a more efficient generation of targeted molecules which satisfy specific synthetic requirements.
Additionally, the template predictor allows for further refinement by restricting the scope of reactions to include or exclude particular name reactions. Users can specify a list of name reactions, such as Suzuki coupling and N-alkylation, which are mapped to the corresponding sets of templates. The template prediction process then filters out templates based on these user-defined inclusion or exclusion criteria, enabling the execution of targeted synthetic strategies.
Training of the generative models
Each molecular tree
can be represented as a sequence of subtrees
, where
is the total number of reactions. Each subtree
represents a series of actions that leads to an intermediate molecule or terminates the tree generation process. GO and LO are trained using a maximum likelihood estimation (MLE) loss, defined as:
![]() |
(1) |
![]() |
(2) |
where
denotes an indicator function that takes 1 if the neural network sub-model parametrized by
(RNN, BBNN, RTNN, SRRN in section Architecture of the generative models) is applied at step
and 0 otherwise. Similar to Bradshaw et al. [20], the weights of the template predictor models are frozen during training.
GO is pre-trained on the ChEMBL [32] dataset. To obtain a training dataset of molecular trees, we used Spaya [27], a tool for high-throughput retrosynthetic analysis, to map each molecule to its retrosynthesis route. LO, on the other hand, having a simpler network architecture, is not pre-trained before optimization tasks.
Optimization of the generative models
The optimization of GO and LO employs an iterative refinement process based on a modified version of the Hillclimb-MLE algorithm [33], enhanced with parameters that improve exploration. This modified algorithm is referred to as Hillclimb-MLE-Explore (Algorithm 1). The optimization method is divided into three distinct phases. In the first phase, the model generates a set of candidate molecules. In the second phase, these molecules are evaluated using an external, user-defined score, referred to as reward
, that quantifies the desired characteristics of the target molecules. In the final phase, a set of top-scoring molecules is selected and the model’s parameters are updated through a likelihood maximization step.
Two techniques are used to enhance the capacity of the generative models to explore a vast chemical space: entropy regularization of maximum likelihood training [34] and a selection strategy based on a determinant point process (DPP) [35].
Entropy terms, associated with distributions over reaction continuation, building block selection, and reaction type, are added to the MLE loss function (Equation 3) to encourage convergence toward a more stochastic policy for better exploration.
![]() |
(3) |
![]() |
(4) |
where
is the hyperparameter of entropy and
is the entropy of the model parameterized by
.
A DPP-based selection strategy is employed within the Hillclimb-MLE-Explore framework to select the training batch
used to fine-tune the model. This approach balances the selection of molecules with high reward values
(exploitation) and the pursuit of greater structural diversity (exploration).
Reward
Reward shaping To normalize the various scores contributing to the reward function, scaling modifiers are applied to map the scores to the [0,1] interval. The choice of these modifiers and their parameters follows established approaches from the literature [36]. Specifically, we use MaxGaussian to maximize scores above a given threshold and MinGaussian to minimize scores below a given threshold.
Reward aggregation The reshaped scores
are aggregated using the geometric mean. The global reward
is computed as:
![]() |
(5) |
In this work, we treat the rewards employed in our benchmarks as perfect oracles, focusing our evaluation on the performance of the generative model and the optimization method rather than on the validation of the reward functions themselves. Although QSAR models and docking scores may not fully capture the complexity of real-world drug discovery projects, they serve as effective reward functions for evaluating the ability of generative models to design molecules across diverse chemical or biological contexts.
Assessing synthetic accessibility
Synthetic accessibility of the molecules generated in each experiment is evaluated using Spaya [27]. This evaluation is performed via Spaya API, which assigns a retrosynthesis score (RScore) [19] to each molecule. RScore is an estimation of the likelihood of the predicted retrosynthesis routes that ranges from 0 to 1, with 0 indicating no route was found by Spaya within the allocated time frame, and 1 representing a one-step retrosynthesis route exactly matching a reaction described in the literature. For this analysis and following the expert opinion of chemists at Iktos as described by Parrot et al. [19], molecules with an RScore of 0.5 or higher were considered synthetically accessible. We complement this analysis with synthetic accessibility results obtained using the alternative retrosynthesis tool AiZynthFinder [37], as detailed in Section G of the Supplementary information.
Results
GO and LO are evaluated against the open-source tool, REINVENT 4 [6], using metrics that assess the optimality, diversity, and synthetic accessibility of the best-generated molecules.
REINVENT 4 serves as a strong open-source baseline [38, 39] for generative models in de novo drug design, addressing all scenarios (unconstrained design, fragment growing, and fragment linking) in our experiments. It employs an RNN-based architecture to generate SMILES strings which is optimized using a policy gradient method.
In all experiments, RScore is used to evaluate the synthetic accessibility of the generated molecules (see Assessing synthetic accessibility). Assessing the synthetic accessibility is critical since the generation of high-scoring molecules that are impractical to synthesize in the laboratory has limited value.
The benchmark experiments encompass the following four distinct use cases:
Molecular rediscovery: A similarity-driven generation aimed at reproducing a given reference compound.
Lead optimization design: A generation based on multi-parametric optimization that takes a homogeneous chemical series as input, aiming to optimize quantitative structure-activity relationship (QSAR) model scores and other chemical properties defined in the target product profile (TPP).
Structure-based design: A generation that uses a protein target as input to design compounds with high docking affinity for its binding site.
Chemistry-constrained design: A generation of compounds taking into account chemistry constraints, such as desired substructures or limitations on the number and types of reactions required for synthesis.
Additional experiments in Supplementary information Section F.1 compare samples generated by GO and REINVENT 4, in terms of validity, uniqueness, novelty, and other distributional properties using the GuacaMol benchmark suite [36]. Additionally other common de novo design benchmarks (QED, JNK3, and GSK3
activity scores) are reported, along with a comparison to the reaction-based algorithm RxnFlow [23] in Supplementary information Sections F.2 and F.3.
Each generator produces the same number of molecules for every experimental setting, ensuring a fair comparison. Depending on the setting, the total number of molecules generated ranges from 10 000 to 40 000 (Supplementary information, Section E).
Molecular rediscovery
Set up
In molecular rediscovery experiments, the goal is to evaluate the generator’s ability to reproduce specific, predefined target molecules. The eight target molecules shown in Fig. 4a were selected from the literature based on structural features that reflect the complexity encountered in real-world drug discovery projects. Selection criteria included the presence of challenging elements such as macrocyclic architectures, quaternary stereocenters, complex heterocyclic frameworks, and a high
carbon fraction. These structural motifs are representative of modern drug design and are known to pose significant difficulties for generative models. By focusing on such complex and realistic cases, we intentionally move beyond the use of overly simplified or trivial molecules often found in prior studies [36]. Only Seltorexant is present in the training sets of both REINVENT 4 and GO, the other target molecules are not.
Figure 4.
Target molecules in (a) and maximum Tanimoto similarity achieved by a molecule with an RScore above 0.5 in unconstrained design (b), fragment growing (c), and fragment linking (d) strategies. Across these use cases GO and LO generates molecules with higher rewards (ie higher similarity score to target molecules) than REINVENT 4.
For these experiments, the reward function
is defined as the Tanimoto similarity of the Morgan fingerprint (radius = 2) [40] to the target molecule, with no further reward shaping. Model parameters and fragment inputs, along with their exit vectors, are provided in Supplementary information, Section E.
Results
To assess the ability of the generators to rediscover specific target molecules, we evaluate the reward of the highest-ranking (most similar) molecule with an RScore above 0.5 produced in each experiment. The results shown in Fig. 4 demonstrate that in all use cases (fragment growing, fragment linking, and unconstrained design) GO and LO reconstruct (maximum similarity = 1) a greater number of target molecules than REINVENT 4. Even when the exact target molecule is not rediscovered (maximum similarity < 1), our generative models output molecules with overall higher similarity to the target. In Supplementery Fig. SI-6 of the Supplementary materials, for each target, we present the most similar building block from the CABB dataset. This analysis shows that fragments of several target molecules are present within the CABB dataset, which likely contributes to the observed high rediscovery rates. The best molecules generated in each experiment are shown in Supplementary Fig. SI-24.
Lead optimization design
Set up
The performance of generative models in lead optimization scenarios is evaluated on their ability to generate active molecules for two biological targets: phosphoinositide 3-kinase (PI3K) and the mechanistic target of rapamycin (mTOR). Both biological targets are involved in the PI3K/AKT/mTOR signaling pathway, which is critical in regulating the cell cycle. This pathway is often overactive in certain types of breast, ovarian or prostate cancer cells, leading to a reduced apoptosis and thus to uncontrolled proliferation. While some selective inhibitors of PI3K or mTOR have shown efficacy, a synergistic effect has been observed for compounds showing a dual PI3K/mTOR inhibition profile. This study is based on a dataset of 463 structurally related molecules with measured pIC50 values for PI3K and mTOR, as reported by Parrot et al. [19]. In this dataset, referred to as the PI3K-mTOR lead optimization (PMLO) dataset, only molecules active against either PI3K or mTOR, but not both, are included. This design ensures that dual-active molecules remain hidden from both QSAR models and generative models, preventing them from trivially learning their patterns.
Generation constraints include the ECFP4 Tanimoto maximum similarity to the PMLO dataset, quantitative estimate of drug-likeness (QED) scores [41], and predicted pIC50 values for PI3K and mTOR, derived from QSAR models, as detailed by Parrot et al. [19]. Additionally, a specific substructure, defined by a SMARTS pattern (Supplementary Fig. SI-12), is enforced to guide the generated molecules toward the desired chemical space. The reward function
optimized in this study is the geometric mean of key molecular property scores, as detailed in Supplementary Table SI-1.
REINVENT 4 and GO are evaluated in both unconstrained design and fragment growing setups for lead optimization. To focus generation and optimization on the chemical space of molecules active on PI3K or mTOR, the generators are pre-trained on the PMLO dataset for a few epochs.
A structure-based rescoring step has been introduced as a post-processing procedure, independent of the generator’s optimization process, to assist in candidate selection and to eliminate energetically unfavorable molecules. To this end, the best generated molecules were docked onto the 3D structures of PI3K and mTOR using the docking protocol described in Supplementary information, Section D.3. This was followed by a binding free energy calculation using the molecular mechanics/generalized born surface area (MMGBSA) method, also detailed in Supplementary information, Section D.4. It is important to remind that lower MMGBSA scores indicate stronger binding affinities, with more negative values corresponding to more favorable ligand–target interactions.
Results
The present analysis focuses on the 500 generated molecules with the highest scores, reflecting the approach a chemist might take when reviewing results from de novo drug design approaches. The evaluation criteria include synthetic accessibility and the number of molecules meeting the PI3K-mTOR TPP in Supplementary Table SI-4. Additionally, diversity is assessed on the top-scoring, synthetically accessible molecules using three metrics: (i) the pairwise Tanimoto similarity among the molecules, (ii) the number of unique scaffolds present in them, and (iii) the pairwise Tanimoto similarity among these scaffolds. REINVENT 4 and GO successfully identify new predicted active molecules for the PI3K and mTOR targets while also meeting the remaining TPP criteria, such as similarity to dataset, QED, and presence of substructure.
Figure 5 illustrates that among the top 500 molecules generated in each experiment, GO produces molecules with better synthetic accessibility. Specifically, when using REINVENT 4 with an initial fragment, less than half of the top molecules achieve an RScore above 0.5, while that percentage rises to 85% with GO. Although REINVENT 4 generates more synthetically accessible molecules meeting the TPP criteria in fragment growing, Table 1 demonstrates that GO identifies active molecules with greater structural diversity, as indicated by a broader range of scaffolds and a lower mean pairwise Tanimoto similarity among molecules and scaffolds.
Figure 5.

Analysis of RScore and the number of molecules within the TPP for the top 500 molecules generated. (a) In unconstrained design, GO produced molecules with higher synthetic accessibility compared to REINVENT 4. (b) In fragment growing, REINVENT 4 generated a substantial number of molecules within the TPP, though only half exhibited an RScore above 0.5.
Table 1.
Diversity comparison of the top 500 high-scoring molecules with an RScore above 0.5 in the lead optimization use case. Similarity is evaluated as the average pairwise Tanimoto similarity. The best score is highlighted in bold
| Generator |
Similarity between molecules ( )
|
Number of scaffolds ( )
|
Similarity between scaffolds ( )
|
|---|---|---|---|
| Unconstrained design | |||
| REINVENT 4 | 0.57 | 54 | 0.73 |
| GO | 0.53 | 77 | 0.62 |
| Fragment growing | |||
| REINVENT 4 | 0.58 | 29 | 0.72 |
| GO | 0.50 | 81 | 0.63 |
|
The comparison of the MMGBSA scores of the top ranked molecules from each generation can be found in Supplementary Fig. SI-10 of the Supplementary information showing that in both the unconstrained design (A) and fragment growing (B) scenarios, GO, and REINVENT 4 provide comparable MMGBSA score distributions for the PI3K and mTOR targets.
Structure-based drug design
Set up
The following experiments aim to evaluate the performance of the generators when optimizing 3D scoring metrics that assess protein-ligand interactions. Specifically, the docking scoring function as implemented in Autodock Vina (Supplementary information, Section D.2) and the Iktos contact score (Supplementary information, Section D.3) are used to estimate, respectively, the binding energy and ligand-protein contact quality. The biological targets included in this study are: Proviral Integration site for Moloney murine leukemia virus-1 (PIM1), Extracellular signal-Regulated Kinase 2 (ERK2), and TransMembrane protein with Ring finger Domain (TMRD).
PIM1: this kinase plays a key role in critical biological processes, including the regulation of cell growth, survival, and differentiation, as well as cellular senescence and apoptosis. The involvement of PIM1 in multiple signaling pathways and its interactions with various proteins have established this kinase as a promising candidate for targeted cancer therapy, given its crucial role in tumor cell regulation.
ERK2: this kinase is a critical component of the mitogen-activated protein kinase (MAPK) pathway, which regulates key cellular processes such as the cellular proliferation, differentiation, and survival. ERK2 has emerged as a promising therapeutic target, with inhibitors being developed to block its activity and disrupt aberrant signaling in MAPK-driven malignancies.
TRMD: This target is a multifunctional protein involved in cellular signaling, ubiquitination, and protein trafficking. TMRD is involved in various biological processes, including cell cycle control, immune responses, and intracellular signaling pathways.
REINVENT 4 and GO are evaluated in two different setups: unconstrained design and fragment growing. The optimized reward
, is computed as the geometric mean of the structure-based scores listed in Table SI-2. Additionally, similar to the lead optimization design (see Lead optimization design), an MMGBSA rescoring post-processing step is applied to the top-ranked generated molecules.
Results
In the same way as in the lead optimization experiment, we focus on the 500 generated molecules with the highest scores and evaluate their synthetic accessibility, associated rewards and structural diversity. We observe the following:
Unconstrained design: GO demonstrates overall superior performance compared to REINVENT 4 in terms of score, synthetic accessibility (Fig. 6a), and diversity (Table 2) in the PIM1, ERK2, and TRMD case studies.
Fragment growing: As shown in Fig. 6b and Table 2, GO generates higher-scoring and more diverse molecules for PIM1 target. For the ERK2 target, among the 500 best molecules generated by REINVENT 4, only 99 are synthetically accessible, compared to 442 for GO, with the best molecules generated by GO achieving higher scores. For the TRMD target, although the highest-scoring molecules are produced by REINVENT 4, it appears that REINVENT 4 has converged to a local optimum, as indicated by the lower molecular diversity observed in Table 2 and further supported by the molecules in Supplementary information Fig. SI-27.
Figure 6.
Synthetic accessibility and score distribution for hit discovery experiments. Synthetic access is measured by the number of molecules with an RScore above 0.5 among the top 500 generated molecules. The molecules scores plots represent the score distribution for synthetically accessible molecules within each generation. In (a), GO consistently generates molecules with superior scores and higher synthetic accessibility compared to REINVENT 4. In boxplot (b), while REINVENT 4 achieves high reward scores for ERK2 and TRMD, it struggles to generate synthetically accessible molecules, in contrast to GO, which generates molecules synthetically accessible with high rewards in all three use cases.
Table 2.
Diversity comparison of the top 500 high-scoring molecules with an RScore above 0.5 in the hit discovery use case. Similarity is evaluated as the average pairwise Tanimoto similarity. The best score is highlighted in bold
| Target | Generator |
Similarity between molecules ( )
|
Number of scaffolds ( )
|
Similarity between scaffolds ( )
|
|---|---|---|---|---|
| Unconstrained design | ||||
| PIM1 | REINVENT 4 | 0.15 | 270 | 0.35 |
| GO | 0.24 | 270 | 0.40 | |
| ERK2 | REINVENT 4 | 0.15 | 300 | 0.32 |
| GO | 0.14 | 331 | 0.30 | |
| TRMD | REINVENT 4 | 0.14 | 288 | 0.35 |
| GO | 0.13 | 297 | 0.32 | |
| Fragment growing | ||||
| PIM1 | REINVENT 4 | 0.50 | 70 | 0.61 |
| GO | 0.40 | 87 | 0.57 | |
| ERK2 | REINVENT 4 | 0.40 | 77 | 0.58 |
| GO | 0.22 | 214 | 0.35 | |
| TRMD | REINVENT 4 | 0.49 | 184 | 0.57 |
| GO | 0.38 | 268 | 0.53 | |
The results from the MMGBSA calculations suggest that GO tends to provide lower scores than REINVENT 4, although this trend is not consistent across all experiments. For further details, see Supplementary Fig. SI-11. MMGBSA scores may serve as an independent rescoring method to help prioritize molecules from each generation. For instance, Supplementary Fig. SI-11 highlights the presence of molecules with scores close to 0, which are indicative of weak binding and should therefore be excluded from selection.
Chemistry constrained use cases
Beyond standard benchmarks, the present use case highlights how GO and LO can be utilized in practical chemistry applications. Chemists often have a clear understanding of the molecular fragments they want to retain, the chemical transformations they aim to perform, and the structural motifs they wish to incorporate.
GO and LO provide precise control over these aspects by allowing users to define specific chemical constraints, such as the number of reaction steps, reaction types, or the integration of a custom dataset of building blocks. This flexibility facilitates the generation of molecules with desired properties while ensuring compatibility with practical constraints, including wet-lab feasibility and building block availability.
Constraints on name reactions
Set up We consider a fragment growing use case aimed at generating molecules that are synthetically accessible in a single step while optimizing both docking and contact scores in the PIM1 case study. The reward function
is defined as the geometric mean of the scores listed in Supplementary Table SI-2. To reflect practical constraints, only Suzuki-type reactions are permitted for synthesis. Templates SMARTS examples of Suzuki reactions are available in Supplementary Fig. SI-16. This experiment simulates an early-stage hit discovery project, where users have identified an initial fragment interacting with a target contact point within the protein’s binding site.
For REINVENT 4, it is not possible to incorporate these constraints directly into the generation process. As a result, constraints are applied during post-processing after the generation is complete. In contrast, GO can impose chemistry-based constraints during the generation process, allowing users to specify the initial fragment and restrict the synthesis to a single step of Suzuki reaction.
Results To assess the synthetic accessibility of the generated molecules, we used the advanced options of the Spaya API to search exclusively for routes involving a single step, which must be a Suzuki-type reaction, and include the PIM1 initial fragment (Supplementary Fig. SI-15). Among the 500 highest-scoring molecules generated by both methods, REINVENT 4 produces only two molecules meeting the specified constraints, while GO achieves superior results with 314 molecules. The top-scoring molecule is shown in Fig. 8.
Figure 8.
(a–c) The three best molecules generated by REINVENT 4 in the hit discovery use case for the PIM1 target with a spiro compound constraint. (d) Molecular tree of the best molecule generated by GO, where the spiro constraint is satisfied by sampling a spiro compound (a) from the CABB dataset. No constraint is applied to the CABB dataset for sampling building block (d).
Figure 7.

(a) Molecular tree of the top-scoring molecule generated by GO, adhering to the generation constraints (1 Suzuki reaction). (b) Suzuki reaction template used in reaction 1 of the molecular tree. (c) Top-scoring molecule generated by REINVENT 4, adhering to the generation constraints.
Constraints on building blocks
Set up We explore a fragment growing use case aimed at generating synthetically accessible spiro derivatives, composed of two or more rings sharing a single atom, within one or two reaction steps while optimizing the docking and contact scores for the PIM1 target. The reward function
is defined as the geometric mean of the scores listed in Supplementary Table SI-3.
GO efficiently satisfies the spiro constraint in its generation process by imposing the selection of a spiro building block at the initial reaction step of the generated molecular trees. In contrast, REINVENT 4 relies solely on the boolean filter that encodes the spiro constraint in the reward function, which provides limited guidance in the optimization process.
Results GO inherently accounts for the imposed substructure constraint in its design, achieving a 99% success rate in generating molecules with a spiro atom and a top-scoring compound at 0.48 (Fig. 8). In contrast, REINVENT 4 struggles to meet this constraint, as it relies on learning the substructure requirement from the reward function. As a result, only 2% of its generated compounds contain a spiro atom. This scarcity limits its ability to optimize the reward function, yielding just six compounds with a nonzero reward and a highest score of 0.11.
Conclusion
This work introduced GO and LO, two reaction-based molecular generative models, and demonstrated their ability to effectively optimize molecules across a wide range of reward functions.This enables their use in various stages of the drug discovery process, including hit discovery and lead optimization. These generative models also support diverse molecule design strategies, such as unconstrained molecule generation, fragment growing, and fragment linking. GO is also capable of generating macrocycles, further expanding its range of capabilities.
We compared the performance of our models exclusively with REINVENT 4, a widely adopted open-source tool for generating SMILES that supports common molecular design strategies like fragment growing and fragment linking. We demonstrated that GO and LO outperform REINVENT 4 in nearly all use cases in terms of scores, diversity, and synthetic accessibility.
Furthermore, GO and LO support a wider range of applications than REINVENT 4, including macrocyclization and the ability to impose constraints on structural motifs and chemical transformations. Beyond these capabilities, GO and LO offer flexibility in integrating further constraints, such as customizing the dataset of commercially available building blocks based on price or availability. Users can also refine synthesis constraints by specifying reaction types, limiting the number of reaction steps, or incorporating the synthesis tree structure into the optimization process.
While the molecular trees generated by GO and LO help design molecules with high synthetic feasibility by construction, they are not full retrosynthetic pathways. Detailed synthesis routes, accounting for chemo- and regioselectivity, should be generated with dedicated tools such as Spaya [27]. Current limitations include the restriction to linear tree structures without converging branches, and the lack of support for multi-component reactions involving more than two reactants. Enabling more flexible tree topologies would significantly expand the accessible chemical space.
GO and LO could be highly beneficial for lab automation, where managing diverse constraints is essential. In robotic synthesis setups, for instance, molecules must often be designed within strict limits on reaction steps, types, and available building blocks. By accommodating these constraints, GO and LO can be seamlessly integrated into iterative design-make-test-analyze cycles, enabling efficient molecule generation tailored to specific drug discovery needs [42].
Key Points
Chemistry-driven design: Growing optimizer (GO) and linking optimizer (LO) models simulate chemical synthesis by selecting building blocks and modeling reactions step by step.
Synthetic accessibility: GO and LO are designed to generate molecules with good synthetic accessibility, making them applicable for real-world drug discovery projects.
Customizable constraints: The approach allows control over reaction types, building blocks, and synthesis pathways, aligning with specific medicinal chemistry strategies. It is especially suited for real-world drug design projects.
Improved performance: Compared with REINVENT 4, a leading molecular generative model, GO and LO show a higher likelihood of producing synthesizable compounds while achieving desired molecular properties.
Supplementary Material
Acknowledgements
We would like to express our sincere gratitude to the Iktos team for their exceptional contributions to this project. Our thanks go to Brice Hoffman, Anna Kriukova, Maud Jusot, Ennys Gheyouche, Stéphanie Labouille, Philippe Pinel, Arthur Carre, Jean-Baptiste Chéron, and Guillaume Plum for their invaluable work on the contact score and the setup of the docking pipeline. We also acknowledge Florian Kasper for his contributions in preparing the data building block dataset, and Paul Join-Lambert and Maxim Shevelev for their development of the pipeline to obtain the reaction dataset. Their efforts were crucial in providing the necessary tools and resources for this research. We are deeply grateful to the engineering team for granting us access to the computing tools and servers that enabled the execution of our models and experiments. Additionally, our thanks go to Massina Abderrahmane and Maud Parrot for their valuable work on the generators, and to everyone at Iktos who contributed to this project in any capacity. Their expertise and support have been indispensable. Lastly, we would like to extend our sincere thanks to Stefani Gamboa, Victoire Cachoux, and Maxime Laugeois for their time and effort in proofreading this manuscript. Their insightful comments and suggestions have greatly improved the clarity and quality of our work.
Contributor Information
Clarisse Descamps, Iktos, 65 Rue de Prony, 75017 Paris, Île-de-France, France.
Vincent Bouttier, Iktos, 65 Rue de Prony, 75017 Paris, Île-de-France, France.
Juan Sanz García, Iktos, 65 Rue de Prony, 75017 Paris, Île-de-France, France.
Maoussi Lhuillier-Akakpo, Iktos, 65 Rue de Prony, 75017 Paris, Île-de-France, France.
Quentin Perron, Iktos, 65 Rue de Prony, 75017 Paris, Île-de-France, France.
Hamza Tajmouati, Iktos, 65 Rue de Prony, 75017 Paris, Île-de-France, France.
Author contributions
Clarisse Descamps: Conceptualization, Methodology, Software, Investigation, Writing—Original Draft; Vincent Bouttier: Methodology, Software, Investigation, Writing—Original Draft; Juan Sanz García: Investigation, Writing—Original Draft; Maoussi Lhuillier-Akakpo: Project administration, Writing—Review & Editing; Quentin Perron: Conceptualization, Methodology; Hamza Tajmouati: Conceptualization, Methodology, Writing—Original Draft; All authors: Writing—Review & Editing, Approval of final manuscript.
Conflict of interest: The authors declare the following competing interests. All authors are employees at Iktos.
Funding
The study was funded by Iktos.
Data availability
The Pistachio dataset, obtained from NextMove Software, was used to train the reaction template predictor models [31]. The molecular data used to train GO was sourced from ChEMBL [32]. The building blocks in the CABB dataset were obtained through supplier agreements established by Iktos with 1PlusChem, Ambeed, BLDpharm, ChemBridge, Chemspace, eMolecules, Key Organics, Life Chemicals, and TCI Chemicals. The code supporting this study is available at https://github.com/iktos/growing_linking_optimizers. Components corresponding to GO, LO, and structure based analysis are proprietary and available under a commercial license from Iktos, they are not included in the public distribution.
References
- 1. Paul SM, Mytelka DS, Dunwiddie CT et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov 2010;9:203–14. 10.1038/nrd3078 [DOI] [PubMed] [Google Scholar]
- 2. Sumathi S, Suganya K, Swathi K et al. A review on deep learning-driven drug discovery: strategies, tools and applications. Curr Pharm Des 2023;29:1013–25. 10.2174/1381612829666230412084137 [DOI] [PubMed] [Google Scholar]
- 3. Stanley M, Segler M. Fake it until you make it? Generative de novo design and virtual screening of synthesizable molecules. Curr Opin Struct Biol 2023;82:102658. issn: 0959-440X. 10.1016/j.sbi.2023.102658 . url: https://www.sciencedirect.com/science/article/pii/S0959440X2300132X [DOI] [PubMed] [Google Scholar]
- 4. Nicolaou CA, Brown N. Multi-objective optimization methods in drug design. Drug Discov Today Technol 2013;10:e427–35. issn: 1740–6749. 10.1016/j.ddtec.2013.02.001 . https://www.sciencedirect.com/science/article/pii/S1740674913000085 [DOI] [PubMed] [Google Scholar]
- 5. Hughes JP, Rees S, Kalindjian SB et al. Principles of early drug discovery. Br J Pharmacol 2011;162:1239–49. 10.1111/j.1476-5381.2010.01127.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Loeffler HH, He J, Tibo A et al. Reinvent 4: modern AI–driven generative molecule design. J Chem 2024;16:20. 10.1186/s13321-024-00812-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. https://api.semanticscholar.org/CorpusID:5445756. 10.1021/ci00057a005 [DOI] [Google Scholar]
- 8. Gómez-Bombarelli R, Wei JN, Duvenaud D et al. Automatic chemical design using a data-driven continuous representation of molecules. In: CoRR, 2016.. arXiv: 1610.02415. [DOI] [PMC free article] [PubMed]
- 9. Graves A. Generating sequences with recurrent neural networks. In: CoRR, 2013.. http://arxiv.org/abs/1308.0850.
- 10. Segler MHS, Kogej T, Tyrchan C et al. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Sci 2018;4:120–31. issn: 23747951. 10.1021/acscentsci.7b00512 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Gao Z, Hu Y, Tan C, Li SZ. PrefixMol: Target- and Chemistry-Aware Molecule Design Via Prefix Embedding. 2023. arXiv: 2302.07120[cs.AI]. https://arxiv.org/abs/2302.07120.
- 12. Li Y, Vinyals O, Dyer C et al. Learning deep generative models of graphs. In: CoRR, 2018. arXiv: 1803.03324. http://arxiv.org/abs/1803.03324.
- 13. Li Y, Zhang L, Liu Z. Multi-Objective De Novo Drug Design with Conditional Graph Generative Model 2018. arXiv: 1801.07299[q-bio.QM]. https://arxiv.org/abs/1801.07299. [DOI] [PMC free article] [PubMed]
- 14. Podda M, Bacciu D, Micheli A. A Deep Generative Model for Fragment-Based Molecule Generation 2020. arXiv: 2002.12826[stat.ML]. https://arxiv.org/abs/2002.12826.
- 15.Yang S, Hwang D, Lee S et al. Hit and Lead Discovery with Explorative RL and Fragment-Based Molecule Generation 2021. arXiv: 2110.01219[q-bio.QM]. https://arxiv.org/abs/2110.01219.
- 16. Gao W, Coley CW. The synthesizability of molecules proposed by generative models. J Chem Inf Model 2020;60:5714–23. 10.1021/acs.jcim.0c00174 [DOI] [PubMed] [Google Scholar]
- 17. Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Chem 2009;1:8. 10.1186/1758-2946-1-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Coley CW, Rogers L, Green WH et al. SCScore: synthetic complexity learned from a reaction corpus. J Chem Inf Model 2018;58:252–61. 10.1021/acs.jcim.7b00622 [DOI] [PubMed] [Google Scholar]
- 19. Parrot M, Tajmouati H, da Silva VBR et al. Integrating synthetic accessibility with AI-based generative drug design. J Chem 2023;15:83. 10.1186/s13321-023-00742-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Bradshaw J, Paige B, Kusner MJ et al. Barking up the right tree: an approach to search over molecule synthesis DAGs. arXiv NeurIPS 2020. issn: 23318422 [Google Scholar]
- 21.Cretu M, Harris C, Igashov I et al. SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints. 2024. https://arxiv.org/abs/2405.01155.
- 22. Koziarski M, Harris C, Igashov I et al. RGFN: Synthesizable Molecular Generation Using GFlowNets 2024. arXiv: 2406.08506[physics.chem-ph]. https://arxiv.org/abs/2406.08506.
- 23. Seo S, Kim M, Shen T et al. Generative Flows on Synthetic Pathway for Drug Design. 2024. https://arxiv.org/abs/2410.04542.
- 24. Inc. Daylight Chemical Information Systems . Smarts. http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html (4 February 2025, date last accessed).
- 25. Schwaller P, Laino T, Gaudin T et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Sci 2019;5:1572–83. 10.1021/acscentsci.9b00576 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Ltd E. REAL Space Navigator. https://enamine.net/compound-collections/real-compounds/real-space-navigator. Consulted on April 2025. n.d.
- 27. Iktos . Spaya retrosynthesis tool. https://www.spaya.ai (2024, date last accessed).
- 28. Cho K, van Merrienboer B, Bahdanau D et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In: CoRR abs/1409.12592014. arXiv: 1409.1259. http://arxiv.org/abs/1409.1259.
- 29. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature 1986;323:533–6 . https://api.semanticscholar.org/CorpusID:205001834. 10.1038/323533a0 [DOI] [Google Scholar]
- 30. Segler MHS, Waller MP. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry (Weinheim an der Bergstrasse, Germany) 2017;23:5966–71. [DOI] [PubMed] [Google Scholar]
- 31. NextMove Software . Pistachio. https://www.nextmovesoftware.com/pistachio.html (2024, date last accessed).
- 32. ChEMBL . ChEMBL https://www.ebi.ac.uk/chembl/ (2024, date last accessed).
- 33. Neil D, Le Roux N, Norouzi M et al. Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design. 2018. https://openreview.net/forum?id=HkcTe-bR-(2024, date last accessed).
- 34. Ahmed Z, Le Roux N, Norouzi M et al. Understanding the Impact of Entropy on Policy Optimization. 2019. https://arxiv.org/abs/1811.11214.
- 35.Fu T, Gao W, Xiao C et al. Differentiable scaffolding tree for molecule optimization. In: International Conference on Learning Representations, 2022.. https://openreview.net/forum?id=w_drCosT76.
- 36. Brown N, Fiscato M, Segler MHS et al. GuacaMol: benchmarking models for de novo molecular design. J Chem Inf Model 2019;59:1096–108. issn: 15205142. 10.1021/acs.jcim.8b00839 [DOI] [PubMed] [Google Scholar]
- 37. Genheden S, Thakkar A, Chadimová V et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Chem 2020;12:70. 10.1186/s13321-020-00472-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Gao W, Fu T, Sun J et al. Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization. 2022. arXiv: 2206.12411[cs.CE]. https://arxiv.org/abs/2206.12411.
- 39. Ciepliński T, Danel T, Podlewska S et al. Generative models should at least be able to design molecules that dock well: a new benchmark. J Chem Inf Model 2023;63:3238–47. 10.1021/acs.jcim.2c01355 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model 2010;50:742–54. 10.1021/ci100050t [DOI] [PubMed] [Google Scholar]
- 41. Bickerton GR, Paolini GV, Besnard J et al. Quantifying the chemical beauty of drugs. Nat Chem 2012;4:90–8. 10.1038/nchem.1243 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Medcalf M, Jain V, Gamboa S et al. Overcoming DMTA cycle challenges: a unified AI-driven system for efficient drug design. ChemRxiv. 2025; Version 2. Preprint. 10.26434/chemrxiv-2024-0z7g6-v2. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Pistachio dataset, obtained from NextMove Software, was used to train the reaction template predictor models [31]. The molecular data used to train GO was sourced from ChEMBL [32]. The building blocks in the CABB dataset were obtained through supplier agreements established by Iktos with 1PlusChem, Ambeed, BLDpharm, ChemBridge, Chemspace, eMolecules, Key Organics, Life Chemicals, and TCI Chemicals. The code supporting this study is available at https://github.com/iktos/growing_linking_optimizers. Components corresponding to GO, LO, and structure based analysis are proprietary and available under a commercial license from Iktos, they are not included in the public distribution.
















