Skip to main content
ACS Polymers Au logoLink to ACS Polymers Au
. 2023 Mar 29;3(4):318–330. doi: 10.1021/acspolymersau.3c00003

Open Macromolecular Genome: Generative Design of Synthetically Accessible Polymers

Seonghwan Kim , Charles M Schroeder †,‡,§,, Nicholas E Jackson †,§,*
PMCID: PMC10416319  PMID: 37576712

Abstract

graphic file with name lg3c00003_0010.jpg

A grand challenge in polymer science lies in the predictive design of new polymeric materials with targeted functionality. However, de novo design of functional polymers is challenging due to the vast chemical space and an incomplete understanding of structure–property relations. Recent advances in deep generative modeling have facilitated the efficient exploration of molecular design space, but data sparsity in polymer science is a major obstacle hindering progress. In this work, we introduce a vast polymer database known as the Open Macromolecular Genome (OMG), which contains synthesizable polymer chemistries compatible with known polymerization reactions and commercially available reactants selected for synthetic feasibility. The OMG is used in concert with a synthetically aware generative model known as Molecule Chef to identify property-optimized constitutional repeating units, constituent reactants, and reaction pathways of polymers, thereby advancing polymer design into the realm of synthetic relevance. As a proof-of-principle demonstration, we show that polymers with targeted octanol–water solubilities are readily generated together with monomer reactant building blocks and associated polymerization reactions. Suggested reactants are further integrated with Reaxys polymerization data to provide hypothetical reaction conditions (e.g., temperature, catalysts, and solvents). Broadly, the OMG is a polymer design approach capable of enabling data-intensive generative models for synthetic polymer design. Overall, this work represents a significant advance, enabling the property targeted design of synthetic polymers subject to practical synthetic constraints.

Keywords: polymer design, polymer database, generative machine learning, synthetic feasibility, reaction condition

Introduction

Polymers are complex macromolecules spanning an enormous chemical space. A key challenge in polymer science lies in the predictive design and discovery of new polymeric materials using a combination of experiments and computational modeling. Polymers have an intrinsic multiscale nature consisting of atomistic monomer structure, intra- and intermolecular interactions, macromolecular conformations, and self-assembly phenomena at mesoscopic scales.13 A comprehensive understanding of phenomena across disparate length scales is required to design polymers with desired functionality due to the interplay of interactions and forces spanning multiple scales.48 In recent years, computational multiscale modeling and scale-bridging strategies of polymeric materials have been developed to accurately capture experimental data.1 Despite recent progress, predictive design and discovery of new polymers are in the early stages due to challenges in navigating a vast chemical and morphological design space.2,9

Data-driven techniques have emerged as a powerful paradigm for navigating molecular and polymeric design spaces.1015 Generative machine learning (ML) employing deep neural networks can effectively learn the intrinsic distributions of a chemical dataset and navigate these distributions to design molecules with desired properties for a broad array of applications.1625 Variational autoencoders have been used to search chemical space using latent variables obtained from the neural network encoding and decoding of a large amount of molecular data. Despite recent advances in using generative models for designing small organic molecules, such a success has not yet been broadly achieved in polymer science. A few prior studies have reported the use of generative modeling to suggest potential chemistries for linear homopolymers26 or large molecules.27,28 However, key challenges remain in adapting generative modeling to polymer science, including the proper choice of polymer featurization29 capturing chemical, sequence, and microstructural heterogeneities inherent to polymer chemistries. Nevertheless, generative modeling holds strong potential to enable the predictive design of polymeric materials with targeted functionality.

Progress in generative polymer models critically requires large data sets of polymer chemistries that delineate the nature of the relevant chemical space. Expansive datasets such as QM930 and ZINC31 have enabled the generative design of small organic and druglike molecules, with most generative models employing only a handful of datasets to learn chemical space. However, the relevance of small molecule datasets is not clear for generative polymer design, in part because most molecules in small-molecule datasets are biased toward pharmaceutical compounds and may not contain the functional groups required for polymerization. To obtain high-quality polymer data for generative polymer design, several polymer databases have been constructed including PI1M,32 PolyInfo,33 Polymer Genome,34 and CRIPT.35 Notably, the PI1M polymer constitutional repeating unit (CRU) dataset32 is an open-source dataset generated using a recurrent neural network model trained on experimentally synthesized PolyInfo polymers.33 However, the PI1M database neither guarantees synthetic access nor provides synthetic pathways necessary for polymer synthesis. On the other hand, Polymer Genome34 is a polymer dataset that provides synthetic pathways to generated polymers,36 but the Polymer Genome dataset is proprietary and can only be accessed (but not exhaustively searched) via a web interface. The development of the CRIPT framework35 is a crucial step toward the development of an open-source dataset but is not yet sufficiently developed to possess enough diverse chemistries to enable generative ML models. Despite recent progress, the limited quantity, synthesizability, and/or accessibility of these data sources limit researchers from developing generative models for polymeric materials.

A common feature limiting the development of generative ML models is the lack of synthetic viability of suggested structures. In recent years, deep generative models have primarily focused on the functionality, chemical validity (e.g., obeying the octet rule), or diversity of generated molecular structures without consideration of synthetic feasibility from common or commercially available reactants. A few prior studies17,18,22,25 considered the synthetic feasibilities of generated small organic molecules by incorporating synthetic accessibility scores37 into optimization objectives. Ideally, explicit synthetic routes to generated molecules would be incorporated in the predictive design process to enable experimental realizations. Most retrosynthetic tools3843 have been developed for small organic molecules, with only a few prior examples of incorporating retrosynthetic pathways into the polymer design.36,44 A generative modeling scheme incorporating synthetic feasibility of both reactants and polymerization pathways remains a key challenge for achieving data-driven polymer design.

In this work, we report a generative framework for the design of synthetically accessible polymers with targeted functionalities. We introduce the Open Macromolecular Genome (OMG), a synthetically aware database of synthetic polymer chemistries, that facilitates the training of generative polymer design models (Figure 1). The OMG represents a chemical space of linear homopolymers obtained from monomer reactants with the necessary functional groups required for polymerization. We demonstrate the utility of generative modeling using OMG by adapting the Molecule Chef,45 a generative ML model, to specify synthetic routes to generated polymers (Figure 1). The OMG database is augmented with polymer Log P(46) values (log10 of partition coefficient P between 1-octanol and water—a crucial measure of aqueous solubility and toxicity in organic molecules47,48), which is used to train the Molecule Chef generative model to construct a polymer design space represented by latent variables. Targeted polymers and their associated monomer reactants are generated by exploring the polymer design space guided by the learned polymer Log P prediction function. We conclude with future directions for improving the OMG framework to accommodate additional complexities intrinsic to polymeric materials.

Figure 1.

Figure 1

Generative polymer design framework using the OMG introduced in this work. The OMG leverages a vast database of purchasable chemistries and known reaction templates to seed powerful generative ML models that identify property-optimized targets and their synthetic components. To identify CRUs of a targeted polymer property (e.g., Log P), the OMG (spanning step growth, chain growth, ring opening, and metathesis) can be utilized to search a polymer design space. The OMG is used in concert with a synthetically aware generative model known as Molecule Chef to identify property-optimized CRUs, constituent reactants, and reaction pathways of polymers.

Methods

OMG

To promote the open-source, data-driven discovery of synthetically accessible polymers, the OMG was developed with viable retrosynthetic pathways and commonly available reactants. The base level of the OMG consists of a large database of commercially available reactants with functional groups compatible with at least one of 17 canonical polymerization reactions (Figure 2) to generate CRUs.49 The 17 polymerization reactions used in the OMG span a variety of step growth, chain growth, ring opening, and metathesis polymerization approaches commonly employed in industry.5053 The OMG is designed to be an open and evolving database and consequently is not limited to these 17 reactions in future iterations. Although the 17 polymerization reactions encode for synthetic accessibility in selecting reactants with functional group compatibility, there is no guarantee that existing small-molecule databases commonly employed in generative design efforts (e.g., QM930 and ZINC31) populate a chemical space of industrial relevance to synthetic polymer design. To address this challenge, additional synthetic viability in the OMG is enforced by only incorporating functional group-compatible reactants available in the eMolecules database. eMolecules is an online platform providing links to commercially available molecules54 and has been previously used in the design of organic molecules.17,55 By downselecting OMG reactants from the broadest available set of purchasable reactant chemistries, the OMG reactant set avoids bias toward druglike (ZINC) or artificial (QM9) chemistries that would result from existing chemical databases.

Figure 2.

Figure 2

Schematic showing the 17 polymerization reactions in the OMG that cover a variety of step growth (red), chain growth addition (green), ring opening (blue), and metathesis polymerization (purple) reactions.

To develop the OMG database, approximately 24 million potential reactant molecules were downloaded from eMolecules (data release 2022.08.01) and downselected to maximize synthetic realizability. Each molecule was transformed into a canonical SMILES56 string using RDKit57 (release 2021.09.4), with stereochemistry being ignored. This set was further downselected to 3.1 million molecules to identify only those molecules containing the functional groups compatible with at least one of the 17 OMG polymerization reactions by utilizing SMARTS representations to locate certain functional groups. These filtered reactants, however, may not be suitable reactants due to potentially undesirable side reactions, and thus the synthetic complexity score (SCScore)58 was further applied to the functional group filtered reactants according to the frequency of appearance in the Reaxys database.59 As all potential reactant molecules are commercially available at eMolecules, the SCScore was chosen to estimate how often a reactant molecule has been utilized in published literature as a proxy for its synthetic feasibility; we found the SCScore measure to be more relevant to synthetic reality than an alternative computational metric37 when quantifying synthetic feasibility. By only selecting monomer reactants with a lower SCScore than the mean SCScore of PolyInfo reactants, the reactant dataset was reduced to 77,281 reactants upon which the 17 OMG polymerizations could be applied (Figures S1 and S2 in the Supporting Information). Based on the selected 77,281 reactants in the OMG, nearly 12 million chemically distinct CRUs corresponding to these reactants were generated by applying the 17 template-based polymerization algorithms implemented with RDKit (Figure S3 in the Supporting Information). A similar template-based reaction prediction approach was previously used for polymer retrosynthesis planning.36 This dataset constitutes the OMG CRU database used throughout this work. Detailed implementation of the OMG CRU generation algorithms can be found at https://github.com/TheJacksonLab/OpenMacromolecularGenome.

Given the enormous number of desired structural and property descriptors of polymeric materials,35 the initial manifestation of the OMG serves primarily as a data-rich repository of synthetically relevant, linear homopolymer chemistries to enable generative ML models. Although the vastness and synthetic relevance of the OMG is unprecedented, it is important to note that the design space in the initial manifestation has been purposefully limited to mitigate an exponentially growing design space. CRUs in the OMG are linear (branched or copolymer structures may be incorporated in future efforts60), agnostic to stereochemical information, and do not detail regiochemical specificity as the latter is often a delicate function of the reaction kinetics. These simplifications have been successfully adopted by others in the field to effectively address the curse of dimensionality in the polymer design space.26,61,62 Moreover, the OMG is not an exhaustive experimental property database in the same style as other databases such as PolyInfo. To demonstrate the power and practicality of generative models that leverage the OMG, the initial database has been augmented with Log P estimations as a physical property target of interest to polymer design, with clear applications to the solubility and toxicity of polymers.47,48 In future efforts, we aim to aggregate broader and more diverse sets of experimental property data in conjunction with experimental scientists to provide an open database of polymer properties to facilitate generative ML models.

Generative Models

We adapt the Molecule Chef architecture45 to identify property-optimized CRUs from the OMG and their associated reactants. Molecule Chef is a generative ML model endowed with retrosynthetic capabilities, which provides a significant advantage over traditional generative models focused only on designing polymer CRUs that are blind to synthetic feasibility. Molecule Chef is based on a variational autoencoder (VAE) that uses Bayesian statistics to extract essential features of input data by compressing and decompressing data points.63 Feature extraction via a VAE can be thought of as analogous to a normal mode analysis of molecular vibrations in a complex molecular structure. As a normal mode analysis provides insight into simple collective variables that describe seemingly complicated molecular vibrations,64 low-dimensional hidden features extracted by the encoding and decoding processes of a VAE can be utilized to construct simple collective variables (embeddings) of arbitrarily complex datasets. Following feature extraction, a continuous low-dimensional latent space representing hidden features is obtained that provides a reduced description of the essential features of the data. When trained on CRUs, VAEs provide a continuous latent space from which physical polymer properties can be predicted by regression methods, resulting in organization of the latent space according to the physical property of interest. Locations within the latent space with desirable physical properties can then be identified via gradient-based search and decoded into the associated CRU. Figure 3 illustrates a representative example of the generation process for a CRU using OMG-enabled Molecule Chef derived from a condensation reaction of dicarboxylic acid and diol chemistries, in which a novel reactant combination that is not present in the training data is discovered.

Figure 3.

Figure 3

Overview of training the Molecule Chef generative model on OMG reactant combinations to design polymers with high Log P values. (a) Training the Molecule Chef generative model using the OMG reactants and (b) generating new polymers with high Log P values through latent space optimization.

As a proof-of-principle demonstration, we utilized the octanol–water solubility (Log P) as a target property to be optimized using the Molecule Chef generative model trained on OMG chemistries. The Log P of a molecule can be estimated by summing the predefined solubility value of each atomic contribution.46 This computational Log P is easily accessible from RDKit and has been utilized as a target physical property in prior work.1719,21,22,24 To demonstrate the generative performance of the Molecule Chef together with OMG, we constructed the training datasets consisting of OMG monomer reactant combinations and Log P values of the corresponding OMG CRUs. Approximately 150,000 polymers were randomly selected among the total of nearly 12 million OMG CRUs and split into train, validation, and test data (8:1:1). To compensate for the length bias of CRUs, the Log P values obtained from RDKit were normalized by the number of backbone atoms in CRUs. This Log P normalization acts as a treatment for dealing with polymers with a fixed backbone length since the computational Log P obtained from RDKit is the summation of the Log P contribution of each atom in a given molecular structure. The hyperparameters of Molecule Chef were optimized with Optuna65 parallelized over multiple GPUs using Dask.66 Detailed descriptions of Log P normalization and hyperparameters of generative models are provided in the Supporting Information (Figures S4 and S5). The OMG database and generative models are freely available at https://github.com/TheJacksonLab/OpenMacromolecularGenome and https://doi.org/10.5281/zenodo.7556942. As a benchmark for this new application of Molecule Chef, we also trained a SELFIES VAE on OMG CRUs. The SELFIES representation possesses the beneficial feature of achieving 100% chemical validity of generated strings as output from VAE-based models67 but only is capable of producing CRUs without any retrosynthetic capabilities.

To demonstrate the effectiveness of the OMG combined with state-of-the-art generative models, the Molecule Chef and SELFIES VAE generative models were trained in three canonical cases of polymer design. In the first case, Molecule Chef was trained on the condensation reaction between dicarboxylic acid and diol reactants to assess the generative performance of the Molecule Chef when optimized over a single polymerization reaction. In the second case, Molecule Chef was trained on six step-growth polymerization reactions in the OMG to assess the generative performance of Molecule Chef under multiple related polymerization reactions between two different reactants. In the third case, to extend the generative scope of polymerization reactions, Molecule Chef was trained on all 17 OMG polymerization reactions consisting of polymerization between two different reactant types (reactions 1–6, Figure 2) and polymerization between the same reactant type (reactions 7–17, Figure 2), where the latter case focused on alternating copolymers. For simplicity, we assumed that alternating copolymers are always constructed regardless of the reactivity of each monomer. Molecule Chef was then trained on OMG monomer reactant combinations across all 17 OMG polymerization reactions. The SELFIES VAE was trained on the CRUs of the same OMG polymers used in Molecule Chef training. Training data compositions are provided in the Supporting Information (Figure S6).

Results

Chemical Structure of the OMG Dataset

We began by investigating the intrinsic chemical structure of the OMG dataset (Figure 4). To understand the difference between OMG reactants and existing chemical datasets, we compared the chemical space of OMG reactants to those of QM9 and ZINC. Principal components analysis (PCA) was applied to numerical featurizations (Morgan fingerprints68– nBits = 1024, radius = 2) of OMG reactants, PolyInfo reactants, QM9, and ZINC to obtain simplified representations for visualization and interpretation. For consistency, we utilized the two principal axes with the largest eigenvalues obtained from reactants in the PolyInfo database, operating under the assumption that the PolyInfo reactants occupy the most realistic chemical space for experimental polymerization. The explained variance of PolyInfo reactants from PCA is provided in the Supporting Information (Figure S7a). Figure 4a shows the PCA of PolyInfo and OMG reactants, indicating that the OMG reactants cover a significant portion of the PolyInfo reactant chemical space. Figure 4b shows the PCA of the OMG reactants, QM9, and ZINC using the principal axes of the PolyInfo reactants. The PCA implies that the OMG reactants occupy a larger portion of the polymerizable reactant space than QM9, which likely reflects the fact that QM9 consists of artificially generated small molecules of up to nine heavy atoms (CNOF) with no assumptions on polymerization. In addition, the OMG reactants and ZINC molecules have a similar coverage of the principal component space, implying that the chemical space of the OMG reactants has a substantial overlap with that of ZINC. This partial overlap is anticipated because the OMG reactants were obtained by downselecting commercially available chemistries in eMolecules that exhibit some overlap with previously synthesized chemistries in ZINC. However, it is important to note that the OMG reactants impose multiple synthesizability constraints, as previously described, which significantly increases the relevance of the OMG as a baseline for generative polymer modeling compared to ZINC.

Figure 4.

Figure 4

PCA of OMG reactants and CRUs. (a) PCA of the OMG and PolyInfo reactants. (b) PCA of the OMG reactants, QM9, and ZINC molecules. (c) PCA of the OMG and PolyInfo CRUs. Randomly sampled sets of 250,000 ZINC and 2 million OMG CRUs were used in the PCA. The density of each dataset was normalized independently and plotted on the marginal axis (top and right-hand side of plots).

We further utilized PCA to understand chemical differences between the experimentally synthesized PolyInfo polymers and the OMG-generated CRUs. Again, the principal axes of the PolyInfo polymers were obtained as the reference axes by applying PCA to the Morgan fingerprints (nBits = 1024, radius = 2) of the PolyInfo CRUs. The explained variance of PolyInfo CRUs from PCA is provided in the Supporting Information (Figure S7b). The Morgan fingerprints of two million randomly selected OMG CRUs were then projected on the two-dimensional plane using the two largest principal components of the PolyInfo polymers. Figure 4c shows that the OMG CRU chemical space occupies a subset of the PolyInfo CRU chemical space, suggesting that the CRUs constructed via the OMG reactant database and the 17 OMG polymerization reactions yield a significant chemical overlap with the largest known assembled database of synthesized polymers. Because the OMG dataset consists of only 17 common polymerization reactions, the chemical space of OMG CRUs can be further expanded to cover a larger space of synthesizable polymers in future iterations. Detailed compositions of OMG CRUs are provided in the Supporting Information (Figure S3).

Generative Model

The Molecule Chef and SELFIES VAE generative models were applied to OMG datasets to search for polymers with targeted solubilities as represented by Log P. To achieve this, high-dimensional latent spaces were constructed by balancing the reconstruction loss and divergence loss of the VAE with Log P regression performance from the latent space, resulting in a structuring of the latent space according to predicted Log P values (Figure 5). Figure 5 shows the PCA of the latent spaces obtained from Molecule Chef and SELFIES VAE trained on a single step-growth polymerization reaction (dicarboxylic acid and diol condensation), six step-growth polymerization reactions, and all 17 OMG polymerization reactions, respectively. In all cases, the latent spaces of the generative models became highly structured according to Log P along the principal axes of the latent space projection, consistent with previous work using VAE for small-molecule based molecular design.17,69

Figure 5.

Figure 5

PCA of the latent spaces of Molecule Chef and SELFIES VAE models trained on the OMG polymers. (a) Latent space of the Molecule Chef and (b) latent space of the SELFIES VAE model trained on the condensation reactions between dicarboxylic acid and diol. (c) Latent space of Molecule Chef and (d) latent space of the SELFIES VAE model trained on the six step-growth polymerization reactions. (e) Latent space of Molecule Chef and (f) latent space of the SELFIES VAE model trained on all of the OMG polymerization reactions.

For all training cases incorporating different sets of polymerization reactions, Molecule Chef and SELFIES VAE achieved an R2 score of Log P prediction > 0.90 for the test sets of OMG CRUs. For the training cases involving a single polymerization reaction and six step-growth polymerization reactions, SELFIES VAE showed a test set reconstruction accuracy > 80%, whereas Molecule Chef achieved a test set reconstruction accuracy > 90%. Despite the intrinsic difficulty of simultaneously learning many reactions, both generative models obtained test set reconstruction accuracies > 70% for the training on all 17 OMG polymerization reactions. A higher test set reconstruction accuracy may be achievable if generative models are trained on larger data sets of OMG polymers. Decreasing reconstruction performance correlates with the inclusion of a larger variety of polymerization reactions, implying that focused training on a subset of essential reactions oriented toward targeted polymer properties might be desirable rather than training all 17 reactions simultaneously. Latent space organization with high polymer reconstruction rates and accurate Log P predictions implies that both generative models learned meaningful polymer structure–property relationships. The detailed training results are provided in the Supporting Information (Figure S8).

Following construction of the latent space according to Log P predictions, the chemical properties of the latent space representations were investigated by decoding several latent space points into their associated CRUs (Figure 6). Figure 6a shows the decoding of a linear interpolation between a high Log P latent point and a low Log P latent point of Molecule Chef trained on six step-growth polymerization reactions. The obtained OMG reactant combinations exhibit (on average) a gradual change from hydrophilic to hydrophobic character by adding aromatic groups to monomer reactants as the principal component 1 value increases, though deviations from this trend are observed corresponding to variations along less important principal components. Importantly, the generated OMG reactant combinations correspond to specific synthetic routes associated with one of the OMG polymerization reactions, which is a key benefit of our approach. However, the latent space decoding of Molecule Chef does not always generate a valid OMG reactant combination following the OMG polymerization reactions. Even though the Molecule Chef was trained on the OMG reactant combinations satisfying one of the OMG polymerization reactions, the latent space search for novel OMG combinations may yield OMG reactant combinations not satisfying the OMG polymerization reactions. The latent space search of the Molecule Chef embeddings can result in a reactant possibly outside of the training regime; this means that there is no guarantee that Molecule Chef will always decode novel reactant combinations that possess proper functional groups that correspond to the specified OMG polymerization reactions. To quantitively analyze the latent space quality, we generated 30,000 OMG monomer reactant combinations following the Gaussian prior distribution on the latent space of Molecule Chef trained on six step-growth polymerization reactions. We then obtained the percentage of valid OMG monomer reactant combinations belonging to one OMG polymerization reaction (83.44%), unique OMG monomer reactant combinations not overlapping with another generated OMG monomer reactant combination by the Gaussian prior (97.56%), and novel OMG monomer reactant combinations not duplicated with training OMG monomer reactant combinations (95.95%), using a similar method as in prior work.45

Figure 6.

Figure 6

Latent space analysis of generative models. Linear interpolation on the (a) latent space of the Molecule Chef and (b) latent space of the SELFIES VAE model trained on six step-growth polymerization reactions. Points in the latent space correspond to the mean value of a posterior Gaussian distribution on the encoding of the training data. Functional groups participating in the step-growth reaction are highlighted in red.

For direct comparison with the latent space structure from Molecule Chef, the Log P-structured latent space of SELFIES VAE was similarly decoded following a linearly interpolated path between high and low Log P values. Figure 6b shows that SELFIES VAE-generated CRUs similarly exhibit a gradual change from hydrophilic to hydrophobic character by removal of hydrophilic amine groups and addition of long alkyl chains as principal component 1 and principal component 2 decrease. Unlike results from Molecule Chef, SELFIES VAE generates novel CRUs by adding new atoms or bonds without synthetic consideration. This represents a major advantage of the Molecule Chef generative model introduced in concert with the OMG: our work allows for facile incorporation of synthetically accessible reactant combinations into the structure of generative models via known polymerization paths, providing improved synthesizability of CRUs compared to prior generative models. Moreover, SELFIES VAE does not always generate chemically valid CRUs even though the SELFIES representation was originally designed to always generate 100% chemically valid strings. This arises due to necessary modifications of the SELFIES representation for compatibility with the asterisk symbol in the OMG CRUs.70 Here, the requirement on a generated SELFIES representation to have only two asterisk symbols to specify a polymer repeating unit was not often satisfied. This modification resulted in a decrease in the chemical validity of generated CRUs (chemical validity of 45.07% from the Gaussian prior generation), but we anticipate that this decreased validity will be improved in future iterations by incorporating an asterisk symbol grammar to SELFIES. The quantitative analysis of the generative validity of Molecule Chef and SELFIES VAE for all training cases is provided in the Supporting Information (Figure S9).

Training of the Molecule Chef architecture on a broad class of polymerization reactions provides insight into the intrinsic functional potential of different polymerization schemes and their compatible reactant chemistries. Figure 7a,b shows the polymerization reaction distributions on the Log P-structured latent space of Molecule Chef trained on six step-growth polymerization reactions and all of the OMG polymerization reactions (corresponding to Figure 5c,e), respectively. Points in latent space are color coded according to their specific polymerization reactions. The latent space of Molecule Chef trained on the six step-growth polymerization reactions indicates that the polymers formed with ester linkages occupy a hydrophobic chemical space with high Log P values, whereas the polymers containing hydrogen bonds in amide linkages occupy a hydrophilic chemical space with low Log P values (Figure 7a). The latent space of Molecule Chef trained on all OMG polymerization reactions shows that the chain growth polymers such as vinyl or acetylene addition occupy a chemical space of high Log P values (Figure 7b). These results suggest that chain growth polymers have a relatively small number of backbone atoms in CRUs and hence occupy a chemical space of high Log P values because CRU Log P values are normalized by the number of backbone atoms. Step-growth polymers generally have low Log P values, likely arising from hydrogen bonds in amide linkages. Although these findings are not proposed as general conclusions across the entire functional potential of all polymer chemistries, our results provide useful insight into whether a broad set of polymerization reactions is required to access desired functional potentials in specific applications. Notably, if a single polymerization reaction possesses a chemical space coverage sufficient for a targeted polymer functionality, then the general retrosynthesis problem of target polymers can be dramatically simplified to the single polymerization reaction. These results are consistent with prior work on building-block synthetic approaches.71

Figure 7.

Figure 7

Distribution of polymerization reaction types mapped to the latent space of Molecule Chef trained on (a) six step-growth polymerization reactions and (b) all OMG polymerization reactions. Representative examples of the chemistries from different regions of latent space are shown.

We next assess the ability of our generative models trained on the OMG to access novel polymers with high Log P values by following gradient trajectories of the learned Log P values in latent space (Figure 8). Here, we generated polymers toward high Log P values by exploring the latent space describing relationships between the OMG reactant combinations and the Log P values of corresponding CRUs. Similar to prior work,45 we explored the latent space in two different ways: (1) gradient search to follow a trajectory of the learned Log P prediction function to identify latent space points with high polymer Log P values and (2) random search to take random walks on the latent space. We began with 500 random latent space points among training OMG reactant combinations and performed 100 steps for the gradient search and 100 steps for the random search. Following the search process, novel polymers that were not overlapping with the training polymers were identified and sorted by their predicted Log P values. The Log P values of the selected polymers were determined using RDKit, and the top 20 polymers with high Log P values were plotted. Figure 8a shows the kernel density estimation of novel polymer Log P distributions of gradient search and random search. The gradient search identified more polymers with high Log P values than the random search, suggesting that Molecule Chef established meaningful polymer structure-property relationships. Overall, these results suggest that the combination of the imposed synthesizability constraints of the OMG and Molecule Chef approaches represents a powerful approach for generative design of polymeric materials with retrosynthetic constraints and targeted property optimization.

Figure 8.

Figure 8

Generating novel polymers using the Molecule Chef trained on six step-growth polymerization reactions. (a) Kernel density estimated Log P distributions of generated novel polymers toward high Log P. (b) Examples of generated OMG monomer combinations from the gradient search on latent space. (c) Reaxys reaction condition recommendations (“n.d.” denotes “not described”).

To further enhance the synthetic feasibility of polymer chemistries suggested using the integrated OMG and Molecule Chef approach, we utilized the Reaxys database to recommend polymerization reaction conditions for the generated OMG polymers from the gradient search. Reaxys59 is a reaction database extracted from published literature that provides detailed chemical reaction conditions such as solvent, catalyst, and temperature. Figure 8b shows novel generated OMG monomer reactant combinations from the gradient search and the Reaxys polymerization reaction that are similar to the generated OMG polymerizations (Figure 8c). The similarity metrics between the OMG polymerization and the Reaxys polymerization were determined using the Tanimoto similarity between the OMG monomer reactants and the Reaxys monomer reactants, similar to prior work.36 We utilized Morgan fingerprints (nBits = 1024, radius = 2) of each reactant to determine the Tanimoto similarity and computed the harmonic mean of reactant Tanimoto similarity scores to obtain a polymer similarity.36 Moving forward, increasingly accurate recommendations of polymerization reaction conditions will be feasible as the number of polymerization reactions in Reaxys increases.

Discussion

This work reports the first demonstration of incorporating a robust set of synthetically aware criteria into generative polymer modeling, directly addressing the key bottlenecks for enabling experimental and synthetic viability of generative polymer design. Importantly, rather than suggesting an arbitrary CRU chemistry with potentially low synthetic accessibility, our framework provides (i) commercially available monomer reactants, (ii) known synthetic pathways, and (iii) hypothetical polymerization reaction conditions via Reaxys necessary to begin a synthetic campaign. In principle, the framework introduced here can be employed with any set of curated polymer data for which targeted property design is required, and we envision the OMG being integrated into design efforts employing small datasets of experimentally labeled property data via transfer learning paradigms that have been successful in other applications of generative modeling.69,72

The OMG represents a significant step in the improvement of the synthetic realizability of generative ML models for polymer design. At present, synthetic realizability is induced by having OMG chemistries (i) downselected from common and commercially available reactants, thereby excluding exotic chemistries with low synthetic potential, (ii) further downselected using the SCScore that assesses the practicality in known reaction templates, (iii) integrated with known polymerization pathways directly into the database and generative model architecture, and (iv) connected with Reaxys suggested reaction conditions. However, this set of criteria do not guarantee synthesizability. In particular, the OMG assumes that polymerization reactions occur when monomer reactants have the requisite functional groups satisfying one of the OMG polymerization reactions. This strong assumption does not consider any potential side reactions (e.g., the possibly nucleophilic hydroxyl in the top left of Figure 6a), solubilities, steric hindrance of reactants, or lack of sufficient nucleophilic or electrophilic character that may hinder polymerization. Similarly, the initial implementation of the OMG does not accept monomer reactants with multiple functional groups in order to simplify polymerization reaction preferences—this constraint can be released in future iterations examining multifunctional reactants. Although these constraints naturally limit the accessible chemical space, they represent a significant step forward in the development of practically useful generative models compatible with synthetic intuition.

Provided the broad chemical diversity spanned by small sets of reaction types (Figure 7), we envision future practical material discovery campaigns selecting one or two OMG polymerization reactions and coupling the generative polymer modeling with high-throughput polymerization and characterization strategies to experimentally validate the generated polymers. Such experimental efforts have been recently demonstrated by IBM’s high-throughput continuous-flow polymerizations7376 and experimental data management platforms.77 Experimental validation will generate a database of polymer properties that can be included in the OMG database, which enables further practical generative polymer design. Moreover, our approach allows for the potential to incorporate OMG chemistries in a data structure compatible with the strong heterogeneity of polymer data, for example, CRIPT. The integration of these emerging classes of tools for polymer databasing and design has the potential to push the field of polymer discovery into a new era of progress defined by openness and unprecedented chemical diversity.

Despite the advantages of the OMG framework, the initial implementation also possesses a few limitations that naturally arise from the strong heterogeneity and diversity of polymeric materials. First, the OMG only considers linear homopolymers and is agnostic to stereo and regiochemistry. Although stereochemistry and regiochemistry are essential for the description of polymer physical properties, the current effort of the OMG development focuses on the design of a polymer’s chemical space, and future efforts will extend the OMG to include stereochemical and regiochemical awareness in its database and generative models. Second, the OMG does not include information regarding various microstructures. The ability to predict the microstructure from CRU represents a grand challenge in polymer science and consequently is a direction of future research that will benefit from strong interactions with experimental and computational polymer property characterizations. From a database perspective, the distinct challenge of representing the broad diversity of microstructures is expected to be overcome with the efforts of the community to represent various microstructures.60 However, ignoring microstructural information is a necessary simplification of the possible design space in the first iteration of the OMG. Third, the OMG can trivially benefit from the expansion to broader classes of polymerization algorithms and known reactant chemistries that go beyond eMolecules and the initial 17 polymerization reactions utilized in this work. Moving forward, these deficiencies will be overcome through continuous OMG database modification. To encourage the experimentally relevant adaptation of the OMG to the needs of experimental polymer chemists, we note that the entirety of the OMG, the scripts for its construction, and the generative models used in this work are available at https://github.com/TheJacksonLab/OpenMacromolecularGenome and https://doi.org/10.5281/zenodo.7556942.

Conclusions

In this work, we report a new polymer database known as the OMG that is constructed via synthetically feasible polymerization reactions. We further demonstrate polymer generative modeling using the OMG database. The OMG database represents a synthesizable polymer design space and specifies polymerization pathways with purchasable monomer reactants downselected from eMolecules to maximize synthetic feasibility. PCA indicates that the OMG CRUs obtained by applying 17 OMG polymerization reactions to OMG monomer reactants have a significant chemical overlap with the largest known experimentally synthesized PolyInfo polymers. Overall, these results suggest that the OMG database can be applied for the polymer generative modeling with Molecule Chef optimizing polymer functionality under guaranteed synthetic pathways. As a proof-of-principle demonstration, the Molecule Chef trained on the OMG database achieved an R2 score of Log P prediction > 0.90 for the suggested OMG polymers. Using this approach, polymers with high Log P values were generated with associated monomer reactants and corresponding polymerization reactions by following the gradient trajectories of the learned Log P predictions on the Molecule Chef latent space. In addition to specified polymerization reactions, it was demonstrated that detailed polymerization reaction conditions such as temperature, catalysts, and solvents are recommended using Reaxys polymerization data. Overall, the OMG database will advance generative polymer modeling by allowing for polymer design space exploration using synthetically accessible polymerization reactions.

Acknowledgments

This work was supported by the IBM-Illinois Discovery Accelerator Institute. N.E.J. thanks the 3M Nontenured Faculty Award for support of this research. We thank Jed Pitera and Jeff Moore for critical readings of the manuscript and Prof. Tengfei Luo for assistance with the PI1M dataset.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acspolymersau.3c00003.

  • OMG reactant composition, schematic diagram for the OMG reactant downselection, OMG CRU composition, linear dependence of CRU Log P, hyperparameters of generative models, training data compositions, explained variance of the PolyInfo reactants and the PolyInfo CRUs, training results of generative models, and Gaussian prior generation of generative models (PDF)

The authors declare no competing financial interest.

Notes

Copyright 2022 Elsevier Limited except certain content provided by third parties. Reaxys is a trademark of Elsevier Limited.

Supplementary Material

lg3c00003_si_001.pdf (1.1MB, pdf)

References

  1. Schmid F. Understanding and Modeling Polymers: The Challenge of Multiple Scales. ACS Polym. Au 2023, 3, 28–58. 10.1021/acspolymersau.2c00049. [DOI] [Google Scholar]
  2. Peerless J. S.; Milliken N. J. B.; Oweida T. J.; Manning M. D.; Yingling Y. G. Soft Matter Informatics: Current Progress and Challenges. Adv. Theory Simul. 2019, 2, 1800129. 10.1002/adts.201800129. [DOI] [Google Scholar]
  3. Gartner T. E. I.; Jayaraman A. Modeling and Simulations of Polymers: A Roadmap. Macromolecules 2019, 52, 755–786. 10.1021/acs.macromol.8b01836. [DOI] [Google Scholar]
  4. Behbahani A. F.; Schneider L.; Rissanou A.; Chazirakis A.; Bačová P.; Jana P. K.; Li W.; Doxastakis M.; Polińska P.; Burkhart C.; Müller M.; Harmandaris V. A. Dynamics and Rheology of Polymer Melts via Hierarchical Atomistic, Coarse-Grained, and Slip-Spring Simulations. Macromolecules 2021, 54, 2740–2762. 10.1021/acs.macromol.0c02583. [DOI] [Google Scholar]
  5. Zhai S.; Zhang P.; Xian Y.; Zeng J.; Shi B. Effective Thermal Conductivity of Polymer Composites: Theoretical Models and Simulation Models. Int. J. Heat Mass Transfer 2018, 117, 358–374. 10.1016/j.ijheatmasstransfer.2017.09.067. [DOI] [Google Scholar]
  6. Gemünden P.; Poelking C.; Kremer K.; Daoulas K.; Andrienko D. Effect of Mesoscale Ordering on the Density of States of Polymeric Semiconductors. Macromol. Rapid Commun. 2015, 36, 1047–1053. 10.1002/marc.201400725. [DOI] [PubMed] [Google Scholar]
  7. Dhamankar S.; Webb M. A. Chemically Specific Coarse-Graining of Polymers: Methods and Prospects. J. Polym. Sci. 2021, 59, 2613–2643. 10.1002/pol.20210555. [DOI] [Google Scholar]
  8. Jackson N. E. Coarse-Graining Organic Semiconductors: The Path to Multiscale Design. J. Phys. Chem. B 2021, 125, 485–496. 10.1021/acs.jpcb.0c09749. [DOI] [PubMed] [Google Scholar]
  9. Gormley A. J.; Webb M. A. Machine Learning in Combinatorial Polymer Chemistry. Nat. Rev. Mater. 2021, 6, 642–644. 10.1038/s41578-021-00282-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Sattari K.; Xie Y.; Lin J. Data-Driven Algorithms for Inverse Design of Polymers. Soft Matter 2021, 17, 7607–7622. 10.1039/D1SM00725D. [DOI] [PubMed] [Google Scholar]
  11. Webb M. A.; Jackson N. E.; Gil P. S.; de Pablo J. J. Targeted Sequence Design within the Coarse-Grained Polymer Genome. Sci. Adv. 2020, 6, eabc6216 10.1126/sciadv.abc6216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Patra T. K. Data-Driven Methods for Accelerating Polymer Design. ACS Polym. Au 2022, 2, 8–26. 10.1021/acspolymersau.1c00035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cencer M. M.; Moore J. S.; Assary R. S. Machine Learning for Polymeric Materials: An Introduction. Polym. Int. 2022, 71, 537–542. 10.1002/pi.6345. [DOI] [Google Scholar]
  14. Schneider L.; Schwarting M.; Mysona J.; Liang H.; Han M.; Rauscher P. M.; Ting J. M.; Venkatram S.; Ross R. B.; Schmidt K. J.; Blaiszik B.; Foster I.; de Pablo J. J. In silico active learning for small molecule properties. Mol. Syst. Des. Eng. 2022, 7, 1611–1621. 10.1039/D2ME00137C. [DOI] [Google Scholar]
  15. Ferguson A. L.; Brown K. A. Data-Driven Design and Autonomous Experimentation in Soft and Biological Materials Engineering. Annu. Rev. Chem. Biomol. Eng. 2022, 13, 25–44. 10.1146/annurev-chembioeng-092120-020803. [DOI] [PubMed] [Google Scholar]
  16. Sanchez-Lengeling B.; Aspuru-Guzik A. Inverse Molecular Design Using Machine Learning: Generative Models for Matter Engineering. Science 2018, 361, 360–365. 10.1126/science.aat2663. [DOI] [PubMed] [Google Scholar]
  17. Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernández-Lobato J. M.; Sánchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4, 268–276. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Griffiths R.-R.; Hernández-Lobato J. M. Constrained Bayesian Optimization for Automatic Chemical Design Using Variational Autoencoders. Chem. Sci. 2020, 11, 577–586. 10.1039/C9SC04026A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lim J.; Ryu S.; Kim J. W.; Kim W. Y. Molecular Generative Model Based on Conditional Variational Autoencoder for de Novo Molecular Design. J. Cheminf. 2018, 10, 31. 10.1186/s13321-018-0286-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Sanchez-Lengeling B.; Outeiral C.; Guimaraes G. L.; Aspuru-Guzik A. Optimizing Distributions over Molecular Space. An Objective-Reinforced Generative Adversarial Network for Inverse-Design Chemistry (ORGANIC). ChemRxiv 2017, 10.26434/chemrxiv.5309668.v3. [DOI] [Google Scholar]
  21. Popova M.; Isayev O.; Tropsha A. Deep Reinforcement Learning for de Novo Drug Design. Sci. Adv. 2018, 4, eaap7885 10.1126/sciadv.aap7885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhou Z.; Kearnes S.; Li L.; Zare R. N.; Riley P. Optimization of Molecules via Deep Reinforcement Learning. Sci. Rep. 2019, 9, 10752. 10.1038/s41598-019-47148-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kadurin A.; Nikolenko S.; Khrabrov K.; Aliper A.; Zhavoronkov A. DruGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico. Mol. Pharmaceutics 2017, 14, 3098–3104. 10.1021/acs.molpharmaceut.7b00346. [DOI] [PubMed] [Google Scholar]
  24. Polykovskiy D.; Zhebrak A.; Vetrov D.; Ivanenkov Y.; Aladinskiy V.; Mamoshina P.; Bozdaganyan M.; Aliper A.; Zhavoronkov A.; Kadurin A. Entangled Conditional Adversarial Autoencoder for de Novo Drug Discovery. Mol. Pharmaceutics 2018, 15, 4398–4405. 10.1021/acs.molpharmaceut.8b00839. [DOI] [PubMed] [Google Scholar]
  25. Hong S. H.; Ryu S.; Lim J.; Kim W. Y. Molecular Generative Model Based on an Adversarially Regularized Autoencoder. J. Chem. Inf. Model. 2020, 60, 29–36. 10.1021/acs.jcim.9b00694. [DOI] [PubMed] [Google Scholar]
  26. Batra R.; Dai H.; Huan T. D.; Chen L.; Kim C.; Gutekunst W. R.; Song L.; Ramprasad R. Polymers for Extreme Conditions Designed Using Syntax-Directed Variational Autoencoders. Chem. Mater. 2020, 32, 10489–10500. 10.1021/acs.chemmater.0c03332. [DOI] [Google Scholar]
  27. Jørgensen P. B.; Mesta M.; Shil S.; García Lastra J. M.; Jacobsen K. W.; Thygesen K. S.; Schmidt M. N. Machine Learning-Based Screening of Complex Molecules for Polymer Solar Cells. J. Chem. Phys. 2018, 148, 241735. 10.1063/1.5023563. [DOI] [PubMed] [Google Scholar]
  28. Jin W.; Barzilay R.; Jaakkola T.. Hierarchical Generation of Molecular Graphs Using Structural Motifs. 2020, arXiv:2002.03230 April 18. [Google Scholar]
  29. Patel R. A.; Borca C. H.; Webb M. A. Featurization Strategies for Polymer Sequence or Composition Design by Machine Learning. Mol. Syst. Des. Eng. 2022, 7, 661–676. 10.1039/D1ME00160D. [DOI] [Google Scholar]
  30. Ramakrishnan R.; Dral P. O.; Rupp M.; von Lilienfeld O. A. Quantum Chemistry Structures and Properties of 134 Kilo Molecules. Sci. Data 2014, 1, 140022. 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Sterling T.; Irwin J. J. ZINC 15 – Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ma R.; Luo T. PI1M: A Benchmark Database for Polymer Informatics. J. Chem. Inf. Model. 2020, 60, 4684–4690. 10.1021/acs.jcim.0c00726. [DOI] [PubMed] [Google Scholar]
  33. Otsuka S.; Kuwajima I.; Hosoya J.; Xu Y.; Yamazaki M.. PoLyInfo: Polymer Database for Polymeric Materials Design. In 2011 International Conference on Emerging Intelligent Data and Web Technologies, 2011; pp 22–29.
  34. Kim C.; Chandrasekaran A.; Huan T. D.; Das D.; Ramprasad R. Polymer Genome: A Data-Powered Polymer Informatics Platform for Property Predictions. J. Phys. Chem. C 2018, 122, 17575–17585. 10.1021/acs.jpcc.8b02913. [DOI] [Google Scholar]
  35. Walsh D. J.; Zou W.; Schneider L.; Mello R.; Deagen M. E.; Mysona J.; Lin T.-S.; de Pablo J. J.; Jensen K. F.; Audus D. J.; Olsen B. D. Community Resource for Innovation in Polymer Technology (CRIPT): A Scalable Polymer Material Data Structure. ACS Cent. Sci. 2023, 9, 330–338. 10.1021/acscentsci.3c00011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Chen L.; Kern J.; Lightstone J. P.; Ramprasad R. Data-Assisted Polymer Retrosynthesis Planning. Appl. Phys. Rev. 2021, 8, 031405. 10.1063/5.0052962. [DOI] [Google Scholar]
  37. Ertl P.; Schuffenhauer A. Estimation of Synthetic Accessibility Score of Drug-like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminf. 2009, 1, 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Coley C. W.; Green W. H.; Jensen K. F. Machine Learning in Computer-Aided Synthesis Planning. Acc. Chem. Res. 2018, 51, 1281–1289. 10.1021/acs.accounts.8b00087. [DOI] [PubMed] [Google Scholar]
  39. Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 60, 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]
  40. Coley C. W.; Rogers L.; Green W. H.; Jensen K. F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Cent. Sci. 2017, 3, 1237–1245. 10.1021/acscentsci.7b00355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Segler M. H. S.; Preuss M.; Waller M. P. Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018, 555, 604–610. 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]
  42. Segler M. H. S.; Waller M. P. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. Chem.—Eur. J. 2017, 23, 5966–5971. 10.1002/chem.201605499. [DOI] [PubMed] [Google Scholar]
  43. Schwaller P.; Petraglia R.; Zullo V.; Nair V. H.; Haeuselmann R. A.; Pisoni R.; Bekas C.; Iuliano A.; Laino T. Predicting Retrosynthetic Pathways Using Transformer-Based Models and a Hyper-Graph Exploration Strategy. Chem. Sci. 2020, 11, 3316–3325. 10.1039/C9SC05704H. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. MALDI Recipes. https://maldi.nist.gov/ (accessed Feb 12, 2022).
  45. Bradshaw J.; Paige B.; Kusner M. J.; Segler M. H. S.; Hernández-Lobato J. M.. A Model to Search for Synthesizable Molecules. 2019, arXiv:1906.05221. [Google Scholar]
  46. Wildman S. A.; Crippen G. M. Prediction of Physicochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci. 1999, 39, 868–873. 10.1021/ci990307l. [DOI] [Google Scholar]
  47. Lipinski C. A.; Lombardo F.; Dominy B. W.; Feeney P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 1997, 23, 3–25. 10.1016/S0169-409X(96)00423-1. [DOI] [PubMed] [Google Scholar]
  48. Chen M.; Borlak J.; Tong W. High Lipophilicity and High Daily Dose of Oral Medications Are Associated with Significant Risk for Drug-Induced Liver Injury. Hepatology 2013, 58, 388–396. 10.1002/hep.26208. [DOI] [PubMed] [Google Scholar]
  49. Jenkins A. D.; Kratochvíl P.; Stepto R. F. T.; Suter U. W. Glossary of basic terms in polymer science (IUPAC Recommendations 1996). Pure Appl. Chem. 1996, 68, 2287–2311. 10.1351/pac199668122287. [DOI] [Google Scholar]
  50. Hiemenz P. C.; Lodge T. P.. Polymer Chemistry, 2nd ed.; CRC Press: Boca Raton, 2007. [Google Scholar]
  51. Nuyken O.; Pask S. D. Ring-Opening Polymerization—An Introductory Review. Polymers 2013, 5, 361–403. 10.3390/polym5020361. [DOI] [Google Scholar]
  52. Mol J. C. Industrial Applications of Olefin Metathesis. J. Mol. Catal. A: Chem. 2004, 213, 39–45. 10.1016/j.molcata.2003.10.049. [DOI] [Google Scholar]
  53. Seyferth D. The Grignard Reagents. Organometallics 2009, 28, 1598–1605. 10.1021/om900088z. [DOI] [Google Scholar]
  54. eMolecules. https://www.emolecules.com/ (accessed Feb 12, 2022).
  55. Martin R. L.; Simon C. M.; Smit B.; Haranczyk M. In silico Design of Porous Polymer Networks: High-Throughput Screening for Methane Storage Materials. J. Am. Chem. Soc. 2014, 136, 5006–5022. 10.1021/ja4123939. [DOI] [PubMed] [Google Scholar]
  56. Weininger D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
  57. RDKit. https://www.rdkit.org/docs/GettingStartedInPython.html (accessed Feb 12, 2022).
  58. Coley C. W.; Rogers L.; Green W. H.; Jensen K. F. SCScore: Synthetic Complexity Learned from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261. 10.1021/acs.jcim.7b00622. [DOI] [PubMed] [Google Scholar]
  59. Reaxys. https://www.reaxys.com/#/search/quick (accessed Feb 12, 2022).; A reaction ID for reaction 1 in Figure 8 is 5068939, and that for reaction 2 in Figure 8 is 11002507.
  60. Lin T.-S.; Coley C. W.; Mochigase H.; Beech H. K.; Wang W.; Wang Z.; Woods E.; Craig S. L.; Johnson J. A.; Kalow J. A.; Jensen K. F.; Olsen B. D. BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules. ACS Cent. Sci. 2019, 5, 1523–1531. 10.1021/acscentsci.9b00476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Gurnani R.; Kamal D.; Tran H.; Sahu H.; Scharm K.; Ashraf U.; Ramprasad R. PolyG2G: A Novel Machine Learning Algorithm Applied to the Generative Design of Polymer Dielectrics. Chem. Mater. 2021, 33, 7008–7016. 10.1021/acs.chemmater.1c02061. [DOI] [Google Scholar]
  62. Mannodi-Kanakkithodi A.; Pilania G.; Huan T. D.; Lookman T.; Ramprasad R. Machine Learning Strategy for Accelerated Design of Polymer Dielectrics. Sci. Rep. 2016, 6, 20952. 10.1038/srep20952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Kingma D. P.; Welling M.. Auto-Encoding Variational Bayes. 2022, arXiv:1312.6114 December 10. [Google Scholar]
  64. Hayward S.; de Groot B. L.. Normal Modes and Essential Dynamics. In Molecular Modeling of Proteins; Kukol A., Ed.; Methods Molecular Biology; Humana Press: Totowa, NJ, 2008; pp 89–106. [DOI] [PubMed] [Google Scholar]
  65. Akiba T.; Sano S.; Yanase T.; Ohta T.; Koyama M.. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; KDD ’19; Association for Computing Machinery: New York, NY, USA, 2019; pp 2623–2631. [Google Scholar]
  66. Dask. https://www.dask.org/ (accessed Feb 12, 2022).
  67. Krenn M.; Häse F.; Nigam A.; Friederich P.; Aspuru-Guzik A. Self-Referencing Embedded Strings (SELFIES): A 100% Robust Molecular String Representation. Mach. Learn. 2020, 1, 045024. 10.1088/2632-2153/aba947. [DOI] [Google Scholar]
  68. Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  69. Iovanac N. C.; Savoie B. M. Simpler Is Better: How Linear Prediction Tasks Improve Transfer Learning in Chemical Autoencoders. J. Phys. Chem. A 2020, 124, 3679–3685. 10.1021/acs.jpca.0c00042. [DOI] [PubMed] [Google Scholar]
  70. SELFIES Github. https://github.com/aspuru-guzik-group/selfies (accessed Feb 12, 2022).
  71. Angello N. H.; Rathore V.; Beker W.; Wołos A.; Jira E. R.; Roszak R.; Wu T. C.; Schroeder C. M.; Aspuru-Guzik A.; Grzybowski B. A.; Burke M. D. Closed-Loop Optimization of General Reaction Conditions for Heteroaryl Suzuki-Miyaura Coupling. Science 2022, 378, 399–405. 10.1126/science.adc8743. [DOI] [PubMed] [Google Scholar]
  72. Iovanac N. C.; Savoie B. M. Improved Chemical Prediction from Scarce Data Sets via Latent Space Enrichment. J. Phys. Chem. A 2019, 123, 4295–4302. 10.1021/acs.jpca.9b01398. [DOI] [PubMed] [Google Scholar]
  73. Lin B.; Hedrick J. L.; Park N. H.; Waymouth R. M. Programmable High-Throughput Platform for the Rapid and Scalable Synthesis of Polyester and Polycarbonate Libraries. J. Am. Chem. Soc. 2019, 141, 8921–8927. 10.1021/jacs.9b02450. [DOI] [PubMed] [Google Scholar]
  74. Lin B.; Jadrich C. N.; Pane V. E.; Arrechea P. L.; Erdmann T.; Dausse C.; Hedrick J. L.; Park N. H.; Waymouth R. M. Ultrafast and Controlled Ring-Opening Polymerization with Sterically Hindered Strong Bases. Macromolecules 2020, 53, 9000–9007. 10.1021/acs.macromol.0c01571. [DOI] [Google Scholar]
  75. Jadrich C. N.; Pane V. E.; Lin B.; Jones G. O.; Hedrick J. L.; Park N. H.; Waymouth R. M. A Cation-Dependent Dual Activation Motif for Anionic Ring-Opening Polymerization of Cyclic Esters. J. Am. Chem. Soc. 2022, 144, 8439–8443. 10.1021/jacs.2c01436. [DOI] [PubMed] [Google Scholar]
  76. Lopez de Pariza X.; Erdmann T.; Arrechea P. L.; Perez L.; Dausse C.; Park N. H.; Hedrick J. L.; Sardon H. Synthesis of Tailored Segmented Polyurethanes Utilizing Continuous-Flow Reactors and Real-Time Process Monitoring. Chem. Mater. 2021, 33, 7986–7993. 10.1021/acs.chemmater.1c01919. [DOI] [Google Scholar]
  77. Park N.; Manica M.; Born J.; Hedrick J.; Erdmann T.; Zubarev D.; Mill N.; Arrechea P. An Extensible Platform for Enabling Artificial Intelligence Guided Design of Catalysts and Materials. ChemRxiv 2023, 10.26434/chemrxiv-2022-811rl-v2. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lg3c00003_si_001.pdf (1.1MB, pdf)

Articles from ACS Polymers Au are provided here courtesy of American Chemical Society

RESOURCES