Abstract

Innovative approaches to design molecules with tailored properties are required in various research areas. Deep learning methods can accelerate the discovery of new materials by leveraging molecular structure–property relationships. In this study, we successfully developed a generative deep learning (Gen-DL) model that was trained on a large experimental database (DBexp) including 71,424 molecule/solvent pairs and was able to design molecules with target properties in various solvents. The Gen-DL model can generate molecules with specified optical properties, such as electronic absorption/emission peak position and bandwidth, extinction coefficient, photoluminescence (PL) quantum yield, and PL lifetime. The Gen-DL model was shown to leverage the essential design principles of conjugation effects, Stokes shifts, and solvent effects when it generated molecules with target optical properties. Additionally, the Gen-DL model was demonstrated to generate practically useful molecules developed for real-world applications. Accordingly, the Gen-DL model can be a promising tool for the discovery and design of novel molecules with tailored properties in various research areas, such as organic photovoltaics (OPVs), organic light-emitting diodes (OLEDs), organic photodiodes (OPDs), bioimaging dyes, and so on.
Short abstract
Generative deep learning model (DeepMoleculeGen) was developed to generate optimal molecules in a given solvent, given the target properties, initial backbone, and solvent.
I. Introduction
The development of new molecules with tailored properties in chemistry and materials science has relied largely on expert knowledge and trial-and-error methods.1,2 However, this approach is limited in developing molecules with tailored properties due to the complex nature of molecules and the vast chemical space.3 Consequently, deep learning (DL) methods have emerged as a promising tool to effectively explore chemical space and develop the molecules with desired properties in an efficient and target-oriented manner.4−9 One approach to finding molecules with desired properties is to virtually generate a large number of molecules using various scaffolds, then use predictive DL models to predict their properties and select the optimal molecule based on the predicted properties.3,4 However, this approach still relies on human expertise in scaffold selection and may not fully exploit the chemical space to develop molecules with optimal properties.
In contrast, generative DL models can be a crucial tool to overcome the limitations of human expertise and to design molecules with target properties based on the underlying molecular structure–property relationship.10−14 Various generative DL models have been developed based on variational autoencoders (VAEs),15−17 generative adversarial networks (GANs),18,19 and recurrent neural networks (RNNs).20−24 Unlike predictive DL models, generative DL models allow designing new molecules without relying on prior knowledge, greatly expanding the scope of potential materials and open up new avenues of exploration in various research areas.25−27 In addition, generative DL models have proven to be successful in designing new molecules with certain optimal properties such as drug-likeness, solubility, and synthetic accessibility.5,28,29 Generative DL models have been extensively developed to design molecules with certain properties that are mostly related to drug discovery, such as molecular weight, solubility, and quantitative estimate of drug-likeness (QED). There have been needs for developing generative DL models that can be applicable to design fluorophores and chromophores used in organic light-emitting diodes (OLEDs), organic photovoltaics (OPVs), organic photodiodes (OPDs), and bioimaging. However, these generative DL models are challenging to develop due to the limited accessibility of experimental databases that can be used to train generative DL models, both in terms of the number of molecular structures and the diversity of molecular structures in the experimental databases.
In this study, we developed a generative DL (Gen-DL) model based on a large experimental database of optical properties of organic molecules. The experimental database contains a variety of organic molecules (with absorption and emission spectra ranging from UV to near-IR) and their optical properties in solutions: first absorption peak position (λabs) and bandwidth (σabs), extinction coefficient (ε), emission peak position (λemi) and bandwidth (σemi), photoluminescence quantum yield (PLQY, Φ), and photoluminescence lifetime (τ). Accordingly, the Gen-DL model can efficiently generate optimal organic molecules in a given solvent, given the target optical properties and solvent as input. In addition to the target optical properties, the solvent is given as an additional input because the optical properties of molecules are substantially influenced by the surrounding solvent molecules. The Gen-DL model was shown to generate molecules with target optical properties by leveraging the essential design principles such as conjugation effects, Stokes-shifts, and solvent effects. In addition, we demonstrated that the Gen-DL model indeed generated practically useful molecules that were found to be used as fluorophores for OLEDs,30 near-IR imaging dyes,31 fluorescent dyes,32 and photovoltaic materials.33
II. Results and Discussion
II-A. Experimental Database for DL Models
The electronic absorption properties (first absorption peak position, λabs; absorption bandwidth, σabs; extinction coefficient, ε) and emission properties (emission peak position, λemi; emission bandwidth, σemi; PLQY, Φ; PL lifetime, τ) can be readily characterized from experimental data (UV–visible absorption and PL spectra, and time-resolved fluorescence signal). We constructed an experimental database (DBexp) by collecting the aforementioned seven optical properties of organic molecules from research articles, comprising 71,424 molecule/solvent pairs, expanded from our previous study.7,8 To train and validate both Pred-DL and Gen-DL models, three different data sets (DBPred-DL, DBGen-DL, and DBTest) were used as shown in Figure S1. The experimental database was split into a training data set (DBPred-DL) and a test data set (DBTest) in a 9:1 ratio based on molecular structures. DBPred-DL consisted of 62,629 molecule/solvent pairs with 23,997 unique organic molecules and was used to retrain the Pred-DL model.7 DBGen-DL included only molecules in solutions and host films by excluding the molecules in solid states and gas phases from DBPred-DL. Additionally, the missing data points in DBGen-DL were filled using the values predicted by the Pred-DL model. Accordingly, DBGen-DL consisted of 56,579 molecule/solvent pairs containing 22,819 unique organic molecules in various solvents. DBTest was used as the test data set for the Gen-DL model.
II-B. Generative DL Model
In this study, we successfully developed a Gen-DL model to generate organic molecules with target optical properties in a given solvent (See Figure S5 and S6 for more detailed information). The Gen-DL model takes the target optical properties (λabs, σabs, log ε, λemi, σemi, Φ, τ) and the solvent as input, and then generates molecules step by step by stochastically selecting the actions (i.e., addition, connection, and termination), finally producing the molecular structure with the target optical properties. The three actions are (i) addition of an atom, represented by a vector (with atomic number, formal charge, the number of explicit hydrogen atoms), using a proper bond (single bond, double bond, triple bond, and aromatic bond), (ii) connection of two atoms to make a bond between them, and (iii) termination of the generation process.
To train the Gen-DL model, first the molecular generation sequence of the input molecules in the DBGen-DL are generated from scratch by a stochastic depth-first search algorithm.34 In the molecular generation process, hydrogen atoms are implicitly included in each molecule. Second, the molecular generation process for every step in the sequence is trained by calculating the probabilities of the next possible actions (addition, connection, and termination). As an example, Figure 1 shows how to train the Gen-DL model using nitrobenzene as an input molecule. The stochastic depth-first search algorithm generates the molecular generation sequence of nitrobenzene from scratch, showing how nitrobenzene can be generated step by step (Figure 1a). Figure 1b shows that the Gen-DL model learns how to generate the (k + 1)th molecular structure from the kth molecular structure in the sequence by calculating the probability of the next action (the addition of nitrogen cation at the 6 position) under the given condition of solvent (CH2Cl2) and optical properties. Therefore, during the training process, the Gen-DL model learns how to generate appropriate molecular structures with target properties in a given solvent (See the Supporting Information (SI) for details).
Figure 1.
(a) Molecular generation sequence of an input molecule (nitrobenzene) is generated by the stochastic depth-first search algorithm. (b) Schematic illustration of training process of the Gen-DL model. The kth molecular structure, solvent (CH2Cl2), and seven optical properties are used as inputs. The Gen-DL model learns how to generate the (k + 1)th molecular structure from the kth molecular structure in the sequence by calculating the probability of the next action (the addition of nitrogen cation at the 6 position).
After the Gen-DL model is trained, it can generate molecular structures that meet the target optical properties in a given solvent, starting from scratch or from an initially given scaffold. The initial scaffold can be any atoms or any initial molecular backbone structures. During the molecular generation process, the Gen-DL model generates molecular structures step by step by calculating the probabilities of the next possible actions (addition, connection, and termination), finally producing the molecular structure with the target optical properties (Figure 1). Since the Gen-DL model stochastically selects the next possible action, it can select less likely actions, ensuring the diversity of molecular structures generated by the Gen-DL model. Therefore, each time molecular generation is performed, different molecular structures with the same target optical properties can be generated. Figure 2 shows the molecular generation process performed by the Gen-DL model and the changes in the optical properties of the molecular structures generated by the Gen-DL model during the molecular generation process, starting from the initial scaffold (benzene) to the final molecule (p-nitroaniline).
Figure 2.
(a) Molecular generation process by the Gen-DL model, given three inputs: initial scaffold (benzene), solvent (H2O), and target properties (λabs = 380 nm, σabs = 4700 cm–1, log ε = 3.6, λemi = 550 nm, σemi = 3800 cm–1, Φ = 0.01, and τ = 2.0 ns). Gen-DL model generates molecular structures step by step, from the initial scaffold (benzene) to the final molecular structure (p-nitroaniline), by calculating the probability of the next possible actions. The probabilities for the selected and nonselected next actions are p* and p, respectively. (b, c, d), Changes in the optical properties of the molecular structures generated by the Gen-DL model during the molecular generation process (indicated by the arrows). The target properties are marked with red stars.
II-C. Efficient Generation of Molecules with Tailored Properties by the Gen-DL Model
The performance of our Gen-DL model was examined by generating the molecules with different sets of seven optical properties in different solvents. Note that some of the seven optical properties of molecules were found to be dependent in some sense, as shown in Figure S3. For example, λabs correlates well with λemi in that λabs increases with λemi, but λabs is usually smaller than λemi for the given molecules due to the Stokes shift. In addition, σabs also correlates with σemi. The PLQY (Φ) does not seem to be correlated with any other optical properties. Based on the correlation between optical properties, a target set of seven optical properties was selected for generating new molecules. As shown in Table S2, we examined 21 sets of seven optical properties in five different solvents: toluene, water, acetonitrile (ACN), tetrahydrofuran (THF), and dichloromethane (DCM), ultimately generating 105 sets of molecules with seven optical properties.
It should be noted that the optical properties of the molecules generated by the Gen-DL model were evaluated by our Pred-DL model. As described elsewhere in detail,8 we developed a Pred-DL model termed Deep Learning Optical Spectroscopy (DLOS) to predict seven optical properties of organic molecules in solutions, solid states, and gas phases. Since we first reported our initial Pred-DL model, the experimental database has significantly expanded to include a greater diversity of molecular structures and the Pred-DL model has been retrained, thereby improving the Pred-DL model in terms of accuracy and generalization. The performance of the current Pred-DL model is summarized in Figure S2 and Table S1.
For a given target set of seven optical properties in a solvent, 10,000 molecules were generated by the Gen-DL model. The optical properties of the molecules were estimated by the Pred-DL model (see the additional SI files). Note that the generated molecules whose estimated optical properties were within the error range of the Pred-DL model were considered to be the ones that satisfied the target optical properties in a given solvent (these molecules were designated as molecules with target optical properties, MTOP). The Gen-DL model was found to exhibit excellent performance in generating MTOP in a given solvent (%MTOP = ∼ 4.4%), as shown in Figure S9–S113. In addition, our Gen-DL model was found to generate molecules with 88.9% validity, 47.1% uniqueness, and 44.4% novelty (see the SI for details). Figure 3a shows the distributions of seven optical properties of the molecules generated by the Gen-DL model in dichloromethane (DCM). Compared with the distributions of the optical properties of all molecules in DBGen-DL, the seven optical properties of the generated molecules show narrow distributions close to the target optical properties.
Figure 3.
Property-oriented molecular generation by the Gen-DL model. (a) Distribution of the optical properties of generated molecules for given target optical properties (λabs = 557 nm, σabs = 3496 cm–1, log ε = 4.49, λemi = 650 nm, σemi = 2351 cm–1, Φ = 0.28, and τ = 1.51 ns) in DCM. The target properties are indicated by dashed vertical lines. The distributions of the optical properties of molecules in DBGen-DL (DCM) are shown in gray. The distributions of the optical properties of the generated molecules (in DCM) are shown in red. The distributions of the optical properties (b) λemi, (c) σemi, (d) log ε, (e) τ, and (f) Φ of generated molecules. The target optical properties are indicated by dashed vertical lines. The distribution of the optical properties of molecules in DBGen-DL (DCM) is shown in gray. See Figure S9–S113 for the distributions of the optical properties of generated molecules for different target sets of seven optical properties.
II-C-1. Absorption and Emission Peak Positions
The absorption and emission peak positions of organic molecules are responsible for the daylight color and the emission color, respectively. Figure 3b shows the remarkable sensitivity of our Gen-DL model in generating molecules with target λemi value. As the target λemi value changes from ultraviolet (350 nm) to near-infrared (750 nm), the distribution of λemi for the generated molecules changes accordingly. Interestingly, when the target λemi value changes from UV to near-infrared (NIR), the Gen-DL model was found use the molecular backbone structures associated with carbazole (λemi = ∼350 nm), coumarin (λemi = ∼450 nm), BODIPY (λemi = ∼550 nm), squaraine (λemi = ∼650 nm), and aza-BODIPY (λemi = ∼750 nm). These molecular backbone structures have been effectively used to develop fluorophores having a given range of λemi. Therefore, the Gen-DL model is shown to find the structure–property relationship (SPR) between the molecular structure and λemi in DBGen-DL, and using the SPR, the Gen-DL model can effectively generate molecules with target λemi values.
II-C-2. Absorption and Emission Bandwidths
Absorption and emission bandwidths are crucial to the design of chromophores and fluorophores in various research areas such as OLEDs, organic photovoltaics (OPVs), organic photodiodes (OPDs), and bioimaging. Narrow emission bandwidths are desired for fluorophores used in OLEDs, OPDs, and bioimaging, whereas wide absorption bandwidths are preferred for light-harvesting dyes in OPVs.35−40 Although some design principles have been used empirically,41 controlling the bandwidth of organic molecules remains extremely challenging. As shown in Figure 3c, the Gen-DL model can generate molecules with tuned emission bandwidths, allowing a wide range of bandwidths from narrow (1000 cm–1) to broad (4000 cm–1) to be readily achieved. In fact, the bandwidth regions of approximately 1,000 and 4,000 cm–1 with relatively few data points posed additional challenges for training the Gen-DL model. However, the Gen-DL model was found to generalize the design principles to be able to generate molecules in these bandwidth regions.
II-C-3. PLQY and PL Lifetime
PLQY (Φ) and PL lifetime (τ) are molecular properties that are significantly influenced by complicated intra- and intermolecular interactions in solutions and host matrices, and thus they are challenging to predict theoretically in practice. Accordingly, a reliable estimation of the PLQY and PL lifetime of organic molecules is crucial for practical application in OLEDs, OPVs, and OPDs. Our Pred-DL model has proven to be a reliable tool for predicting the PLQY and PL lifetime of organic molecules, and it surpassed the limitations of conventional theoretical calculations.8,9 Additionally, as shown in Figure 3e and 3f, the Gen-DL model exhibits the ability to generate molecules with specified Φ and τ values to some extent. Note that the PLQY is measured relatively imprecisely by currently available experimental methods when compared with UV–visible absorption and emission spectra, and the PLQY database is relatively noisy.42,43 Therefore, the SPR between molecular structure and PLQY appears to be learned relatively inaccurately by the Gen-DL model.
II-D. Solvent Effects and Structure–Property Relationships Embedded in the Gen-DL Model
II-D-1. Solvent-Dependent Molecular Generation
The optical properties of molecules are significantly influenced by the surrounding solvent molecules in solutions. Our Gen-DL model is built to reflect solvent effects when generating molecules with target optical properties in a specified solvent. To examine how solvent effects are reflected by the Gen-DL model, we used the partition coefficients (P) of molecules generated in different solvents. The partition coefficient (P) is defined as the ratio of molecular concentrations in a mixture of water and n-octanol, and the log P value is used as a measure of lipophilicity (or hydrophobicity). In this study, the log P values of molecules were estimated using the method proposed by Crippen et al.,44 which was implemented in RDKit.45 A larger log P value indicates that the molecule is more hydrophobic. By properly reflecting solvent effects, our Gen-DL model is expected to generate more hydrophilic molecules in water than in organic solvents. In Figure 4a, the distribution of the log P values of the molecules generated in water shows a tendency toward smaller log P values compared with the molecules generated in toluene. This indicates that the Gen-DL model reflects solvent effects when generating molecules in different solvents.
Figure 4.
(a) Distribution of log P values of generated molecules in toluene and in water. (b) Distribution of the Stokes shift values of the generated molecules in toluene and in ACN. Inset: the molecule with an intramolecular charge transfer (ICT) upon electronic excitation, showing a Stokes shift of 80 nm in toluene and 187 nm in ACN. (c) Degree of conjugation of the generated molecules having different target λemi values. (d) t-distributed stochastic neighbor embedding (t-SNE) plot of molecules. Gray: Molecules in DBGen-DL. Blue: Molecules with target optical properties (λabs = 452 nm, σabs = 3942 cm–1, log ε = 4.38, λemi = 550 nm, σemi = 2820 cm–1, Φ = 0.34, τ = 2.17 ns) in DBGen-DL. Red: Generated molecules with the target properties.
The solvent effect on the properties of the generated molecules was further investigated by analysis of the Stokes shift, which is quantified by the difference between the λabs and λemi values. A large Stokes shift results from a large change in the charge distribution between the electronic ground and excited states, which often involves significant intramolecular charge transfer (ICT) in the molecule upon electronic excitation. Consequently, the molecules with the ICT upon electronic excitation exhibit a larger Stokes shift in highly polar solvents than in nonpolar solvents. Figure 4b shows the distributions of the Stokes shifts (λemi minus λabs) for the same molecules in toluene (ε = 2.38, weakly polar) and acetonitrile (ACN, ε = 37.5, highly polar). The molecules generated in ACN exhibit relatively larger Stokes shifts than those in toluene, indicating that the solvent effects are appropriately reflected by the Gen-DL model.
II-D-2. Degree of Conjugation of Organic Molecules
One of the key structural features associated with λabs and λemi is the conjugation length, which refers to the lengths of the connected sp and sp2 hybridized atoms in a molecule. To quantify the conjugation length in a given molecule, we used the degree of conjugation (DOC), which is defined in this study as the number of bonds that connect the farthest atoms in the conjugated backbone of a molecule. To calculate the DOC of molecules, we developed a Python function using the RDKit library,45 which provides a powerful toolkit for cheminformatics (see the SI for details). To investigate the effect of the DOC on λemi, we analyzed the DOC for ∼10,000 molecules generated at different λemi values in toluene. Figure 4c illustrates the trend in the DOC observed for the molecules at three different λemi values. As the λemi value increases, the DOC of the generated molecules tends to increase in toluene. Even though λemi is influenced by the solvent effect, it is important for the molecule to possess sufficient DOC to achieve target λemi values. This finding once again highlights the ability of our Gen-DL model to understand and leverage crucial design principles.
II-D-3. Exploring Structural Diversity
To investigate the structural diversity of molecules generated by the Gen-DL model, we performed t-SNE analysis using the Morgan fingerprints of the molecules, as illustrated in Figure 4d. The t-SNE analysis using the Morgan fingerprint provides a two-dimensional visualization of the structural diversity. In the t-SNE plot, structurally similar molecules are positioned closely, while structurally distinct molecules are placed farther apart. Figure 4d shows the t-SNE analysis of all molecules in DBGen-DL (in gray, MDB), the molecules with target optical properties in DBGen-DL (in blue, MDBTOP), and the generated molecules (in red, MTOP) with target optical properties (λabs = 452 nm, σabs = 3942 cm–1, log ε = 4.38, λemi = 550 nm, σemi = 2820 cm–1, Φ = 0.34, τ = 2.17 ns). Blue dots (MDBTOP) scattered over the distribution of gray dots (MDB) indicates that the molecules with the same target optical properties in DBGen-DL have diverse molecular structures. Interestingly, some of the red dots (MTOP) are located close to the blue dots (MDBTOP), showing that the Gen-DL model generates the molecules having molecular structures similar to those in DBGen-DL. More interestingly, the red dots are found quite away from the blue dots in the t-SNE plot. This indicates that the Gen-DL model can generate molecules having different backbone structures with the same target optical properties. The t-SNE analysis shows that our Gen-DL model can generate molecules that are not only structurally similar to molecules in DBGen-DL, but also structurally different from those in DBGen-DL. This demonstrates that our Gen-DL model has great potential to generate molecule with new backbone structures that are not included in the training data set (DBGen-DL).
II-E. Validating Practical Applications for Designing Novel and Useful Molecules
One of the ultimate goals of chemistry and materials science is to develop new molecules with desired properties for specific purposes in many research areas. Our Gen-DL model provides an efficient way to achieve this goal by generating molecules that satisfy specific properties in a given solvent. Rafael and co-workers previously demonstrated molecular discovery through the virtual screening and synthesis of new molecular structures.46 However, examining the performance of our Gen-DL model via direct synthesis of generated molecules is time-consuming. Therefore, in this study, we used a test data set (DBTest), which is mutually exclusive with the training data set (DBGen-DL), to determine whether the molecules generated by Gen-DL model exhibit the target optical properties. In other words, the molecules generated by Gen-DL model were searched in DBTest and the optical properties of the molecules found in DBTest were compared with the target optical properties. This approach allowed us to examine the performance of the Gen-DL model without the need to synthesize molecules directly and measure their optical properties. This validation step can support the efficiency and applicability of our Gen-DL model in guiding the discovery of newly designed molecules.
To directly examine the performance of our Gen-DL model, the molecules generated by the Gen-DL model were identified in DBTest and their experimental properties were compared with the target properties, as shown in Figure 5.
Figure 5.
Examples of molecules generated by the Gen-DL model and their practical use. Experimental values (green dots), predicted values (red line), and target values (blue line) are compared in the radar plot. The range of optical properties considered to meet the target optical properties is indicated in shaded blue based on the RMSE of the Pred-DL model.
Figure 5 shows several intriguing molecules that were generated by our Gen-DL model and were also found in DBTest. As a first example, given suitable target properties (e.g., λemi = 760 nm and log ε = 5), NIR fluorophores can be generated by the Gen-DL model. In fact, as shown in Figure 5a, the Gen-DL model generated a NIR imaging dye (λemi = 766 nm) with a large extinction coefficient (log ε = 5.34) that was developed by Sletten and co-workers for applications in high-resolution fluorescence microscopy.32 As a second example, imaging dyes with large Stokes shifts and large extinction coefficients are useful for in vivo imaging. For this purpose, the target properties (e.g., λabs = 570 nm, λemi = 660 nm, Stokes shift = 90 nm, log ε = 5) can be given. Our Gen-DL model was found to generate a fluorophore in Figure 5b which was originally developed by Ren and co-workers for fluorescence microscopy and in vivo imaging.31 Ren and co-workers showed that the fluorophore in Figure 5b exhibited deep tissue penetration, small autofluorescence interference, and large absorption due to the optimal optical properties (λabs = 571 nm and λemi = 651 nm) for biological tissues, the large Stokes shift (80 nm), and the large extinction coefficient (log ε = 5). As a third example, narrowband emitters in OLEDs can be generated by the Gen-DL model with the target properties (e.g., λemi = 520 nm, σemi = 1500 cm–1, Φ = 0.9). The emitter in Figure 5c, generated by our Gen-DL model, was originally developed by Zhang and co-workers.30 They showed that the emitter exhibited green emission (λemi = 500 nm) with a narrow bandwidth (σemi = 25 at 500 nm) and a high PLQY (Φ = 0.887). As the last example, organic photovoltaic materials require broad absorption spectra with a significantly large extinction coefficients to achieve high power conversion efficiency. To reflect these characteristics, the target properties (e.g., λabs = 500 nm, σabs = 4200 cm–1, and log ε = 5) were used as input for the Gen-DL model. As a result, the Gen-DL model generated a molecule in Figure 5d, originally developed by Liang and co-workers for small molecular photovoltaic applications.33 They demonstrated that this molecule has a wide absorption spectrum in the visible range (λabs = 569 nm and σabs = 4031 cm–1 = ∼120 nm at 570 nm), along with a high extinction coefficient (log ε = 4.74). In short, our Gen-DL model is shown to have great potential in generating novel and practically useful molecules that are not included in the training data set (DBGen-DL).
III. Concluding Remarks
In this study, we successfully developed a generative DL model that was trained on a large experimental database (DBexp) including 71,424 molecule/solvent pairs and was able to design molecules with target optical properties in various solvents. Our Gen-DL model can generate molecules with specific optical properties, such as specific absorption and emission peak positions, bandwidths, extinction coefficients, PL lifetimes, and PLQY. Notably, the Gen-DL model was found to generate target molecules even in the ranges of the optical properties having insufficient training data. Furthermore, our Gen-DL model effectively reflected the solvent effects in generating molecules with target optical properties, which was verified by investigating log P values and Stokes shifts of generated molecules in different solvents. Additionally, using the DOC descriptor, we have demonstrated that our Gen-DL model understands and exploits the essential design principles of conjugation effects on absorption and emission wavelengths. The t-SNE analysis showed that the Gen-DL model can generate molecules that are not only structurally similar to molecules in DBGen-DL, but also structurally different from those in DBGen-DL. Lastly, we demonstrated the performance of our Gen-DL model by identifying the molecules generated by the Gen-DL model in the test data set (DBTest) and comparing their experimental properties with the target properties, which confirms that our Gen-DL model has great potential to generate new and practically useful molecules that are not included in the training data set (DBGen-DL). Our Gen-DL model is a promising tool for discovering and designing novel molecules that have tailored properties, and it may pave the way for more efficient development of target materials in various research areas, such as OLEDs, OPVs, OPDs, bioimaging fluorophores, and so on. Our Gen-DL model is currently available as a web-based application (http://deep4chem.korea.ac.kr/DeepMoleculeGen).
When it is combined with the previously reported Pred-DL model (DLOS),7 the Gen-DL model can be effectively utilized to develop optimal molecules with specified properties in many research areas. The Pred-DL model can be used for efficient virtual screening of large numbers of predesigned molecules to select molecules with desired properties, whereas the Gen-DL model can directly generate molecules having target properties. More importantly, the Gen-DL model may offer the potential to discover new molecular backbones that have not been designed so far. Both Pred-DL and Gen-DL models can be used to build a library of molecules with target properties by screening predesigned molecules based on molecular property predictions and generating molecules with target properties. Synthesizable molecules can be selected from the library and synthesized directly. Finally, after confirming their properties by experiments, they can be used for practical applications in many research areas.
Acknowledgments
This work was supported by grants from the National Research Foundation of Korea (NRF 2019R1A6A1A11044070 and NRF 2022R1A2C1003627).
Data Availability Statement
Our Gen-DL model is publicly available as a web-based application (http://deep4chem.korea.ac.kr/DeepMoleculeGen). The codes and data for implementing our Gen-DL model are available at https://github.com/spark8ku/DeepMoleculeGen. The experimental database can be obtained from https://www.nature.com/articles/s41597-020-00634-8 or the corresponding author upon request for academic purposes only.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acscentsci.4c00656.
Author Present Address
‡ Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
Author Contributions
† M.H. and J.F.J. contributed equally to this work. S.P. conceived and supervised the project. M.H. and J.F.J. developed DL models and constructed the experimental database. M.J. constructed the experimental database. D.H.C. supervised the construction of the experimental database. M.H., J.F.J., and S.P. analyzed the results and wrote the manuscript. All authors have given approval to the final version of the manuscript.
The authors declare no competing financial interest.
Supplementary Material
References
- Henson Z. B.; Müllen K.; Bazan G. C. Design strategies for organic semiconductors beyond the molecular formula. Nat. Chem. 2012, 4, 699–704. 10.1038/nchem.1422. [DOI] [PubMed] [Google Scholar]
- Lookman T.; Balachandran P. V.; Xue D.; Yuan R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 2019, 5, 21. 10.1038/s41524-019-0153-8. [DOI] [Google Scholar]
- Gorse A.-D. Diversity in medicinal chemistry space. Curr. Top. Med. Chem. 2006, 6, 3–18. 10.2174/156802606775193310. [DOI] [PubMed] [Google Scholar]
- Benhenda M. ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity?. arXiv preprint 2017, arXiv:1708.08227. 10.48550/arXiv.1708.08227. [DOI] [Google Scholar]
- Polykovskiy D.; Zhebrak A.; Vetrov D.; Ivanenkov Y.; Aladinskiy V.; Mamoshina P.; Bozdaganyan M.; Aliper A.; Zhavoronkov A.; Kadurin A. Entangled conditional adversarial autoencoder for de novo drug discovery. Mol. Pharmaceutics 2018, 15, 4398–4405. 10.1021/acs.molpharmaceut.8b00839. [DOI] [PubMed] [Google Scholar]
- Jeong M.; Joung J. F.; Hwang J.; Han M.; Koh C. W.; Choi D. H.; Park S. Deep learning for development of organic optoelectronic devices: efficient prescreening of hosts and emitters in deep-blue fluorescent OLEDs. npj Comput. Mater. 2022, 8, 147. 10.1038/s41524-022-00834-3. [DOI] [Google Scholar]
- Joung J. F.; Han M.; Jeong M.; Park S. Experimental database of optical properties of organic compounds. Sci. Data 2020, 7, 295. 10.1038/s41597-020-00634-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joung J. F.; Han M.; Hwang J.; Jeong M.; Choi D. H.; Park S. Deep Learning Optical Spectroscopy Based on Experimental Database: Potential Applications to Molecular Design. JACS Au 2021, 1, 427–438. 10.1021/jacsau.1c00035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joung J. F.; Han M.; Jeong M.; Park S. Beyond Woodward–Fieser Rules: Design Principles of Property-Oriented Chromophores Based on Explainable Deep Learning Optical Spectroscopy. J. Chem. Inf. Model. 2022, 62, 2933–2942. 10.1021/acs.jcim.2c00173. [DOI] [PubMed] [Google Scholar]
- Kearnes S.; McCloskey K.; Berndl M.; Pande V.; Riley P. Molecular graph convolutions: moving beyond fingerprints. J. Comput.-Aided Mol. Des. 2016, 30, 595–608. 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J.; Hsieh C.-Y.; Wang M.; Wang X.; Wu Z.; Jiang D.; Liao B.; Zhang X.; Yang B.; He Q.; et al. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 2021, 3, 914–922. 10.1038/s42256-021-00403-1. [DOI] [Google Scholar]
- You J.; Liu B.; Ying Z.; Pande V.; Leskovec J. Graph convolutional policy network for goal-directed molecular graph generation. arXiv preprint 2018, arXiv:1806.02473. 10.48550/arXiv.1806.02473. [DOI] [Google Scholar]
- Lim J.; Hwang S. Y.; Moon S.; Kim S.; Kim W. Y. Scaffold-based molecular design with a graph generative model. Chem. Sci. 2020, 11, 1153–1164. 10.1039/C9SC04503A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Z.; Kearnes S.; Li L.; Zare R. N.; Riley P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 2019, 9, 10752 10.1038/s41598-019-47148-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernández-Lobato J. M.; Sánchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Q.; Allamanis M.; Brockschmidt M.; Gaunt A. Constrained graph variational autoencoders for molecule design. arxiv preprint 2018, arXiv:1805.09076. 10.48550/arXiv.1805.09076. [DOI] [Google Scholar]
- Jin W.; Barzilay R.; Jaakkola T. Junction tree variational autoencoder for molecular graph generation. arXiv preprint 2018, arXiv:1802.04364. 10.48550/arXiv.1802.04364. [DOI] [Google Scholar]
- Kadurin A.; Aliper A.; Kazennov A.; Mamoshina P.; Vanhaelen Q.; Khrabrov K.; Zhavoronkov A. The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget 2017, 8, 10883. 10.18632/oncotarget.14073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Cao N.; Kipf T. MolGAN: An implicit generative model for small molecular graphs. arXiv preprint 2018, arXiv:1805.11973. 10.48550/arXiv.1805.11973. [DOI] [Google Scholar]
- Segler M. H.; Kogej T.; Tyrchan C.; Waller M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2018, 4, 120–131. 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olivecrona M.; Blaschke T.; Engkvist O.; Chen H. Molecular de-novo design through deep reinforcement learning. J. Cheminformatics 2017, 9, 1–14. 10.1186/s13321-017-0235-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan Q.; Santana-Bonilla A.; Zwijnenburg M. A.; Jelfs K. E. Molecular generation targeting desired electronic properties via deep generative models. Nanoscale 2020, 12, 6744–6758. 10.1039/C9NR10687A. [DOI] [PubMed] [Google Scholar]
- Kotsias P.-C.; Arús-Pous J.; Chen H.; Engkvist O.; Tyrchan C.; Bjerrum E. J. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat. Mach. Intell. 2020, 2, 254–265. 10.1038/s42256-020-0174-5. [DOI] [Google Scholar]
- Li Y.; Zhang L.; Liu Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminformatics 2018, 10, 33. 10.1186/s13321-018-0287-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coley C. W.; Jin W.; Rogers L.; Jamison T. F.; Jaakkola T. S.; Green W. H.; Barzilay R.; Jensen K. F. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 2019, 10, 370–377. 10.1039/C8SC04228D. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y.; Vinyals O.; Dyer C.; Pascanu R.; Battaglia P. Learning deep generative models of graphs. arXiv preprint 2018, arXiv:1803.03324. 10.48550/arXiv.1803.03324. [DOI] [Google Scholar]
- Gebauer N. W. A.; Gastegger M.; Hessmann S. S. P.; Muller K. R.; Schutt K. T. Inverse design of 3d molecular structures with conditional generative neural networks. Nat. Commun. 2022, 13, 973. 10.1038/s41467-022-28526-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zang C.; Wang F. Moflow: an invertible flow model for generating molecular graphs. arXiv preprint 2020, arXiv:2006.10137. 10.48550/arXiv.2006.10137. [DOI] [Google Scholar]
- Nguyen T.; Le H.; Quinn T. P.; Nguyen T.; Le T. D.; Venkatesh S. GraphDTA: Predicting drug–target binding affinity with graph neural networks. Bioinformatics 2021, 37, 1140–1147. 10.1093/bioinformatics/btaa921. [DOI] [PubMed] [Google Scholar]
- Zhang Y.; Zhang D.; Wei J.; Liu Z.; Lu Y.; Duan L. Multi-resonance induced thermally activated delayed fluorophores for narrowband green OLEDs. Angew. Chem.-Int. Ed. 2019, 58, 16912–16917. 10.1002/anie.201911266. [DOI] [PubMed] [Google Scholar]
- Ren T.-B.; Xu W.; Zhang W.; Zhang X.-X.; Wang Z.-Y.; Xiang Z.; Yuan L.; Zhang X.-B. A general method to increase stokes shift by introducing alternating vibronic structures. J. Am. Chem. Soc. 2018, 140, 7716–7722. 10.1021/jacs.8b04404. [DOI] [PubMed] [Google Scholar]
- Cosco E. D.; Caram J. R.; Bruns O. T.; Franke D.; Day R. A.; Farr E. P.; Bawendi M. G.; Sletten E. M. Flavylium polymethine fluorophores for near-and shortwave infrared imaging. Angew. Chem.-Int. Ed. 2017, 56, 13126–13129. 10.1002/anie.201706974. [DOI] [PubMed] [Google Scholar]
- Liang L.; Wang J.-T.; Xiang X.; Ling J.; Zhao F.-G.; Li W.-S. Influence of moiety sequence on the performance of small molecular photovoltaic materials. J. Mater. Chem. A 2014, 2, 15396–15405. 10.1039/C4TA03125C. [DOI] [Google Scholar]
- Tarjan R. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972, 1, 146–160. 10.1137/0201010. [DOI] [Google Scholar]
- Qiu Y.; Xia H.; Miao J.; Huang Z.; Li N.; Cao X.; Han J.; Zhou C.; Zhong C.; Yang C. Narrowing the electroluminescence spectra of multiresonance emitters for high-performance blue OLEDs by a peripheral decoration strategy. ACS Appl. Mater. Interfaces 2021, 13, 59035–59042. 10.1021/acsami.1c18704. [DOI] [PubMed] [Google Scholar]
- Zhu L.; Zhang H.; Peng X.; Zhang M.; Zhou F.; Chen S.; Song J.; Qu J.; Wong W.-Y. Blue OLEDs with narrow bandwidth using CF3 substituted bis ((carbazol-9-yl) phenyl) amines as emitters: Structural regulation of linker between donor and acceptor in chromophores. Dyes Pigments 2021, 194, 109627 10.1016/j.dyepig.2021.109627. [DOI] [Google Scholar]
- Tang Y.; Xie G.; Liang X.; Zheng Y.-X.; Yang C. Organic and quantum-dot hybrid white LEDs using a narrow bandwidth blue TADF emitter. J. Mater. Chem. C 2020, 8, 10831–10836. 10.1039/D0TC01942A. [DOI] [Google Scholar]
- Zhao B.; Ma H.; Zheng M.; Xu K.; Zou C.; Qu S.; Tan Z. a. Narrow-bandwidth emissive carbon dots: A rising star in the fluorescent material family. Carbon Energy 2022, 4, 88–114. 10.1002/cey2.175. [DOI] [Google Scholar]
- Shi P.; Song Y.; Tang J.; Nie Z.; Chang J.; Chen Q.; He Y.; Guo T.; Zhang J.; Wang H. Ultra-narrow bandwidth red-emission carbon quantum dots and their bio-imaging. Physica E 2022, 142, 115197 10.1016/j.physe.2022.115197. [DOI] [Google Scholar]
- Tang J.; Xiao Z.; Xu K. Ultra-thin metamaterial absorber with extremely bandwidth for solar cell and sensing applications in visible region. Opt. Mater. 2016, 60, 142–147. 10.1016/j.optmat.2016.07.023. [DOI] [Google Scholar]
- Ha J. M.; Shin H. B.; Joung J. F.; Chung W. J.; Jeong J.-E.; Kim S.; Hur S. H.; Bae S.-Y.; Kim J.-Y.; Lee J. Y. Rational molecular design of azaacene-based narrowband green-emitting fluorophores: Modulation of spectral bandwidth and vibronic transitions. ACS Appl. Mater. Interfaces 2021, 13, 26227–26236. 10.1021/acsami.1c04981. [DOI] [PubMed] [Google Scholar]
- Fries F.; Reineke S. Statistical treatment of photoluminescence quantum yield measurements. Sci. Rep. 2019, 9, 15638 10.1038/s41598-019-51718-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brouwer A. M. Standards for photoluminescence quantum yield measurements in solution (IUPAC Technical Report). Pure Appl. Chem. 2011, 83, 2213–2228. 10.1351/PAC-REP-10-09-31. [DOI] [Google Scholar]
- Wildman S. A.; Crippen G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 1999, 39, 868–873. 10.1021/ci990307l. [DOI] [Google Scholar]
- RDKit: Open-source chemiformatics. 2010. http://www.rdkit.org.
- Gomez-Bombarelli R.; Aguilera-Iparraguirre J.; Hirzel T. D.; Duvenaud D.; Maclaurin D.; Blood-Forsythe M. A.; Chae H. S.; Einzinger M.; Ha D. G.; Wu T.; et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 2016, 15, 1120–1127. 10.1038/nmat4717. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Our Gen-DL model is publicly available as a web-based application (http://deep4chem.korea.ac.kr/DeepMoleculeGen). The codes and data for implementing our Gen-DL model are available at https://github.com/spark8ku/DeepMoleculeGen. The experimental database can be obtained from https://www.nature.com/articles/s41597-020-00634-8 or the corresponding author upon request for academic purposes only.





