MOLGENGO: Finding Novel Molecules with Desired Electronic Properties by Capitalizing on Their Global Optimization

Beomchang Kang; Chaok Seok; Juyong Lee

doi:10.1021/acsomega.1c04347

. 2021 Oct 5;6(41):27454–27465. doi: 10.1021/acsomega.1c04347

MOLGENGO: Finding Novel Molecules with Desired Electronic Properties by Capitalizing on Their Global Optimization

Beomchang Kang ^†, Chaok Seok ^†,^*, Juyong Lee ^‡,^*

PMCID: PMC8529683 PMID: 34693166

Abstract

graphic file with name ao1c04347_0014.jpg

The discovery of novel and favorable fluorophores is critical for understanding many chemical and biological studies. High-resolution biological imaging necessitates fluorophores with diverse colors and high quantum yields. The maximum oscillator strength and its corresponding absorption wavelength of a molecule are closely related to the quantum yields and the emission spectrum of fluorophores, respectively. Thus, the core step to design favorable fluorophore molecules is to optimize the desired electronic transition properties of molecules. Here, we present MOLGENGO, a new molecular property optimization algorithm, to discover novel and favorable fluorophores with machine learning and global optimization. This study reports novel molecules from MOLGENGO with high oscillator strength and absorption wavelength close to 200, 400, and 600 nm. The results of MOLGENGO simulations have the potential to be candidates for new fluorophore frameworks.

1. Introduction

Fluorophores play crucial roles in various disciplines such as medicine, biochemistry, spectroscopy, and analytical chemistry.¹⁻⁴ They are widely used to screen toxic compounds at the molecular level and observe protein–protein interactions.^5,6 The discovery of novel fluorophores will open new possibilities in biology and biochemistry because only a small number of fluorophores are commonly used at present.⁷ It is essential to optimize their maximum oscillator strength (f_max) and the corresponding absorption wavelength (λ_max) to the desired values to design bright and fluorophores with various colors.⁸

Conventionally, the discovery of most of the new fluorophores has been accomplished by the established rules, guidelines, and strategies.⁹⁻¹¹ Finding a novel scaffold by conventional experimental molecular discovery approach without the established rules, guidelines, and strategies demands astronomical amounts of resources and time to synthesize and experimentally verify the properties of candidate molecules.⁷ Despite decades of endeavors by chemists, only a few fluorophores are commonly used, such as fluorescein,¹² bodify,¹³ cyanine,¹⁴ bisbenzimide,¹⁵ coumarin,¹⁶ rhodamine,¹⁷ and others.¹⁸ In this study, we aimed to develop a computational approach to find novel scaffolds, which are distinct from known ones without using explicit rules, guidelines, and strategies.

Various computational approaches have been suggested for efficient optimization of the desired electronic transition properties.¹⁹⁻²¹ Sumita and co-workers developed ChemTS, which utilized Monte Carlo tree search with a recurrent neural network as a molecular generator and density functional theory (DFT) calculation as the evaluator of the desired electronic transition.¹⁹ They synthesized and validated five fluorophores. Henault and co-workers applied a graph-based genetic algorithm (GB-GA) and the tight-binding-based simplified Tamm–Dancoff approximation (sTDA-xTB) as a molecular generator and a desired electronic transition evaluator.²⁰ They reported nine molecules, which are expected to have favorable electronic properties. Leguy and co-workers combined a graph-based evolutionary algorithm (EvoMol) to generate new molecules with density functional theory calculations.²¹ They reported 15 molecules with low E_LUMO and 15 molecules with high E_HOMO.

Favorable fluorophores must have high f_max and λ_max, which is close to the target λ_max. In a practical sense, toxicity, biocompatibility, chemical stability, and photostability are also important properties for fluorophores. However, considering all properties simultaneously is a highly challenging task and out of the scope of this study. This study focused on optimizing f_max and λ_max as the first step to discovering new fluorophores. However, many previous approaches did not optimize f_max. Sumita and co-workers only optimized λ_max.¹⁹ Leguy and co-workers optimized only E_HOMO and E_LUMO.²¹ They did not optimize fluorescent strength.

An extensive search of the chemical space is essential to discover various favorable fluorophores from unexplored areas of chemical space. For extensive searches, molecular evaluators should have fast speed, and populations should hold diversity and not be trapped in local minima at the early-stage optimization. All current methods used DFT calculations that require extensive computation resources and time as evaluators to the best of our knowledge.¹⁹⁻²³ Also, they did not consider the diversity of generated molecules during optimization.¹⁹⁻²¹

Here, we present the MOLecular generator using Light Gradient boosting machine, Grammatical Evolution aNd Global Optimization (MOLGENGO) approach to find novel molecules that have a targeted absorption wavelength, λ_max, and high oscillator strength, f_max. Our method optimizes both f_max and λ_max simultaneously, unlike previous approaches (Figure 1).^19,21 The light gradient boosting machine (LGBM)²⁴ method, one of the tree-based machine learning (ML) methods, was used to predict f_max and λ_max. Our ML-based predictions require much fewer computation resources than quantum mechanical (QM) calculations without sacrificing accurate characterization of electronic excitation properties.^22,24 Furthermore, we implemented the conformational space annealing (CSA) algorithm as a global optimization method that searches global minimum solutions while considering the diversity of molecules.²⁵⁻²⁷ As a result, we observed that the faster evaluation of f_max and λ_max and consideration of diversity enabled an extensive search of fluorophores with broad chemical space coverage.

Flow chart of MOLGENGO in discovering novel fluorophores.

This paper is organized as follows. First, the details of the molecular descriptors and LGBM models to predict f_max and λ_max are described. Second, the components of the CSA algorithm, genetic operators for the molecular generator, and diversity control scheme are described. Third, the detailed molecular optimization results of MOLGEGO are discussed. We show that our method successfully optimized f_max and λ_max and maintained the diversity of the pool of generated molecules better than the existing genetic algorithm through the benchmark test. Finally, this article suggests novel molecules with optimized f_max and λ_max verified by time-dependent (TD)-DFT.

2. Results and Discussion

2.1. Prediction of Maximum Oscillator Strength and the Corresponding Excitation Energy

LGBM models predicted the desired electronic transition properties with similar accuracy and correlation but faster training speed than the previous random forest (RF)-based models (Table 1). The mean prediction times of LGBM models were less than 1/10 of the RF method. These results demonstrate that our LGBM models facilitate global optimization efficiency due to their faster prediction time. Furthermore, the LGBM models showed higher Pearson correlation coefficients on f_max and maximum corresponding excitation energy (E_max) predictions. The root mean square error (RMSE) of E_max from LGBM was 0.02 lower than that of RF and the RMSE of f_max from LGBM was identical to that of RF.

Table 1. Comparison of Accuracy and Efficiency of LGBM and RF Models.

model	quantum property	RMSE^a	Pearson R^a	mean prediction time (s)^b
LGBM	excitation energy (eV)	0.43	0.89	0.006
LGBM	f_max	0.08	0.85	0.004
RF⁷	excitation energy (eV)	0.45	0.88	0.065
RF⁷	f_max	0.08	0.83	0.068

Open in a new tab

The test set includes 50 000 molecules. The number of heavy atoms is up to 38.

Intel Xeon CPU E5-2620 v4 2.10 GHz 1 core, 1 processor, 128 GB memory.

2.2. Finding Novel Fluorescent Molecules via Global Optimization

2.2.1. Benchmarking Optimization Performance

MOLGENGO successfully generated molecules with high f_max,pred and desired λ_max,pred (Figure 2). MOLGENGO was executed from the identical first bank to optimize f_max,pred and λ_max,pred with three λ_max,target values: 200, 400, and 600 nm. The population of the predicted λ_max of the first bank had its peak near 250 nm. After optimization, all predicted λ_max values in the final banks deviated from λ_max,target less than 50 nm for all simulations (Figure 2a). The λ_max,pred distributions became narrow, changing from the first banks to the final banks for all λ_max,target, which indicated that our method successfully generated molecules with desired properties. The width of λ_max,pred distribution for λ_max,target = 600 nm was broader than that of λ_max,target = 200 nm. This may be due to a sparse population of molecules whose λ_max were close to 600 nm in PubChemQC. In PubChemQC, the number of molecules whose λ_max were in the range of 600 ± 10 nm is about 10², 1/100 of the molecules with λ_max = 400 ± 10 nm, and 1/10 000 of the molecules with λ_max = 200 ± 10 nm^7,28 (Figure S1).

Distribution change of predicted (a) λ_max and (b) f_max using the vilin plot. Cyan and orange correspond to the first bank and final bank, respectively. The first plot, second plot, and third plot correspond to 200, 400, and 600 nm target λ_max, respectively.

For all λ_max,target values, many molecules whose predicted f_max surpassed 1.5 were found (Figure 2b). All predicted f_max values of the first banks were less than 1.0 for all λ_max,target. When λ_max,target = 200 nm, we even discovered molecules with f_max over 3.0. When λ_max,target = 400 and 600 nm, two peaks that exceeded 1.0 were found.

For a fair comparison between the two methods tested here, we performed global optimization using both methods for 7 days with a single CPU. MOLGENGO requires a longer computation time to be converged but finds more optimized molecules than ChemGE²⁹ (Figure 3). ChemGE simulations converged within 3 h for all λ_max,target. However, MOLGENGO kept finding more optimized molecules until 168 h for all λ_max,target.

Average of the best S(m), λ_max,pred and f_max,pred from 10 simulations by ChemGE and CSA. The change of S(m) with target λ_max (a) 200 nm, (b) 400 nm, and (c) 600 nm. λ_max,pred with target λ_max (d) 200 nm, (e) 400 nm, and (f) 600 nm. f_max,pred with target λ_max (g) 200 nm, (h) 400 nm, and (i) 600 nm during search. The X-axis represents running time (hour). Red and blue correspond to the CSA and ChemGE, respectively.

The objective function of our global optimization was designed to maximize the highest oscillator strength (f_max) and make the corresponding absorption wavelength (λ_max) close to the target wavelength (λ_max,target). The objective function of a molecule m used for in this study is defined as follows

where f_max,pred(m) and λ_max,pred(m) are the predicted values obtained from the LGBM regressors²⁴ and w is the weight of the λ_max deviation term.

In terms of finding molecules whose λ_max,pred are close to λ_max,target, MOLGENGO outperformed the ChemGE method significantly except the case of λ_max,target = 400 nm. When λ_max,target = 200 nm, the best result of MOLGENGO approached 200 nm but that of ChemGE departed from 200 nm as a simulation proceeded (Figure 3d). When λ_max,target = 600 nm, MOLGENGO results converged to λ_max = 602 nm and ChemGE converged at 604 nm (Figure 3f). When λ_max,target = 400 nm, the difference between the results of both methods was not significant, under 1.0 nm (Figure 3e).

In terms of finding molecules with high f_max, MOLGENGO outperformed the ChemGE method significantly for all λ_max,target. When λ_max,target = 200 nm, f_max,pred of the best molecule obtained with MOLGENGO was 2.8, almost 3 times that of the ChemGE result, 1.0 (Figure 3g). When λ_max,target = 400 nm, the highest f_max,pred obtained with MOLGENGO was 2.0 while that of the ChemGE result was 1.1 (Figure 3h). When λ_max,target = 600 nm, the best f_max,pred of the MOLGENGO simulation was 1.2, slightly higher than that of the ChemGE result, 1.1 (Figure 3i).

With a weight factor of 1.0, MOLGENGO showed slightly worse results in terms of λ_max than ChemGE (Figure 3e). However, MOLGENGO found molecules with much higher f_max values than ChemGE, which compensates slightly worse results of λ_max (Figure 3h). Because we optimized the objective function defined as the linear combination of λ_max and f_max terms, each individual component may not show consistent improvement over ChemGE. However, the overall objective values obtained with MOLGENGO are better than those of ChemGE consistently. These results show that our approach using CSA performs more extensive exploration of chemical space than ChemGE based on the conventional genetic algorithm, resulting in better molecules.

2.2.2. Diversity of Generated Molecules

MOLGENGO sampled more diverse molecules than ChemGE²⁹ for all target λ_max (Figures 4 and 5). The pairwise distance distribution of MOLGENGO shows that the diversities of the pools of molecules were well-maintained in the final banks, which are the results of global optimization (Figure 4). However, the optimization by ChemGE led to lower distances between optimized molecules, indicating that they are highly similar to each other. For all λ_max,target, the pairwise distance distributions of the final banks obtained with ChemGE had their peaks near 0.0. Furthermore, the highest peaks of pairwise distance distribution of the final banks from ChemGE were located near 0.0 when λ_max,target = 400 and 600 nm. However, the MOLGENGO results formed their peaks around 0.6 for all λ_max,target values.

Pairwise distance (1 – J_c) distribution of MOLGENGO and ChemGE. Cyan and orange colors correspond to the ChemGE and MOLGENGO results, respectively.

Change of average similarities of generated molecules measured by the Jaccard coefficient. Red and blue lines correspond to the MOLGENGO and ChemGE, respectively.

The average similarities of banks increased and converged at an early stage, within 7 h, with ChemGE (Figure 5). In contrast, that of MOLGENGO results rose gradually until the end of simulations and saturated near 0.4 (Figure 5). When λ_max,target = 400 and 600 nm, the average similarities of the final bank of ChemGE simulations were almost twice that of MOLGENGO. Preservation of diversity with MOLGENGO explains its slower convergence and broader search in the chemical space. In addition, because MOLGENGO covered broader space than ChemGE, it may have had more chances to discover compounds whose λ_max,pred was closer to λ_max,target and f_max higher.

2.3. Chemical Space Coverage

t-Distributed stochastic neighbor embedding (t-SNE) visualization was utilized to show how MOLGENGO searches chemical space widely (Figure 6). t-SNE is a statistical model for visualizing high-dimensional data by giving each datapoint a coordinate in a two or three-dimensional map.³⁰ A 4096-dimensional ECFP4 vector was projected onto two-dimensional space to deal with structural diversity. The final bank of each run formed clusters, which extended from the first banks (Figure 6). Most final banks’ molecules resided outside of ZINC-250k. It shows that MOLGENGO could discover novel molecules that are not present in the initial DB.³¹

t-SNE visualization of molecules in first and final banks with target λ_max (a) 200 nm, (b) 400 nm, and (c) 600 nm. Magenta and blue digits represent molecules included in the first and final banks, respectively. Numbers mean indices of runs. A 4096-dimensional ECFP4 vector was projected onto a two-dimensional space.

2.4. Optimization of Objective Function

As the first trial for the weight parameter of the objective function, we tried three values: 0.1, 1.0, and 10.0.

Desirable optimization results must satisfy two conditions: high f_max,pred and convergence to target λ_max. Excitation energy optimization results with w = 0.1 did not converge to λ_max,target = 200 nm (Figure 7a) and 600 nm (Figure 7c). Both optimization results with w = 1.0 and 10.0 converged close to target λ_max for all target λ_max (Figure 7a–c). However, f_max,pred optimization results with w = 1.0 were higher than those of w = 10.0 for all target λ_max. When λ_max,target = 200 nm, f_max,pred with w = 1.0 was twice of f_max,pred with w = 10.0 (Figure 7d). Similarly, when λ_max,target = 400 nm, f_max,pred with w = 1.0 was 0.7 higher than that with w = 10.0 (Figure 7e). When λ_max,target = 600 nm, f_max,pred with w = 1.0 was also higher than that with w = 10.0 (Figure 7f). Thus, we determined the best weight of eq 1 as 1.0 among the three tested values. This hyperparameter optimization is not extensive and a more rigorous and systematic parameter tuning is required for more accurate results.

Electronic property optimization results with different weight coefficients. Red, blue, and green represent the results obtained with w = 0.1, 1.0, and 10.0, respectively. λ_max,pred for target λ_max (a) 200 nm, (b) 400 nm, and (c) 600 nm during searches. f_max,pred for target λ_max (d) 200 nm, (e) 400 nm, and (f) 600 nm during searches.

2.5. Optimization of Highest Occupied Molecular Orbital (HOMO)–Lowest Unoccupied Molecular Orbital (LUMO) Gap and Its Oscillator Strength

In solution, HOMO–LUMO gap’s wavelength (λ_HOMO–LUMO), and oscillator strength (f_HOMO–LUMO) play important roles in fluorescence.³² We optimized λ_HOMO–LUMO and f_HOMO–LUMO with the same manner of λ_max and f_max (Figure 8). For all λ_target, λ_HOMO–LUMO entered the stationary stage after 50 h. The convergence values were the same for all calculations with λ_target = 200, 400, and 600 nm. Except for λ_target = 200 nm, f_HOMO–LUMO values exceeded 1.0. This also supports the possibility of discovering favorable fluorescence using MOLEGNGO.

Optimization of (a) λ_HOMO–LUMO and (b) f_HOMO–LUMO for λ_target = 200 (magenta), 400 (navy), and 600 (yellow) nm.

Validation of optimized molecules is described in the Supporting Information.

2.6. Validation of Optimization Results Using TD-DFT Calculations

We executed TD-DFT calculations of molecules generated by MOLGENGO simulations and obtained their maximum oscillator strength (f_max,TD-DFT) and its corresponding wavelength (λ_max,TD-DFT) to verify the properties of novel fluorophores discovered by MOLGENGO. Quantum calculation results from PubChemQC are based on the optimized ground state geometry. Molecules in the final banks whose f_max,TD-DFT values exceeded 0.1 were divided into nine clusters using k-means clustering algorithm³³ and the ECFP4 of the molecules folded into 4096 bits. From each cluster, the molecule with the lowest |λ_max,TD-DFT – λ_max,pred| was selected. In summary, 27 molecules, 9 molecules for λ_max,target = 200, 400, and 600 nm, are displayed (Figures 9–11). All 27 molecules are novel molecules, which are not in PubChem.

Novel molecules found by MOLGENGO with λ_max,target = 200 nm. TD-DFT results of λ_max and f_max are represented below the molecular structures.

Novel molecules found by MOLGENGO with λ_max,target = 600 nm. TD-DFT results of λ_max and f_max are represented below the molecular structures.

When λ_max,target = 200 nm, all absolute deviation between λ_max,TD-DFT and λ_max,target, |λ_max,TD-DFT – λ_max,target|, were less than 8 nm and all f_max exceeded 0.3 (Figure 9). |λ_max,TD-DFT – λ_max,target| of molecules for λ_max,target = 200 nm were less than those of compounds for λ_max,target = 400 and 600 nm. The number of molecules that satisfy |λ_max,TD-DFT – λ_max,target| <10 nm was 9, 6, and 2 for 200, 400, and 600 nm, respectively. The λ_max values of molecules in PubChemQC are densely populated around 200 nm²⁸ (Figure S1). This may have led to the better prediction accuracy of λ_max. This also may be related to better TD-DFT results of λ_max compared to λ_max,target = 400 and 600 nm.

The structures of novel molecules found with λ_max,target = 200 nm (Figure 9) are simpler and smaller than compounds with λ_max,target = 400 and 600 nm (Figures 10 and 11). Molecules with λ_max near 200 nm do not have extensive π-conjugations.³⁴

Novel molecules found by MOLGENGO with λ_max,target = 400 nm. TD-DFT results of λ_max and f_max are represented below the molecular structures.

When λ_max,target = 400 nm, all |λ_max – λ_max,target| values were under 50 nm and all f_max values exceeded 0.3 (Figure 10). We found five molecules whose f_max were over 1.0 and two molecules with f_max exceeding 3.0 (Figure 10d,g). The molecule in Figure 10g may have high quantum yield because it may have a high f_max over 3.0 and a rigid structure that prevents excited-state molecular twisting.^35,36

When λ_max,target = 600 nm, all |λ_max – λ_max,target| values were under 50 nm and all f_max values were over 0.1 (Figure 11). We found one molecule whose f_max was over 1.0. The insufficient number of molecules whose λ_max are in a range of 600 ± 10 nm^7,28 (Figure S1) appears to be a reason for relatively worse results to discover novel molecules with λ_max ≈ 600 nm.

2.7. Limitation of the Current Study

We identified that some of the generated molecules appear to be hard to synthesize. To overcome this problem, two approaches will be tried in future studies. First, scores to measure the synthesizability of molecules, such as SA-score³⁷ or RA-score,³⁸ can be directly incorporated into an objective function. Second, after generating many novel molecules, they can be screened based on synthesizability and hand-crafted rules based on synthetic chemists’ expertise.

Second, the accuracy of MOLGENGO is tightly coupled with the PubChemQC database. Therefore, MOLGENGO may not explore the near-infrared (NIR)-I/II region accurately, which has been drawing much attention recently.³⁹ The ratio of molecules whose excitation energy and oscillator strength lying in the NIR-I/II region is less than 0.1% of our training set. This severely prevents efficient and accurate prediction of the corresponding region of chemical space. Once more molecules in the NIR-I/II region are accumulated, we will be able to explore the NIR-I/II region accurately.

3. Conclusions

In this work, we developed a new molecular discovery approach by combining the global optimization method and LGBM predictors to find molecules with high oscillator strength and targeted excitation energy. Unlike previous approaches,¹⁹⁻²¹ which used quantum calculation, we used machine learning to characterize the desired electronic transition properties. We also performed global optimization of properties on chemical space to consider the diversity. LBGM models predicted f_max and λ_max efficiently without deteriorating the accuracy of predictions. MOLGENGO successfully found novel molecules with λ_max,pred = 200, 400, and 600 nm and f_max,pred over 3.0. We identified that MOLGENGO covers a wide range of chemical space outside the existing databases’ coverage. Many novel molecules with high f_max and desired λ_max were found and they were verified via TD-DFT calculations. We expect that MOLGENGO is an efficient tool for discovering novel molecules, which can be candidates for favorable fluorophores.

From the experimental results, the limitations of the current version of MOLGENGO are identified. First, the results of MOLGENGO targeting the longer absorption wavelength, 600 nm, were worse than those targeting the shorter absorption wavelengths, 200 or 400 nm. We believe that this is due to the bias of the training set, PubChemQC. The majority of molecules in PubChemQC have their absorption wavelength in a relatively short-wavelength region, i.e., shorter than 600 nm. This bias of input data appears to be the reason for relatively worse results for molecules targeting absorption wavelength in the IR region. To overcome this limitation, more information on molecules with longer absorption wavelengths is necessary. Second, many newly discovered molecules appear to be hard to synthesize. MOLGENGO performs global optimization of molecular properties in a combinatorial fashion. Thus, unlike generative models, it does not require any training or assumption on chemical structure. Instead, a global optimization approach heavily relies on an objective function and assumes that the objective function quantifies the quality of a molecule accurately. Currently, synthetic accessibility is not considered in the objective functions used in this study. Therefore, incorporating synthetic accessibility scores^37,38 will help MOLGENGO generate more synthesizable molecules.

4. Method

4.1. Overview of Workflow

In this study, we applied the CSA global optimization algorithm to discover novel fluorescent molecules. CSA is a powerful global optimization approach and includes components of GA.^25,40−42 The flow chart of the MOLGENGO is illustrated in Figure 1.

The inputs of the algorithm were the simplified molecular input line entry system (SMILES) format.⁴³ SMILES represents molecules as strings.⁴³ However, the string type is not efficient to be handled with genetic operators because the grammatical rules of SMILES are so complex that SMILES-based genetic operators easily generate many invalid molecules.⁴⁴ Thus, we converted SMILES strings to integer arrays using simple grammatical rules used in the context-free grammar method^29,45 (Figure 1). The example of converting from SMILES string to integer array using context-free grammar is introduced in the Supporting Information. Integer-based genetic operations allow larger changes of molecules compared to string-based genetic operations.²⁹

New gene populations, molecules, were generated at each generation from the initial bank to the final bank using genetic operations. A bank includes a fixed number of genes. Two types of genetic operators were used to create diverse genes. One is mutation type and the other is crossover type (Figure 12).

Genetic operators used in MOLGENGO: mutation and crossover.

One of the crucial features of MOLGENGO is that it keeps the diversity of its population by (1) defining a distance measure and setting the criterion of the similarity between two genes as D_cut and (2) using D_cut to control the diversity of the bank while D_cut is slowly decreased from the first value to the final value.²⁵ Details of the CSA method in MOLGENGO will be described in the CSA Algorithm section.

Our method only covered molecules that only include H, B, C, N, O, F, and Cl atoms with no net charge.⁷

4.2. Data Set

The PubChemQC database was used to train LGBM regressors.²⁸ We randomly sampled 0.5 million molecules from PubChemQC. The data set was split into a 9:1 ratio to generate the training and test sets.⁷ The first banks were selected from the ZINC-250k set, which was first compiled by Kusner and co-workers and consists of 0.25 million random molecules ZINC.^31,46 ZINC-250k was used to sample starting molecules in other studies: ChemGE and GB-GA.^20,29,47

4.3. Descriptors and Prediction Models for Objective Function

We applied the LGBM²⁴ algorithm to predict f_max and its corresponding λ_max. Our previous research developed random forest (RF) machines to predict a given molecule’s maximum oscillator strength and the corresponding excitation energy.⁷ However, a more efficient prediction method with comparable accuracy was necessary to perform an extensive search with CSA. To satisfy this requirement, we trained the prediction models using LGBM.²⁴

To convert a molecular feature vector, we utilized three descriptors, extended connectivity fingerprint with a diameter of 4 (ECFP4),⁴⁸ MACCS keys,⁴⁹ and RDKit molecular properties.⁵⁰ ECFP is a circular fingerprint for molecular characterization that accounts for the relationships between the molecular substructure efficiently. The MACCS keys were one-hot encoded fingerprints that describe the 166 crucial molecular substructures. Implemented RDKit molecular descriptors contain the real values of molecular features such as molecular weight, charge, and many more. The list of used RDKit molecular descriptors is described in the Supporting Information (RDkit Molecular Descriptors used for LGBM training section in the Supporting Information). In summary, 4301-dimensional vectors were used as input features. The vector contained 4096 bits of ECFP4, 166 MACCS keys, and 39 RDKit molecular descriptors. The number of the estimator was 1000, the number of data points per leaf node was 50, and the feature fraction was 1/3 in our LGBM model.

4.4. CSA Algorithm

The bank size of CSA was set to 100. SMILES representations were converted to 300-dimensional integer vectors using grammatical evolution (GE).²⁹ Thus, each bank contains 100 integer vectors. The number of vectors in a bank was maintained identical throughout the sampling.

Here, we aimed to optimize f_max and its corresponding λ_max. Ten MOLGENGO runs were performed for three target λ_max values: 200, 400, and 600 nm. Simulations were performed for 7 days to guarantee their convergence to target wavelength using a machine with Intel Xeon CPU E5-2650 v4 (2.20 GHz 1 core, 1 processor, 128 GB memory). We also performed 10 ChemGE²⁹ simulations to compare performance with MOLGENGO.

4.4.1. New Gene Generation

To generate new genes, we randomly selected 50 seed genes from a current bank. Then, we implemented three operators, one mutation and two crossovers, to create new genes.²⁵ As a result, 30 chemically valid children’s genes were generated from each seed gene, 10 by mutation, 10 by crossover1, and 10 by crossover2 (Figure 12).

The mutation operator mutated up to three randomly selected variables of each seed. The crossover1 operator performed a crossover between a seed gene and a randomly selected gene from the current bank. The size of the crossover did not exceed half of the total number of variables. The crossover2 operator performed a crossover between a seed and a randomly selected gene from the first bank. The size of the crossover did not exceed 20% of the total number of variables.

4.4.2. Bank Update and D_cut Control

All children genes were used to update the bank one at a time as follows. For each child gene G_child, its distances to all of the bank solutions were measured to identify its closest neighbor gene G_closest. If the distance between them, d(G_child,G_closest), was less than or equal to D_cut and the objective of G_child, S(G_child), was more optimized than S(G_closest), G_child replaced G_closest.

A distance between two genes was defined by 1 – J_c. J_c is the Jaccard coefficient, also known as Tanimoto similarity, between two genes. Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets. If d(G_child,G_cloeset) > D_cut and S(G_child) was more optimized than the worst optimized gene (G_worst) in the current bank, G_child replaced G_worst. Otherwise, G_child was abandoned.

At each generation, D_cut was reduced by R_D. At the first bank, D_cut started as D_cut,init = D_mean,init/2 where D_mean,init is the mean distance among the first bank genes. After each CSA iteration, D_cut was decreased by multiplying it with R_D = 0.999995945357139. D_cut became D_mean,init/3 after 100 000 generations.

In summary, eq 2 represents the D_cut of the nth generation

The value of R_D controls the annealing schedule of CSA. D_cut played the role of the temperature in conventional simulated annealing.

4.5. TD-DFT Calculations of the Generated Molecules

Quantum mechanical (QM) calculations were executed to verify the desired electronic transition properties of the designed molecules. We utilized the TeraChem program, whose computational efficiency is accelerated by GPU.^51,52 Because our LGBM models were trained with PubChemQC, QM calculations followed the procedure of the PubChemQC paper.²⁸ Density functional theory (DFT) calculation with the B3LYP functional and the 6-31G* basis set was operated to optimize the geometries of molecules in the ground state.⁵³ Next, we applied time-dependent-density functional theory (TD-DFT) calculations with B3LYP/6-31+* to predict up to 10 excitation levels and identified f_max and corresponding λ_max.⁵⁴ We used the VWN5 correlation for B3LYP to be compatible with GAMESS, which was used in PubChemQC.^28,51,55

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2016M3C4A7952630, NRF-2019M3E5D4066898 and NRF-2020M3A9G7103933). B.K. was supported by the BK21Plus Program funded by the Ministry of Education, Republic of Korea (21A20131312240). This work was supported by the Korea Environment Industry & Technology Institute (KEITI) through the Technology Development Project for Safety Management of Household Chemical Products, funded by the Korea Ministry of Environment (MOE) (KEITI:2020002960002 and NTIS:1485017120).

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.1c04347.

Example of converting from SMILES string to integer array using context-free grammar, RDkit molecular descriptors used for LGBM training, excitation energy distribution of PubChemQC, and validation of HOMO–LUMO gap optimization results using TD-DFT calculations (PDF)

The authors declare no competing financial interest.

Notes

The MOLGENGO program and the zinc-250k data file are available free of charge at our homepage (http://galaxy.seoklab.org/suppl/molgengo.html).

Supplementary Material

ao1c04347_si_001.pdf^{(190.9KB, pdf)}

References

Murphy K. R.; Stedmon C. A.; Wenig P.; Bro R. OpenFluor– an online spectral library of auto-fluorescence by organic compounds in the environment. Anal. Methods 2014, 6, 658–661. 10.1039/C3AY41935E. [DOI] [Google Scholar]
Cartlidge E. The light fantastic. Science 2018, 359, 382–385. 10.1126/science.359.6374.382. [DOI] [PubMed] [Google Scholar]
Zinchuk V.; Grossenbacher-Zinchuk O. Recent advances in quantitative colocalization analysis: Focus on neuroscience. Prog. Histochem. Cytochem. 2009, 44, 125–172. 10.1016/j.proghi.2009.03.001. [DOI] [PubMed] [Google Scholar]
Evanko D. A. ’flaky’ but useful fluorophore. Nat. Methods 2005, 2, 160–161. 10.1038/nmeth0305-160b. [DOI] [Google Scholar]
Martin S. F.; Tatham M. H.; Hay R. T.; Samuel I. D. Quantitative analysis of multi-protein interactions using FRET: Application to the SUMO pathway. Protein Sci. 2008, 17, 777–784. 10.1110/ps.073369608. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moczko E.; Mirkes E. M.; Cáceres C.; Gorban A. N.; Piletsky S. Fluorescence-based assay as a new screening tool for toxic chemicals. Sci. Rep. 2016, 6, 33922 10.1038/srep33922. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang B.; Seok C.; Lee J. Prediction of Molecular Electronic Transitions Using Random Forests. J. Chem. Inf. Model. 2020, 60, 5984–5994. 10.1021/acs.jcim.0c00698. [DOI] [PubMed] [Google Scholar]
Kim E.; Lee Y.; Lee S.; Park S. B. Discovery, Understanding, and Bioapplication of Organic Fluorophore: A Case Study with an Indolizine-Based Novel Fluorophore, Seoul-Fluor. Acc. Chem. Res. 2015, 48, 538–547. 10.1021/ar500370v. [DOI] [PubMed] [Google Scholar]
Fahrni C. J. Biological applications of X-ray fluorescence microscopy: exploring the subcellular topography and speciation of transition metals. Curr. Opin. Chem. Biol. 2007, 11, 121–127. 10.1016/j.cbpa.2007.02.039. [DOI] [PubMed] [Google Scholar]
Fahrni C. J. Fluorescent Probes and Labels for Cellular Imaging. CHIMIA Int. J. Chem. 2009, 63, 714–720. 10.2533/chimia.2009.714. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morgan M. T.; McCallum A. M.; Fahrni C. J. Rational design of a water-soluble, lipid-compatible fluorescent probe for Cu(i) with sub-part-per-trillion sensitivity. Chem. Sci. 2016, 7, 1468–1473. 10.1039/C5SC03643G. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kobayashi H.; Ogawa M.; Alford R.; Choyke P. L.; Urano Y. New Strategies for Fluorescent Probe Design in Medical Diagnostic Imaging. Chem. Rev. 2010, 110, 2620–2640. 10.1021/cr900263j. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loudet A.; Burgess K. BODIPY Dyes and Their Derivatives: Syntheses and Spectroscopic Properties. Chem. Rev. 2007, 107, 4891–4932. 10.1021/cr078381n. [DOI] [PubMed] [Google Scholar]
Mishra A.; Behera R. K.; Behera P. K.; Mishra B. K.; Behera G. B. Cyanines during the 1990s: A Review. Chem. Rev. 2000, 100, 1973–2012. 10.1021/cr990402t. [DOI] [PubMed] [Google Scholar]
Swanson L.; Kuypers H. A direct projection from the ventromedial nucleus and retrochiasmatic area of the hypothalamus to the medulla and spinal cord of the rat. Neurosci. Lett. 1980, 17, 307–312. 10.1016/0304-3940(80)90041-5. [DOI] [PubMed] [Google Scholar]
Stefanachi A.; Leonetti F.; Pisani L.; Catto M.; Carotti A. Coumarin: A Natural, Privileged and Versatile Scaffold for Bioactive Compounds. Molecules 2018, 23, 250 10.3390/molecules23020250. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kubin R.; Fletcher A. Fluorescence quantum yields of some rhodamine dyes. J. Lumin. 1982, 27, 455–462. 10.1016/0022-2313(82)90045-X. [DOI] [Google Scholar]
Song H.-O.; Lee B.; Bhusal R. P.; Park B.; Yu K.; Chong C.-K.; Cho P.; Kim S. Y.; Kim H. S.; Park H. Development of a Novel Fluorophore for Real-Time Biomonitoring System. PLoS One 2012, 7, e48459 10.1371/journal.pone.0048459. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sumita M.; Yang X.; Ishihara S.; Tamura R.; Tsuda K. Hunting for Organic Molecules with Artificial Intelligence: Molecules Optimized for Desired Excitation Energies. ACS Cent. Sci. 2018, 4, 1126–1133. 10.1021/acscentsci.8b00213. [DOI] [PMC free article] [PubMed] [Google Scholar]
Henault E. S.; Rasmussen M. H.; Jensen J. H. Chemical space exploration: how genetic algorithms find the needle in the Haystack. PeerJ Phys. Chem. 2020, 2, e11 10.7717/peerj-pchem.11. [DOI] [Google Scholar]
Leguy J.; Cauchy T.; Glavatskikh M.; Duval B.; Mota B. D. EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J. Cheminf. 2020, 12, 55 10.1186/s13321-020-00458-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dral P. O.; Barbatti M. Molecular excited states through a machine learning lens. Nat. Rev. Chem. 2021, 388. 10.1038/s41570-021-00278-1. [DOI] [PubMed] [Google Scholar]
Grimme S.; Bannwarth C. Ultra-fast computation of electronic spectra for large systems by tight-binding based simplified Tamm-Dancoff approximation (sTDA-xTB). J. Chem. Phys. 2016, 145, 054103 10.1063/1.4959605. [DOI] [PubMed] [Google Scholar]
Ke G.; Meng Q.; Finley T.; Wang T.; Chen W.; Ma W.; Ye Q.; Liu T.-Y. In LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Advances in Neural Information Processing Systems, 2017.
Joung I.; Kim J. Y.; Gross S. P.; Joo K.; Lee J. Conformational Space Annealing explained: A general optimization algorithm, with diverse applications. Comput. Phys. Commun. 2018, 223, 28–33. 10.1016/j.cpc.2017.09.028. [DOI] [Google Scholar]
Shin W.-H.; Kim J.-K.; Kim D.-S.; Seok C. GalaxyDock2: Protein-ligand docking using beta-complex and global optimization. J. Comput. Chem. 2013, 34, 2647–2656. 10.1002/jcc.23438. [DOI] [PubMed] [Google Scholar]
Floudas C. A.; Gounaris C. E. A review of recent advances in global optimization. J. Global Optim. 2009, 45, 3–38. 10.1007/s10898-008-9332-8. [DOI] [Google Scholar]
Nakata M.; Shimazaki T. PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry. J. Chem. Inf. Model. 2017, 57, 1300–1308. 10.1021/acs.jcim.7b00083. [DOI] [PubMed] [Google Scholar]
Yoshikawa N.; Terayama K.; Sumita M.; Homma T.; Oono K.; Tsuda K. Population-based De Novo Molecule Generation, Using Grammatical Evolution. Chem. Lett. 2018, 47, 1431–1434. 10.1246/cl.180665. [DOI] [Google Scholar]
Pezzotti N.; Thijssen J.; Mordvintsev A.; Hollt T.; Lew B. V.; Lelieveldt B. P.; Eisemann E.; Vilanova A. GPGPU Linear Complexity t-SNE Optimization. IEEE Trans. Visualization Comput. Graphics 2020, 26, 1172–1181. 10.1109/TVCG.2019.2934307. [DOI] [PubMed] [Google Scholar]
Kusner M. J.; Paige B.; Hernández-Lobato J. M. In Grammar Variational Autoencoder, Proceedings of the 34th International Conference on Machine Learning, 2017; pp 1945–1954.
Valeur B.; Berberan-Santos M.. Molecular Fluorescence: Principles and Applications; Wiley, 2012. [Google Scholar]
Pelleg D.; Moore A. In Accelerating Exact k-Means Algorithms with Geometric Reasoning, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’99, 1999.
Yin N.; Wang L.; Lin Y.; Yi J.; Yan L.; Dou J.; Yang H.-B.; Zhao X.; Ma C.-Q. Effect of the π-conjugation length on the properties and photovoltaic performance of A−π–D−π–A type oligothiophenes with a 4, 8-bis (thienyl) benzo [1, 2-b: 4, 5-b] dithiophene core. Beilstein J. Org. Chem. 2016, 12, 1788–1797. 10.3762/bjoc.12.169. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun F.; Jin R. DFT and TD-DFT study on the optical and electronic properties of derivatives of 1,4-bis(2-substituted-1,3,4-oxadiazole)benzene. Arabian J. Chem. 2017, 10, S2988–S2993. 10.1016/j.arabjc.2013.11.037. [DOI] [Google Scholar]
Suhina T.; Amirjalayer S.; Mennucci B.; Woutersen S.; Hilbers M.; Bonn D.; Brouwer A. M. Excited-State Decay Pathways of Molecular Rotors: Twisted Intermediate or Conical Intersection?. J. Phys. Chem. Lett. 2016, 7, 4285–4290. 10.1021/acs.jpclett.6b02277. [DOI] [PubMed] [Google Scholar]
Ertl P.; Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminf. 2009, 1, 8 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thakkar A.; Chadimová V.; Bjerrum E. J.; Engkvist O.; Reymond J.-L. Retrosynthetic accessibility score (RAscore) – rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem. Sci. 2021, 12, 3339–3349. 10.1039/D0SC05401A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schnermann M. J. Organic dyes for deep bioimaging. Nature 2017, 551, 176–177. 10.1038/nature24755. [DOI] [PubMed] [Google Scholar]
Lee J.; Scheraga H. A.; Rackovsky S. New optimization method for conformational energy calculations on polypeptides: Conformational space annealing. J. Comput. Chem. 1997, 18, 1222–1232. . [DOI] [Google Scholar]
Lee J.; Lee I.-H.; Joung I.; Lee J.; Brooks B. R. Finding multiple reaction pathways via global optimization of action. Nat. Commun. 2017, 8, 15443 10.1038/ncomms15443. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee J.; Zhang Z.-Y.; Lee J.; Brooks B. R.; Ahn Y.-Y. Inverse Resolution Limit of Partition Density and Detecting Overlapping Communities by Link-Surprise. Sci. Rep. 2017, 7, 12399 10.1038/s41598-017-12432-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
Brown N.; Fiscato M.; Segler M. H.; Vaucher A. C. GuacaMol: Benchmarking Models for de Novo Molecular Design. J. Chem. Inf. Model. 2019, 59, 1096–1108. 10.1021/acs.jcim.8b00839. [DOI] [PubMed] [Google Scholar]
Dempsey I.; O’Neill M.; Brabazon A.. Foundations in Grammatical Evolution for Dynamic Environments; Springer, 2009; Vol. 194. [Google Scholar]
Maziarka Ł.; Pocha A.; Kaczmarczyk J.; Rataj K.; Danel T.; Warchoł M. Mol-CycleGAN: a generative model for molecular optimization. J. Cheminf. 2020, 12, 2 10.1186/s13321-019-0404-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sterling T.; Irwin J. J. ZINC 15 – Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Durant J. L.; Leland B. A.; Henry D. R.; Nourse J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
Landrum G. RDKit: Open-Source Cheminformatics Software, 2016.
Ufimtsev I. S.; Martinez T. J. Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics. J. Chem. Theory Comput. 2009, 5, 2619–2628. 10.1021/ct9003004. [DOI] [PubMed] [Google Scholar]
Titov A. V.; Ufimtsev I. S.; Luehr N.; Martinez T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213–221. 10.1021/ct300321a. [DOI] [PubMed] [Google Scholar]
Kästner J.; Carr J. M.; Keal T. W.; Thiel W.; Wander A.; Sherwood P. DL-FIND: An Open-Source Geometry Optimizer for Atomistic Simulations†. J. Phys. Chem. A 2009, 113, 11856–11865. 10.1021/jp9028968. [DOI] [PubMed] [Google Scholar]
Isborn C. M.; Luehr N.; Ufimtsev I. S.; Martínez T. J. Excited-State Electronic Structure with Configuration Interaction Singles and Tamm–Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. J. Chem. Theory Comput. 2011, 7, 1814–1823. 10.1021/ct200030k. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barca G. M. J.; Bertoni C.; Carrington L.; Datta D.; De Silva N.; Deustua J. E.; Fedorov D. G.; Gour J. R.; Gunina A. O.; Guidez E.; Harville T.; Irle S.; Ivanic J.; Kowalski K.; Leang S. S.; Li H.; Li W.; Lutz J. J.; Magoulas I.; Mato J.; Mironov V.; Nakata H.; Pham B. Q.; Piecuch P.; Poole D.; Pruitt S. R.; Rendell A. P.; Roskop L. B.; Ruedenberg K.; Sattasathuchana T.; Schmidt M. W.; Shen J.; Slipchenko L.; Sosonkina M.; Sundriyal V.; Tiwari A.; Galvez Vallejo J. L.; Westheimer B.; Wloch M.; Xu P.; Zahariev F.; Gordon M. S. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152, 154102 10.1063/5.0005188. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ao1c04347_si_001.pdf^{(190.9KB, pdf)}

[ref1] Murphy K. R.; Stedmon C. A.; Wenig P.; Bro R. OpenFluor– an online spectral library of auto-fluorescence by organic compounds in the environment. Anal. Methods 2014, 6, 658–661. 10.1039/C3AY41935E. [DOI] [Google Scholar]

[ref2] Cartlidge E. The light fantastic. Science 2018, 359, 382–385. 10.1126/science.359.6374.382. [DOI] [PubMed] [Google Scholar]

[ref3] Zinchuk V.; Grossenbacher-Zinchuk O. Recent advances in quantitative colocalization analysis: Focus on neuroscience. Prog. Histochem. Cytochem. 2009, 44, 125–172. 10.1016/j.proghi.2009.03.001. [DOI] [PubMed] [Google Scholar]

[ref4] Evanko D. A. ’flaky’ but useful fluorophore. Nat. Methods 2005, 2, 160–161. 10.1038/nmeth0305-160b. [DOI] [Google Scholar]

[ref5] Martin S. F.; Tatham M. H.; Hay R. T.; Samuel I. D. Quantitative analysis of multi-protein interactions using FRET: Application to the SUMO pathway. Protein Sci. 2008, 17, 777–784. 10.1110/ps.073369608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Moczko E.; Mirkes E. M.; Cáceres C.; Gorban A. N.; Piletsky S. Fluorescence-based assay as a new screening tool for toxic chemicals. Sci. Rep. 2016, 6, 33922 10.1038/srep33922. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] Kang B.; Seok C.; Lee J. Prediction of Molecular Electronic Transitions Using Random Forests. J. Chem. Inf. Model. 2020, 60, 5984–5994. 10.1021/acs.jcim.0c00698. [DOI] [PubMed] [Google Scholar]

[ref8] Kim E.; Lee Y.; Lee S.; Park S. B. Discovery, Understanding, and Bioapplication of Organic Fluorophore: A Case Study with an Indolizine-Based Novel Fluorophore, Seoul-Fluor. Acc. Chem. Res. 2015, 48, 538–547. 10.1021/ar500370v. [DOI] [PubMed] [Google Scholar]

[ref9] Fahrni C. J. Biological applications of X-ray fluorescence microscopy: exploring the subcellular topography and speciation of transition metals. Curr. Opin. Chem. Biol. 2007, 11, 121–127. 10.1016/j.cbpa.2007.02.039. [DOI] [PubMed] [Google Scholar]

[ref10] Fahrni C. J. Fluorescent Probes and Labels for Cellular Imaging. CHIMIA Int. J. Chem. 2009, 63, 714–720. 10.2533/chimia.2009.714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Morgan M. T.; McCallum A. M.; Fahrni C. J. Rational design of a water-soluble, lipid-compatible fluorescent probe for Cu(i) with sub-part-per-trillion sensitivity. Chem. Sci. 2016, 7, 1468–1473. 10.1039/C5SC03643G. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] Kobayashi H.; Ogawa M.; Alford R.; Choyke P. L.; Urano Y. New Strategies for Fluorescent Probe Design in Medical Diagnostic Imaging. Chem. Rev. 2010, 110, 2620–2640. 10.1021/cr900263j. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Loudet A.; Burgess K. BODIPY Dyes and Their Derivatives: Syntheses and Spectroscopic Properties. Chem. Rev. 2007, 107, 4891–4932. 10.1021/cr078381n. [DOI] [PubMed] [Google Scholar]

[ref14] Mishra A.; Behera R. K.; Behera P. K.; Mishra B. K.; Behera G. B. Cyanines during the 1990s: A Review. Chem. Rev. 2000, 100, 1973–2012. 10.1021/cr990402t. [DOI] [PubMed] [Google Scholar]

[ref15] Swanson L.; Kuypers H. A direct projection from the ventromedial nucleus and retrochiasmatic area of the hypothalamus to the medulla and spinal cord of the rat. Neurosci. Lett. 1980, 17, 307–312. 10.1016/0304-3940(80)90041-5. [DOI] [PubMed] [Google Scholar]

[ref16] Stefanachi A.; Leonetti F.; Pisani L.; Catto M.; Carotti A. Coumarin: A Natural, Privileged and Versatile Scaffold for Bioactive Compounds. Molecules 2018, 23, 250 10.3390/molecules23020250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Kubin R.; Fletcher A. Fluorescence quantum yields of some rhodamine dyes. J. Lumin. 1982, 27, 455–462. 10.1016/0022-2313(82)90045-X. [DOI] [Google Scholar]

[ref18] Song H.-O.; Lee B.; Bhusal R. P.; Park B.; Yu K.; Chong C.-K.; Cho P.; Kim S. Y.; Kim H. S.; Park H. Development of a Novel Fluorophore for Real-Time Biomonitoring System. PLoS One 2012, 7, e48459 10.1371/journal.pone.0048459. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] Sumita M.; Yang X.; Ishihara S.; Tamura R.; Tsuda K. Hunting for Organic Molecules with Artificial Intelligence: Molecules Optimized for Desired Excitation Energies. ACS Cent. Sci. 2018, 4, 1126–1133. 10.1021/acscentsci.8b00213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] Henault E. S.; Rasmussen M. H.; Jensen J. H. Chemical space exploration: how genetic algorithms find the needle in the Haystack. PeerJ Phys. Chem. 2020, 2, e11 10.7717/peerj-pchem.11. [DOI] [Google Scholar]

[ref21] Leguy J.; Cauchy T.; Glavatskikh M.; Duval B.; Mota B. D. EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J. Cheminf. 2020, 12, 55 10.1186/s13321-020-00458-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Dral P. O.; Barbatti M. Molecular excited states through a machine learning lens. Nat. Rev. Chem. 2021, 388. 10.1038/s41570-021-00278-1. [DOI] [PubMed] [Google Scholar]

[ref23] Grimme S.; Bannwarth C. Ultra-fast computation of electronic spectra for large systems by tight-binding based simplified Tamm-Dancoff approximation (sTDA-xTB). J. Chem. Phys. 2016, 145, 054103 10.1063/1.4959605. [DOI] [PubMed] [Google Scholar]

[ref24] Ke G.; Meng Q.; Finley T.; Wang T.; Chen W.; Ma W.; Ye Q.; Liu T.-Y. In LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Advances in Neural Information Processing Systems, 2017.

[ref25] Joung I.; Kim J. Y.; Gross S. P.; Joo K.; Lee J. Conformational Space Annealing explained: A general optimization algorithm, with diverse applications. Comput. Phys. Commun. 2018, 223, 28–33. 10.1016/j.cpc.2017.09.028. [DOI] [Google Scholar]

[ref26] Shin W.-H.; Kim J.-K.; Kim D.-S.; Seok C. GalaxyDock2: Protein-ligand docking using beta-complex and global optimization. J. Comput. Chem. 2013, 34, 2647–2656. 10.1002/jcc.23438. [DOI] [PubMed] [Google Scholar]

[ref27] Floudas C. A.; Gounaris C. E. A review of recent advances in global optimization. J. Global Optim. 2009, 45, 3–38. 10.1007/s10898-008-9332-8. [DOI] [Google Scholar]

[ref28] Nakata M.; Shimazaki T. PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry. J. Chem. Inf. Model. 2017, 57, 1300–1308. 10.1021/acs.jcim.7b00083. [DOI] [PubMed] [Google Scholar]

[ref29] Yoshikawa N.; Terayama K.; Sumita M.; Homma T.; Oono K.; Tsuda K. Population-based De Novo Molecule Generation, Using Grammatical Evolution. Chem. Lett. 2018, 47, 1431–1434. 10.1246/cl.180665. [DOI] [Google Scholar]

[ref30] Pezzotti N.; Thijssen J.; Mordvintsev A.; Hollt T.; Lew B. V.; Lelieveldt B. P.; Eisemann E.; Vilanova A. GPGPU Linear Complexity t-SNE Optimization. IEEE Trans. Visualization Comput. Graphics 2020, 26, 1172–1181. 10.1109/TVCG.2019.2934307. [DOI] [PubMed] [Google Scholar]

[ref31] Kusner M. J.; Paige B.; Hernández-Lobato J. M. In Grammar Variational Autoencoder, Proceedings of the 34th International Conference on Machine Learning, 2017; pp 1945–1954.

[ref32] Valeur B.; Berberan-Santos M.. Molecular Fluorescence: Principles and Applications; Wiley, 2012. [Google Scholar]

[ref33] Pelleg D.; Moore A. In Accelerating Exact k-Means Algorithms with Geometric Reasoning, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’99, 1999.

[ref34] Yin N.; Wang L.; Lin Y.; Yi J.; Yan L.; Dou J.; Yang H.-B.; Zhao X.; Ma C.-Q. Effect of the π-conjugation length on the properties and photovoltaic performance of A−π–D−π–A type oligothiophenes with a 4, 8-bis (thienyl) benzo [1, 2-b: 4, 5-b] dithiophene core. Beilstein J. Org. Chem. 2016, 12, 1788–1797. 10.3762/bjoc.12.169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] Sun F.; Jin R. DFT and TD-DFT study on the optical and electronic properties of derivatives of 1,4-bis(2-substituted-1,3,4-oxadiazole)benzene. Arabian J. Chem. 2017, 10, S2988–S2993. 10.1016/j.arabjc.2013.11.037. [DOI] [Google Scholar]

[ref36] Suhina T.; Amirjalayer S.; Mennucci B.; Woutersen S.; Hilbers M.; Bonn D.; Brouwer A. M. Excited-State Decay Pathways of Molecular Rotors: Twisted Intermediate or Conical Intersection?. J. Phys. Chem. Lett. 2016, 7, 4285–4290. 10.1021/acs.jpclett.6b02277. [DOI] [PubMed] [Google Scholar]

[ref37] Ertl P.; Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminf. 2009, 1, 8 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] Thakkar A.; Chadimová V.; Bjerrum E. J.; Engkvist O.; Reymond J.-L. Retrosynthetic accessibility score (RAscore) – rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem. Sci. 2021, 12, 3339–3349. 10.1039/D0SC05401A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] Schnermann M. J. Organic dyes for deep bioimaging. Nature 2017, 551, 176–177. 10.1038/nature24755. [DOI] [PubMed] [Google Scholar]

[ref40] Lee J.; Scheraga H. A.; Rackovsky S. New optimization method for conformational energy calculations on polypeptides: Conformational space annealing. J. Comput. Chem. 1997, 18, 1222–1232. . [DOI] [Google Scholar]

[ref41] Lee J.; Lee I.-H.; Joung I.; Lee J.; Brooks B. R. Finding multiple reaction pathways via global optimization of action. Nat. Commun. 2017, 8, 15443 10.1038/ncomms15443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] Lee J.; Zhang Z.-Y.; Lee J.; Brooks B. R.; Ahn Y.-Y. Inverse Resolution Limit of Partition Density and Detecting Overlapping Communities by Link-Surprise. Sci. Rep. 2017, 7, 12399 10.1038/s41598-017-12432-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]

[ref44] Brown N.; Fiscato M.; Segler M. H.; Vaucher A. C. GuacaMol: Benchmarking Models for de Novo Molecular Design. J. Chem. Inf. Model. 2019, 59, 1096–1108. 10.1021/acs.jcim.8b00839. [DOI] [PubMed] [Google Scholar]

[ref45] Dempsey I.; O’Neill M.; Brabazon A.. Foundations in Grammatical Evolution for Dynamic Environments; Springer, 2009; Vol. 194. [Google Scholar]

[ref46] Maziarka Ł.; Pocha A.; Kaczmarczyk J.; Rataj K.; Danel T.; Warchoł M. Mol-CycleGAN: a generative model for molecular optimization. J. Cheminf. 2020, 12, 2 10.1186/s13321-019-0404-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] Sterling T.; Irwin J. J. ZINC 15 – Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[ref49] Durant J. L.; Leland B. A.; Henry D. R.; Nourse J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]

[ref50] Landrum G. RDKit: Open-Source Cheminformatics Software, 2016.

[ref51] Ufimtsev I. S.; Martinez T. J. Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics. J. Chem. Theory Comput. 2009, 5, 2619–2628. 10.1021/ct9003004. [DOI] [PubMed] [Google Scholar]

[ref52] Titov A. V.; Ufimtsev I. S.; Luehr N.; Martinez T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213–221. 10.1021/ct300321a. [DOI] [PubMed] [Google Scholar]

[ref53] Kästner J.; Carr J. M.; Keal T. W.; Thiel W.; Wander A.; Sherwood P. DL-FIND: An Open-Source Geometry Optimizer for Atomistic Simulations†. J. Phys. Chem. A 2009, 113, 11856–11865. 10.1021/jp9028968. [DOI] [PubMed] [Google Scholar]

[ref54] Isborn C. M.; Luehr N.; Ufimtsev I. S.; Martínez T. J. Excited-State Electronic Structure with Configuration Interaction Singles and Tamm–Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. J. Chem. Theory Comput. 2011, 7, 1814–1823. 10.1021/ct200030k. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref55] Barca G. M. J.; Bertoni C.; Carrington L.; Datta D.; De Silva N.; Deustua J. E.; Fedorov D. G.; Gour J. R.; Gunina A. O.; Guidez E.; Harville T.; Irle S.; Ivanic J.; Kowalski K.; Leang S. S.; Li H.; Li W.; Lutz J. J.; Magoulas I.; Mato J.; Mironov V.; Nakata H.; Pham B. Q.; Piecuch P.; Poole D.; Pruitt S. R.; Rendell A. P.; Roskop L. B.; Ruedenberg K.; Sattasathuchana T.; Schmidt M. W.; Shen J.; Slipchenko L.; Sosonkina M.; Sundriyal V.; Tiwari A.; Galvez Vallejo J. L.; Westheimer B.; Wloch M.; Xu P.; Zahariev F.; Gordon M. S. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152, 154102 10.1063/5.0005188. [DOI] [PubMed] [Google Scholar]

PERMALINK

MOLGENGO: Finding Novel Molecules with Desired Electronic Properties by Capitalizing on Their Global Optimization

Beomchang Kang

Chaok Seok

Juyong Lee

Abstract

1. Introduction

Figure 1.

2. Results and Discussion

2.1. Prediction of Maximum Oscillator Strength and the Corresponding Excitation Energy

Table 1. Comparison of Accuracy and Efficiency of LGBM and RF Models.

2.2. Finding Novel Fluorescent Molecules via Global Optimization

2.2.1. Benchmarking Optimization Performance

Figure 2.

Figure 3.

2.2.2. Diversity of Generated Molecules

Figure 4.

Figure 5.

2.3. Chemical Space Coverage

Figure 6.

2.4. Optimization of Objective Function

Figure 7.

2.5. Optimization of Highest Occupied Molecular Orbital (HOMO)–Lowest Unoccupied Molecular Orbital (LUMO) Gap and Its Oscillator Strength

Figure 8.

2.6. Validation of Optimization Results Using TD-DFT Calculations

Figure 9.

Figure 11.

Figure 10.

2.7. Limitation of the Current Study

3. Conclusions

4. Method

4.1. Overview of Workflow

Figure 12.

4.2. Data Set

4.3. Descriptors and Prediction Models for Objective Function

4.4. CSA Algorithm

4.4.1. New Gene Generation

4.4.2. Bank Update and Dcut Control

4.5. TD-DFT Calculations of the Generated Molecules

Acknowledgments

Supporting Information Available

Notes

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.4.2. Bank Update and D_cut Control