Abstract
The versatility of organic molecules generates a rich design space for organic semiconductors (OSCs) considered for electronics applications. Offering unparalleled promise for materials discovery, the vastness of this design space also dictates efficient search strategies. Here, we present an active machine learning (AML) approach that explores an unlimited search space through consecutive application of molecular morphing operations. Evaluating the suitability of OSC candidates on the basis of charge injection and mobility descriptors, the approach successively queries predictive-quality first-principles calculations to build a refining surrogate model. The AML approach is optimized in a truncated test space, providing deep methodological insight by visualizing it as a chemical space network. Significantly outperforming a conventional computational funnel, the optimized AML approach rapidly identifies well-known and hitherto unknown molecular OSC candidates with superior charge conduction properties. Most importantly, it constantly finds further candidates with highest efficiency while continuing its exploration of the endless design space.
Subject terms: Computational chemistry, Organic molecules in materials science
Existing methods for organic semiconductor computational screening are limited by the computational demand of the process, leading to the identification of non-optimal material candidates. Here, the authors report machine learning method to guide the discovery of organic semiconductors.
Introduction
The sheer vastness of chemical spaces1 has long motivated prior-to-synthesis virtual discovery. In corresponding work, promising candidate molecules or materials for refined study are often searched and identified on the basis of a small number of quantities that are deemed representative for the targeted application2–4. Prevalent for first-principles computational screening approaches is to calculate such descriptors at predictive quality through electronic structure theory for every candidate in a somehow enumerated chemical space or otherwise given database. Initially performed for small focused libraries, the screening is now extended to search spaces of ever increasing size and—since discovery is limited to the explicitly considered molecules or materials—to ever more systematic and exhaustive enumerations within these spaces.
Unfortunately, the combinatorial explosion characteristic for chemical versatility quickly leads to intractable numbers of candidates for such exhaustive first-principles screenings, even if based on computationally comparably undemanding descriptors. A common strategy to tackle this problem is a computational funnel5. Here, the exhaustive screening is only performed for computationally least-demanding descriptors or even less demanding estimates thereof. Subsequently, the large candidate set is narrowed in staged filtering and the calculation of other descriptors is only performed for smaller and smaller subsets which appear promising in terms of the previously calculated descriptors. Unfortunately, chemical diversity suggests the multi-objective (descriptor) landscape spanned over the search space to be quite rugged6, with molecular or materials sub-classes likely constituting separate funnels and related analogs leading to multiple local minima. This raises concerns whether the true optimum candidates can reliably be identified through such computational funneling.
An ever more appealing alternative is therefore to completely abandon the original idea to exhaustively screen a once defined chemical space or database. Instead, the explicit first-principles computation of the descriptors is restricted to candidates emerging in an iteratively refining search7–9. In the context of data science, this is afforded by several learning concepts, which additionally allow to even avoid predefining or a priori enumerating the search space itself. Examples include (semi-)supervised learning, meta-, transfer-, or few-shot learning and generative models10,11. For drug-discovery tasks12,13, such concepts have already been successfully employed to further accelerate molecular de novo design14 and drive autonomous discovery15. For materials discovery based on first-principles descriptors, in particular active machine learning (AML)16 has been explored as a most data-efficient method17–22.
In AML, the acquired knowledge in form of explicitly calculated descriptors is used to successively establish a surrogate model of larger and larger regions of the rugged descriptor landscape. In an iterative procedure, the predictive-quality calculations for new candidates can then also be balanced between exploitation and exploration. In exploitation, the global insight provided by the current surrogate model is used for a targeted identification of new promising candidates. In exploration, descriptors for new candidates are specifically calculated to refine and extend the surrogate model. For this, we here employ Gaussian Process Regression (GPR) and use high values of its inherent Bayesian uncertainty estimate to flag candidates (or regions in chemical space) for which an explicit descriptor calculation will maximally contribute new information.
We pursue this concept for the efficient virtual discovery of organic semiconductors (OSCs) for electronic applications. Used in organic field effect transistors (OFETs),23 photovoltaics (OPVs),24 or light emitting diodes (OLEDs),25 OSCs offer great versatility and novel materials’ properties, paired with a low ecologic and economic footprint. Typical OSC-constituting molecules are, however, of considerable size (e.g., 22 or 42 non-hydrogen atoms in the classic examples pentacene or rubrene, respectively) and the spanned electronic property landscapes are known to be highly sensitive even to small molecular substitutions.26–28 A vast number of ~1033 similar-sized molecules is estimated to be synthesizable1, raising the suspicion that presently known well-performing OSC molecular materials are not even the tip of the iceberg. This has motivated a number of preceding exhaustive screening or virtual discovery studies in more or less restricted closed subspaces.3,5,29–34.
In this work we first analyze a diverse set of OSC molecules to derive clear molecular-construction rules that allow to generate an in principle unlimited OSC chemical space. This space is then successively explored by the AML discovery strategy, rapidly identifying molecular candidates that are superior to well-known OSC materials in terms of their molecular electronic descriptors assessing efficient charge injection and charge mobility. Deep methodological insight is gained by analyzing and visualizing the AML exploration inside a chemical space network (CSN) containing only a subset of the design space, limited to allow its full enumeration. Even inside this truncated chemical space the AML-discovery clearly outperforms a conventional funnel approach.
Results
Morphing based generation of an unlimited OSC search space
The basis for our efficient AML exploration of an a priori unlimited molecular search space is the development of a concise set of molecular construction rules that allow to generate this space by iterative application. To establish a diverse, but problem-specific chemical space, we resort to existing domain knowledge and analyze the building blocks and motives contained in molecules constituting a number of well-performing crystalline OSC molecular materials. For this analysis, we exploit the fact that most functionalized organic molecules can be unambiguously fragmented into a molecular backbone (of one or more cores), linkers (that connect cores) and side groups (attached to cores) as illustrated in Fig. 1. Without loss of generality, we correspondingly fragment 30 prominent π-conjugated molecules that belong to a variety of important molecular families23 (Acenes, Thienoacenes, TTF-derivatives, Carbazoles, Triphenylamines, Diimides, Quinacridones and Azaacenes) and consist of the most common organic elements C, H, N, O and S. Figure 1 highlights some of these peer molecules and the full set is given in the SI in Supplementary Fig. 1. Intriguingly, the richness of chemical building blocks identified in this way can be exhaustively generated by a set of only 22 simple molecular morphing operations starting from the smallest aromatic building block benzene. As illustrated in Fig. 1 these morphing operations each act on a molecule’s individual atomic sites or fragments, each time adding, modifying or removing fragments. These morphing operations should be seen as alchemical transformations to navigate between molecules, while applying organic synthesis steps could be a viable alternative.35 Even though at a first glance rather unintuitive for the generation of successively larger or complex molecules, we also note that the inclusion of every morphing operation in a backwards step, i.e., resubstituting a fragment substructure, is crucial to increase the interconnectivity of the forming chemical space, see Supplementary Fig. 3.
The generic nature of the morphing operations identified through the fragmentation ansatz is not only a stepping stone for the efficient AML exploration. It also provides a blueprint for future variations of the present search space or the generation of different search spaces for other applications. Additional morphing operations will lead to more general search spaces and could be automatically extracted from a diverse chemical database36, while deliberate suppression of morphing operations can be used to focus on molecular sub-classes. Ring-annelation type morphing operations as well as biphenylic addition are for example essential for the iterative construction of core Acene fragments, such as in Pyrene or DPA. To build structures like Thienoacenes, Azaacenes or Carbazoles, ring contractions that lead to 5-membered rings are included as intermediates for heteroaromatic ring construction. This, though, comes at the cost of potentially yielding pericyclically reactive molecules, as discussed further below. Similarly, two types of linker operations are included to access the family of Triphenylamines. Further examples together with a detailed description of every morphing operation are provided in Supplementary Note 1. Considering their known OSC tuning potential,28,37,38 we note that in particular the augmentation of the present backbone-oriented set of construction rules by specific morphing operations for side groups or additional functional groups is expected to lead to an important extension of the here showcased search space.
The construction rules may also be modified to incorporate further prior knowledge about the OSC design problem. Here, we notably include constraints on molecular symmetry. Molecular symmetry may be beneficial for synthetic accessibility. Furthermore, it can mitigate mobility reducing charge localization27 and in particular in monomolecular crystals often favors charge percolation pathways3,39,40 (albeit its role can be intricate41). We correspondingly prune the construction rules for the present OSC context to enforce 2D graph symmetries expected to provide a prosymmetry for the 3D case. Specifically, generated molecules are only considered for further morphing, if they fall into three types of symmetry classes as explained in Fig. 1d, e: They (1) exhibit a full graph-symmetry, with all atomic environments appearing at least twice. (2) An asymmetric part in the molecule made of one or more fragments is symmetrically substituted by an even number of similar fragments, or (3) a molecule is prosymmetric such that it has atomic sites on which a single substitution operation could lead to a molecule of class (1) or (2). Further details on symmetry detection are provided in Supplementary Note 2. As always, incorporation of any such domain-specific heuristics like symmetry is thereby a double-edged sword, possibly generating more meaningful search spaces as much as introducing a limiting bias. AML is particularly appealing in this respect. Any such rules can readily be added or dropped without incurring excessive computational costs as in exhaustive screenings of predefined search spaces.
Charge-conduction based fitness
In the spanned search space, we assess the suitability of candidate molecules for OSC applications by two descriptors known to probe two important and complementary aspects related to the conduction of charge. One concerns the efficient injection of charge from a contacting electrode into the OSC material. The other assesses the required high charge mobility inside the OSC bulk. For predominantly p-type OSC materials23 a detrimentally high barrier for a corresponding hole injection from a standard gold electrode is readily probed by a level-alignment descriptor ϵalign = ∣ϵHOMO − ΦAu∣,42 which evaluates the energetic mismatch between the Au work function ΦAu = − 5.1 eV43 and the energetic position of the highest occupied molecular orbital (HOMO) ϵHOMO as a common approximation of the material’s ionization potential.44,45 Adapting this descriptor to other electrode materials or to n-type OSC materials (then involving the energetic position of the lowest unoccupied molecular orbital, LUMO) is straightforward. As an equally established descriptor for the bulk charge mobility we employ the intra-molecular (hole) reorganization energy λh, which measures the cost of accommodating a new charge state after the carrier has moved to the next molecular site.46,47 As molecular properties, both ϵHOMO and λh can be determined by efficient first-principles calculations as detailed in Supplementary Note 2, where the density-functional theory (DFT) B3LYP48–50 level of theory constitutes a well established accuracy standard27,31,39,40,51, matching experimental data44,52. We emphasize though that using the lowest-energy gas-phase conformer for the descriptor calculation disregards packing-effects in the molecular crystal53–55 and we further discuss the influence of conformers on descriptor values in Supplementary Note 3.
To evaluate molecular fitness and prioritize candidates during AML discovery, both objectives are combined in a scalarized fitness function
1 |
which an ideal candidate molecule will maximize.56 Here, the weight vector w = (1.0, 0.7)⊤ accommodates the generally different absolute scales of the two descriptors, with the value of 0.7 chosen to yield an essentially Ohmic alignment with the electrode of ∣ϵalign∣ < 0.3 eV if λh falls into the range of commonly known OSCs. We note, though, that the exact choice of weights is rather unimportant for the performance of the AML search, as it only linearly biases F towards either of the descriptors, as further detailed below. With the currently chosen weight and at the DFT-B3LYP level of theory, pentacene and rubrene – materials that have been contacted by gold electrodes before57,58 – will feature F values of −0.16 and −0.2, respectively. A threshold F ≥ −0.2 will therefore later on be used to measure discovery success of the AML.
AML: design and search strategy
By successively querying the explicit first-principles calculation of the descriptors for identified candidate molecules, the AML algorithm establishes an ever improving surrogate model of the fitness function F over the search space. Out of a manifold of in principle possible surrogate models, we found GPR to already achieve outstanding performance at very moderate amounts of data. In brief, the employed model uses circular Morgan fingerprints59 to compare the structural similarity of not yet explicitly calculated molecules with the hitherto acquired ones. Specifically, counts of substructures that can be extracted by moving up to two bonds away from each central atom are generated. The similarity between two molecules is then measured with a substructure count kernel. A full account of the GPR learning through log-marginal likelihood maximization is provided in Supplementary Note 2. A central advantage of GPR for the AML context is that it not only provides a prediction for the targeted fitness function F, but also the corresponding predictive uncertainty σ from the Gaussian variance. Balancing between exploitation and exploration, the AML algorithm can thus query new candidate molecules either because they are highly promising in terms of a maximum predicted fitness F or because they exhibit a high uncertainty σ such that their explicit calculation will maximally improve the surrogate model. Practically, molecules are thereby chosen according to an upper confidence bound acquisition function
2 |
This represents a simple, well-tested strategy in Bayesian optimization60–62 or active-search63,64 with GPRs, which contains only one hyperparameter κ to balance exploration and exploitation.
Multiple possibilities arise how to actually execute the iterative AML process. After initializing the surrogate model by training on a defined number Ninitial of molecules, central questions concern the acquisition of new data before the surrogate model is retrained. Compatible with super-computing resources that encourage a parallel first-principles evaluation of the descriptors for multiple molecules, we opt for a batch-based learning where Nbatch molecules with maximum Facq are queried and the model is then retrained on the basis of the accumulated new descriptor data. Future improvements could include an additional enforcement of diversity in the prioritized batch.18,21,65,66 In an in principle infinite chemical space, another central AML design choice regards the extent over which new molecules are practically assessed with the established, conceptually global surrogate model. Aiming for high-performance OSC molecules of tractable size and complexity, we here opt for a single tree expansion that limits the candidates to those in the vicinity of already sampled ones67.
In a most straightforward realization and if all molecules for which first-principles descriptors have already been computed define the current population at step n of the AML search, then the Nbatch molecules for the next step n + 1 are identified in the search space formed by all molecules that can be generated by one-time application of any of the morphing operations to every molecule in the current population. While this nicely exploits the evolutionary pressure contained in the current population of size Npop = Ninitial + n × Nbatch, the search space for step n + 1 could also be systematically increased by exhaustive multiple-time application of the morphing operations. As illustrated below by comparing a corresponding search depth of one- or two-time application, this may help to overcome local funnels and navigate more efficiently through chemical space. On the other hand and regardless of the actual search depth dsearch, the continuously growing population size will at later learning steps n inevitably lead to a combinatorial explosion of new candidates for any such exhaustive enumeration. Eventually, this requires to decrease the resolution in the ever increasing search space. Note that precisely this combinatorial explosion also precludes popular supervised machine learning approaches that exhaustively learn molecular properties in a closed chemical space, possibly followed by some form of data mining3.
A decreasing resolution in the AML search space can for instance be achieved by imposing additional heuristic selection criteria, e.g., selectively suppressing certain morphing operations for increasing search depths, or other more sophisticated tree-search policies68 also employed in reinforcement learning35,69. Here, we realize deeper partial expansions of the search tree up to a search depth dsearch by applying the molecular morphing operations only to a fixed number of Ndeep molecules selected first from the current population and then subsequently from those molecules that were created by the previous morphing operations. By each time selecting the Ndeep molecules through fitness-rank based roulette-wheel selection, i.e., by assigning higher selection probabilities to molecules with high Facq values, the search tree is thus preferentially expanded into regions of the OSC space that the surrogate model anticipates to be rewarding (either in terms of exploitation or exploration).
Hyperparameter optimization
The thus defined AML approach contains a number of hyperparameters that may critically affect its performance. Most notably, these are κ that balances exploration and exploitation in the acquisition function, Nbatch the size of the prioritized batch in each learning step, as well as dsearch the depth of the search space in terms of the number of applied morphing operations. The decreased resolution strategy additionally requires the specification of the fixed subset size of Ndeep molecules to which morphing operations are applied. Less decisive is the initial number of molecules Ninitial used for the first training of the surrogate model, which defines only an insignifiant part of the total executed first-principles calculations and which should only be large enough to somehow kick-start the AML process. Here, we suitably set Ninitial to the 179 unique molecules that result in the first two generations when applying all morphing operations up to two times starting from the simplest building block benzene, cf. Fig. 1.
In order to explore the effect of the other hyperparameters and optimize them for first-principles OSC discovery, we consider the finite subspace formed of all molecules up to a maximum size of 4 rings, 4 heteroatoms and 2 linkers that are generated by exhaustive application of all morphing operations up to 14 times, see Supplementary Note 2. With 65.552 unique molecules this subspace is already representative for the design problem and contains many and diversely structured high-performing molecules as illustrated in Fig. 2. At the same time, the still tractable size of the finite test space allows for the exhaustive calculation of all molecular descriptors with van der Waals (vdW) corrected density functional tight-binding (DFTB).70 While this semi-empirical level of theory is not fully quantitative, it provides a sufficiently realistic account of the descriptor landscape for the intended method testing as analyzed in detail in Supplementary Fig. 4. Further details on molecular test space generation and descriptor calculation are provided in the Supplementary Note 2.
The finite test space contains a total of 2438 top-performing molecules with a high fitness F ≥ −0.2. As a quantitative benchmark, we thus measure the discovery success S(N) as the fraction of these molecules that are identified after the descriptors of N molecules have been queried. With 179 queries used for the initialization, see above, the final measure S(5179) thus evaluates the discovery success after n = 50 learning steps when using Nbatch = 100. Supplementary Fig. 6 compiles the corresponding success curves S(N), when systematically combining Nbatch = 50, 100, or 200 with κ values in half-integer steps between 0 and 5, as well as for a search depth of one- or two-time exhaustive application of all morphing operations. Fortunately, we find the AML search to be highly robust with respect to the choice of Nbatch and κ. Only a small variation of 0.71 < S(N = 5179) < 0.80 is obtained over all tested combinations for a search depth of one, meaning that 70–80% of the top-performing molecules are consistently found after descriptors for less than 8% of the entire test space have actually been computed. For a search depth of two, this success rate becomes slightly higher, reaching up to 85% as compiled in Supplementary Fig. 7. Generally, larger batch sizes seem to implicitly increase the explorative behavior, such that an almost indistinguishably optimum performance is obtained for larger Nbatch in combination with successively smaller exploration weights κ in the acquisition function, cf. Eq. (2). For too small κ, the success curves become stepped though, indicating that temporarily the mainly exploitative algorithm then only meanders through identified sub-pockets of the test space. Too large κ, on the other hand, diminish the initial success of a then too explorative algorithm in the first learning steps. Overall, an intermediate value pair (Nbatch, κ) = (100, 2.5) thus provides a robust setting and is henceforth employed in all AML runs. For these values of (Nbatch, κ), we also performed a sensitivity analysis with regard to the employed weight vector w in Eq. (1) and the bond radius in the Morgan fingerprints used to assess molecular similarity. The results are summarized in Supplementary Figs. 8 and 9, respectively, and again demonstrate a high robustness with respect to these parameters.
The higher success rate for dsearch = 2 indicates that it is generally advantageous to further expand the search space away from the known topologies of the current population. Assessing the dependence of the decreased resolution AML algorithm on its two additional hyperparameters, Supplementary Table 1 summarizes the corresponding discovery successes when systematically combining a varying subset size Ndeep = 100, 250, 500 and 1000 with search depths dsearch = 1, 2, 3, 4, 5 and 10. Again, we find the algorithm to be quite robust, with higher dsearch compensating smaller Ndeep. Within the finite test space, many combinations thus saturate at success rates around 82–83%. This is essentially as good as the best performance of the previous exhaustive enumerations, but comes at the advantage of a controlled growth of the search space at later learning steps. For the first-principles AML discovery in the virtually unlimited OSC space below we correspondingly employ this decreased resolution search strategy with a top-performing hyperparameter combination (dsearch, Ndeep) = (3, 500).
Visualizing AML at work
The finite test space can also be viewed as a chemical space network (CSN), in which the morphing operations establish a total of 315.451 directed connections between the constituting molecules. This allows us to visualize the space in form of a 2D graph structure, in which the molecules are mutually repelling nodes, while morphing relationships between them lead to attractive edges71, see Supplementary Note 1 for details. In such a representation each molecule is thus spatially surrounded by morphing-related analogs. Figure 2 shows the resulting graph, in which the individual nodes are colored according to their DFTB calculated fitness. As expected, the target group for discovery in form of the 2438 top-performing molecules is widely scattered over disjoint parts of chemical space, with ensembles of related molecules often clustered in sub-pockets.
Apart from providing a bird’s eye view of the design problem, the CSN representation also affords a direct visual access to the AML process. Plotting the evolving population N over subsequent learning steps n reveals how much a chosen AML strategy is able to focus its exploration onto the interesting regions of chemical space and how efficiently it prioritizes OSC molecules with desired properties. Figure 3 illustrates this for the determined optimum hyperparameters and contrasts the learning for exhaustive searches with depths of one or two, with the decreased resolution strategy where the searches partially expand subsets of Ndeep = 500 molecules at search depth three. For the exhaustive search with dsearch = 1, the discovery is centered to more morphing-related top-performing molecules all more or less located in the core region of the CSN. In contrast, for the deeper exhaustive search, the algorithm also successfully identifies top-performing molecules in the periphery of the network that are topologically quite disconnected from the initial population. The downside is a rapidly increasing size of the search space that in the present case is only bounded by the finiteness of the considered test space. This is largely mitigated by the decreased resolution search, which nevertheless equally successfully identifies top-performing molecules at the CSN periphery.
To put this performance of the AML searches into perspective, we also contrast them in Fig. 3 with the result of a conventional computational funnel. For the latter we pretend that the calculation of ϵHOMO has a negligible computational cost and the value of this descriptor is known for every molecule in the test space. This allows to identify a subset of 13.755 promising molecules for which ∣ϵalign∣ < 0.3 eV and which contains all previously considered 2438 top-performing molecules. The computational funnel approach would then focus the explicit calculation of the more demanding λh descriptor to molecules in this subset. To enable a direct comparison with the preceding AML assessment, a random selection of 5179 molecules out of this subset would then lead to a success rate of S(5179) ≈ 0.4. Even in this finite test space, where the AML algorithm can not even unfold its real strength, less than half of the top-performing molecules are thus found by this prevalent computational screening strategy after spending the same amount of CPU time (assuming that the exhaustive calculation of 65.552 ϵHOMO descriptors for the entire test space would constitute an insignificant computational effort).
First-principles AML discovery in a virtually unlimited OSC chemical space
Based on the gathered methodological understanding and optimized algorithmic settings (Nbatch = 100, κ = 2.5, dsearch = 3, Ndeep = 500) we now proceed to first-principles AML discovery at the vdW-corrected DFT-B3LYP level of theory. This is a truly challenging endeavor, considering the vastness of the OSC design space. While the space of molecules that can be generated through the morphing operations is in principle unbounded, we here restrict it to the realm of “small molecules” containing a maximum of 100 atoms (including H atoms). This realm appears as a first, more practical target for synthesis and crystallization, also considering that essentially all known top-performing OSC molecules to date fall into this size range. Estimated to surpass a size of 1030 molecules, see Supplementary Note 2, the corresponding chemical space is nevertheless virtually unlimited for all practical purposes and would defy any conventional exhaustive computational screening. While an iterative search as with AML is thus the only tractable means to explore this space at predictive quality, an additional technical aspect emerges that did not yet play a role in the analysis of the finite test space at the semi-empirical level before. It concerns the typically massively parallel processing on the required high-performance computing (HPC) infrastructure. As a result of queuing or down-times, as well as convergence behavior of the first-principles calculations, the results for the Nbatch descriptor calculations can become available at quite different times (or in rare cases of failed convergence or system instabilities may not become available at all). A practical way to avoid long waiting times before the last calculations are ready is to initially select a larger batch size for descriptor calculation and then continue with the forthcoming learning steps whenever the desired number of Nbatch molecules has been processed (successfully or unsuccessfully). We found this strategy to afford an efficient and continuous HPC workflow, here initially submitting the 200 molecules with highest Facq values for descriptor calculations. These are continuously processed on the HPC system by 40–100 parallel worker processes, to reach the targeted batch size Nbatch = 100, while for a retraining of the surrogate model only successfully processed cases are included. In this respect, the above determined robustness of the AML performance with regard to the exact batch size also constitutes an important asset for such HPC operation.
Figure 4 summarizes the results of the AML discovery run over its first 15 learning steps. Gratifyingly, the algorithm quickly stabilizes into a highly efficient mode of operation while simultaneously meandering deep into unknown chemical space. Already after five learning steps even the median fitness of the entire prioritized batch exceeds the threshold value F ≥ −0.2 for the first time, reflecting top-performing molecules. However, as clearly seen from the violin plots of the F distribution over the batches in Fig. 4b, this high efficiency does not simply result from the algorithm just exploiting its established knowledge. Even at later learning steps, the algorithm steadily queries quite unfavorable molecules with a fitness worse than F < −0.3. While such exploratory queries can either be based on high model uncertainty or induced by model prediction errors, they serve to continuously improve the surrogate model also outside the already considered search space. As a result, at each later learning step, the algorithm keeps on identifying top-performing molecules at a stable, high rate.
After 15 learning steps and a corresponding calculation of first-principles descriptors for 1680 molecules (and only 35 unsuccessfully terminated calculations), a total of 900 molecules with molecular fitness F ≥ −0.2 have been found. A relative success rate of 54%, i.e., essentially every second first-principles calculation yields a promising molecule and this without any a priori knowledge of the vast OSC space. A second AML discovery run described in Supplementary Note 4 confirms the robustness of this high performance. Notably, due to the random nature in our search strategy, significantly different, but equally favorable molecules are identified in this run. This performance becomes even more impressive from the viewpoint that these molecules are true discoveries, as essentially none of them are contained in existing focused libraries assembled in previous screening studies3,31–34. With typically ~105 − 106 entries, these data sets reflect the wealth of our existing knowledge and synthesis efforts, but simply do not even scratch the surface of the true OSC design possibilities. To this end, the negligible overlap with the top-performing molecules identified in these previous studies also has to do with molecular size. Within the first learning steps, the average size in the prioritized batch quickly rises to around 90 atoms, which is at the edge of the limit currently imposed on our search and in a size regime that could barely be addressed by the previous exhaustive enumeration studies. At the same time, even archetypical and acclaimed molecular OSC materials like DNTT (C22H12S) or rubrene (C42H28) approach this size regime, with many other experimentally tested candidates falling right into it23. The preferred prioritization of such larger molecules is thereby to some extent likely simply a result of the combinatorially exploding phase space. On the other hand, another physical factor could be that the AML algorithm learns and exploits the tendency of λh to decrease with increasing molecular size3 as a consequence of a larger hole delocalization (which even at the hybrid DFT-B3LYP level of theory may be slightly overestimated72). The inclusion of molecular coupling-sensitive descriptors into the fitness function is therefore certainly a promising topic for future studies.
The discovered molecules exhibit a diverse set of structures, incorporating distinct core fragments and the full set of allowed heteroatoms and linkers. Figure 4c illustrates this with the best-performing molecules identified at selected learning steps, and an extended list being compiled in Supplementary Fig. 11. This diversity indicates that the AML algorithm successfully explored topologically widely differing areas of the OSC space and did not get stuck in one or a few subpockets. Nevertheless, some commonalities can be spotted, like the recurrent presence of phenylamine linker motifs (marked in orange in the best-performing molecule of learning step 1 in Fig. 4c). Similarly, more complex ring systems emerged at later learning stages (marked in blue and green in the most favorable molecule of step 3 and 9, respectively) and are from thereon quite pronounced among well-performing molecules. While a diverse molecular space is searched, the AML discovery thus automatically identifies and prioritizes privileged design motifs. After harvesting a larger number of molecules in further learning steps, an exciting prospect for future studies is therefore to mine the accumulating data set and systematically extract this implicit knowledge for rational design. To this end, the trained surrogate model can also be used to quickly assess the suitability of such manually constructed molecules or of deliberate modifications of the here identified ones. The latter could be particularly appealing in view of long-term device-stability or synthetic accessibility. We note that certainly not all identified molecules are suitable in this regard. For instance, the 5-membered unsaturated rings of the displayed compound of learning step 1 (marked in red) in Fig. 4c could be problematic as they might undergo Diels-Alder type reactions, and we attribute the appearance of such ring motives as the algorithm’s intent to provide intermediates on the way to the later explored, more stable 5-membered heterocycles. Nonetheless, multiple of the favorable molecules are symmetric and composed of standard building blocks that should be easily accessible through short and reliable synthesis routes, with the surrogate model furthermore available to gauge the effect of stabilizing modifications.
Discussion
In our view, active machine learning based on first-principles descriptors constitutes a most promising route to prior-to-synthesis virtual discovery. Its iterative refinement allows to most efficiently focus the data-generating calculations and meaningfully explore the vastness of chemical spaces at predictive quality and without a priori specifications, enumeration or reliance on empirical descriptors with limited validity range. In this work we have established such an AML discovery approach for molecular OSC materials through versatile molecular morphing operations and based on charge injection and conduction querying descriptors. Fortunately and with a view on explainable ML models, our systematic assessment within a finite test space suggests the approach to be quite robust with respect to the algorithmic hyperparameters. Most promising to further increase its already high efficiency and prevent an over-exploitation of particular structural motifs, is likely to additionally enforce structural diversity among the Nbatch molecules selected at each learning step, instead of the present purely fitness-ranked roulette-wheel selection.
Central to assess this performance and enable an unbiased and systematic comparability of different AML approaches will be the establishment of well-designed, balanced and freely available benchmark platforms for unlimited search spaces. As clear from the present work, already within the here pursued single-tree expansion there are multiple design strategies and concomitant algorithmic parameters. While we have explored these in a truncated test space, AML only unfolds its full potential in the exploration of unlimited spaces. Representative and standardized benchmark platforms as already available for drug-design tasks13 will therefore be pivotal to truly compare various learning concepts that work without a priori enumeration or pre-definition of the search problem.
Further challenges and advancements in the physico-chemical domain comprise the adaption and extension of the molecular morphing operations to tailor the OSC search space. The present set derived from literature domain knowledge spans a design space geared towards flexible, π-conjugated molecules. Ultimately, a generic, but chemically-valid creation of morphing operations could drive discovery of many novel structural motives. Heavier requirements on the surrogate GPR-model in such cases could then be tackled with improved covariance functions for 2D molecular graphs73 or conformer-specific 3D coordinates74, while alleviating the limited scaling by sparse approximations75, or application of alternative models76–79.
Another major area for development concerns the first-principles descriptors entering the employed multi-objective fitness function. Devising such suitable descriptors has evolved into an important research area of its own80–83, independent of the present AML and OSC context. With the presently employed level-alignment descriptor ϵalign and the hole reorganization energy λh our search readily identified a diverse range of hitherto unknown molecular candidates. Just as in conventional computational screening, there are numerous possibilities to refine the underlying candidate evaluation through additional (or alternative) descriptors. In the exemplified OSC context, obvious avenues could be to explicitly consider synthetic accessibility84, electronic coupling and charge-transport networks in the molecular solid46,51,85,86 or electron-phonon coupling87. In view of the high data efficiency of the AML approach, one may also drop the present focus on computationally least-demanding descriptors, originally dictated by the excessive queries in conventional exhaustive screening work. More elaborate descriptors like structural interfacing with electrode materials88 could therefore routinely (or at least occasionally) be requested. Eventually, one could even think of incorporating experimental feedback from self-driving laboratories89. The prospects are thus as manifold as exciting. Regardless of the specific road chosen, it is conceptually clear that autonomously operating workflows like the present AML approach offer an unparalleled means to accelerate the discovery and design of viable future materials like the high-mobility organic semiconductors featured in this work.
Supplementary information
Acknowledgements
C.K. and J.T.M are grateful for support by Deutsche Forschungsgemeinschaft (DFG) through TUM International Graduate School of Science and Engineering (IGSSE), GSC 81. C.K., H.O. and K.R. gratefully acknowledge support from the Solar Technologies Go Hybrid initiative of the State of Bavaria. K.C. acknowledges funding from the China Scholarship Council. We also thankfully acknowledge computational resources provided by the Leibniz Supercomputing Centre. J.T.M. would like to acknowledge illuminating discussions on synthesizability and chemical stability with Tobias Schaub.
Source data
Author contributions
C.K, H.O., J.T.M and K.R. conceived the idea. C.K. implemented the algorithms in code and carried out the calculations. Methodological details were thereby worked out by C.K., K.C. and J.T.M. C.K, H.O., J.T.M and K.R. wrote the manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Data availability
The source data necessary to reproduce the main figures of the manuscript is provided in the supplementary materials of this article. Source data are provided with this paper.
Code availability
The code used to run AML discovery is available at 10.5281/zenodo.4554331
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks Graeme Day and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-021-22611-4.
References
- 1.Polishchuk PG, Madzhidov TI, Varnek A. Estimation of the size of drug-like chemical space based on gdb-17 data. J. Comput. Aided Mol. Des. 2013;27:675–679. doi: 10.1007/s10822-013-9672-4. [DOI] [PubMed] [Google Scholar]
- 2.Jorgensen WL. The many roles of computation in drug discovery. Science. 2004;303:1813–1818. doi: 10.1126/science.1096361. [DOI] [PubMed] [Google Scholar]
- 3.Schober C, Reuter K, Oberhofer H. Virtual screening for high carrier mobility in organic semiconductors. J. Phys. Chem. Lett. 2016;7:3973–3977. doi: 10.1021/acs.jpclett.6b01657. [DOI] [PubMed] [Google Scholar]
- 4.Pulido A, et al. Functional materials discovery using energy–structure–function maps. Nature. 2017;543:657–664. doi: 10.1038/nature21419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Pyzer-Knapp EO, Suh C, Gómez-Bombarelli R, Aguilera-Iparraguirre J, Aspuru-Guzik A. What is high-throughput virtual screening? a perspective from organic materials discovery. Annu. Rev. Mater. Res. 2015;45:195–216. doi: 10.1146/annurev-matsci-070214-020823. [DOI] [Google Scholar]
- 6.Gaspar HA, Baskin II, Marcou G, Horvath D, Varnek A. Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge. J. Chem. Inf. Model. 2015;55:84–94. doi: 10.1021/ci500575y. [DOI] [PubMed] [Google Scholar]
- 7.Reymond J-L, van Deursen R, Blum LC, Ruddigkeit L. Chemical space as a source for new drugs. Med. Chem. Commun. 2010;1:30–38. doi: 10.1039/c0md00020e. [DOI] [Google Scholar]
- 8.Devi RV, Sathya SS, Coumar MS. Evolutionary algorithms for de novo drug design – a survey. Appl. Soft Comput. 2015;27:543–552. doi: 10.1016/j.asoc.2014.09.042. [DOI] [Google Scholar]
- 9.Le TC, Winkler DA. Discovery and optimization of materials using evolutionary approaches. Chem. Rev. 2016;116:6107–6132. doi: 10.1021/acs.chemrev.5b00691. [DOI] [PubMed] [Google Scholar]
- 10.Sanchez-Lengeling B, Aspuru-Guzik A. Inverse molecular design using machine learning: generative models for matter engineering. Science. 2018;361:360–365. doi: 10.1126/science.aat2663. [DOI] [PubMed] [Google Scholar]
- 11.Elton DC, Boukouvalas Z, Fuge MD, Chung PW. Deep learning for molecular design – a review of the state of the art. Mol. Syst. Des. Eng. 2019;4:828–849. doi: 10.1039/C9ME00039A. [DOI] [Google Scholar]
- 12.Reker D, Schneider G. Active-learning strategies in computer-assisted drug discovery. Drug Discov. Today. 2015;20:458–465. doi: 10.1016/j.drudis.2014.12.004. [DOI] [PubMed] [Google Scholar]
- 13.Brown N, Fiscato M, Segler MH, Vaucher AC. Guacamol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 2019;59:1096–1108. doi: 10.1021/acs.jcim.8b00839. [DOI] [PubMed] [Google Scholar]
- 14.Schneider, G. De novo Molecular Design (Wiley, 2013) https://books.google.de/books?id=ZxlrmwEACAAJ.
- 15.Jensen KF, Coley CW, Eyke NS. Autonomous discovery in the chemical sciences part i: progress. Angew. Chem. Int. Ed. 2019;59:22858–22893. doi: 10.1002/anie.201909987. [DOI] [PubMed] [Google Scholar]
- 16.Settles, B. Active learning literature survey (University of Wisconsin, Madison, 2010).
- 17.Lookman T, Balachandran PV, Xue D, Yuan R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. Npj Comput. Mater. 2019;5:21. doi: 10.1038/s41524-019-0153-8. [DOI] [Google Scholar]
- 18.Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE. Less is more: sampling chemical space with active learning. J. Chem. Phys. 2018;148:241733. doi: 10.1063/1.5023802. [DOI] [PubMed] [Google Scholar]
- 19.Häse F, Roch LM, Kreisbeck C, Aspuru-Guzik A. Phoenics: a bayesian optimizer for chemistry. ACS Cent. Sci. 2018;4:1134–1145. doi: 10.1021/acscentsci.8b00307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Vandermause J, et al. On-the-fly active learning of interpretable bayesian force fields for atomistic rare events. Npj Comput. Mater. 2020;6:20. doi: 10.1038/s41524-020-0283-z. [DOI] [Google Scholar]
- 21.Janet JP, Ramesh S, Duan C, Kulik HJ. Accurate multiobjective design in a space of millions of transition metal complexes with neural-network-driven efficient global optimization. ACS Cent. Sci. 2020;6:513–524. doi: 10.1021/acscentsci.0c00026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bisbo MK, Hammer B. Efficient global structure optimization with a machine-learned surrogate model. Phys. Rev. Lett. 2020;124:086102. doi: 10.1103/PhysRevLett.124.086102. [DOI] [PubMed] [Google Scholar]
- 23.Wang C, Dong H, Hu W, Liu Y, Zhu D. Semiconducting π-conjugated systems in field-effect transistors: a material odyssey of organic electronics. Chem. Rev. 2012;112:2208–2267. doi: 10.1021/cr100380z. [DOI] [PubMed] [Google Scholar]
- 24.Lin Y, Li Y, Zhan X. Small molecule semiconductors for high-efficiency organic photovoltaics. Chem. Soc. Rev. 2012;41:4245–4272. doi: 10.1039/c2cs15313k. [DOI] [PubMed] [Google Scholar]
- 25.Xu R-P, Li Y-Q, Tang J-X. Recent advances in flexible organic light-emitting diodes. J. Mater. Chem. C. 2016;4:9116–9142. doi: 10.1039/C6TC03230C. [DOI] [Google Scholar]
- 26.Geng H, et al. Theoretical study of substitution effects on molecular reorganization energy in organic semiconductors. J. Chem. Phys. 2011;135:104703. doi: 10.1063/1.3632105. [DOI] [PubMed] [Google Scholar]
- 27.Uejima M, Sato T, Tanaka K, Kaji H. Vibronic coupling density analysis for the chain-length dependence of reorganization energies in oligofluorenes: a comparative study with oligothiophenes. Phys. Chem. Chem. Phys. 2013;15:14006–14016. doi: 10.1039/c3cp51592c. [DOI] [PubMed] [Google Scholar]
- 28.Wilbraham L, Smajli D, Heath-Apostolopoulos I, Zwijnenburg MA. Mapping the optoelectronic property space of small aromatic molecules. Commun. Chem. 2020;3:14. doi: 10.1038/s42004-020-0256-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gryn’ova G, Lin K-H, Corminboeuf C. Read between the molecules: computational insights into organic semiconductors. J. Am. Chem. Soc. 2018;140:16370–16386. doi: 10.1021/jacs.8b07985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Saeki A, Kranthiraja K. A high throughput molecular screening for organic electronics via machine learning: present status and perspective. Jpn. J. Appl. Phys. 2019;59:SD0801. doi: 10.7567/1347-4065/ab4f39. [DOI] [Google Scholar]
- 31.Matsuzawa NN, et al. Massive theoretical screen of hole conducting organic materials in the heteroacene family by using a cloud-computing environment. J. Phys. Chem. A. 2020;124:1981–1992. doi: 10.1021/acs.jpca.9b10998. [DOI] [PubMed] [Google Scholar]
- 32.Nematiaram, T., Padula, D., Landi, A. and Troisi, A. On the largest possible mobility of molecular semiconductors and how to achieve it. Adv. Funct. Mater. 30, 2001906 (2020).
- 33.Atahan-Evrenk S, Atalay FB. Prediction of intramolecular reorganization energy using machine learning. J. Phys. Chem. A. 2019;123:7855–7863. doi: 10.1021/acs.jpca.9b02733. [DOI] [PubMed] [Google Scholar]
- 34.Cheng CY, Campbell JE, Day GM. Evolutionary chemical space exploration for functional materials: computational organic semiconductor discovery. Chem. Sci. 2020;11:4922–4933. doi: 10.1039/D0SC00554A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Segler MHS, Preuss M, Waller MP. Planning chemical syntheses with deep neural networks and symbolic AI. Nature. 2018;555:604–610. doi: 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]
- 36.Segler MHS, Waller MP. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 2017;23:5966–5971. doi: 10.1002/chem.201605499. [DOI] [PubMed] [Google Scholar]
- 37.Yao Z-F, Wang J-Y, Pei J. Control of π–π stacking via crystal engineering in organic conjugated small molecule crystals. Cryst. Growth Des. 2018;18:7–15. doi: 10.1021/acs.cgd.7b01385. [DOI] [Google Scholar]
- 38.Kunkel C, Schober C, Margraf JT, Reuter K, Oberhofer H. Finding the right bricks for molecular legos: a data mining approach to organic semiconductor design. Chem. Mater. 2019;31:969–978. doi: 10.1021/acs.chemmater.8b04436. [DOI] [Google Scholar]
- 39.Stehr V, Pfister J, Fink RF, Engels B, Deibel C. First-principles calculations of anisotropic charge-carrier mobilities in organic semiconductor crystals. Phys. Rev. B. 2011;83:155208. doi: 10.1103/PhysRevB.83.155208. [DOI] [Google Scholar]
- 40.Li P, Cui Y, Song C, Zhang H. Electronic and charge transport properties of dimers of dithienothiophenes: effect of structural symmetry and linking mode. RSC Adv. 2015;5:50212–50222. doi: 10.1039/C5RA07946B. [DOI] [Google Scholar]
- 41.Ren L, et al. Critical role of molecular symmetry for charge transport properties: a paradigm learned from quinoidal bithieno[3,4-b]thiophenes. Chem. Mater. 2017;29:4999–5008. doi: 10.1021/acs.chemmater.7b01551. [DOI] [Google Scholar]
- 42.Ishii H, Sugiyama K, Ito E, Seki K. Energy level alignment and interfacial electronic structures at organic/metal and organic/organic interfaces. Adv. Mater. 1999;11:605–625. doi: 10.1002/(SICI)1521-4095(199906)11:8<605::AID-ADMA605>3.0.CO;2-Q. [DOI] [Google Scholar]
- 43.Michaelson HB. The work function of the elements and its periodicity. J. Appl. Phys. 1977;48:4729–4733. doi: 10.1063/1.323539. [DOI] [Google Scholar]
- 44.Schwenn P, Burn P, Powell B. Calculation of solid state molecular ionisation energies and electron affinities for organic semiconductors. Org Electron. 2011;12:394 – 403. doi: 10.1016/j.orgel.2010.11.025. [DOI] [Google Scholar]
- 45.Bhandari S, Cheung MS, Geva E, Kronik L, Dunietz BD. Fundamental gaps of condensed-phase organic semiconductors from single-molecule calculations using polarization-consistent optimally tuned screened range-separated hybrid functionals. J. Chem. Theory Comput. 2018;14:6287–6294. doi: 10.1021/acs.jctc.8b00876. [DOI] [PubMed] [Google Scholar]
- 46.Oberhofer H, Reuter K, Blumberger J. Charge transport in molecular materials: an assessment of computational methods. Chem. Rev. 2017;117:10319–10357. doi: 10.1021/acs.chemrev.7b00086. [DOI] [PubMed] [Google Scholar]
- 47.Nelsen SF, Blackstock SC, Kim Y. Estimation of inner shell marcus terms for amino nitrogen compounds by molecular orbital calculations. J. Am. Chem. Soc. 1987;109:677–682. doi: 10.1021/ja00237a007. [DOI] [Google Scholar]
- 48.Becke AD. Density-functional exchange-energy approximation with correct asymptotic behavior. Phys. Rev. A. 1988;38:3098–3100. doi: 10.1103/PhysRevA.38.3098. [DOI] [PubMed] [Google Scholar]
- 49.Lee C, Yang W, Parr RG. Development of the colle-salvetti correlation-energy formula into a functional of the electron density. Phys. Rev. B. 1988;37:785–789. doi: 10.1103/PhysRevB.37.785. [DOI] [PubMed] [Google Scholar]
- 50.Stephens PJ, Devlin FJ, Chabalowski CF, Frisch MJ. Ab initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. J. Phys. Chem. 1994;98:11623–11627. doi: 10.1021/j100096a001. [DOI] [Google Scholar]
- 51.Moral M, Garzón-Ruiz A, Castro M, Canales-Vázquez J, Sancho-García JC. Virtual design in organic electronics: screening of a large set of 1,4-bis(phenylethynyl)benzene derivatives as molecular semiconductors. J. Phys. Chem. C. 2017;121:28249–28261. doi: 10.1021/acs.jpcc.7b07284. [DOI] [Google Scholar]
- 52.Kera S, et al. Experimental reorganization energies of pentacene and perfluoropentacene: effects of perfluorination. J. Phys. Chem. C. 2013;117:22428–22437. doi: 10.1021/jp4032089. [DOI] [Google Scholar]
- 53.Stuke A, et al. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Sci. Data. 2020;7:58. doi: 10.1038/s41597-020-0385-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Mitzel, N. W. & Rankin, D. W. H. Saracen – molecular structures from theory and experiment: the best of both worlds. Dalton Trans. 3650–3662 (2003).
- 55.Blomeyer S, et al. Intramolecular π–π interactions in flexibly linked partially fluorinated bisarenes in the gas phase. Angew. Chem. Int. Ed. 2017;56:13259–13263. doi: 10.1002/anie.201707716. [DOI] [PubMed] [Google Scholar]
- 56.Besnard J, et al. Automated design of ligands to polypharmacological profiles. Nature. 2012;492:215–220. doi: 10.1038/nature11691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Takeya J, et al. Very high-mobility organic single-crystal transistors with in-crystal conduction channels. Appl. Phys. Lett. 2007;90:102120. doi: 10.1063/1.2711393. [DOI] [Google Scholar]
- 58.Jurchescu OD, Baas J, Palstra TTM. Effect of impurities on the mobility of single crystal pentacene. Appl. Phys. Lett. 2004;84:3061–3063. doi: 10.1063/1.1704874. [DOI] [Google Scholar]
- 59.Rogers D, Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010;50:742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
- 60.Srinivas N, Krause A, Kakade SM, Seeger MW. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory. 2012;58:3250–3265. doi: 10.1109/TIT.2011.2182033. [DOI] [Google Scholar]
- 61.Auer P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 2002;3:397–422. [Google Scholar]
- 62.Srinivas, N., Krause, A., Kakade, S. & Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 1015–1022 (Omnipress, Madison, WI, USA, 2010).
- 63.Vanchinathan, H. P., Marfurt, A., Robelin, C.-A., Kossmann, D. & Krause, A. Discovering valuable items from massive data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, 1195–1204 (Association for Computing Machinery, New York, NY, USA, 2015) 10.1145/2783258.2783360.
- 64.Ma, Y., Huang, T.-K. & Schneider, J. Active search and bandits on graphs using sigma-optimality. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, UAI 2015, 542–551 (2015).
- 65.Pinsler, R., Gordon, J., Nalisnick, E. & Hernández-Lobato, J. M. Bayesian batch active learning as sparse subset approximation. In Advances in Neural Information Processing Systems 32, (eds Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. & Garnett, R.) 6359–6370 (Curran Associates, Inc., 2019).
- 66.Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J. & Agarwal, A. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations (2020).
- 67.Madhawa K, Murata T. A multi-armed bandit approach for exploring partially observed networks. Appl. Netw. Sci. 2019;4:26. doi: 10.1007/s41109-019-0145-0. [DOI] [Google Scholar]
- 68.Browne C, et al. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games. 2012;4:1–43. doi: 10.1109/TCIAIG.2012.2186810. [DOI] [Google Scholar]
- 69.Zhou Z, Kearnes S, Li L, Zare RN, Riley P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 2019;9:10752. doi: 10.1038/s41598-019-47148-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Grimme S, Bannwarth C, Shushkov P. A robust and accurate tight-binding quantum chemical method for structures, vibrational frequencies, and noncovalent interactions of large molecular systems parametrized for all spd-block elements (z = 1–86) J. Chem. Theory Comput. 2017;13:1989–2009. doi: 10.1021/acs.jctc.7b00118. [DOI] [PubMed] [Google Scholar]
- 71.Kunkel C, Schober C, Oberhofer H, Reuter K. Knowledge discovery through chemical space networks: the case of organic electronics. J. Mol. Model. 2019;25:87. doi: 10.1007/s00894-019-3950-6. [DOI] [PubMed] [Google Scholar]
- 72.Brückner C, Engels B. A theoretical description of charge reorganization energies in molecular organic p-type semiconductors. J. Comput. Chem. 2016;37:1335–1344. doi: 10.1002/jcc.24325. [DOI] [PubMed] [Google Scholar]
- 73.Ralaivola L, Swamidass SJ, Saigo H, Baldi P. Graph kernels for chemical informatics. Neural Netw. 2005;18:1093 – 1110. doi: 10.1016/j.neunet.2005.07.009. [DOI] [PubMed] [Google Scholar]
- 74.Himanen L, et al. Dscribe: Library of descriptors for machine learning in materials science. Comput. Phys. Commun. 2020;247:106949. doi: 10.1016/j.cpc.2019.106949. [DOI] [Google Scholar]
- 75.Quinonero-Candela J. A unifying view of sparse approximate gaussian process regression. J. Mach. Learn. Res. 2005;6:1939–1959. [Google Scholar]
- 76.Beluch, W. H., Genewein, T., Nürnberger, A. & Köhler, J. M. The power of ensembles for active learning in image classification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018) pp. 9368–9377.
- 77.Kendall, A. & Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (2017).
- 78.Zhang Y, Lee AA. Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning. Chem. Sci. 2019;10:8154–8163. doi: 10.1039/C9SC00616H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Wenzel, F. et al. How good is the bayes posterior in deep neural networks really? In International Conference on Machine Learning (2020).
- 80.Curtarolo S, Hart GLW, Nardelli MB, Mingo N, Sanvito S, Levy O. The high-throughput highway to computational materials design. Nat. Mater. 2013;12:191–201. doi: 10.1038/nmat3568. [DOI] [PubMed] [Google Scholar]
- 81.Pracht P, Bauer CA, Grimme S. Automated and efficient quantum chemical determination and energetic ranking of molecular protonation sites. J. Comput. Chem. 2017;38:2618–2631. doi: 10.1002/jcc.24922. [DOI] [PubMed] [Google Scholar]
- 82.Andersen M, Levchenko SV, Scheffler M, Reuter K. Beyond scaling relations for the description of catalytic materials. ACS Catal. 2019;9:2752–2759. doi: 10.1021/acscatal.8b04478. [DOI] [Google Scholar]
- 83.Cubuk ED, Sendek AD, Reed EJ. Screening billions of candidates for solid lithium-ion conductors: a transfer learning approach for small data. J. Chem. Phys. 2019;150:214701. doi: 10.1063/1.5093220. [DOI] [PubMed] [Google Scholar]
- 84.Coley CW, Rogers L, Green WH, Jensen KF. Scscore: synthetic complexity learned from a reaction corpus. J. Chem. Inf. Model. 2018;58:252–261. doi: 10.1021/acs.jcim.7b00622. [DOI] [PubMed] [Google Scholar]
- 85.Ishii H, et al. Charge mobility calculation of organic semiconductors without use of experimental single-crystal data. Sci. Rep. 2020;10:2524. doi: 10.1038/s41598-020-59238-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Friederich P, et al. Molecular origin of the charge carrier mobility in small molecule organic semiconductors. Adv. Funct. Mater. 2016;26:5757–5763. doi: 10.1002/adfm.201601807. [DOI] [Google Scholar]
- 87.Landi A, Troisi A. Rapid evaluation of dynamic electronic disorder in molecular semiconductors. J. Phys. Chem. C. 2018;122:18336–18345. doi: 10.1021/acs.jpcc.8b05511. [DOI] [Google Scholar]
- 88.Egger AT, et al. Charge transfer into organic thin films: a deeper insight through machine-learning-assisted structure search. Adv. Sci. 2020;7:2000992. doi: 10.1002/advs.202000992. [DOI] [Google Scholar]
- 89.MacLeod, B. P. et al. Self-driving laboratory for accelerated discovery of thin-film materials. Sci. Adv. 6, eaaz8867 (2020). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source data necessary to reproduce the main figures of the manuscript is provided in the supplementary materials of this article. Source data are provided with this paper.
The code used to run AML discovery is available at 10.5281/zenodo.4554331