Skip to main content
ACS Central Science logoLink to ACS Central Science
. 2024 Apr 8;10(4):899–906. doi: 10.1021/acscentsci.3c01638

Standardizing Substrate Selection: A Strategy toward Unbiased Evaluation of Reaction Generality

Debanjan Rana 1, Philipp M Pflüger 1, Niklas P Hölter 1, Guangying Tan 1, Frank Glorius 1,*
PMCID: PMC11046462  PMID: 38680564

Abstract

graphic file with name oc3c01638_0005.jpg

With over 10,000 new reaction protocols arising every year, only a handful of these procedures transition from academia to application. A major reason for this gap stems from the lack of comprehensive knowledge about a reaction’s scope, i.e., to which substrates the protocol can or cannot be applied. Even though chemists invest substantial effort to assess the scope of new protocols, the resulting scope tables involve significant biases, reducing their expressiveness. Herein we report a standardized substrate selection strategy designed to mitigate these biases and evaluate the applicability, as well as the limits, of any chemical reaction. Unsupervised learning is utilized to map the chemical space of industrially relevant molecules. Subsequently, potential substrate candidates are projected onto this universal map, enabling the selection of a structurally diverse set of substrates with optimal relevance and coverage. By testing our methodology on different chemical reactions, we were able to demonstrate its effectiveness in finding general reactivity trends by using a few highly representative examples. The developed methodology empowers chemists to showcase the unbiased applicability of novel methodologies, facilitating their practical applications. We hope that this work will trigger interdisciplinary discussions about biases in synthetic chemistry, leading to improved data quality.

Short abstract

We introduce an objective substrate scope selection method for assessing the generality of chemical reactions.

Introduction

The synthesis of new and active compounds with ever-increasing complexity continues to be the bottleneck in pharmaceutical research.1,2 Therefore, developing smart synthetic methodologies and protocols that enable novel ways of making molecules efficiently drives progress in this crucial industry.3,4 Driven by this need, thousands of reports on synthetic methodology get published each year, and this number is only expected to rise.5 However, it is alarming to note that the vast majority of these reactions never find their way into industrial application.6,7 This becomes particularly evident by the fact that the 10 most used reactions in medicinal chemistry all originate from the last century.7 So what is preventing new methods from being used in industry? Reasons could lie in chemical limitations of the reactions themselves, such as scalability or functional group tolerance; however, even with these restrictions, reactions should be applicable in specific cases.2,3 Another explanation could be the lack of comprehensive understanding of the reaction, which limits the chemist’s confidence in its synthetic utility.8 Following this, integrating a new reaction into the synthetic chemist’s toolbox not only demands wide tolerance to a range of functional groups but also requires knowledge about its applicability and especially limitations.2,9

To demonstrate this tolerance and reaction generality, chemists usually test a variety of different substrates for a transformation and include the results in their publication. Thereby, reports typically showcase a range of entries from 20 to more than 100 in their scope tables. This conventional substrate scope presents entries with varying electronic and steric properties as well as substitution patterns. However, due to the combinatorial nature of chemistry, even the best scope will always be incomplete.10 To counteract this complexity, chemists have begun to test more and more compounds, leading to a dramatic increase in the average number of reported substrate scope entries (Figure 1A).9 Although this recent trend aims to emphasize the robustness of the developed protocol, current enlarged scope tables are often redundant. The underlining reason is their subjection to substantial biases,8,11 namely, selection12 and reporting bias (Figure 1C).13 The former—selection bias—can be explained by the chemist’s prioritization of substrates that are expected to give higher yields or are easily accessible. The latter—reporting bias—is given, as most publications do not report unsuccessful experiments or low-yielding results that constitute negative data.14 It should be mentioned that these biases are known to exist in all sciences.15 However, while measures to minimize these have been taken in several disciplines,16 those in synthetic chemistry are still in their nascent stage.17

Figure 1.

Figure 1

Overview of current substrate scope statistics and challenges. (A) Increase in the number of scope entries over the last 13 years (SI section 2.4). (B) Missing transition of new protocols to industry. (C) Biases involved in reported substrate scope studies. Nested pie charts depict the number of successful examples (outer circle) and average yield (inner circle). (D) Comparison of contemporary strategies for the substrate scope.

Therefore, new approaches to standardize substrate selection and comparably benchmark reaction generality are highly desirable (Figure 1D).18,19 On this note, our group previously developed the robustness screen, which provides a simple way to assess the functional group tolerance of a reaction.19 This screen measures the impact of standardized additives on the reaction outcome and therefore allows an approximation of the applicability and limits of any given report. Although the protocol is readily applicable, it fails to capture intramolecular effects such as electronics and sterics as the additives consist of small molecules, representing synthetic needs only partially. Researchers from Merck introduced the informer library, comprising a set of structurally complex substrates specifically chosen to maximize coverage of the physicochemical drug space.20 While this library facilitates screening conditions for cross-coupling reactions, its applicability to other substrate classes has been limited. In addition, the utilized principal component analysis method cannot identify complex global and local structural patterns. Recently, Doyle and co-workers reported an efficient method for the scope selection of aryl bromides based on their coverage of the chemical space.21 Their method selects the scope broadly to cover electronic and steric effects of substituents around the reaction center based on calculated quantum chemical descriptors. While the method impressively reduces selection and reporting bias, it cannot capture the diverse reactivity and interactions of functional groups in complex pharmaceutically relevant substrates. Moreover, applying this methodology to different transformations requires significant adaptations, including defining a suitable set of quantum chemical descriptors followed by time-consuming calculations for a large dataset of substrate candidates.

Given the rapid pace at which new scientific literature is published, it is imperative to develop new approaches to meaningfully evaluate synthetic protocols based on their generality. For such a method to provide meaningful conclusions, it must meet the following requirements: I) should be low in selection and without any reporting bias; II) should be readily applicable to any chemical transformation; III) should provide broad knowledge with the minimal number of substrates; and IV) gained insights should be applicable on complex scaffolds such as those found in drugs. One solution to fulfill these requirements could be to map available substrates onto the currently known chemical space of drugs followed by an unbiased selection of a diverse set of compounds. This selected set of molecules could be tested in addition to the conventional scope examples, therefore providing a comparable benchmark of the protocol’s applicability. Based on this hypothesis, we herein report a standardized approach for substrate selection that can be easily applied by researchers to test the unbiased applicability of their developed methods.

Results and Discussion

The envisioned substrate selection workflow operates in three steps: first, a machine learning algorithm is utilized to identify common structural patterns inherent to the given molecular dataset such as a drug library. Thereby, it maps molecules sharing similar scaffolds closer together while placing structurally dissimilar structures further apart from each other. This map can be divided into clusters and is then utilized in the second step: the trained machine learning model analyzes potential reaction substrates based on their structural proximity to previously given drug scaffolds and projects them onto the original map. These overlaid maps are then used in a third step to finally select candidate molecules for experimental reaction evaluation (Figure 2).

Figure 2.

Figure 2

Schematic overview of steps involved in the standardized substrate selection workflow. (A) Analyzing reaction compatibility and requirements from the initial classical scope for filtering the potential list of substrates. (B) Mapped chemical space of drugs obtained after UMAP dimensionality reduction and hierarchal clustering; cluster centers are labeled from A to O.

To begin the development of this workflow, we opted for the Drugbank database as a representative dataset encompassing the structural diversity of drug molecules.22 It is worth noting that this approach could also be applied to datasets for other applications such as crop protectants or fragrances.23 For the sake of clarity, it has to be stated that while our aim is to minimize the human bias in substrate selection, the choice of Drugbank here introduces a dataset bias. However, this choice was motivated by the significant focus within the field of synthetic methodologies on streamlining the synthesis of pharmaceutical compounds. Next, we featurized the drug molecules utilizing extended connectivity fingerprints (ECFP).24 While quantum chemical descriptors have been shown to be effective in mapping molecular structures,21 they feature major disadvantages for the given application. Typically, they are highly problem- and structure-specific,25 lacking a general set of descriptors to describe the high structural diversity of the drug chemical space. In contrast, molecular fingerprints can natively encode substructures,24 providing a robust structural representation designed for broader applicability. (See SI section 5 for fingerprint comparison.)

With a general molecular featurization in hand, we turned our focus toward the mapping of the drug chemical space,26,27 leveraging unsupervised learning, which has already demonstrated promising results in identifying inherent patterns in molecular datasets.26,28 To obtain a meaningful map of the drug chemical space, structural relationships and similarities between drug molecules need to be identified.29 To achieve this, we employed UMAP (Uniform Manifold Approximation and Projection), a nonlinear dimensionality reduction algorithm utilized for embedding chemical datasets.30 The amount of global information (such as recurring motifs, e.g. terpenes,31 or sugar-like patterns32) versus the local information (such as functional groups or substitutional variations) captured by UMAP is dependent on two key parameters: the minimum distance allowed between data points (Md) and the number of nearest neighbors (Nb). To optimize these parameters, two metrics were utilized: the correlation of the Jaccard distance between fingerprint pairs and their distance in the final projection (D), as well as the silhouette score33 (S) to assess whether significant clustering can be achieved in the projected space (SI section 4). It was found that by using a number of nearest neighbors Nb = 30 and a minimum distance of Md = 0.1, a mapping can be accomplished which effectively preserves global similarity while still capturing distinct local characteristics of specific compound classes (Figure 2B).

Subsequently, we performed clustering to compartmentalize the embedded drug chemical space. Evaluating different clustering algorithms revealed hierarchical agglomerative clustering34 as the superior method (SI section 6). The algorithm conserves visually separable clusters while segmenting larger regions. However, determining the number of clusters poses a significant challenge, as this would then determine the size of a later scope. As any example could in principle give new information, we argue that a scope of more than 25 examples would be impractical. This is especially true if complex substrates are tested. In contrast, a scope of fewer than 10 examples would neglect relevant structural motifs. To select an appropriate number, we computed the silhouette scores for cluster numbers ranging from 10 to 25, revealing no significant trend. Ultimately, we chose 15 clusters for practical reasons, allowing us to adapt the methodology efficiently. It has to be stated, however, that a greater number could also be selected.

The UMAP embedded drug chemical space forms the basis of the envisioned substrate selection workflow. In the subsequent step, the trained UMAP model can be utilized to project any given class of substrate molecules onto the drug map and finally select candidate molecules for the substrate scope. This capability that different substrate classes could be projected onto the universal drug map makes our approach generally applicable to various substrate categories. Depending on the reaction, initially a broad list of molecules for a specific substrate class should be collected from a molecular database or supplier catalogue and filtered based on previous knowledge of reactivity. The filtering process allows for explicitly stating already known limitations and ultimately enhances information about the reactivity. These filters can typically be obtained from the conventional scope of a reaction’s incompatibility toward specific functional groups or steric restrictions (Figure 2A). Complementary unbiased data-driven approaches such as the quantum chemically modeled substrate selection21 could also be utilized for identifying electronic and steric reactivity trends and defining filters.

The filtered list is then fed into the trained UMAP model, which can project these possible starting materials on the drug map. Within this process, the previously learned interdependencies are utilized by the model to project substrates based on their similarity to known drugs. To select a final list of diverse scope entries from this projection, previously derived clusters are utilized. Thereby, the substrates which fall in closest proximity to each drug cluster center are chosen and subjected to reaction conditions. In cases where the centermost candidate is hardly accessible (e.g., due to price or availability concerns), the top-n closest structures should be considered. The entire standardized substrate selection workflow has been automated through a web interface,35 making it readily accessible to synthetic chemists. Users only need to upload a list of potential substrate molecules for obtaining a standardized set of substrates. (See SI section 3.1 for implementation guidelines.)

To test our standardized substrate selection strategy, we chose two reactions. First, the photochemical iminocarboxylation of alkenes, recently reported from our laboratory.36 This energy transfer enabled difunctionalization approach converts alkenes to biologically important β-amino acid derivatives in a single step. Previously, the reaction was tested on 100 different alkene substrates and was able to tolerate a broad variety of functional groups. However, as is true for most reaction procedures published today, the presented scope was subjected to reporting and selection bias, making this reaction ideal for our method. Second, to compare the results against the same set of substrates, we also chose the osmium-catalyzed dihydroxylation of alkenes.37,38 This reaction is a widely known transformation and is routinely used within academic and industrial chemical laboratories all over the world due to its versatility. For defining the broadest possible substrate space for the alkenes, the Reaxys database was queried on all the commercially available olefins. To keep costs acceptable and the approach applicable, the results were filtered based on price (<100 €/g) and molecular weight (<700 Da). In addition, tetra-substituted alkenes and free amine groups were filtered out because they were known to be incompatible with the photochemical iminocarboxylation reaction. Overall, a dataset of 3811 olefins was obtained and projected onto the drug map by the trained UMAP model (Figure 3A).

Figure 3.

Figure 3

Projections of olefins (dark blue) over the drug chemical space (light blue). (A) Commercially available olefins were filtered from Reaxys. Olefins lying closest to the center of each drug cluster are selected for the standardized substrate scope. (B) Olefins shown in the original photocatalytic iminocarboxylation report.

Alkenes typically do not share the same scaffold complexity as drugs, and as a result, certain regions (e.g., linear chains) have a higher density compared to others. This nonuniform distribution again reemphasizes the need for a systematic approach, as random or human selection would result in substrates being picked primarily from the denser regions, thus giving less information about a protocol’s applicability (Figure 3B). Therefore, based on our selection strategy, alkenes lying closest to the center of each drug cluster were chosen based on availability, yielding a representative set of alkenes. As expected, the selected set of 15 candidate molecules represents a wide variety of alkenes, including terminal, 1,1-disubstituted, 1,2-disubstituted, and trisubstituted alkenes. Alkenes with different electronics ranging from electron-deficient α,β-unsaturated ketones, unactivated terminal alkenes, and electron-rich conjugated ethers are present. Various functional groups such as alcohols, esters, amides, ethers, silyl ethers, halides, thiols, tertiary amines, and trifluoroethyl groups are also included. Notably, in comparison to conventional scopes, a major fraction of the selected molecules are polyfunctionalized. To demonstrate that the developed standardized scope selection workflow is highly transferable and applicable to different classes of compounds, it was also tested with (hetero) aryl bromides, again showing uneven but broad drug space coverage. Following the described selection process, it was possible to obtain a diverse set of (hetero) aryl bromides (SI section 8).

While the substrate selection process can be applied to most reactions, we focused our experimental evaluation on the 2 aforementioned reactions with all 15 selected olefins being subjected to both reaction conditions (Figure 4). Our criterion for a reaction to be successful was obtaining the expected product in greater than 10% yield (isolable). This threshold is arguably low, although it provides enough product for analytical as well as primary pharmaceutical studies. Following this criterion, eight (53%) substrates successfully underwent dihydroxylation and seven (47%) substrates underwent iminocarboxylation. Unactivated alkenes (L1 and N1) were successfully converted in both reactions. The nortriptyline derivative (E1) was also converted in both reactions, albeit in lower yields. 1-Vinyltriazole (A1), vinyl cyclopropane derivative (B1), and electron-deficient alkenes M1 and trifluoroethyl acrylate (K1) were unreactive toward dihydroxylation but yielded the corresponding β-amino acid derivatives. Despite participating in the desired iminocarboxylation reaction, alkene-M1 underwent dechlorination and subsequent β-hydrogen elimination, and alkene-B1 afforded the ring-opening product under the employed photochemical conditions. Three out of the 15 chosen alkenes (G1, I1, and J1) were unsuccessful in both reactions. However, a tetrahydropyridine derivative (F1), a β-lactam-bearing alkene (D1), (−)-quinuclidine (H1), triacetyl-d-glucal (O1), and alkene-C1 were successfully converted to their corresponding diols, while the expected products were not detected in the case of iminocarboxylation (Figure 4B).

Figure 4.

Figure 4

(A) Experimental reaction evaluation with the 15 alkenes selected by the standardized substrate selection workflow. (B) The alkenes are labeled according to the cluster labels (A1 to O1). All reactions were performed at 0.2 mmol scale, and isolated yields are reported. Full experimental procedure can be found in the Supporting Information. aRing-opening product was obtained. bDechlorinated and β-hydrogen eliminated product was obtained. (C) Performance metrics of the benchmarked reactions. Nested pie charts: number of working substrates (outer ring) and average yield of successful reactions (inner ring). (D) Guidelines to improve the reaction evaluation and applicability of protocols.

Given the relatively small number of examples presented in this standardized scope, the information obtained from this set of structurally diverse substrates is manifold. For example, in the iminocarboxylation reaction, the positive scope outcomes align with the originally published scope, focusing on styrenes, unactivated alkenes, and electron-deficient alkenes. However, the iminocarboxylation shows limited applicability when it comes to structurally complex substrates bearing multiple reactive functionalities, which contradicts the originally published scope that exemplified complex substrates where the olefin was located farther away from the reactive functionalities.36 In contrast, the osmium-catalyzed dihydroxylation reaction, being a long-standing valuable tool in synthetic chemistry, exhibits a broader scope. The dihydroxylation reaction successfully converted 8 out of the 15 substrates, including structurally complex alkenes. Substrates that did not work include electron-deficient and heteroatom-bound alkenes, which are well-known limitations for this reaction.37 Overall, the osmium-catalyzed dihydroxylation demonstrated greater productivity with a 47% average yield as compared to 30% for the iminocarboxylation reaction (Figure 4C). The structural, electronic, and functional diversity and the substrate complexity underline the potential of the developed approach to serve as a benchmark for reaction applicability.

While a conventional substrate scope approach remains crucial for assessing electronic and steric demands, we advocate against enlarging them with human-selected examples for showcasing reaction generality. Instead, we propose a two-step approach: first, performing a short classical scope to identify reactivity trends and second, combining it with examples selected by the standardized substrate selection workflow. This combined approach will enrich the information in substrate scope tables and facilitate a more unbiased evaluation of reaction generality (Figure 4D). It is also important to note that the standardized scope selection approach presents a considerable fraction of negative (i.e., low/no-yielding) examples. Although the inclusion of low-yielding or unsuccessful scope entries can be observed occasionally in present day publications, still a large number of unsuccessful scope examples remain unpublished.9,14 This originates from researchers being hesitant to share unsuccessful examples, which might negatively impact their chances of publication. We must state clearly that negative results are equally important in delineating the limits of applicability of a newly reported reaction and identifying potential room for further methodological improvements. Therefore, we encourage the scientific community to share, accept, and even highlight negative results.

Conclusions

We have developed a standardized substrate selection strategy allowing for a straightforward, unbiased, and comparable assessment of chemical reactions. At the core of our workflow, a UMAP model generates a map of a chemical (e.g., drug space) by learning underlying structural relationships and projecting reaction substrates onto the chemical space based on structural similarity. Selecting substrates from the different clusters in the drug map enables an unbiased selection of a diverse set of representative examples. Our strategy also complements previous unsupervised learning based substrate selection approaches which focused on quantum chemical modeling of the reactive center.21 We have demonstrated the applicability of the workflow on different substrate classes, namely, alkenes and (hetero)aryl bromides, and utilized it to assess the scope of different chemical reactions.

Since 15 substrates comprise a small fraction as compared to the full chemical space of a substrate class, the perception of reaction generality on the entire chemical space still remains limited, especially considering the fact that a minute structural variation can have major consequences on the reaction outcome. As variations in UMAP or clustering parameters can lead to a set of 15 similar but yet different substrates, the chosen centermost substrate may not always explain the complete reactivity of the whole cluster. However, due to the underlying complexity of chemical reactions, we anticipate that by sampling only a few molecules no individual method will ever be able to capture all reactivity trends. With that said, the current trends of increasing scope entries, which are often biased and redundant, could be enriched in information by a standardized approach to diverse substrate selection. Therefore, we anticipate our workflow to be used as a tool toward an unbiased scope evaluation of new chemical transformations.

To further make this approach user-friendly, a web platform35 was built for generating representative substrate scopes and visualization of the given substrate space using only a list of potential substrate SMILES. Alternatively, it can also be used to retrospectively evaluate already existing substrate scope tables. To demonstrate the broad applicability of a reaction and enable quick adoption in academia and industry, a scope does not necessarily need to be enlarged but broadened in diversity and cleansed of biases. Implementing our proposed substrate selection strategy in addition to the conventional scope examples would inevitably reveal a more realistic picture of a new transformation’s true utility.

Acknowledgments

We thank F. Katzenburg, L. Schlosser, F. Zhang, J. Tyler, R. Laskar, F. M. Boser, S. Dutta, J. Spies, and C. Chintawar (all WWU Muenster) for helpful discussions.

Data Availability Statement

The web interface can be accessed through the link https://pharmascope.uni-muenster.de/. The developed code and all associated datasets can be found at https://zivgitlab.uni-muenster.de/ag-glorius/published-paper/standardizing-substrate-selection/.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acscentsci.3c01638.

  • Details of python scripts and implementation guidelines for the developed workflow, supporting experiments, details of synthetic procedures, and data of isolated compounds (PDF)

Author Contributions

F.G., D.R., and P.M.P. supervised the research and wrote the manuscript with contributions from all authors.

This work was generously supported by the Deutsche Forschungsgemeinschaft: SPP 2363 (Utilization and Development of Machine Learning for Molecular Applications — Molecular Machine Learning).

The authors declare no competing financial interest.

Originally published ASAP April 8, 2024; References 21 and 24 updated April 24, 2024.

Supplementary Material

oc3c01638_si_001.pdf (5.6MB, pdf)

References

  1. Campos K. R.; Coleman P. J.; Alvarez J. C.; Dreher S. D.; Garbaccio R. M.; Terrett N. K.; Tillyer R. D.; Truppo M. D.; Parmee E. R. The importance of synthetic chemistry in the pharmaceutical industry. Science 2019, 363, eaat0805. 10.1126/science.aat0805. [DOI] [PubMed] [Google Scholar]
  2. Boström J.; Brown D. G.; Young R. J.; Keserü G. M. Expanding the medicinal chemistry synthetic toolbox. Nat. Rev. Drug Discovery 2018, 17, 709–727. 10.1038/nrd.2018.116. [DOI] [PubMed] [Google Scholar]
  3. Cernak T.; Dykstra K. D.; Tyagarajan S.; Vachal P.; Krska S. W. The medicinal chemist’s toolbox for late stage functionalization of drug-like molecules. Chem. Soc. Rev. 2016, 45, 546–576. 10.1039/C5CS00628G. [DOI] [PubMed] [Google Scholar]
  4. Krska S. W.; DiRocco D. A.; Dreher S. D.; Shevlin M. The Evolution of Chemical High-Throughput Experimentation To Address Challenging Problems in Pharmaceutical Synthesis. Acc. Chem. Res. 2017, 50, 2976–2985. 10.1021/acs.accounts.7b00428. [DOI] [PubMed] [Google Scholar]
  5. Yue K.; Zhou Q.; Bird R.; Zhu L.; Di Zhang; Li D.; Zou L.; Yang J.; Fu X.; Georges G. P. Trends and Opportunities in Organic Synthesis: Global State of Research Metrics and Advances in Precision, Efficiency, and Green Chemistry. J. Org. Chem. 2023, 88, 4031–4035. 10.1021/acs.joc.2c03057. [DOI] [PubMed] [Google Scholar]
  6. Roughley S. D.; Jordan A. M. The medicinal chemist’s toolbox: an analysis of reactions used in the pursuit of drug candidates. J. Med. Chem. 2011, 54, 3451–3479. 10.1021/jm200187y. [DOI] [PubMed] [Google Scholar]
  7. Brown D. G.; Boström J. Analysis of Past and Present Synthetic Methodologies on Medicinal Chemistry: Where Have All the New Reactions Gone?. J. Med. Chem. 2016, 59, 4443–4458. 10.1021/acs.jmedchem.5b01409. [DOI] [PubMed] [Google Scholar]
  8. Baker M. 1,500 scientists lift the lid on reproducibility. Nature 2016, 533, 452–454. 10.1038/533452a. [DOI] [PubMed] [Google Scholar]
  9. Kozlowski M. C. On the Topic of Substrate Scope. Org. Lett. 2022, 24, 7247–7249. 10.1021/acs.orglett.2c03246. [DOI] [PubMed] [Google Scholar]
  10. Kirkpatrick P.; Ellis C. Chemical space. Nature 2004, 432, 823. 10.1038/432823a. [DOI] [Google Scholar]
  11. Fanelli D.; Costas R.; Ioannidis J. P. A. Meta-assessment of bias in science. Proc. Natl. Acad. Sci. U.S.A. 2017, 114, 3714–3719. 10.1073/pnas.1618569114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Heckman J. Varieties of Selection Bias. Am. Econ. Rev. 1990, 80, 313–318. [Google Scholar]
  13. McGauran N.; Wieseler B.; Kreis J.; Schüler Y.-B.; Kölsch H.; Kaiser T. Reporting bias in medical research - a narrative review. Trials 2010, 11, 37. 10.1186/1745-6215-11-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Strieth-Kalthoff F.; Sandfort F.; Kühnemund M.; Schäfer F. R.; Kuchen H.; Glorius F. Machine Learning for Chemical Reactivity: The Importance of Failed Experiments. Angew. Chem., Int. Ed. 2022, 61, e202204647. 10.1002/anie.202204647. [DOI] [PubMed] [Google Scholar]
  15. a Miyakawa T. No raw data, no science: another possible source of the reproducibility crisis. Mol. Brain 2020, 13, 24. 10.1186/s13041-020-0552-2. [DOI] [PMC free article] [PubMed] [Google Scholar]; b Voelkl B.; Würbel H. Reproducibility Crisis: Are We Ignoring Reaction Norms?. Trends Pharmacol. Sci. 2016, 37, 509–510. 10.1016/j.tips.2016.05.003. [DOI] [PubMed] [Google Scholar]
  16. a Valentine K. D.; Buchanan E. M.; Cunningham A.; Hopke T.; Wikowsky A.; Wilson H. Have psychologists increased reporting of outliers in response to the reproducibility crisis?. Soc. Personal. Psychol. Compass 2021, 15, e12591. 10.1111/spc3.12591. [DOI] [Google Scholar]; b Hardwicke T. E.; Wagenmakers E.-J. Reducing bias, increasing transparency and calibrating confidence with preregistration. Nat. Hum. Behav. 2023, 7, 15–26. 10.1038/s41562-022-01497-2. [DOI] [PubMed] [Google Scholar]; c Roselli D.; Matthews J.; Talagala N.. Managing Bias in AI. In Companion Proceedings of The 2019 World Wide Web Conference; Liu L., White R., Eds.; ACM: New York, 2019; pp 539–544.; d Höller Y.; Uhl A.; Bathke A.; Thomschewski A.; Butz K.; Nardone R.; Fell J.; Trinka E. Reliability of EEG Measures of Interaction: A Paradigm Shift Is Needed to Fight the Reproducibility Crisis. Front. Hum. Neurosci. 2017, 11, 441. 10.3389/fnhum.2017.00441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. a Maloney M. P.; Coley C. W.; Genheden S.; Carson N.; Helquist P.; Norrby P.-O.; Wiest O. Negative Data in Data Sets for Machine Learning Training. Org. Lett. 2023, 25, 2945–2947. 10.1021/acs.orglett.3c01282. [DOI] [PubMed] [Google Scholar]; b Mercado R.; Kearnes S. M.; Coley C. W. Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data. J. Chem. Inf. Model. 2023, 63, 4253–4265. 10.1021/acs.jcim.3c00607. [DOI] [PMC free article] [PubMed] [Google Scholar]; c Raghavan P.; Haas B. C.; Ruos M. E.; Schleinitz J.; Doyle A. G.; Reisman S. E.; Sigman M. S.; Coley C. W. Dataset Design for Building Models of Chemical Reactivity. ACS Cent. Sci. 2023, 9, 2196–2204. 10.1021/acscentsci.3c01163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. a Gensch T.; Glorius F. The straight dope on the scope of chemical reactions. Science 2016, 352, 294–295. 10.1126/science.aaf3539. [DOI] [PubMed] [Google Scholar]; b Bess E. N.; Bischoff A. J.; Sigman M. S. Designer substrate library for quantitative, predictive modeling of reaction performance. Proc. Natl. Acad. Sci. U.S.A. 2014, 111, 14698–14703. 10.1073/pnas.1409522111. [DOI] [PMC free article] [PubMed] [Google Scholar]; c Prieto Kullmer C. N.; Kautzky J. A.; Krska S. W.; Nowak T.; Dreher S. D.; MacMillan D. W. C. Accelerating reaction generality and mechanistic insight through additive mapping. Science 2022, 376, 532–539. 10.1126/science.abn1885. [DOI] [PMC free article] [PubMed] [Google Scholar]; d Wagen C. C.; McMinn S. E.; Kwan E. E.; Jacobsen E. N. Screening for generality in asymmetric catalysis. Nature 2022, 610, 680–686. 10.1038/s41586-022-05263-2. [DOI] [PMC free article] [PubMed] [Google Scholar]; e Pitzer L.; Schäfers F.; Glorius F. Rapid Assessment of the Reaction-Condition-Based Sensitivity of Chemical Transformations. Angew. Chem., Int. Ed. 2019, 58, 8572–8576. 10.1002/anie.201901935. [DOI] [PubMed] [Google Scholar]; f Gensch T.; Teders M.; Glorius F. Approach to Comparing the Functional Group Tolerance of Reactions. J. Org. Chem. 2017, 82, 9154–9159. 10.1021/acs.joc.7b01139. [DOI] [PubMed] [Google Scholar]; g Dreher S. D.; Krska S. W. Chemistry Informer Libraries: Conception, Early Experience, and Role in the Future of Cheminformatics. Acc. Chem. Res. 2021, 54, 1586–1596. 10.1021/acs.accounts.0c00760. [DOI] [PubMed] [Google Scholar]
  19. a Collins K. D.; Glorius F. A robustness screen for the rapid assessment of chemical reactions. Nat. Chem. 2013, 5, 597–601. 10.1038/nchem.1669. [DOI] [PubMed] [Google Scholar]; b Collins K. D.; Glorius F. Intermolecular reaction screening as a tool for reaction evaluation. Acc. Chem. Res. 2015, 48, 619–627. 10.1021/ar500434f. [DOI] [PubMed] [Google Scholar]
  20. Kutchukian P. S.; Dropinski J. F.; Dykstra K. D.; Li B.; DiRocco D. A.; Streckfuss E. C.; Campeau L.-C.; Cernak T.; Vachal P.; Davies I. W.; Krska S. W.; Dreher S. D. Chemistry informer libraries: a chemoinformatics enabled approach to evaluate and advance synthetic methods. Chem. Sci. 2016, 7, 2604–2613. 10.1039/C5SC04751J. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. a Kariofillis S. K.; Jiang S.; Żurański A. M.; Gandhi S. S.; Martinez Alvarado J. I.; Doyle A. G. Using Data Science To Guide Aryl Bromide Substrate Scope Analysis in a Ni/Photoredox-Catalyzed Cross-Coupling with Acetals as Alcohol-Derived Radical Sources. J. Am. Chem. Soc. 2022, 144, 1045–1055. 10.1021/jacs.1c12203. [DOI] [PMC free article] [PubMed] [Google Scholar]; For related works on using quantum chemical descriptors and data science tools for substrate and catalyst selection, see:; b Haas B. C.; Goetz A. E.; Bahamonde A.; McWilliams J. C.; Sigman M. S. Predicting relative efficiency of amide bond formation using multivariate linear regression. Proc. Natl. Acad. Sci. U.S.A. 2022, 119, e2118451119 10.1073/pnas.2118451119. [DOI] [PMC free article] [PubMed] [Google Scholar]; c Gensch T.; Smith S. R.; Colacot T. J.; Timsina Y. N.; Xu G.; Glasspoole B. W.; Sigman M. S. Design and Application of a Screening Set for Monophosphine Ligands in Cross-Coupling. ACS Catal. 2022, 12, 7773–7780. 10.1021/acscatal.2c01970. [DOI] [Google Scholar]; d Ruos M. E.; Kinney R. G.; Ring O. T.; Doyle A. G. A General Photocatalytic Strategy for Nucleophilic Amination of Primary and Secondary Benzylic C-H Bonds. J. Am. Chem. Soc. 2023, 145, 18487–18496. 10.1021/jacs.3c04912. [DOI] [PubMed] [Google Scholar]; e Zahrt A. F.; Henle J. J.; Rose B. T.; Wang Y.; Darrow W. T.; Denmark S. E. Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 2019, 363, 247. 10.1126/science.aau5631. [DOI] [PMC free article] [PubMed] [Google Scholar]; f van Dijk L.; Haas B. C.; Lim N.-K.; Clagg K.; Dotson J. J.; Treacy S. M.; Piechowicz K. A.; Roytman V. A.; Zhang H.; Toste F. D.; Miller S. J.; Gosselin F.; Sigman M. S. Data Science-Enabled Palladium-Catalyzed Enantioselective Aryl-Carbonylation of Sulfonimidamides. J. Am. Chem. Soc. 2023, 145, 20959–20967. 10.1021/jacs.3c06674. [DOI] [PubMed] [Google Scholar]; g Tang T.; Hazra A.; Min D. S.; Williams W. L.; Jones E.; Doyle A. G.; Sigman M. S. Interrogating the Mechanistic Features of Ni(I)-Mediated Aryl Iodide Oxidative Addition Using Electroanalytical and Statistical Modeling Techniques. J. Am. Chem. Soc. 2023, 145, 8689–8699. 10.1021/jacs.3c01726. [DOI] [PMC free article] [PubMed] [Google Scholar]; h Olen C. L.; Zahrt A. F.; Reilly S. W.; Schultz D.; Emerson K.; Candito D.; Wang X.; Strotman N. A.; Denmark S. E. Chemoinformatic Catalyst Selection Methods for the Optimization of Copper−Bis(oxazoline)-Mediated, Asymmetric, Vinylogous Mukaiyama Aldol Reactions. ACS Catal. 2024, 14, 2642–2655. 10.1021/acscatal.3c05903. [DOI] [Google Scholar]
  22. Wishart D. S.; Feunang Y. D.; Guo A. C.; Lo E. J.; Marcu A.; Grant J. R.; Sajed T.; Johnson D.; Li C.; Sayeeda Z.; Assempour N.; Iynkkaran I.; Liu Y.; Maciejewski A.; Gale N.; Wilson A.; Chin L.; Cummings R.; Le D.; Pon A.; Knox C.; Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082. 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Dunkel M.; Schmidt U.; Struck S.; Berger L.; Gruening B.; Hossbach J.; Jaeger I. S.; Effmert U.; Piechulla B.; Eriksson R.; Knudsen J.; Preissner R. SuperScent-a database of flavors and scents. Nucleic Acids Res. 2009, 37, D291–4. 10.1093/nar/gkn695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. a Rogers D.; Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]; For related works on mapping substrates using Mordred descriptor, see:; b Rein J.; Rozema S. D.; Langner O. C.; Zacate S. B.; Hardy M. A.; Siu J. C.; Mercado B. Q.; Sigman M. S.; Miller S. J.; Lin S. Generality-oriented optimization of enantioselective aminoxyl radical catalysis. Science 2023, 380, 706–712. 10.1126/science.adf6177. [DOI] [PMC free article] [PubMed] [Google Scholar]; c Leibler I. N.-M.; Gandhi S. S.; Tekle-Smith M. A.; Doyle A. G. Strategies for Nucleophilic C(sp3)-(Radio)Fluorination. J. Am. Chem. Soc. 2023, 145, 9928–9950. 10.1021/jacs.3c01824. [DOI] [PubMed] [Google Scholar]
  25. Ahneman D. T.; Estrada J. G.; Lin S.; Dreher S. D.; Doyle A. G. Predicting reaction performance in C-N cross-coupling using machine learning. Science 2018, 360, 186–190. 10.1126/science.aar5169. [DOI] [PubMed] [Google Scholar]
  26. Probst D.; Reymond J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform. 2020, 12, 12. 10.1186/s13321-020-0416-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. a Awale M.; Reymond J.-L. Web-based 3D-visualization of the DrugBank chemical space. J. Cheminform. 2016, 8, 25. 10.1186/s13321-016-0138-2. [DOI] [PMC free article] [PubMed] [Google Scholar]; b Öztürk H.; Özgür A.; Schwaller P.; Laino T.; Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discovery Today 2020, 25, 689–705. 10.1016/j.drudis.2020.01.020. [DOI] [PubMed] [Google Scholar]
  28. a Reymond J.-L. The chemical space project. Acc. Chem. Res. 2015, 48, 722–730. 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]; b Strieth-Kalthoff F.; Sandfort F.; Segler M. H. S.; Glorius F. Machine learning the ropes: principles, applications and directions in synthetic chemistry. Chem. Soc. Rev. 2020, 49, 6154–6168. 10.1039/C9CS00786E. [DOI] [PubMed] [Google Scholar]; c Schwaller P.; Vaucher A. C.; Laplaza R.; Bunne C.; Krause A.; Corminboeuf C.; Laino T. Machine intelligence for chemical reaction space. WIREs Comput. Mol. Sci. 2022, 12, e1604. 10.1002/wcms.1604. [DOI] [Google Scholar]; d Jaeger S.; Fulle S.; Turk S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J. Chem. Inf. Model. 2018, 58, 27–35. 10.1021/acs.jcim.7b00616. [DOI] [PubMed] [Google Scholar]; e Maser M. R.; Cui A. Y.; Ryou S.; DeLano T. J.; Yue Y.; Reisman S. E. Multilabel Classification Models for the Prediction of Cross-Coupling Reaction Conditions. J. Chem. Inf. Model. 2021, 61, 156–166. 10.1021/acs.jcim.0c01234. [DOI] [PubMed] [Google Scholar]; f Schwaller P.; Hoover B.; Reymond J.-L.; Strobelt H.; Laino T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv. 2021, 7, eabe4166. 10.1126/sciadv.abe4166. [DOI] [PMC free article] [PubMed] [Google Scholar]; g Schwaller P.; Probst D.; Vaucher A. C.; Nair V. H.; Kreutter D.; Laino T.; Reymond J.-L. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 2021, 3, 144–152. 10.1038/s42256-020-00284-w. [DOI] [Google Scholar]
  29. a Eckert H.; Bajorath J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discovery Today 2007, 12, 225–233. 10.1016/j.drudis.2007.01.011. [DOI] [PubMed] [Google Scholar]; b Maggiora G.; Vogt M.; Stumpfe D.; Bajorath J. Molecular similarity in medicinal chemistry. J. Med. Chem. 2014, 57, 3186–3204. 10.1021/jm401411z. [DOI] [PubMed] [Google Scholar]
  30. McInnes L.; Healy J.; Melville J.. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction arXiv, 2018, https://arxiv.org/abs/1802.03426.
  31. Jansen D. J.; Shenvi R. A. Synthesis of medicinally relevant terpenes: reducing the cost and time of drug discovery. Future Med. Chem. 2014, 6, 1127–1148. 10.4155/fmc.14.71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Meutermans W.; Le G. T.; Becker B. Carbohydrates as scaffolds in drug discovery. ChemMedChem. 2006, 1, 1164–1194. 10.1002/cmdc.200600150. [DOI] [PubMed] [Google Scholar]
  33. Rousseeuw P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
  34. Müllner D.Modern hierarchical, agglomerative clustering algorithms. arXiv, 2011https://arxiv.org/abs/1109.2378.
  35. Provided web-interface for the standardized substrate selection workflow: https://pharmascope.uni-muenster.de/ (accessed 14-03-2024).
  36. Tan G.; Das M.; Keum H.; Bellotti P.; Daniliuc C.; Glorius F. Photochemical single-step synthesis of β-amino acid derivatives from alkenes and (hetero)arenes. Nat. Chem. 2022, 14, 1174–1184. 10.1038/s41557-022-01008-w. [DOI] [PubMed] [Google Scholar]
  37. Kolb H. C.; VanNieuwenhze M. S.; Sharpless K. B. Catalytic Asymmetric Dihydroxylation. Chem. Rev. 1994, 94, 2483–2547. 10.1021/cr00032a009. [DOI] [Google Scholar]
  38. a VanRheenen V.; Kelly R. C.; Cha D. Y. An improved catalytic OsO4 oxidation of olefins to −1,2-glycols using tertiary amine oxides as the oxidant. Tetrahedron Lett. 1976, 17, 1973–1976. 10.1016/S0040-4039(00)78093-2. [DOI] [Google Scholar]; b Schroeder M. Osmium tetraoxide cis hydroxylation of unsaturated substrates. Chem. Rev. 1980, 80, 187–213. 10.1021/cr60324a003. [DOI] [Google Scholar]; c Jacobsen E. N.; Marko I.; Mungall W. S.; Schroeder G.; Sharpless K. B. Asymmetric dihydroxylation via ligand-accelerated catalysis. J. Am. Chem. Soc. 1988, 110, 1968–1970. 10.1021/ja00214a053. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

oc3c01638_si_001.pdf (5.6MB, pdf)

Data Availability Statement

The web interface can be accessed through the link https://pharmascope.uni-muenster.de/. The developed code and all associated datasets can be found at https://zivgitlab.uni-muenster.de/ag-glorius/published-paper/standardizing-substrate-selection/.


Articles from ACS Central Science are provided here courtesy of American Chemical Society

RESOURCES