Abstract

Introduction
Recent advances in machine learning (ML) offer new opportunities for homogeneous catalysis, from accelerating discovery to optimizing performance and enabling sustainable design. , However, the development of new catalysts in this domain remains predominantly empirical, with a limited integration of data-driven methodologies. As a result, progress is still largely guided by intuition and trial-and-error approaches that differ little from those used decades ago.
One of the main barriers preventing the widespread adoption of ML is educational, as most chemists lack formal training in data science and ML techniques can seem out of their reach. , In addition, the supporting ecosystem, including curated datasets, user-friendly tools, and clear benchmarking practices, is still underdeveloped or scattered across domains. This situation limits the use of ML to a small number of specialized groups, while its integration into routine experimental workflows is still uncommon.
This Viewpoint offers a set of guidelines and considerations for implementing ML tools in homogeneous catalysis in a way that is both rigorous and accessible. The manuscript is particularly suited for experimental and computational chemists with little to no experience in ML who intend to apply this technology in their research. The protocols discussed are particularly useful for projects involving dozens to hundreds of experiments or calculations, a common scenario in the field.
Herein, we outline how to represent molecules in a way that is computer readable, structure datasets, and apply ML models that balance predictive power with interpretability. Through selected case studies, we highlight applications commonly encountered in homogeneous catalysis where chemists can benefit from data-driven strategies, particularly in substrate and catalyst sampling or discovery.
Our aim is to help chemists engage more critically and effectively with ML, not just as tool users but as informed practitioners aware of its possibilities and limitations. Ultimately, we argue that the thoughtful integration of ML can transform catalysis, not by replacing chemical intuition but by enhancing it with data-driven insight. While this work focuses on homogeneous catalysis, it is simply an illustration of how coupling ML with chemistry can drive significant breakthroughs across the broader chemistry and materials domains.
FIRST STEP: DIGITALIZATION OF MOLECULES
One of the first steps in enabling ML predictions is to represent molecules in a way that is computer readable. When chemists examine a molecule, they intuitively recognize functional groups, electronic effects, and steric environments (Figure ). For example, a nitro group (−NO2) is electron-withdrawing, a tert-butyl group (− t Bu) introduces steric hindrance, and a metal center modulates reactivity through its ligand environment. These concepts, while intuitive to humans, are meaningless to an algorithm unless translated into numerical values.
1.
Examples of transforming concepts from chemical intuition (left) into machine-readable descriptors (right).
This translation process, known as featurization, represents the first critical step in applying ML to catalysis, − as it is the conceptual bridge between chemical intuition and data-driven modeling. For instance, if chemists want a ML model to quantify the steric hindrance of a target reactive site, they can use numerical descriptors such as buried volume (V bur). This descriptor was originally developed to describe the steric demand of ligands around metal centers in organometallic complexes, but it can be adapted to evaluate steric environments around any atom (Figure , top right). Similarly, for electronic effects, chemists need to calculate features such as electrostatic potential (ESP) or atomic charges. These values help capture whether a ring is electron-rich (i.e., with a −NMe2 group) or electron-poor (i.e., with a −NO2 group), transforming chemical intuition into machine-readable data that a model can interpret (Figure , bottom right).
Descriptors can capture the properties of an entire molecule (molecular descriptors) or specific atoms within the molecule (atomic descriptors). In ML models for homogeneous catalysis, molecular descriptors are often complemented by atomic descriptors, as catalytic activity and selectivity frequently depend on the specific local environment around reactive sites. Importantly, chemical intuition plays a central role in selecting which descriptors to generate, since the performance of ML models are strongly influenced by the relevance of the input features. For example, a chemist experienced in Michael additions understands that the reaction outcome largely depends on the electrophilicity and steric accessibility of the carbon atom undergoing nucleophilic attack. Failing to include descriptors for this atom or introducing irrelevant features will likely reduce the quality of the resulting model, even if performance metrics appear high.
To capture these local effects in an ML framework, atomic descriptors for that carbon should contain electronic properties (i.e., partial charges, Fukui indices, etc.) and steric parameters such as V bur or solvent-accessible surface area (SASA). Similarly, in metal-catalyzed reactions, the metal center often dominates reactivity, and descriptors that capture the electronic and steric properties of that atom are essential for building accurate predictive models. In such cases, the properties of the coordinating atoms of the ligands are often used instead of those of the metal itself. For example, the phosphorus atom in phosphine ligands or the nitrogen atoms in pyridines may be used to obtain representative descriptors. This approach enables simpler featurization protocols while still capturing the key characteristics of the catalytic center.
For nonspecialists, the simplest entry point into descriptor generation might be the use of online descriptor databases with quantum-mechanical (QM) features for large collections of ligands and catalysts. These tools offer researchers easy access to rich electronic and steric information without requiring custom calculations or coding skills. However, such databases typically cover only a limited range of structures and may be of limited use when the goal is to design new ligands. In practice, especially in catalysis, descriptors are often generated on-demand by using QM methods. Density functional theory (DFT) and semiempirical approaches such as GFN2-xTB are increasingly popular for calculating descriptors relevant to catalytic performance, including atomic charges, HOMO–LUMO energies (where HOMO denotes highest occupied molecular orbitals and LUMO denotes lowest unoccupied molecular orbitals), and Fukui indices, while steric parameters can be derived from the optimized QM structures.
This QM strategy is particularly valuable when experimental data are scarce, as is often the case in catalysis, where only a few dozen experiments may be available. However, when working with large-scale datasets, such as those generated in digital high-throughput screenings, QM calculations become computationally prohibitive. In these scenarios, alternative strategies such as molecular fingerprints from libraries like RDKit or feature-learning methods based on graph neural networks (GNNs) allow the rapid featurization of millions of molecules within reasonable computational times. These scalable methods are powerful when data are abundant, but they come with tradeoffs: they often require greater programming expertise and generally lack the interpretability of physics-based descriptors. Consequently, while fingerprints and GNNs are highly effective for large, high-volume datasets, they are rarely the first choice for the smaller, carefully curated datasets that are typical in catalysis research.
More information, implementation details, guidelines, and considerations about descriptor generation are shown in Tables and .
1. Guidelines for Generating Descriptors –
2. Warnings and Considerations for Descriptor Generation ,
SAMPLING SUBSTRATES AND CATALYSTS WITH DATA-DRIVEN CLUSTERING
One of the most common pitfalls of human intuition arises in selecting substrate scopes, a standard practice used to evaluate the generality of a catalytic method. In most articles, researchers tend to follow well-established patterns of chemical intuition to study how applicable the method is. They often explore electronic effects with a one-variable-at-a-time strategy, such as substituting aromatics with a para electron-withdrawing group (−NO2, −CF3, −CN), hydrogen, or electron-donating group (−OMe, −NMe2). Similarly, steric effects are usually probed by replacing substituents such as −H, −Me, − i Pr, and − t Bu at a single position close to the reactive center. While this approach provides a general sense of how structural modifications influence catalytic activity, it represents an inefficient way of exploring chemical space and yields results that cover only a small fraction of the available diversity.
As an illustrative example, we downloaded and curated a set of 25 048 commercially available bromoaryl (Ar–Br) substrates from Enamine, the chemical supplier, specifically from a collection designed for Pd-catalyzed cross-couplings. We then mapped the chemical space of substrates available from this vendor using two key descriptors: an electronic property (the partial charge on the leaving Br atom) and a steric parameter (the V bur of Br). The chemical space is represented with blue points in the graphs of Figure . For comparison, we examined two reported metal-catalyzed cross-couplings of aromatic halides, case A and case B, which involved 19 and 12 substrates, respectively.
2.
Chemical space exploration through human-guided (top), random (middle), and data-driven (bottom) sampling.
In both cases, the chosen molecules occupy only a small region of the vast and diverse chemical space (see the orange area in the top portion of Figure ), underscoring the limited scope and generality of the catalytic methods. Although these two cases serve purely as illustrative examples, similar trends likely extend across much of the cross-coupling literature, suggesting that relying solely on reported examples may provide an incomplete view of catalytic generality and limitations.
Based on this result, one might argue that, once such a dataset of substrates is created, even random selection strategies could often explore a more diverse region of chemical space. In the example shown in Figure , middle, we picked substrates at random and observed a considerable increase in the coverage of the explored region, although it still represented only a limited portion of the total space. A key drawback of this approach, however, is that random selections are inherently dependent on the chosen random seed, and whether the sampled substrates cover a wide chemical space is essentially a matter of luck.
To overcome the inefficiencies of traditional substrate selection, unsupervised ML methods such as clustering are gaining traction in the chemistry community, with notable applications from the Sigman, Doyle, and Glorius groups, among others. Clustering algorithms require only descriptors, without any pre-existing activity data, to partition the chemical space into groups of compounds with similar properties. The resulting clusters reveal natural groupings without prior bias, and representative candidates can then be selected for targeted experimental validation.
An illustrative example using 38 Ar–Br structures and two properties (V bur and the partial charge of Br) is shown in Figure . In this example, the molecules are separated into three clusters. Cluster A consists of phenyl rings with electron-donating groups in the para position, relative to the Br atom and no proximal bulky substituents, corresponding to a low partial charge and low V bur at the Br atom. Cluster B contains rings where the Br atom is flanked by bulky and electron-donating groups, appearing in the upper left region of the plot. Cluster C includes aromatics with para electron-withdrawing groups relative to the Br atom and no nearby steric repulsions located in the lower right region of the graph.
3.

Example of a clustering result with three clusters (left) and the molecular properties associated with each cluster (right).
Once the clusters are defined, the point closest to the centroid of each cluster can be selected as a representative molecule for sampling (Figure , points with black borders). Using the larger-scale example from Figure , we generated 19 clusters using a k-means clustering algorithm and then selected the corresponding centroid molecules for comparison with the original 19 substrates from case A. Unlike intuition-driven selection, which is often guided by prior experience and limited to low-dimensional changes, clustering approaches objectively capture high-dimensional descriptor relationships and enable more efficient chemical exploration (Figure , bottom). By selecting a small set of substrates from distinct clusters, chemists can efficiently probe a more diverse and representative chemical space, ultimately gaining deeper insight into the generality and limitations of their catalytic methods.
In addition to guiding substrate selection, clustering can also be applied to explore catalyst diversity, a common strategy in homogeneous catalysis for optimizing the reactivity and selectivity. Different groups have adopted this data-driven methodology, with notable examples from the Pérez-Ramírez, Jorner, Ackermann, and Denmark groups, among others.
Clustering has also been employed to discover catalysts by incorporating experimental results into generated clusters. For example, Schoenebeck and co-workers constructed a chemical space of ligands for the synthesis of palladium(I) dimers, which are challenging to stabilize. Using an k-means algorithm, they divided ligand space into groups with related properties and subsequently incorporated stability data from five ligands. This analysis revealed that certain clusters were enriched in active ligands, whereas others contained inactive ones. This relatively simple strategy enabled the digital exploration of new phosphine ligands within the chemical space, and the prioritization of candidates from clusters containing stable representatives. Experimental validation ultimately led to the discovery of eight previously unexplored Pd dimers under conditions of very limited available data.
As a final remark for this section, it is important to note that more than three descriptors are typically used to featurize molecules and, consequently, to define chemical spaces. In such cases, the principal components obtained through principal component analysis (PCA) can be employed to represent the chemical space in two- or three-dimensional plots. Plotting the sampling selection in these graphs helps to verify that there is sufficient coverage of the chemical space. When using this approach, researchers should ensure that the selected principal components together capture a substantial portion of the dataset’s variance (typically 60%–70% or more). Alternatively, more advanced nonlinear dimensionality reduction techniques, such as UMAP and t-SNE, are often preferred over PCA for visualizing chemical space and guiding sample selection.
Guidelines and considerations relevant to this section are summarized in Tables and and should be carefully reviewed before attempting any clustering-based sampling.
3. Guidelines for Sampling with Clustering and for Supervised ML ,– ,–
4. Warnings and Considerations for Clustering ,,
CATALYST DISCOVERY WITH SUPERVISED ML
Supervised learning refers to the development of predictive models from datasets that contain descriptors as the X matrix (i.e., electronic properties, steric features, etc.) and one or more outcomes as the y matrix (i.e., reactivity, selectivity). , Broadly, the X matrix is processed by ML algorithms such as linear regression, random forests, and neural networks, which learn from the descriptor data to predict the y-values. In catalysis, this learning of patterns across datasets makes it possible to map complex relationships between reaction parameters and performance metrics, enabling smarter experimental design and accelerating the discovery of optimal catalysts and conditions. ,
The target catalytic properties (y-values) may be obtained from either experimental measurements or computational calculations. Experimentally, catalytic activity is often quantified through yields, and turnover numbers and frequencies, , while selectivity is typically expressed using metrics such as enantiomeric excess and diastereoselectivity or regioselectivity ratio. , Computationally, the most common approach involves DFT calculations, which provide energy barriers for competing pathways. These data allow direct computation of parameters such as ΔG ⧧ or ΔE ⧧ for reaction rates, and ΔΔG ⧧ or ΔΔE ⧧ for selectivity. ,
Supervised learning problems in catalysis can generally be divided into regression and classification tasks. In regression, the objective is to predict numerical values within a continuous range, such as yields, rate constants, or enantioselectivity. In contrast, classification problems aim to predict discrete outcomes, such as whether a catalyst is active or inactive, or to categorize performance levels (i.e., high vs low selectivity). Notably, even small datasets containing as few as 18 reactions can deliver reliable predictions, provided that the chosen descriptors effectively capture the key chemical information.
In both regression and classification problems, two main data-driven strategies are typically used for catalyst discovery: rational catalyst design enabled by explainable ML, and ML-based candidate prediction (Figure ). These strategies are often integrated into iterative active learning cycles. In the first case, Strategy A employs a predictive model combined with SHapley Additive exPlanations (SHAP) feature analysis, enabling chemists to gain mechanistic insights and prioritize the most informative candidates for testing through rational design. In the example from the figure, the SHAP analysis reveals that reduced steric hindrance and lower charge at the Pd centers (Pd charge and Pd V bur) correlate with higher yields. This insight can help chemists to propose improved catalysts by incorporating groups that minimize these properties. A landmark example by Sigman and co-workers used multivariate linear regression to correlate steric and electronic descriptors of BINOL-derived phosphoric acids with experimental enantioselectivities, enabling accurate out-of-sample predictions (beyond the training set) and guiding ligand selection across diverse substrates.
4.

Comparison of different data-driven catalyst discovery strategies: rational design versus ML-based suggestions.
Alternatively, Strategy B focuses on discovering catalysts with minimal experimentation by relying solely on ML-guided suggestions. This strategy often employs Bayesian optimization (BO) to prioritize experiments, balancing options expected to work best (exploitation) with less certain choices that can provide new insights (exploration). In catalysis, this approach maximizes information gained per experiment while minimizing redundant trials. This discovery strategy is particularly popular for optimizing reaction conditions using descriptor matrices that include variables such as the temperature, catalyst loading, and solvent. As an example, Doyle and co-workers demonstrated that BO identified high-yielding reaction conditions for a Pd-catalyzed arylation reaction faster than human chemists, highlighting its efficiency in navigating complex reaction landscapes.
While this section highlights the advantages of using supervised ML and encourages its adoption, it is important for readers to understand that great care must be taken when developing and applying this technology. A double-edged situation is that many tools now allow chemists to build ML workflows and obtain predictions within minutes, regardless of their programming or data science expertise. Although this accessibility is crucial for the broader adoption of ML, the combination of limited expertise and the ease of model generation can be problematic, as it becomes relatively easy to overestimate the quality or predictive power of our tools. For example, a researcher might train a model with a correlation coefficient of R 2 = 0.95. These results could misleadingly suggest strong performance, when, in fact, the model is overfitted but went undetected because proper overfitting tests were skipped.
To avoid these pitfalls, researchers should evaluate models using multiple metrics (i.e., R 2, RMSE, MAE, accuracy, F1 score, and Matthews correlation coefficient) rather than relying on a single measure of performance. In addition, held-out test sets and k-fold cross-validation (preferably 5-fold CV) should be used to assess consistency and detect overfitting. Leave-one-out CV (LOOCV) can be applied to very small datasets but should be used with caution, as it is more brittle for detecting overfitting, compared to 5-fold CV. One example of an evaluation technique that combines these analyses with additional statistical tests (i.e., y-shuffle, y-mean, extrapolation) is the ROBERT score. This metric evaluates models on a 10-point scale, considering three key aspects: predictive ability and overfitting, prediction uncertainty, and detection of spurious predictions.
Another essential consideration is the predictive scope of ML models, as they are often applied to predict outcomes for molecules beyond the range of their training data, where reliability drops sharply. For instance, a researcher might develop an excellent predictor for the yields of substitution reactions in pyridines, but the same model may fail for pyrazines, even if intuition suggests otherwise. In general, predictors should be treated with caution when applied to extrapolated regions outside their training sets, and experimental validation is strongly recommended in these cases. Examples of extrapolation include predicting yields higher than those observed in the training set (Figure A) and predicting outcomes for molecules substantially different from those used for training (Figure B). When ML predictors fail to extrapolate reliably, focusing on interpolation within the dataset used to train the model can still yield useful insights. For example, such models can be used to discover reactivity trends and identify molecular features that are relevant to a particular set of catalytic results.
5.

Different types of extrapolation in supervised ML with respect to (A) the prediction range and (B) the chemical space.
Lastly, it is important to recognize that the model accuracy depends largely on the quality of the training data. Studies on yield prediction consistently show a performance gap between models trained on high-throughput experimentation (HTE) datasets and those trained on electronic laboratory notebook (ELN) records. HTE datasets typically follow strict, standardized protocols for reaction setup, analysis, and data logging, resulting in clean, balanced datasets with minimal missing information. By contrast, ELN-derived data are often accumulated over years by different researchers and tend to suffer from heterogeneous formats, incomplete metadata, and inconsistent experimental practices. For these reasons, another key consideration is that datasets compiled from different manuscripts and patents often contain noisy yield values, which can severely undermine the accuracy of ML models.
In this context, chemists should, whenever possible, report crude yields (before purification) for comparability and maintain consistent conditions such as temperature, solvent, and sampling time when building or merging datasets. Reactions should also be performed in at least duplicate to ensure reproducibility and avoid spurious results. While selectivity might be less sensitive to some reaction parameters, caution is still recommended (i.e., diastereomeric ratios may change after column chromatography purification). Moreover, yield, conversion, and sometimes selectivity are strongly time-dependent (Figure A). It is therefore essential to select meaningful reaction times or compare results across multiple time points.
6.

(A) Yield evolution over time in two reactions with different kinetics. Optimal distribution of target values for (B) regression and (C) classification.
For optimal performance, datasets should be balanced to minimize bias and improve model robustness. In regression, a uniform distribution of target values is desired (Figure B) so the model learns across low, medium, and high regions. In classification, ensure similar numbers of samples per class (Figure C) to avoid bias and loss of generalization. When data are highly skewed or concentrated around a single value, targeted data acquisition may be necessary to restore balance before modeling. In a similar context, poor performance results or “negative data” (i.e., low yields, enantioselectivities, etc.) are encouraged to be reported and included in models to broaden the scope of the algorithms and make them more general and robust. ,
Further guidelines and considerations are summarized in Tables and .
5. Warnings and Considerations for Supervised ML.
ADVANCED ML APPLICATIONS IN CATALYSIS, FUTURE DIRECTIONS, AND FURTHER READING
Despite the significant potential of ML to accelerate catalytic studies, its adoption within the chemistry community remains limited and, at times, misapplied, particularly when predictive models are used without an adequate understanding of their limitations. A common reason for these limitations is that implementing ML often requires time and effort to gain expertise in fields outside of chemistry (i.e., data science, programming). Closing this gap will require both education and rigor, including initiatives to embed hands-on digital chemistry modules into university curricula, offer workshops on ML applications, develop more user-friendly software, and adopt the Findable, Accessible, Interoperable, and Reusable (FAIR) principles. In other cases, ML remains underutilized because researchers are skeptical. However, as with other disruptive technologies throughout history, this mistrust will likely fade as further advances emerge.
In the context of ML research, one of the most promising directions is the development of automated robotic platforms, which pave the way for the popularization of self-driving laboratories. Even though this is still a young field, several representative examples have already demonstrated success in catalytic reaction discovery, including work from the groups of Cooper, Aspuru-Guzik, and Noël, among others.
Another emerging technology with a growing influence in ML-driven catalysis is the use of large language models (LLMs). The creation of chatbot assistants holds great promise, as they can help chemists generate code, extract descriptors from the published literature, and guide digital catalyst discovery through natural language conversations. Promising results have already been reported by the teams of Gomes, White, Laino, and Schwaller, among others.
For descriptor generation, a particularly exciting development is the emergence of machine learning potentials (MLPs), which enable simulations of catalytic systems with near-DFT accuracy at a fraction of computational cost. While they are still at an early stage of adoption in catalysis, MLPs are poised to transform QM calculations and facilitate the routine exploration of complex reaction landscapes. Notable contributions in this area have been made by the groups of Isayev, Dral, Duarte, and Wood and Zitnick, among others.
An additional promising direction is ML-based inverse design, which aims to generate in silico catalysts by suggesting structural modifications through algorithms. These methods are often combined with filtering strategies to eliminate candidates that are too expensive or synthetically unfeasible. Active groups in this area include those of Balcells, Jensen, and Bhowmik, among others.
As a final remark, it is worth mentioning that the intention of this viewpoint is to introduce readers to some useful concepts and capabilities that ML can offer to those working in homogeneous catalysis. Its format is intentionally short and concise, aiming to encourage the adoption of ML and to popularize this technology within the broader catalysis community. For this reason, further reading is encouraged for those seeking a deeper understanding of ML before applying it in research intended for publication, including reviews and studies from the Illas, Schuurman, Fey, Norrby, Maseras, Shimizu, and Wiest groups, among others.
Acknowledgments
J.V.A.-R. acknowledges the computing resources at the Galicia Supercomputing Center, CESGA, including access to the FinisTerrae supercomputer and the Drago cluster facility of SGAI-CSIC. The authors also thank Chris Collison (Rochester Institute of Technology) for previewing the manuscript.
Raw data, instructions, and code used for descriptor generation and clustering are freely available on Zenodo (https://zenodo.org/records/17084325).
∇.
Authors David Dalmau and Susana García-Abellán contributed equally to this work. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.
J.V.A.-R. acknowledges Gobierno de Aragón-Fondo Social Europeo (Research Groups E07_23R), the State Research Agency of Spain (MCIN/AEI/10.13039/501100011033/FEDER, UE) for financial support (PID2022–140159NA-I00) and the European Union’s Recovery and Resilience Facility-Next Generation in the framework of the General Invitation of the Spanish Government’s public business entity Red.es to participate in talent attraction and retention programmes within Investment 4 of Component 19 of the Recovery, Transformation and Resilience Plan (MOMENTUM, MMT24-ISQCH-01).
The authors declare no competing financial interest.
References
- Abraham B. M., Jyothirmai M. V., Sinha P., Viñes F., Singh J. K., Illas F.. Catalysis in the digital age: Unlocking the power of data with machine learning. WIREs Comput. Mol. Sci. 2024;14:e1730. doi: 10.1002/wcms.1730. [DOI] [Google Scholar]
- de Araujo L. G., Vilcocq L., Fongarland P., Schuurman Y.. Recent developments in the use of machine learning in catalysis: A broad perspective with applications in kinetics. Chem. Eng. J. 2025;508:160872. doi: 10.1016/j.cej.2025.160872. [DOI] [Google Scholar]
- Sanosa N., Dalmau D., Sampedro D., Alegre-Requena J. V., Funes-Ardoiz I.. Recent advances of machine learning applications in the development of experimental homogeneous catalysis. Artif. Intell. Chem. 2024;2:100068. doi: 10.1016/j.aichem.2024.100068. [DOI] [Google Scholar]
- Seavill P.. The future of digital chemistry. Nat. Synth. 2023;2:469–470. doi: 10.1038/s44160-023-00334-2. [DOI] [Google Scholar]
- Strieth-Kalthoff F., Sandfort F., Kühnemund M., Schäfer F. R., Kuchen H., Glorius F.. Machine Learning for Chemical Reactivity: The Importance of Failed Experiments. Angew. Chem., Int. Ed. 2022;61:e202204647. doi: 10.1002/anie.202204647. [DOI] [PubMed] [Google Scholar]
- Dalmau D., Alegre-Requena J. V.. Integrating digital chemistry within the broader chemistry community. Trends Chem. 2024;6:459–469. doi: 10.1016/j.trechm.2024.06.008. [DOI] [Google Scholar]
- Butler K. T., Davies D. W., Cartwright H., Isayev O., Walsh A.. Machine Learning for Molecular and Materials Science. Nature. 2018;559:547–555. doi: 10.1038/s41586-018-0337-2. [DOI] [PubMed] [Google Scholar]
- Dos Passos Gomes G., Pollice R., Aspuru-Guzik A.. Navigating through the Maze of Homogeneous Catalyst Design with Machine Learning. Trends Chem. 2021;3:96–110. doi: 10.1016/j.trechm.2020.12.006. [DOI] [Google Scholar]
- Sigmund L. M., Assante M., Johansson M. J., Norrby P.-O., Jorner K., Kabeshov M.. Computational Tools for the Prediction of Site- and Regioselectivity of Organic Reactions. Chem. Sci. 2025;16:5383–5412. doi: 10.1039/D5SC00541H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallegos L. C., Luchini G., St. John P. C., Kim S., Paton R. S.. Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties. Acc. Chem. Res. 2021;54:827–836. doi: 10.1021/acs.accounts.0c00745. [DOI] [PubMed] [Google Scholar]
- Escayola S., Bahri-Laleh N., Poater A.. %VBur index and steric maps: from predictive catalysis to machine learning. Chem. Soc. Rev. 2024;53:853–882. doi: 10.1039/D3CS00725A. [DOI] [PubMed] [Google Scholar]
- Hillier A. C., Sommer W. J., Yong B. S., Petersen J. L., Cavallo L., Nolan S. P.. A Combined Experimental and Theoretical Study Examining the Binding of N-Heterocyclic Carbenes (NHC) to the Cp*RuCl (Cp* = η5-C5Me5) Moiety: Insight into Stereoelectronic Differences between Unsaturated and Saturated NHC Ligands. Organometallics. 2003;22:4322–4326. doi: 10.1021/om034016k. [DOI] [Google Scholar]
- Sigman M. S., Harper K. C., Bess E. N., Milo A.. The Development of Multidimensional Analysis Tools for Asymmetric Catalysis and Beyond. Acc. Chem. Res. 2016;49:1292–1301. doi: 10.1021/acs.accounts.6b00194. [DOI] [PubMed] [Google Scholar]
- Seko, A. ; Togo, A. ; Tanaka, I. . Descriptors for Machine Learning of Materials Data. In Nanoinformatics; Tanaka, I. , Ed.; Springer Singapore: Singapore, 2018; pp 3–23. [Google Scholar]
- Fukui K., Yonezawa T., Shingu H.. A Molecular Orbital Theory of Reactivity in Aromatic Hydrocarbons. J. Chem. Phys. 1952;20:722–725. doi: 10.1063/1.1700523. [DOI] [Google Scholar]
- Shrake A., Rupley J. A.. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 1973;79:351–371. doi: 10.1016/0022-2836(73)90011-9. [DOI] [PubMed] [Google Scholar]
- Bannwarth C., Ehlert S., Grimme S.. GFN2-xTBAn Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. J. Chem. Theory Comput. 2019;15:1652–1671. doi: 10.1021/acs.jctc.8b01176. [DOI] [PubMed] [Google Scholar]
- Landrum, G. RDKit: Open-source cheminformatics. https://www.rdkit.org (accessed October 14, 2025).
- Scarselli F., Gori M., Chung T. A., Hagenbuchner M., Monfardini G.. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009;20:61–80. doi: 10.1109/TNN.2008.2005605. [DOI] [PubMed] [Google Scholar]
- Wellawatte G. P., Gandhi H. A., Seshadri A., White A. D.. A Perspective on Explanations of Molecular Prediction Models. J. Chem. Theory Comput. 2023;19:2149–2160. doi: 10.1021/acs.jctc.2c01235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rana D., Pflüger P. M., Hölter N. P., Tan G., Glorius F.. Standardizing Substrate Selection: A Strategy toward Unbiased Evaluation of Reaction Generality. ACS Cent. Sci. 2024;10:899–906. doi: 10.1021/acscentsci.3c01638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ar–Br substrates were retrieved from the subset “Aryl Halides for Pd-Catalyzed Couplings” available at https://enamine.net/building-blocks/functional-classes/aryl-halides (accessed July 9, 2025). To construct a simplified chemical space of approximately 25,000 molecules, we applied the following filters: (1) exactly one Br atom, with zero Cl or I atoms, and (2) no counterions present.
- Reymond J. L.. Chemical space as a unifying theme for chemistry. J. Cheminform. 2025;17:6. doi: 10.1186/s13321-025-00954-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Descriptors were calculated using AQME with GFN2-xTB, and clustering was performed using the k-means algorithm. These representations were generated for visualization purposes; alternative levels of theory or algorithms lead to the same conclusions regarding the limitations of human sampling.
- Ackerman L. K. G., Lovell M. M., Weix D. J.. Multimetallic catalysed cross-coupling of aryl bromides with aryl triflates. Nature. 2015;524:454–457. doi: 10.1038/nature14676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao S., Gensch T., Murray B., Niemeyer Z. L., Sigman M. S., Biscoe M. R.. Enantiodivergent Pd-catalyzed C–C bond formation enabled through ligand parameterization. Science. 2018;362:670–674. doi: 10.1126/science.aat2299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- A random seed of 42 was used, which is a common choice in ML applications.
- Talevi A., Bellera C. L.. Clustering of small molecules: new perspectives and their impact on natural product lead discovery. Front. Nat. Produc. 2024;3:1367537. doi: 10.3389/fntpr.2024.1367537. [DOI] [Google Scholar]
- Samha M. H., Karas L. J., Vogt D. B., Odogwu E. C., Elward J., Crawford J. M., Steves J. E., Sigman M. S.. Predicting success in Cu-catalyzed C–N coupling reactions using data science. Sci. Adv. 2024;10:eadn3478. doi: 10.1126/sciadv.adn3478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kariofillis S. K., Jiang S., Zuranski A. M., Gandhi S. S., Martinez Alvarado J. I., Doyle A. G.. Using Data Science To Guide Aryl Bromide Substrate Scope Analysis in a Ni/Photoredox-Catalyzed Cross-Coupling with Acetals as Alcohol-Derived Radical Sources. J. Am. Chem. Soc. 2022;144:1045–1055. doi: 10.1021/jacs.1c12203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacQueen J.. Some methods for classification and analysis of multivariate observations. Berkeley Symp. Math. Statist. Prob. 1967:281–297. [Google Scholar]
- Suvarna M., Zou T., Chong S. H., Ge Y., Martín A. J., Pérez-Ramírez J.. Active learning streamlines development of high performance catalysts for higher alcohol synthesis. Nat. Commun. 2024;15:5844. doi: 10.1038/s41467-024-50215-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmid S. P., Schlosser L., Glorius F., Jorner K.. Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis. Beilstein J. Org. Chem. 2024;20:2280–2304. doi: 10.3762/bjoc.20.196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hou X., Li S., Frey J., Hong X., Ackermann L.. Machine learning-guided yield optimization for palladaelectro-catalyzed annulation reaction. Chem. 2024;10:2283–2294. doi: 10.1016/j.chempr.2024.03.027. [DOI] [Google Scholar]
- Olen C. L., Zahrt A. F., Reilly S. W., Schultz D., Emerson K., Candito D., Wang X., Strotman N. A., Denmark S. E.. Chemoinformatic Catalyst Selection Methods for the Optimization of Copper–Bis(oxazoline)-Mediated, Asymmetric, Vinylogous Mukaiyama Aldol Reactions. ACS Catal. 2024;14:2642–2655. doi: 10.1021/acscatal.3c05903. [DOI] [Google Scholar]
- Hueffel J. A., Sperger T., Funes-Ardoiz I., Ward J. S., Rissanen K., Schoenebeck F.. Accelerated dinuclear palladium catalyst identification through unsupervised machine learning. Science. 2021;374:1134–1140. doi: 10.1126/science.abj0999. [DOI] [PubMed] [Google Scholar]
- Hotelling H.. Analysis of a complex of statistical variables into principal components. J. Ed. Psychol. 1933;24:417–441. doi: 10.1037/h0071325. [DOI] [Google Scholar]
- Jolliffe I. T., Cadima J.. Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A. 2016;374:20150202. doi: 10.1098/rsta.2015.0202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McInnes, L. ; Healy, J. ; Melville, J. . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, 1802.03426.
- van der Maaten L., Hinton G.. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008;9:2579–2605. [Google Scholar]
- Trozzi F., Wang X., Tao P.. UMAP as a Dimensionality Reduction Tool for Molecular Dynamics Simulations of Biomacromolecules: A Comparison Study. J. Phys. Chem. B. 2021;125:5022–5034. doi: 10.1021/acs.jpcb.1c02081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Żurański A. M., Martinez Alvarado J. I., Shields B. J., Doyle A. G.. Predicting Reaction Yields via Supervised Learning. Acc. Chem. Res. 2021;54:1856–1865. doi: 10.1021/acs.accounts.0c00770. [DOI] [PubMed] [Google Scholar]
- Oliveira J. C. A., Frey J., Zhang S.-Q., Xu L.-C., Li X., Li S.-W., Hong X., Ackermann L.. When Machine Learning Meets Molecular Synthesis. Trends Chem. 2022;4:863–885. doi: 10.1016/j.trechm.2022.07.005. [DOI] [Google Scholar]
- Morrison, D. F. Multivariate Statistical Methods, 2nd ed.; McGraw-Hill: New York, 1976. [Google Scholar]
- Ho, T. K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition; IEEE Computer Society Press: Montreal, QC, Canada, 1995, 1, 278–282. [Google Scholar]
- Hornik K., Stinchcombe M., White H.. Multilayer feedforward networks are universal approximators. Neural Networks. 1989;2:359–366. doi: 10.1016/0893-6080(89)90020-8. [DOI] [Google Scholar]
- Morán-González L., Burnage A. L., Nova A., Balcells D.. AI Approaches to Homogeneous Catalysis with Transition Metal Complexes. ACS Catal. 2025;15:9089–9105. doi: 10.1021/acscatal.5c01202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo C., Alegre-Requena J. V., Sujansky S. J., Pajk S. P., Gallegos L. C., Paton R. S., Bandar J. S.. Mechanistic Studies Yield Improved Protocols for Base-Catalyzed Anti-Markovnikov Alcohol Addition Reactions. J. Am. Chem. Soc. 2022;144:9586–9596. doi: 10.1021/jacs.1c13397. [DOI] [PubMed] [Google Scholar]
- Bonke S. A., Trezza G., Bergamasco L., Song H., Rodríguez-Jiménez S., Hammarström L., Chiavazzo E., Reisner E.. Multi-Variable Multi-Metric Optimization of Self-Assembled Photocatalytic CO2 Reduction Performance Using Machine Learning Algorithms. J. Am. Chem. Soc. 2024;146:15648–15658. doi: 10.1021/jacs.4c01305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baczewska P., Kulczykowski M., Zambroń B., Jaszczewska-Adamczak J., Pakulski Z., Roszak R., Grzybowski B. A., Mlynarski J.. Machine Learning Algorithm Guides Catalyst Choices for Magnesium-Catalyzed Asymmetric Reactions. Angew. Chem., Int. Ed. 2024;63:e202318487. doi: 10.1002/anie.202318487. [DOI] [PubMed] [Google Scholar]
- Modak A., Alegre-Requena J. V., de Lescure L., Rynders K. J., Paton R. S., Race N. J.. Homologation of Electron-Rich Benzyl Bromide Derivatives via Diazo C–C Bond Insertion. J. Am. Chem. Soc. 2022;144:86–92. doi: 10.1021/jacs.1c11503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friederich P., dos Passos Gomes G., De Bin R., Aspuru-Guzik A., Balcells D.. Machine learning dihydrogen activation in the chemical space surrounding Vaska’s complex. Chem. Sci. 2020;11:4584–4601. doi: 10.1039/D0SC00445F. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallarati S., Fabregat R., Laplaza R., Bhattacharjee S., Wodrich M. D., Corminboeuf C.. Reaction-based machine learning representations for predicting the enantioselectivity of organocatalysts. Chem. Sci. 2021;12:6879–6889. doi: 10.1039/D1SC00482D. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong Y.-P., Jung H.-J., Lin S., Shammami M. A., Roshandel H., Dodge H. M., Chapp S. M., Ruiz De Castilla L. C., Wang D., Do L. H., Liu C., Miller A. J. M., Diaconescu P. L.. Using Classifiers To Predict Catalyst Design for Polyketone Microstructure. J. Am. Chem. Soc. 2025;147:3913–3918. doi: 10.1021/jacs.4c11666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dalmau D., Sigman M. S., Alegre-Requena J. V.. Machine learning workflows beyond linear models in low-data regimes. Chem. Sci. 2025;16:8555–8560. doi: 10.1039/D5SC00996K. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shapley, L. S. in The Shapley Value, ed. Roth, A. E. , Cambridge University Press, 1st edn., 1988; pp 31–40. [Google Scholar]
- Reid J. P., Sigman M. S.. Holistic Prediction of Enantioselectivity in Asymmetric Catalysis. Nature. 2019;571:343–348. doi: 10.1038/s41586-019-1384-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mockus, J. ; Tiesis, V. ; Zilinskas, A. . The application of Bayesian methods for seeking the extremum. In Towards Global Optimization 2; Dixon, L. C. W. , Szegö, G. P. , Eds.; North-Holland, 1978; pp 117–129. [Google Scholar]
- Shields B. J., Stevens J., Li J., Parasram M., Damani F., Alvarado J. I. M., Janey J. M., Adams R. P., Doyle A. G.. Bayesian reaction optimization as a tool for chemical synthesis. Nature. 2021;590:89–96. doi: 10.1038/s41586-021-03213-y. [DOI] [PubMed] [Google Scholar]
- Varoquaux G.. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage. 2018;180:68–77. doi: 10.1016/j.neuroimage.2017.06.061. [DOI] [PubMed] [Google Scholar]
- https://robert.readthedocs.io/en/latest/Report/score.html (accessed October 14, 2025).
- Tong W., Xie Q., Hong H., Shi L., Fang H., Perkins R.. Assessment of Prediction Confidence and Domain Extrapolation of Two Structure–Activity Relationship Models for Predicting Estrogen Receptor Binding Activity. Environ. Health Perspect. 2004;112:1249–1254. doi: 10.1289/txg.7125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saebi M., Nan B., Herr J. E., Wahlers J., Guo Z., Zurański A. M., Kogej T., Norrby P.-O., Doyle A. G., Chawla N. V., Wiest O.. On the Use of Real-World Datasets for Reaction Yield Prediction: Performance Differences Between HTE and ELN Data. Chem. Sci. 2023;14:4997–5005. doi: 10.1039/D2SC06041H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas B. C., Kalyani D., Sigman M. S.. Applying Statistical Modeling Strategies to Sparse Datasets in Synthetic Chemistry. Sci. Adv. 2025;11:eadt3013. doi: 10.1126/sciadv.adt3013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maloney M. P., Coley C. W., Genheden S., Carson N., Helquist P., Norrby P.-O., Wiest O.. Negative Data in Data Sets for Machine Learning Training. J. Org. Chem. 2023;88:5239–5241. doi: 10.1021/acs.joc.3c00844. [DOI] [PubMed] [Google Scholar]
- Wilkinson M. D., Dumontier M., Aalbersberg I. J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L. B., Bourne P. E.. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai T., Vijayakrishnan S., Szczypiński F. T., Ayme J.-F., Simaei E., Fellowes T., Clowes R., Kotopanov L., Shields C. E., Zhou Z., Ward J. W., Cooper A. I.. Autonomous mobile robots for exploratory synthetic chemistry. Nature. 2024;635:890–897. doi: 10.1038/s41586-024-08173-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flores-Leonar M. M., Mejía-Mendoza L. M., Aguilar-Granda A., Sanchez-Lengeling B., Tribukait H., Amador-Bedolla C., Aspuru-Guzik A.. Materials Acceleration Platforms: On the way to autonomous experimentation. Curr. Opin. Green Sus. Chem. 2020;25:100370. doi: 10.1016/j.cogsc.2020.100370. [DOI] [Google Scholar]
- Slattery A., Wen Z., Tenblad P., Sanjosé-Orduna J., Pintossi D., Den Hartog T., Noël T.. Automated self-optimization, intensification, and scale-up of photocatalysis in flow. Science. 2024;383:eadj1817. doi: 10.1126/science.adj1817. [DOI] [PubMed] [Google Scholar]
- Boiko D. A., MacKnight R., Kline B., Gomes G.. Autonomous chemical research with large language model. Nature. 2023;624:570–578. doi: 10.1038/s41586-023-06792-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramos M. C., Collison C. J., White A. D.. A review of large language models and autonomous agents in chemistry. Chem. Sci. 2025;16:2514–2572. doi: 10.1039/D4SC03921A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nana Teukam Y. G., Kwate Dassi L., Manica M., Probst D., Schwaller P., Laino T.. Language models can identify enzymatic binding sites in protein sequences. Comput. Struct. Biotechnol. J. 2024;23:1929–1937. doi: 10.1016/j.csbj.2024.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bran A. M., Cox S., Schilter O., Baldassari C., White A. S., Schwaller P.. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 2024;6:525–535. doi: 10.1038/s42256-024-00832-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anstine D. M., Zubatyuk R., Isayev O.. AIMNet2: a neural network potential to meet your neutral, charged, organic, and elemental-organic needs. Chem. Sci. 2025;16:10228–10244. doi: 10.1039/D4SC08572H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alavi S. F., Chen Y., Hou Y.-F., Ge F., Zheng P., Dral P. O.. ANI-1ccx-gelu Universal Interatomic Potential and Its Fine-Tuning: Toward Accurate and Efficient Anharmonic Vibrational Frequencies. J. Phys. Chem. Lett. 2025;16:483–493. doi: 10.1021/acs.jpclett.4c03031. [DOI] [PubMed] [Google Scholar]
- Zhang H., Juraskova V., Duarte F.. Modelling chemical processes in explicit solvents with machine learning potentials. Nat. Commun. 2024;15:6114. doi: 10.1038/s41467-024-50418-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wood B. M., Dzamba M., Fu X., Gao M., Shuaibi M., Barroso-Luque L., Abdelmaqsoud K., Gharakhanyan V., Kitchin J. R., Levine D. S., Michel K., Sriram A., Cohen T., Das A., Rizvi A., Sahoo S. J., Ulissi Z. W., Zitnick C. L.. UMA: A Family of Universal Models for Atoms. arXiv. 2025;2506:23971. doi: 10.48550/arXiv.2506.23971. [DOI] [Google Scholar]
- Strandgaard M., Linjordet T., Kneiding H., Burnage A. L., Nova A., Jensen J. H., Balcells D.. A Deep Generative Model for the Inverse Design of Transition Metal Ligands and Complexes. JACS Au. 2025;5:2294–2308. doi: 10.1021/jacsau.5c00242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seumer J., Jensen J. H.. Beyond predefined ligand libraries: a genetic algorithm approach for de novo discovery of catalysts for the Suzuki coupling reactions. Peer J. Phys. Chem. 2025;7:e34. doi: 10.7717/peerj-pchem.34. [DOI] [Google Scholar]
- Cornet F., Benediktsson B., Hastrup B., Schmidt M. N., Bhowmik A.. OM-Diff: inverse-design of organometallic catalysts with guided equivariant denoising diffusion. Dig. Discovery. 2024;3:1793–1811. doi: 10.1039/D4DD00099D. [DOI] [Google Scholar]
- Durand D. J., Fey N.. Computational Ligand Descriptors for Catalyst Design. Chem. Rev. 2019;119:6561–6594. doi: 10.1021/acs.chemrev.8b00588. [DOI] [PubMed] [Google Scholar]
- Jorner K., Tomberg A., Bauer C., Sköld C., Norrby P.-O.. Organic reactivity from mechanism to machine learning. Nature Rev. Chem. 2021;5:240–255. doi: 10.1038/s41570-021-00260-x. [DOI] [PubMed] [Google Scholar]
- Lakuntza O., Besora M., Maseras F.. Searching for Hidden Descriptors in the Metal–Ligand Bond through Statistical Analysis of Density Functional Theory (DFT) Results. Inorg. Chem. 2018;57:14660–14670. doi: 10.1021/acs.inorgchem.8b02372. [DOI] [PubMed] [Google Scholar]
- Toyao T., Maeno Z., Takakusagi S., Kamachi T., Takigawa I., Shimizu K.. Machine Learning for Catalysis Informatics: Recent Applications and Prospects. ACS Catal. 2020;10:2260–2297. doi: 10.1021/acscatal.9b04186. [DOI] [Google Scholar]
- Gensch T., Gomes G. D. P., Friederich P., Peters E., Gaudin T., Pollice R., Jorner K., Nigam A., Lindner-D’Addario M., Sigman M. S., Aspuru-Guzik A.. A Comprehensive Discovery Platform for Organophosphorus Ligands for Catalysis. J. Am. Chem. Soc. 2022;144:1205–1217. doi: 10.1021/jacs.1c09718. [DOI] [PubMed] [Google Scholar]
- Álvarez-Moreno M., de Graaf C., López N., Maseras F., Poblet J. M., Bo C.. Managing the computational chemistry big data problem: the ioChem-BD platform. J. Chem. Inf. Model. 2015;55:95–103. doi: 10.1021/ci500593j. [DOI] [PubMed] [Google Scholar]
- Gallarati S., van Gerwen P., Laplaza R., Vela S., Fabrizio A., Corminboeuf C.. OSCAR: An Extensive Repository of Chemically and Functionally Diverse Organocatalysts. Chem. Sci. 2022;13:13782–13794. doi: 10.1039/D2SC04251G. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levine, D. S. ; Shuaibi, M. ; Spotte-Smith, E. W. C. ; Taylor, M. G. ; Hasyim, M. R. ; Michel, K. ; Batatia, I. ; Csányi, G. ; Dzamba, M. ; Eastman, P. ; Frey, N. C. ; Fu, X. ; Gharakhanyan, V. ; Krishnapriyan, A. S. ; Rackers, J. A. ; Raja, S. ; Rizvi, A. ; Rosen, A. S. ; Ulissi, Z. ; Vargas, S. ; Zitnick, C. L. ; Blau, S. M. ; Wood, B. M. . The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models. arXiv 2025, 2505.08762.
- Willighagen E. L., Mayfield J. W., Alvarsson J., Berg A., Carlsson L., Jeliazkova N., Kuhn S., Pluskal T., Rojas-Chertó M., Spjuth O., Torrance G., Evelo C. T., Guha R., Steinbeck C.. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 2017;9:33. doi: 10.1186/s13321-017-0220-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yap C. W.. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011;32:1466–1474. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- Alegre-Requena J. V., Sowndarya S. V., Shree, Pérez-Soto R., Alturaifi T. M., Paton R. S.. AQME: Automated Quantum Mechanical Environments for Researchers and Educators. WIREs Comput. Mol. Sci. 2023;13:e1663. doi: 10.1002/wcms.1663. [DOI] [Google Scholar]
- Kalikadien A. V., Mirza A., Hossaini A. N., Sreenithya A., Pidko E. A.. Paving the road towards automated homogeneous catalyst design. ChemPlusChem. 2024;89:e202300702. doi: 10.1002/cplu.202300702. [DOI] [PubMed] [Google Scholar]
- Żurański A. M., Wang J. Y., Shields B. J., Doyle A. G.. Auto-QChem: An Automated Workflow for the Generation and Storage of DFT Calculations for Organic Molecules. React. Chem. Eng. 2022;7:1276–1284. doi: 10.1039/D2RE00030J. [DOI] [Google Scholar]
- Shved A. S., Ocampo B. E., Burlova E. S., Olen C. L., Rinehart N. I., Denmark S. E.. molli: A General-Purpose Python Toolkit for Combinatorial Small Molecule Library Generation, Manipulation, and Feature Extraction. J. Chem. Inf. Model. 2024;64:8083–8090. doi: 10.1021/acs.jcim.4c00424. [DOI] [PubMed] [Google Scholar]
- Jorner, J. https://github.com/digital-chemistry-laboratory/morfeus.
- Luchini, G. ; Patterson, T. ; Paton, R. S. . DBSTEP: DFT Based Steric Parameters. 2022, DOI: 10.5281/zenodo.4702097. [DOI]
- Falivene L., Credendino R., Poater A., Petta A., Serra L., Oliva R., Scarano V., Cavallo L.. SambVca 2. A Web Tool for Analyzing Catalytic Pockets with Topographic Steric Maps. Organometallics. 2016;35:2286–2293. doi: 10.1021/acs.organomet.6b00371. [DOI] [Google Scholar]
- Heid E., Greenman K. P., Chung Y., Li S.-C., Graff D. E., Vermeire F. H., Wu H., Green W. H., McGill C. J.. Chemprop: A Machine Learning Package for Chemical Property Prediction. J. Chem. Inf. Model. 2024;64:9–17. doi: 10.1021/acs.jcim.3c01250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rollins Z. A., Cheng A. C., Metwally E.. MolPROP: Molecular Property prediction with multimodal language and graph fusion. J. Cheminform. 2024;16:56. doi: 10.1186/s13321-024-00846-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weininger D.. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988;28:31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
- Baidun M. S., Kalikadien A. V., Lefort L., Pidko E. A.. Impact of Model Selection and Conformational Effects on the Descriptors for In Silico Screening Campaigns: A Case Study of Rh-Catalyzed Acrylate Hydrogenation. J. Phys. Chem. C. 2024;128:7987–7998. doi: 10.1021/acs.jpcc.4c01631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuncheva, L. I. ; Matthews, C. E. ; Arnaiz-González, Á. ; Rodríguez, J. J. . Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale. arXiv 2020, 2008.12025.
- Comesana A. E., Huntington T. T., Scown C. D., Niemeyer K. E., Rapp V. H.. A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. Fuel. 2022;321:123836. doi: 10.1016/j.fuel.2022.123836. [DOI] [Google Scholar]
- Xerxa E., Vogt M., Bajorath J.. Influence of Data Curation and Confidence Levels on Compound Predictions Using Machine Learning Models. J. Chem. Inf. Model. 2024;64:9341–9349. doi: 10.1021/acs.jcim.4c01573. [DOI] [PubMed] [Google Scholar]
- Beck A. G., Fine J., Lam Y.-H., Sherer E. C., Regalado E. L., Aggarwal P.. Dedenser: A Python Package for Clustering and Downsampling Chemical Libraries. J. Chem. Inf. Model. 2025;65:1053–1060. doi: 10.1021/acs.jcim.4c01980. [DOI] [PubMed] [Google Scholar]
- Martinez-Fernandez, M. https://github.com/MiguelMartzFdez/almos.
- Hadipour H., Liu C., Davis R., Cardona S. T., Hu P.. Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means. BMC Bioinform. 2022;23:132. doi: 10.1186/s12859-022-04667-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gori D. N. P., Llanos M. A., Bellera C. L., Talevi A., Alberca L. N.. iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules. J. Chem. Inf. Model. 2022;62:2987–2998. doi: 10.1021/acs.jcim.2c00265. [DOI] [PubMed] [Google Scholar]
- GPT-5 model (accessed August 7, 2025). The prompts used and the resulting code are available in the Zenodo repository containing the supplementary data for this Viewpoint.
- Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E.. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Chen, T. ; Guestrin, C. . XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: San Francisco California USA, 2016; pp 785–794. [Google Scholar]
- Chollet, F. and et al. Keras, 2015. https://keras.io (accessed October 14, 2025).
- Abadi, M. ; Agarwal, A. ; Barham, P. ; Brevdo, E. ; Chen, Z. ; Citro, C. ; Corrado, G. S. ; Davis, A. ; Dean, J. ; Devin, M. ; Ghemawat, S. ; Goodfellow, I. ; Harp, A. ; Irving, G. ; Isard, M. ; Jia, Y. ; Jozefowicz, R. ; Kaiser, L. ; Kudlur, M. ; Levenberg, J. ; Mane, D. ; Monga, R. ; Moore, S. ; Murray, D. ; Olah, C. ; Schuster, M. ; Shlens, J. ; Steiner, B. ; Sutskever, I. ; Talwar, K. ; Tucker, P. ; Vanhoucke, V. ; Vasudevan, V. ; Viegas, F. ; Vinyals, O. ; Warden, P. ; Wattenberg, M. ; Wicke, M. ; Yu, Y. ; Zheng, X. . TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, 1603.04467.
- Torres J. A. G., Lau S. H., Anchuri P., Stevens J. M., Tabora J. E., Li J., Borovika A., Adams R. P., Doyle A. G.. A Multi-Objective Active Learning Platform and Web App for Reaction Optimization. J. Am. Chem. Soc. 2022;144:19999–20007. doi: 10.1021/jacs.2c08592. [DOI] [PubMed] [Google Scholar]
- Dalmau D., Alegre-Requena J. V.. ROBERT: Bridging the Gap Between Machine Learning and Chemistry. WIREs Comput. Mol. Sci. 2024;14:e1733. doi: 10.1002/wcms.1733. [DOI] [Google Scholar]
- Correia J., Capela J., Rocha M.. Deepmol: An Automated Machine and Deep Learning Framework for Computational Chemistry. J. Cheminform. 2024;16:136. doi: 10.1186/s13321-024-00937-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campello, R. J. G. B. ; Moulavi, D. ; Sander, J. . Density-Based Clustering Based on Hierarchical Density Estimates. In: Pei, J. , Tseng, V. S. , Cao, L. , Motoda, H. , Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science, 2013, 7819. Springer, Berlin, Heidelberg. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Levine, D. S. ; Shuaibi, M. ; Spotte-Smith, E. W. C. ; Taylor, M. G. ; Hasyim, M. R. ; Michel, K. ; Batatia, I. ; Csányi, G. ; Dzamba, M. ; Eastman, P. ; Frey, N. C. ; Fu, X. ; Gharakhanyan, V. ; Krishnapriyan, A. S. ; Rackers, J. A. ; Raja, S. ; Rizvi, A. ; Rosen, A. S. ; Ulissi, Z. ; Vargas, S. ; Zitnick, C. L. ; Blau, S. M. ; Wood, B. M. . The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models. arXiv 2025, 2505.08762.
- Jorner, J. https://github.com/digital-chemistry-laboratory/morfeus.
- Martinez-Fernandez, M. https://github.com/MiguelMartzFdez/almos.
Data Availability Statement
Raw data, instructions, and code used for descriptor generation and clustering are freely available on Zenodo (https://zenodo.org/records/17084325).







