Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Feb 23:2025.02.18.638937. [Version 1] doi: 10.1101/2025.02.18.638937

Growth vs. Diversity: A Time-Evolution Analysis of the Chemical Space

Kenneth Lopez Perez 1, Edgar López-López 2,3, Flavie Soulage 1, Eloy Felix 4, José L Medina-Franco 2, Ramon Alain Miranda-Quintana 1
PMCID: PMC11870478  PMID: 40027807

Abstract

Chemical space is a core and theoretical concept in cheminformatics, and it also has practical applications in drug discovery and other research areas. Chemical space is frequently associated with the number of molecules in the universe (e.g., chemical universe). It is well known that the number of compounds (both synthesized and theoretical ones) is rapidly increasing. It would be obvious to affirm that the chemical space is expanding (as a proxy of growth). But is the chemical diversity of compound libraries growing? In this study, we tackle this question by assessing quantitatively the time evolution of chemical libraries in terms of the chemical diversity as measured with molecular fingerprints. To tackle this task, we employed innovative cheminformatics methods to assess the progress over time of the chemical diversity of compound libraries available in the public domain. Using the iSIM and the BitBIRCH clustering algorithm, we conclude that, based on the fingerprints used to represent the chemical structures, just an increasing number of molecules cannot be directly translated to diversity for the analyzed libraries. With these tools, we have identified what releases contributed to the diversity of the library and the zones it did.

Keywords: cheminformatics, chemical databases, clustering, diversity, drug discovery, iSIM, large libraries, molecular fingerprint, representation

1. INTRODUCTION

Chemical space is a key concept in cheminformatics and molecular design.1 It serves as a systematic tool to analyze, study, and visualize the chemical diversity of all kinds of compounds contained in the “chemical universe”, which includes all compounds that can or could exist.2,3 The concept of chemical space has been defined and reviewed from different perspectives.4,5 For instance, Arús-Pous et al. describe it as “a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical space where the position of each molecule is defined by its properties”.6 This definition has improved the development of new cheminformatic approaches with direct applicability in chemical diversity analysis, chemical structure classification, database design, virtual screening, and structure-property / structure-multiple property relationships.7,8

Arguably there is not a unified definition of chemical space or chemical universe.1 The space definition can be constricted to certain regions of the “whole” space depending on the compounds included in it. For example, it has been estimated that the chemical space of small organic molecules exceeds 1060 compounds.2,3 It will also depend on the representations of the molecules, which can be chemical data9 (e.g., fingerprints10, physicochemical11 or quantum properties12), biological13,14 (e.g., bioactivity, bioavailability), or clinical15 (e.g., side effects) descriptors. This fact has brought up the idea of a consensus chemical space by combining multiple representations.16

Since the chemical space is too large to evaluate exhaustively17, the creation of field-specific libraries is important. Public repositories like ChEMBL18 contribute to the curation, standardization, organization, and filtration of chemical and biological information, and guarantee data accessibility to the public (i.e. academia and independent researchers)19, which is primordial in the current context of the accelerated accessibility to artificial intelligence, such as machine and deep learning methods.20

Parallelly, the accelerated increase in the number of compounds available in public repositories, and the creation of large (107 compounds)21 and ultra-large (109 compounds)3,22 libraries, opens new perspectives on the use and the study of the chemical universe. New strategies are required to navigate, cluster, and assess the diversity of these fast-increasing libraries; there is a necessity for more efficient models to do so.17,2325

It is widely known that the cardinality in the explored chemical space is growing.24,25 However, it is also important to establish a rational quantification of its chemical diversity to have a clearer picture of its actual expansion. Also, providing information to guide the addition of new compounds in future releases. Just recently, Liu et al. reported a time-dependent comparison of different structural features of natural products obtained in a time series of the Dictionary of Natural Products, as compared to synthetic compounds26, which remarks on the importance of establishing systematic comparisons of the evolution of chemical diversity in chemical datasets.

In this contribution, we present several similarity-based tools to assess the time evolution of chemical libraries. The key component is the iSIM framework, which can efficiently quantify the intrinsic similarity (diversity) of each release thanks to its O(N) complexity. Moreover, the related notion of complementary similarity facilitates the study of how different zones of a library’s chemical space change over time. We also study if the diversity’s time evolution is dependent on the chosen fingerprint representation. Finally, the BitBIRCH clustering algorithm is used to dissect the evolving chemical spaces in a more “granular” way, looking at the formation of new clusters of compounds. The combination of these tools gives an unprecedented view on the formation of new chemical spaces, with varied resolutions, which can be broadly applicable to the design of novel compound libraries with specific desired functions. The adaptability and time efficiency make the proposed tools an attractive option for studying the chemical space of large molecular libraries.

2. METHODS

2.1. iSIM tools for dissecting chemical space

The similarity analysis of the different libraries was performed used iSIM as the basic tool. Traditional similarity indices are all based on the comparison between pairs of molecules, so they unequivocally scale as O(N2) when comparing N molecules. These steep computational cost is particularly pressing when dealing with millions of compounds, so given the expansive nature of commonly used libraries, we need other alternatives. iSIM bypasses the quadratic scaling problem by comparing all the molecules at the same time. The recipe for this is very simple: First, we arrange all the fingerprints in a matrix, and we add the elements of each of the columns, resulting in a vector K=[k1,k2,,kM]. Clearly, ki represents the number of “ones” in the ith column, which is all that we need to calculate the coincidence of “on” bits (ki(ki1)2), the coincidence of “off” bits ((Nki)(Nki1)2), and the number of “on”-“off” clashes (ki(Nki)) in each column. This information is enough to calculate the average value of multiple similarity indices. Here we focus on the Tanimoto similarity, T, as this well-known similarity index has been proven to correlate consistently in the ranking of compounds in structure-activity studies.27,28

iT=2N(N1)p,q>pT(p,q)=i=1Mki(ki1)2i=1M{ki(ki1)2+ki(Nki)} 0

As indicated above, the iSIM Tanimoto (iT) value corresponds to the average of all the distinct pairwise Tanimoto comparisons, with the key advantage of being calculated in O(N). The iT thus corresponds to the internal diversity of the set (lower iT values indicate a more diverse collection of compounds). This is a global indicator of the diversity of the library, but it would be desirable to have local metrics for the evolution of chemical space. A simple way to do this is by utilizing the concept of complementary similarity. Since we can easily calculate the iT of a set, we can also easily calculate the set’s iT after a molecule has been removed from it. This is known as the complementary similarity of the removed molecule. Lower complementary similarity values correspond to molecules that are central to the library (they are medoid-like), and high complementary similarity values are indicative of outlier molecules, in the periphery of the set. After identifying the central and outlier regions of the set, we can then calculate their corresponding iSIM values and see how their internal diversity changes over time. Additionally, we compared the medoids and outliers between releases of the set by calculating the iSIM when merging them. To shed more light on the time of evolution of different sectors of chemical space we analyzed how different iterations of a given library are related to each other. For this, we used the set Jaccard similarity index, J:

J(Lp,Lq)=|LpLq||LpLq| 0

where Lp and Lq represent a sector (medoid or outliers) of a library, but correspond to years p and q. Where for a given set X, |X| indicates the cardinality (size) of the set, for Eq. 2 the cardinality of the intersection over the cardinality union of the sectors of the libraries. We define the medoids as molecules in the lowest 5th percentile according to complementary similarity values, and the outliers as the ones in the highest 5th percentile.

Finally, to gain even more insights into the inner structure of the chemical space, we recourse to clustering. Once again, the O(N2) scaling of standard clustering techniques, like Taylor-Butina29 and Jarvis-Patrick30, severely limits the scope of the chemical space that could be explored. For this reason, we decided to use the recently proposed BitBIRCH31 algorithm. This method draws inspiration from the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)32 by using a tree structure to reduce the number of comparisons required to group all the data. The key difference is that the original BIRCH algorithm could only be used to cluster objects represented with continuous vectors, and using the Euclidean distance. BitBIRCH, on the other hand, relies on iSIM to process binary vectors and to use the Tanimoto similarity.31

2.2. Chemical libraries

-. ChEMBL

ChEMBL33 is a large-scale, high-quality, manually curated, open Global Core Biodata Resource focused on drug-like bioactive compounds. The data are manually extracted from primary scientific literature, updated regularly, and rigorously curated and standardized. In addition to scientific literature data, ChEMBL incorporates deposited datasets, including those from neglected disease screening programs, crop protection research, drug metabolism and disposition studies, bioactivity data extracted from patents, and chemical probe screening programs. Currently, ChEMBL contains over 20 million bioactivity measurements for more than 2.4 million compounds and 15,500 protein targets. We analyzed the complete versions of ChEMBL from releases 1 through 33. In addition, we applied our framework only to the natural products of each of the releases.

-. DrugBank

DrugBank34,35 is an online database containing information on drugs and drug targets, making it highly used in cheminformatics and bioinformatics. For targets, it includes sequences, structures, and pathways. It combines chemical, pharmacological, and pharmaceutical data for drugs. We analyzed the releases of drugs included in the database from 2005 to 2022.

-. PubChem

PubChem36 is a widely used open chemistry database put to disposition by the National Institute of Health (NIH). It was launched in 2004, since it has added information on millions of small compounds and larger molecules. PubChem compiles information from several sources including government, agencies, publications, and patents. For each compound, PubChem contains multiple identifiers, physical and chemical properties, toxicological information, related compounds, etc. For this publication, we analyzed the releases from 2004 to 2023.

2.3. Molecular representations

It is well known that molecular similarity measurements depend on the representations37, thus we decided to carry out this study with three different types of 2048-bit fingerprints: RDKit38, ECFP439,40 (radius = 2), and ECFP639,40 (radius = 3). All fingerprints were calculated using the FPSim241 python module using SMILES strings obtained from the above-mentioned libraries.

3. RESULTS AND DISCUSSION

3.1. Absolute Size & Diversity

We will illustrate the key points of our analysis using the ChEMBL library, however, the same conclusions can be observed on other libraries presented in the Supplementary Information. Overall, a simple counting strategy is not enough to evaluate the potential expansion of chemical space. As shown in Fig. 1 a), ChEMBL (as virtually every library) always increases its number of compounds over time, but this is not necessarily a direct proxy of the chemical diversity of the set, nor an obvious connection to new chemistries being explored. As stated in the Introduction, in this work we did not consider the chemical space expansion as a direct synonym of the increasing number of compounds in chemical libraries. Figure 1 b) shows how the average Tanimoto similarity of the whole library (we will name as iSIM for the rest of the text) does not fluctuate much with the increase of the library’s size. We want to remark that this calculation is extremely easy to calculate over time, as we only need to add the column-wise sum of the new incoming molecules from release to release to calculate the new iSIM value.

Figure 1:

Figure 1:

(a) Size evolution of the ChEMBL database across releases (1–33) over time and (b) the variation of iSIM-Tanimoto with database size for the same ChEMBL releases.

The same trend in Fig. 1 b) can also be seen in Fig. 2, iSIM of the whole library, as well as the central and outlier regions remained essentially constant since 2011, despite more than 1 million molecules being added since then. As expected, the iSIM value of the whole set is somewhere in between that of the medoids (which are notably less diverse) and the outliers (which are markedly more diverse). It seems like, after an initial period of small fluctuations before 2011, the average similarity of the whole set and the medoids essentially reaches an “equilibrium”. In the case of the outliers, we see the most notable changes in the iSIM value before 2015. The same steady trends in the iSIM value over the ChEMBL releases are observed with other similarity indexes like Russell-Rao and Sokal-Michener. (SI: Fig. S1) When focusing on ChEMBL’s natural products there is a noticeable decrease in diversity from the first to the second release. (SI: Fig S10) In the case of the natural product’s outliers, it is also clear how in 2011 and 2015 the additions contributed to increasing the diversity of the set. (SI: Fig S9) PubChem shows a different trend in the iSIM value over time; during the first 10 years, the diversity increases (similarity decreases) overall, after the 2016 addition the behavior changed, the iSIM values have been slowly increasing since. (SI: Fig 26)

Figure 2:

Figure 2:

Variation of iSIM over time for the (A) entire ChEMBL database, (B) medoids, and (C) outliers. Medoids and outliers are considered the 5% of the set with the lowest/highest complementary similarity, respectively.

3.2. iSIM Speed

An interesting exercise is to check the actual rate of change of the iSIM values over time, which we termed “iSIM speed”. In Fig. 3 we show that the rate of change over time is relatively similar for the whole set and for the medoids and outliers. However, the type of fingerprint used to represent the molecules has a great impact even in the qualitative direction of the change. For example, in the whole set and medoid region the RDKit fingerprints reflect a decrease in iSIM, while the ECFP fingerprints show the opposite change. That is, the RDKit fingerprints suggest an increase in diversity, while ECFP representation shows that the new molecules actually decreased the overall separation between the compounds in the library. It is also notable that with passing years the ECFP fingerprints tend to show a more stable landscape, while RDKit fingerprints show comparatively bigger fluctuations. As for the outliers, the changes are more consistent between the three molecular representations. Overall, the greatest change in iSIM was in 2011 as appreciated in the three panels of Fig. 3; this date matches one of the biggest increases in the size of the ChEMBL library (Fig. 1a). These changes are also appreciated in the analogous iSIM speed respect to the size (d[iSIM]/d[size]) shown in the SI (Fig. S2).

Figure 3:

Figure 3:

iSIM speed with respect to time for the A) entire, B) the medoids, and C) outliers of the ChEMBL library.

3.3. Jaccard set similarity for medoids and outliers

Fig. 4 shows the Jaccard set similarity between releases for medoids and outliers in the ChEMBL library. It suggests that in virtually every iteration there is a strong overlap between the core of the library and, perhaps more surprisingly, even between the outliers. Notice that the behavior of the medoids and outliers is strikingly similar, both showing the same overall trends, even while corresponding to drastically different regions of chemical space. It is interesting that in this representation it is easier to see how there are some “pivotal” updates in the datasets that are preserved throughout multiple years. The main observed change is in 2011, which matches with the observations from the previous section. It is also apparent how the latest updates seem to be highly correlated with each other, in particular since 2019–2020. These trends are observed also with the ECFP4 and ECFP6 fingerprints (SI: Fig. 34).

Figure 4:

Figure 4:

Jaccard set similarity values of the medoid (A) and outlier (B) regions of the ChEMBL library represented with RDKit fingerprints.

To complement the results shown in Fig. 4, we calculated the iSIM value of the merged zones of chemical space between releases. The same observations were obtained (Fig. S5), the main change in the iSIM value happened in 2011 for both medoids and outliers. Structures of the outlier (molecule with the largest complementary similarity) and medoids (molecule with the lowest complementary similarity) from each year are included in the SI.

3.4. Clustering

Finally, we studied how the BitBIRCH clustering of ChEMBL has evolved through time, we performed it with a threshold of 0.65, as this value is in the relevant range for drug design and has shown stability in previous works. We define dense clusters as the ones with more than 10 elements, and outliers as the ones with less. First, notice how the average similarity of the top 10 most populated clusters shows an overall decreasing tendency (Fig. 5A), while their average population is increasing (Fig. 5B). This is expected of these sphere exclusion-like clustering algorithms, which tend to produce slightly more diffuse clusters with increasing number of members. However, perhaps more telling towards the actual library expansion, both the number of dense and outliers clusters (Fig. 5CD) show a decisively upward tendency, indicating that the incoming molecules with each iteration are not only going to over-explored regions of chemical space but are also covering new regions in the library’s scope. With extended connectivity fingerprints (ECFP4 and ECFP6) the populations and number of dense/outlier clusters have the same increasing tendency as the RDKit fingerprints. However, the average iSIM of most populations has more fluctuations and does not have a decreasing tendency (SI: Fig. S6S7).

Figure 5:

Figure 5:

(A) Average iSIM of the top 10 most populated clusters (B) Average population of the top 10 most populated clusters (C) Number of dense clusters (D) Number of outliers for the BitBIRCH clustering of the ChEMBL releases over time represented with RDKit fingerprints.

4. CONCLUSIONS AND OUTLOOK

In this work, we have shown how increasing the number of molecules in a molecular library cannot be directly translated to more diversity or “expansion” of the chemical space. The average similarity of the libraries, iSIM Tanimoto, tends to be a rather stable quantity, not reflecting the impact of the expansion in chemical space (considering the compounds deposited in the public databases studied in this work: heavily focused on drug discovery projects). Different molecular representations can lead to different conclusions regarding the increase or not of chemical diversity in abrupt changes in the chemical space. However, the overall tendencies over time do not vary much across the types of fingerprints studied, they agree on which years there are significant changes in the diversity. Across the multiple releases of the ChEMBL database, the medoid and outlier regions tend to be conserved over multiple years, according to their Jaccard similarity the only significant observed change was in 2011.

The clustering analysis provides a more nuanced view of the expansion of chemical space and the clearest indication of an actual expansion, as reflected in the number of newly generated dense clusters and outliers.

It remains to be explored quantitatively if the continued enumerated chemical libraries (large and ultra-large libraries) are increasing the chemical diversity of the chemical space or just increasing the number of molecules (as it happened in the 1990s with the “traditional” combinatorial libraries). Due to the flexibility and adaptability of the proposed framework, it can be used to evaluate the chemical diversity of the libraries based on other types of molecular representations such as chemical scaffolds, and continuous properties (drug-like, ADMETox, constitutional descriptors, etc.) in future work. It remains to evaluate the chemical diversity of the specialized libraries based on other types of molecules such as small molecules, peptides, macrocycles, metallodrugs, etc.

Supplementary Material

Supplement 1
media-1.pdf (1,007.1KB, pdf)

ACKNOWLEDGEMENTS

KLP, FL, and RAMQ thank the National Institute of General Medical Sciences of the National Institutes of Health for support under award number R35GM150620. E.L.-L. is grateful to Consejo Nacional de Humanidades, Ciencia y Tecnología (CONAHCyT), Mexico, for the Ph.D. scholarship number, 894234. E. L.-L. also thanks DrugBank for providing academic access to their platform, which was used to explore the data of the approved drugs presented in this work. We also thank the funding of DGAPA, UNAM, Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT), grant No. IG200124.

DATA AND AVAILABILITY

-. Datasets

The SMILES strings were obtained from the publicly available websites:

-. Code and análisis

iSIM functionalities, BIRCH clustering algorithm, and examples of how to use them for the diversity vs growth analysis of ChEMBL’s natural products are included in the linked repository. https://github.com/mqcomplab/ChemUniverse

REFERENCES

  • (1).Reymond J.-L. Chemical Space as a Unifying Theme for Chemistry. J Cheminform 2025, 17 (1), 6. 10.1186/s13321-025-00954-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (2).Kirkpatrick P.; Ellis C. Chemical Space. Nature 2004, 432 (7019), 823–823. 10.1038/432823a. [DOI] [Google Scholar]
  • (3).Reymond J.-L. The Chemical Space Project. Acc Chem Res 2015, 48 (3), 722–730. 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]
  • (4).Virshup A. M.; Contreras-García J.; Wipf P.; Yang W.; Beratan D. N. Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds. J Am Chem Soc 2013, 135 (19), 7296–7303. 10.1021/ja401184g. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Osolodkin D. I.; Radchenko E. V; Orlov A. A.; Voronkov A. E.; Palyulin V. A.; Zefirov N. S. Progress in Visual Representations of Chemical Space. Expert Opin Drug Discov 2015, 10 (9), 959–973. 10.1517/17460441.2015.1060216. [DOI] [PubMed] [Google Scholar]
  • (6).Arús-Pous J.; Awale M.; Probst D.; Reymond J.-L. Exploring Chemical Space with Machine Learning. Chimia (Aarau) 2019, 73 (12), 1018. 10.2533/chimia.2019.1018. [DOI] [PubMed] [Google Scholar]
  • (7).López-López E.; Medina-Franco J. L. Toward Structure–Multiple Activity Relationships (SMARts) Using Computational Approaches: A Polypharmacological Perspective. Drug Discov Today 2024, 29 (7), 104046. 10.1016/j.drudis.2024.104046. [DOI] [PubMed] [Google Scholar]
  • (8).Medina-Franco J. L.; Sánchez-Cruz N.; López-López E.; Díaz-Eufracio B. I. Progress on Open Chemoinformatic Tools for Expanding and Exploring the Chemical Space. J Comput Aided Mol Des 2022, 36 (5), 341–354. 10.1007/s10822-021-00399-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Oprea T. I.; Gottfries J. Chemography: The Art of Navigating in Chemical Space. J Comb Chem 2001, 3 (2), 157–166. 10.1021/cc0000388. [DOI] [PubMed] [Google Scholar]
  • (10).Boldini D.; Ballabio D.; Consonni V.; Todeschini R.; Grisoni F.; Sieber S. A. Effectiveness of Molecular Fingerprints for Exploring the Chemical Space of Natural Products. J Cheminform 2024, 16 (1), 35. 10.1186/s13321-024-00830-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).Coley C. W. Defining and Exploring Chemical Spaces. Trends Chem 2021, 3 (2), 133–145. 10.1016/j.trechm.2020.11.004. [DOI] [Google Scholar]
  • (12).Fallani A.; Medrano Sandonas L.; Tkatchenko A. Inverse Mapping of Quantum Properties to Structures for Chemical Space of Small Organic Molecules. Nat Commun 2024, 15 (1), 6061. 10.1038/s41467-024-50401-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Lipinski C.; Hopkins A. Navigating Chemical Space for Biology and Medicine. Nature 2004, 432 (7019), 855–861. 10.1038/nature03193. [DOI] [PubMed] [Google Scholar]
  • (14).Dobson C. M. Chemical Space and Biology. Nature 2004, 432 (7019), 824–828. 10.1038/nature03192. [DOI] [PubMed] [Google Scholar]
  • (15).Samanipour S.; Barron L. P.; van Herwerden D.; Praetorius A.; Thomas K. V.; O’Brien J. W. Exploring the Chemical Space of the Exposome: How Far Have We Gone? JACS Au 2024, 4 (7), 2412–2425. 10.1021/jacsau.4c00220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (16).Medina‐Franco J. L.; Chávez‐Hernández A. L.; López‐López E.; Saldívar‐González F. I. Chemical Multiverse: An Expanded View of Chemical Space. Mol Inform 2022, 41 (11), 2200116. 10.1002/minf.202200116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17).Coley C. W. Defining and Exploring Chemical Spaces. Trends Chem 2021, 3 (2), 133–145. 10.1016/j.trechm.2020.11.004. [DOI] [Google Scholar]
  • (18).Zdrazil B.; Felix E.; Hunter F.; Manners E. J.; Blackshaw J.; Corbett S.; de Veij M.; Ioannidis H.; Lopez D. M.; Mosquera J. F.; Magarinos M. P.; Bosc N.; Arcila R.; Kizilören T.; Gaulton A.; Bento A. P.; Adasme M. F.; Monecke P.; Landrum G. A.; Leach A. R. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods. Nucleic Acids Res 2024, 52 (D1), D1180–D1192. 10.1093/nar/gkad1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (19).Bender A. Compound Bioactivities Go Public. Nat Chem Biol 2010, 6 (5), 309–309. 10.1038/nchembio.354. [DOI] [Google Scholar]
  • (20).Gasteiger J. Chemistry in Times of Artificial Intelligence. ChemPhysChem 2020, 21 (20), 2233–2242. 10.1002/cphc.202000518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (21).Wang Y.; Bryant S. H.; Cheng T.; Wang J.; Gindulyte A.; Shoemaker B. A.; Thiessen P. A.; He S.; Zhang J. PubChem BioAssay: 2017 Update. Nucleic Acids Res 2017, 45 (D1), D955–D963. 10.1093/nar/gkw1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (22).Tingle B. I.; Tang K. G.; Castanon M.; Gutierrez J. J.; Khurelbaatar M.; Dandarchuluun C.; Moroz Y. S.; Irwin J. J. ZINC-22─A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. J Chem Inf Model 2023, 63 (4), 1166–1176. 10.1021/acs.jcim.2c01253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (23).Cavasotto C. N.; Di Filippo J. I. The Impact of Supervised Learning Methods in Ultralarge High-Throughput Docking. J Chem Inf Model 2023, 63 (8), 2267–2280. 10.1021/acs.jcim.2c01471. [DOI] [PubMed] [Google Scholar]
  • (24).Lyu J.; Irwin J. J.; Shoichet B. K. Modeling the Expansion of Virtual Screening Libraries. Nat Chem Biol 2023, 19 (6), 712–718. 10.1038/s41589-022-01234-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25).Walters W. P. Virtual Chemical Libraries. J Med Chem 2019, 62 (3), 1116–1124. 10.1021/acs.jmedchem.8b01048. [DOI] [PubMed] [Google Scholar]
  • (26).Liu Y.; Cai M.; Zhao Y.; Hu Z.; Wu P.; Kong D.-X. Time-Dependent Comparison of the Structural Variations of Natural Products and Synthetic Compounds. Int J Mol Sci 2024, 25 (21), 11475. 10.3390/ijms252111475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Dunn T. B.; López‐López E.; Kim T. D.; Medina‐Franco J. L.; Miranda‐Quintana R. A. Exploring Activity Landscapes with Extended Similarity: Is Tanimoto Enough? Mol Inform 2023, 42 (7). 10.1002/minf.202300056. [DOI] [PubMed] [Google Scholar]
  • (28).Bajusz D.; Rácz A.; Héberger K. Why Is Tanimoto Index an Appropriate Choice for Fingerprint-Based Similarity Calculations? J Cheminform 2015, 7 (1), 20. 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (29).Butina D. Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J Chem Inf Comput Sci 1999, 39 (4), 747–750. 10.1021/ci9803381. [DOI] [Google Scholar]
  • (30).Malhat M. G.; Mousa H. M.; El-Sisi A. B. Improving Jarvis-Patrick Algorithm for Drug Discovery. In 2014 9th International Conference on Informatics and Systems; IEEE, 2014; p DEKM-61-DEKM-66. 10.1109/INFOS.2014.7036710. [DOI] [Google Scholar]
  • (31).Pérez K. L.; Jung V.; Chen L.; Huddleston K.; Miranda-Quintana R. A. Efficient Clustering of Large Molecular Libraries. August 10, 2024. 10.1101/2024.08.10.607459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32).Zhang T.; Ramakrishnan R.; Livny M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Record 1996, 25 (2), 103–114. 10.1145/235968.233324. [DOI] [Google Scholar]
  • (33).Zdrazil B.; Felix E.; Hunter F.; Manners E. J.; Blackshaw J.; Corbett S.; de Veij M.; Ioannidis H.; Lopez D. M.; Mosquera J. F.; Magarinos M. P.; Bosc N.; Arcila R.; Kizilören T.; Gaulton A.; Bento A. P.; Adasme M. F.; Monecke P.; Landrum G. A.; Leach A. R. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods. Nucleic Acids Res 2024, 52 (D1), D1180–D1192. 10.1093/nar/gkad1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).Wishart D. S.; Feunang Y. D.; Guo A. C.; Lo E. J.; Marcu A.; Grant J. R.; Sajed T.; Johnson D.; Li C.; Sayeeda Z.; Assempour N.; Iynkkaran I.; Liu Y.; Maciejewski A.; Gale N.; Wilson A.; Chin L.; Cummings R.; Le D.; Pon A.; Knox C.; Wilson M. DrugBank 5.0: A Major Update to the DrugBank Database for 2018. Nucleic Acids Res 2018, 46 (D1), D1074–D1082. 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (35).Wishart D. S.; Knox C.; Guo A. C.; Shrivastava S.; Hassanali M.; Stothard P.; Chang Z.; Woolsey J. DrugBank: A Comprehensive Resource for in Silico Drug Discovery and Exploration. Nucleic Acids Res 2006, 34 (Database issue), D668–D672. 10.1093/NAR/GKJ067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (36).Kim S.; Chen J.; Cheng T.; Gindulyte A.; He J.; He S.; Li Q.; Shoemaker B. A.; Thiessen P. A.; Yu B.; Zaslavsky L.; Zhang J.; Bolton E. E. PubChem 2023 Update. Nucleic Acids Res 2023, 51 (D1), D1373–D1380. 10.1093/nar/gkac956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (37).Dimova D.; Stumpfe D.; Bajorath J. Quantifying the Fingerprint Descriptor Dependence of Structure–Activity Relationship Information on a Large Scale. J Chem Inf Model 2013, 53 (9), 2275–2281. 10.1021/ci4004078. [DOI] [PubMed] [Google Scholar]
  • (38).Landrum G.; Penzotti J. RDKit. 2018. http://www.rdkit.org/ (accessed 2025-01-17).
  • (39).Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J Chem Inf Model 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • (40).Glem R. C.; Bender A.; Arnby C. H.; Carlsson L.; Boyer S.; Smith J. Circular Fingerprints: Flexible Molecular Descriptors with Applications from Physical Chemistry to ADME. IDrugs 2006, 9 (3), 199–204. [PubMed] [Google Scholar]
  • (41).Félix-Manzanares E. FPSim2. ChEMBL; 2025. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (1,007.1KB, pdf)

Data Availability Statement

-. Datasets

The SMILES strings were obtained from the publicly available websites:

-. Code and análisis

iSIM functionalities, BIRCH clustering algorithm, and examples of how to use them for the diversity vs growth analysis of ChEMBL’s natural products are included in the linked repository. https://github.com/mqcomplab/ChemUniverse


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES