Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2022 Dec 19;63(2):484–492. doi: 10.1021/acs.jcim.2c01107

Molecular Framework Analysis of the Generated Database GDB-13s

Ye Buehler 1, Jean-Louis Reymond 1,*
PMCID: PMC9875802  PMID: 36533982

Abstract

graphic file with name ci2c01107_0009.jpg

The generated databases (GDBs) list billions of possible molecules from systematic enumeration following simple rules of chemical stability and synthetic feasibility. To assess the originality of GDB molecules, we compared their Bemis and Murcko molecular frameworks (MFs) with those in public databases. MFs result from molecules by converting all atoms to carbons, all bonds to single bonds, and removing terminal atoms iteratively until none remain. We compared GDB-13s (99,394,177 molecules up to 13 atoms containing simplified functional groups, 22,130 MFs) with ZINC (885,905,524 screening compounds, 1,016,597 MFs), PubChem50 (100,852,694 molecules up to 50 atoms, 1,530,189 MFs), and COCONUT (401,624 natural products, 42,734 MFs). While MFs in public databases mostly contained linker bonds and six-membered rings, GDB-13s MFs had diverse ring sizes and ring systems without linker bonds. Most GDB-13s MFs were exclusive to this database, and many were relatively simple, representing attractive targets for synthetic chemistry aiming at innovative molecules.

Introduction

To delineate the chemical space of interest for drug discovery,13 we have reported several generated databases (GDBs) enumerating all possible small molecules up to a given number of non-hydrogen atoms following simple rules of chemical stability and synthetic feasibility.4,5 These databases contain billions of possible molecules, which are almost all novel because only a few million molecules are known in the size range of the GDBs (up to 17 non-hydrogen atoms).6 However, defining novelty as non-identity in the context of drug discovery is partly misleading because many similar molecules have comparable properties.7

Here, we analyze GDB molecules in terms of molecular frameworks (MFs) as proposed by Bemis and Murcko.8 MFs are the molecular graphs obtained from the structural formula by converting all atoms into carbons, all bonds to single bonds, and removing terminal atoms iteratively until none remain. For our analysis, we considered GDB-13s, a new subset of 99 million possible molecules up to 13 atoms of C, N, O, S, and Cl derived from the full GDB-13 (977 million molecules)9 by restricting allowed functional groups. Although much smaller than GDB-13, GDB-13s contains the complete set of MFs present in GDB-13. We compared MFs of GDB-13s molecules with those derived from 885 million commercially available screening compounds from the ZINC database,10 from 100 million molecules up to 50 non-hydrogen atoms (heavy atom count HAC ≤ 50) in the public database PubChem11 and from 400 thousand natural products and natural product-like molecules in the COCONUT database.12

MFs define molecular series by their constitutive ring systems and linker bonds compatible with variations in substituents, heteroatoms, and ring aromaticity to form various scaffolds13 and lead to a more demanding definition of novelty.14,15 For example, our recently reported triquinazine scaffold 1, inspired from the ring system database GDB4c,16 represents a new heteroatom variation of the MF of the known angular triquinane 2, but is not a new MF per se; however, the derived Janus kinase inhibitor 3 features an unprecedented MF 4 occurring only in seven molecules recorded in PubChem which correspond to the record of 3 and related synthetic intermediate from the original publication (Figure 1).17 Drugs often derive from highly populated MFs, such as molnupiravir (5) derived from MF 6 found in most pyrimidine nucleosides and analogues and corresponding to millions of different molecules, including 24 marketed drugs. On the other hand, recently approved drugs, such as the orexin antagonist daridorexant (7), may feature more complex and far less common MFs such as 8, reflecting the general tendency toward larger and more complex drug structures observed in recent medicinal chemistry trends.14,15

Figure 1.

Figure 1

Examples of molecules (left) and their constitutive MF (right). The number of occurrences of the MFs in each database is indicated on the right.

As detailed below, we find that, because of the small size of molecules in GDB-13s and the exhaustive enumeration approach taken to create the database, GDB-13s features only a relatively few MFs relative to its size compared with ZINC, PubChem, and COCONUT. Nevertheless, these MFs are mostly exclusive MFs (eMFs) occurring only in GDB-13s and none of the other three analyzed databases, assessing to a vast MF novelty potential in this database. Most remarkably, many eMFs are tricyclic frameworks that should be readily accessible by synthesis. A typical example is the tricyclic MF 10, for which only a single, non-referenced molecule example is found in SciFinder in the form of epoxide 9.18

Results and Discussion

Database Selection and MF Analysis

We chose GDB-13 for this analysis because of its manageable size of 977 million molecules. To further restrict the database to molecules resembling those in public databases, we removed functional groups occurring frequently in GDB-13 molecules, but which are rarely found in medicinal chemistry, such as acetals and carbonates, non-aromatic carbon–nitrogen and carbon–carbon double bonds, aziridines, and non-aromatic N–N and N–O bonds. This selection, here named GDB-13s, was reduced by 90% compared to the full GDB-13, further facilitating analysis. Similar to GDB-13, GDB-13s showed an exponential increase in the number of molecules as a function of molecule size (Figure 2a).

Figure 2.

Figure 2

Count of molecules (Cpd), MFs, eMFs, and MFs up to three rings (MF-3R) in GDB-13s (a), ZINC (b), PubChem50 (c), and COCONUT (d) as a function of HAC.

To compare MFs of GDB-13s with those of known molecules, we downloaded several publicly accessible data sets. We included molecules larger than HAC = 13 for our comparison because MFs result from pruning atoms from molecules, implying that many molecules with HAC > 13 provide smaller MFs falling within the size range of GDB-13s. We downloaded ZINC, which features 885 million screening compounds available from various providers.10 Molecules from ZINC are larger than GDB molecules and peak at HAC = 26, which is a typical drug size, with only very few molecules larger than HAC = 36 (Figure 2b). Furthermore, we collected 100 million molecules up to HAC = 50 from PubChem,11 here named PubChem50. This collection peaked at HAC = 21 but extended more evenly than ZINC up to HAC = 50 (Figure 2c). Finally, we considered the recently reported COCONUT, which features 400 thousand natural products or natural product-like molecules.12 This database is highly populated at HAC = 25–35 and contains a few molecules above HAC = 100, which are mostly glycolipids such as saponins, peptides, and polyphenols (Figure 2d).

Despite its large size, GDB-13s only contained 22,130 MFs. By contrast, ZINC and PubChem50 both contained over one million MFs, and COCONUT contained 42,734 MFs for only 400 thousand molecules. However, ZINC, PubChem50 and COCONUT contained fewer MFs up to 13 atoms (MF13) than GDB-13s because these databases mostly contain molecules built on MFs larger than 13 atoms (Table 1a and Figure 2a–d, green lines). The much larger number of molecules per MF in GDB-13s compared to databases of known molecules reflects the exhaustive enumeration approach taken to create the GDB, in contrast to the other three databases composed of known examples. This is also evidenced by the fact that GDB-13s does not contain any MF with only a single molecule example, while 14% of ZINC MFs and 47% of PubChem50 and COCONUT MFs are singletons.

Table 1. Molecular Framework Analysis of GDB-13s, ZINC, PubChem50, and COCONUT.

  GDB-13s ZINC PubChem50 COCONUT
(a) Database Size and Cpd/MF
Cpds 99,394,177 885,905,524 100,852,694 401,624
MFa 22,130 1,016,597 1,530,189 42,734
MF13b 22,130 1448 13,422 679
singletonsc 0 141,510 717,917 20,211
% singletons 0 13.9% 46.9% 47.3%
MF90d 872 13,800 24,830 14,000
% MF90 3.9% 1.4% 1.6% 32.8%
Cpd/MF90 102,586 57,776 3656 26
(b) MF Types
MF-ring systemse 17,816 3841 86,379 6181
% MF-ring systems 80.5% 0.4% 0.6% 14.5%
Cpd-ring systems 96,554,175 73,511,304 21,642,803 136,056
% Cpd-ring systems 97.1% 8.3% 21.5% 33.9%
MF-5/6f 3610 298,901 812,006 24,038
% MF-5/6 16.3% 29.4% 53.1% 56.3%
Cpd-5/6 34,214,845 656,214,620 84,927,019 285,823
% Cpd-5/6 34.4% 74.1% 84.1% 71.2%
(c) Exclusive MFs
eMFg 16,936 691,045 1,192,517 16,503
% eMF 76.5% 68.0% 77.9% 38.6%
Cpd-eMF 4,975,340 45,755,635 5,771,217 44,040
% Cpd-eMF 5.0% 5.2% 5.7% 11.0%
eMF13h 16,936 47 7997 5
% eMF13 100% 0.01% 0.67% 0.03%
(d) MFs up to Three Rings
MF-3Ri 2215 25,143 40,577 3670
% MF-3R 10.0% 2.5% 2.7% 8.6%
Cpd-3R 83,472,674 642,704,648 69,406,919 169,647
% Cpd-3R 84.0% 72.5% 68.8% 42.2%
eMF-3R 225 7794 21,481 317
% eMF-3R 1.0% 0.8% 1.4% 0.7%
Cpd-eMF-3R 209,011 1,648,939 139,368 841
% Cpd-eMF-3R 0.2% 0.2% 0.1% 0.2%
a

MF = molecular framework.

b

MF13 = MF up to 13 atoms.

c

Singletons = MF with only a single molecule example.

d

MF90 = no. of MF covering 90% of the database.

e

MF-ring systems = MF without acyclic bonds.

f

MF-5/6 = MF containing only five- or six-membered rings.

g

eMF = exclusive MF, does not occur in the other three databases.

h

eMF13 = exclusive MF up to 13 atoms.

i

MF-3R = MF up to three rings.

The frequency of molecules per MF followed a typical power law distribution in all four databases (Figure 3). This distribution was steepest in ZINC and PubChem 50, where approximately 1.5% of all MFs were sufficient to cover 90% of the database, defined here as MF90 (Table 1a). GDB-13s required 3.9% of its MFs to cover 90% of the database; however, in this case, the number of molecules per MF was higher than in ZINC or PubChem50 due to the lower total number of MF in GDB-13s. A similar coverage of 90% in COCONUT required 32.8% of its MFs, reflecting the large MF diversity of this natural product collection with an average of only 26 compounds per MF90.

Figure 3.

Figure 3

Frequency distribution of MFs in GDB-13s (a), ZINC (b), PubChem50 (c), and COCONUT (d).

In terms of structural types, the majority of the MFs in GDB-13s (80.5%) were ring systems, which are MFs without any linker bonds, and these ring systems made up almost the entire database (97.1% of all molecules, Table 1b). In sharp contrast, the other databases were dominated by MFs containing linker bonds, such that ring systems only composed a small fraction of MFs and molecules in ZINC (0.4% MFs, 8.3% molecules), PubChem50 (0.6% MFs, 21.5% molecules), and COCONUT (14.5% MFs, 33.9% molecules). Furthermore, MFs and molecules containing only five- or six-membered rings were a minority in GDB-13s (16.3% MFs, 34.4% molecules), but made up a much larger fraction of ZINC (29.4% MFs, 74.1% molecules) and dominated in PubChem (53.1% MFs, 84.1% molecules) and COCONUT (56.3% MFs, 71.2% molecules), probably reflecting the fact that five- and six-membered rings are easily formed and synthesized. Frequency histograms as function of the largest ring size in fact showed that five-membered rings were most prevalent in GDB-13s molecules, while six-membered rings dominated in GDB-13s MFs as well as in both molecules and MFs for ZINC, PubChem50, and COCONUT (Figure 4).

Figure 4.

Figure 4

Largest ring size histogram of molecules (Cpd), MFs, eMFs, and MFs up to three rings (MF-3R) in GDB-13s (a), ZINC (b), PubChem50 (c), and COCONUT (d).

The relative importance of ring sizes in each database was further illustrated by analyzing the 10 most populated MFs in each database (11–35, Figure 5). The most populated MF in GDB-13s was cyclopentane (11) with 7.3 million molecules, followed by cyclobutane (12), cyclopropane (13), and bicyclic fused ring systems (14–17), with cyclohexane (18) appearing in position 8 with 2.3 million molecules. Furthermore, six of the top-10 MFs in GDB-13s (12, 13, 15, 16, 19, and 20) contained a small (three- or four-membered) ring. By contrast, ZINC, PubChem50, and COCONUT all featured cyclohexane (18) as the most populated MF, followed by cyclopentane (11) for ZINC and PubChem50 and decalin (31) for COCONUT. All top-10 MFs in these databases only contained five- and six-membered rings and were very comparable to the top-10 MFs in the comprehensive medicinal chemistry (CMC) data set as reported by Bemis and Murcko8 and in the Chemical Abstracts Service (CAS) Registry Organic Subset as reported by Lipkus.19 A similar pattern appeared when considering the top-30 MFs in each set (Figure S1).

Figure 5.

Figure 5

Top-10 most populated MFs in various databases. MFs are numbered by order or appearance in the frequency sorted list across the four databases. The top-30 most populated MFs in these databases are shown in Figure S1. A color-code has been added to the numbering of MFs appearing several times to facilitate comparison across different databases.

Exclusive Molecular Frameworks

To appreciate the uniqueness of each database, we next analyzed which MFs were found only in one of the four databases, here named eMFs (Table 1c). A Venn diagram analysis showed that a substantial fraction of MFs in each of the four databases were eMFs (Figure 6a). For instance, 76.5% of MFs in GDB-13s were eMFs, which was not surprising considering the much larger number of MF13 (MFs up to 13 atoms) in GDB-13s compared to the other databases. Nevertheless, a comparable percentage of eMFs were present in ZINC (68.0%) and PubChem50 (77.9%), while only 38.6% of MFs were eMFs in COCONUT. Despite the exhaustive nature of GDB-13, eMFs up to 13 atoms also occurred in ZINC (47), PubChem (7,997), and COCONUT (5). These eMF13 were exemplified on average by only 12 molecules per MF and contained fused three- and four-membered rings such as tetrahedrane and prismane, which by design are excluded from the GDB-13 generation procedure due to their high ring strain.9

Figure 6.

Figure 6

eMFs. (a) Venn diagram of MF in the different databases. (b) The top-2 most populated eMFs in the different databases. The top-10 eMFs in the different databases are shown in Figure S2.

Note that eMFs were generally less populated than MFs. In all four databases, the most populated eMFs only comprised thousands of molecules, as opposed to up to millions for MFs. The corresponding molecules only made up to approximately 5% of the database for GDB-13s, ZINC, and PubChem50, and 10% for COCONUT. Furthermore, eMFs were generally more complex than MFs, featuring polycyclic systems with mostly four or more rings, as illustrated by the two most populated eMFs in each of the four databases analyzed (36–43, Figure 6b). A similar pattern was visible when surveying the top-10 most populated eMFs (Figure S2).

MFs up to Three Rings

Because the most populated MFs from each of the four databases featured at most three rings, we investigated which percentage of the databases were in fact from MFs with only up to three rings, here named MF-3R, considering all MFs as well as eMFs for each database (Table 1d).

In line with the most populated MFs, 84% of the molecules in GDB-13s stemmed from MF-3R, although these only made up 10.9% of all MFs. A similar and even more extreme situation in terms of MFs occurred in ZINC (2.5% MF-3R result in 72.5% molecules), PubChem (2.7% MF-3R result in 68.8% molecules), and COCONUT (8.6% MF-3R result in 42.2% molecules). The frequency of molecules with only few rings most likely results from their easier synthesis compared to molecules derived from more complex MFs.

A Venn diagram analysis showed that only very few of MF-3R were exclusive to each database (Figure 7a). Overall, only approximately 1% all MFs were eMF-3R, and only 0.1% of all molecules stemmed from eMF-3R in each database (Table 1d). Due to database sizes, however, this still left a good number of molecules from eMF-3R in each database (>200,000 for GDB-13s and >1,000,000 for ZINC).

Figure 7.

Figure 7

MF up to three rings (MF-3R). (a) Venn diagram of MF-3R in the different databases. (b) Tree map (TMAP) visualization of the 48,947 MF-3Rs in GDB-13s, ZINC, PubChem, and COCONUT color-coded by MF-3R size of the largest ring. (c) TMAP color-coded by MF type. (d) TMAP of the 13,769 molecules derived from the most frequent eMF-3R in GDB-13s. An interactive version of the TMAPs with additional color-codes is accessible at https://tm.gdb.tools/map4 (MAP4_4databases_MF3R; MAP4_GDB-13s_eMF3R_Cpd).

In terms of identifying original yet simple molecules, those built from eMF-3R should be the most interesting. To gain an overview of such frameworks, we built a tree-map (TMAP)20 of MF-3R to visualize their diversity across the different databases. Color-coding the TMAP by the size of the largest ring showed that macrocycles made up a significant fraction of MF-3R (≥12-membered ring, 18.9%), while MF with only small rings (three- or four-membered) only accounted for 3.0% of MF-3R (Figure 7b). Furthermore, color-coding by MF-type (ring system or MF with linker bonds) showed that 14.2% of MF-3R were ring systems (Figure 7c).

Although any particular MF groups molecules sharing a common set of ring structures, the diversity accessible from a single MF can be quite substantial, as illustrated by the TMAP displaying the 13,769 possible GDB-13s molecules sharing MF 10, which is the most populated eMF-3R in GDB-13s (Figure 7d). These molecules span the full range of functional groups allowed in GDB-13s and the Tanimoto similarities (Tan) to the parent MF, calculated using the standard ECFP4 fingerprint,21 range from almost identical (Tan ∼ 1) to almost entirely dissimilar (Tan ∼ 0). Among these molecules, one can readily identify possible analogues of well-known 3D-shaped molecules, such as memantine (44, 45), camphor (46, 47), patchoulol (48, 49), tropinone (50, 51), triquinazine 1 (52, 53), and DABCO (54, 55).

Identifying Novel Synthetic Targets in GDB-13s

Although our definition of eMFs is limited to the comparison of the four databases considered, most eMF in GDB-13s are indeed novel upon checking for novelty in SciFinder.18 For example, MF 10 contains only a single entry in SciFinder in form of epoxide 9, however without any literature reference. This epoxide can probably be synthesized from the parent ketone 56, which is listed in PubChem.

Some eMFs in GDB-13s are not truly exclusive because they can be obtained by removing linker bonds from larger MFs in molecules from PubChem, ZINC, or COCONUT. Indeed, 312 of the 16,936 eMFs in GDB-13s are ring systems that can be obtained by removing linker bonds from larger MFs in these databases, which still leaves 16,624 eMFs in GDB-13s. Among the many associated molecules, one might select targets that are both easily accessible and readily functionalizable for medicinal chemistry purpose. For example, bi- or tricyclic diamines from eMFs related to triquinazine 1,17 such as 52 and 53, might be accessible by amine cyclization procedures from the simple precursor and provide valuable novel core structures for drugs. Furthermore, carbocyclic molecules from original MFs in GDB-13s containing three- and four-membered rings might be accessible by cyclopropanation, respectively, [2 + 2] cycloaddition from suitable alkene precursors.

Conclusions

The analysis above shows that, despite the large size of GDB-13s, the absolute number of different MFs in GDB-13s is quite low compared to collections such as ZINC, PubChem, or COCONUT. In contrast to these collections which contain mostly MFs with five- and six-membered rings and including linker bonds, most MFs in GDB-13s feature a broader variety of ring sizes and are ring systems without any acyclic bonds. Most interestingly, many MFs occur only in GDB-13s (eMFs) and feature unprecedented ring combinations. Such eMFs might be the most relevant targets for synthetic chemistry aiming at innovative molecules.

Methods

GDB-13s Generation

The entire GDB-13 (including all C/N/O/Cl/S molecules) data set was downloaded from our group website (https://gdb.unibe.ch/downloads). 977,468,301 entries of the GDB-13 database were filtered by Python programming. Functional groups or substructures were identified by using the Daylight SMARTS language.22 Alog P (atomic log P) values using Ghose/Crippen method23 were calculated by using RDKit.24 Five rules have been applied to the entire GDB-13 database as follows in order:

  • (1)

    C=O filtration: only keep the molecules with a double bond as C=O in non-aromatic structures so as to phase out molecules with non-aromatic C=C and C=N. For aromatic rings, all types of double bonds are allowed. There is no restriction for the molecules without any C=O double bonds;

  • (2)

    Alog P filtration: Alog P is the refinement of Log P, it suits smaller molecules. If the Alog P value of a drug is too low, the drug molecule will hardly pass through the cell membrane. In this context, we use this filtration to remove all the molecules with an Alog P value less than 0;

  • (3)

    N–O, N–N in non-aromatic ring filtration: exclude all N–O and N–N bonds from non-aromatic rings (both atoms are inside the aromatic ring);

  • (4)

    O–C–O filtration: filter out the molecules containing O–C–O structures;

  • (5)

    N in three-member ring filtration: eliminate the compounds containing three-member rings with any nitrogen atoms.

GDB-13s can be downloaded from: https://gdb.unibe.ch/downloads.

Data Collection

The ZINC data used in this study are the February 2022 version (https://zinc.docking.org). The PubChem data with a version of October 2021 were first downloaded from the NCBI (The National Center for Biotechnology Information), NIH (National Institutes of Health) via FTP server (https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full). Then, the compounds with HACs not greater than 50 were extracted to build the PubChem50 database. The COCONUT data adopted in this study are the February 2021 version (https://github.com/reymond-group/Coconut-TMAP-SVM). SMILES strings served as inputs and RDKit package was regarded as the main tool for HAC and other properties calculation.

MF Model

The MF model is written in Python 3 and dependent on several freely available Python packages such as Pandas,25 Numpy,26 and RDKit. A brief outline of the model is provided here: a molecule as an input will be first simplified by converting all its bonds into single bonds and converting all its atoms into carbon atoms. Then, all terminal atoms of this molecule will be removed iteratively. The outcome will be a MF as defined by Bemis and Murcko.8

Venn Diagrams and TMAPs

Venn diagrams were computed by using the freely available Python package Venn.27 TMAPs were generated by specifying standard parameters,20 and all utilized the MAP4 fingerprint (MinHashed atom-pair fingerprint up to a diameter of four bonds),28 which is our lately developed fingerprint suitable for universal classes of molecules, especially preferable for natural product molecules. MAP4 fingerprints were computed with a dimension of 256.

Data and Software Availability

The generated data set GDB-13s is hosted on the open-access repository Zenodo. All the molecules are stored in dearomatized, canonized SMILES format and compressed as a GNU zip archive. The GDB-13s data set (GDB-13s.smi.gz) can be downloaded free of charge at https://doi.org/10.5281/zenodo.7041051. The MF model is made freely available and under the MIT license. It is distributed in a GitHub repository upon publication of this paper: https://github.com/Ye-Buehler/Molecular_Framework_Model.

Acknowledgments

We thank Dr. Todd Wills for the suggestion to analyze MFs in the GDBs, and Dr. Sacha Javor for critical reading of the paper with helpful suggestions. We also thank UBELIX (http://www.id.unibe.ch/hpc), the HPC cluster at the University of Bern, for providing free computing service.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.2c01107.

  • Top-30 most populated MFs in GDB-13s, ZINC, PubChem50, COCONUT, CMC, and CAS Registry-Organic Subset and the top-10 most frequent eMFs in GDB-13s, ZINC, PubChem50, and COCONUT (PDF)

Author Contributions

Y.B. designed and realized the study and wrote the paper. J.-L.R. co-designed and supervised the study and wrote the paper.

This work was funded by the Swiss National Science Foundation, grant number 200020_207976.

The authors declare no competing financial interest.

Supplementary Material

ci2c01107_si_001.pdf (1.1MB, pdf)

References

  1. Mullard A. The Drug-Maker’s Guide to the Galaxy. Nature 2017, 549, 445–447. 10.1038/549445a. [DOI] [PubMed] [Google Scholar]
  2. Hoffmann T.; Gastreich M. The next Level in Chemical Space Navigation: Going Far beyond Enumerable Compound Libraries. Drug Discovery Today 2019, 24, 1148–1156. 10.1016/j.drudis.2019.02.013. [DOI] [PubMed] [Google Scholar]
  3. Warr W. A.; Nicklaus M. C.; Nicolaou C. A.; Rarey M. Exploration of Ultralarge Compound Collections for Drug Discovery. J. Chem. Inf. Model. 2022, 62, 2021–2034. 10.1021/acs.jcim.2c00224. [DOI] [PubMed] [Google Scholar]
  4. Reymond J.-L.; Ruddigkeit L.; Blum L.; van Deursen R. The Enumeration of Chemical Space. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 717–733. 10.1002/wcms.1104. [DOI] [Google Scholar]
  5. Meier K.; Bühlmann S.; Arús-Pous J.; Reymond J.-L. The Generated Databases (GDBs) as a Source of 3D-Shaped Building Blocks for Use in Medicinal Chemistry and Drug Discovery. Chimia 2020, 74, 241–246. 10.2533/chimia.2020.241. [DOI] [PubMed] [Google Scholar]
  6. Ruddigkeit L.; van Deursen R.; Blum L. C.; Reymond J. L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]
  7. Wermuth C. G. Similarity in Drugs: Reflections on Analogue Design. Drug Discovery Today 2006, 11, 348–354. 10.1016/j.drudis.2006.02.006. [DOI] [PubMed] [Google Scholar]
  8. Bemis G. W.; Murcko M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887–2893. 10.1021/jm9602928. [DOI] [PubMed] [Google Scholar]
  9. Blum L. C.; Reymond J. L. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 2009, 131, 8732–8733. 10.1021/ja902302h. [DOI] [PubMed] [Google Scholar]
  10. Irwin J. J.; Tang K. G.; Young J.; Dandarchuluun C.; Wong B. R.; Khurelbaatar M.; Moroz Y. S.; Mayfield J.; Sayle R. A. ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model. 2020, 60, 6065–6073. 10.1021/acs.jcim.0c00675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kim S.; Chen J.; Cheng T.; Gindulyte A.; He J.; He S.; Li Q.; Shoemaker B. A.; Thiessen P. A.; Yu B.; Zaslavsky L.; Zhang J.; Bolton E. E. PubChem 2019 Update: Improved Access to Chemical Data. Nucleic Acids Res. 2019, 47, D1102–D1109. 10.1093/nar/gky1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Sorokina M.; Merseburger P.; Rajan K.; Yirik M. A.; Steinbeck C. COCONUT Online: Collection of Open Natural Products Database. J. Cheminf. 2021, 13, 2. 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kruger F.; Stiefl N.; Landrum G. A. RdScaffoldNetwork: The Scaffold Network Implementation in RDKit. J. Chem. Inf. Model. 2020, 60, 3331–3335. 10.1021/acs.jcim.0c00296. [DOI] [PubMed] [Google Scholar]
  14. Lipkus A. H.; Watkins S. P.; Gengras K.; McBride M. J.; Wills T. J. Recent Changes in the Scaffold Diversity of Organic Chemistry As Seen in the CAS Registry. J. Org. Chem. 2019, 84, 13948–13956. 10.1021/acs.joc.9b02111. [DOI] [PubMed] [Google Scholar]
  15. Wills T. J.; Lipkus A. H. Structural Approach to Assessing the Innovativeness of New Drugs Finds Accelerating Rate of Innovation. ACS Med. Chem. Lett. 2020, 11, 2114–2119. 10.1021/acsmedchemlett.0c00319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Visini R.; Arús-Pous J.; Awale M.; Reymond J. L. Virtual Exploration of the Ring Systems Chemical Universe. J. Chem. Inf. Model. 2017, 57, 2707–2718. 10.1021/acs.jcim.7b00457. [DOI] [PubMed] [Google Scholar]
  17. Meier K.; Arús-Pous J.; Reymond J.-L. A Potent and Selective Janus Kinase Inhibitor with a Chiral 3D-Shaped Triquinazine Ring System from Chemical Space. Angew. Chem., Int. Ed. Engl. 2021, 60, 2074–2077. 10.1002/anie.202012049. [DOI] [PubMed] [Google Scholar]
  18. SciFinder: Substances Search Service. https://scifinder.cas.org (accessed Aug 25, 2022).
  19. Lipkus A. H.; Yuan Q.; Lucas K. A.; Funk S. A.; Bartelt W. F.; Schenck R. J.; Trippe A. J. Structural Diversity of Organic Chemistry. A Scaffold Analysis of the CAS Registry. J. Org. Chem. 2008, 73, 4443–4451. 10.1021/jo8001276. [DOI] [PubMed] [Google Scholar]
  20. Probst D.; Reymond J.-L. Visualization of Very Large High-Dimensional Data Sets as Minimum Spanning Trees. J. Cheminf. 2020, 12, 12. 10.1186/s13321-020-0416-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  22. Daylight Theory: SMARTS—A Language for Describing Molecular Patterns. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html (accessed July 25, 2022).
  23. Souza E. S.; Zaramello L.; Kuhnen C. A.; Junkes B. d. S.; Yunes R. A.; Heinzen V. E. F. Estimating the Octanol/Water Partition Coefficient for Aliphatic Organic Compounds Using Semi-Empirical Electrotopological Index. Int. J. Mol. Sci. 2011, 12, 7250–7264. 10.3390/ijms12107250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. RDKit: Open-source cheminformatics. http://www.rdkit.org (accessed July 25, 2022).
  25. McKinney W.Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, Austin, TX, 2010; pp 56–61.
  26. Harris C. R.; Millman K. J.; van der Walt S. J.; Gommers R.; Virtanen P.; Cournapeau D.; Wieser E.; Taylor J.; Berg S.; Smith N. J.; Kern R.; Picus M.; Hoyer S.; van Kerkwijk M. H.; Brett M.; Haldane A.; del Río J. F.; Wiebe M.; Peterson P.; Gérard-Marchant P.; Sheppard K.; Reddy T.; Weckesser W.; Abbasi H.; Gohlke C.; Oliphant T. E. Array Programming with NumPy. Nature 2020, 585, 357–362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Pyvenn: Venn diagrams for 2, 3, 4, 5, 6 sets. https://pypi.org/project/venn (accessed July 20, 2022).
  28. Capecchi A.; Probst D.; Reymond J.-L. One Molecular Fingerprint to Rule Them All: Drugs, Biomolecules, and the Metabolome. J. Cheminf. 2020, 12, 43. 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci2c01107_si_001.pdf (1.1MB, pdf)

Data Availability Statement

The generated data set GDB-13s is hosted on the open-access repository Zenodo. All the molecules are stored in dearomatized, canonized SMILES format and compressed as a GNU zip archive. The GDB-13s data set (GDB-13s.smi.gz) can be downloaded free of charge at https://doi.org/10.5281/zenodo.7041051. The MF model is made freely available and under the MIT license. It is distributed in a GitHub repository upon publication of this paper: https://github.com/Ye-Buehler/Molecular_Framework_Model.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES