Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Sep 13;87(9):2216–2229. doi: 10.1021/acs.jnatprod.4c00530

Chemoinformatic Characterization of NAPROC-13: A Database for Natural Product 13C NMR Dereplication

Juan F Avellaneda-Tamayo , Naicolette A Agudo-Muñoz ‡,, Javier E Sánchez-Galán §,, José L López-Pérez ⊥,#,*, José L Medina-Franco †,*
PMCID: PMC11443490  PMID: 39269718

Abstract

graphic file with name np4c00530_0011.jpg

Natural products (NPs) are secondary metabolites of natural origin with broad applications across various human activities, particularly the discovery of bioactive compounds. Structural elucidation of new NPs entails significant cost and effort. On the other hand, the dereplication of known compounds is crucial for the early exclusion of irrelevant compounds in contemporary pharmaceutical research. NAPROC-13 stands out as a publicly accessible database, providing structural and 13C NMR spectroscopic information for over 25 000 compounds, rendering it a pivotal resource in natural product (NP) research, favoring open science. This study seeks to quantitatively analyze the chemical content, structural diversity, and chemical space coverage of NPs within NAPROC-13, compared to FDA-approved drugs and a very diverse subset of NPs, UNPD-A. Findings indicated that NPs in NAPROC-13 exhibit properties comparable to those in UNPD-A, albeit showcasing a notably diverse array of structural content, scaffolds, ring systems of pharmaceutical interest, and molecular fragments. NAPROC-13 covers a specific region of the chemical multiverse (a generalization of the chemical space from different chemical representations) regarding physicochemical properties and a region as broad as UNPD-A in terms of the structural features represented by fingerprints.


Natural products (NPs) are chemical compounds produced by organic life to carry out secondary metabolic processes. NPs and their derivatives are a well-known source of bioactive compounds, as well as widely applied compounds in many other areas of human life.1 The importance of discovering, characterizing, and developing the area of NPs lies in their broad spectrum of molecular complexity and structural diversity, as well as those known as privileged scaffolds, because of their optimization through evolution over millions of years.24 Noteworthy, more than 66% of current drugs approved for clinical use are NPs or NP derivatives.5 The size of the currently explored chemical space of NPs comprises more than 1 × 106 compounds, both isolated and predicted.6 Different estimates indicate that between 2.5 × 105 and 4 × 105 are available in public compound databases.7,8 The number of newly reported molecules keeps increasing.9,10 Because of these reasons, NPs continue to be of interest to chemists and specialists of all disciplines regarding the possibility of present privileged properties in terms of material resistance,11 biological activity in drug discovery,1,10,12 fertilizer industry,13 and pesticides.14 NPs present an advantage, as bioactive candidates, over synthetic compounds, because they are synthesized by living species and have been exploited traditionally by different cultures at different stages of history.15

As part of the NP-based drug discovery protocols, it is crucial to establish whether the bioactive compound from natural sources is a new finding. To this purpose, chromatograms–spectra of hit mixtures are compared to those of reference using multiple powerful data analysis tools to identify potential bioactive compounds. This approach is the basis of natural product (NP) dereplication and takes advantage of metabolomics and other omics technologies.16

Structural elucidation and dereplication of NPs are rapidly advancing fields that benefit greatly from data analysis and machine learning. These technologies enable the efficient processing of large volumes of data generated from various experimental methods, which are becoming increasingly sensitive and high-resolution. As a result, researchers face more complex challenges in analyzing diverse chemical systems and handling vast amounts of data. Machine learning helps overcome these challenges by identifying patterns and relationships in large data sets, ultimately improving the speed and accuracy of NP identification and characterization.17

Data related to NPs, and specifically to NP dereplication, are stored in databases. Existent NP databases in the public domain have been reviewed recently by Sorokina and Steinbeck, until 2020.18 Previously, in 2015, Johnson and Lange reviewed open-access metabolomics databases for NP research.19 Databases focused on NP dereplication relate structural information to spectral and/or chromatographic information on compounds and taxonomic knowledge on NP sources.17 Those types of databases can be classified according to the spectrometric or spectroscopic information they support, i.e., MS or NMR data.

NAPROC-13 is a freely web-based accessible and searchable database at https://c13.materia-medica.net/, that collects structural and 13C NMR spectral information on NPs, of which a considerable number of their structures have been reviewed.2023 NAPROC-13 contains information regarding 24 722 compounds (21 250 after applying a chemoinformatics-based curation protocol), mostly from plant sources, and in less amount from marine and microbiological organisms. By geographical distribution, systematic introductions of NPs are limited to those from Panama and El Salvador. Most of the entries in NAPROC-13 come from various unexplicit countries worldwide. Information on the chemical structures is not freely downloadable. NAPROC-13 has captured more relevance each year as evidenced by the 106 citations, 619 downloads, and 2916 views of the original publication (updated to May 1, 2024)24 and has more than 800 registered users from many countries around the world. In addition, a recent publication reporting on the usefulness of NAPROC-13 as a new strategy for finding erroneously established NPs contains reports of over 3200 views.25

The main goal of this study is to analyze quantitatively the chemical content, diversity, and coverage in the chemical space of NPs in the latest release of NAPROC-13. Chemoinformatic and statistical tools were applied for this purpose. The database was compared to the Food and Drug Administration of the United States (FDA)-approved drugs and a diverse subset of NPs (known as UNPD-A)2 regarding physicochemical and constitutional descriptors. To this end, we computed molecular descriptors of pharmaceutical interest, among others, and discussed the structural complexity as quantified by standard metrics. Also, we contrasted the structural diversity of compounds in NAPROC-13 according to different types of representations: (1) molecular fingerprints of different designs, (2) scaffold content according to Bemis and Murko’s definition,26 (3) ring systems, and (4) computed list of fragments using the RECAP algorithm of fragmentation.27 The ring systems of the studied data sets were correlated versus a virtual database of ring systems with reported bioactivity, Magic Rings, computed and released by Ertl.28

Also, we analyzed the chemical multiverse of NAPROC-13 using different types of molecular representations, including physicochemical, constitutional, and fingerprint-based descriptors. The chemical multiverse is a set of multiple chemical spaces, describing the same set of molecules through different sets of descriptors, each one describing the compounds differently but in a complementary manner.29 We used the COlleCtion of Open Natural ProdUcTs (COCONUT) database to approach the known chemical space of NPs. Results of this investigation quantitatively showed the large structural diversity of NPs. The results also confirmed the vast chemical diversity of the NAPROC-13 database, which is well suited for NP dereplication and virtual screening on different target families, including the report of more than 4000 new NPs. Some examples of those new included natural products are shown in Figure 1. NAPROC-13 showed a high structural diversity as well as the presence of interesting substructures (molecular scaffolds, Magic Rings, and molecular fragments) suitable for the discovery and design of drug candidates, including the development of pseudo-NPs.30

Figure 1.

Figure 1

Examples of nine new molecules (from more than 4000) reported in NAPROC-13 that are not present in the current version of COCONUT, with their configuration. NAPROC-13 ID, NP type, and group are underneath each structure (19047 and 19052 correspond to the revised structures from the original publications).

Results and Discussion

Data Sets

Information in Table 1 summarizes relevant data in the analysis of NAPROC-13 and the reference data sets studied in the present work. The number of exclusive entries refers to the number of compounds that do not share among the other databases. The SSE refers to the entropy of the information (compounds) distributed among the number of different possible values (scaffolds) presented in the particular data set. The molecular similarity is a measure of the diversity of compounds within each database, in this case, according to their ECFP4 representation. CSP3 and chiral centers are different estimations of molecular complexity.31,32 The NPL score was computed according to the methods section.

Table 1. Structural Constitution, Complexity, and NPL Score of NAPROC-13 and Reference Data Sets.

data set size (initial) size (curated) exclusive entries (%)b Murcko scaffolds (SSE) mean (median) similarity ECFP4 1024 bits mean (median) CSP3 mean (median) chiral centers mean (median) NPL score
NAPROC-13 24722 21250 19992 0.97 0.144 (0.135) 0.668 (0.724) 6.586 (6.0) 2.437 (2.575)
FDA 2587 2324 2215 0.63 0.096 (0.094) 0.454 (0.429) 2.305 (1.0) 1.513 (1.505)
UNPD-Aa 14994 14994 13676 0.67 0.099 (0.091) 0.519 (0.522) 3.806 (2.0) 0.019 (−0.095)
a

Data set curated from the source.2

b

Computed disregarding chirality of compounds.

Finally, 4327 NPs in NAPROC-13 were unique when compared against the current version of COCONUT.8 This finding reveals the valuable work of correcting elucidated NP structures. Some of their structures and classifications are represented in Figure 1.

Distribution of Physicochemical Properties, Constitutional Descriptors, and Structural Complexity

Figure 2 shows the distribution of molecular descriptors computed, both physicochemical properties and constitutional descriptors of interest in NPs and pharmaceutical research. Histograms represent the fraction of compounds sharing the referenced property within the database, while probability distributions depict the probability density functions for the variable. The importance of measuring properties such as those described by Lipinski33 and Veber34 lies in empirical rules developed in observing trends in most of the small molecules approved for clinical use that are administered orally. NPs in NAPROC-13 had on average 6.07 HBA (median: 5.00; standard deviation: 4.2), FDA-approved drugs had 5.29 (median: 4.00; standard deviation: 4.6), and NPs in UNPD-A had 5.58 (median: 4.00; standard deviation: 5.0). For HBD, NPs in NAPROC-13 had 2.27 on average (median: 2.00; standard deviation: 2.4), FDA-approved drugs 2.45 (median: 2.00; standard deviation: 3.7), and NPs in UNPD-A 2.51 (median: 2.00; standard deviation: 3.2). In this sense, the number of heteroatoms per molecule follows a similar trend. NPs in NAPROC-13 presented 6.25 on average (median: 5.00; standard deviation: 4.2), FDA-approved drugs 7.50 (median: 6.00; standard deviation: 7.0), and NPs in UNPD-A 6.02 (median: 5.00; standard deviation: 5.1).

Figure 2.

Figure 2

Distribution of physicochemical properties and constitutional descriptors among NAPROC-13 (green), FDA-approved drugs (red), and UNPD-A (blue): (a) HBA, (b) HBD, (c) logP, (d) TPSA, (e) MW, (f) CSP3, (g) number of aromatic rings, (h) number of nitrogen atoms, and (i) fraction of RB. Dotted lines are used for ease of visualization.

The content of atoms distinct from carbon and hydrogen, which is directly correlated with the polarity of molecules, further influences properties like LogP and TPSA. Regarding LogP, NPs in NAPROC-13 presented an average of 3.54 (median: 3.48; standard deviation: 2.4), whereas FDA-approved drugs presented an average of 2.27 (median: 2.55; standard deviation: 2.9) and NPs in UNPD-A 2.94 (median: 2.87; standard deviation: 3.0). For TPSA, NPs in NAPROC-13 presented an average of 96.13 Å2 (median: 80.92 Å2; standard deviation: 64.1 Å2), FDA-approved drugs 95.72 Å2 (median: 74.60 Å2; standard deviation: 106.3 Å2), and NPs in UNPD-A 90.78 Å2 (median: 69.67 Å2; standard deviation: 82.7 Å2).

The molecular size was approximated by calculating the MW. NPs in NAPROC-13 presented an average of 430.38 g/mol (median: 404.46 g/mol; standard deviation: 163.6 g/mol), FDA-approved drugs had 387.38 g/mol (median: 337.37 g/mol; standard deviation: 272.0 g/mol), and NPs in UNPD-A 371.94 g/mol (median: 330.29 g/mol; standard deviation: 196.4 g/mol). The trend of bigger molecules is explainable due to large triterpenoids, which is the most frequent molecular class, and their functionalities, sometimes sugars. There is a scarcity of nitrogen atoms in NPs of NAPROC-13, which lies in an average of 0.06 per molecule (median: 0.00; standard deviation: 0.3), while FDA-approved drugs have 2.54 per molecule (median: 2.00; standard deviation: 3.3), and NPs in UNPD-A have 0.49 per molecule (median: 0.00; standard deviation: 1.2).

In general, the values of the descriptors discussed so far highlight that about 75% of the NPs in NAPROC-13 are within the values of the classical empirical drug-likeness rules, except for MW, with a Q3 of 502.52 (exceeding the rule by less than 1%).

Molecular Complexity

Molecular complexity is a useful property in screening compounds against biological targets and indicates the selectivity in the interaction.35 Herein we discuss the complexity of studied databases according to topological indexes such as CSP3 and the number of chiral centers, and by the substructure-based approaches of ring quantification.35,36 NPs in NAPROC-13 presented an average CSP3 of 0.67 (median: 0.724; standard deviation: 0.2), FDA-approved drugs of 0.45 (median: 0.43; standard deviation: 0.3), and NPs in UNPD-A of 0.52 (median: 0.52; standard deviation: 0.3). According to chiral centers, NPs in NAPROC-13 had an average of 6.59 (median: 6.00; standard deviation: 5.1), FDA-approved drugs 2.31 (median: 1.00; standard deviation: 3.8), and NPs in UNPD-A 3.81 (median: 2.00; standard deviation: 5.1). In terms of ring content, molecules of NAPROC-13 had an average of 3.92 (median: 4.00; standard deviation: 1.72), FDA-approved drugs 2.78 (median: 3.00; standard deviation: 2.0), and NPs in UNPD-A 3.10 (median: 3.00; standard deviation: 2.2). These values reinforce the tendency of NPs to have greater complexity according to the metrics implemented. Also, a higher complexity of NPs in NAPROC-13 versus those in UNPD-A is observed. Previous findings show that a higher molecular complexity affords a better performance in clinical trials,35 in conjunction with better bioavailability-related physicochemical properties.37 These observations suggest a high potential in NAPROC-13 as a database suitable for virtual screening and lead development and optimization. A comprehensive visualization of molecular descriptors computed in this study and their descriptive statistical summary can be found in the Supporting Information, in Figure S2 and Table S1, respectively.

Solubility-Descriptors Correlation

NAPROC-13 includes information about deuterated solvents utilized for solubilizing each compound during 13C NMR experimental analysis. In this study, we propose leveraging this data set as an indicator of NP solubility. Deuterated chloroform was the most frequently reported solvent, utilized for 13 481 compounds, followed by pyridine-d5 (2496 compounds), methanol-d4 (2060 compounds), acetone-d6 (822 compounds), dimethyl sulfoxide-d6 (DMSO-d6, 1415 compounds), and benzene-d6 (309 compounds). This solvent data can serve as a valuable component of a classification model, as is or with the inclusion of broader data. Relevant potential applications of this database are the prediction of 13C NMR signals and the impact of the analysis solvent as well as the putative development of specific solubility drugs of natural origin.

While the development of a thoroughly validated machine learning model itself falls beyond the scope of this study, the exploratory data analysis presented herein showed the potential of the data annotated in NAPROC-13 for predicting the solubility of NPs. A key methodology employed to assess the impact of various descriptors on solubility involves visualizing data on the compounds under study. This statistical approach enhances our understanding of the relationships between different properties and solubility categories, thereby facilitating the identification of interdependent variables and refining parameters that are crucial for training machine learning models. Moreover, this data set is potentially useful to train a predictive machine learning model in the field of the NMR signal analysis of NPs.

Figure 3 shows the distribution of selected molecular descriptors among the different categories of solubilities for NPs in NAPROC-13. Of note, Figure 3 is focused on NAPROC-13 compounds, while Figure 2 compares the profile of different descriptors of NAPROC-13 as compared to reference data sets. A complete visualization of molecular descriptors and a statistical descriptive summary are shown in Figure S3 and Table S2 in the Supporting Information, respectively. In Figure 3, the differential distribution of properties among the categories of analysis solvent (approximable to solubility) is included in NAPROC-13. Solubility is a macroscopic property of molecules in the presence of third substances, commonly associated with physicochemical/constitutional molecular properties, similar to the melting and boiling points. Some of the striking differences in this respect are related to the polarity and possibility of interacting through hydrogen bonds of molecules and to the polar/nonpolar character of the solvent. For example, some of these interesting relationships are the differential distribution of HBD and HBA, the number of heteroatoms, and CSP3.

Figure 3.

Figure 3

Distribution of physicochemical properties and constitutional descriptors among NAPROC-13 solubility categories. (a) HBD, (b) HBA, (c) number of heteroatoms, (d) CSP3, (e) TPSA, (f) number of aromatic rings, and (g) number of ring structures. Dotted lines are used for ease of visualization.

Structural Diversity

Fingerprint-Based Structural Diversity

The structural fingerprints of different designs served as molecular representations across the studied data sets, with the Tanimoto similarity coefficient employed to gauge pairwise similarity between molecules. To assess diversity within each database, pairwise molecular similarity was computed, ascendingly ordered, and depicted as cumulative curves. A higher squared curve signifies greater database diversity, as illustrated in Figure 4. Independent of the molecular representation method used—whether MACCS keys (166 bits), ECFP4 or ECFP6 (1024 bits), or the recently developed MAP4 (2048 bits)—database diversity can be ranked in ascending order as follows: NAPROC-13 exhibits a lower diversity than UNPD-A, which, in turn, is less diverse than FDA-approved drugs. Statistical similarity values also indicate a lower diversity within compounds in NAPROC-13. However, it is important to highlight that even without this aim, NAPROC-13 demonstrates a diversity close to that of UNPD-A, despite the latter being designed to be highly diverse.2 NAPROC-13 closely aligns with the statistical similarity values of UNPD-A, particularly when employing molecularly independent fingerprints such as ECFP and MAP4. These descriptors offer higher resolution in molecular representation, detailing specific connectivity features along each molecule, thereby contributing to their effectiveness in assessing molecular diversity.38,39

Figure 4.

Figure 4

Cumulative distribution functions for the pairwise Tanimoto similarity using (a) MACCS-keys (166 bits), (b) ECFP4, (c) ECFP6, and (d) MAP4 fingerprints as molecular representations. NAPROC-13 (green), FDA-approved drugs (red), and the most diverse subset of NPs (UNPD-A, blue). Dotted lines are used for ease of visualization.

Scaffold-Based Substructural Diversity

Analysis of the scaffold content of the compound data sets revealed that compounds in NAPROC-13 present a high content of cyclic compounds, with a small fraction (0.78%) of acyclic structures. In contrast, NPs in UNPD-A, as well as FDA-approved drugs, have a similar fraction of acyclic compounds, nearly 11.5% of the database. However, it is remarkable that most of the cyclic molecular cores in NAPROC-13 are uniquely present in NAPROC-13 (Figure 5 and Figure S1 in the Supporting Information). Moreover, most of them are absent in the compounds approved for clinical use.

Figure 5.

Figure 5

Fifteen of the most frequent scaffolds from NAPROC-13 and their abundances in the three data sets. Additionally, 165 (0.78%) of NAPROC-13, 265 (11.40%) approved drugs, and 1743 (11.62%) of UNPD-A are acyclic compounds.

The large abundance of merged aromatic and saturated rings for almost all of the 15 most frequent scaffolds in NAPROC-13 is noticeable. An exception to this trend is the high content of benzene-type rings, with 208 (0.98%) substructures, which tend to be the most frequent scaffold in small-molecule databases,40 specifically in NPs, and even food chemicals.4,26,39,41

NAPROC-13 presented 8169 molecular scaffolds, 77 common to all three databases, 1770 shared with UNPD-A (the highest overlapping in scaffold analysis), and 84 shared with FDA-approved drugs. UNPD-A had 7059 scaffolds, 5175 were unique, and 191 were shared with FDA-approved drugs. Finally, FDA-approved drugs had 1291 scaffolds, and 1093 of them were unique. These results are in accordance with the high molecular diversity in NPs,42 demonstrated by pairwise similarity relationships (see above), and with previous analyses of the databases analyzed in this study.2,39

Quantitative approaches to compare the scaffold diversity of compound data sets are the SSE and the CSR curve. The scaled SSE is an index that measures how the information, in this case, compounds, is distributed along the different scaffolds, in terms of uniformity. The SSE ranges from zero (no diversity since all compounds share the same scaffold) to one (maximum diversity, uniform distribution of compounds along the scaffolds). For the three databases compared in the present study, NAPROC-13 presented an SSE of 0.97 for their 15 most frequent scaffolds (chemical structures shown in Figure 5), FDA-approved drugs 0.63, and UNPD-A 0.67. These values of SSE indicate that there is no evident predominance among the first 15 scaffolds in NAPROC-13 NPs. In contrast, for UNPD-A NPs as well as FDA-approved drugs, saturated and unsaturated six-membered rings are overrepresented, which can be noticeable in SSE. These results are in agreement with previous studies.39

The CSR curves shown in Figure 6 are a graphic representation of how compounds, represented as the fraction of each database, are distributed along the set of scaffolds, represented as the fraction of the total amount of scaffolds. As more compounds are accumulated by a small fraction of the database’s scaffolds, the less diverse the database is. The CSR curves in Figure 6 indicate that the FDA-approved drugs set has the overall largest scaffold diversity, followed by UNPD-A and finally by NAPROC-13. These measures support the previously discussed high diversity of the NPs.

Figure 6.

Figure 6

Cyclic system retrieval (CSR) curves for NAPROC-13 (green), FDA-approved drugs (red), and the most diverse subset of NPs (UNPD-A, blue).

To further compare the substructures in the different data sets, we analyzed the types and frequency of disconnected ring systems, which allows us to analyze the overlap sharing of minimal cyclic substructures among the three databases. The results of this analysis are reported in the next section.

Ring System-Based Substructural Diversity

A ring system is defined as a set of atoms contained within a cycle including exocyclic double bonds. Unlike Bemis and Murcko scaffolds, ring systems disconnect the not-ring cores bonded by single bonds. This approach of molecular exploration has been widely implemented to describe the structural composition of bioactive molecules4347 and has been used in the study of NPs.48 Possible combinations of heteroatom-containing ring systems have also been predicted, with a probable prediction of an activity for a biological target, under the categories of active, inactive, or undefined. Herein, we used the results reported by Ertl to extract and compare potential biologically active ring systems (so-called “Magic Rings”) and search for similarities among NAPROC-13 and reference databases.28 Ertl categorization classifies a ring system for a preferred family target if the report includes at least twice as large as the next target class; on the other hand, it is assigned to the multitarget category. In the category of bioactivity, a ring system is reported as active if it possesses at least ten times more reports as active than inactive. The same condition applies to the inactive classification. Otherwise, it does not assign a classification.28 Herein, we use the “intermediate” bioactivity class for such molecules.

In the current analysis, we identified 8902 ring systems; 3853 were only present in NAPROC-13 NPs, 2973 were only present in UNPD-A, and 438 were only present in FDA-approved drugs. The major overlapping set was between NAPROC-13 and UNPD-A, as expected, with 1504 ring systems. Between UNPD-A and FDA-approved drugs, there is an overlap of 229 ring systems and between NAPROC-13 and FDA-approved drugs there is an overlap of 109 ring systems. Finally, 102 ring systems are shared among the three databases (see Figure S1.C in the Supporting Information).

The main classes of bioactive ring systems identified in the three databases according to Ertl′s categorization, with classification either active or intermediate active, are related to multitarget reports (not-known: NAPROC-13: 728, FDA-approved drugs: 33, UNPD-A: 570; multiple targets: NAPROC-13: 153, FDA-approved drugs: 147, UNPD-A: 241; other enzymes: NAPROC-13: 143, FDA-approved drugs: 73, UNPD-A: 186). However, GPCR (NAPROC-13: 54, FDA-approved drugs: 76, and UNPD-A: 99) and nuclear receptors (NAPROC-13: 21, FDA-approved drugs: 10, UNPD-A: 32) are also well represented by natural product ring systems, both in NAPROC-13 and UNPD-A (see Figure S4 in the Supporting Information).

Figure 7 shows the most frequent active ring systems (“Magic Rings”) in NAPROC-13 NPs. Figures S5 and S6 in the Supporting Information show the most frequent active ring systems in FDA-approved drugs and UNPD-A NPs, respectively.

Figure 7.

Figure 7

Most frequent potentially bioactive ring systems in NAPROC-13 (“Magic Rings”).28

The predominant ring system in the NP databases under study is the chromone (1-benzopyran-4-one) core, which serves as the structural foundation for flavonoids. Flavonoids, abundant in plants, play pivotal roles in essential metabolic pathways and have been linked to various biological activities, notably their antioxidant properties.49,50 Remarkably, this ring system is also prevalent among FDA-approved drug structures. Additionally, both NP databases exhibit a significant presence of diverse steroid-type cores, which are well-documented for their wide-ranging bioactivities.51

In their most frequent ring systems, UNPD-A exhibits nitrogen-containing ring systems in limited quantities, while NAPROC-13 lacks such structures entirely. Contrastingly, both databases include oxygen-containing ring systems. These characteristics align with the typical patterns observed in natural products and contrast with synthetic compounds like FDA-approved drugs.10,52,53

Conversely, UNPD-A displays a marked abundance of quinone-type cores, a feature notably absent in NAPROC-13. Quinones are intriguing NPs known for their dual nature, possessing both beneficial and potentially toxic effects in humans. This characteristic positions quinones as a promising area for ongoing and future investigations into their therapeutic potential and safety profile.54

Fragment-Based Substructural Diversity

Molecular fragments are relevant entities for a comprehensive characterization of chemical libraries, specifically for drug design. Fragment-based drug design (FBDD) is founded in the search for better atomic efficiency in binding interaction.55 Those can be applied in the de novo design,56 reaction-based,57 or rules of transformation-based58 fragment-based libraries. In the field of fragment screening, more than 20 years ago Congreve et al. proposed a set of empirical rules known as the rule of three (RO3, MW < 300, cLogP ≤ 3, HBD ≤ 3, HBA ≤ 3).59 Although the RO3 has limitations, the concept has given relevant resources in FBDD, especially limiting the molecular complexity of fragments libraries, giving place to an adequate exploration of diverse zones of the molecular space and avoiding “molecular obesity”.6062

Currently, there are multiple commercial or free accessible fragment libraries, many of them based on small synthetically accessible molecules, focused on particular biological activities, among others. However, there is still a lack of NP-based fragment libraries.2,63 For this reason, we consider that a highly valuable contribution of the present study is to make freely available the fragment libraries we have built in the Supporting Information.

We identified 748 533 different fragments, 405 265 were only present in NAPROC-13 NPs, 320 143 were unique in UNPD-A, and 14 141 were only present in FDA-approved drugs. The highest overlap was between both natural source databases with 8438 fragments in common. FDA-approved drugs and NAPROC-13 shared 345 fragments, UNPD-A and FDA-approved drugs 745, and 272 fragments were present in the three databases (see Figure S1.D in the Supporting Information).

For each data set of molecular fragments generated, we computed the properties included in the RO3. Figure 8 shows the most frequent molecular fragments in NAPROC-13 NPs fulfilling the RO3 conditions and their frequencies in FDA-approved drugs and NPs in UNPD-A.

Figure 8.

Figure 8

Most frequent fragments complaining of RO3 computed for NPs in NAPROC-13, and their frequencies in FDA-approved drugs and UNPD-A. “A” symbol means the point of disconnection.

Different molecular fragments fulfilling the RO3 conditions were 5196 for NAPROC-13 NPs, 2911 for FDA-approved drugs, and 7596 for UNPD-A NPs, highlighting the highest diversity of the UNPD-A subset. The computed fragments for the NP source databases presented a higher content of oxygen atoms in accordance with previous studies. Lack of heteroatoms more than oxygen is a characteristic mentioned and discussed above, as well as their contrast to FDA-approved drugs (see sections 3.2 and 3.3.3).

Chemical Multiverse Visualization

Figure 9 shows a visual representation of the chemical multiverse of NAPROC-13, immersed in the chemical multiverse of the known NPs (using COCONUT as an approach to this set) and compared with the chemical multiverse of a set of NPs designed to be diverse, as is UNPD-A, and the FDA-approved drugs. According to physicochemical properties related to pharmacological interest, NAPROC-13 comprises a region of the chemical space shorter than UNPD-A, and both of them cover a limited region of NP chemical space. FDA-approved drugs overlap a considerable zone of NPs’ chemical space as has been demonstrated in previous analyses (see Figure 9A).42

Figure 9.

Figure 9

Chemical multiverse visualization of NPs in NAPROC-13 compared to those of UNPD-A, FDA-approved drugs, and COCONUT as the chemical space of NPs. (a) PCA of molecular descriptors of pharmacological interest (HBA, HBD, MW, TPSA, RB, LogP) and (b) t-SNE of structural fingerprint ECFP4.

According to their structural motifs, the coverage of the chemical space of NPs included in NAPROC-13 was approached using the molecular fingerprint ECFP4, broadly used in the representation of NPs in previous studies, and classically recognized as adaptable and functional for the correct representations of both small and synthetical molecules, as well as big molecules and of natural sources.42,64 The structural diversity of NPs present in NAPROC-13 determined by using this methodology, saving the character of UNPD-A of being a diversity-focused library as mentioned above, is highly remarkable (see Figure 9B).2,42

t-SNE reduction of components (perplexity = 30, learning rate = by default, number of iterations = 5000) performed well in the task of representing the chemical space covered by the present study, in the sense that small and synthetic molecules, such as FDA-approved drugs, were proximately clustered in the middle of the graphical representation, in contrast to natural products, that covered a broader region of the structural features. This trend appears to be reproduced in terms of druglike properties, where UNPD-A reflects a wider region of NP chemical space than NAPROC-13.

Natural Product Likeness Score

Figure 10 shows the distribution of NPL scores of the compounds in different databases analyzed.65 The NPL score is an approximate measure to describe and summarize the characteristics of NPs that differentiate them from synthetic molecules and can be attached by different approaches and model methodologies. The approach used here consists of the most classical method and is based on a Bayesian statistical model based on fragmentation patterns of common apparition in NPs. Positive values of the NPL score mean that the chemical structure of a compound resembles an NP structure (given the data set used to train the metric). In contrast, negative values of the score mean that the chemical is more associated with fragmentation patterns of the synthetic organic compounds. In this study, profiling compound databases in terms of NPL scores aims to prioritize databases in terms of their coverage of the chemical space of NPs. Results in Figure 10 indicate that compounds in NAPROC-13 have large variability in terms of natural product likeness, but most of them, as expected, have a strong character of natural products: their NPL scores were highly shifted toward positive values (average: 2.44; median: 2.58; standard deviation: 0.8). Results also indicated that compounds NAPROC-13 have stronger features of NPs as compared to the NPs in UNPD-A (average: 1.51; median: 1.51; standard deviation: 1.1). As expected, both NP collections had higher values than compounds in the set of FDA-approved drugs (average: 0.02; median: −0.10; standard deviation: 1.1), which was used as a reference in the comparison. The distribution of NPL scores in Figure 10 agreed with the values calculated previously for other NPs and approved drugs.39,66

Figure 10.

Figure 10

Distribution of the probability density of the Natural Product Likeness score of NPs in NAPROC-13, UNPD-A, and FDA-approved drugs.

Conclusions

NAPROC-13 is a freely accessible compound database containing over 24 000 NPs, with detailed information on their chemical structures and 13C NMR data. Notably, more than 4000 compounds in NAPROC-13 have yet to be included in large NP databases, such as COCONUT, highlighting its unique contribution to the field. Our analysis shows that the chemical structures of NPs in NAPROC-13 exhibit similar physicochemical and constitutional properties to those in a broad and diverse set of NPs. However, they also contain distinct molecular fragments, scaffolds, and ring systems of pharmaceutical relevance, distinguishing them from other databases.

The high structural complexity of NAPROC-13 NPs, compared to other screening libraries, including those of generally diverse NPs, underscores its potential for use in virtual screening for hit identification in drug discovery projects. While the chemical space of NAPROC-13, as described by drug-type properties, is somewhat limited to areas associated with approved drugs, its extensive distribution across the NP chemical space characterized by structural features makes it a valuable resource for identifying potential candidates in drug discovery efforts, especially for exploring under-represented areas of chemical diversity. This is further supported by our findings, where NAPROC-13 achieved higher NPL scores than those of other NP sets and approved drugs, demonstrating its significant potential in the field.

Moving forward, one of the key perspectives of this work is to enhance the NAPROC-13 web site by incorporating interactive features that display computed descriptors, graphics, scaffolds, and substructure searching tools. These updates will provide researchers with an invaluable resource for NP dereplication, drug design, and structure-based investigations. By continuously updating and expanding the database with new tools and data, we aim to further support NP research and facilitate advancements in spectroscopy, machine learning, and drug discovery.

Experimental Section

Data Sets and Standardization of Chemical Structures

At the time of writing (May 2024), NAPROC-13 had 24 722 compounds. While most entries in this web-based application originate from plant sources, it also includes some marine and microbiological NPs, reflecting recent revisions in their structures. It is noteworthy that the priority for entries in NAPROC-13 currently lies with substances of which a considerable number of thier structures have been reviewed.2123 By geographical distribution, systematic introductions of NPs are limited to those from Panama and El Salvador. Most of the entries in NAPROC-13 come from various unspecified countries worldwide. For comparisons, we used the following as reference compound databases: the FDA set (update until January 4, 2023)67,68 with 2324 unique compounds, and the Universal Natural Product Database - Subset A (UNPD-A), which includes the 14 994 most diverse compounds from NPs reported in the UNPD, selected using the MaxMin algorithm.2,69 In addition, a set of reported solvents of analysis for NAPROC-13 compound analyses was created from the original and curated database as an approximation to the solubility of the chemical compounds in deuterated solvents. The COCONUT database was included as an approximation to the whole reported compounds of natural origin, to estimate the region of the chemical space of NPs that is covered by NAPROC-13.8

Compounds in NAPROC-13, UNPD-A, and FDA-approved drugs encoded as Simplified Molecular Input Line Entry System (SMILES)70 were standardized using the open-source cheminformatics toolkit RDKit, version 2023.09.471 and MolVS.72 According to a standardized protocol,73 the functions Standardizer, LargestFragmentChoser, Uncharger, Reionizer, and TautomerCanonicalizer implemented in MolVS were used. Compounds with valence errors or any chemical element different from H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I were removed. Stereochemistry information was kept except for the computation of unique compounds, molecular scaffolds, and ring systems. Compounds with multiple components, if present, were split, and the largest component was retained. The remaining compounds were neutralized and reionized to generate the corresponding canonical tautomer.

Molecular Descriptors

For each molecule, physicochemical properties of pharmaceutical interest, constitutional descriptors, and molecular fingerprints were calculated with Python language using RDKit toolkit version 2023.09.471 and Molecular Operating Environment (MOE), version 2022.02.74

Descriptors computed with the RDKit toolkit were hydrogen bond acceptors (HBA), hydrogen bond donors (HBD), partition coefficient octanol/water (LogP), topological polar surface area (TPSA), molecular weight (MW), fraction of sp3 carbon atoms (CSP3), number of heavy atoms, number of ring systems, number of heteroatoms, number of rotatable bonds (RB), number of alicyclic rings formed by carbon atoms, number of alicyclic rings that include heteroatoms, number of aromatic rings formed by carbon atoms, number of aromatic rings that include heteroatoms, and the total number of aromatic rings. The number of acid atoms, aromatic atoms, basic atoms, nitrogen, oxygen, and halogen atoms, the fraction of rotatable bonds, and the number of chiral centers were computed using MOE.

Three types of molecular fingerprints with different designs were calculated: Molecular ACCes System (MACCS) keys (166-bits),75 extended connectivity fingerprint (ECFP)76 of 1024-bits with diameter 4 (ECFP4) and diameter 6 (ECFP6), and MinHashed atom pair fingerprint up to a diameter of four bonds (MAP4).77,78

Structural Content and Diversity

The structural diversity of NPs reported in NAPROC-13 was analyzed by comparison with FDA-approved drugs and the most diverse set of NPs (UNPD-A), in terms of their distribution of similarity values, computed with the Tanimoto coefficient using four molecular fingerprints: MACCS Keys (166 bits), ECFP4, ECFP6, and MAP4. For NPs in NAPROC-13 and UNPD-A, five random samples of 1000 compounds each were extracted to reduce the computational cost. It has been demonstrated that multiple sampling of 1000 compounds is a valid approach to quantify the entire database pairwise fingerprint-based diversity.79

Among the multiple methods to perform the molecular scaffold analysis of a set of compounds,80 we used the definition proposed by Bemis and Murcko, which consists of removing all side chains in molecules and preserving the ring systems and their corresponding linkers.26 Along with this calculation, scaffold diversity was estimated by computing the scaled Shannon’s entropy (SSE) of molecular distribution along the set of presented scaffolds, taking into account the distribution along the first 15 scaffolds.81 The Cyclic System Retrieval (CSR) curve was generated as a visual guide to compare the relative scaffold diversity of the databases.

Also, we implemented ring systems to determine the presence of “Magic Rings” in NAPROC-13 and the reference databases. A ring system consists of a cyclic system without any bridge connecting to another cyclic ring, preserving exocyclic double bonds.82 This approach has proven to be highly efficient for the analysis of chemical structures.83 Ring systems were computed using the implementation of RingSystemFinder in the library useful_rdkit_utils.84 This algorithm identifies and protects exocyclic double bonds connected to rings, cleaves single bonds with the FragmentOnBonds RDKit function, and returns the cyclic fragments after the addition of hydrogen atoms to cleavage sites.

The REtrosynthetic Combinatorial Analysis Procedure (RECAP) algorithm,27 implemented in the RDKit package, was used to generate the molecular fragments in the studied databases. RECAP algorithm fragments molecular structures around bonds formed by common chemical reactions. Recognized cleavable bonds for the RECAP algorithm are amide, ester, amine, urea, ether, olefin, quaternary nitrogen, aromatic carbon, and sulfonamide. Those rules are applied for acyclic bonds, while rings are maintained intact. In the present work, we compared the content of fragments of NPs lighter than 1350 g/mol in NAPROC-13 (21 226 compounds) with those generated from NPs lighter than 1000 g/mol in UNPD-A (14 733 compounds) and FDA-approved drugs lighter than 2000 g/mol (2317 compounds). MW filtering was applied to reduce the computational cost, and the benchmark was designated according to the time of computation related to the molecular complexity. The newly generated fragment library is freely available at https://github.com/DIFACQUIM/naproc13_characterization. Finally, molecular descriptors related to the rule of three (RO3) were computed using the Datamol Python library, version 0.12.3.60,85

Chemical Multiverse Visualization

The chemical multiverses of the three databases, NAPROC-13, UNPD-A, and FDA-approved drugs, were compared against the chemical space of NPs, using COCONUT as an approach to these sets. A chemical multiverse is a group of alternative or “parallel” chemical spaces of a set of compounds, each defined by a distinct set of molecular descriptors.29 Each chemical space is an M-dimensional Cartesian space, and each dimension represents the descriptors or features encoding a molecule. The length of the descriptor sets defines the number of dimensions of each chemical space. Dimensionality reduction for chemical space visualization was achieved using t-distributed stochastic neighbor embedding (t-SNE) according to the bits-based fingerprints previously computed (vide supra). t-SNE is a nonlinear method that uses t-distribution instead of the linear method used by the principal component analysis (PCA). This approach allows t-SNE to display a wider distribution of points along the graph.86 A PCA using the most significant physicochemical and constitutional descriptors was also computed. PCA and t-SNE variable reductions were done using the library Scikit-Learn 1.4.1.87

Natural Product Likeness Score

The Natural Product Likeness (NPL) score65 is a machine-learning-based scoring that quantifies how similar a compound is to the structural chemical space covered by those NPs. This approach efficiently distinguishes NPs from synthetic molecules. It has been broadly used to classify compounds of natural origin2,42 and food components39 and to validate the NPL of machine learning-based generative models.88 The index consists of a range between −5 (for compounds of probable synthetic origin) and 5 (for compounds similar to an NP). In the present work, we employed the NPL score for the compounds in NAPROC-13, as well as UNPD-A and FDA-approved drugs.

Acknowledgments

J.F.A.-T. is thankful to Consejo Nacional de Humanidades, Ciencias y Tecnologías (CONAHCyT), Mexico, for the Postgraduate scholarship with number 1270553. N.A.A.-M. is thankful to Universidad Tecnológica de Panamá (UTP) for the Economic Subsidy contract no. FCYT-MCIM-001-2021 under the Educational Collaboration Agreement no. 006-2021, with Secretaría Nacional de Ciencias, Tecnología e Innovación (SENACYT), Panamá. J.E.S-G. is part of the Sistema Nacional de Investigación (SNI) of the Secretaría Nacional de Ciencia, Tecnología e Innovación de Panamá (SENACYT), Panama. Helpful discussions with Raziel Cedillo-González, Diana L. Prado-Romero, Fernanda I. Saldívar-González, Ana L. Chávez-Hernández, Felipe Victoria-Muñoz, Samuel Homberg, and Dionisio Olmedo are greatly acknowledged. We acknowledge the support of DGAPA, UNAM, Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT), grant no. IG200124, to cover the cost of MOE’s academic license. We also are thankful for the computational resources involving the Miztli supercomputer at UNAM under the project LANCAD-UNAM-DGTIC-335.

Data Availability Statement

Data sets and codes are freely available at https://github.com/DIFACQUIM/naproc13_characterization.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jnatprod.4c00530.

  • Details of the comprehensive chemoinformatic characterization as well as machine learning approach analysis of NAPROC-13 versus FDA-approved drugs and UNPD-A, and data visualization (PDF)

The authors declare no competing financial interest.

Supplementary Material

np4c00530_si_001.pdf (1.7MB, pdf)

References

  1. Saldívar-González F. I.; Aldas-Bulos V. D.; Medina-Franco J. L.; Plisson F. Natural Product Drug Discovery in the Artificial Intelligence Era. Chem. Sci. 2022, 13 (6), 1526–1546. 10.1039/D1SC04471K. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Chávez-Hernández A. L.; Medina-Franco J. L. Natural Products Subsets: Generation and Characterization. Artif. Intell. Life Sci. 2023, 3, 100066. 10.1016/j.ailsci.2023.100066. [DOI] [Google Scholar]
  3. Rodrigues T.; Reker D.; Schneider P.; Schneider G. Counting on Natural Products for Drug Design. Nat. Chem. 2016, 8 (6), 531–541. 10.1038/nchem.2479. [DOI] [PubMed] [Google Scholar]
  4. Saldívar-González F. I.; Valli M.; Andricopulo A. D.; da Silva Bolzani V.; Medina-Franco J. L. Chemical Space and Diversity of the NuBBE Database: A Chemoinformatic Characterization. J. Chem. Inf. Model. 2019, 59 (1), 74–85. 10.1021/acs.jcim.8b00619. [DOI] [PubMed] [Google Scholar]
  5. Newman D. J.; Cragg G. M. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod. 2020, 83 (3), 770–803. 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
  6. Afendi F. M.; Okada T.; Yamazaki M.; Hirai-Morita A.; Nakamura Y.; Nakamura K.; Ikeda S.; Takahashi H.; et al. Altaf-Ul-Amin; Darusman, L. K.; Saito, K.; Kanaya, S. KNApSAcK Family Databases: Integrated Metabolite–Plant Species Databases for Multifaceted Plant Research. Plant Cell Physiol. 2012, 53 (2), e1. 10.1093/pcp/pcr165. [DOI] [PubMed] [Google Scholar]
  7. Chen Y.; de Bruyn Kops C.; Kirchmair J. Data Resources for the Computer-Guided Discovery of Bioactive Natural Products. J. Chem. Inf. Model. 2017, 57 (9), 2099–2111. 10.1021/acs.jcim.7b00341. [DOI] [PubMed] [Google Scholar]
  8. Sorokina M.; Merseburger P.; Rajan K.; Yirik M. A.; Steinbeck C. COCONUT Online: Collection of Open Natural Products Database. J. Cheminform. 2021, 13 (1), 2. 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Katz L.; Baltz R. H. Natural Product Discovery: Past, Present, and Future. J. Ind. Microbiol. Biotechnol. 2016, 43 (2–3), 155–176. 10.1007/s10295-015-1723-5. [DOI] [PubMed] [Google Scholar]
  10. Atanasov A. G.; Zotchev S. B.; Dirsch V. M.; Supuran C. T. Natural Products in Drug Discovery: Advances and Opportunities. Nat. Rev. Drug Discovery 2021, 20 (3), 200–216. 10.1038/s41573-020-00114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Marzorati S.; Verotta L.; Trasatti S. P. Green Corrosion Inhibitors from Natural Sources and Biomass Wastes. Molecules 2019, 24 (1), 48. 10.3390/molecules24010048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Li F.; Wang Y.; Li D.; Chen Y.; Dou Q. P. Are We Seeing a Resurgence in the Use of Natural Products for New Drug Discovery?. Expert Opin. Drug Discovery 2019, 14 (5), 417–420. 10.1080/17460441.2019.1582639. [DOI] [PubMed] [Google Scholar]
  13. Basu A.; Prasad P.; Das S. N.; Kalam S.; Sayyed R. Z.; Reddy M. S.; El Enshasy H. Plant Growth Promoting Rhizobacteria (PGPR) as Green Bioinoculants: Recent Developments, Constraints, and Prospects. Sustainability 2021, 13 (3), 1140. 10.3390/su13031140. [DOI] [Google Scholar]
  14. Kumar J.; Ramlal A.; Mallick D.; Mishra V. An Overview of Some Biopesticides and Their Importance in Plant Protection for Commercial Acceptance. Plants 2021, 10 (6), 1185. 10.3390/plants10061185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dias D. A.; Urban S.; Roessner U. A Historical Overview of Natural Products in Drug Discovery. Metabolites 2012, 2 (2), 303–336. 10.3390/metabo2020303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gaudêncio S. P.; Pereira F. Dereplication: Racing to Speed up the Natural Products Discovery Process. Nat. Prod. Rep. 2015, 32 (6), 779–810. 10.1039/C4NP00134F. [DOI] [PubMed] [Google Scholar]
  17. Gaudêncio S. P.; Bayram E.; Lukić Bilela L.; Cueto M.; Díaz-Marrero A. R.; Haznedaroglu B. Z.; Jimenez C.; Mandalakis M.; Pereira F.; Reyes F.; Tasdemir D. Advanced Methods for Natural Products Discovery: Bioactivity Screening, Dereplication, Metabolomics Profiling, Genomic Sequencing, Databases and Informatic Tools, and Structure Elucidation. Mar. Drugs 2023, 21 (5), 308. 10.3390/md21050308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Sorokina M.; Steinbeck C. Review on Natural Products Databases: Where to Find Data in 2020. J. Cheminform. 2020, 12 (1), 20. 10.1186/s13321-020-00424-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Johnson S. R.; Lange B. M. Open-Access Metabolomics Databases for Natural Product Research: Present Capabilities and Future Potential. Front Bioeng Biotechnol 2015, 3, 22. 10.3389/fbioe.2015.00022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. López-Pérez J. L.; Therón R.; del Olmo E.; Díaz D. NAPROC-13: A Database for the Dereplication of Natural Product Mixtures in Bioassay-Guided Protocols. Bioinformatics 2007, 23 (23), 3256–3257. 10.1093/bioinformatics/btm516. [DOI] [PubMed] [Google Scholar]
  21. Guerrero De León E.; Sánchez-Martínez H.; Morán-Pinzón J. A.; Del Olmo Fernández E.; López-Pérez J. L. Computational Structural Revision of Elaeophorbate and Other Triterpenoids with the Help of NAPROC-13. A New Strategy for Structural Revision of Natural Products. J. Nat. Prod. 2023, 86 (4), 897–908. 10.1021/acs.jnatprod.2c01135. [DOI] [PubMed] [Google Scholar]
  22. Ren F.-C.; Wang L.-X.; Lv Y.-F.; Hu J.-M.; Zhou J. Structure Revision of Four Classes of Prenylated Aromatic Natural Products Based on a Rule for Diagnostic 13C NMR Chemical Shifts. J. Org. Chem. 2021, 86 (16), 10982–10990. 10.1021/acs.joc.0c02409. [DOI] [PubMed] [Google Scholar]
  23. Shen S.-M.; Appendino G.; Guo Y.-W. Pitfalls in the Structural Elucidation of Small Molecules. A Critical Analysis of a Decade of Structural Misassignments of Marine Natural Products. Nat. Prod. Rep. 2022, 39 (9), 1803–1832. 10.1039/D2NP00023G. [DOI] [PubMed] [Google Scholar]
  24. Dimensions Badges. https://badge.dimensions.ai/details/id/pub.1034075226/citations (accessed 2024-01-29).
  25. Sánchez-Martínez H. A.; Morán-Pinzón J. A.; Del Olmo Fernández E.; Eguiluz D. L.; Adserias Vistué J. F.; López-Pérez J. L.; De León E. G. Synergistic Combination of NAPROC-13 and NMR 13C DFT Calculations: A Powerful Approach for Revising the Structure of Natural Products. J. Nat. Prod. 2023, 86 (10), 2294–2303. 10.1021/acs.jnatprod.3c00437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Bemis G. W.; Murcko M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39 (15), 2887–2893. 10.1021/jm9602928. [DOI] [PubMed] [Google Scholar]
  27. Lewell X. Q.; Judd D. B.; Watson S. P.; Hann M. M. RECAP-Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry. J. Chem. Inf. Comput. Sci. 1998, 38 (3), 511–522. 10.1021/ci970429i. [DOI] [PubMed] [Google Scholar]
  28. Ertl P. Magic Rings: Navigation in the Ring Chemical Space Guided by the Bioactive Rings. J. Chem. Inf. Model. 2022, 62 (9), 2164–2170. 10.1021/acs.jcim.1c00761. [DOI] [PubMed] [Google Scholar]
  29. Medina-Franco J. L.; Chávez-Hernández A. L.; López-López E.; Saldívar-González F. I. Chemical Multiverse: An Expanded View of Chemical Space. Mol. Inform. 2022, 41 (11), 2200116. 10.1002/minf.202200116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Karageorgis G.; Foley D. J.; Laraia L.; Waldmann H. Principle and Design of Pseudo-Natural Products. Nat. Chem. 2020, 12 (3), 227–235. 10.1038/s41557-019-0411-x. [DOI] [PubMed] [Google Scholar]
  31. Méndez-Lucio O.; Medina-Franco J. L. The Many Roles of Molecular Complexity in Drug Discovery. Drug Discovery Today 2017, 22 (1), 120–126. 10.1016/j.drudis.2016.08.009. [DOI] [PubMed] [Google Scholar]
  32. Clemons P. A.; Bodycombe N. E.; Carrinski H. A.; Wilson J. A.; Shamji A. F.; Wagner B. K.; Koehler A. N.; Schreiber S. L. Small Molecules of Different Origins Have Distinct Distributions of Structural Complexity That Correlate with Protein-Binding Profiles. Proc. Natl. Acad. Sci. U. S. A. 2010, 107 (44), 18787–18792. 10.1073/pnas.1012741107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lipinski C. A.; Lombardo F.; Dominy B. W.; Feeney P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 1997, 23 (1–3), 3–25. 10.1016/S0169-409X(96)00423-1. [DOI] [PubMed] [Google Scholar]
  34. Veber D. F.; Johnson S. R.; Cheng H.-Y.; Smith B. R.; Ward K. W.; Kopple K. D. Molecular Properties That Influence the Oral Bioavailability of Drug Candidates. J. Med. Chem. 2002, 45 (12), 2615–2623. 10.1021/jm020017n. [DOI] [PubMed] [Google Scholar]
  35. Lovering F.; Bikker J.; Humblet C. Escape from Flatland: Increasing Saturation as an Approach to Improving Clinical Success. J. Med. Chem. 2009, 52 (21), 6752–6756. 10.1021/jm901241e. [DOI] [PubMed] [Google Scholar]
  36. Barone R.; Chanon M. A New and Simple Approach to Chemical Complexity. Application to the Synthesis of Natural Products. J. Chem. Inf. Comput. Sci. 2001, 41 (2), 269–272. 10.1021/ci000145p. [DOI] [PubMed] [Google Scholar]
  37. Meanwell N. A. Improving Drug Design: An Update on Recent Applications of Efficiency Metrics, Strategies for Replacing Problematic Elements, and Compounds in Nontraditional Drug Space. Chem. Res. Toxicol. 2016, 29 (4), 564–616. 10.1021/acs.chemrestox.6b00043. [DOI] [PubMed] [Google Scholar]
  38. Singh N.; Guha R.; Giulianotti M. A.; Pinilla C.; Houghten R. A.; Medina-Franco J. L. Chemoinformatic Analysis of Combinatorial Libraries, Drugs, Natural Products, and Molecular Libraries Small Molecule Repository. J. Chem. Inf. Model. 2009, 49 (4), 1010–1024. 10.1021/ci800426u. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Avellaneda-Tamayo J. F.; Chávez-Hernández A. L.; Prado-Romero D. L.; Medina-Franco J. L. Chemical Multiverse and Diversity of Food Chemicals. J. Chem. Inf. Model. 2024, 64 (4), 1229–1244. 10.1021/acs.jcim.3c01617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Cedillo-González R.; Medina-Franco J. L. Diversity and Chemical Space Characterization of Inhibitors of the Epigenetic Target G9a: A Chemoinformatics Approach. ACS Omega 2023, 8 (33), 30694–30704. 10.1021/acsomega.3c04566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Yongye A. B.; Waddell J.; Medina-Franco J. L. Molecular Scaffold Analysis of Natural Products Databases in the Public Domain. Chem. Biol. Drug Des. 2012, 80 (5), 717–724. 10.1111/cbdd.12011. [DOI] [PubMed] [Google Scholar]
  42. Gómez-García A.; Acuña Jiménez D. A.; Zamora W. J.; Barazorda-Ccahuana H. L.; Chávez-Fumagalli M. Á.; Valli M.; Andricopulo A. D.; da Silva Bolzani V.; Olmedo D. A.; Solís P. N.; Núñez M. J.; Rodríguez Pérez J. R.; Valencia Sánchez H. A.; Cortés Hernández H. F.; Medina-Franco J. L. Navigating the Chemical Space and Chemical Multiverse of a Unified Latin American Natural Product Database: LANaPDB. Pharmaceuticals 2023, 16 (10), 1388. 10.3390/ph16101388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ertl P.; Jelfs S.; Mühlbacher J.; Schuffenhauer A.; Selzer P. Quest for the Rings. In Silico Exploration of Ring Universe to Identify Novel Bioactive Heteroaromatic Scaffolds. J. Med. Chem. 2006, 49 (15), 4568–4573. 10.1021/jm060217p. [DOI] [PubMed] [Google Scholar]
  44. Ertl P. Database of Bioactive Ring Systems with Calculated Properties and Its Use in Bioisosteric Design and Scaffold Hopping. Bioorg. Med. Chem. 2012, 20 (18), 5436–5442. 10.1016/j.bmc.2012.02.058. [DOI] [PubMed] [Google Scholar]
  45. Taylor R. D.; MacCoss M.; Lawson A. D. G. Rings in Drugs. J. Med. Chem. 2014, 57 (14), 5845–5859. 10.1021/jm4017625. [DOI] [PubMed] [Google Scholar]
  46. Shearer J.; Castro J. L.; Lawson A. D. G.; MacCoss M.; Taylor R. D. Rings in Clinical Trials and Drugs: Present and Future. J. Med. Chem. 2022, 65 (13), 8699–8712. 10.1021/acs.jmedchem.2c00473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Aldeghi M.; Malhotra S.; Selwood D. L.; Chan A. W. E. Two- and Three-Dimensional Rings in Drugs. Chem. Biol. Drug Des. 2014, 83 (4), 450–461. 10.1111/cbdd.12260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Chen Y.; Rosenkranz C.; Hirte S.; Kirchmair J. Ring Systems in Natural Products: Structural Diversity, Physicochemical Properties, and Coverage by Synthetic Compounds. Nat. Prod. Rep. 2022, 39 (8), 1544–1556. 10.1039/D2NP00001F. [DOI] [PubMed] [Google Scholar]
  49. Dias M. C.; Pinto D. C. G. A.; Silva A. M. S. Plant Flavonoids: Chemical Characteristics and Biological Activity. Molecules 2021, 26 (17), 5377. 10.3390/molecules26175377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Martínez C. A.; Mosquera O. M.; Niño J. Apigenin Glycoside: An Antioxidant Isolated from Alchornea Coelophylla Pax & K. Hoffm. (euphorbiaceae) Leaf Extract. Universitas Scientiarum 2016, 21 (3), 245–257. 10.11144/Javeriana.SC21-3.agaa. [DOI] [Google Scholar]
  51. Dembitsky V. M. Biological Activity and Structural Diversity of Steroids Containing Aromatic Rings, Phosphate Groups, or Halogen Atoms. Molecules 2023, 28 (14), 5549. 10.3390/molecules28145549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Simoben C. V.; Babiaka S. B.; Moumbock A. F. A.; Namba-Nzanguim C. T.; Eni D. B.; Medina-Franco J. L.; Günther S.; Ntie-Kang F.; Sippl W. Challenges in Natural Product-Based Drug Discovery Assisted with in Silico-Based Methods. RSC Adv. 2023, 13 (45), 31578–31594. 10.1039/D3RA06831E. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Chen Y.; Garcia de Lomana M.; Friedrich N.-O.; Kirchmair J. Characterization of the Chemical Space of Known and Readily Obtainable Natural Products. J. Chem. Inf. Model. 2018, 58 (8), 1518–1532. 10.1021/acs.jcim.8b00302. [DOI] [PubMed] [Google Scholar]
  54. Bolton J. L.; Dunlap T. Formation and Biological Targets of Quinones: Cytotoxic versus Cytoprotective Effects. Chem. Res. Toxicol. 2017, 30 (1), 13–37. 10.1021/acs.chemrestox.6b00256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Hopkins A. L.; Groom C. R.; Alex A. Ligand Efficiency: A Useful Metric for Lead Selection. Drug Discovery Today 2004, 9 (10), 430–431. 10.1016/S1359-6446(04)03069-7. [DOI] [PubMed] [Google Scholar]
  56. Prado-Romero D. L.; Gómez-García A.; Cedillo-González R.; Villegas-Quintero H.; Avellaneda-Tamayo J. F.; López-López E.; Saldívar-González F. I.; Chávez-Hernández A. L.; Medina-Franco J. L. Consensus Docking Aid to Model the Activity of an Inhibitor of DNA Methyltransferase 1 Inspired by de Novo Design. Front. Drug Discovery 2023, 3, 1261094. 10.3389/fddsv.2023.1261094. [DOI] [Google Scholar]
  57. Saldívar-González F. I.; Huerta-García C. S.; Medina-Franco J. L. Chemoinformatics-Based Enumeration of Chemical Libraries: A Tutorial. J. Cheminform. 2020, 12 (1), 64. 10.1186/s13321-020-00466-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Saldívar-González F. I.; Navarrete-Vázquez G.; Medina-Franco J. L. Design of a Multi-Target Focused Library for Antidiabetic Targets Using a Comprehensive Set of Chemical Transformation Rules. Front. Pharmacol. 2023, 14, 1276444. 10.3389/fphar.2023.1276444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Congreve M.; Carr R.; Murray C.; Jhoti H. A “Rule of Three” for Fragment-Based Lead Discovery?. Drug Discovery Today 2003, 8 (19), 876–877. 10.1016/S1359-6446(03)02831-9. [DOI] [PubMed] [Google Scholar]
  60. Jhoti H.; Williams G.; Rees D. C.; Murray C. W. The “Rule of Three” for Fragment-Based Drug Discovery: Where Are We Now?. Nat. Rev. Drug Discovery 2013, 12 (8), 644–645. 10.1038/nrd3926-c1. [DOI] [PubMed] [Google Scholar]
  61. Hann M. M. Molecular Obesity, Potency and Other Addictions in Drug Discovery. Med. Chem. Commun. 2011, 2 (5), 349–355. 10.1039/C1MD00017A. [DOI] [Google Scholar]
  62. Bon M.; Bilsland A.; Bower J.; McAulay K. Fragment-Based Drug Discovery-the Importance of High-Quality Molecule Libraries. Mol. Oncol. 2022, 16 (21), 3761–3777. 10.1002/1878-0261.13277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Chávez-Hernández A. L.; Sánchez-Cruz N.; Medina-Franco J. L. A Fragment Library of Natural Products and Its Comparative Chemoinformatic Characterization. Mol. Inform. 2020, 39 (11), 2000050. 10.1002/minf.202000050. [DOI] [PubMed] [Google Scholar]
  64. López-Pérez K.; Avellaneda-Tamayo J. F.; Chen L.; López-López E.; Juárez-Mercado K. E.; Medina-Franco J. L.; Miranda-Quintana R. Molecular Similarity: Theory, Applications, and Perspectives. Artificial Intelligence Chemistry. 2024, 100077, 100077. 10.1016/j.aichem.2024.100077. [DOI] [Google Scholar]
  65. Ertl P.; Roggo S.; Schuffenhauer A. Natural Product-Likeness Score and Its Application for Prioritization of Compound Libraries. J. Chem. Inf. Model. 2008, 48 (1), 68–74. 10.1021/ci700286x. [DOI] [PubMed] [Google Scholar]
  66. Gómez-García A.; Prinz A.-K.; Acuña Jiménez D. A.; Zamora W. J.; Barazorda-Ccahuana H. L.; Chávez-Fumagalli M. Á.; Valli M.; Andricopulo A. D.; da Silva Bolzani V.; Olmedo D. A.; Solís P. N.; Núñez M. J.; Rodríguez Pérez J. R.; Valencia Sánchez H. A.; Cortés Hernández H. F.; Mosquera Martinez O. M.; Koch O.; Medina-Franco J. L. Updating and Profiling the Natural Product-Likeness of Latin American Compound Libraries. Mol. Inform. 2024, 43 (7), e202400052. 10.1002/minf.202400052. [DOI] [PubMed] [Google Scholar]
  67. DrugBank online. https://go.drugbank.com (accessed 2024-01-23).
  68. Wishart D. S.; Feunang Y. D.; Guo A. C.; Lo E. J.; Marcu A.; Grant J. R.; Sajed T.; Johnson D.; Li C.; Sayeeda Z.; Assempour N.; Iynkkaran I.; Liu Y.; Maciejewski A.; Gale N.; Wilson A.; Chin L.; Cummings R.; Le D.; Pon A.; Knox C.; Wilson M. DrugBank 5.0: A Major Update to the DrugBank Database for 2018. Nucleic Acids Res. 2018, 46 (D1), D1074–D1082. 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Gu J.; Gui Y.; Chen L.; Yuan G.; Lu H.-Z.; Xu X. Use of Natural Products as Chemical Library for Drug Discovery and Network Pharmacology. PLoS One 2013, 8 (4), e62839. 10.1371/journal.pone.0062839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Weininger D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28 (1), 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
  71. Landrum G.RDKit. https://www.rdkit.org/ (accessed 2024-01-23).
  72. MolVS: Molecule Validation and Standardization — MolVS 0.1.1 documentation. https://molvs.readthedocs.io/en/latest/ (accessed 2024-01-23).
  73. Sánchez-Cruz N.; Pilón-Jiménez B. A.; Medina-Franco J. L. Functional Group and Diversity Analysis of BIOFACQUIM: A Mexican Natural Product Database. F1000Res. 2019, 8, 2071. 10.12688/f1000research.21540.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Molecular Operating Environment (MOE), 2022.02; Chemical Computing Group ULC: 910-1010 Sherbrooke St. W., Montreal, QC H3A 2R7, Canada, 2023.
  75. Durant J. L.; Leland B. A.; Henry D. R.; Nourse J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42 (6), 1273–1280. 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
  76. Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  77. Capecchi A.; Probst D.; Reymond J.-L. One Molecular Fingerprint to Rule Them All: Drugs, Biomolecules, and the Metabolome. J. Cheminform. 2020, 12 (1), 43. 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Orsi M.; Reymond J.-L.. One Chiral Fingerprint to Find Them All. J. Cheminf. 2024, 16, 53. 10.1186/s13321-024-00849-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Agrafiotis D. K. A Constant Time Algorithm for Estimating the Diversity of Large Chemical Libraries. J. Chem. Inf. Comput. Sci. 2001, 41 (1), 159–167. 10.1021/ci000091j. [DOI] [PubMed] [Google Scholar]
  80. Langdon S. R.; Brown N.; Blagg J. Scaffold Diversity of Exemplified Medicinal Chemistry Space. J. Chem. Inf. Model. 2011, 51 (9), 2174–2185. 10.1021/ci2001428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Medina-Franco J.; Martí-nez-Mayorga K.; Bender A.; Scior T. Scaffold Diversity Analysis of Compound Data Sets Using an Entropy-Based Measure. QSAR Comb. Sci. 2009, 28 (11–12), 1551–1560. 10.1002/qsar.200960069. [DOI] [Google Scholar]
  82. Xu Y.-J.; Johnson M. Using Molecular Equivalence Numbers to Visually Explore Structural Features that Distinguish Chemical Libraries. J. Chem. Inf. Comput. Sci. 2002, 42 (4), 912–926. 10.1021/ci025535l. [DOI] [PubMed] [Google Scholar]
  83. Read R. C.; Corneil D. G. The Graph Isomorphism Disease. J. Graph Theory 1977, 1 (4), 339–363. 10.1002/jgt.3190010410. [DOI] [Google Scholar]
  84. Walters P.Mining ring systems in molecules for fun and profit. https://practicalcheminformatics.blogspot.com/2022/12/identifying-ring-systems-in-molecules.html (accessed 2024-03-12).
  85. Datamol: Molecular Processing Made Easy; Github. https://github.com/datamol-io/datamol (accesed 2024-04-15).
  86. Van der Maaten L.; Hinton G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9 (86), 2579–2605. [Google Scholar]
  87. Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay É. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  88. Tay D. W. P.; Yeo N. Z. X.; Adaikkappan K.; Lim Y. H.; Ang S. J. 67 Million Natural Product-like Compound Database Generated via Molecular Language Processing. Sci. Data 2023, 10 (1), 296. 10.1038/s41597-023-02207-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

np4c00530_si_001.pdf (1.7MB, pdf)

Data Availability Statement

Data sets and codes are freely available at https://github.com/DIFACQUIM/naproc13_characterization.


Articles from Journal of Natural Products are provided here courtesy of American Chemical Society

RESOURCES