Abstract
Natural products (NPs) or their derivatives represent a large proportion of drugs that successfully progress through clinical trials to approval. This study explores the presence of NPs in both early- and late-stage drug discovery to determine their success rate, and the factors or features of natural products that contribute to such success. As a proxy for early drug development stages, we analyzed patent applications over several decades, finding a consistent proportion of NP, NP-derived, and synthetic-compound-based patent documents, with the latter group outnumbering NP and NP-derived ones (approximately 77% vs 23%). We next assessed clinical trial data, where we observed a steady increase in NP and NP-derived compounds from clinical trial phases I to III (from approximately 35% in phase I to 45% in phase III), with an inverse trend observed in synthetics (from approximately 65% in phase I to 55% in phase III). Finally, in vitro and in silico toxicity studies revealed that NPs and their derivatives were less toxic alternatives to their synthetic counterparts. These discoveries offer valuable insights for successful NP-based drug development, highlighting the potential benefits of prioritizing NPs and their derivatives as starting points.
Natural products (NPs) are chemical compounds naturally synthesized by organisms, such as plants, bacteria, animals, and fungi. Historically, traditional extracts from NPs have been a cornerstone of medical practices, serving as remedies for various ailments and disorders for thousands of years. In the contemporary era, the scope has expanded to include isolated bioactive compounds (i.e., compounds with potential health effects versus general biological activity), marking a significant evolution in medicinal approaches. While recent years have seen a surge in interest in NP-based drug discovery driven by technological advancements in high-throughput screening (HTS), modern tandem mass spectrometry, machine learning, and other technologies,1−3 their study has not been without challenges. Specifically, it can be difficult to identify specific compounds responsible for bioactivity, rapidly prioritize tractable chemical leads, and ensure supply of required starting material for drug manufacture. The latter can be especially problematic when considering that the concentrations of bioactive compounds can vary widely within an organism depending on environmental conditions, such as changing seasons, which can complicate the detection of these compounds.4 Similarly, the purification, isolation, and identification of minute quantities of bioactive compounds from complex mixtures of NPs with structurally similar molecules is technically complicated.5,6
The broader drug R&D process has its own challenges, with the path to approval of a new drug marked by unsustainably low success rates. To pass clinical trials, a drug must demonstrate both safety and efficacy, and the vast majority of clinical candidates fail, primarily due to a lack of clinical efficacy in the context of manageable clinical toxicity, likely precipitated by poor drug-like properties and/or target selection, among other reasons.7 Nonetheless, NPs have played a pivotal role as a primary source of new drugs, with nearly half of approved drugs between 1981 and 2019 tracing their origins back to unaltered NPs, derivatives of NPs, or compounds with pharmacophores resembling NPs.8 By contrast, of the approved drugs during that time, only a quarter were purely synthetic small-molecule compounds, with the remaining consisting of vaccine adjuvants and biologics, such as antibody-drug conjugates (ADCs) and peptides.
Several factors could plausibly contribute to the case for natural products. Drugs based on NP structures exhibit a higher degree of chemical diversity in both molecular structures and physicochemical properties, occupying a broader chemical space than their synthetic counterparts.9−12 Furthermore, through evolutionary pressure, NPs are intrinsically endowed with ligand–protein binding motifs which serve a broad range of biological functions. As a result, NPs can be inherently validated for biological relevance, often featuring molecular scaffolds and pharmacophore patterns that are especially selective to cellular targets.13−15 Additionally, a large proportion of NPs have drug-like properties, including desirable Absorption, Distribution, Metabolism, and Excretion (ADME) properties.16 Despite their impressive potential, only a fraction of the vast chemical landscape of NPs has been tapped for drug discovery purposes.17
Since previous work has shown that a large proportion of approved drugs are NP or NP-like compounds,8 we investigated whether such compounds have greater success rates in the clinical development process. A greater “survival rate” of NP or NP-like compounds would provide one explanation for this apparent divergence. Therefore, in this work, we evaluated whether NP or NP-like small molecules have a higher likelihood of success through the stages of clinical trials. Given that information on preclinical data is not readily accessible, we used patent applications as a proxy for compounds in the initial stages of drug discovery to evaluate the proportion of NP, NP-like, and synthetic compounds. Finally, we compare NPs to synthetic compounds through the lens of clinical toxicity data to determine whether NPs have better clinical outcomes due to decreased toxicity. We find that NP and NP-like compounds have a greater likelihood of success in the clinic, making them potentially superior starting points for drug discovery.
Results
Natural Products Constitute a Minority of All Patent Applications
Patents are the first step toward preserving the rights for a novel innovation. It is the most common strategy applied by pharmaceutical companies to safeguard the right to a drug candidate prior to initiating clinical trials. As preclinical and early stage drug discovery data is not as readily available, we use patent applications as a proxy to investigate which classes of compounds represent potential starting material for clinical trials. To do so, we looked at the proportions of NPs, hybrids (see Experimental Section for definition), and synthetics over several decades (i.e., 1976–2022) on a data set of over 1 million patent applications. These patent compounds were extracted by SureChEMBL18 through image and text mining workflows and cross-referenced with those in public databases like ChEMBL and PubChem, ensuring a comprehensive mapping and integration, if they exist in other resources.
It is critical to note that numerous natural products cannot be patented because various patent laws, including the laws in the U.S. and Europe, stipulate that products of nature, laws of nature, and natural phenomena are not patentable. This includes cases like the relationship between metabolite concentrations in blood and medical predictions (Mayo v. Prometheus) or a naturally occurring DNA segment in the form of the BRCA2 gene (AMP v. Myriad Genetics). However, it is possible to patent NPs that are “markedly different” from their naturally occurring counterparts (i.e., NP-derivatives), or if the claim includes “significantly more” than what is found in nature. For example, a non-naturally occurring analogue of a NP small molecule is patent eligible. Furthermore, cDNA is also patent eligible as it is not naturally occurring. Also, methods of treatment, which involve human intervention and are not directed to a natural relationship itself, are patentable in many jurisdictions, including the U.S. Thus, because the pharmaceutical industry will only advance molecules into clinical development that can be patented through such modifications or processes, we believe that this measure still captures a majority of successful high-quality clinical candidates.
From this data set, we found that a small proportion of NPs (approximately 8%, ∼1.5 million patent compounds) have been found in patent applications until 2022, compared to synthetic (approximately 77%, ∼15.3 million patent compounds) and hybrid compounds (approximately 15%, ∼2.8 million patent compounds) (Figure 1A). Additionally, we looked at the total number of patent applications filed per year and found that the proportion of patent applications has remained relatively consistent over time for each of the three classes (i.e., NP, hybrid, and synthetic compounds) (Figure 1B). As expected, synthetic and hybrid compounds consistently had a higher annual patent filing rate compared to NPs. Significantly, synthetic compounds consistently exhibit between 6 and 8 times more patent filings than NP or hybrid compounds throughout the entire time period. Apart from patent eligibility, this might reflect the greater usage of synthetic compounds and their associated technologies (such as the erstwhile Combi-Chem or modern DNA-encoded Library methods) to generate novel chemical substrates.
Figure 1.
(A) Percentage of patent compounds per class (Hybrid, NP, and synthetic). (B) Percentage of patent applications of each class by year.
Proportion of Natural Products or Natural Product-Like Compounds Increases through Clinical Trial Phases
NP and NP-like compounds constitute a significant proportion of FDA-approved drugs. Based on our classification of NP-likeness (see the Experimental Section), approximately 25% (1149/4749) of approved drugs are NPs, while hybrids make up 20% (895/4749) (see Figure S1 of the Supporting Information for details on specific data sets of approved drugs and Figure S2 for results from a proprietary data set). In this section, we investigated whether the proportion of compounds that are NPs, hybrids, or synthetics changes across the phases of clinical trials (i.e., phases I, II, and III).
We find that the proportion of compounds from NPs and hybrids significantly increases as compounds progress through clinical investigation, contrasting with a steady decline in the proportion of synthetic compounds (Figure 2). For instance, 65% (3085/4749) of all compounds entering phase I were synthetics, while in phase III, this number decreased to 55.5% (1863/3356) (Figure 2B). By contrast, the proportion of NPs increased from ∼20% (940/4749) in phase I to ∼26% in phase III (860/3356), and the proportion of hybrids from ∼15% (724/4749) to ∼19% (632/3356). NPs and hybrids constituted around 45% of compounds in phase III, aligning with proportions observed in approved drugs. Similarly, when exclusively looking at NP-likeness across the three phases (without considering the categories), we found a significant shift in the distribution of NP-likeness scores as compounds moved toward late clinical stages (Figure S3). Lastly, we evaluated whether different chemical representation approaches, such as using the 14 characters of the InChIKeys or canonical SMILES, or not implementing our deduplication strategy, which relies upon high Tanimoto similarity, influenced our findings. In all cases, we found that there was a consistently slight positive trend for NPs or NP-like compounds through the different clinical trial phases (Figure S4). We were further able to corroborate these findings in an independent, proprietary data set, underscoring the validity of these findings.
Figure 2.
(A) Proportion of NP, synthetic, and hybrid compounds in all clinical trials compared to FDA-approved drugs. (B) Proportion of NP, synthetic, and hybrid compounds across the phases of clinical trials (data from www.ClinicalTrials.gov). Compounds are stratified by their latest phase (i.e., phase I/II are categorized as phase II and phase II/III are categorized as phase III).
Next, we investigated which NP structural classes were enriched between phase I and approved drugs. We found that terpenoids exhibited a notable 20% relative increase, while fatty acids and alkaloids demonstrated increases of 7% and 6%, respectively (Table S1). In contrast, carbohydrates were found to have a relative decrease in abundance by 8%, while amino acids decreased by 22% when comparing compounds in phase I with approved drugs. In examining NP superclasses,19 we found that β-lactams and peptide alkaloids were among the most enriched ones (Table S2), which means that drugs that belong to these NP superclasses exhibited a significantly lower failure rate.
In Vitro and in Silico Approaches Indicate That Natural Products Tend To Be Less Toxic Than Synthetics
One of the major reasons for attrition of drug candidates in clinical trials is compound toxicity, poor PK/PD, and economic conditions.7 To determine whether the dropout rate for potential drug candidates differs among NP, hybrid, and synthetic compounds, we assessed experimental data for three different classes of tissue and target-specific toxicity. Additionally, since we could only analyze three toxicity data sets with a small sample size (see Experimental Section), we also leveraged an accurate in silico approach20 to predict 32 different types of toxicities for hundreds of thousands of NPs and synthetic compounds.
When looking at the experimental data, we found that NPs were more likely to be nontoxic than toxic compared to synthetics (Figure 3). Specifically, we compared NP and synthetic compounds across CYP450 toxicity, hepatotoxicity, and carcinogenicity. Inhibition of the Cytochrome (CYP) P450 superfamily may lead to decreased elimination and/or changes in the metabolic pathways of their substrates, which is one of the main causes of adverse drug–drug interactions.21 Our data suggests that NPs exhibit lower CYP450 toxicity (p-value <0.001). Similarly, hepatocytes are known to play a role in drug metabolism, and the toxicity of drugs to hepatocytes is the most common reason for their attrition or withdrawal.22 Here again, we found that NPs were significantly less likely to demonstrate hepatotoxicity than synthetic compounds (p-value <0.001). Although NPs showed an inverse correlation with toxicity in CYP450 and hepatotoxicity, no significant differences were found in terms of carcinogenicity between NP and synthetic compounds.
Figure 3.
Experimental tissue and target-specific toxicity overview across NP, synthetic, and hybrid compounds. For each of the three toxicity classes (i.e., CYP450 toxicity, hepatotoxicity, and carcinogenicity), the percentage of toxic compounds for each group (i.e., hybrids, NPs, or synthetics) is shown.
To complement our results, we leveraged two comprehensive data sets containing hundreds of thousands of NPs and synthetic compounds to represent the chemical space covered by the two and predicted 32 different toxicity end points using a set of top-performing machine learning models called FP-ADMET.20 Similarly, we found higher predicted toxicity in synthetic compounds than NPs (25/32 end points exhibited a significantly lower toxicity for NPs (p-value <0.001)) (Figure 4 and Figure S5). Furthermore, we observed similar results using instead a representative set of compounds clustered by chemical properties (Figure S6). Lastly, we also verified that both COCONUT and Enamine Discovery Diversity Set (DDS) cover a similar chemical space with the training data used by FP-ADMET (Figure S7). Overall, our data suggests that the success of NPs in clinical trials, when compared to synthetics, could partly be due to their lower toxicity profile.
Figure 4.
Proportion of toxic compounds for each of the 32 toxicity end points in FP-ADMET. The predictions are made on COCONUT and Enamine DDS for NPs and synthetics, respectively. As the figure shows, DDS (synthetic) bars (orange) are generally higher than those for COCONUT (NPs).
Discussion and Conclusions
Of the compounds that make it to clinical trials, over 90% do not become commercial drugs.7 Such a high attrition rate at the clinical stage incurs considerable costs in both time and finances. About half of compounds that successfully progress through clinical trials and are subsequently approved as drugs are NPs or hybrids. These high proportions also remain for small molecules approved over the last 2 decades, despite a widely acknowledged decrease in screening of NP extracts or collections for hit finding exercises. This warrants further examination into whether NP and hybrid compounds advance through clinical trials at greater rates compared to synthetic compounds. In this work, we have endeavored to shed light on this question, and whether certain chemical properties or features such as NP-like or hybrid compounds position them as superior starting candidates for drug development.
We divide the results into three different sections based on various aspects involved in progressing a drug candidate through the drug discovery pipeline, specifically the patenting stage, early clinical stage, and all clinical trial stages. First, we examined patent applications and found that the number of patent documents for synthetic compounds was nearly 8 times higher than NPs and synthetics. As previously mentioned, NPs are more difficult to patent due to legal challenges in obtaining intellectual property rights for unmodified NPs. Nonetheless, our findings highlighted a discrepancy in the proportion of NPs and synthetics between patent applications and clinical trials, with synthetics exhibiting a much higher prevalence in patent applications, and the two groups exhibiting a far narrower gap in clinical trials, thus suggesting that NPs are under-represented in early stages of drug discovery.
Second, to assess early clinical stages of drug discovery, we investigated the in vitro and in silico toxicity of NP, synthetic, and hybrid compounds. In two out of the three in vitro toxicity studies (hepatotoxicity and CYP450), NPs appear to be less toxic compared to synthetics. One possible explanation for this observation could be the presence of ethnobotanical priors (i.e., plants with known uses in treating certain indications by specific cultures) with documented hepatoprotective properties among the set of investigated NPs.22,23 As an example, ethnomedicinal data from Egypt cites a compound from Silybum marianum known for its effectiveness in treating liver pathologies.24,25 Similarly, traditional Chinese medicine demonstrates the hepatoprotective activity of puerarin, a compound derived from the root of Pueraria lobata.26 Such historical uses of medicinal plants suggest a preselection for safety and efficacy, possibly contributing to the lower toxicity profiles observed in NP and hybrid compounds in our study. In contrast, we found higher toxicity in synthetics compared to NPs through in silico approaches.
Finally, from the clinical trial perspective, we studied the relative proportions of compounds in the three classes across phases I to III as well as approved drugs. Here, we found a steady increase of NP and hybrid compounds from phases I to III when compared to synthetics. Additionally, we validated this finding with a commercial database (Figure S2). The finding that NP and hybrid compounds perform better in clinical trials than synthetic ones provides valuable insights for future drug development efforts.
As our analysis relies on various data sets that represent the different stages of drug development, we aimed to leverage the most complete publicly available data sets. For instance, we relied on raw data from www.ClinicalTrials.gov as a proxy for clinical data. Unfortunately, the current lack of an established nomenclature in clinical trials hampers the mapping of intervention names to chemical structures. To enhance efficiency and accuracy, we employed an automated workflow to map clinical interventions to structures, avoiding the manual mapping of hundreds of thousands of interventions, which would be time-intensive and error-prone. Despite our efforts to manually review a substantial number of data points in the data set to minimize errors, achieving full curation quickly becomes impractical. Consequently, despite the quality control and manual curation, the data set may still contain some entries with inaccuracies. To address this potential shortcoming, we validated our findings with an external commercial data set. For approved drugs, we combined multiple resources to compensate for the lack of a complete and updated list. In the absence of comprehensive preclinical data, we used patent applications as a proxy, though this may not fully capture the proportion of NPs. Due to limited experimental toxicity data sets, such as ClinTox27 which is limited to slightly more than 100 compounds, we used machine learning models to validate our results by generating in silico data for thousands of molecules. Lastly, we would like to note that clinical trials may be discontinued for a variety of reasons, including economic considerations, rather than solely due to toxicity or a lack of clinical efficacy.
In conclusion, this study highlights the advantageous progression of NPs and their derivatives through the preclinical and clinical stages of drug development. Despite a lower initial number of NPs entering clinical trials compared to synthetics, the proportion of NPs that achieve regulatory approval is substantial, highlighting their significant contribution to the set of FDA-approved molecules. This trend could be explained by lower observed and predicted toxicity compared to synthetics, ultimately supporting the idea that prioritizing NP-based compounds could contribute to increased rates of clinical trial success in drug development.
Experimental Section
Classifying Small Molecules Based on NP-Likeness
The growing interest in NPs has spurred researchers to develop a scoring system known as NP-likeness. This scoring system, introduced by Ertl et al.,28 aims to identify compounds that exhibit similarities to those found in the NP space. The NP-likeness score spans a range from −5 to 5, where a higher score indicates a greater likelihood of a molecule being a NP. For more details about the methodology of the NP-likeness scorer, we refer to the original publication.28 Based on NP-likeness scores, we classify compounds in our analysis into three classes: synthetics, NPs, and hybrids. To assign compounds to each class, we used as a reference the distribution of NP-likeness scores reported by Sorokina and Steinbeck.29 This study compared the distribution of NP-likeness across a variety of NP databases and synthetic libraries containing hundreds of thousands of NP and synthetic molecules. Based on the overlap of this distribution of NP-likeness, and how it can separate the two groups based on their provenance, we defined synthetic compounds as those with NP-likeness scores below 0, NPs as those with scores exceeding 0.6, and hybrids as those with scores falling between 0 and 0.6 (where both distributions overlap). The rationale behind this cutoff is that a negative NP-likeness score (i.e., using NP-likeness = 0) can be used to accurately distinguish between NPs and synthetics.28,29 However, the distributions of NPs and synthetics still partially overlap between 0 and 0.6, as some synthetics are derived from NPs. Thus, we defined this overlapping space as hybrids.
Mining of Clinical Trial Data
We obtained data from www.ClinicalTrials.gov through the AACT database (https://aact.ctti-clinicaltrials.org; accessed on October 31, 2023). For each clinical study, we extracted its National Clinical Trial (NCT) Identifier Number, the phase of the study (e.g., clinical phases I, II, or III), the intervention type (e.g., biological, device, drug, etc.), and the intervention name. First, we identified studies in which the intervention type was classified as either “drug” or “other”, as both categories encompass all the studies that tested small molecules (Table S3). From this subset, we chose 168,634 studies that were at least in phase I, encompassing a total of 110,305 unique intervention names.
Following this, we proceed to identify the studies using a small molecule. Given the absence of structural information in www.ClinicalTrials.gov and the impracticality of manually curating all corresponding names, we employed PubChem’s API to automatically map intervention names to their respective PubChem entries. This process successfully mapped 35,599 out of the 110,305 interventions (Figure 5).
Figure 5.
Annotation pipeline for compounds from clinical trials. Created with BioRender.com.
Of the remaining 74,706 interventions that were not mapped, we identified radiation therapies, biologics (e.g., antibodies), dietetic supplements, and other types of therapies that were misclassified within their categories. Additionally, we also observed over 7500 interventions corresponding to the placebo arm of the study (i.e., intervention names containing the word “placebo”) (Figure 5).
To maximize the interventions used in our study, we run BioBERT30 (v1.2) and MetaboListem31 on the interventions that were not mapped to PubChem compounds in order to annotate potential compounds (Figure 5). Similarly, we run prompts on ChatGPT (model gpt-4-1106-preview) to identify small molecules and clean up their names. We then meticulously inspected the output of the model to ensure that no errors were introduced, affirming the accuracy of the curated chemical names (Table S4 contains examples of the process). Next, we again used PubChem’s API to automatically map the digitally curated names to their respective PubChem entries and successfully mapped 20,860 additional interventions. Summing these mapped interventions with the previous ones, we were able to map a total of 56,459 interventions to 6950 unique small molecules (Figure 5).
Chemical Harmonization
We processed structures using RDKit32 (v2023.03.1). To ensure that the presented analysis exclusively focuses on small molecules and is not influenced by chemical entities used as supplements (e.g., magnesium), we filtered molecules containing three or less atoms from all data sets. Furthermore, in the case of salts, we considered the largest molecule present as the final structure. Additionally, as PubChem contains several entries for a given compound and we combined several data sets for approved drugs, we applied the following procedure to minimize the risk of duplicate structures influencing our results. For each pair of molecules, we calculated its Tanimoto coefficient using Morgan fingerprints (radius = 2, number of bits = 2048). If the Tanimoto coefficient was higher than 0.95, we randomly removed one of the molecules. To determine that the selection of one of these molecules over the other would not lead to significantly different NP-likeness scores, we confirmed that the difference in NP-likeness between the two molecules was marginal.
Stratifying by Clinical Phase
To annotate each compound with corresponding clinical phases, we mapped each intervention name to its corresponding clinical study or studies (e.g., phases I and II if a compound has exclusively been tested in these two phases). To disambiguate for a minority of studies that comprised multiple phases (i.e., phase I/II studies and phase II/III studies), we treated phase I/II as phase II and phase II/III as phase III. Additionally, we confirmed that reassigning these phases to phases I and II, respectively, did not impact the results.
FDA-Approved Drugs
We employed three independent data sets listing FDA-approved drugs8,33 and FDA Approved Drug Products (Orange Book). Figure S8 displays the number of approved drugs in each data set and their overlay. For details about how the files were processed, we refer to our previous work.34
Toxicity Data
To compare the toxic effects of NPs versus synthetic compounds, we used a curated toxicological database called TOXicology Resources for Intelligent Computation (TOXRIC).35 This database contains 113,372 compounds distributed across 13 different toxicity categories from over five different data resources, including existing databases and publications. From these 13 categories, we used hepatotoxicity, CYP450, and carcinogenicity (see Table 1) to analyze the toxic effects of NPs on specific cell tissues, metabolism, and their long-term effects in humans, respectively. The remaining 10 categories either represented experimental validation on nonhuman species, contained a small sample size (i.e., less than 100 compounds), or captured environmental effects outside the scope of this research. Additionally, we did a chi-squared test to reject the null hypothesis that NPs and synthetics are equally toxic. Following this, we performed an odds-ratio analysis to identify the directionality for the effect (i.e., if the compounds are classified as NPs, based on the NP-likeness score, are less toxic).
Table 1. Toxicity Data Set Distribution Based on Our Criteria Defined by NP-Likeness.
toxicity class | synthetics | NPs | hybrids |
---|---|---|---|
hepatotoxic | 41.64% (1183) | 44.60% (1267) | 13.76% (391) |
CYP450 | 77.34% (12,555) | 12.96% (2103) | 9.70% (1575) |
carcinogenic | 62.20% (622) | 18.60% (186) | 19.20% (192) |
In addition to the available experimental toxicity data, we leveraged 407,270 known NPs from COCONUT36 and 50,240 from the Enamine Discovery Diversity Set (DDS). We ran their SMILES through 32 state-of-the-art toxicity predictors,20 covering the categories listed in Table S5. In order to solely evaluate our hypothesis on accurate predictions, while avoiding discarding large amounts of data, we used the credibility score provided by the models. As detailed by the authors of FP-ADMET, the credibility measure (equal to the highest p-value of any one of the possible classifications being the true label) provides an indication of how good the training set is for classifying the given example. We chose a 0.5 threshold (i.e., only predictions with a credibility score higher than 0.5 were considered) (Figures S9 and S10), as it is a relatively low threshold that the authors of the paper reported to not affect the predictive performance of the model.20 Additionally, we confirmed our findings using a representative subset of each data set selected according to 210 chemical properties, to avoid biases caused by over-represented chemical classes (Supporting Information Text 1).
Patent Data
To extract information on chemicals linked to patent applications, we made use of SureChEMBL,18 one of the largest patent literature resources in the life sciences domain. At its back end, this resource leverages various proprietary and open-source machine learning models to identify and link the chemical names or structures reported in patent applications. Here, we used the entire SureChEMBL (accessed on November 12, 2023) from 1976 to 2022, resulting in 19,752,953 unique compounds from hundreds of thousands of patent applications. Notably, SureChEMBL does not provide metadata for patent applications published before 2014, and therefore, data on publication dates for patent applications prior to that year are not available. The patent applications collected in this database include both filed and granted patents.
Implementation and Data Availability
We calculated natural product-likeness (NP-likeness)28 using the RDKit implementation32 (v2023.03.1). We predicted the chemical classes of each compound using NPClassifier.19 We run the analyses using Jupyter notebooks available at https://github.com/enveda/np-clinical-trials using the corresponding versions of these and other libraries specified in this GitHub repository. Lastly, raw data and intermediate files, and the curation of clinical trial data necessary to reproduce the analyses, can be found at https://zenodo.org/records/10404954.
Acknowledgments
We would like to thank L. Richardson for her valuable suggestions and feedback. Furthermore, we thank the reviewers for their feedback and suggestions.
Data Availability Statement
The source code, raw and processed data, and Jupyter notebooks to reproduce the analyses are available at https://github.com/enveda/np-clinical-trials and https://zenodo.org/records/10404954.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jnatprod.4c00581.
Percentage of all NP pathways across clinical phases and approved drugs; percentage of the top 10 NP superclasses based on their relative increase between phase 1 and approved drugs; number of clinical trials based on the intervention type; examples of the input (intervention name) and output (cleaned name) after running ChatGPT/BioBERT/MetaboListem; toxicity end points (32) predicted by FP-ADMET; proportion of NP, synthetic, and hybrid small molecules among FDA-approved data sets; proportion of NP, synthetic, and hybrid small molecules among the proprietary clinical data set from Citeline (https://clinicalintelligence.citeline.com); distribution of NP-likeness scores across clinical trials phases; equivalent figure to Figure 2 in the main text without applying the deduplication strategy based on high Tanimoto similarity; predictions for each toxicity end point in FP-ADMET using COCONUT and Enamine DDS for NPs and synthetics, respectively; proportion of toxic compounds for each of the 32 toxicity end points in FP-ADMET, using a subset of compounds selected via clustering leveraging chemical properties (predictions are made on COCONUT and Enamine DDS for NPs and synthetics, respectively); chemical space overlap of the OCHEM selected data set (used to train the (gray), the COCONUT data set (green), and the DDS data set (orange); overlap between the DrugBank, FDA orange book, and Newman and Cragg data sets; distribution of the credibility scores from FP-ADMET for COCONUT and the DDS data set; distribution of the credibility scores from FP-ADMET for COCONUT and the DDS data set stratified by toxicity end point; text discussing the generation of a representative subset of COCONUT/DDS (PDF)
The authors declare the following competing financial interest(s): All authors were employees of Enveda Biosciences Inc. during the course of this work and have a real or potential ownership interest in Enveda Biosciences Inc.
Special Issue
Published as part of Journal of Natural Productsvirtual special issue “Natural Products Driven Medicinal Chemistry”.
Supplementary Material
References
- Atanasov A. G.; Zotchev S. B.; Dirsch V. M.; Supuran C. T. Natural products in drug discovery: Advances and opportunities. Nat. Rev. Drug Discovery 2021, 20 (3), 200–216. 10.1038/s41573-020-00114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pal R.; Dai M.; Seleem M. N. High-throughput screening identifies a novel natural product-inspired scaffold capable of inhibiting Clostridioides difficile in vitro. Sci. Rep. 2021, 11 (1), 10913. 10.1038/s41598-021-90314-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shinde P.; Banerjee P.; Mandhare A. Marine natural products as source of new drugs: A patent review (2015–2018). Expert opinion on therapeutic patents 2019, 29 (4), 283–309. 10.1080/13543776.2019.1598972. [DOI] [PubMed] [Google Scholar]
- Li J. W. H.; Vederas J. C. Drug discovery and natural products: end of an era or an endless frontier?. Science 2009, 325 (5937), 161–165. 10.1126/science.1168243. [DOI] [PubMed] [Google Scholar]
- Harvey A. L.; Edrada-Ebel R.; Quinn R. J. The re-emergence of natural products for drug discovery in the genomics era. Nat. Rev. Drug Discovery 2015, 14 (2), 111–129. 10.1038/nrd4510. [DOI] [PubMed] [Google Scholar]
- Zhang Q. W.; Lin L. G.; Ye W. C. Techniques for extraction and isolation of natural products: A comprehensive review. Chinese medicine 2018, 13, 1–26. 10.1186/s13020-018-0177-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun D.; Gao W.; Hu H.; Zhou S. Why 90% of clinical drug development fails and how to improve it?. Acta Pharmaceutica Sinica B 2022, 12 (7), 3049–3062. 10.1016/j.apsb.2022.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman D. J.; Cragg G. M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 2020, 83 (3), 770–803. 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
- Chen Y.; Garcia de Lomana M.; Friedrich N. O.; Kirchmair J. Characterization of the chemical space of known and readily obtainable natural products. J. Chem. Inf. Model. 2018, 58 (8), 1518–1532. 10.1021/acs.jcim.8b00302. [DOI] [PubMed] [Google Scholar]
- Chen Y.; Kirchmair J. Cheminformatics in natural product-based drug discovery. Molecular Informatics 2020, 39 (12), 2000171. 10.1002/minf.202000171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stratton C. F.; Newman D. J.; Tan D. S. Cheminformatic comparison of approved drugs from natural product versus synthetic origins. Bioorganic & medicinal chemistry letters 2015, 25 (21), 4802–4807. 10.1016/j.bmcl.2015.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzvetkov N. T.; Kirilov K.; Matin M.; Atanasov A. G. Natural product drug discovery and drug design: two approaches shaping new pharmaceutical development. Nephrology Dialysis Transplantation 2024, 39, 375. 10.1093/ndt/gfad208. [DOI] [PubMed] [Google Scholar]
- Grigalunas M.; Brakmann S.; Waldmann H. Chemical evolution of natural product structure. J. Am. Chem. Soc. 2022, 144 (8), 3314–3329. 10.1021/jacs.1c11270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davison E. K.; Brimble M. A. Natural product derived privileged scaffolds in drug discovery. Curr. Opin. Chem. Biol. 2019, 52, 1–8. 10.1016/j.cbpa.2018.12.007. [DOI] [PubMed] [Google Scholar]
- Rodrigues T.; Reker D.; Schneider P.; Schneider G. Counting on natural products for drug design. Nature Chem. 2016, 8 (6), 531–541. 10.1038/nchem.2479. [DOI] [PubMed] [Google Scholar]
- Gu J.; Gui Y.; Chen L.; Yuan G.; Lu H. Z.; Xu X. Use of natural products as chemical library for drug discovery and network pharmacology. PloS one 2013, 8 (4), e62839 10.1371/journal.pone.0062839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atanasov A. G.; Waltenberger B.; Pferschy-Wenzig E. M.; Linder T.; Wawrosch C.; Uhrin P.; et al. Discovery and resupply of pharmacologically active plant-derived natural products: A review. Biotechnology advances 2015, 33 (8), 1582–1614. 10.1016/j.biotechadv.2015.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Papadatos G.; Davies M.; Dedman N.; Chambers J.; Gaulton A.; Siddle J.; et al. SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic acids research 2016, 44 (D1), D1220–D1228. 10.1093/nar/gkv1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H. W.; Wang M.; Leber C. A.; Nothias L. F.; Reher R.; Kang K. B.; et al. NPClassifier: A deep neural network-based structural classification tool for natural products. J. Nat. Prod. 2021, 84 (11), 2795–2807. 10.1021/acs.jnatprod.1c00399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Venkatraman V. FP-ADMET: a compendium of fingerprint-based ADMET prediction models. Journal of cheminformatics 2021, 13 (1), 1–12. 10.1186/s13321-021-00557-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guengerich F. P. A history of the roles of cytochrome P450 enzymes in the toxicity of drugs. Toxicological research 2021, 37 (1), 1–23. 10.1007/s43188-020-00056-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh D.; Cho W. C.; Upadhyay G. Drug-induced liver toxicity and prevention by herbal antioxidants: an overview. Frontiers in physiology 2016, 6, 363. 10.3389/fphys.2015.00363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domingo-Fernandez D.; Gadiya Y.; Mubeen S.; Bollerman T. J.; Healy M. D.; Chanana S.; Sadovsky R. G.; Healey D.; Colluru V. Modern drug discovery using ethnobotany: A large-scale cross-cultural analysis of traditional medicine reveals common therapeutic uses. iScience 2023, 26, 107729. 10.1016/j.isci.2023.107729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Federico A.; Dallio M.; Loguercio C. Silymarin/silybin and chronic liver disease: a marriage of many years. Molecules 2017, 22 (2), 191. 10.3390/molecules22020191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaker E.; Mahmoud H.; Mnaa S. Silymarin, the antioxidant component and Silybum marianum extracts prevent liver damage. Food Chem. Toxicol. 2010, 48 (3), 803–806. 10.1016/j.fct.2009.12.011. [DOI] [PubMed] [Google Scholar]
- Zhou Y. X.; Zhang H.; Peng C. Puerarin: a review of pharmacological effects. Phytotherapy Research 2014, 28 (7), 961–975. 10.1002/ptr.5083. [DOI] [PubMed] [Google Scholar]
- Gayvert K. M.; Madhukar N. S.; Elemento O. A data-driven approach to predicting successes and failures of clinical trials. Cell chemical biology 2016, 23 (10), 1294–1301. 10.1016/j.chembiol.2016.07.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ertl P.; Roggo S.; Schuffenhauer A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 2008, 48 (1), 68–74. 10.1021/ci700286x. [DOI] [PubMed] [Google Scholar]
- Sorokina M.; Steinbeck C. NaPLeS: a natural products likeness scorer—web application and database. Journal of Cheminformatics 2019, 11 (1), 55. 10.1186/s13321-019-0378-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J.; Yoon W.; Kim S.; Kim D.; Kim S.; So C. H.; Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36 (4), 1234–1240. 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeung C. S.; Beck T.; Posma J. M. MetaboListem and TABoLiSTM: two deep learning algorithms for metabolite named entity recognition. Metabolites 2022, 12 (4), 276. 10.3390/metabo12040276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landrum G.RDKit: open-source cheminformatics, http://www.rdkit.org/. (2016) 10.5281/zenodo.7415128. [DOI]
- Wishart D. S.; Feunang Y. D.; Guo A. C.; Lo E. J.; Marcu A.; Grant J. R.; et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 2018, 46 (D1), D1074–D1082. 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domingo-Fernández D.; Gadiya Y.; Mubeen S.; Healey D.; Norman B. H.; Colluru V. Exploring the known chemical space of the plant kingdom: insights into taxonomic patterns, knowledge gaps, and bioactive regions. Journal of Cheminformatics 2023, 15 (1), 107. 10.1186/s13321-023-00778-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu L.; Yan B.; Han J.; Li R.; Xiao J.; He S.; Bo X. TOXRIC: a comprehensive database of toxicological data and benchmarks. Nucleic Acids Res. 2023, 51 (D1), D1432–D1445. 10.1093/nar/gkac1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorokina M.; Merseburger P.; Rajan K.; Yirik M. A.; Steinbeck C. COCONUT online: collection of open natural products database. Journal of Cheminformatics 2021, 13 (1), 1–13. 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code, raw and processed data, and Jupyter notebooks to reproduce the analyses are available at https://github.com/enveda/np-clinical-trials and https://zenodo.org/records/10404954.