Abstract
Natural products (NPs) exhibit diverse chemical structures and biological activities that make them valuable sources for drug discovery. With advancements in computational technology, computation-enabled natural drug discovery is gaining increasing significance, with NP databases playing a pivotal role. In light of this, we first summarize the key features of NP databases, including structural data, property annotations, biological sources, biosynthetic pathways, and web interfaces. Subsequently, the wide applications of these databases in drug discovery, such as virtual screening, knowledge graph construction, and molecular generation, are reviewed. We further discuss the puzzle of database development, focusing on data quality and updating. Finally, we emphasize the pivotal role of team collaboration and toolkit innovation in harnessing the immense potential of NP-related databases to accelerate bioactivity mining, structure modification, and manufacturing. This review aims to elucidate the key features and applications of NP databases, with the goal of aiding researchers in developing and maintaining high-quality NP databases for drug discovery.
Keywords: Natural products, Database, Drug discovery, Cheminformatics, Computer-aided drug design
Abbreviations
- ADMET
Absorption, Distribution, Metabolism, Elimination/Excretion, and Toxicity
- CSD
Cambridge Structural Database
- CYP
Cytochrome P450
- GNN
Graph Neural Network
- InChI
International Chemical Identifier
- MS
Mass Spectrometry
- ND
Natural product Derived drug
- NI
Natural product Inspired drug
- NMR
Nuclear Magnetic Resonance
- NP
Natural Product
- NPA
Active Natural Product
- NPD
Natural Product Drug
- NPO
Natural Products with biological source annotations
- NPS
Natural Products with biosynthetic pathway annotations
- OCSR
Optical Chemical Structure Recognition
- PDB
Protein Data Bank
- PLK1
Polo-Like Kinase 1
- RNN
Recurrent Neural Network
- S
Synthetic drug
- M-SYFSF
Modified Shen-Yan-Fang-Shuai formula
- SMILES
Simplified Molecular Input Line Entry System
- TCM
Traditional Chinese Medicine;
- U.S. FDA
United States Food and Drug Administration
- VAE
Variational Autoencoders
1. Introduction
Natural products (NPs) are substances generated by living organisms and typically exclude large molecules such as proteins and nucleic acids. NPs have long been a significant source for drug discovery because of their diverse chemical structures and biological activities. Over the past four decades (1981–2019), more than half of the small-molecule drugs worldwide have originated directly or indirectly from NPs [1,2]. Penicillin, an antibiotic derived from microbial sources, has been used to synthesize a wide range of antibiotics. Artemisinin is an antimalarial drug extracted from Artemisia annua [3], and paclitaxel is an anticancer drug derived from Taxus brevifolia [4]. In the last decade (2014–2023), the U.S. FDA has approved 28 small-molecule drugs related to NPs, with nine directly sourced from NPs and 19 derived from the semi-synthesis of NPs [5]. Examples include tauroursodeoxycholic acid [6], which is derived from bear bile for anti-apoptotic purposes, and rezafungin [7], an antifungal drug semi-synthesized from echinocandin isolated from Aspergillus delacroxii.
Although natural-product-based drug discovery has immense potential, challenges still exist. First, more powerful tools are required for active compound mining. The COCONUT NP database contains over 400,000 molecules [8], whereas the NPASS database, which catalogs active NPs, includes fewer than 100,000 molecules [9], representing only a quarter of the total number of NPs (Fig. 1A). As not all existing NPs exhibit biological activity, identifying molecules with specific activities or targets and predicting the activities of newly discovered molecules from the vast chemical space remain formidable challenges in the field. The second challenge lies in the fact that despite the positive results of many NPs in activity tests, drug-likeness and off-target effects must be further considered. NPs typically have high molecular weight and poor solubility, which hinder their absorption and transport in the human body [10]. However, they often possess numerous functional groups and multitarget binding capabilities, necessitating subsequent structural optimization to improve drug likeness and specificity [11]. This may explain the relatively low proportion of drugs directly sourced from native NPs (Fig. 1B), as most drugs require modification and optimization (NI and ND in Fig. 1B) based on their original NP structures. The third challenge is the low abundance of most NPs in nature and the difficulty in synthesizing structurally complex products. NPs often exhibit complex ring systems and multiple chiral centers, making the total synthesis lengthy and inefficient. Biosynthesis is constrained by unknown pathways and corresponding enzymes. As shown in Fig. 1C, only 30% of the molecules in COCONUT are annotated with known biological sources, and less than 7% are annotated with complete biosynthetic pathways according to structure matching with metabolic databases, such as KEGG [12] and MetaCyc [13].
Fig. 1.
The distribution of natural products under different classification criteria. (A) Distribution of active NPs (NPA) and NP drugs (NPD). (B) Distribution of categories of approved small molecule drugs according the data from Newman's work [1]. S, Synthetic drugs; NI, natural products inspired drugs (corresponding to S∗ and NM of the original work); ND, natural products derived drugs; NP, natural products (corresponding to N and NB of the original work). (C) Distribution of biological sources of natural products. NPO, natural products with biological source annotations; NPS, natural products with biosynthetic pathway annotations.
With the advancement of big data and information technology, computation-enabled natural drug discovery has become increasingly significant, and NP databases play a key role in this regard. Both large-scale databases (such as COCONUT [8] and SuperNatural [14]) and specialized databases (such as activity-focused NPASS [9], terpenoid-focused TeroMOL [15], and the marine NP database CMNPD [16]) continue to evolve and improve. Several reviews have categorized existing NP databases [[17], [18], [19], [20], [21], [22]]; others have provided a systematic analysis of the chemical space of multiple databases [[23], [24], [25]]. Various databases have been built continuously, and a considerable number of them are no longer updated or are even inaccessible [17]. Therefore, rather than redundantly listing existing databases, this paper summarizes the typical features (Fig. 2 and Table 1) and representative applications of NP databases and highlights their importance in order to accelerate drug discovery (Fig. 3). Additionally, we discuss the potential challenges and future prospects of NP databases with the primary aim of facilitating the development and maintenance of these databases, thereby better empowering natural drug discovery.
Fig. 2.
Typical features of natural products databases.
Table 1.
Feature summary of NP databases mentioned in this work.a.
| Database | Structures | Properties | Source | Biosynthesis |
|---|---|---|---|---|
| COCONUT (https://coconut.naturalproducts.net/) | Topological | Physicochemical, Biological | Yes | No |
| SuperNatural 3.0 (https://bioinf-applied.charite.de/supernatural_3/) | Topological | Physicochemical | No | No |
| GNPS (https://gnps.ucsd.edu/) | Topological, Spectral | No | No | No |
| NP-MRD (https://np-mrd.org/) | Topological, Spectral | No | Yes | No |
| NMRdata (http://www.nmrdata.com/) | Topological, Spectral | No | Yes | No |
| NPASS (https://bidd.group/NPASS/) | Topological | Biological | Yes | No |
| NPACT (http://crdd.osdd.net/raghava/npact/) | Topological | Biological | Yes | No |
| TeroKit (http://terokit.qmclab.com/) | Topological | Physicochemical, Biological | Yes | Yes |
| CMNPD (https://cmnpd.org/) | Topological | Physicochemical, Biological | Yes | No |
| NuBBE (https://nubbe.iq.unesp.br/) | Topological, Spectral | Physicochemical, Biological | Yes | No |
| NPcVar (http://npcvar.idrblab.net/) | Topological | Biological | Yes | No |
| TCMBank (https://tcmbank.cn/) | Topological | Physicochemical, Biological | Yes | No |
| MetaCyc (https://metacyc.org/) | Topological | Physicochemical | Yes | Yes |
| KEGG (https://www.genome.jp/kegg/) | Topological | No | Yes | Yes |
| DNP (https://dnp.chemnetbase.com/) | Topological | No | Yes | No |
Yes/No indicate whether the database contain corresponding annotations.
Fig. 3.
Representative applications of natural products databases for drug discovery.
2. Key features of NP databases
As the significance of NPs in drug discovery continues to grow, the design and functionality of NP databases are evolving. These databases are required to not only provide comprehensive structural information but also integrate multidimensional data, such as the properties, biological sources, and biosynthesis pathways of NPs. This integration aims to facilitate in-depth pharmaceutical research. Most databases use web platforms to provide users with information. This section focuses on the key features of NP databases that collectively constitute the core value of these repositories.
2.1. Structure data
One of the most crucial pieces of information in a molecular database is the molecular structure, which is paramount for drug screening purposes. Almost all databases store molecular structure information in the form of simplified molecular input line entry system (SMILES) [26], commonly accompanied by an international chemical identifier (InChI) [27]. These representations condense two-dimensional topological structure information into one-dimensional strings, rendering them widely used in databases. String-like forms make them suitable for natural language processing, which has recently been widely used in property prediction, molecular generation, and reaction prediction [28]. Moreover, text files are frequently used for sharing libraries of three-dimensional structure data, such as SDF and MOL files, where a series of molecules can be joined together and annotated with additional information. These files can be directly used as molecular libraries for virtual screening and chemical similarity analyses [29]. These structures can also be used as inputs for physical-based modeling such as quantum chemistry calculations and molecular dynamic simulations [30]. The string representations and text files can be mutually converted through common cheminformatics tools such as RDKit [31] and Open Bable [32]. InChIKey [33] serves as a hash representation of InChI, allowing for a one-way conversion from InChI to InChiKey without the possibility of reversal. Although InChiKey does not convey topological structure information, its fixed length of 25 characters makes it conducive for the swift matching and retrieval of structural information within databases.
In addition to textual representations, many online databases have adopted skeletal formula representations to enhance the visualization of molecular structures. For instance, the COCONUT database generates image formats of two-dimensional structures for each molecule, whereas SuperNatural 3.0 [14] employs the ChemDoodle [34] plugin, embedded in web pages for structure rendering. Spectral data, including nuclear magnetic resonance (NMR) and mass spectrometry (MS) data, offer crucial insights for structural identification and comparison, particularly for complex NPs. However, in contrast to the diversity of NP structural databases, only a few databases, including specialized databases such as GNPS [35], NP-MRD [36] and NMRdata [37] and the general-purpose database PubChem [38], store spectral data.
2.2. Property annotations
The structural features and properties of compounds play pivotal roles in determining their pharmacological efficacy during drug development. Consequently, many databases have annotated compound-related features and properties. With advancements in machine learning, these pieces of information are commonly termed descriptors. Common descriptors include parameters associated with Lipinski's rule of five [39], such as the number of hydrogen bond donors/acceptors, rotatable bonds, and lipophilicity, as well as those related to ADMET properties such as blood-brain barrier permeability and hepatotoxicity [40]. Cheminformatics tools such as RDKit can be used to calculate some of the descriptors, for example, Lipinski's rule of five. The prediction of ADMET properties is more challenging [41] and several computational tools have been developed using the web interface [[42], [43], [44]]. With the help of increasing data, computational approaches can achieve acceptable accuracy in predicting certain properties (such as hERG inhibitory activities) using 2D descriptors; however, not all endpoints are predicted accurately [45]. For example, the prediction of time-dependent CYP inhibition, which relies on information from 3D fragments, is difficult because of the lag compared to other datasets. Moreover, despite the rapid development of computational models, we are still far from being able to replace the experimental validation.
In addition to the aforementioned descriptors, another critical aspect directly relevant to drug discovery is the biological activity of the molecules. Original literature often elucidates the inhibitory, activating, or other relevant activities of compounds against specific targets, cells, or organisms. These activities are often associated with broader indications or functions, such as anticancer or antimicrobial effects, promoting the construction of knowledge graphs and contributing to activity prediction and drug repurposing. Databases collect activity data as annotations directly from literature sources. For instance, the NPASS [9] currently encompasses over 950,000 activity records of approximately 100,000 compounds. NPACT [46] encampasses anticancer active molecules from literature sources, along with their corresponding cancer cell lines and activity values. Some databases align their structures with general-purpose activity databases to extract and annotate the corresponding biological activities. For example, TeroKit [47] and CMNPD [16] integrate target names, types, and other information extracted from the ChEMBL database [48].
2.3. Biological sources
In NP databases, annotations of biological sources of compounds serve as crucial metadata and provide valuable contextual information for researchers. These annotations typically include details about the organisms from which the compounds were isolated, such as plant species, microbial strains, or marine organisms. In addition, they may encompass information regarding the geographic location of the source organism, ecological habitats, and collection methods employed for isolation. For example, the NuBBE [49] database contains a variety of NPs isolated from Brazil and provides information on the species and their locations. NPcVar [50] is a database that describes the content variation of NPs, where the NP content data are associated with the species, parts of the organism, experimental details, and other factors. These annotations not only aid in understanding the biodiversity and ecological significance of NPs but also facilitate the discovery of novel bioactive molecules from untapped natural sources. Overall, the inclusion of comprehensive biological source annotations enhances the utility and interpretability of NP databases, thereby fostering advancements in NP research and drug discovery.
As a special category of NP databases, biological source information in traditional Chinese medicine (TCM) databases is more important and needs to be more detailed. These annotations encompass the medicinal herbs or herbal formulations from which the compounds are derived. They often include the botanical and traditional Chinese names of herbs as well as their corresponding English translations. Furthermore, annotations may provide insights into the historical and cultural significance of medicinal plants, including their traditional uses, preparation methods, and therapeutic indications in traditional Chinese medicine practice. Annotations related to TCM also extend to the pharmacological properties and therapeutic effects associated with specific herbs or herbal formulations. These include traditional indications for various health conditions, reported pharmacological activities, and potential mechanisms of action based on traditional knowledge and empirical evidence. These annotations serve as valuable resources for researchers and practitioners, facilitating the exploration of the pharmacological potential of TCM ingredients and aiding the development of evidence-based therapeutic interventions. Additionally, they contribute to the integration of traditional knowledge with modern drug discovery approaches, promoting the discovery of novel bioactive compounds and development of effective herbal medicines. The TCM Database@Taiwan [51] is the world's largest non-commercial TCM database available for download, making it an excellent molecular library for in silico drug screening. The newly released TCMBank is an extension of the TCM Database@Taiwan that integrates additional learning-based tools for data extraction and activity prediction [52]. Comprehensive annotations regarding traditional Chinese herbal medicine enrich the content of NP databases, bridging the gap between traditional wisdom and contemporary scientific inquiries in the pursuit of novel therapeutic agents.
2.4. Biosynthetic pathways
Genomic annotations of NP hosts can be used to elucidate the biosynthetic pathways of specific metabolites. In particular, annotations of medicinal plants, such as the IMP platform [53], not only curate high-quality genomic and transcriptomic data for many plants but also integrate practical modules for gene annotation and analysis to facilitate the exploration of natural sources for valuable chemical constituents, such as drug discovery and drug production. Annotations of biosynthetic and metabolic pathways within NP databases offer direct insights into the biotransformation processes of bioactive compounds. These annotations encompass details regarding the biosynthetic origins of NPs, including the enzymes and precursors involved in their production in living organisms. This is important not only for NP biosynthesis but also for chemical synthesis because the chemical logic in enzymatic synthesis can be an inspiration for total chemical synthesis [54,55]. A typical database of this type is MetaCyc [13], where nearly 20,000 metabolites are annotated with a total of 19,000 reactions and 3000 pathways (accessed in March 2024). The enzymes and their corresponding organisms are also collected and linked to other databases such as UniProt [56]. These reactions provide valuable training data for retrobiosynthesis modeling [57], especially for NPs that are difficult to synthesize. Additionally, these annotations may provide information on the metabolic pathways responsible for the degradation or modification of these compounds. For example, KEGG [12] contains the metabolic pathways of various drugs, which contributes to a deeper understanding of compound metabolism, pharmacokinetics, and toxicity profiles and facilitates the optimization of compound properties to enhance metabolic stability and pharmacological efficacy.
2.5. Web interface
As computational tools, databases provided solely in specific file formats are no longer sufficient to meet the diverse needs of researchers. Currently, most databases rely on the corresponding web platforms to facilitate information access for users with varying backgrounds. Browsing and retrieving information are fundamental functionalities of database platforms, allowing users to sequentially peruse compounds according to various sorting criteria, and customize searches using basic structural and annotation information. Another advantage of online platforms lies in their interactivity with users, affording them the freedom to download the desired data selectively rather than the entire database, which can be burdensome, particularly for large database source files. Moreover, the web interface provides users with an opportunity to contribute to the database. Considering the challenge of data updating for database developers, the provision of interfaces for data uploads on platforms significantly alleviates the burden of database maintenance while also aiding in standardizing the requirements for data disclosure in academic research publications. Successful cases include the Cambridge Structural Database (CSD) [58] for small-molecule crystal structure data and the Protein Data Bank (PDB) [59] for biomolecular structure data. Furthermore, some databases integrate practical modules tailored to their unique characteristics, including data analysis, spectrum prediction, and target prediction within the platform. For example, NPAtlas [60] provides a web page for database analysis, including the year distribution of the number of isolated compounds and the number of compounds found in different genera. TeroKit [47] cross-links terpenoid-related molecules (TeroMOL) [15] and enzymes (TeroENZ) [61] according to metabolic reactions and integrates skeleton analysis and activity profiling modules into an online platform.
3. Applications in drug discovery
Drug discovery is a multifaceted, multistage endeavor. Compound databases, particularly those with multidimensional annotations, are pivotal in this process. This section discusses the uses of NP databases in drug discovery, including virtual screening, construction of knowledge graphs, molecular generation, and other pertinent applications.
3.1. Virtual screening
NP databases serve as repositories of the chemical structures, biological activities, and pharmacological properties of NPs, facilitating the identification of potential lead compounds for drug development. Using virtual screening techniques such as molecular docking [62], pharmacophore modeling [63], and similarity searching [64], researchers can efficiently explore the chemical space represented by NPs and predict their interactions with biological targets [65]. Zhou et al. [66] collected the SMILES of polo-like kinase 1 (PLK1) inhibitors and converted them into an SDF format to construct a pharmacophore model. NP-based data from CMNPD [16] were then obtained and screened to identify three small compounds from marine sources with the potential to inhibit PLK1. Juárez-Mercado et al. [67] identified SARS-CoV-2 protease inhibitors using ligand-based virtual screening based on similarity searching of many small-molecule databases, including the NPs database COCONUT [8].
Recently, artificial intelligence has been widely used to predict compound-target interaction [[68], [69], [70]], thus simplifying and accelerating the progress of NP drug discovery [71,72]. The typical model encodes the information of a small molecule (sometimes protein information is also included), then inputs it into the machine learning model and predicts whether it is active (classification task) or the strength of activity (regression task) to a specific target. Liu et al. [73] trained a deep learning model using molecules from an anti-osteoclastogenesis dataset. The model was then used to predict anti-osteoclastogenic activity in an NP library, and five bioactive NPs with diverse structures were identified. Zhang et al. [74] established a database [75] of natural anti-inflammatory products and trained machine learning models to predict the anti-inflammatory activity of new NPs and the compound-target relationship of existing items. Overall, the integration of NP databases into virtual screening workflows enhances the efficiency and effectiveness of drug discovery by harnessing the structural diversity and bioactivity of natural compounds.
3.2. Knowledge graph construction
Unlike the NP database for virtual screening, which is used as a screening database for a specific target, some databases with activity annotations can be used to build knowledge graphs, leading to the mining of active compounds and drug repurposing [76]. Target fishing is the process of identifying small molecule targets. Cockroft et al. [77] proposed STarFish by collecting NPs from available databases and identifying protein targets from ChEMBL [48], and then trained multilabel classification models to predict targets for a given compound. It was deployed via an application programming interface to facilitate the target identification of NPs. Qiang et al. [78] generated a compound-target dataset by integrating COCONUT [8] and ChEMBL and trained an NP target prediction model with transfer learning. The model successfully predicted the high-frequency targets of a certain number of approved drugs whose structures were derived from NPs. Through traditional methods such as network pharmacology [79], researchers can also elucidate the pharmacological properties and biological mechanisms underlying the actions of natural compounds. Gu et al. [80] constructed a compound-target network for NPs and analyzed its properties. After extracting target-related diseases, two compounds from the network were identified as potential drugs for bacterial infections and several cancers. TCM research involves complex diseases, syndromes, herbal formulae, and activated molecules, which have resulted in a new generation of studies featuring networks [81]. Synergy between TCM and network pharmacology holds great promise for drug discovery and development. By leveraging the holistic perspective of TCM and a systematic analytical approach to network pharmacology, researchers can gain deeper insights into the multifaceted nature of herbal remedies and their effects on biological systems [82]. Yu et al. [83] constructed a compound-target network for a traditional medicine formula-modified Shen-Yan-Fang-Shuai formula (M-SYFSF) with the help of the TCMSP [84] and SymMap [85] databases and deciphered the mechanism of M-SYFSF in treating diabetic nephropathy. These predictive approaches enable the identification of potential molecular targets, pathways, or biological processes modulated by natural compounds, thereby facilitating the rational design and optimization of drug candidates. Dai et al. [86] discovered 32 NPs from multiple TCM databases that ameliorated Alzheimer's disease pathology by targeting endoplasmic reticulum stress.
3.3. Molecular generation
As mentioned above, NP structures exhibit remarkable diversity, making NPs valuable resources for drug discovery. However, the number of known NP molecules is significantly lower than that of synthetic small molecules and even lower than the theoretical chemical space of 1060 small molecules [87]. Consequently, the acquisition of additional NPs or their analogs to expand the chemical space for virtual screening has remained the focus of drug design research. Traditional methodologies involve the use of structural fragmentation algorithms to dissect database structures into fragments that are then recombined using combinatorial chemistry principles to generate novel molecules [88]. Recently, generative deep learning algorithms have enhanced the utility of structures in large NP databases. Models such as recurrent neural networks (RNNs) [89], variational autoencoders (VAEs) [90], generative adversarial networks (GANs) [91], and transformers [92,93] can generate numerous molecules for virtual screening. For instance, Ochiai et al. [90] trained a VAE model using approved drugs and NPs from an original dataset and generated structures with higher docking scores based on Gefitinib. Similarly, Ma et al. [94] used SMILES of NPs from the COCONUT database to train various models and generated a plethora of molecular structures. These molecules were then docked with target proteins against COVID-19, demonstrating that increased diversity in the molecular library improved the docking scores, thereby facilitating the discovery of ligands with higher affinities. Inspired by the natural biogenesis of terpenoids, Zeng et al. [95] combined physical-based metadynamics simulations, learning-based transformers, and RNN to generate diverse terpenoids, greatly expanding the chemical space with the evaluation of synthetic accessibility.
3.4. Other applications
In addition to the direct applications in drug discovery discussed above, NP databases also offer assistance in drug discovery through various avenues. Physicochemical properties, especially the drug-likeness properties, of chemical entities are important for the discovery of structures with interesting biological activities [96]. Cheminformatics tools and methods can identify biologically relevant parts of the chemical space of NPs through the analysis and comparison of different databases based on properties such as Lipinski's rule of five [97]. Analysis of drugs, NPs, and molecules from combinatorial chemistry using several databases has shown that NPs typically have a higher molecular mass, more oxygen atoms but fewer nitrogen and halogen atoms, higher numbers of H-bond acceptors and donors, higher hydrophilicity, and greater molecular rigidity [25,98,99]. Thus, cheminformatics filters are required to screen structures from NP databases to obtain compounds with desirable pharmacokinetic/pharmacodynamic properties and low toxicity [100]. Analysis of specific databases provides insights into chemical diversity. For example, the difference between terrestrial and marine NPs was revealed through a thorough cheminformatics analysis of the Dictionary of NPs [101] and the Dictionary of Marine NPs [102]. Their physicochemical properties and structural features show that marine NPs are more drug-like than terrestrial products. Saldívar-González et al. [103] compared the chemical space between the NuBBE [49] database and other databases and revealed the diversity and complexity of the structures and the potential for drug discovery. Zeng et al. [104] explored the chemical and biological spaces of terpenoids using a series of cheminformatics and bioinformatics approaches, revealing a special region of terpenoids in NPs.
Inspired by cheminformatics, Ertl et al. [105] employed machine learning to construct a regression model for assessing a molecule's NP likeness based on structures from the CRC Dictionary of NPs (DNP) [101], serving as an evaluation metric for generating or synthesizing more diverse compound structures. Similarly, based on deep learning models, NP classifiers [106] can hierarchically classify NP structures to aid in understanding the molecular basis and biosynthetic pathways of complex NPs. Metabolic data from NP databases can be used to predict the biosynthesis of molecules and can be classified into traditional similarity-based [107], rule-based [108] and deep-learning-based [109] approaches, which are promising tools for the production of bioactive NPs.
4. Challenges and prospects
Despite their utility, NP databases face challenges such as data quality issues, especially redundant or conflicting information, which bring additional work to data preprocessing in drug discovery, such as library construction in virtual screening. On one hand, there is a large amount of data duplication between different NP databases [17] because all developers use the same data sources, such as literature and patents. However, early limited structural determination techniques and mistakes in data collection inevitably result in incomplete data, such as stereochemical information, which also leads to redundant storage of the same substance due to inconsistent levels of structural information. Therefore, it is sensible for users to confirm the accuracy of the extracted data. Ambure et al. [110] explored methods for cleaning activity data, structural data, and other related information. They also discussed strategies for handling significant discrepancies in the experimental data. It can sometimes be unclear whether NPs originate from a natural source when encountering heterologous gene expression or the artificial activation of silenced genes. Considering the rapid development of genomics and synthetic biology, the community must make efforts to define and attribute NPs.
Another challenge is data update and website maintenance. There will be studies reporting new NPs, and it is tedious to collect information about their structures, sources, and activities from documents. Recently, text and image recognition tools have been used to extract chemical information from document-level sources. Optical chemical structure recognition (OCSR) is an essential tool for automatic structural information extraction. Deep-learning-based tools such as ABC-Net [111], ChemDataExtractor [112], and the StoneMIND Collector [113] can be used for database construction and updating. Nevertheless, due to the structure complexity and derivatives (“R groups” widely exist in the document), the accuracy of NP structure recognition is not high enough for fully automated information extraction and manual correction is essential. Additionally, even if the structure is correctly recognized, automatically annotating the information (name, source, activities, etc.) in the text to the correct structure remains a challenge. Owing to the manual cost of data collection, many open-source databases developed by individual research groups have been outdated for a long time and are unavailable even a few years after release (for example, ChemBank [114]). Other databases like MetaCyc [13] have turned from open-source to subscription-acquired because of the termination of funding support. While improving the data extraction algorithm, it may be beneficial to establish an NP data bank (NPDB), where researchers around the world can upload formatted information for review and validation by the whole community. Currently, a small step has been taken as many geographical region-specific NP databases, such as AfroDb [115] (Africa) and LANaPD [116] (Latin America), have been established and are expected to make significant contributions to NP research and NP-based drug discovery.
The isolation and discovery of NPs appear to have peaked and that the number of new structures has started to decline in recent years (Fig. 4A). Of course, we cannot rule out the impact of the COVID-19 pandemic on the scientific community, but the “low-hanging fruit” is indeed getting fewer. However, the approval of NP-based drugs continues even in recent years (Fig. 4B). This reminds us that: 1) comprehensive information is required to supplement the NP databases, including biosynthetic gene clusters and organism-specific metabolic networks. By integrating such data, more powerful tools such as genome mining [117,118] and combinatorial biosynthesis [119,120] can be used to delve deeper into the realm of NPs and uncover the “dark substance”; 2) the implementation of enhanced activity prediction and structure optimization strategies are highly desired to fully harness the existing treasure trove of NPs and developing them into drugs [71,121,122].
Fig. 4.
(A) Natural products (data collected from the DNP database) and drugs (data collected from Newman's work [1]) by year. (B) Small-molecule natural product drugs approved by the U.S. Food and Drug Administration from 2014 to 2023 (data from the review by Feng et al. [5]).
Existing NP databases play vital roles in these two aspects by providing researchers with access to comprehensive collections of chemical and biological data. In the era of big data, both traditional physical-based computational methods and data-driven artificial intelligence models require friendly and high-quality databases for knowledge mining. Despite these challenges and limitations, NP databases continue to serve as valuable resources for exploring chemical spaces, identifying potential drug candidates, and elucidating biosynthetic pathways. Team collaboration and toolkit development for integrating diverse chemical and biological databases are essential for overcoming these challenges, and we believe that these effects would expedite drug discovery, bioactivity mining, structure modification, and the synthesis of NPs.
Data availability
Not applicable.
Ethics approval
Not applicable.
Funding information
This work was supported by the National Key Research and Development Program of China (2023YFC3404900) and the Key Area Research and Development Program of Guangdong Province, China (2022B1111080005). We also thank the Top-Notch Young Talents Program of China for its support.
CRediT authorship contribution statement
Tao Zeng: Writing – original draft, Resources, Methodology. Jiahao Li: Resources. Ruibo Wu: Writing – review & editing, Supervision, Project administration, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Not applicable.
References
- 1.Newman D.J., Cragg G.M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 2020;83(3):770–803. doi: 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
- 2.Luo Z., Yin F., Wang X., Kong L. Progress in approved drugs from natural product resources. Chin. J. Nat. Med. 2024;22(3):195–211. doi: 10.1016/S1875-5364(24)60582-0. [DOI] [PubMed] [Google Scholar]
- 3.White N.J. Qinghaosu (artemisinin): the price of success. Science. 2008;320(5874):330–334. doi: 10.1126/science.1155165. [DOI] [PubMed] [Google Scholar]
- 4.Wani M.C., Taylor H.L., Wall M.E., Coggon P., McPhail A.T. Plant antitumor agents. Vi. The isolation and structure of taxol, a novel antileukemic and antitumor agent from taxus brevifolia. J. Am. Chem. Soc. 1971;93(9):2325–2327. doi: 10.1021/ja00738a045. [DOI] [PubMed] [Google Scholar]
- 5.Jin F., Haixue P., Gongli T. Research advances in biosynthesis of natural product drugs in the past decade. Synth. Biol J. 2024:1–39. doi: 10.12211/2096-8280.2023-092. [DOI] [Google Scholar]
- 6.Khalaf K., Tornese P., Cocco A., Albanese A. Tauroursodeoxycholic acid: a potential therapeutic tool in neurodegenerative diseases. Transl. Neurodegener. 2022;11(1):33. doi: 10.1186/s40035-022-00307-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Garcia-Effron G. Rezafungin—mechanisms of action, susceptibility and resistance: similarities and differences with the other echinocandins. J. Fungi. 2020;6(4):262. doi: 10.3390/jof6040262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sorokina M., Merseburger P., Rajan K., Yirik M.A., Steinbeck C. COCONUT online: collection of open natural products database. J. Cheminf. 2021;13(1):2. doi: 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zhao H., Yang Y., Wang S., Yang X., Zhou K., Xu C., Zhang X., Fan J., Hou D., Li X., Lin H., Tan Y., Wang S., Chu X.-Y., Zhuoma D., Zhang F., Ju D., Zeng X., Chen Y.Z. NPASS database update 2023: quantitative natural product activity and species source database for biomedical research. Nucleic Acids Res. 2023;51(D1):D621–D628. doi: 10.1093/nar/gkac1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Atanasov A.G., Zotchev S.B., Dirsch V.M., et al. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 2021;20(3):200–216. doi: 10.1038/s41573-020-00114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rodrigues T., Reker D., Schneider P., Schneider G. Counting on natural products for drug design. Nat. Chem. 2016;8(6):531–541. doi: 10.1038/nchem.2479. [DOI] [PubMed] [Google Scholar]
- 12.Kanehisa M., Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Caspi R., Billington R., Keseler I.M., Kothari A., Krummenacker M., Midford P.E., Ong W.K., Paley S., Subhraveti P., Karp P.D. The MetaCyc database of metabolic pathways and enzymes - a 2019 update. Nucleic Acids Res. 2020;48(D1):D445–D453. doi: 10.1093/nar/gkz862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gallo K., Kemmler E., Goede A., Becker F., Dunkel M., Preissner R., Banerjee P. SuperNatural 3.0—a database of natural products and natural product-based derivatives. Nucleic Acids Res. 2023;51(D1):D654–D659. doi: 10.1093/nar/gkac1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zeng T., Chen Y., Jian Y., Zhang F., Wu R. Chemotaxonomic investigation of plant terpenoids with an established database (TeroMOL) New Phytol. 2022;235(2):662–673. doi: 10.1111/nph.18133. [DOI] [PubMed] [Google Scholar]
- 16.Lyu C., Chen T., Qiang B., Liu N., Wang H., Zhang L., Liu Z. CMNPD: a comprehensive marine natural products database towards facilitating drug discovery from the ocean. Nucleic Acids Res. 2021;49(D1):D509–D515. doi: 10.1093/nar/gkaa763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sorokina M., Steinbeck C. Review on natural products databases: where to find data in 2020. J. Cheminf. 2020;12(1):20. doi: 10.1186/s13321-020-00424-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Thomford N.E., Senthebane D.A., Rowe A., Munro D., Seele P., Maroyi A., Dzobo K. Natural products for drug discovery in the 21st century: innovations for novel drug discovery. Int. J. Mol. Sci. 2018:1578. doi: 10.3390/ijms19061578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chen Y., de Bruyn Kops C., Kirchmair J. In: Progress in the Chemistry of Organic Natural Products 110: Cheminformatics in Natural Product Research. Kinghorn A.D., Falk H., Gibbons S., Kobayashi J.i., Asakawa Y., Liu J.-K., editors. Springer International Publishing; Cham: 2019. Resources for chemical, biological, and structural data on natural products; pp. 37–71. [DOI] [PubMed] [Google Scholar]
- 20.Ntie-Kang F., Svozil D. An enumeration of natural products from microbial, marine and terrestrial sources. Phys. Sci. Rev. 2020;5(8) doi: 10.1515/psr-2018-0121. [DOI] [Google Scholar]
- 21.van Santen J.A., Kautsar S.A., Medema M.H., Linington R.G. Microbial natural product databases: moving forward in the multi-omics era. Nat. Prod. Rep. 2021;38(1):264–278. doi: 10.1039/D0NP00053A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Medema M.H. The year 2020 in natural product bioinformatics: an overview of the latest tools and databases. Nat. Prod. Rep. 2021;38(2):301–306. doi: 10.1039/D0NP00090F. [DOI] [PubMed] [Google Scholar]
- 23.Prieto-Martínez F.D., Norinder U., Medina-Franco J.L. In: Progress in the Chemistry of Organic Natural Products 110: Cheminformatics in Natural Product Research. Kinghorn A.D., Falk H., Gibbons S., Kobayashi J.i., Asakawa Y., Liu J.-K., editors. Springer International Publishing; Cham: 2019. Cheminformatics explorations of natural products; pp. 1–35. [DOI] [PubMed] [Google Scholar]
- 24.Chen Y., Kirchmair J. Cheminformatics in natural product-based drug discovery. Mol. Inf. 2020;39(12) doi: 10.1002/minf.202000171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chen Y., Garcia de Lomana M., Friedrich N.O., Kirchmair J. Characterization of the chemical space of known and readily obtainable natural products. J. Chem. Inf. Model. 2018;58(8):1518–1532. doi: 10.1021/acs.jcim.8b00302. [DOI] [PubMed] [Google Scholar]
- 26.Daylight theory: SMILES. https://daylight.com/dayhtml/doc/theory/theory.smiles.html
- 27.Stein S.E., Heller S.R., Tchekhovskoi D.V. International Chemical Information Conference, Nimes. FR. 2003. An open standard for chemical structure representation: the IUPAC chemical identifier; pp. 131–143. [Google Scholar]
- 28.Ozturk H., Ozgur A., Schwaller P., Laino T., Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov. Today. 2020;25(4):689–705. doi: 10.1016/j.drudis.2020.01.020. [DOI] [PubMed] [Google Scholar]
- 29.Lo Y.C., Rensi S.E., Torng W., Altman R.B. Machine learning in chemoinformatics and drug discovery. Drug Discov. Today. 2018;23(8):1538–1546. doi: 10.1016/j.drudis.2018.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cavasotto C.N., Aucar M.G., Adler N.S. Computational chemistry in drug lead discovery and design. Int. J. Quant. Chem. 2019;119(2) doi: 10.1002/qua.25678. [DOI] [Google Scholar]
- 31.Landrum G. Rdkit: open-source cheminformatics software. http://www.rdkit.org
- 32.O'Boyle N.M., Banck M., James C.A., Morley C., Vandermeersch T., Hutchison G.R. Open babel: an open chemical toolbox. J. Cheminf. 2011;3(1):33. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Pletnev I., Erin A., McNaught A., Blinov K., Tchekhovskoi D., Heller S. Inchikey collision resistance: an experimental testing. J. Cheminf. 2012;4(1):39. doi: 10.1186/1758-2946-4-39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Burger M.C. Chemdoodle web components: html 5 toolkit for chemical graphics, interfaces, and informatics. J. Cheminf. 2015;7(1):35. doi: 10.1186/s13321-015-0085-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wang M., Carver J.J., Phelan V.V., et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 2016;34(8):828–837. doi: 10.1038/nbt.3597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wishart D.S., Sayeeda Z., Budinski Z., et al. NP-MRD: the natural products magnetic resonance database. Nucleic Acids Res. 2022;50(D1):D665–D677. doi: 10.1093/nar/gkab1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.NMRdata. http://www.nmrdata.com/
- 38.Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B.A., Thiessen P.A., Yu B., Zaslavsky L., Zhang J., Bolton E.E. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021;49(D1):D1388–D1395. doi: 10.1093/nar/gkaa971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Lipinski C.A., Lombardo F., Dominy B.W., Feeney P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997;23(1–3):3–25. doi: 10.1016/s0169-409x(96)00423-1. [DOI] [PubMed] [Google Scholar]
- 40.Lipinski C.A. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today Technol. 2004;1(4):337–341. doi: 10.1016/j.ddtec.2004.11.007. [DOI] [PubMed] [Google Scholar]
- 41.van de Waterbeemd H., Gifford E. ADMET in silico modelling: towards prediction paradise? Nat. Rev. Drug Discov. 2003;2(3):192–204. doi: 10.1038/nrd1032. [DOI] [PubMed] [Google Scholar]
- 42.Xiong G., Wu Z., Yi J., Fu L., Yang Z., Hsieh C., Yin M., Zeng X., Wu C., Lu A., Chen X., Hou T., Cao D. Admetlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 2021;49(W1):W5–W14. doi: 10.1093/nar/gkab255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yang H., Lou C., Sun L., Li J., Cai Y., Wang Z., Li W., Liu G., Tang Y. admetSAR 2.0: web-service for prediction and optimization of chemical ADMET properties. Bioinformatics. 2019;35(6):1067–1069. doi: 10.1093/bioinformatics/bty707. [DOI] [PubMed] [Google Scholar]
- 44.Daina A., Michielin O., Zoete V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 2017;7(1) doi: 10.1038/srep42717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bhhatarai B., Walters W.P., Hop C.E.C.A., Lanza G., Ekins S. Opportunities and challenges using artificial intelligence in adme/tox. Nat. Mater. 2019;18(5):418–422. doi: 10.1038/s41563-019-0332-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Mangal M., Sagar P., Singh H., Raghava G.P.S., Agarwal S.M. NPACT: naturally occurring plant-based anti-cancer compound-activity-target database. Nucleic Acids Res. 2013;41(D1):D1124–D1129. doi: 10.1093/nar/gks1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zeng T., Liu Z., Zhuang J., Jiang Y., He W., Diao H., Lv N., Jian Y., Liang D., Qiu Y., Zhang R., Zhang F., Tang X., Wu R. TeroKit: a database-driven web server for terpenome research. J. Chem. Inf. Model. 2020;60(4):2082–2090. doi: 10.1021/acs.jcim.0c00141. [DOI] [PubMed] [Google Scholar]
- 48.Zdrazil B., Felix E., Hunter F., et al. The ChEMBL database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024;52(D1):D1180–D1192. doi: 10.1093/nar/gkad1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Valli M., dos Santos R.N., Figueira L.D., Nakajima C.H., Castro-Gamboa I., Andricopulo A.D., Bolzani V.S. Development of a natural products database from the biodiversity of Brazil. J. Nat. Prod. 2013;76(3):439–444. doi: 10.1021/np3006875. [DOI] [PubMed] [Google Scholar]
- 50.Xu H., Zhang W., Zhou Y., Yue Z., Yan T., Zhang Y., Liu Y., Hong Y., Liu S., Zhu F., Tao L. Systematic description of the content variation of natural products (nps): to prompt the yield of high-value nps and the discovery of new therapeutics. J. Chem. Inf. Model. 2023;63(5):1615–1625. doi: 10.1021/acs.jcim.2c01459. [DOI] [PubMed] [Google Scholar]
- 51.Chen C.Y.-C. TCM database@taiwan: the world's largest traditional Chinese medicine database for drug screening in silico. PLoS One. 2011;6(1) doi: 10.1371/journal.pone.0015939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Lv Q., Chen G., He H., Yang Z., Zhao L., Chen H.-Y., Chen C.Y.-C. TCMBank: bridges between the largest herbal medicines, chemical ingredients, target proteins, and associated diseases with intelligence text mining. Chem. Sci. 2023;14(39):10684–10701. doi: 10.1039/D3SC02139D. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Chen T., Yang M., Cui G., Tang J., Shen Y., Liu J., Yuan Y., Guo J., Huang L. IMP: bridging the gap for medicinal plant genomics. Nucleic Acids Res. 2024;52(D1):D1347–D1354. doi: 10.1093/nar/gkad898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Bao R., Zhang H., Tang Y. Biomimetic synthesis of natural products: a journey to learn, to mimic, and to be better. Acc. Chem. Res. 2021;54(19):3720–3733. doi: 10.1021/acs.accounts.1c00459. [DOI] [PubMed] [Google Scholar]
- 55.Shakour N., Mohadeszadeh M., Iranshahi M. Biomimetic synthesis of biologically active natural products: an updated review. Mini-Rev. Med. Chem. 2024;24(1):3–25. doi: 10.2174/1389557523666230417083143. [DOI] [PubMed] [Google Scholar]
- 56.UniProt C. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Yu T., Boob A.G., Volk M.J., Liu X., Cui H., Zhao H. Machine learning-enabled retrobiosynthesis of molecules. Nat. Catal. 2023;6(2):137–151. doi: 10.1038/s41929-022-00909-w. [DOI] [Google Scholar]
- 58.Allen F. The cambridge structural database: a quarter of a million crystal structures and rising. Acta Crystallogr. B. 2002;58(3 Part 1):380–388. doi: 10.1107/S0108768102003890. [DOI] [PubMed] [Google Scholar]
- 59.Burley S.K., Bhikadiya C., Bi C., et al. RCSB protein data bank (RCSB.Org): delivery of experimentally-determined pdb structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 2023;51(D1):D488–D508. doi: 10.1093/nar/gkac1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.van Santen J.A., Poynton E.F., Iskakova D., McMann E., Alsup Tyler A., Clark T.N., Fergusson C.H., Fewer D.P., Hughes A.H., McCadden C.A., Parra J., Soldatou S., Rudolf J.D., Janssen E.M.L., Duncan K.R., Linington R.G. The natural products atlas 2.0: a database of microbially-derived natural products. Nucleic Acids Res. 2022;50(D1):D1317–D1323. doi: 10.1093/nar/gkab941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Chen N., Zhang R., Zeng T., Zhang X., Wu R. Developing TeroENZ and TeroMAP modules for the terpenome research platform TeroKit. Database. 2023;2023:baad020. doi: 10.1093/database/baad020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Fan J., Fu A., Zhang L. Progress in molecular docking. Quant. Biol. 2019;7(2):83–89. doi: 10.1007/s40484-019-0172-y. [DOI] [Google Scholar]
- 63.Schaller D., Šribar D., Noonan T., Deng L., Nguyen T.N., Pach S., Machalz D., Bermudez M., Wolber G. Next generation 3D pharmacophore modeling. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2020;10(4):e1468. doi: 10.1002/wcms.1468. [DOI] [Google Scholar]
- 64.Muegge I., Mukherjee P. An overview of molecular fingerprint similarity search in virtual screening. Expet Opin. Drug Discov. 2016;11(2):137–148. doi: 10.1517/17460441.2016.1117070. [DOI] [PubMed] [Google Scholar]
- 65.de Sousa Luis J.A., Barros R.P.C., de Sousa N.F., Muratov E., Scotti L., Scotti M.T. Virtual screening of natural products database. Mini-Rev. Med. Chem. 2021;21(18):2657–2730. doi: 10.2174/1389557520666200730161549. [DOI] [PubMed] [Google Scholar]
- 66.Zhou N., Zheng C., Tan H., Luo L. Identification of PLK1-PBD inhibitors from the library of marine natural products: 3D QSAR pharmacophore, ADMET, scaffold hopping, molecular docking, and molecular dynamics study. Mar. Drugs. 2024:83. doi: 10.3390/md22020083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Juárez-Mercado K.E., Gómez-Hernández M.A., Salinas-Trujano J., Córdova-Bahena L., Espitia C., Pérez-Tapia S.M., Medina-Franco J.L., Velasco-Velázquez M.A. Identification of sars-cov-2 main protease inhibitors using chemical similarity analysis combined with machine learning. Pharmaceuticals. 2024:240. doi: 10.3390/ph17020240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Arul Murugan N., Ruba Priya G., Narahari Sastry G., Markidis S. Artificial intelligence in virtual screening: models versus experiments. Drug Discov. Today. 2022;27(7):1913–1923. doi: 10.1016/j.drudis.2022.05.013. [DOI] [PubMed] [Google Scholar]
- 69.Tropsha A., Isayev O., Varnek A., Schneider G., Cherkasov A. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nat. Rev. Drug Discov. 2024;23(2):141–155. doi: 10.1038/s41573-023-00832-0. [DOI] [PubMed] [Google Scholar]
- 70.Yang X., Wang Y., Byrne R., Schneider G., Yang S. Concepts of artificial intelligence for computer-assisted drug discovery. Chem. Rev. 2019;119(18):10520–10594. doi: 10.1021/acs.chemrev.8b00728. [DOI] [PubMed] [Google Scholar]
- 71.Zhang R., Li X., Zhang X., Qin H., Xiao W. Machine learning approaches for elucidating the biological effects of natural products. Nat. Prod. Rep. 2021;38(2):346–361. doi: 10.1039/D0NP00043D. [DOI] [PubMed] [Google Scholar]
- 72.Mullowney M.W., Duncan K.R., Elsayed S.S., et al. Artificial intelligence for natural product drug discovery. Nat. Rev. Drug Discov. 2023;22(11):895–916. doi: 10.1038/s41573-023-00774-7. [DOI] [PubMed] [Google Scholar]
- 73.Liu Z., Huang D., Zheng S., Song Y., Liu B., Sun J., Niu Z., Gu Q., Xu J., Xie L. Deep learning enables discovery of highly potent anti-osteoporosis natural products. Eur. J. Med. Chem. 2021;210 doi: 10.1016/j.ejmech.2020.112982. [DOI] [PubMed] [Google Scholar]
- 74.Zhang R., Ren S., Dai Q., Shen T., Li X., Li J., Xiao W. InflamNat: web-based database and predictor of anti-inflammatory natural products. J. Cheminf. 2022;14(1):30. doi: 10.1186/s13321-022-00608-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Zhang R., Lin J., Zou Y., Zhang X.J., Xiao W.L. Chemical space and biological target network of anti-inflammatory natural products. J. Chem. Inf. Model. 2019;59(1):66–73. doi: 10.1021/acs.jcim.8b00560. [DOI] [PubMed] [Google Scholar]
- 76.Kibble M., Saarinen N., Tang J., Wennerberg K., Mäkelä S., Aittokallio T. Network pharmacology applications to map the unexplored target space and therapeutic potential of natural products. Nat. Prod. Rep. 2015;32(8):1249–1266. doi: 10.1039/C5NP00005J. [DOI] [PubMed] [Google Scholar]
- 77.Cockroft N.T., Cheng X., Fuchs J.R. STarFish: a stacked ensemble target fishing approach and its application to natural products. J. Chem. Inf. Model. 2019;59(11):4906–4920. doi: 10.1021/acs.jcim.9b00489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Qiang B., Lai J., Jin H., Zhang L., Liu Z. Target prediction model for natural products using transfer learning. Int. J. Mol. Sci. 2021:4632. doi: 10.3390/ijms22094632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Hopkins A.L. Network pharmacology: the next paradigm in drug discovery. Nat. Chem. Biol. 2008;4(11):682–690. doi: 10.1038/nchembio.118. [DOI] [PubMed] [Google Scholar]
- 80.Gu J., Gui Y., Chen L., Yuan G., Lu H.Z., Xu X. Use of natural products as chemical library for drug discovery and network pharmacology. PLoS One. 2013;8(4) doi: 10.1371/journal.pone.0062839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Wang X., Wang Z.-Y., Zheng J.-H., Li S. TCM network pharmacology: a new trend towards combining computational, experimental and clinical approaches. Chin. J. Nat. Med. 2021;19(1):1–11. doi: 10.1016/S1875-5364(21)60001-8. [DOI] [PubMed] [Google Scholar]
- 82.Ouyang Y., Rong Y., Wang Y., Guo Y., Shan L., Yu X., Li L., Si J., Li X., Ma K. A systematic study of the mechanism of acacetin against sepsis based on network pharmacology and experimental validation. Front. Pharmacol. 2021;12 doi: 10.3389/fphar.2021.683645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Yu B., Zhou M., Dong Z., Zheng H., Zhao Y., Zhou J., Zhang C., Wei F., Yu G., Liu W.J., Liu H., Wang Y. Integrating network pharmacology and experimental validation to decipher the mechanism of the Chinese herbal prescription modified shen-yan-fang-shuai formula in treating diabetic nephropathy. Pharm. Biol. 2023;61(1):1222–1233. doi: 10.1080/13880209.2023.2241521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Ru J., Li P., Wang J., Zhou W., Li B., Huang C., Li P., Guo Z., Tao W., Yang Y., Xu X., Li Y., Wang Y., Yang L. TCMSP: a database of systems pharmacology for drug discovery from herbal medicines. J. Cheminf. 2014;6(1):13. doi: 10.1186/1758-2946-6-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Wu Y., Zhang F., Yang K., Fang S., Bu D., Li H., Sun L., Hu H., Gao K., Wang W., Zhou X., Zhao Y., Chen J. SymMap: an integrative database of traditional Chinese medicine enhanced by symptom mapping. Nucleic Acids Res. 2019;47(D1):D1110–D1117. doi: 10.1093/nar/gky1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Dai Z., Hu T., Wei J., Wang X., Cai C., Gu Y., Hu Y., Wang W., Wu Q., Fang J. Network-based identification and mechanism exploration of active ingredients against alzheimer's disease via targeting endoplasmic reticulum stress from traditional Chinese medicine. Comput. Struct. Biotechnol. J. 2024;23:506–519. doi: 10.1016/j.csbj.2023.12.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Mullard A. The drug-maker's guide to the galaxy. Nature. 2017;549(7673):445–447. doi: 10.1038/549445a. [DOI] [PubMed] [Google Scholar]
- 88.Yu M.J. Natural product-like virtual libraries: recursive atom-based enumeration. J. Chem. Inf. Model. 2011;51(3):541–557. doi: 10.1021/ci1002087. [DOI] [PubMed] [Google Scholar]
- 89.Segler M.H.S., Kogej T., Tyrchan C., Waller M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2018;4(1):120–131. doi: 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Ochiai T., Inukai T., Akiyama M., Furui K., Ohue M., Matsumori N., Inuki S., Uesugi M., Sunazuka T., Kikuchi K., Kakeya H., Sakakibara Y. Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity. Commun. Chem. 2023;6(1):249. doi: 10.1038/s42004-023-01054-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.De Cao N., Kipf T. ICML18 Workshop on Theoretical Foundations and Applications of Deep Generative Models. 2018. Molgan: an implicit generative model for small molecular graphs. [DOI] [Google Scholar]
- 92.Bagal V., Aggarwal R., Vinod P.K., Priyakumar U.D. MolGPT: molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2022;62(9):2064–2076. doi: 10.1021/acs.jcim.1c00600. [DOI] [PubMed] [Google Scholar]
- 93.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc.; Long Beach, California, USA: 2017. Attention is all you need; pp. 6000–6010. [Google Scholar]
- 94.Ma M., Zhang X., Zhou L., Han Z., Shi Y., Li J., Wu L., Xu Z., Zhu W. D3Rings: a fast and accurate method for ring system identification and deep generation of drug-like cyclic compounds. J. Chem. Inf. Model. 2024;64(3):724–736. doi: 10.1021/acs.jcim.3c01657. [DOI] [PubMed] [Google Scholar]
- 95.Zeng T., Hess B.A., Jr., Zhang F., Wu R. Bio-inspired chemical space exploration of terpenoids. Briefings Bioinf. 2022;23(5):bbac197. doi: 10.1093/bib/bbac197. [DOI] [PubMed] [Google Scholar]
- 96.Mignani S., Rodrigues J., Tomas H., Jalal R., Singh P.P., Majoral J.-P., Vishwakarma R.A. Present drug-likeness filters in medicinal chemistry during the hit and lead optimization process: how far can they be simplified? Drug Discov. Today. 2018;23(3):605–615. doi: 10.1016/j.drudis.2018.01.010. [DOI] [PubMed] [Google Scholar]
- 97.Lachance H., Wetzel S., Kumar K., Waldmann H. Charting, navigating, and populating natural product chemical space for drug discovery. J. Med. Chem. 2012;55(13):5989–6001. doi: 10.1021/jm300288g. [DOI] [PubMed] [Google Scholar]
- 98.Feher M., Schmidt J.M. Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci. 2003;43(1):218–227. doi: 10.1021/ci0200467. [DOI] [PubMed] [Google Scholar]
- 99.Ertl P., Schuffenhauer A. Cheminformatics analysis of natural products: lessons from nature inspiring the design of new drugs. Prog. Drug Res. 2008;66:217–235. doi: 10.1007/978-3-7643-8595-8_4. [DOI] [PubMed] [Google Scholar]
- 100.Huggins D.J., Venkitaraman A.R., Spring D.R. Rational methods for the selection of diverse screening compounds. ACS Chem. Biol. 2011;6(3):208–217. doi: 10.1021/cb100420r. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Dictionary of natural products. https://dnp.chemnetbase.com
- 102.Dictionary of marine natural products. https://dmnp.chemnetbase.com
- 103.Saldivar-Gonzalez F.I., Valli M., Andricopulo A.D., da Silva Bolzani V., Medina-Franco J.L. Chemical space and diversity of the nubbe database: a chemoinformatic characterization. J. Chem. Inf. Model. 2019;59(1):74–85. doi: 10.1021/acs.jcim.8b00619. [DOI] [PubMed] [Google Scholar]
- 104.Zeng T., Liu Z., Liu H., He W., Tang X., Xie L., Wu R. Exploring chemical and biological space of terpenoids. J. Chem. Inf. Model. 2019;59(9):3667–3678. doi: 10.1021/acs.jcim.9b00443. [DOI] [PubMed] [Google Scholar]
- 105.Ertl P., Roggo S., Schuffenhauer A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 2008;48(1):68–74. doi: 10.1021/ci700286x. [DOI] [PubMed] [Google Scholar]
- 106.Kim H.W., Wang M., Leber C.A., Nothias L.F., Reher R., Kang K.B., van der Hooft J.J.J., Dorrestein P.C., Gerwick W.H., Cottrell G.W. Npclassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 2021;84(11):2795–2807. doi: 10.1021/acs.jnatprod.1c00399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Yuan L., Tian Y., Ding S., Liu Y., Chen F., Zhang T., Tu W., Chen J., Hu Q.-N. PrecursorFinder: a customized biosynthetic precursor explorer. Bioinformatics. 2019;35(9):1603–1604. doi: 10.1093/bioinformatics/bty838. [DOI] [PubMed] [Google Scholar]
- 108.Koch M., Duigou T., Faulon J.L. Reinforcement learning for bioretrosynthesis. ACS Synth. Biol. 2020;9(1):157–168. doi: 10.1021/acssynbio.9b00447. [DOI] [PubMed] [Google Scholar]
- 109.Zheng S., Zeng T., Li C., Chen B., Coley C.W., Yang Y., Wu R. Deep learning driven biosynthetic pathways navigation for natural products with bionavi-np. Nat. Commun. 2022;13(1):3342. doi: 10.1038/s41467-022-30970-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Ambure P., Cordeiro M.N.D.S. In: Ecotoxicological QSARs. Roy K., editor. Springer US; New York, NY: 2020. Importance of data curation in QSAR studies especially while modeling large-size datasets; pp. 97–109. [DOI] [Google Scholar]
- 111.Zhang X.-C., Yi J.-C., Yang G.-P., Wu C.-K., Hou T.-J., Cao D.-S. ABC-Net: a divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Briefings Bioinf. 2022;23(2):bbac033. doi: 10.1093/bib/bbac033. [DOI] [PubMed] [Google Scholar]
- 112.Mavračić J., Court C.J., Isazawa T., Elliott S.R., Cole J.M. ChemDataExtractor 2.0: autopopulated ontologies for materials science. J. Chem. Inf. Model. 2021;61(9):4280–4289. doi: 10.1021/acs.jcim.1c00446. [DOI] [PubMed] [Google Scholar]
- 113.StoneMIND collector. https://stonewise.cn/mol_product
- 114.Petri Seiler K., Kuehn H., Pat Happ M., DeCaprio D., Clemons P.A. Using chembank to probe chemical biology. Curr. Protoc. Bioinf. 2008;22(1):14.5.1–14.5.26. doi: 10.1002/0471250953.bi1405s22. [DOI] [PubMed] [Google Scholar]
- 115.Ntie-Kang F., Zofou D., Babiaka S.B., Meudom R., Scharfe M., Lifongo L.L., Mbah J.A., Mbaze L.M.a., Sippl W., Efange S.M.N. AfroDb: a select highly potent and diverse natural product library from african medicinal plants. PLoS One. 2013;8(10) doi: 10.1371/journal.pone.0078085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Medina-Franco J.L. Towards a unified Latin american natural products database: LANaPD. Future Sci. OA. 2020;6(8) doi: 10.2144/fsoa-2020-0068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Yang Q., Cheng B., Tang Z., Liu W. Applications and prospects of genome mining in the discovery of natural products. Synth. Biol J. 2023;2(5):697–715. doi: 10.12211/2096-8280.2021-012. [DOI] [Google Scholar]
- 118.XI M.Y., Hu Y.L., Gu Y.C., Ge H.M. Genome mining-directed discovery for natural medicinal products. Synth. Biol J. 2024:1–27. doi: 10.12211/2096-8280.2023-086. [DOI] [Google Scholar]
- 119.Hautbergue T., Jamin E.L., Debrauwer L., Puel O., Oswald I.P. From genomics to metabolomics, moving toward an integrated strategy for the discovery of fungal secondary metabolites. Nat. Prod. Rep. 2018;35(2):147–173. doi: 10.1039/C7NP00032D. [DOI] [PubMed] [Google Scholar]
- 120.Yan P., Wang G., Huang M., Liu Z., Dai C., Hu B., Gu M., Deng Z., Liu R., Wang X., Liu T. Combinatorial biosynthesis creates a novel aglycone polyether with high potency and low side effects against bladder cancer. Adv. Sci. 2024 doi: 10.1002/advs.202404668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Kurita K.L., Glassey E., Linington R.G. Integration of high-content screening and untargeted metabolomics for comprehensive functional annotation of natural product libraries. Proc. Natl. Acad. Sci. U.S.A. 2015;112(39):11999–12004. doi: 10.1073/pnas.1507743112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Yuan Y., Shi C., Zhao H. Machine learning-enabled genome mining and bioactivity prediction of natural products. ACS Synth. Biol. 2023;12(9):2650–2662. doi: 10.1021/acssynbio.3c00234. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.




