Abstract
Natural products are essential in drug discovery, chemical biology, and medicinal chemistry. Despite their widespread use, NP data remains fragmented across various databases, limiting their utility for whole person health research, which requires comprehensive, interoperable resources. This study explores and compares three major NP databases: COCONUT, NP-MRD, and GSRS, assessing their scope, structural representation, metadata completeness, and accessibility. COCONUT provides extensive chemical diversity, NP-MRD emphasizes spectral and physical property data, and GSRS focuses on regulatory classification. Despite their strengths, overlap between databases is moderate to small, and significant gaps remain in integrating medical and pharmaceutical information. Improved interoperability and harmonization are needed to support advanced computational models for whole person health. Our findings highlight critical gaps and opportunities to enhance NP database integration, laying the groundwork for developing comprehensive resources that better support data-driven investigations of natural products.
Introduction
Natural Products (NPs) have played a fundamental role in drug discovery, chemical biology, and medicinal chemistry. Their unique chemical scaffolds and extensive structural diversity enable interactions with biological targets that synthetic compounds often cannot achieve1. Over 66% of approved therapeutic agents are directly or indirectly derived from NPs2, and recent pharmaceutical surveys reveal that roughly half of all newly introduced small-molecule drugs are NP-based or closely related2. Even today, NPs remain well-represented in the drug development pipeline – for example, in 2019 more than half of new therapeutic candidates were complex modalities such as proteins, nucleic acids, polymers, and structurally diverse NPs2. Beyond therapeutics, NPs serve as indispensable chemical probes in chemical biology, illuminating biochemical pathways and disease mechanisms through their unique bioactivities1. Their chemical complexity, high structural diversity, and bioactivity make them valuable resources for developing new therapeutic agents and functional materials.
In the digital era, robust databases have become crucial for managing the vast and growing body of scientific knowledge. High quality, well-structured and accessible databases are fundamental for organizing diverse data types, making them Findable, Accessible, Interoperable, and Reusable (FAIR). The rise of computational techniques, including bioinformatics, cheminformatics, machine learning, and knowledge graphs (KGs), has further increased the reliance on high quality databases3–7. For instance, modern drug discovery and metabolomics increasingly employ in silico methods, which require large, well-annotated datasets for training and validation8. In NP research, AI-driven approaches have begun mining genomic, spectral, and chemical data to predict new structures or activities, thereby accelerating drug discovery, metabolomics, and related fields9. However, NP data remain scattered across disparate sources, often with inconsistent formats, incomplete metadata, and varied standards for identifiers and classification. A recent perspective noted that NP data are often “multimodal, unbalanced, unstandardized, and scattered across many data repositories,” posing a major barrier to integration and AI applications10.
These limitations are particularly relevant to the emerging paradigm of whole person health (WPH), which emphasizes a comprehensive view of health by integrating biological, behavioral, social and environmental data11. NPs are uniquely positioned to contribute to WPH research. Derived from plants, foods, and traditional medicine, they influence not only molecular and physiological processes but also reflect cultural practices, dietary habits, and environmental exposures. Properly structured and integrated NP data can serve as a critical link between molecular insights and broader determinants of health, enabling more holistic approaches to biomedical discovery and public health.
Despite the growing availability of numerous NP databases, the landscape remains highly fragmented due to significant heterogeneity in data representation, coverage, metadata completeness and accessibility. Databases vary greatly in their structure, content depth, and data standards, reflecting diverse scopes, priorities, and curation strategies. This fragmentation often results in difficulties when researchers attempt to cross-reference compounds, validate structural data, or integrate datasets into computational workflows. This fragmentation prevents algorithms from recognizing overarching patterns and limits the impact of AI in NP science. High-quality databases that enforce common standards and rigorous curation thus play a pivotal role in enabling next-generation computational workflows in NP discovery. For example, inconsistencies in chemical identifiers, incomplete metadata, or inaccessible spectral data can severely hamper computational drug discovery efforts and regulatory assessments, ultimately reducing the practical utility of available NP resources.
Addressing these challenges necessitates a thorough understanding of existing resources and their complementary strengths and limitations. In this context, this study systematically compares three prominent NP databases: COlleCtion of Open Natural prodUcTs (COCONUT)12,13, Natural Products Magnetic Resonance Database (NP-MRD)14, and Global Substance Registration System (GSRS)15. These databases were selected to represent diverse and complementary facets of NP information that are critical for data harmonization. COCONUT represents a large-scale aggregation of chemical structural information from diverse open-access sources, designed specifically for broad computational analyses and dereplication studies. NP-MRD specializes in nuclear magnetic resonance (NMR) spectroscopy data, which is crucial for the detailed structural elucidation and validation of NPs. GSRS emphasizes standardized regulatory identifiers for substances, enabling data harmonization essential for regulatory compliance and international data interoperability. By evaluating these databases in terms of data coverage, structural representation, metadata completeness, interoperability, and accessibility, we aim to highlight critical gaps and integration opportunities to guide future improvement in NP database design and harmonization, with a particular emphasis on enabling their use in WPH research.
Materials and Methods
Natural Products database selection
For this study, we selected representative resources for comparing existing NPs databases. These databases were chosen based on their complementary scopes, data coverage, and relevance to NP research. COCONUT is one of the largest open-access NP databases, an aggregation of over 120 openly available NP datasets. It offers extensive chemical diversity with structural annotations, making it ideal for large-scale computational analysis and dereplication studies. NP-MRD is a specialized resource focusing on nuclear magnetic resonance (NMR) spectral data of NPs, providing high-quality structural and analytical insights crucial for compound validation and elucidation. GSRS, developed for regulatory purposes, offers standardized substance identification, ensuring data harmonization and interoperability across global pharmaceutical and research institutions. By including these three databases, our study captures a broad spectrum of NP information, from chemical and spectral data to regulatory classification, allowing a comprehensive evaluation of their strengths and limitations for different research applications.
Database review
A systematic analysis of the three selected databases was performed to assess their characteristics and usability. The appraisal focused on several key aspects, including the general description of each database, its stated purpose, key features, content coverage, data accessibility, and format. Additionally, a real-time search test was conducted in each database using a common NP as a representative example, allowing for the evaluation of search functionality, retrieval efficiency, and consistency of the provided information. This comparative assessment aimed to highlight the strengths and limitations of each database in terms of structural representation, metadata completeness, and user accessibility, ensuring a comprehensive understanding of their applicability in NP research.
Data element model generation
To facilitate a systematic comparison of the selected NP databases, we first developed a standardized data element model, a structured set of attributes that describe key characteristics of NPs. In this context, data elements refer to individual fields or metadata types used to represent information about a NP, such as its name, structure, bioactivity, or regulatory status. These elements were extracted from each data sources and separated into two levels: (1) attributes (e.g. “Molecular Weight”), and (2) the corresponding attribute values, which are actual data entries (e.g. “302.24 g/mol”). Both the attributes and the attribute values were harmonized across data sources. The attributes were then organized into thematic categories, including Identification (e.g. Database Identifier, Name, Synonyms, Canonical Simplified Molecular Input Line Entry System (SMILES), Standard International Chemical Identifier (InChI), InChI Key, International Union of Pure and Applied Chemistry (IUPAC) Name, Chemical Abstracts Service (CAS) Registration, Chemical Formula), Classification and Taxonomy (e.g. Chemical Class, Source Organism), Physical and Chemical Properties (e.g., Solubility, Molecular Weight, Aromatic Ring Count, pKa), Bioactivity and Pharmacological Information (e.g., Rule of Five, QED Drug Likeliness), Regulatory and Usage Information (e.g. Regulatory Status), Substance Relationships, External Database References, and Record Tracking Information. This standardized framework allowed for an objective assessment of the strengths and limitations of each database in representing NP information, supporting the identification of gaps and potential areas for improvement.
Data element coverage comparison
To assess the extent, or robustness, of data representation in each selected database, we systematically evaluated each database to determine its coverage of the harmonized set of attributes. Each database was examined for the presence or absence of each attribute in the harmonized set, as well as presence or absence of each category of attributes.
Overlap analysis
To assess the overlap of NP compounds among the selected databases, we used InChI Key as a unique molecular identifier to standardize compound representation across datasets. Notably, InChI keys (in contrast to SMILES strings) are tautomer-invariant. A Venn diagram was generated to visualize the number of distinct compounds in each database and their shared entries. InChI Keys were extracted from each database and duplicates due to tautomeric variations were removed before comparison.
Case example selection and search procedure
To illustrate cross-database comparison and data integration, quercetin was selected as a representative example. Quercetin is a well-known flavonoid commonly found in various fruits, vegetables, and grains, and is recognized for its antioxidant, anti-inflammatory, and potential anticancer properties. It serves as an ideal case study for this comparison because it is widely studied, represented across multiple NP databases, and associated with diverse type of metadata. This makes it particularly useful for illustrating how information about a single compound can vary across sources, highlighting both the complementarity and the gaps among the databases. For each database, we performed a targeted search using the compound name or InChIKey and extracted relevant data elements based on our standardized model, including identifiers, structural information, classifications, bioactivity, and regulatory data.
Results
Overview of each database
COCONUT is one of the largest publicly available NP repositories, containing 695,133 curated NP entries aggregated from multiple open-access sources. It is designed for browsing, searching, and efficiently downloading data through a user-friendly web interface. The database offers multiple download formats, including SDF and CSV. The latest dataset (October 2024 release) is available under a Creative Commons CC0 license, permitting unrestricted use, modification, and distribution without attribution.
A key strength of COCONUT is its comprehensive molecular representation. Each NP entry is assigned a unique COCONUT ID, mapping to a canonical SMILES, and includes extensive molecular details, including IUPAC name, InChI, InChIKey, molecular weight, atom counts, aromatic ring count, hydrogen bond acceptors, and Lipinski’s Rule of Five (RO5) violations. The database also provides chemical and NP classification, enhancing its utility for compound categorization and comparative analysis.
In addition to chemical characterization, COCONUT integrates biological taxonomy, listing a comprehensive set of source organisms for each NP. Each organism entry includes a direct link to the COCONUT Organisms page and an external reference to Ontobee (NCBI organismal classification). As an aggregated dataset, COCONUT provides extensive cross-references to hundreds of external resources, such as PubChem Substance, KNApSaCK, DrugBank, Phenol Explorer, FooDB, and ChEBI, enhancing data interoperability and facilitating multi-database integration.
While COCONUT excels in molecular diversity and computational accessibility, it lacks physical properties, detailed experimental spectral data and direct pharmacological activity annotations, which may limit its direct applicability for experimental NP validation and drug development workflows. However, its vast chemical coverage and open-access nature make it an essential resource.
NP-MRD is a specialized NP database that focuses on extensive chemical, spectral, and physical property data, making it a valuable resource for structure elucidation, dereplication, and experimental validation. The database contains 289,609 distinct NP-MRD entries, corresponding to 288,789 unique SMILES, and provides a user-friendly web interface for browsing and searching.
In addition to standard chemical properties (e.g., molecular formula, SMILES, InChI), NP-MRD offers a comprehensive set of physical properties, including experimental data such as water solubility, melting point, and boiling point, as well as predicted properties such as pKa (strongest acidic and basic), molar refractivity, and polarizability. This makes NP-MRD particularly useful for computational modeling, solubility predictions, and pharmacokinetic assessments. A key feature of NP-MRD is its spectral data coverage, which includes diverse spectral data formats. A particularly valuable feature is the growing repository of NMR row spectral data for NPs, allowing for users download and subsequent local interrogation16. Users can also access and download metadata in XML and JSON, as well as structure data in SDF and SMILES formats. The database supports various spectral data formats, facilitating integration into analytical chemistry workflows. In terms of biological taxonomy, NP-MRD provides a full list of species of origin, including species name, source, and references, allowing researchers to easily trace NPs back to their biological sources. Additionally, external links are provided to related resources, supporting data interoperability and further exploration.
While NP-MRD excels in spectral data, physical properties, and taxonomic details, it does not emphasize regulatory classification or broad cheminformatics applications like some other databases. However, its high-quality spectral and experimental data make it a very valuable resource for researchers focused on NP characterization, analytical chemistry, and structure verification.
GSRS is a comprehensive database developed through a collaboration between the U.S. Food and Drug Administration (FDA) and the National Center for Advancing Translational Sciences (NCATS). It provides detailed scientific descriptions and unique ingredient identifiers (UNIIs) for substances relevant to regulated products. The latest public data release (v2.5.1–20200707) consists of 162,731 substance definitions and covers six types of substances referenced in the ISO 11238 standard, including chemicals (116,384), nucleic acids (597), proteins (7,214), polymers (2,527), structurally diverse (27,610), and concepts (5,269). The substance types differ significantly in their structural and functional characteristics. Chemicals are well-defined small molecules, including both NP compounds and synthetic compounds. Nucleic acids and proteins refer to large biomolecules essential for regulatory and therapeutic tracking but are not typically considered NPs. Polymers encompass complex macromolecules, including both synthetic polymers and natural biopolymers, with structural characteristics that differ from small molecules. Structurally diverse substances include mixtures or complex materials of biological origin that cannot be fully characterized by a single chemical structure. Concepts represent abstract entries that serve as categories, classes, or generalized descriptors rather than specific chemical entities. Among these categories, the chemicals type is most relevant for comparison with the other two NP databases: COCONUT and NP-MRD, as it provides compound-level information such as chemical structure, molecular formula, and classification. However, unlike the other two databases, GSRS includes both NPs and synthetic compounds, making it a more comprehensive but less specialized resource for NP research. Furthermore, GSRS focuses heavily on regulatory aspects that facilitate data harmonization and compliance tracking, rather than comprehensive spectral or taxonomic information. A key strength of GSRS is its ability to capture relationships among substances to provide additional biological and manufacturing context that helps regulatory authorities identify and track product ingredients in various applications and information sources.
The public GSRS data are freely available at https://gsrs.ncats.nih.gov, but it does not support user registration. For more personalized features and functionalities, users can use PrecisionFDA, which holds the same extensive substance data but offers a user-friendly registration system. GSRS is also integrated with the FDA’s DailyMed platform to support regulatory decision-making and medical health research. Through PrecisionFDA, qualified users can request specialized access to data and use additional functions that are not available on the public GSRS platform. The UNIIs are searchable in PrecisionFDA’s dedicated UNII search tool at https://precision.fda.gov/uniisearch. The data can be downloaded in various formats, including TXT, CSV, XSLX, JSON, and SDF. GSRS includes extensive references to other substance information sources, including Common Chemistry, Inxight Drugs, DailyMed Regulated Products, GSRS Full Record, NCI Thesaurus, PubChem, and CompTox Chemicals Dashboard. This comprehensive cross-referencing enhances the integration of multiple databases.
While GSRS provides extensive substance definitions, it does not provide detailed physical and chemical properties, and taxonomic information like other databases. It does not provide comprehensive pharmacological or toxicological profiles, limiting its utilization for drug discovery and safety assessments in regulated products.
Comparison of Databases Adequacy and Completeness
A comprehensive list of data elements is used as a standard checklist. The presence or absence of each data element in these databases was systematically assessed (Table 1).
Table 1.
Comparison of data elements across existing NP databases.
| Section | Facet | COCONUT | NP-MRD | GSRS (Compound) | GSRS (Substance) |
|---|---|---|---|---|---|
| 1. Identification | Database Identifier | Yes | Yes | Yes | Yes |
| Name & Synonyms | Yes | Yes | Yes | Yes | |
| Canonical SMILES | Yes | Yes | Yes | No | |
| Standard InChI/InChI Key | Yes | Yes | Yes | No | |
| IUPAC Name | Yes | Yes | Yes | No | |
| CAS Registration | Yes | Yes | Yes | Yes | |
| Chemical Formula | Yes | Yes | Yes | No | |
| 2. Classification & Taxonomy | Chemical Classification (Kingdom, Chemical Class/Subclass/Superclass) | Yes | Yes | No | No |
| Structural Classification (Direct/Alternative Parent, Substituents, Molecular/Murcko Framework, Stereochemistry, Defined Stereocenters, E/Z Centers) | Yes | Yes | Yes | No | |
| Biological Classification (Species of Origin, Species Where Detected, Organisms, Geolocations) | Yes: a list of organisms with links to Ontobee | Yes: a list of species | No | Yes: include Source Materials | |
| NP Classification (NP Pathway/Class/Superclass/Is Glycoside) | Yes | No | No | No | |
| 3. Physical and Chemical Properties | Phase and Thermodynamic properties (State, Melting Point, Boiling Point) | No | Yes | No | No |
| Solubility (Water Solubility, LogP/ALogP/ClogP/Experimental LogP) | Yes | Yes | No | No | |
| pKa (Strongest Acidic/Basic) | No | Yes | No | No | |
| Molecular Weight/Exact Mass | Yes | Yes | Yes | No | |
| Total/Heavy Atom Count | Yes | No | No | No | |
| Van der Waals Volume | Yes | No | No | No | |
| Fraction of Csp3 | Yes | No | No | No | |
| Ring Structure (Aromatic Rings Count, Number of Rings, Minimal Number of Rings) | Yes | Yes | No | No | |
| Contains Ring/Linear Sugars | Yes | No | No | No | |
| Rotatable Bond Count | Yes | Yes | No | No | |
| Formal/Physiological Charge | Yes | Yes | Yes | No | |
| Hydrogen Acceptor/Donor Count | Yes | Yes | No | No | |
| Polar Surface Area (Topological/Predicted) | Yes | Yes | No | No | |
| Optical Activity | No | Yes | Yes | No | |
| Molar Refractivity/Polarizability | No | Yes | Yes | No | |
| 4. Bioactivity and Pharmacological Information | Bioavailability | No | Yes | No | No |
| Rule of Five/Lipinski’s Rule/Violations | Yes | Yes | No | No | |
| Ghose Filter | No | Yes | No | No | |
| Veber’s Rule | No | Yes | No | No | |
| MDDR-like Rule | No | Yes | No | No | |
| QED Drug Likeliness | Yes | No | No | No | |
| 5. Regulatory & Usage Classification | Record UNII | No | No | Yes | Yes |
| Record Protection Status | No | No | Yes | Yes | |
| Regulatory Status | No | No | Yes | Yes | |
| Usage Classification (Orphan Drug, Dietary Supplement, LiverTox, …) | No | No | Yes | Yes | |
| 6. Substance Relationships | Substance Composition (Active Moiety, Metabolites, Impurities, Constituents) | No | No | Yes | Yes |
| General Relationships | No | No | Yes | Yes | |
| 7. References to External Database | Metabolomics & Biochemical Databases | Phenol Explorer Compound, FoodDB, KNApSAcK, ReSpect, HIM, Watermelon Database, | HMDB, Phenol Explorer Compound, FoodDB, KNApSAcK, METLIN | HMDB, Phenol Explorer Compound, FoodDB, KNApSAcK, Metabolomics Workbench | |
| Drug & Pharmacology Databases | DrugBank | DrugBank | DrugBank, DAILYMED, RXCUI, PharmGKB, Pharos Ligand, ChEMBL, DRUG CENTRAL | ||
| NP & Chemical Compound Databases | ChemSpider, PubChem NPs Substance, ChEBI, TIPdb, p-ANAPL, ANPDB, AfroCancer, UNPD, AfroDB, AnalytiCon Discovery NPs, GNPS, NPCARE, CMAUP, InPACdb, Australian NPs, EMNPD, SANCDB, TCMDB@Taiwan, TCMID, StreptomeDB, HIT, NANPDB, BitterDB, NPASS, VietHerB, Phyto4Health, InterBioScreen, Mitishamba, ETM-DB, BIOFACQUIM, UEFS, NPACT, NPEdia, Specs NPs, LANaPDB, ConMedNP, NuBBEDB, Super Natural II | ChemSpider, PubChem Compound, ChEBI | ChemSpider, PubChem Compound, ChEBI | ||
| Biological Pathways & Systems Biology Databases | KEGG Compound, BioCyc, BiGG | KEGG Compound, BioCyc | |||
| Toxicology & Environmental Databases | TPPT, Exposome-Explorer | EPA_CompTox, DSSTox, NSC, HSDB | |||
| Structural & 3D Databases | PDB, Good Scents | ||||
| Regulatory & Identifier Databases | UN | Wikipedia | Enzyme Commission, NCIT, Wikipedia, Wikidata, Nikkaji, MESH, MERCK INDEX, EVMPD, GRAS Notification, ECHA (EC/EINECS), FDA UNII | ||
| 8. Record Tracking Information | Created At/By | Yes | Yes | Yes | Yes |
| Last Edited At/By | Yes | Yes | Yes | Yes | |
| Version | No | Yes | Yes | Yes |
Across the databases, identification-related attributes are well-represented. Classification and taxonomy elements vary significantly, with COCONUT providing chemical, structural, biological and NP classifications; NP-MRD providing chemical, structural and biological classifications; and GSRS providing source materials only at the substance level. The coverage of physical and chemical properties is inconsistent, with COCONUT and NP-MRD offering extensive information, whereas GSRS lacks key physicochemical attributes. Bioactivity and pharmacological properties, including bioavailability and drug-likeness assessments, are present in COCONUT and NP-MRD but are largely absent from GSRS directly. Regulatory information, as well as substance relationships within the database, are only available in GSRS. Additionally, all the three databases integrate numerous external references across metabolomics, NP and chemical compound databases. COCONUT focuses more on links with other NP databases; NP-MRD focuses more on biological pathways, systems biology databases, and structural databases; and GSRS focuses more on links with drug/pharmacology databases, and regulatory databases. Finally, record tracking information is included in all databases. This comparative evaluation indicates that there are significant differences in scope and emphasis across databases.
Each database was evaluated for the presence or absence of key data categories to assess its overall completeness (Table 2). Data elements within the same category were grouped, and the number and percentage in the table indicate how many items within a database contain at least one data element from each category. It should be noted that a high completeness percentage does not necessarily reflect comprehensive coverage of all relevant data elements within a category. For example, GSRS appears to have high completeness for physical and chemical properties, this is due to the consistent availability of basic information, such as molecular weight, whereas all other physicochemical attributes are absent.
Table 2.
Completeness of selected data elements within existing databases
| Section | COCONUT | NP-MRD | GSRS |
|---|---|---|---|
| Identification | 695133 (100%) | 289609 (100%) | 116008 (100%) |
| Classification and Taxonomy | 694336 (99.9%) | 250116 (86.4%) | ? |
| Physical and Chemical Properties | 695133 (100%) | 289509 (99.97%) | ? |
| Bioactivity and Pharmacological Information | 695133 (100%) | 270139 (93.3%) | - |
| Regulatory and Usage Information | - | - | 115412 (99.5%) |
| Substance Relationships | - | - | ? |
| External Database References | 695133 (100%) | 241956 (83.5%) | 112429 (96.91%) |
| Record Tracking Information | 695133 (100%) | 289609 (100%) | ? |
Overlap between databases
A Venn diagram based on InChI Key identifiers was used to assess the overlap and uniqueness of NP compounds across COCONUT, NP-MRD, and GSRS (Figure 1). While each database contains many unique compounds, the overlap across all three is relatively small, indicating that these resources capture distinct subsets of the NP chemical space. It is important to note that GSRS does not distinguish between NP compounds and synthetic compounds, and therefore, the overlap analysis based on InChI Key identifiers may present bias by overestimating the uniqueness of GSRS in this NP-focused study. The results highlight COCONUT’s broad coverage, NP-MRD’s focus on structurally validated compounds, and GSRS’s regulatory emphasis, emphasizing the need for better cross-referencing across NP databases.
Figure 1.

Overlap of NP compounds among databases based on InChI Key.
Illustration with a Search Example of Quercetin
Quercetin was used as a representative example to demonstrate the process of searching across the three selected NP databases and to illustrate how NPs can be comprehensively represented by integrating information from multiple resources (Table 3). Only the most relevant data elements across the databases are presented. This example demonstrates the necessity of accessing multiple databases to obtain a comprehensive view of a NP, particularly when aiming to explore its taxonomic, chemical, biological, pharmaceutical, and regulatory information. Integrating these resources would significantly enhance their applicability for computational analysis, experimental studies, and advancing WPH research.
Table 3.
Selected data elements to represent Quercetin using information in existing resources
| Data elements | Representative example | |
|---|---|---|
| 1. Identification | Name | Quercetin |
| Canonical SMILES | O=C1C(O)=C(C2=CC=C(O)C(O)=C2)OC2=CC(O)=CC(O)=C12 | |
| IUPAC Name | 2-(3,4-dihydroxyphenyl)-3,5,7-trihydroxy-chromen-4-one | |
| Chemical Formula | C15H10O7 | |
| InChI Key | REFJWTPEDVJJIY-UHFFFAOYSA-N | |
| 2. Classification & Taxonomy | Chemical Class | Flavonoids |
| Chemical Subclass | Flavones | |
| Chemical Superclass | Phenylpropanoids and Polyketides | |
| Direct Parent Classification | Flavonols | |
| 3. Physical and Chemical Properties | State | Solid |
| Experimental Water Solubility | 0.06 mg/mL at 16°C | |
| Experimental Melting Point | 316 - 318 °C | |
| Predicted Water Solubility | 0.26 g/L | |
| Molecular Weight | 302.24 | |
| Exact Molecular Weight | 302.04265 | |
| Van Der Waals Volume | 239.07 | |
| 4. Bioactivity and Pharmacological Information | Bioavailability | Yes |
| Rule of Five/Lipinski’s Rule/Violations | Yes | |
| Veber’s Rule | No | |
| MDDR-like Rule | No | |
| 5. Regulatory & Usage Classification | Record Protection Status | Public record |
| Record Status | Validated (UNII) | |
| APPROVAL_ID | 9IKM0I5T1E | |
| 6. Substance Relationships | Active Moiety | QUERCETIN ->Quercetin (ACTIVE MOIETY) |
| Metabolites | QC-12 (PRODRUG)->Quercetin (METABOLITE ACTIVE) | |
| Impurities | RUTIN (PARENT)->Quercetin (IMPURITY ACTIVE) | |
| Constituents | FENUGREEK SEED (PARENT)->Quercetin (CONSTITUENT ALWAYS PRESENT) | |
| 7. References to External Database | DrugBank ID | DB04216 |
| Phenol Explorer Compound ID | 291 | |
| FoodDB ID | FDB011904 | |
| KNApSAcK ID | C00004631 | |
| Chemspider | 4444051 | |
| PubChem Compound ID | 5280343 | |
| 8. Record Tracking Information | Version | 50 |
Discussion
Significance
Despite NP’s prevalent use for nutrition, health promotion, and medicinal purposes, many of these products remain insufficiently studied, primarily due to the challenges of conducting rigorous research on chemically complex materials. Addressing this gap is particularly crucial within the NCCH’s Whole Person Health Initiative, which emphasizes studying multicomponent interventions across interconnected biological systems rather than focusing on single disease models17. The reciprocal interactions between NPs, the gut microbiome, and systemic health further underscore the importance of a systems-level approach to evaluating their effects. By leveraging computational and integrative methodologies, researchers can uncover novel mechanisms and interactions between NPs and biological systems, contributing to a more holistic understanding of their role in health and resilience.
Given the fragmentation and heterogeneity of available NPs data, the lack of standardized and interoperable sources makes it difficult to systematically analyze and compare NPs and their effects on WPH. Previous initiatives, such as the Consortium Advancing Research on Botanicals and Other Natural Products (CARBON), have focused on developing methodologies for NP characterization and bioactivity assessment, yet the need for integrated, standardized data repositories remains. To effectively harmonize NP data from diverse databases, a multi-layered standardization strategy could be adopted, centered on the use of universally recognized chemical identifiers. The InChI Key may serve as the primary unique identifier for NP compounds, ensuring consistency and enabling precise matching across disparate sources. To reinforce the accuracy of these linkages, SMILES could be used as a supplementary validation mechanism. This dual-identifier approach can help reduce redundancy and misalignment. For entries lacking structured chemical identifiers, such as those identified only by names, tools like QuickUMLS may assist in mapping these entities to Concept Unique Identifiers (CUIs) in the Unified Medical Language System (UMLS), enabling semantic normalization and alignment of nomenclature across datasets. Collectively, these strategies provide a scalable framework for harmonizing NP databases and improving interoperability, thereby facilitating downstream applications in cheminformatics, biomedical research, and AI-driven discovery.
Despite the growing interest in WPH, no existing NP database has been specifically structured to support this research paradigm. Traditional NP databases, such as COCONUT, NP-MRD, and GSRS, primarily focus on chemical structure, spectral data, and regulatory classification, respectively. However, they lack integration with clinical, multi-omics, behavioral, and environmental data sources—key components necessary to study how NPs influence whole-body systems. Additionally, there is limited effort to link computational tools with experimental research to uncover complex interactions between NPs and multiple biological systems. Without an integrated data framework, the full potential of AI and multi-scale modeling to predict synergistic effects of NP mixtures remains largely untapped.
Our study addresses this gap by systematically evaluating existing NP databases in the context of advancing WPH research. By reviewing, categorizing, and comparing COCONUT, NP-MRD, and GSRS, we identified strengths, limitations, and missing elements critical for constructing an integrated NP informatics framework. Our analysis highlights the need and potential directions for a harmonized data structure. Such a framework would support future integration of NP data with complementary sources, such as multi-omics profiles, Electronic Health Records (EHRs), and public health datasets, which is essential for building computational models capable of capturing the complex, multiscale effects of NPs on WPH. For example, linking NP chemical and bioactivity data with metabolomics profiles could help elucidate metabolic pathways influenced by specific compounds. Integration with EHRs may enable retrospective analyses of NP-related exposures or supplement safety based on real-world evidence. Similarly, aligning NP databases with nutrition, microbiome, or exposome datasets could support population-level studies on diet-derived compounds and environmental modifiers of health. These directions illustrate how harmonized NP data could play a central role in translational pipelines that span molecular mechanisms to clinical and public health outcomes.
Our analysis of COCONUT, NP-MRD, and GSRS reveals that while these databases serve different but complementary purposes, they each have limitations in the context of WPH research. COCONUT provides the broadest coverage of NPs, focusing on chemical structures, molecular descriptors, and taxonomic origins, but lacks bioactivity and regulatory insights. NP-MRD, with its emphasis on spectral data and experimental validation, offers valuable structural and physical property information, yet it does not integrate multi-omics or health-related data. GSRS, designed for regulatory tracking, captures detailed substance classifications, pharmaceutical formulations, and compliance records, but it does not systematically link NPs to biological effects or multisystem interactions.
Despite their individual strengths, these databases exhibit only moderate to small overlap, highlighting the fragmentation of NP information across cheminformatics, experimental, and regulatory domains. This lack of integration presents a significant challenge for WPH research, which requires multiscale data linking NPs to biological, environmental, and physiological systems. Our findings underscore the need for enhanced interoperability, improved data harmonization, and cross-referencing between cheminformatics, clinical, and multi-omics resources to enable comprehensive, data-driven investigations into the complex effects of NPs on whole-body health.
Limitation
While our study provides a comprehensive comparison of three major NP databases (COCONUT, NP-MRD, and GSRS), several limitations should be acknowledged. One potential gap is the lack of medical and pharmaceutical information of NPs, such as dosing guidelines, contraindications, adverse reactions, and toxicology. Resources like the Natural Product Information (Professional) from Drugs.com offer valuable information, but were not included in our analysis due to several constraints: (i) the platform is limited to web browsing, with no options of data download, scraping, or extracting content; (ii) it primarily relies on common names as identifiers, lacking standardized chemical identifiers such as InChIKey or SMILES, which complicates integration efforts, and (iii) it has relatively limited coverage. Future integration efforts could explore strategies such as mapping name-based entries to standardized identifiers using NLP-based entity linking. Additionally, collaboration with clinical or regulatory data providers could enable access to structured metadata under appropriate licensing terms. These steps could help bridge the gap between chemical-level NP data and clinically relevant information, further enhancing the utility of NP databases for translational and WPH research.
Additionally, our study is inherently limited by the subjective nature of data element categorization. While we aimed for a logical and standardized grouping of data elements, some distinctions remain ambiguous. Furthermore, while our analysis identifies moderate to small overlap between databases, this does not capture potential semantic inconsistencies or nomenclature variations that could further impact integration. These limitations highlight the need for future efforts to incorporate pharmaceutical and clinical data sources, establish standardized metadata frameworks, and refine NP data classification methods to improve interoperability and support WPH research.
Another limitation of our overlap analysis is the potential bias introduced by the inclusion of synthetic compounds in GSRS. Since GSRS does not explicitly distinguish NPs from synthetic substances, the use of InChIKey identifiers may overestimate the uniqueness of GSRS entries relative to the NP-focused COCONUT and NP-MRD databases. Addressing this bias will require clearer annotation of compound origin in GSRS and similar databases, highlighting the need for better classification of natural versus synthetic substances to enable more accurate cross-database comparisons in future research.
Perspectives
The findings from this study highlight the need for improved integration and interoperability of NP databases to enhance their relevance for WPH research applications. As the field continues to advance, developing a unified framework capable of linking chemical, spectral, taxonomic, pharmacological, and regulatory information will be critical for leveraging computational approaches, including machine learning, cheminformatics, and knowledge graphs. Integrating existing NP resources with multi-omics, clinical, and environmental data will allow researchers to explore complex interactions between NPs and interconnected biological systems more effectively.
Future efforts should focus on standardizing data formats, harmonizing metadata frameworks, and improving cross-references across NP databases. However, these goals are often challenged by technical barriers such as inconsistent chemical representations, heterogeneous metadata standards, and incomplete or ambiguous entries. For instance, variations in naming conventions, stereochemical details, and missing structural identifiers can hinder accurate integration, even with tools like InChIKey and SMILES.
Furthermore, semantic inconsistencies—such as synonymy and varied ontologies—complicate the mapping of biomedical concepts, requiring both advanced NLP tools and alignment with established ontologies. Incorporating domain-specific resources such as ChEBI for chemical entities, MeSH for pharmacological categories, UMLS for clinical terminology, and NCBI Taxonomy for organisms can help standardize terminology and improve cross-database interoperability. Aligning NP data with these ontologies would improve semantic consistency, facilitate automated reasoning and search, and support interoperability with clinical and biomedical data systems.
Additionally, mechanisms to incorporate medical and pharmaceutical information, such as dosage, interactions, and toxicity, are essential for advancing data-driven investigations of NPs. Bridging these gaps will enable more robust computational models that can predict NP effects on health resilience and complex, whole-body systems, ultimately contributing to the broader goals of WPH research.
Conclusion
This study systematically compared three major NP databases—COCONUT, NP-MRD, and GSRS—highlighting their complementary strengths and limitations in supporting NP research for WPH. While COCONUT provides extensive chemical diversity, NP-MRD offers valuable spectral and physical property data, and GSRS focuses on standardized regulatory classification. However, only moderate to small overlap exists between these resources, revealing fragmentation in NP information. Additionally, critical gaps remain in integrating medical and pharmaceutical information relevant to holistic health research. Future efforts should focus on enhancing data interoperability, developing comprehensive data models, and integrating clinical and multi-omics data to support advanced computational methodologies aimed at understanding NPs’ systemic effects on health.
Figures & Tables
References
- 1.Hong J. Role of natural product diversity in chemical biology. Curr Opin Chem Biol. 2011;15:350–354. doi: 10.1016/j.cbpa.2011.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Newman D. J., Cragg G. M. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod. 2020;83:770–803. doi: 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
- 3.Hou Y., Zhang R. Enhancing Dietary Supplement Question Answer via Retrieval-Augmented Generation (RAG) with LLM. 2024.09.11.24313513. 2024. Preprint at . [DOI]
- 4.Su C., et al. Biomedical discovery through the integrative biomedical knowledge hub (iBKH) iScience. 2023;26:106460. doi: 10.1016/j.isci.2023.106460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Xiao Y., et al. Repurposing non-pharmacological interventions for Alzheimer’s disease through link prediction on biomedical literature. Sci Rep. 2024;14:8693. doi: 10.1038/s41598-024-58604-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Su C., Hou Y., Levin M., Zhang R., Wang F. Protocol to implement a computational pipeline for biomedical discovery based on a biomedical knowledge graph. STAR Protoc. 2023;4:102666. doi: 10.1016/j.xpro.2023.102666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Su C., Hou Y., Wang F. GNN-based Biomedical Knowledge Graph Mining in Drug Development. In: Wu L., Cui P., Pei J., Zhao L., editors. Graph Neural Networks: Foundations, Frontiers, and Applications. Singapore: Springer Nature; 2022. pp. 517–540. doi:10.1007/978-981-16-6054-2_24. [Google Scholar]
- 8.Dzobo K. The Role of Natural Products as Sources of Therapeutic Agents for Innovative Drug Discovery. Comprehensive Pharmacology. 2022:408–422. doi:10.1016/B978-0-12-820472-6.00041-4. [Google Scholar]
- 9.van Santen J. A., Kautsar S. A., Medema M. H., Linington R. G. Microbial natural product databases: moving forward in the multi-omics era. Nat Prod Rep. 2021;38:264–278. doi: 10.1039/d0np00053a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Meijer D., et al. Empowering natural product science with AI: leveraging multimodal data and knowledge graphs. Nat Prod Rep. 2025;42:654–662. doi: 10.1039/d4np00008k. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Whole Person Health. What It Is and Why It’s Important. NCCIH. https://www.nccih.nih.gov/health/whole-person-health-what-it-is-and-why-its-important .
- 12.Sorokina M., Merseburger P., Rajan K., Yirik M. A., Steinbeck C. COCONUT online: Collection of Open Natural Products database. J Cheminform. 2021;13:2. doi: 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chandrasekhar V., et al. COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res. 2024;53:D634–D643. [Google Scholar]
- 14.Wishart D. S., et al. NP-MRD: the Natural Products Magnetic Resonance Database. Nucleic Acids Res. 2021;50:D665–D677. [Google Scholar]
- 15.Peryea T., et al. Global Substance Registration System: consistent scientific descriptions for substances related to health. Nucleic Acids Res. 2020;49:D1179–D1185. [Google Scholar]
- 16.Wishart D. S., et al. The Natural Products Magnetic Resonance Database (NP-MRD) for 2025. Nucleic Acids Res. 2025;53:D700–D708. doi: 10.1093/nar/gkae1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Concept: Whole Person Research Initiative. NCCIH. https://www.nccih.nih.gov/grants/concept-whole-person-research-initiative .
