Exploring and Comparing Existing Natural Product Databases Towards Whole Person Health Research

Xiaoyi Chen; Meijia Song; Yu Hou; Rubina F Rizvi; Jeffrey R Bishop; Piper A Ranallo; Thomas R Hoye; Rui Zhang

. 2026 Feb 14;2025:267–276.

Exploring and Comparing Existing Natural Product Databases Towards Whole Person Health Research

Xiaoyi Chen ¹, Meijia Song ², Yu Hou ¹, Rubina F Rizvi ¹, Jeffrey R Bishop ³, Piper A Ranallo ¹, Thomas R Hoye ³, Rui Zhang ¹

PMCID: PMC12919503 PMID: 41726513

Abstract

Natural products are essential in drug discovery, chemical biology, and medicinal chemistry. Despite their widespread use, NP data remains fragmented across various databases, limiting their utility for whole person health research, which requires comprehensive, interoperable resources. This study explores and compares three major NP databases: COCONUT, NP-MRD, and GSRS, assessing their scope, structural representation, metadata completeness, and accessibility. COCONUT provides extensive chemical diversity, NP-MRD emphasizes spectral and physical property data, and GSRS focuses on regulatory classification. Despite their strengths, overlap between databases is moderate to small, and significant gaps remain in integrating medical and pharmaceutical information. Improved interoperability and harmonization are needed to support advanced computational models for whole person health. Our findings highlight critical gaps and opportunities to enhance NP database integration, laying the groundwork for developing comprehensive resources that better support data-driven investigations of natural products.

Introduction

Natural Products (NPs) have played a fundamental role in drug discovery, chemical biology, and medicinal chemistry. Their unique chemical scaffolds and extensive structural diversity enable interactions with biological targets that synthetic compounds often cannot achieve¹. Over 66% of approved therapeutic agents are directly or indirectly derived from NPs², and recent pharmaceutical surveys reveal that roughly half of all newly introduced small-molecule drugs are NP-based or closely related2. Even today, NPs remain well-represented in the drug development pipeline – for example, in 2019 more than half of new therapeutic candidates were complex modalities such as proteins, nucleic acids, polymers, and structurally diverse NPs². Beyond therapeutics, NPs serve as indispensable chemical probes in chemical biology, illuminating biochemical pathways and disease mechanisms through their unique bioactivities¹. Their chemical complexity, high structural diversity, and bioactivity make them valuable resources for developing new therapeutic agents and functional materials.

In the digital era, robust databases have become crucial for managing the vast and growing body of scientific knowledge. High quality, well-structured and accessible databases are fundamental for organizing diverse data types, making them Findable, Accessible, Interoperable, and Reusable (FAIR). The rise of computational techniques, including bioinformatics, cheminformatics, machine learning, and knowledge graphs (KGs), has further increased the reliance on high quality databases^3–7. For instance, modern drug discovery and metabolomics increasingly employ in silico methods, which require large, well-annotated datasets for training and validation⁸. In NP research, AI-driven approaches have begun mining genomic, spectral, and chemical data to predict new structures or activities, thereby accelerating drug discovery, metabolomics, and related fields⁹. However, NP data remain scattered across disparate sources, often with inconsistent formats, incomplete metadata, and varied standards for identifiers and classification. A recent perspective noted that NP data are often “multimodal, unbalanced, unstandardized, and scattered across many data repositories,” posing a major barrier to integration and AI applications¹⁰.

These limitations are particularly relevant to the emerging paradigm of whole person health (WPH), which emphasizes a comprehensive view of health by integrating biological, behavioral, social and environmental data¹¹. NPs are uniquely positioned to contribute to WPH research. Derived from plants, foods, and traditional medicine, they influence not only molecular and physiological processes but also reflect cultural practices, dietary habits, and environmental exposures. Properly structured and integrated NP data can serve as a critical link between molecular insights and broader determinants of health, enabling more holistic approaches to biomedical discovery and public health.

Despite the growing availability of numerous NP databases, the landscape remains highly fragmented due to significant heterogeneity in data representation, coverage, metadata completeness and accessibility. Databases vary greatly in their structure, content depth, and data standards, reflecting diverse scopes, priorities, and curation strategies. This fragmentation often results in difficulties when researchers attempt to cross-reference compounds, validate structural data, or integrate datasets into computational workflows. This fragmentation prevents algorithms from recognizing overarching patterns and limits the impact of AI in NP science. High-quality databases that enforce common standards and rigorous curation thus play a pivotal role in enabling next-generation computational workflows in NP discovery. For example, inconsistencies in chemical identifiers, incomplete metadata, or inaccessible spectral data can severely hamper computational drug discovery efforts and regulatory assessments, ultimately reducing the practical utility of available NP resources.

Addressing these challenges necessitates a thorough understanding of existing resources and their complementary strengths and limitations. In this context, this study systematically compares three prominent NP databases: COlleCtion of Open Natural prodUcTs (COCONUT)^12,13, Natural Products Magnetic Resonance Database (NP-MRD)¹⁴, and Global Substance Registration System (GSRS)¹⁵. These databases were selected to represent diverse and complementary facets of NP information that are critical for data harmonization. COCONUT represents a large-scale aggregation of chemical structural information from diverse open-access sources, designed specifically for broad computational analyses and dereplication studies. NP-MRD specializes in nuclear magnetic resonance (NMR) spectroscopy data, which is crucial for the detailed structural elucidation and validation of NPs. GSRS emphasizes standardized regulatory identifiers for substances, enabling data harmonization essential for regulatory compliance and international data interoperability. By evaluating these databases in terms of data coverage, structural representation, metadata completeness, interoperability, and accessibility, we aim to highlight critical gaps and integration opportunities to guide future improvement in NP database design and harmonization, with a particular emphasis on enabling their use in WPH research.

Materials and Methods

Natural Products database selection

For this study, we selected representative resources for comparing existing NPs databases. These databases were chosen based on their complementary scopes, data coverage, and relevance to NP research. COCONUT is one of the largest open-access NP databases, an aggregation of over 120 openly available NP datasets. It offers extensive chemical diversity with structural annotations, making it ideal for large-scale computational analysis and dereplication studies. NP-MRD is a specialized resource focusing on nuclear magnetic resonance (NMR) spectral data of NPs, providing high-quality structural and analytical insights crucial for compound validation and elucidation. GSRS, developed for regulatory purposes, offers standardized substance identification, ensuring data harmonization and interoperability across global pharmaceutical and research institutions. By including these three databases, our study captures a broad spectrum of NP information, from chemical and spectral data to regulatory classification, allowing a comprehensive evaluation of their strengths and limitations for different research applications.

Database review

A systematic analysis of the three selected databases was performed to assess their characteristics and usability. The appraisal focused on several key aspects, including the general description of each database, its stated purpose, key features, content coverage, data accessibility, and format. Additionally, a real-time search test was conducted in each database using a common NP as a representative example, allowing for the evaluation of search functionality, retrieval efficiency, and consistency of the provided information. This comparative assessment aimed to highlight the strengths and limitations of each database in terms of structural representation, metadata completeness, and user accessibility, ensuring a comprehensive understanding of their applicability in NP research.

Data element model generation

To facilitate a systematic comparison of the selected NP databases, we first developed a standardized data element model, a structured set of attributes that describe key characteristics of NPs. In this context, data elements refer to individual fields or metadata types used to represent information about a NP, such as its name, structure, bioactivity, or regulatory status. These elements were extracted from each data sources and separated into two levels: (1) attributes (e.g. “Molecular Weight”), and (2) the corresponding attribute values, which are actual data entries (e.g. “302.24 g/mol”). Both the attributes and the attribute values were harmonized across data sources. The attributes were then organized into thematic categories, including Identification (e.g. Database Identifier, Name, Synonyms, Canonical Simplified Molecular Input Line Entry System (SMILES), Standard International Chemical Identifier (InChI), InChI Key, International Union of Pure and Applied Chemistry (IUPAC) Name, Chemical Abstracts Service (CAS) Registration, Chemical Formula), Classification and Taxonomy (e.g. Chemical Class, Source Organism), Physical and Chemical Properties (e.g., Solubility, Molecular Weight, Aromatic Ring Count, pKa), Bioactivity and Pharmacological Information (e.g., Rule of Five, QED Drug Likeliness), Regulatory and Usage Information (e.g. Regulatory Status), Substance Relationships, External Database References, and Record Tracking Information. This standardized framework allowed for an objective assessment of the strengths and limitations of each database in representing NP information, supporting the identification of gaps and potential areas for improvement.

Data element coverage comparison

To assess the extent, or robustness, of data representation in each selected database, we systematically evaluated each database to determine its coverage of the harmonized set of attributes. Each database was examined for the presence or absence of each attribute in the harmonized set, as well as presence or absence of each category of attributes.

Overlap analysis

To assess the overlap of NP compounds among the selected databases, we used InChI Key as a unique molecular identifier to standardize compound representation across datasets. Notably, InChI keys (in contrast to SMILES strings) are tautomer-invariant. A Venn diagram was generated to visualize the number of distinct compounds in each database and their shared entries. InChI Keys were extracted from each database and duplicates due to tautomeric variations were removed before comparison.

Case example selection and search procedure

To illustrate cross-database comparison and data integration, quercetin was selected as a representative example. Quercetin is a well-known flavonoid commonly found in various fruits, vegetables, and grains, and is recognized for its antioxidant, anti-inflammatory, and potential anticancer properties. It serves as an ideal case study for this comparison because it is widely studied, represented across multiple NP databases, and associated with diverse type of metadata. This makes it particularly useful for illustrating how information about a single compound can vary across sources, highlighting both the complementarity and the gaps among the databases. For each database, we performed a targeted search using the compound name or InChIKey and extracted relevant data elements based on our standardized model, including identifiers, structural information, classifications, bioactivity, and regulatory data.

Results

Overview of each database

COCONUT is one of the largest publicly available NP repositories, containing 695,133 curated NP entries aggregated from multiple open-access sources. It is designed for browsing, searching, and efficiently downloading data through a user-friendly web interface. The database offers multiple download formats, including SDF and CSV. The latest dataset (October 2024 release) is available under a Creative Commons CC0 license, permitting unrestricted use, modification, and distribution without attribution.

A key strength of COCONUT is its comprehensive molecular representation. Each NP entry is assigned a unique COCONUT ID, mapping to a canonical SMILES, and includes extensive molecular details, including IUPAC name, InChI, InChIKey, molecular weight, atom counts, aromatic ring count, hydrogen bond acceptors, and Lipinski’s Rule of Five (RO5) violations. The database also provides chemical and NP classification, enhancing its utility for compound categorization and comparative analysis.

In addition to chemical characterization, COCONUT integrates biological taxonomy, listing a comprehensive set of source organisms for each NP. Each organism entry includes a direct link to the COCONUT Organisms page and an external reference to Ontobee (NCBI organismal classification). As an aggregated dataset, COCONUT provides extensive cross-references to hundreds of external resources, such as PubChem Substance, KNApSaCK, DrugBank, Phenol Explorer, FooDB, and ChEBI, enhancing data interoperability and facilitating multi-database integration.

While COCONUT excels in molecular diversity and computational accessibility, it lacks physical properties, detailed experimental spectral data and direct pharmacological activity annotations, which may limit its direct applicability for experimental NP validation and drug development workflows. However, its vast chemical coverage and open-access nature make it an essential resource.

NP-MRD is a specialized NP database that focuses on extensive chemical, spectral, and physical property data, making it a valuable resource for structure elucidation, dereplication, and experimental validation. The database contains 289,609 distinct NP-MRD entries, corresponding to 288,789 unique SMILES, and provides a user-friendly web interface for browsing and searching.

In addition to standard chemical properties (e.g., molecular formula, SMILES, InChI), NP-MRD offers a comprehensive set of physical properties, including experimental data such as water solubility, melting point, and boiling point, as well as predicted properties such as pKa (strongest acidic and basic), molar refractivity, and polarizability. This makes NP-MRD particularly useful for computational modeling, solubility predictions, and pharmacokinetic assessments. A key feature of NP-MRD is its spectral data coverage, which includes diverse spectral data formats. A particularly valuable feature is the growing repository of NMR row spectral data for NPs, allowing for users download and subsequent local interrogation¹⁶. Users can also access and download metadata in XML and JSON, as well as structure data in SDF and SMILES formats. The database supports various spectral data formats, facilitating integration into analytical chemistry workflows. In terms of biological taxonomy, NP-MRD provides a full list of species of origin, including species name, source, and references, allowing researchers to easily trace NPs back to their biological sources. Additionally, external links are provided to related resources, supporting data interoperability and further exploration.

While NP-MRD excels in spectral data, physical properties, and taxonomic details, it does not emphasize regulatory classification or broad cheminformatics applications like some other databases. However, its high-quality spectral and experimental data make it a very valuable resource for researchers focused on NP characterization, analytical chemistry, and structure verification.

GSRS is a comprehensive database developed through a collaboration between the U.S. Food and Drug Administration (FDA) and the National Center for Advancing Translational Sciences (NCATS). It provides detailed scientific descriptions and unique ingredient identifiers (UNIIs) for substances relevant to regulated products. The latest public data release (v2.5.1–20200707) consists of 162,731 substance definitions and covers six types of substances referenced in the ISO 11238 standard, including chemicals (116,384), nucleic acids (597), proteins (7,214), polymers (2,527), structurally diverse (27,610), and concepts (5,269). The substance types differ significantly in their structural and functional characteristics. Chemicals are well-defined small molecules, including both NP compounds and synthetic compounds. Nucleic acids and proteins refer to large biomolecules essential for regulatory and therapeutic tracking but are not typically considered NPs. Polymers encompass complex macromolecules, including both synthetic polymers and natural biopolymers, with structural characteristics that differ from small molecules. Structurally diverse substances include mixtures or complex materials of biological origin that cannot be fully characterized by a single chemical structure. Concepts represent abstract entries that serve as categories, classes, or generalized descriptors rather than specific chemical entities. Among these categories, the chemicals type is most relevant for comparison with the other two NP databases: COCONUT and NP-MRD, as it provides compound-level information such as chemical structure, molecular formula, and classification. However, unlike the other two databases, GSRS includes both NPs and synthetic compounds, making it a more comprehensive but less specialized resource for NP research. Furthermore, GSRS focuses heavily on regulatory aspects that facilitate data harmonization and compliance tracking, rather than comprehensive spectral or taxonomic information. A key strength of GSRS is its ability to capture relationships among substances to provide additional biological and manufacturing context that helps regulatory authorities identify and track product ingredients in various applications and information sources.

The public GSRS data are freely available at https://gsrs.ncats.nih.gov, but it does not support user registration. For more personalized features and functionalities, users can use PrecisionFDA, which holds the same extensive substance data but offers a user-friendly registration system. GSRS is also integrated with the FDA’s DailyMed platform to support regulatory decision-making and medical health research. Through PrecisionFDA, qualified users can request specialized access to data and use additional functions that are not available on the public GSRS platform. The UNIIs are searchable in PrecisionFDA’s dedicated UNII search tool at https://precision.fda.gov/uniisearch. The data can be downloaded in various formats, including TXT, CSV, XSLX, JSON, and SDF. GSRS includes extensive references to other substance information sources, including Common Chemistry, Inxight Drugs, DailyMed Regulated Products, GSRS Full Record, NCI Thesaurus, PubChem, and CompTox Chemicals Dashboard. This comprehensive cross-referencing enhances the integration of multiple databases.

While GSRS provides extensive substance definitions, it does not provide detailed physical and chemical properties, and taxonomic information like other databases. It does not provide comprehensive pharmacological or toxicological profiles, limiting its utilization for drug discovery and safety assessments in regulated products.

Comparison of Databases Adequacy and Completeness

A comprehensive list of data elements is used as a standard checklist. The presence or absence of each data element in these databases was systematically assessed (Table 1).

Table 1.

Comparison of data elements across existing NP databases.

Section	Facet	COCONUT	NP-MRD	GSRS (Compound)	GSRS (Substance)
1. Identification	Database Identifier	Yes	Yes	Yes	Yes
	Name & Synonyms	Yes	Yes	Yes	Yes
	Canonical SMILES	Yes	Yes	Yes	No
	Standard InChI/InChI Key	Yes	Yes	Yes	No
	IUPAC Name	Yes	Yes	Yes	No
	CAS Registration	Yes	Yes	Yes	Yes
	Chemical Formula	Yes	Yes	Yes	No
2. Classification & Taxonomy	Chemical Classification (Kingdom, Chemical Class/Subclass/Superclass)	Yes	Yes	No	No
	Structural Classification (Direct/Alternative Parent, Substituents, Molecular/Murcko Framework, Stereochemistry, Defined Stereocenters, E/Z Centers)	Yes	Yes	Yes	No
	Biological Classification (Species of Origin, Species Where Detected, Organisms, Geolocations)	Yes: a list of organisms with links to Ontobee	Yes: a list of species	No	Yes: include Source Materials
	NP Classification (NP Pathway/Class/Superclass/Is Glycoside)	Yes	No	No	No
3. Physical and Chemical Properties	Phase and Thermodynamic properties (State, Melting Point, Boiling Point)	No	Yes	No	No
	Solubility (Water Solubility, LogP/ALogP/ClogP/Experimental LogP)	Yes	Yes	No	No
	pKa (Strongest Acidic/Basic)	No	Yes	No	No
	Molecular Weight/Exact Mass	Yes	Yes	Yes	No
	Total/Heavy Atom Count	Yes	No	No	No
	Van der Waals Volume	Yes	No	No	No
	Fraction of Csp3	Yes	No	No	No
	Ring Structure (Aromatic Rings Count, Number of Rings, Minimal Number of Rings)	Yes	Yes	No	No
	Contains Ring/Linear Sugars	Yes	No	No	No
	Rotatable Bond Count	Yes	Yes	No	No
	Formal/Physiological Charge	Yes	Yes	Yes	No
	Hydrogen Acceptor/Donor Count	Yes	Yes	No	No
	Polar Surface Area (Topological/Predicted)	Yes	Yes	No	No
	Optical Activity	No	Yes	Yes	No
	Molar Refractivity/Polarizability	No	Yes	Yes	No
4. Bioactivity and Pharmacological Information	Bioavailability	No	Yes	No	No
	Rule of Five/Lipinski’s Rule/Violations	Yes	Yes	No	No
	Ghose Filter	No	Yes	No	No
	Veber’s Rule	No	Yes	No	No
	MDDR-like Rule	No	Yes	No	No
	QED Drug Likeliness	Yes	No	No	No
5. Regulatory & Usage Classification	Record UNII	No	No	Yes	Yes
	Record Protection Status	No	No	Yes	Yes
	Regulatory Status	No	No	Yes	Yes
	Usage Classification (Orphan Drug, Dietary Supplement, LiverTox, …)	No	No	Yes	Yes
6. Substance Relationships	Substance Composition (Active Moiety, Metabolites, Impurities, Constituents)	No	No	Yes	Yes
6. Substance Relationships	General Relationships	No	No	Yes	Yes
7. References to External Database	Metabolomics & Biochemical Databases	Phenol Explorer Compound, FoodDB, KNApSAcK, ReSpect, HIM, Watermelon Database,	HMDB, Phenol Explorer Compound, FoodDB, KNApSAcK, METLIN	HMDB, Phenol Explorer Compound, FoodDB, KNApSAcK, Metabolomics Workbench
	Drug & Pharmacology Databases	DrugBank	DrugBank	DrugBank, DAILYMED, RXCUI, PharmGKB, Pharos Ligand, ChEMBL, DRUG CENTRAL
	NP & Chemical Compound Databases	ChemSpider, PubChem NPs Substance, ChEBI, TIPdb, p-ANAPL, ANPDB, AfroCancer, UNPD, AfroDB, AnalytiCon Discovery NPs, GNPS, NPCARE, CMAUP, InPACdb, Australian NPs, EMNPD, SANCDB, TCMDB@Taiwan, TCMID, StreptomeDB, HIT, NANPDB, BitterDB, NPASS, VietHerB, Phyto4Health, InterBioScreen, Mitishamba, ETM-DB, BIOFACQUIM, UEFS, NPACT, NPEdia, Specs NPs, LANaPDB, ConMedNP, NuBBEDB, Super Natural II	ChemSpider, PubChem Compound, ChEBI	ChemSpider, PubChem Compound, ChEBI
	Biological Pathways & Systems Biology Databases		KEGG Compound, BioCyc, BiGG	KEGG Compound, BioCyc
	Toxicology & Environmental Databases	TPPT, Exposome-Explorer		EPA_CompTox, DSSTox, NSC, HSDB
	Structural & 3D Databases		PDB, Good Scents
	Regulatory & Identifier Databases	UN	Wikipedia	Enzyme Commission, NCIT, Wikipedia, Wikidata, Nikkaji, MESH, MERCK INDEX, EVMPD, GRAS Notification, ECHA (EC/EINECS), FDA UNII
8. Record Tracking Information	Created At/By	Yes	Yes	Yes	Yes
	Last Edited At/By	Yes	Yes	Yes	Yes
	Version	No	Yes	Yes	Yes

Open in a new tab

Across the databases, identification-related attributes are well-represented. Classification and taxonomy elements vary significantly, with COCONUT providing chemical, structural, biological and NP classifications; NP-MRD providing chemical, structural and biological classifications; and GSRS providing source materials only at the substance level. The coverage of physical and chemical properties is inconsistent, with COCONUT and NP-MRD offering extensive information, whereas GSRS lacks key physicochemical attributes. Bioactivity and pharmacological properties, including bioavailability and drug-likeness assessments, are present in COCONUT and NP-MRD but are largely absent from GSRS directly. Regulatory information, as well as substance relationships within the database, are only available in GSRS. Additionally, all the three databases integrate numerous external references across metabolomics, NP and chemical compound databases. COCONUT focuses more on links with other NP databases; NP-MRD focuses more on biological pathways, systems biology databases, and structural databases; and GSRS focuses more on links with drug/pharmacology databases, and regulatory databases. Finally, record tracking information is included in all databases. This comparative evaluation indicates that there are significant differences in scope and emphasis across databases.

Each database was evaluated for the presence or absence of key data categories to assess its overall completeness (Table 2). Data elements within the same category were grouped, and the number and percentage in the table indicate how many items within a database contain at least one data element from each category. It should be noted that a high completeness percentage does not necessarily reflect comprehensive coverage of all relevant data elements within a category. For example, GSRS appears to have high completeness for physical and chemical properties, this is due to the consistent availability of basic information, such as molecular weight, whereas all other physicochemical attributes are absent.

Table 2.

Completeness of selected data elements within existing databases

Section	COCONUT	NP-MRD	GSRS
Identification	695133 (100%)	289609 (100%)	116008 (100%)
Classification and Taxonomy	694336 (99.9%)	250116 (86.4%)	?
Physical and Chemical Properties	695133 (100%)	289509 (99.97%)	?
Bioactivity and Pharmacological Information	695133 (100%)	270139 (93.3%)	-
Regulatory and Usage Information	-	-	115412 (99.5%)
Substance Relationships	-	-	?
External Database References	695133 (100%)	241956 (83.5%)	112429 (96.91%)
Record Tracking Information	695133 (100%)	289609 (100%)	?

Open in a new tab

Overlap between databases

A Venn diagram based on InChI Key identifiers was used to assess the overlap and uniqueness of NP compounds across COCONUT, NP-MRD, and GSRS (Figure 1). While each database contains many unique compounds, the overlap across all three is relatively small, indicating that these resources capture distinct subsets of the NP chemical space. It is important to note that GSRS does not distinguish between NP compounds and synthetic compounds, and therefore, the overlap analysis based on InChI Key identifiers may present bias by overestimating the uniqueness of GSRS in this NP-focused study. The results highlight COCONUT’s broad coverage, NP-MRD’s focus on structurally validated compounds, and GSRS’s regulatory emphasis, emphasizing the need for better cross-referencing across NP databases.

Figure 1. — Overlap of NP compounds among databases based on InChI Key.

Illustration with a Search Example of Quercetin

Quercetin was used as a representative example to demonstrate the process of searching across the three selected NP databases and to illustrate how NPs can be comprehensively represented by integrating information from multiple resources (Table 3). Only the most relevant data elements across the databases are presented. This example demonstrates the necessity of accessing multiple databases to obtain a comprehensive view of a NP, particularly when aiming to explore its taxonomic, chemical, biological, pharmaceutical, and regulatory information. Integrating these resources would significantly enhance their applicability for computational analysis, experimental studies, and advancing WPH research.

Table 3.

Selected data elements to represent Quercetin using information in existing resources

Data elements		Representative example
1. Identification	Name	Quercetin
	Canonical SMILES	O=C1C(O)=C(C2=CC=C(O)C(O)=C2)OC2=CC(O)=CC(O)=C12
	IUPAC Name	2-(3,4-dihydroxyphenyl)-3,5,7-trihydroxy-chromen-4-one
	Chemical Formula	C15H10O7
	InChI Key	REFJWTPEDVJJIY-UHFFFAOYSA-N
2. Classification & Taxonomy	Chemical Class	Flavonoids
	Chemical Subclass	Flavones
	Chemical Superclass	Phenylpropanoids and Polyketides
	Direct Parent Classification	Flavonols
3. Physical and Chemical Properties	State	Solid
	Experimental Water Solubility	0.06 mg/mL at 16°C
	Experimental Melting Point	316 - 318 °C
	Predicted Water Solubility	0.26 g/L
	Molecular Weight	302.24
	Exact Molecular Weight	302.04265
	Van Der Waals Volume	239.07
4. Bioactivity and Pharmacological Information	Bioavailability	Yes
	Rule of Five/Lipinski’s Rule/Violations	Yes
	Veber’s Rule	No
	MDDR-like Rule	No
5. Regulatory & Usage Classification	Record Protection Status	Public record
	Record Status	Validated (UNII)
	APPROVAL_ID	9IKM0I5T1E
6. Substance Relationships	Active Moiety	QUERCETIN ->Quercetin (ACTIVE MOIETY)
	Metabolites	QC-12 (PRODRUG)->Quercetin (METABOLITE ACTIVE)
	Impurities	RUTIN (PARENT)->Quercetin (IMPURITY ACTIVE)
Constituents	FENUGREEK SEED (PARENT)->Quercetin (CONSTITUENT ALWAYS PRESENT)
7. References to External Database	DrugBank ID	DB04216
	Phenol Explorer Compound ID	291
	FoodDB ID	FDB011904
	KNApSAcK ID	C00004631
	Chemspider	4444051
	PubChem Compound ID	5280343
8. Record Tracking Information	Version	50

Open in a new tab

Discussion

Significance

Despite NP’s prevalent use for nutrition, health promotion, and medicinal purposes, many of these products remain insufficiently studied, primarily due to the challenges of conducting rigorous research on chemically complex materials. Addressing this gap is particularly crucial within the NCCH’s Whole Person Health Initiative, which emphasizes studying multicomponent interventions across interconnected biological systems rather than focusing on single disease models¹⁷. The reciprocal interactions between NPs, the gut microbiome, and systemic health further underscore the importance of a systems-level approach to evaluating their effects. By leveraging computational and integrative methodologies, researchers can uncover novel mechanisms and interactions between NPs and biological systems, contributing to a more holistic understanding of their role in health and resilience.

Given the fragmentation and heterogeneity of available NPs data, the lack of standardized and interoperable sources makes it difficult to systematically analyze and compare NPs and their effects on WPH. Previous initiatives, such as the Consortium Advancing Research on Botanicals and Other Natural Products (CARBON), have focused on developing methodologies for NP characterization and bioactivity assessment, yet the need for integrated, standardized data repositories remains. To effectively harmonize NP data from diverse databases, a multi-layered standardization strategy could be adopted, centered on the use of universally recognized chemical identifiers. The InChI Key may serve as the primary unique identifier for NP compounds, ensuring consistency and enabling precise matching across disparate sources. To reinforce the accuracy of these linkages, SMILES could be used as a supplementary validation mechanism. This dual-identifier approach can help reduce redundancy and misalignment. For entries lacking structured chemical identifiers, such as those identified only by names, tools like QuickUMLS may assist in mapping these entities to Concept Unique Identifiers (CUIs) in the Unified Medical Language System (UMLS), enabling semantic normalization and alignment of nomenclature across datasets. Collectively, these strategies provide a scalable framework for harmonizing NP databases and improving interoperability, thereby facilitating downstream applications in cheminformatics, biomedical research, and AI-driven discovery.

Despite the growing interest in WPH, no existing NP database has been specifically structured to support this research paradigm. Traditional NP databases, such as COCONUT, NP-MRD, and GSRS, primarily focus on chemical structure, spectral data, and regulatory classification, respectively. However, they lack integration with clinical, multi-omics, behavioral, and environmental data sources—key components necessary to study how NPs influence whole-body systems. Additionally, there is limited effort to link computational tools with experimental research to uncover complex interactions between NPs and multiple biological systems. Without an integrated data framework, the full potential of AI and multi-scale modeling to predict synergistic effects of NP mixtures remains largely untapped.

Our study addresses this gap by systematically evaluating existing NP databases in the context of advancing WPH research. By reviewing, categorizing, and comparing COCONUT, NP-MRD, and GSRS, we identified strengths, limitations, and missing elements critical for constructing an integrated NP informatics framework. Our analysis highlights the need and potential directions for a harmonized data structure. Such a framework would support future integration of NP data with complementary sources, such as multi-omics profiles, Electronic Health Records (EHRs), and public health datasets, which is essential for building computational models capable of capturing the complex, multiscale effects of NPs on WPH. For example, linking NP chemical and bioactivity data with metabolomics profiles could help elucidate metabolic pathways influenced by specific compounds. Integration with EHRs may enable retrospective analyses of NP-related exposures or supplement safety based on real-world evidence. Similarly, aligning NP databases with nutrition, microbiome, or exposome datasets could support population-level studies on diet-derived compounds and environmental modifiers of health. These directions illustrate how harmonized NP data could play a central role in translational pipelines that span molecular mechanisms to clinical and public health outcomes.

Our analysis of COCONUT, NP-MRD, and GSRS reveals that while these databases serve different but complementary purposes, they each have limitations in the context of WPH research. COCONUT provides the broadest coverage of NPs, focusing on chemical structures, molecular descriptors, and taxonomic origins, but lacks bioactivity and regulatory insights. NP-MRD, with its emphasis on spectral data and experimental validation, offers valuable structural and physical property information, yet it does not integrate multi-omics or health-related data. GSRS, designed for regulatory tracking, captures detailed substance classifications, pharmaceutical formulations, and compliance records, but it does not systematically link NPs to biological effects or multisystem interactions.

Despite their individual strengths, these databases exhibit only moderate to small overlap, highlighting the fragmentation of NP information across cheminformatics, experimental, and regulatory domains. This lack of integration presents a significant challenge for WPH research, which requires multiscale data linking NPs to biological, environmental, and physiological systems. Our findings underscore the need for enhanced interoperability, improved data harmonization, and cross-referencing between cheminformatics, clinical, and multi-omics resources to enable comprehensive, data-driven investigations into the complex effects of NPs on whole-body health.

Limitation

While our study provides a comprehensive comparison of three major NP databases (COCONUT, NP-MRD, and GSRS), several limitations should be acknowledged. One potential gap is the lack of medical and pharmaceutical information of NPs, such as dosing guidelines, contraindications, adverse reactions, and toxicology. Resources like the Natural Product Information (Professional) from Drugs.com offer valuable information, but were not included in our analysis due to several constraints: (i) the platform is limited to web browsing, with no options of data download, scraping, or extracting content; (ii) it primarily relies on common names as identifiers, lacking standardized chemical identifiers such as InChIKey or SMILES, which complicates integration efforts, and (iii) it has relatively limited coverage. Future integration efforts could explore strategies such as mapping name-based entries to standardized identifiers using NLP-based entity linking. Additionally, collaboration with clinical or regulatory data providers could enable access to structured metadata under appropriate licensing terms. These steps could help bridge the gap between chemical-level NP data and clinically relevant information, further enhancing the utility of NP databases for translational and WPH research.

Additionally, our study is inherently limited by the subjective nature of data element categorization. While we aimed for a logical and standardized grouping of data elements, some distinctions remain ambiguous. Furthermore, while our analysis identifies moderate to small overlap between databases, this does not capture potential semantic inconsistencies or nomenclature variations that could further impact integration. These limitations highlight the need for future efforts to incorporate pharmaceutical and clinical data sources, establish standardized metadata frameworks, and refine NP data classification methods to improve interoperability and support WPH research.

Another limitation of our overlap analysis is the potential bias introduced by the inclusion of synthetic compounds in GSRS. Since GSRS does not explicitly distinguish NPs from synthetic substances, the use of InChIKey identifiers may overestimate the uniqueness of GSRS entries relative to the NP-focused COCONUT and NP-MRD databases. Addressing this bias will require clearer annotation of compound origin in GSRS and similar databases, highlighting the need for better classification of natural versus synthetic substances to enable more accurate cross-database comparisons in future research.

Perspectives

The findings from this study highlight the need for improved integration and interoperability of NP databases to enhance their relevance for WPH research applications. As the field continues to advance, developing a unified framework capable of linking chemical, spectral, taxonomic, pharmacological, and regulatory information will be critical for leveraging computational approaches, including machine learning, cheminformatics, and knowledge graphs. Integrating existing NP resources with multi-omics, clinical, and environmental data will allow researchers to explore complex interactions between NPs and interconnected biological systems more effectively.

Future efforts should focus on standardizing data formats, harmonizing metadata frameworks, and improving cross-references across NP databases. However, these goals are often challenged by technical barriers such as inconsistent chemical representations, heterogeneous metadata standards, and incomplete or ambiguous entries. For instance, variations in naming conventions, stereochemical details, and missing structural identifiers can hinder accurate integration, even with tools like InChIKey and SMILES.

Furthermore, semantic inconsistencies—such as synonymy and varied ontologies—complicate the mapping of biomedical concepts, requiring both advanced NLP tools and alignment with established ontologies. Incorporating domain-specific resources such as ChEBI for chemical entities, MeSH for pharmacological categories, UMLS for clinical terminology, and NCBI Taxonomy for organisms can help standardize terminology and improve cross-database interoperability. Aligning NP data with these ontologies would improve semantic consistency, facilitate automated reasoning and search, and support interoperability with clinical and biomedical data systems.

Additionally, mechanisms to incorporate medical and pharmaceutical information, such as dosage, interactions, and toxicity, are essential for advancing data-driven investigations of NPs. Bridging these gaps will enable more robust computational models that can predict NP effects on health resilience and complex, whole-body systems, ultimately contributing to the broader goals of WPH research.

Conclusion

This study systematically compared three major NP databases—COCONUT, NP-MRD, and GSRS—highlighting their complementary strengths and limitations in supporting NP research for WPH. While COCONUT provides extensive chemical diversity, NP-MRD offers valuable spectral and physical property data, and GSRS focuses on standardized regulatory classification. However, only moderate to small overlap exists between these resources, revealing fragmentation in NP information. Additionally, critical gaps remain in integrating medical and pharmaceutical information relevant to holistic health research. Future efforts should focus on enhancing data interoperability, developing comprehensive data models, and integrating clinical and multi-omics data to support advanced computational methodologies aimed at understanding NPs’ systemic effects on health.

Figures & Tables

References

1.Hong J. Role of natural product diversity in chemical biology. Curr Opin Chem Biol. 2011;15:350–354. doi: 10.1016/j.cbpa.2011.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Newman D. J., Cragg G. M. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod. 2020;83:770–803. doi: 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
3.Hou Y., Zhang R. Enhancing Dietary Supplement Question Answer via Retrieval-Augmented Generation (RAG) with LLM. 2024.09.11.24313513. 2024. Preprint at . [DOI]
4.Su C., et al. Biomedical discovery through the integrative biomedical knowledge hub (iBKH) iScience. 2023;26:106460. doi: 10.1016/j.isci.2023.106460. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Xiao Y., et al. Repurposing non-pharmacological interventions for Alzheimer’s disease through link prediction on biomedical literature. Sci Rep. 2024;14:8693. doi: 10.1038/s41598-024-58604-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Su C., Hou Y., Levin M., Zhang R., Wang F. Protocol to implement a computational pipeline for biomedical discovery based on a biomedical knowledge graph. STAR Protoc. 2023;4:102666. doi: 10.1016/j.xpro.2023.102666. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Su C., Hou Y., Wang F. GNN-based Biomedical Knowledge Graph Mining in Drug Development. In: Wu L., Cui P., Pei J., Zhao L., editors. Graph Neural Networks: Foundations, Frontiers, and Applications. Singapore: Springer Nature; 2022. pp. 517–540. doi:10.1007/978-981-16-6054-2_24. [Google Scholar]
8.Dzobo K. The Role of Natural Products as Sources of Therapeutic Agents for Innovative Drug Discovery. Comprehensive Pharmacology. 2022:408–422. doi:10.1016/B978-0-12-820472-6.00041-4. [Google Scholar]
9.van Santen J. A., Kautsar S. A., Medema M. H., Linington R. G. Microbial natural product databases: moving forward in the multi-omics era. Nat Prod Rep. 2021;38:264–278. doi: 10.1039/d0np00053a. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Meijer D., et al. Empowering natural product science with AI: leveraging multimodal data and knowledge graphs. Nat Prod Rep. 2025;42:654–662. doi: 10.1039/d4np00008k. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Whole Person Health. What It Is and Why It’s Important. NCCIH. https://www.nccih.nih.gov/health/whole-person-health-what-it-is-and-why-its-important .
12.Sorokina M., Merseburger P., Rajan K., Yirik M. A., Steinbeck C. COCONUT online: Collection of Open Natural Products database. J Cheminform. 2021;13:2. doi: 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chandrasekhar V., et al. COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res. 2024;53:D634–D643. [Google Scholar]
14.Wishart D. S., et al. NP-MRD: the Natural Products Magnetic Resonance Database. Nucleic Acids Res. 2021;50:D665–D677. [Google Scholar]
15.Peryea T., et al. Global Substance Registration System: consistent scientific descriptions for substances related to health. Nucleic Acids Res. 2020;49:D1179–D1185. [Google Scholar]
16.Wishart D. S., et al. The Natural Products Magnetic Resonance Database (NP-MRD) for 2025. Nucleic Acids Res. 2025;53:D700–D708. doi: 10.1093/nar/gkae1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Concept: Whole Person Research Initiative. NCCIH. https://www.nccih.nih.gov/grants/concept-whole-person-research-initiative .

[r1-8183] 1.Hong J. Role of natural product diversity in chemical biology. Curr Opin Chem Biol. 2011;15:350–354. doi: 10.1016/j.cbpa.2011.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2-8183] 2.Newman D. J., Cragg G. M. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod. 2020;83:770–803. doi: 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]

[r3-8183] 3.Hou Y., Zhang R. Enhancing Dietary Supplement Question Answer via Retrieval-Augmented Generation (RAG) with LLM. 2024.09.11.24313513. 2024. Preprint at . [DOI]

[r4-8183] 4.Su C., et al. Biomedical discovery through the integrative biomedical knowledge hub (iBKH) iScience. 2023;26:106460. doi: 10.1016/j.isci.2023.106460. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-8183] 5.Xiao Y., et al. Repurposing non-pharmacological interventions for Alzheimer’s disease through link prediction on biomedical literature. Sci Rep. 2024;14:8693. doi: 10.1038/s41598-024-58604-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6-8183] 6.Su C., Hou Y., Levin M., Zhang R., Wang F. Protocol to implement a computational pipeline for biomedical discovery based on a biomedical knowledge graph. STAR Protoc. 2023;4:102666. doi: 10.1016/j.xpro.2023.102666. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-8183] 7.Su C., Hou Y., Wang F. GNN-based Biomedical Knowledge Graph Mining in Drug Development. In: Wu L., Cui P., Pei J., Zhao L., editors. Graph Neural Networks: Foundations, Frontiers, and Applications. Singapore: Springer Nature; 2022. pp. 517–540. doi:10.1007/978-981-16-6054-2_24. [Google Scholar]

[r8-8183] 8.Dzobo K. The Role of Natural Products as Sources of Therapeutic Agents for Innovative Drug Discovery. Comprehensive Pharmacology. 2022:408–422. doi:10.1016/B978-0-12-820472-6.00041-4. [Google Scholar]

[r9-8183] 9.van Santen J. A., Kautsar S. A., Medema M. H., Linington R. G. Microbial natural product databases: moving forward in the multi-omics era. Nat Prod Rep. 2021;38:264–278. doi: 10.1039/d0np00053a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10-8183] 10.Meijer D., et al. Empowering natural product science with AI: leveraging multimodal data and knowledge graphs. Nat Prod Rep. 2025;42:654–662. doi: 10.1039/d4np00008k. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-8183] 11.Whole Person Health. What It Is and Why It’s Important. NCCIH. https://www.nccih.nih.gov/health/whole-person-health-what-it-is-and-why-its-important .

[r12-8183] 12.Sorokina M., Merseburger P., Rajan K., Yirik M. A., Steinbeck C. COCONUT online: Collection of Open Natural Products database. J Cheminform. 2021;13:2. doi: 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-8183] 13.Chandrasekhar V., et al. COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res. 2024;53:D634–D643. [Google Scholar]

[r14-8183] 14.Wishart D. S., et al. NP-MRD: the Natural Products Magnetic Resonance Database. Nucleic Acids Res. 2021;50:D665–D677. [Google Scholar]

[r15-8183] 15.Peryea T., et al. Global Substance Registration System: consistent scientific descriptions for substances related to health. Nucleic Acids Res. 2020;49:D1179–D1185. [Google Scholar]

[r16-8183] 16.Wishart D. S., et al. The Natural Products Magnetic Resonance Database (NP-MRD) for 2025. Nucleic Acids Res. 2025;53:D700–D708. doi: 10.1093/nar/gkae1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-8183] 17.Concept: Whole Person Research Initiative. NCCIH. https://www.nccih.nih.gov/grants/concept-whole-person-research-initiative .

PERMALINK

Exploring and Comparing Existing Natural Product Databases Towards Whole Person Health Research

Xiaoyi Chen, PhD

Meijia Song, BSN

Yu Hou, PhD

Rubina F Rizvi, MD, PhD

Jeffrey R Bishop, PharmD, MS

Piper A Ranallo, PhD

Thomas R Hoye, PhD

Rui Zhang, PhD

Abstract

Introduction

Materials and Methods

Natural Products database selection

Database review

Data element model generation

Data element coverage comparison

Overlap analysis

Case example selection and search procedure

Results

Overview of each database

Comparison of Databases Adequacy and Completeness

Table 1.

Table 2.

Overlap between databases

Figure 1.

Illustration with a Search Example of Quercetin

Table 3.

Discussion

Significance

Limitation

Perspectives

Conclusion

Figures & Tables

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Exploring and Comparing Existing Natural Product Databases Towards Whole Person Health Research

Xiaoyi Chen, PhD

Meijia Song, BSN

Yu Hou, PhD

Rubina F Rizvi, MD, PhD

Jeffrey R Bishop, PharmD, MS

Piper A Ranallo, PhD

Thomas R Hoye, PhD

Rui Zhang, PhD

Abstract

Introduction

Materials and Methods

Natural Products database selection

Database review

Data element model generation

Data element coverage comparison

Overlap analysis

Case example selection and search procedure

Results

Overview of each database

Comparison of Databases Adequacy and Completeness

Table 1.

Table 2.

Overlap between databases

Figure 1.

Illustration with a Search Example of Quercetin

Table 3.

Discussion

Significance

Limitation

Perspectives

Conclusion

Figures & Tables

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases