Database resources of the National Center for Biotechnology Information in 2026

Eric W Sayers; Evan E Bolton; Anna M Fine; Christopher Kelly; Sunghwan Kim; Melissa Landrum; Stacy Lathrop; Adriana Malheiro; Terence D Murphy; Lon Phan; Shashikant Pujar; Barton W Trawick; Valerie A Schneider; Kim D Pruitt

doi:10.1093/nar/gkaf1060

. 2025 Dec 12;54(D1):D20–D27. doi: 10.1093/nar/gkaf1060

Database resources of the National Center for Biotechnology Information in 2026

Eric W Sayers ^1,^✉, Evan E Bolton ², Anna M Fine ³, Christopher Kelly ⁴, Sunghwan Kim ⁵, Melissa Landrum ⁶, Stacy Lathrop ⁷, Adriana Malheiro ⁸, Terence D Murphy ⁹, Lon Phan ¹⁰, Shashikant Pujar ¹¹, Barton W Trawick ¹², Valerie A Schneider ¹³, Kim D Pruitt ¹⁴

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

³ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

⁴ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

⁵ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

⁶ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

⁷ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

⁸ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

⁹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

¹⁰ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

¹¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

¹² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

¹³ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

¹⁴ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States

^✉

To whom correspondence should be addressed. Email: sayers@ncbi.nlm.nih.gov

Roles

Eric W Sayers: Conceptualization, Writing - original draft, Writing - review & editing

Evan E Bolton: Data curation, Software, Writing - original draft

Anna M Fine: Data curation, Software, Writing - original draft

Christopher Kelly: Data curation, Writing - original draft

Sunghwan Kim: Data curation, Software, Writing - original draft

Melissa Landrum: Data curation, Software, Writing - original draft

Stacy Lathrop: Data curation, Writing - original draft

Adriana Malheiro: Data curation, Software, Writing - original draft

Terence D Murphy: Conceptualization, Data curation, Software, Writing - original draft

Lon Phan: Data curation, Software, Writing - original draft

Shashikant Pujar: Data curation, Software, Writing - original draft

Barton W Trawick: Data curation, Software, Writing - original draft

Valerie A Schneider: Conceptualization, Funding acquisition, Supervision, Writing - review & editing

Kim D Pruitt: Conceptualization, Funding acquisition, Supervision, Writing - review & editing

PMCID: PMC12807769 PMID: 41385079

Abstract

The National Center for Biotechnology Information (NCBI) provides biomedical data resources including PubMed^®, a repository of citations and abstracts published in life science journals, and ClinicalTrials.gov, a repository of clinical research summaries. NCBI also hosts the NIH Comparative Genomics Resource (CGR) that aims to maximize the impact of eukaryotic genome datasets. NCBI provides search and retrieval operations for most of these data from 40 distinct repositories, knowledgebases, and services. The E-utilities serve as the programming interface for most of these. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, CGR, ClinicalTrials.gov, ClinVar, dbSNP, GTR, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Graphical Abstract

Introduction

The National Center for Biotechnology Information (NCBI), a center within the National Library of Medicine (NLM) at the National Institutes of Health (NIH), was created in 1988 to develop information systems for molecular biology [1]. In this article, we provide a brief overview of the NCBI collection of databases, followed by a summary of resources that we significantly updated in the past year. In a slight change from previous years, we discuss updates to DNA and protein sequence resources separately in a companion paper.

NCBI maintains a set of 40 biomedical data resources that collectively contain 5.2 billion records (Table 1). Most of which are accessible through the Entrez search and retrieval system [2]. An Entrez search bar appears near the top of the NCBI home page (https://www.ncbi.nlm.nih.gov) and of the pages of these various resources. Each Entrez resource supports simple text queries as well as more complex queries containing Boolean operators (“AND,” “OR,” and “NOT”) and fielded term searches that users can explore in the “Advanced” search linked in the search bar. Each resource also provides multiple data formats appropriate for its data type and offers various downloading functions to retrieve data. In many resources, each record functions much like a node in a knowledge graph, as it is linked to records in the same and other Entrez resources based on relationships asserted by submitters, curators, or computational analysis. Details about these formats and links are available through the home pages of each resource.

Table 1.

NCBI Data Resources (as of 3 September 2025)

Database	Records	Annual growth	Description
Literature
PubMed	39 334 316	5%	Scientific and medical abstracts/citations
PubMed Central	11 230 676	10%	Full-text journal articles
NLM Catalog	1 653 376	0.2%	Index of NLM collections
Bookshelf	1 121 186	6%	Books and reports
MeSH	355 572	0.1%	Ontology used for PubMed indexing
DNA/RNA
Nucleotide	673 713 256	6%	DNA and RNA sequences from GenBank and RefSeq
BioSample	47 538 647	18%	Descriptions of biological source materials
SRA	40 333 544	16%	High-throughput DNA/RNA sequence read archive
Taxonomy	2 828 854	4%	Taxonomic classification and nomenclature catalog
BioProject	926 825	14%	Biological projects providing data to NCBI
BioCollections	8 497	0%	Museum, herbaria, and biorepository collections
Genes
GEO Profiles	128 414 055	0%	Gene expression and molecular abundance profiles
Gene	63 123 617	15%	Collected information about gene loci
GEO Datasets	8 296 887	9%	Functional genomics studies
Proteins
Protein	1 500 527 183	12%	Protein sequences from GenBank and RefSeq
Identical Protein Groups	994 033 355	22%	Protein sequences grouped by identity
Structure	240 990	8%	Experimentally determined biomolecular structures
Protein Family Models	177 908	11%	Conserved domain architectures, HMMs, and BlastRules
Conserved Domains	67 160	0%	Conserved protein domains
Chemicals
PubChem Substance	337 779 072	6%	Deposited substance and chemical information
PubChem Compound	122 265 315	3%	Chemical information with structures, information, and links
PubChem BioAssay	1 768 720	6%	Bioactivity screening studies
PubChem Pathways	250 942	4%	Molecular pathways with links to genes, proteins, and chemicals
Clinical Genetics
dbSNP	1 197 210 835	7%	Short genetic variations
dbVar	8 669 169	6%	Genome structural variation studies
ClinVar	3 761 822	23%	Human variations of clinical significance
ClinicalTrials.gov	551 551	9%	Registry of clinical studies
MedGen	231 394	3%	Medical genetics literature and links
GTR	68 280	−3%	Genetic testing registry
dbGaP	1406	0%	Genotype–phenotype interaction studies

Open in a new tab

All NCBI resources are committed to following and periodically evaluating their alignment with emerging principles [3] for reliable and competent data management such as FAIR (Findable, Accessible, Interoperable, and Reusable) [4] and TRUST (Transparency, Responsibility, User focus, Sustainability, and Technology) [5]. FAIR principles accelerate discovery by focusing on the quality of data objects and data sharing by promoting standardized data formats easily read by machines and stable identifiers that allow data to be retrieved consistently over time. The more recent TRUST Principles extend this by promoting reliable and sustainable repositories that communities can trust to preserve data through periods of changing technology and/or community requirements. Users can find details of specific efforts to align with these practices on resource web sites.

Literature resources

PubMed

PubMed (https://pubmed.ncbi.nlm.nih.gov) provides free online access to citations and abstracts for biomedical literature and facilitates searching across the MEDLINE, PubMed Central and Bookshelf literature resources. In the past year, PubMed added ∼1.7 million citations, growing the database to >39 million citations in 2025. We recently updated the filters interface on the PubMed search results page to provide a more intuitive, user-friendly experience (https://www.nlm.nih.gov/pubs/techbull/so24/so24_pubmed_filters_improvements.html). We designed these updates based on user feedback, web analytics, interviews, and hands-on usability testing with PubMed users from different backgrounds, such as medical librarians, clinicians, and scientists among others. Additionally, the PubMed homepage now includes information about recent development updates and other PubMed-related highlights (https://www.nlm.nih.gov/pubs/techbull/jf25/jf25_pubmed_news.html).

PubMed central

PubMed Central^® (PMC) is a free full-text archive of biomedical and life sciences literature (https://pmc.ncbi.nlm.nih.gov). In 2025, PMC added >900 000 full-text articles bringing the total size of the archive to >11 million articles. These include articles from peer-reviewed journals, author manuscripts funded by NIH and other research funders, and preprints collected under the NIH Preprint Pilot. In 2025, we updated the PMC full-text search (https://pmc.ncbi.nlm.nih.gov/search/), the next step in the ongoing modernization of the PMC product and services (Fig. 1). The update transitions PMC search to the same platform used by PubMed and provides more robust search functionality and more accurate results. We also continued updating the technology behind several public PMC APIs and utilities that are now available as cloud services (Table 2).

Screenshot of new PMC search results page showing new filtering options — View of the updated search interface in PMC including new controls and filtering options.

Table 2.

Updated PMC API services

API name	API information
EFetch	https://pmc.ncbi.nlm.nih.gov/about/new-in-pmc/#2025-03-05
PMC ID Converter	https://pmc.ncbi.nlm.nih.gov/tools/id-converter-api/
PMC XML Style Checker	https://pmc.ncbi.nlm.nih.gov/tools/stylechecker/
OAI-PMH	https://pmc.ncbi.nlm.nih.gov/tools/oai/

Open in a new tab

Bookshelf

The NCBI Bookshelf provides free online access to full-text books and documents in life sciences, healthcare, and medicine. Bookshelf contains over a dozen formats collected by the NLM including monographs, reviews, reference works, government publications, standards and guidelines, technical reports, and textbooks. In the past year, Bookshelf added over 1550 books, growing the repository to over 14 600 total books from 185 content providers. Significant peer-reviewed collections added and updated in 2025 were in the subjects of chronic health, toxicology, and epidemiology. Bookshelf continues to support public access to nonjournal article documents such as systematic reviews, technical reports, and data briefs voluntarily submitted by several federal agencies, primarily in the Department of Health and Human Services, making it easier for the public to discover and cite these materials.

SciENcv

SciENcv (Science Experts Network Curriculum Vitae) is a valuable tool for researchers applying for federal funding from agencies such as the NIH, NSF, USDA, and the US Department of Energy. Available at https://www.ncbi.nlm.nih.gov/sciencv, the platform enables users to create and maintain biosketches that meet agency-specific requirements. By linking a SciENcv account to ORCID, researchers benefit from enhanced functionality, including the ability to auto-populate fields with ORCID data, incorporate citations directly from their ORCID profile, and include a persistent identifier on application documents, which several agencies have begun to require as part of the grant application process to support researcher identification. SciENcv continues to evolve in response to user needs and federal agency requirements. Recent enhancements include expanded partnerships with other US federal agencies, the launch of an XML upload feature for Current and Pending (Other) Support documents, and new capabilities for delegates assisting principal investigators with drafting application materials. SciENcv will continue to evolve to support changing federal requirements, particularly as agencies adopt more standardized forms and seek more detailed applicant information.

NIH Comparative Genomics Resource

The NIH Comparative Genomics Resource (CGR) (https://www.ncbi.nlm.nih.gov/cgr/) maximizes the impact of eukaryotic research organisms and their genomic data to biomedical research [3]. CGR includes work to expand and improve genome-related data, to develop new and improved tools for accessing and analyzing data as part of an NCBI toolkit, and to collaborate with communities and other resources to better interconnect data throughout the global biodata ecosystem and integrate into user workflows. Over its initial five-year focused development period, CGR has resulted in new public tools to improve data quality including NCBI’s Foreign Contamination Screening (FCS) tool suite [6] that identifies and removes contaminating sequences from newly sequenced genomes, and NCBI’s publicly released Eukaryotic Genome Annotation Pipeline (EGAPx, https://github.com/ncbi/egapx) that generates high quality annotation of genes and proteins in metazoan and plant genomes. EGAPx now produces output optimized for submission to GenBank, and since FCS is also used for screening new eukaryotic and prokaryotic genomes submissions, together we expect these tools to increase the number of high-quality annotated eukaryotic genomes available for future comparative genomics studies.

CGR has also expanded the number of tools available for data access and analysis to accelerate scientific discovery. NCBI datasets (https://www.ncbi.nlm.nih.gov/datasets/) support FAIR access to genome, gene, and ortholog information with user-friendly web interfaces, command-line tools, and documented APIs [7]. The Comparative Genome Viewer (https://ncbi.nlm.nih.gov/genome/cgv/, Fig. 2) [8] and Multiple Comparative Genome Viewer (https://www.ncbi.nlm.nih.gov/mcgv/) visualize either pairwise or multiple genome alignments allowing users to examine sequence and structural differences between genomes and how those differences may affect annotated genes. Improvements in BLAST such as ClusteredNR [9] help users better explore protein sequence diversity across the tree of life. The tools developed as part of CGR support a wide range of organisms and are intended to scale with the continuing exponential increase in genomic data.

Screenshot of CGV view comparing the human and chimpanzee genomes. — Graphical comparison of the *Homo sapiens* and *Pan troglodytes* genomes created by the NCBI CGV.

CGR has greatly increased connectivity and collaboration with other resources, including integration of more data from UniProt [10], Ensembl [11], UCSC [12], the Alliance of Genome Resources [13], and others. CGR tools are being used to power new analysis tools like BRC Analytics (https://brc-analytics.org/). With genomes now publicly available for over 20 000 eukaryotic species, the potential of applying comparative genomic data in diverse research applications has never been greater. We continue to encourage feedback on our efforts through feedback buttons on most web pages or e-mails to cgr@nlm.nih.gov.

Clinical resources

ClinicalTrials.gov

Clinical trial registries and results databases are designed to make summaries of clinical research publicly accessible and available in a centralized repository for patients, caregivers, researchers, and the general public. ClinicalTrials.gov (https://clinicaltrials.gov) is the world’s largest publicly available clinical trial registry and results database containing over 540 000 studies with nearly 4 million website visitors each month. We now provide “Fast Forward,” a series of short videos to educate users on the modernized website. They address common user questions on how to accomplish tasks on the website, such as how to search for and download studies of interest, and are available at https://www.nlm.nih.gov/oet/ed/ct/demo_videos.html. The modern website also provides an updated API (https://clinicaltrials.gov/data-api/api) that is consistent with other publicly accessible APIs and delivers standardized data.

ClinVar

ClinVar [14] archives human genetic variants classified for diseases and drug responses and contains both genomic and somatic variants and functional assertions. Over the past year, ClinVar added 625 000 new variants processed from >1 million submitted records. We also updated ClinVar to better represent functional data that are critical to resolve variants of uncertain significance and variants with conflicting classifications. A given laboratory may generate and submit functional data as part of the evidence for a classification submitted to ClinVar. Research and diagnostic laboratories may also submit functional data without a classification; this includes data from MAVEs (Multiplexed Assays of Variant Effect) that generate high-throughput functional data for variants even before they have been observed in a patient. We now require several fields for submissions of functional data, including the functional consequence of the variant, the assay type, the molecular phenotype measured, a short description of the assay, and the result of the assay for each variant. Optional fields include the disease context for the assay, a citation for the experimental method, the cell line or tissue type used for the assay, the number of replicates or controls for the assay, and a longer description of the assay’s result. We updated ClinVar’s XML files to better represent functional data as an observation of the variant, including a new attribute on the “ObservedIn” element to explicitly tag the observation as functional data. We also updated ClinVar variant pages to reflect these new data types, making it easier for the user to know when functional data are available to assist classifying variants.

MANE

The Matched Annotation from NCBI and EMBL-EBI (MANE) dataset provides a representative transcript called MANE Select for human genes to support clinical reporting and other applications [15]. MANE version 1.4, released in October 2024, incorporated the first set of noncoding genes. The next iteration of MANE (v1.5) was made available on MANE FTP (https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/) in the fall of 2025 and includes MANE Select transcripts for additional noncoding genes, revisions to MANE Select transcripts for six protein-coding genes, and additional MANE transcripts (MANE Plus Clinical) to represent significant alternate isoforms requested by clinical groups. MANE data from the new release are accessible in NCBI resources following an update to the RefSeq annotation of the human reference genome. The RefSeq team at NCBI welcomes feedback on MANE data as well as requests for MANE Plus Clinical transcripts sent to MANE-help@ncbi.nlm.nih.gov.

dbSNP and ALFA

In 2025 we released dbSNP Build 157 and ALFA Release 4 (R4) [16], significantly advancing these resources for genomic research. Build 157 contains over 1.5 billion RefSNP (rs) records, integrating data from major sources including gnomAD v4 [17] and ALFA R4. This build offers crucial annotations, with over 930 million variants having allele frequencies and ∼1.3 million linked to clinical significance from ClinVar. ALFA R4 represents a major milestone, nearly doubling the cohort size to ∼409 000 subjects. This expansion dramatically enhances clinical utility by providing frequency data for over 959 000 ClinVar variants, a 74% increase from the previous release. Aggregating data from ∼898 million total variants, ALFA R4 provides precise allele frequencies for 12 major populations, making it a critical tool for interpreting both rare and common variations. Together, these updates provide an invaluable resource for understanding human genetic diversity and its impact on health, driving improvements in personalized medicine and disease genetics. More details and information about access are available for dbSNP at https://www.ncbi.nlm.nih.gov/snp/ and for ALFA at https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/.

GTR

The NIH Genetic Testing Registry (GTR^®) [18] is an international database that currently provides information on 68 273 clinical and 185 research tests from 314 laboratories. Each test record has a unique identifier (GTR ID) and is versioned. GTR helps clinicians choose appropriate tests for patient care, supports laboratories in identifying gaps to expand testing options, aids researchers in exploring the genetic basis of diseases, provides standardized test descriptions for payers’ billing and reimbursement reviews, supports professional societies in advocating for laboratory practice standardization and creating guidelines, and enables public health professionals to assess the genetic testing market, compare quality metrics, analyze testing technologies, identify trends, and evaluate the clinical impact and utilization of genetic tests. GTR made software upgrades and implemented new features based on user feedback, analytics, survey results, and market research activities. The upgrades increased the flexibility and depth of GTR’s search tools, increasing its value to the user community. Specific improvements include a simplified search box that enables complex queries in which users can select disease names, genes, and laboratories names from an autocomplete dictionary and search with a text string. The search logic now returns a list of tests that match the search query, and we optimized search filters by adding the ability to filter by genes, diseases, laboratory names, number of genes, laboratory certifications, and services. We implemented a pop-up that lists all labs in search results including laboratories with no tests registered in GTR but with a laboratory description that includes services available to the community. When the search query results in only one gene, disease, or laboratory, a summary box appears and provides relevant information and access to pages with more detailed information. We enhanced the advanced search by enabling searches for “All fields,” searches by American Medical Association Current Procedural Terminology (AMA CPT^®) codes (https://www.ama-assn.org/practice-management/cpt), and an option to change Boolean logic. Users can also select a set of tests or the full set of results and download descriptions of the retrieved tests.

Pathogen detection

The NCBI Pathogen Detection Project (https://www.ncbi.nlm.nih.gov/pathogens/) helps public health scientists investigate disease outbreaks by integrating pathogen genomic sequences obtained from cultured bacterial isolates and quickly clustering and identifying related sequences [1]. It has been used successfully to help uncover international foodborne outbreaks and has reduced the burden of disease in the United States for foodborne pathogens [19, 20] along with other success stories (https://www.ncbi.nlm.nih.gov/pathogens/success_stories). As of August 2025, over 2 474 299 pathogen isolates covering 100 bacterial taxa and one emerging fungal pathogen, Candidozyma auris (renamed from Candida auris [21]), are available for analysis. The Isolates Browser (https://www.ncbi.nlm.nih.gov/pathogens/isolates) displays these analysis results on a daily basis, and they are also available on Google Cloud (https://www.ncbi.nlm.nih.gov/pathogens/docs/gcp). These data remain central to many bacterial outbreak detection efforts in the United States and internationally. The FDA, through the GenomeTrakr project, has used NCBI Pathogen Detection to initiate 1332 actions intended to protect consumers from foodborne illness (https://www.fda.gov/food/whole-genome-sequencing-wgs-program/genometrakr-network).

Antimicrobial resistance

The Pathogen Detection team has continued to improve and release updated resources for antimicrobial resistance (AMR) (https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/) [22]. The team has curated 9353 total proteins (8119 AMR proteins, 257 stress response proteins, and 977 virulence proteins) as well as 1500 point mutations and 5503 publication references for proteins and point mutations in the July 2025 release. We analyze all bacterial isolates in the Pathogen Detection Isolates Browser with AMRFinderPlus [22], and the three categories of genes (AMR, stress response, and virulence) are available in the Isolates Browser. Currently, over 2 356 946 isolates have at least one identified AMR gene, over 1 953 938 have at least one identified stress response gene and over 1 631 099 have at least one identified virulence gene. For the subset of isolates with assemblies in GenBank, MicroBIGG-E (the Microbial Browser for Identification of Genetic and Genomic Elements, https://www.ncbi.nlm.nih.gov/pathogens/microbigge) provides detailed information and sequences for over 42 829 000 genes and point mutations identified by AMRFinderPlus in over 1 866 000 assemblies. These data are available on Google Cloud, including the sequence of those contigs with elements identified by AMRFinderPlus. Researchers are using the AMRFinderPlus results to examine the distribution of AMR genotypes at the scale of tens of thousands of pathogen isolates [23–26]. To provide geographic context for the MicroBIGG-E data, AMR element data for those isolates in MicroBIGG-E with location data in the “geo_loc_name” field in BioSample are included the MicroBIGG-E Map (https://www.ncbi.nlm.nih.gov/pathogens/microbigge_map/). The Antibiotic Susceptibility Test (AST) Browser (https://www.ncbi.nlm.nih.gov/pathogens/ast/) allows users to search submitter-provided AST data for over 33 200 isolates (https://www.ncbi.nlm.nih.gov/pathogens/submit-data/#ast). It also includes additional data, such as measurement values and testing methods, that are not found in the Isolates Browser display. So, users can examine the relationship between measurement values and genomic features.

Chemical resources

PubChem, the largest public repository of information about chemicals [27], added over 60 new data sources in the past year and is now providing chemical information for 122 million compounds. Notably, we added regulatory information from the US Food and Drug Administration (FDA) regarding color additives (https://www.hfpappexternal.fda.gov/scripts/fdcc/index.cfm?set=ColorAdditives) and food contact substances (https://www.fda.gov/food/food-ingredients-packaging/packaging-food-contact-substances-fcs) used in manufacturing, packing, packaging, transporting, or holding food. Data integration with the FDA Generally Recognized As Safe (GRAS) Notice Inventory (https://www.fda.gov/food/food-ingredients-packaging/generally-recognized-safe-gras) now allows users to readily find the GRAS notices for substances used in food and review the basis for a substance’s GRAS designation under its intended conditions of use in food. In addition, we incorporated chemical toxicity information from the Risk Assessment Information System (RAIS) (https://rais.ornl.gov/) and the National Toxicology Program (NTP) Technical Reports (https://ntp.niehs.nih.gov/data/tr) into PubChem. We also integrated information from the California Safe Cosmetics Program (CSCP) Product Database (https://cscpsearch.cdph.ca.gov/search/publicsearch) on cosmetic ingredients known or suspected to cause harm to human health. We also loaded PubChemRDF data [machine-readable PubChem data in the Resource Description Framework (RDF) format] [28, 29] into two RDF databases (Virtuoso [30] and Qlever [31]) available in docker containers (https://pubchem.ncbi.nlm.nih.gov/docs/rdf-cloud). This allows users to readily deploy an RDF database containing PubChem data on a local machine or virtual machine on a cloud computing platform (e.g. Google Cloud Platform) and explore the data using SPARQL queries.

For further information

The resources described here include documentation, other explanatory materials, and references to collaborators and data sources on their respective web sites. A variety of video tutorials are available on the NLM YouTube channel that can be accessed through links in the standard NCBI page footer. User-support staff are available to answer questions at info@ncbi.nlm.nih.gov, and users can view support articles at https://support.nlm.nih.gov. Updates on NCBI resources and database enhancements are described on the NCBI Insights blog (https://ncbiinsights.ncbi.nlm.nih.gov/) and on resource web pages.

Acknowledgements

The authors would like to thank all the NCBI staff who through their dedicated efforts continue to enable NCBI to provide our full collection of services to the community.

Author contributions: Eric Whitney Sayers (Conceptualization [lead], Writing—original draft [lead], Writing—review & editing [lead]), Evan Bolton (Data curation [lead], Software [lead], Writing—original draft [equal]), Anna M Fine (Data curation [lead], Software [lead], Writing—original draft [equal]), Chris Kelly (Data curation [equal], Writing—original draft [supporting]), Sunghwan Kim (Data curation [equal], Software [equal], Writing—original draft [supporting]), Melissa J. Landrum (Data curation [lead], Software [equal], Writing—original draft [equal]), Stacy Lathrop (Data curation [equal], Writing—original draft [equal]), Adriana Malheiro (Data curation [lead], Software [lead], Writing—original draft [equal]), Terence D. Murphy (Conceptualization [equal], Data curation [lead], Software [equal], Writing—original draft [equal]), Lon Phan (Data curation [lead], Software [lead], Writing—original draft [equal]), Shashikant Pujar (Data curation [equal], Software [equal], Writing—original draft [supporting]), Bart Trawick (Data curation [lead], Software [lead], Writing—original draft [equal]), Valerie Anne Schneider (Conceptualization [equal], Funding acquisition [equal], Supervision [equal], Writing—review & editing [equal]), and Kim D. Pruitt (Conceptualization [equal], Funding acquisition [lead], Supervision [lead], Writing—review & editing [lead])

Contributor Information

Eric W Sayers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Evan E Bolton, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Anna M Fine, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Christopher Kelly, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Sunghwan Kim, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Melissa Landrum, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Stacy Lathrop, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Adriana Malheiro, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Terence D Murphy, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Lon Phan, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Shashikant Pujar, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Barton W Trawick, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Valerie A Schneider, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Kim D Pruitt, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, United States.

Conflict of interest

None declared.

Funding

This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health (NIH). The contributions of the NIH authors are considered Works of the United States Government. The findings and conclusions presented in this paper are those of the authors and do not necessarily reflect the views of the NIH or the US Department of Health and Human Services. Funding to pay the Open Access publication charges for this article was provided by the National Library of Medicine, National Institutues of Health.

References

1. Sayers EW, Beck J, Bolton EEet al. Database resources of the National Center for Biotechnology Information in 2025. Nucleic Acids Res. 2025;53:D20–9. 10.1093/nar/gkae979. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Schuler GD, Epstein JA, Ohkawa Het al. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996;266:141–62. [DOI] [PubMed] [Google Scholar]
3. Lin D, McAuliffe M, Pruitt KDet al. Biomedical Data Repository Concepts and Management Principles. Sci Data. 2024;11:622. 10.1038/s41597-024-03449-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Wilkinson MD, Dumontier M, Aalbersberg IJet al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Lin D, Crabtree J, Dillo Iet al. The TRUST Principles for digital repositories. Sci Data. 2020;7:144. 10.1038/s41597-020-0486-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Astashyn A, Tvedte ES, Sweeney Det al. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024;25:60. 10.1186/s13059-024-03198-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. O’Leary NA, Cox E, Holmes JBet al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci Data. 2024;11:732. 10.1038/s41597-024-03571-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Rangwala SH, Rudnev DV, Ananiev VVet al. The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments. PLoS Biol. 2024;22:e3002405. 10.1371/journal.pbio.3002405. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Sayers EW, Bolton EE, Brister JRet al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 2023;51:D29–38. 10.1093/nar/gkac1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 2025;53:D609–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Dyer SC, Austine-Orimoloye O, Azov AGet al. Ensembl 2025. Nucleic Acids Res. 2025;53:D948–57. 10.1093/nar/gkae1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Perez G, Barber GP, Benet-Pages Aet al. The UCSC Genome Browser database: 2025 update. Nucleic Acids Res. 2025;53:D1243–9. 10.1093/nar/gkae974. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Bult CJ, Sternberg PW. The alliance of genome resources: transforming comparative genomics. Mamm Genome. 2023;34:531–44. 10.1007/s00335-023-10015-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Landrum MJ, Chitipiralla S, Kaur Ket al. ClinVar: updates to support classifications of both germline and somatic variants. Nucleic Acids Res. 2025;53:D1313–21. 10.1093/nar/gkae1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Morales J, Pujar S, Loveland JEet al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature. 2022;604:310–5. 10.1038/s41586-022-04558-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Phan L, Zhang H, Wang Qet al. The evolution of dbSNP: 25 years of impact in genomic research. Nucleic Acids Res. 2025;53:D925–31. 10.1093/nar/gkae977. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Chen S, Francioli LC, Goodrich JKet al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature. 2024;625:92–100. 10.1038/s41586-023-06045-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Rubinstein WS, Maglott DR, Lee JMet al. The NIH genetic testing registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency. Nucleic Acids Res. 2013;41:D925–35. 10.1093/nar/gks1173. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Pereira E, Conrad A, Tesfai Aet al. Multinational Outbreak of Listeria monocytogenes Infections Linked to Enoki Mushrooms Imported from the Republic of Korea 2016-2020. J Food Protect. 2023;86:100101. 10.1016/j.jfp.2023.100101. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Brown B, Allard M, Bazaco MCet al. An economic evaluation of the Whole Genome Sequencing source tracking program in the U.S. PLoS One. 2021;16:e0258262. 10.1371/journal.pone.0258262. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Liu F, Hu ZD, Zhao XMet al. Phylogenomic analysis of the Candida auris-Candida haemuli clade and related taxa in the Metschnikowiaceae, and proposal of thirteen new genera, fifty-five new combinations and nine new species. Persoonia. 2024;52:22–43. 10.3767/persoonia.2024.52.02. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Feldgarden M, Brover V, Fedorov Bet al. Curation of the AMRFinderPlus databases: applications, functionality and impact. Microb Genom. 2022;8:000832. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Shawrob KSM, Dhariwal A, Salvadori Get al. Large-scale global molecular epidemiology of antibiotic resistance determinants in Streptococcus pneumoniae. Microb Genom. 2025;11:001444. 10.1099/mgen.0.001444. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Li X, Zhuang Y, Yu Yet al. Interplay of multiple carbapenemases and tigecycline resistance in Acinetobacter species: a serious combined threat. Clinical Microbiology and Infection. 2025;31:128–33. 10.1016/j.cmi.2024.08.027. [DOI] [PubMed] [Google Scholar]
25. Mack AR, Hujer AM, Mojica MFet al. beta-Lactamase diversity in Pseudomonas aeruginosa. Antimicrob Agents Chemother. 2025;69:e0078524. 10.1128/aac.00785-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Mack AR, Hujer AM, Mojica MFet al. beta-Lactamase diversity in Acinetobacter baumannii. Antimicrob Agents Chemother. 2025;69:e0078424. 10.1128/aac.00784-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Kim S, Chen J, Cheng Tet al. PubChem 2025 update. Nucleic Acids Res. 2025;53:D1516–25. 10.1093/nar/gkae1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Fu G, Batchelor C, Dumontier Met al. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminform. 2015;7:34. 10.1186/s13321-015-0084-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Li Q, Kim S, Zaslavsky Let al. A resource description framework (RDF) model of named entity co-occurrences in biomedical literature and its integration with PubChemRDF. J Cheminform. 2025;17:79. 10.1186/s13321-025-01017-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Erling O, Mikhailov I (ed.), de Virgilio R., Giunchiglia F., Tanca L. (ed.), Semantic Web Information Management: A Model-Based Perspective. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp.501–19. 10.1007/978-3-642-04329-1. [DOI] [Google Scholar]
31. Bast H, Buchhold B. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Association for Computing Machinery, Singapore, Singapore, 2017, pp.647–56. 10.1145/3132847.3132921. [DOI] [Google Scholar]

[B1] 1. Sayers EW, Beck J, Bolton EEet al. Database resources of the National Center for Biotechnology Information in 2025. Nucleic Acids Res. 2025;53:D20–9. 10.1093/nar/gkae979. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Schuler GD, Epstein JA, Ohkawa Het al. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996;266:141–62. [DOI] [PubMed] [Google Scholar]

[B3] 3. Lin D, McAuliffe M, Pruitt KDet al. Biomedical Data Repository Concepts and Management Principles. Sci Data. 2024;11:622. 10.1038/s41597-024-03449-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Wilkinson MD, Dumontier M, Aalbersberg IJet al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Lin D, Crabtree J, Dillo Iet al. The TRUST Principles for digital repositories. Sci Data. 2020;7:144. 10.1038/s41597-020-0486-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Astashyn A, Tvedte ES, Sweeney Det al. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024;25:60. 10.1186/s13059-024-03198-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. O’Leary NA, Cox E, Holmes JBet al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci Data. 2024;11:732. 10.1038/s41597-024-03571-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Rangwala SH, Rudnev DV, Ananiev VVet al. The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments. PLoS Biol. 2024;22:e3002405. 10.1371/journal.pbio.3002405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Sayers EW, Bolton EE, Brister JRet al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 2023;51:D29–38. 10.1093/nar/gkac1032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 2025;53:D609–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Dyer SC, Austine-Orimoloye O, Azov AGet al. Ensembl 2025. Nucleic Acids Res. 2025;53:D948–57. 10.1093/nar/gkae1071. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Perez G, Barber GP, Benet-Pages Aet al. The UCSC Genome Browser database: 2025 update. Nucleic Acids Res. 2025;53:D1243–9. 10.1093/nar/gkae974. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Bult CJ, Sternberg PW. The alliance of genome resources: transforming comparative genomics. Mamm Genome. 2023;34:531–44. 10.1007/s00335-023-10015-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Landrum MJ, Chitipiralla S, Kaur Ket al. ClinVar: updates to support classifications of both germline and somatic variants. Nucleic Acids Res. 2025;53:D1313–21. 10.1093/nar/gkae1090. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Morales J, Pujar S, Loveland JEet al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature. 2022;604:310–5. 10.1038/s41586-022-04558-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Phan L, Zhang H, Wang Qet al. The evolution of dbSNP: 25 years of impact in genomic research. Nucleic Acids Res. 2025;53:D925–31. 10.1093/nar/gkae977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Chen S, Francioli LC, Goodrich JKet al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature. 2024;625:92–100. 10.1038/s41586-023-06045-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Rubinstein WS, Maglott DR, Lee JMet al. The NIH genetic testing registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency. Nucleic Acids Res. 2013;41:D925–35. 10.1093/nar/gks1173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Pereira E, Conrad A, Tesfai Aet al. Multinational Outbreak of Listeria monocytogenes Infections Linked to Enoki Mushrooms Imported from the Republic of Korea 2016-2020. J Food Protect. 2023;86:100101. 10.1016/j.jfp.2023.100101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Brown B, Allard M, Bazaco MCet al. An economic evaluation of the Whole Genome Sequencing source tracking program in the U.S. PLoS One. 2021;16:e0258262. 10.1371/journal.pone.0258262. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Liu F, Hu ZD, Zhao XMet al. Phylogenomic analysis of the Candida auris-Candida haemuli clade and related taxa in the Metschnikowiaceae, and proposal of thirteen new genera, fifty-five new combinations and nine new species. Persoonia. 2024;52:22–43. 10.3767/persoonia.2024.52.02. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Feldgarden M, Brover V, Fedorov Bet al. Curation of the AMRFinderPlus databases: applications, functionality and impact. Microb Genom. 2022;8:000832. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Shawrob KSM, Dhariwal A, Salvadori Get al. Large-scale global molecular epidemiology of antibiotic resistance determinants in Streptococcus pneumoniae. Microb Genom. 2025;11:001444. 10.1099/mgen.0.001444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Li X, Zhuang Y, Yu Yet al. Interplay of multiple carbapenemases and tigecycline resistance in Acinetobacter species: a serious combined threat. Clinical Microbiology and Infection. 2025;31:128–33. 10.1016/j.cmi.2024.08.027. [DOI] [PubMed] [Google Scholar]

[B25] 25. Mack AR, Hujer AM, Mojica MFet al. beta-Lactamase diversity in Pseudomonas aeruginosa. Antimicrob Agents Chemother. 2025;69:e0078524. 10.1128/aac.00785-24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Mack AR, Hujer AM, Mojica MFet al. beta-Lactamase diversity in Acinetobacter baumannii. Antimicrob Agents Chemother. 2025;69:e0078424. 10.1128/aac.00784-24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Kim S, Chen J, Cheng Tet al. PubChem 2025 update. Nucleic Acids Res. 2025;53:D1516–25. 10.1093/nar/gkae1059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28. Fu G, Batchelor C, Dumontier Met al. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminform. 2015;7:34. 10.1186/s13321-015-0084-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Li Q, Kim S, Zaslavsky Let al. A resource description framework (RDF) model of named entity co-occurrences in biomedical literature and its integration with PubChemRDF. J Cheminform. 2025;17:79. 10.1186/s13321-025-01017-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30. Erling O, Mikhailov I (ed.), de Virgilio R., Giunchiglia F., Tanca L. (ed.), Semantic Web Information Management: A Model-Based Perspective. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp.501–19. 10.1007/978-3-642-04329-1. [DOI] [Google Scholar]

[B31] 31. Bast H, Buchhold B. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Association for Computing Machinery, Singapore, Singapore, 2017, pp.647–56. 10.1145/3132847.3132921. [DOI] [Google Scholar]

PERMALINK

Database resources of the National Center for Biotechnology Information in 2026

Eric W Sayers

Evan E Bolton

Anna M Fine

Christopher Kelly

Sunghwan Kim

Melissa Landrum

Stacy Lathrop

Adriana Malheiro

Terence D Murphy

Lon Phan

Shashikant Pujar

Barton W Trawick

Valerie A Schneider

Kim D Pruitt

Roles

Abstract

Graphical Abstract

Graphical Abstract.

Introduction

Table 1.

Literature resources

PubMed

PubMed central

Figure 1.

Table 2.

Bookshelf

SciENcv

NIH Comparative Genomics Resource

Figure 2.

Clinical resources

ClinicalTrials.gov

ClinVar

MANE

dbSNP and ALFA

GTR

Pathogen detection

Antimicrobial resistance

Chemical resources

For further information

Acknowledgements

Contributor Information

Conflict of interest

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases