Abstract
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence repository and the PubMed® repository of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 31 distinct repositories and knowledgebases. The E-utilities serve as the programming interface for most of these. Resources receiving significant updates in the past year include PubMed, PubMed Central, Bookshelf, the NIH Comparative Genomics Resource, BLAST, Sequence Read Archive, Taxonomy, iCn3D, Conserved Domain Database, Pathogen Detection, antimicrobial resistance resources and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Graphical Abstract
Graphical Abstract.
Introduction
NCBI overview
The National Center for Biotechnology Information (NCBI), a center within the National Library of Medicine (NLM) at the National Institutes of Health (NIH), was created in 1988 to develop information systems for molecular biology (1). In this article we provide a brief overview of the NCBI collection of databases, followed by a summary of resources that we significantly updated in the past year.
NCBI maintains a diverse set of 31 repositories and knowledgebases that together contain 4.6 billion records (Table 1), most of which are available through the Entrez retrieval system (2) at https://www.ncbi.nlm.nih.gov/search/. Each Entrez resource supports text searching using simple Boolean queries, downloading of data in various formats and linking records between databases based on asserted relationships. Records retrieved in Entrez can be displayed in many formats and downloaded singly or in batches. An application programming interface for Entrez functions (the E-utilities) is available, and detailed documentation is provided at https://eutils.ncbi.nlm.nih.gov/.
Table 1.
NCBI repositories and knowledgebases (as of 21 August 2024)
| Database | Records | Annual growth | Description |
|---|---|---|---|
| Literature | |||
| PubMed | 37 619 497 | 4% | Scientific and medical abstracts/citations |
| PubMed Central | 10 162 706 | 10% | Full-text journal articles |
| NLM Catalog | 1 650 874 | 1% | Index of NLM collections |
| Bookshelf | 1 056 250 | 7% | Books and reports |
| MeSH | 355 218 | 0.4% | Ontology used for PubMed indexing |
| DNA/RNA | |||
| Nucleotide | 635 869 265 | 5% | DNA and RNA sequences from GenBank and RefSeq |
| BioSample | 40 386 493 | 16% | Descriptions of biological source materials |
| SRA | 34 747 933 | 20% | High-throughput DNA/RNA sequence read archive |
| Taxonomy | 2 728 258 | 3% | Taxonomic classification and nomenclature catalog |
| BioProject | 811 517 | 14% | Biological projects providing data to NCBI |
| BioCollections | 8497 | 0% | Museum, herbaria and biorepository collections |
| Genes | |||
| GEO Profiles | 128 414 055 | 0% | Gene expression and molecular abundance profiles |
| Gene | 54 753 643 | 16% | Collected information about gene loci |
| PopSet | 8 151 530 | 4% | Sequence sets from phylogenetic/population studies |
| GEO Datasets | 7 617 524 | 11% | Functional genomics studies |
| Proteins | |||
| Protein | 1 338 769 287 | 12% | Protein sequences from GenBank and RefSeq |
| Identical Protein Groups | 813 686 460 | 29% | Protein sequences grouped by identity |
| Structure | 223 775 | 7% | Experimentally determined biomolecular structures |
| Protein Family Models | 159 572 | −4% | Conserved domain architectures, HMMs and BlastRules |
| Conserved Domains | 67 160 | 5% | Conserved protein domains |
| Chemicals | |||
| PubChem Substance | 319 894 321 | 4% | Deposited substance and chemical information |
| PubChem Compound | 118 564 722 | 2% | Chemical information with structures, information and links |
| PubChem BioAssay | 1 671 279 | 3% | Bioactivity screening studies |
| PubChem Pathways | 241 163 | 0.2% | Molecular pathways with links to genes, proteins and chemicals |
| Clinical genetics | |||
| dbSNP | 1 121 739 543 | 0% | Short genetic variations |
| dbVar | 8 151 530 | 5% | Genome structural variation studies |
| ClinVar | 3 068 120 | 31% | Human variations of clinical significance |
| ClinicalTrials.gov | 505 852 | 9% | Registry of clinical studies |
| MedGen | 225 517 | 4% | Medical genetics literature and links |
| GTR | 70 264 | −13% | Genetic testing registry |
| dbGaP | 1406 | 0% | Genotype/phenotype interaction studies |
NCBI collects data from four sources: direct submissions from researchers, national and international collaborations or agreements with data providers and research consortia, public health surveillance efforts and internal curation. Details about direct submission processes are available from the NCBI Submit page (https://www.ncbi.nlm.nih.gov/home/submit.shtml) and from the resource home pages (e.g. the GenBank page, https://www.ncbi.nlm.nih.gov/genbank/). More information about the various collaborations, agreements and curation efforts are also available through the home pages of the individual resources.
Recent developments
Literature updates
PubMed
PubMed provides free online access to citations and abstracts for biomedical literature and facilitates searching across the MEDLINE, PubMed Central and Bookshelf literature resources. In the past year, PubMed added over 1.6 million citations, growing the database to >37 million total citations in 2024. PubMed search syntax now supports expanded use of the asterisk (*) wildcard character (https://www.nlm.nih.gov/pubs/techbull/mj24/mj24_pubmed_wildcard.html). With this update, wildcards can be used in the middle of a term or phrase, and multiple wildcards can be used in the same term or phrase. Related article links now appear in the search results summary display, increasing awareness and facilitating navigation to updated versions of citations and retraction notices (https://www.nlm.nih.gov/pubs/techbull/mj24/mj24_pubmed_related_citations.html). For example, if a preprint citation is linked to a published journal article, a link now appears in the search results for that preprint citation that provides access to the published article’s PubMed citation.
PubMed Central
PubMed Central (PMC) is NLM’s free full-text archive of biomedical and life sciences literature. In 2024, PMC added >900 000 thousand full-text articles to the archive, surpassing 10 million publicly available full-text articles. These include articles from peer-reviewed journals, as well as author manuscripts funded by NIH and other research funders, and preprints collected under the NIH Preprint Pilot (https://ncbiinsights.ncbi.nlm.nih.gov/2023/01/09/next-phase-preprint-pilot). In 2024, PMC also began updating the technology behind its public website. By using cloud services, PMC aims to make its website more sustainable and reliable. To facilitate this transition, NCBI/NLM announced a public preview of upcoming changes to the PMC website in March for user testing and feedback (https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_pmc_updates.html). The accompanying changes included an updated look and feel for articles focused on the accessibility and readability of the content, improved article navigation, and a new journal list display. The preview version of the website became the default for PMC in the Fall of 2024 (https://pmc.ncbi.nlm.nih.gov/).
We also recognized that improving the accessibility of the scientific literature is as much a data issue as a display issue and engaged with publishers and data providers to improve the accessibility of content submitted to PMC throughout 2024. To date, we have released updates to the specifications for the banners (branding) displayed at the top of each article in PMC (https://www.ncbi.nlm.nih.gov/pmc/pub/filespec-banner/), updates to the PMC Tagging Guidelines (https://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html), engaged in strategic efforts to improve the machine-readability of equations and table data, and taken steps to make plain language summaries more discoverable. Improvements to the overall accessibility and usability of NIH Manuscript Submission system process were also released in mid-2024.
Bookshelf
The NCBI Bookshelf provides free online access to full-text books and documents in the life sciences, healthcare and medicine. Bookshelf contains over a dozen formats collected by the NLM, including monographs, reviews, reference works, government publications, standards and guidelines, technical reports, and textbooks. In the past year, Bookshelf added over 1550 books, growing the repository to over 13 150 total books from 175 content providers. Significant new peer-reviewed collections added in 2024 were in the subjects of public health, health disparities and diabetes.
In 2024, we added new short video tutorials on how to search and filter Bookshelf (https://www.nlm.nih.gov/pubs/techbull/ja24/ja24_ncbi_bookshelf_tutorials.html). We also added discovery guides that provide predefined queries and other tips to help users explore within subject areas collected by the NLM, frequently-used format collections including clinical guidelines and systematic reviews, and other prominent Bookshelf collections such as toxicology reviews, assessments and technical reports published by health agencies. In addition, to support the World Health Organization’s (WHO) efforts to develop SMART guidelines (https://www.who.int/teams/digital-health-and-innovation/smart-guidelines), Bookshelf now provides a finding aide file that we update monthly. This file includes nearly 4500 clinical guidelines and systematic reviews from WHO and other agencies like Agency for Healthcare Research and Quality. These documents are available for programmatic access and reuse including transformation into computable guidelines that conform to HL7 standards for implementation into electronic health records. To inform this effort, we collaborated with clinical guideline developers and informaticists to develop standard operating procedures for developing narrative clinical guidelines (3).
Lastly, to streamline submission workflows, we made a tool that allows submitters and collaborators to upload their Word and XML source files and generate a preview of their content. The tool reports conversion errors in real time so that submitters can address them before approving the preview. This tool has reduced our manual effort by >50% and has allowed us to publish important information more rapidly, increasing accesses and citability.
Genome updates
NIH Comparative Genomics Resource
The NIH Comparative Genomics Resource (CGR) (https://www.ncbi.nlm.nih.gov/datasets/cgr/) maximizes the impact of eukaryotic research organisms and their genomic data to biomedical research (4). CGR includes work to expand and improve genome-related data, to develop new and improved tools for accessing and analyzing data as part of an NCBI toolkit, and to collaborate with communities and other resources to better interconnect data throughout the global biodata ecosystem and integrate into users’ workflows.
To help improve data quality, all new eukaryotic and prokaryotic genomes submitted to GenBank are screened with NCBI’s Foreign Contamination Screening (FCS) tool suite that detects adaptors and cross-species contamination (5). FCS is available for public use (https://github.com/ncbi/fcs), including on the Galaxy platform (6). We now provide detailed information on contaminant sequences found in all eukaryote and prokaryote genome assemblies in the form of aggregate summary and detail reports (https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/) and individual reports provided in each assembly directory on the NCBI genomes FTP site. Users can leverage these data to select for better assemblies at thresholds of their choosing, or they can mask contaminant sequences so they do not adversely affect analyses. Contaminant sequences and those from genome assemblies with unverified organism source information are also filtered out of the BLAST (7) nucleotide (nt and core_nt) and protein (nr) databases to remove misleading results.
To help provide more information about genes and proteins on new eukaryotic genomes, we annotated over 230 new animal and plant genomes in the last year using the Eukaryotic Genome Annotation Pipeline (EGAP), another CGR toolkit component, and these data are available in the RefSeq collection and the gene resource (https://www.ncbi.nlm.nih.gov/gene/). RefSeq now includes genomes from over 2000 species of eukaryotes, 19 000 species of prokaryotes and 6500 species of viruses (8). We are in the process of redesigning EGAP so that users can access it on cloud platforms or locally available computing infrastructure. A preliminary version called EGAPx is available on GitHub (https://github.com/ncbi/egapx) for testing, and we welcome feedback. EGAPx is designed to produce high-quality annotations on metazoan and plant genomes, with output suitable for submission to GenBank. The capabilities and organism scope of EGAPx will be further expanded over the next year.
NCBI Datasets
NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/) provides modern and streamlined access to genome, taxonomy and gene information with user-friendly web and programmatic interfaces that support scalable access. In the past year, NCBI Datasets has replaced the legacy Assembly and Genome web resources (9). NCBI Orthologs, accessed through NCBI Gene, has replaced the prior HomoloGene web resource, and now includes access to 1:1 orthologs in over 1000 species of vertebrates and arthropods. The Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/) allows users to visually inspect two genomes based on their alignments to one another, and now includes over 1000 alignments from over 400 species (10). Additional alignments are added to CGV based on user requests. Together, the new tools and data improvements made possible through CGR aim to enable researchers to leverage the growing wealth of eukaryotic genomic data to advance biomedical research, and we encourage user feedback to help inform new development (cgr@nlm.nih.gov).
In June 2024, NCBI streamlined access to genome-related data by merging the Entrez Genome and Assembly websites into the new NCBI Datasets resource (9). This integration offers more comprehensive access to the rapidly expanding and increasingly complex repository of assembled genomes. Over the past 15 years, the number of submitted genomes has surged, accompanied by more intricate data, including genome, transcript, protein FASTA and annotation files. The associated metadata, spread across the Assembly, Genome, Biosample, Bioproject and Taxonomy databases, has also grown in complexity. NCBI Datasets enhances the user experience with an intuitive interface that allows easy searches by taxonomic name or NCBI accession and provides efficient browsing and filtering. The redesigned genome assembly webpages provide easy browsing of detailed assembly information and seamless navigation to tools such as Genome Data Viewer, annotation tables and genome-specific BLAST pages (Figure 1). The new genome table simplifies browsing and filtering of available genomes and offers a convenient download option directly from the table. The platform also supports programmatic access through a new command-line tool, designed to be user-friendly for both beginners and advanced users. Although the legacy websites have been retired, programmatic access to the Assembly database through the E-utilities API remains unchanged, ensuring continuity for existing workflows.
Figure 1.
NCBI Datasets assembly page for Equus caballus assembly EquCab3.0.
BLAST
We recently released a new BLAST database, core_nt, to address the effects that the exponential growth of the NCBI nt database are having on users and our internal data management. The new database is easier to maintain and facilitates more efficient searches. Core_nt contains the same content as nt except for eukaryotic chromosomal sequences. To search such chromosomal sequences, we recommend using the RefSeq Reference Genome database available on the BLAST search pages or searching against a genome assembly from NCBI Datasets. Core_nt offers a faster search experience yet contains the most searched content of the nt database. While the original nt database will continue to be available on the BLAST website, core_nt is now the default database for blastn searches. Another enhancement allows users of Primer-BLAST to save their results. Users can download primer pair information in three different formats: text, CSV and tabular. This feature will be helpful for downstream tasks such as primer ordering.
Sequence Read Archive
NCBI continues to provide SRA (11) data in the cloud (12) to support a variety of use cases, including those requiring high throughput analyses. Users can obtain the >31 000 000 public SRA files, over 47 petabytes in a single copy, from the Amazon and Google commercial cloud platforms as well as the Amazon Web Services Open Data program (https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/). These data are provided in both Normalized and Lite formats with associated SRA, BioSample, and BioProject metadata and Sequence Taxonomic Analysis Tool (STAT) analyses (13). We updated our Google Cloud Platform storage model to incorporate NEARLINE, single region storage, and SRA Lite is now available from the us-east-1 location (https://cloud.google.com/storage/docs/storage-classes). Data are available from this location without retrieval fees using the SRA toolkit, but egress transfer fees may apply depending on the destination (https://cloud.google.com/storage/pricing#network-buckets).
We extended support for SRA submissions to FASTQ and BAM files generated from new Element BioSciences and Ultima technologies, allowing data from these emerging sequencing technologies to be added to the interoperable SRA corpus. To prevent the unintentional release of human sequence data, SRA provides an opt-in service based on STAT that masks human sequence reads present in all submitted FASTQ and BAM files. More information about how to request this service can be found at https://www.ncbi.nlm.nih.gov/sra/docs/submit/#human-data. To increase the transparency of the SRA data lifecycle, we formally defined SRA data statuses and documented status transitions for SRA records (https://www.ncbi.nlm.nih.gov/sra/docs/sequence-data-processing/). Finally, we added support for AR64 processors to the SRA toolkit, enhancing our support of Mac users.
Taxonomy
Average Nucleotide Identity (ANI) is a valuable tool for determining the taxonomic identity of genome assemblies. GenBank uses ANI (14) to generate a taxonomy check status (‘ok’, ‘failed’ or ‘inconclusive’) that aids users in evaluating the reliability of taxonomic assignments for public genome assemblies. Each taxonomy check status is further categorized with ANI best match status. Recently, we introduced two new ANI best match statuses to address assemblies with informal species names classified at taxonomic ranks higher than genus. The two statuses have a taxonomy check status of ‘inconclusive’. The first status, ‘lineage-match’, indicates that an assembly shares the same lineage as the best matching assembly from a type strain. The second status, ‘below-threshold-lineage-match’, indicates that an assembly matches one from a type strain from the same lineage, but the ANI is below the species ANI threshold. Currently, 8871 and 14 664 prokaryotic assemblies have ‘lineage-match’ and ‘below-threshold-lineage-match’ statuses, respectively (https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_prokaryotes.txt) (14). Additionally, we added 5989 new assemblies from type materials in 2023 and the first half of 2024. We now have a total of 27 276 type strain assemblies across 21 025 ranked taxa (https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/prokaryote_type_strain_report.txt) (15).
Reference genomes
We are now using the single term ‘reference’ for the genome assembly deemed best among all genomes available for a given species. Previously, ‘reference’ was reserved for a small number of hand-chosen genomes recognized by the scientific community as the anchor or gold standard for a species, while the term ‘representative’ was used to label genomes chosen automatically. We have deprecated ‘representative’ in favor of ‘reference’. We select these reference genomes for a species based on the assembly and its available annotation metrics, and in a small number of cases, by curatorial review (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/policies-annotation/genome-processing/reference-selection/). The set of eukaryotic reference assemblies is updated continuously as new assemblies are submitted to GenBank, while the set of prokaryotic references are recalculated three times a year. As of 12 September 2024, there were 38 227 reference assemblies in GenBank (17 823 eukaryotic, 19 700 bacterial and 704 archaeal).
Protein updates
iCn3D
In the previous year we updated iCn3D, our 3D molecular structure viewer, (16,17) so that users can augment protein 3D/1D views with annotations depicting protein isoforms and the location and boundaries of exons (Figure 2). Furthermore, ligand–protein interactions can now be studied in greater atomic detail by displaying 2D ‘chemical drawing’ representations of the ligands and a representation of interacting protein residues (Figure 3).
Figure 2.
iCn3D view of gene exons superimposed on the 3D AlphaFold structure A4D1S0. Also shown are the sequence alignments of this structure with RefSeq proteins in which the exons are represented by the same color gradient.
Figure 3.
iCn3D views of the protein–ligand interactions between the drug Gleevec and human ABL2.
iCn3D now automatically detects immunoglobulin (Ig)-like domains by matching the 3D topology of arbitrary structures to a library of Ig templates. If matches are found, ‘IgStrand’ reference numbers are mapped to the structures (18) using a new reference number scheme distinct from earlier proposals by Kabat (19) and the IMGT (20) reference numbers. The reference numbering covers all residues, facilitating comparative analysis of Ig-like topologies and binding interfaces between different Ig domains. Instructions on generating the reference numbering can be found at https://www.ncbi.nlm.nih.gov/Structure/icn3d/icn3d.html#igrefnum.
CDD
The Conserved Domain Database (CDD, https://www.ncbi.nlm.nih.gov/cdd/), v3.21, contains 62 456 protein and protein domain models obtained from a variety of sources: Pfam v35 (21), SMART (22), COGs (23), TIGRFAMS (24), the NCBI Protein Clusters collection (25), NCBIfam (26) and finally CDD internal curation efforts (27) that make up ∼40% of the collection. The upcoming version (v3.22) will include Pfam v37 and ∼1200 new or updated models from CDD curation efforts. CDD provides a CD-search service that can accept single protein queries (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) or batches of up to 1000 queries (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi).
Clinical updates
ClinicalTrials.gov
Clinical trial registries and results databases are designed to make summary clinical research information and results publicly accessible and available in a centralized repository for patients, caregivers, researchers and the general public. ClinicalTrials.gov (https://clinicaltrials.gov) is the world’s largest publicly available clinical trial registry and results database containing over 500 000 studies and visited by nearly 4 million website users each month. To support this growth, we began an effort to modernize ClinicalTrials.gov in 2019 with the aim to deliver an improved user experience on an updated platform that will accommodate growth and enhance efficiency. As of June 2024, the modernized ClinicalTrials.gov website became the unified web experience with improved navigation and searchability. The new design is easier to use, more functional, more streamlined and optimized for mobile devices. It also provides plain language guidance and support materials. In addition, the new website provides a modernized API that aligns with other publicly accessible APIs and standardized data.
NLM also continues to work on modernizing the Protocol Registration and Results System (PRS), the clinical trial information submission and management portal for ClinicalTrials.gov. An enhanced results information submission process is coming soon, and the modern PRS became the primary website for protocol registration and management of study records as of August 2024. Users can seamlessly transition to the Classic PRS to manage some of their records and to enter study results. More enhancements such as uploading a record and records with results, delayed results and study documents in a modern PRS to come in 2025.
Pathogen Detection
The NCBI Pathogen Detection Project (https://www.ncbi.nlm.nih.gov/pathogens/) helps public health scientists investigate disease outbreaks by integrating pathogen genomic sequences obtained from cultured bacterial isolates and quickly clustering and identifying related sequences (28). It has been used successfully to help uncover an international outbreak due to contaminated mushrooms (29) and has been shown to contribute significantly to reducing illness and the burden of disease in the US for foodborne pathogens (30). As of 5 August 2024, over 1 966 530 pathogen isolates covering 84 bacterial taxa and one emerging fungal pathogen, Candida auris, are available for analysis. The analysis results are available in the Isolates Browser on a daily basis (https://www.ncbi.nlm.nih.gov/pathogens/isolates), and are also available on Google Cloud (https://www.ncbi.nlm.nih.gov/pathogens/docs/gcp). This near-real-time update of comprehensive public data is now central to many bacterial outbreak detection and analysis efforts in the US and internationally. The FDA, through the GenomeTrakr project, has used NCBI Pathogen Detection to initiate 1229 actions intended to protect consumers from foodborne illness (https://www.fda.gov/food/whole-genome-sequencing-wgs-program/genometrakr-network). Researchers have also used the resource to investigate hospital outbreaks (31). For more examples showing how NCBI Pathogen Detection resources contribute to public health and research see https://www.ncbi.nlm.nih.gov/pathogens/success_stories.
Antimicrobial resistance resources
The Pathogen Detection team has continued to improve and release updated resources for antimicrobial resistance (AMR) (https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/) (32). The team has curated 8355 total proteins (7138 AMR proteins, 257 stress response proteins and 960 virulence proteins) as well as 1351 point mutations and 5049 publication references for proteins and point mutations in the July 2024 release.
All bacterial isolates in the Pathogen Detection Isolates Browser are analyzed with AMRFinderPlus, and the three categories of genes (AMR, stress response and virulence) are available in the Isolates Browser. Currently over 1 879 000 isolates have at least one identified AMR gene, over 1 574 000 have at least one identified stress response gene and over 1 352 000 have at least one identified virulence gene. For the subset of isolates with assemblies in GenBank, MicroBIGG-E (the Microbial Browser for Identification of Genetic and Genomic Elements, https://www.ncbi.nlm.nih.gov/pathogens/microbigge) provides detailed information and sequences for over 31 500 000 genes and point mutations identified by AMRFinderPlus in over 1 482 000 assemblies. These data are available on Google Cloud, including the sequence of those contigs with elements identified by AMRFinderPlus.
To provide geographic context for the MicroBIGG-E data, AMR element data for those isolates in MicroBIGG-E with location data in the ‘geo_loc_name’ field in BioSample are now displayed in a new interface, the MicroBIGG-E Map (https://www.ncbi.nlm.nih.gov/pathogens/microbigge_map/). The MicroBIGG-E Map displays the number of instances these elements occur in MicroBIGG-E, the number of isolates that contain one or more copies of these elements, and the proportion of total isolates that possess one or more copies of these elements. For example, if users choose ‘blaKPC’, the KPC family beta-lactamases, they can then examine the global distribution of carbapenemase and cephalosporinase enzymes in the map (Figure 4). The MicroBIGG-E Map also allows users to limit the data used by selecting one or more countries in a GUI map display. This tool can also display the proportion and counts of selected AMR genes by year of collection or addition to the Pathogen Detection system. Data from selected countries can be viewed in the Isolates Browser or MicroBIGG-E using cross-browser selection. The MicroBIGG-E Map only contains those isolates with geographic metadata, a detail that users should consider when interpreting the displays (https://www.ncbi.nlm.nih.gov/pathogens/microbigge_map_details/).
Figure 4.
The distribution of cephalosporinases and carbapenemases in Escherichia coli found in China and the U.S. (A) Using a drop-down menu, users can investigate either individual taxonomic groups or all isolates combined; shown here are data for E. coli. (B) Users can filter the data using a variety of data fields. Here we display the selection of all acquired carbapenemases and cephalosporinases. (C) To limit the data to certain countries, individual countries can be selected on the map. In this case, both the USA and China are selected. In addition, the shading of the map indicates the frequency (as a percentage) of the selected genetic elements in each country. (D) Bar chart of the counts of cephalosporinase and carbapenemase genes in China and the USA combined.
The new Antibiotic Susceptibility Test (AST) Browser (https://www.ncbi.nlm.nih.gov/pathogens/ast/) allows users to search submitter-provided AST data for over 28 800 isolates (https://www.ncbi.nlm.nih.gov/pathogens/submit-data/#ast). It also includes additional data, such as measurement values and testing methods, that are not found in the Isolates Browser display. Subsets of the AST data can be downloaded, and isolates with the selected phenotypes can be viewed in the Isolates Browser or MicroBIGG-E using cross-browser selection, so users can examine the relationship between measurement values and genomic features. For example, users could compare the genomic elements involved in resistance to a given antibiotic between those isolates at the resistance breakpoint to those with much higher levels of resistance.
Chemical updates
PubChem (33,34), a public chemical database at NCBI, has significantly expanded its content scope in the past year by integrating data from >70 new sources. Thanks to this integration, PubChem now provides chemical information for 119 million compounds collected from >1000 data sources. The newly added data include annotations on the safety, health and environmental effects of chemicals. Examples are the data from the Integrated Risk Information System and Provisional Peer-Reviewed Toxicity Values at the U.S. Environmental Protection Agency as well as regulatory information from the Australian Industrial Chemicals Introduction Scheme and the New Zealand Environmental Protection Authority. Also notable are information on chemicals used in cosmetics and fragrances, integrated from the Cosmetic Ingredient Review and the International Fragrance Association, respectively.
Additionally, we developed the patent knowledge panel that shows chemicals, genes and diseases frequently mentioned with a given chemical or gene in patent documents. The data underlying the patent knowledge panel were derived from the analysis of the co-occurrence of named entities in a large corpus of patent documents published by patent offices in the USA, European Union, Japan and Korea. The patent knowledge panel helps users discover important entity relationships that might not be found through the literature knowledge panel (35), whose co-occurrence data are derived from PubMed abstracts. Moreover, the co-occurrence data from the literature knowledge panel are also made available in PubChemRDF (36), which is machine-readable PubChem data formatted using the Resource Description Framework.
For further information
The resources described here include documentation, other explanatory materials and references to collaborators and data sources on their respective web sites. An outreach events page (https://ncbiinsights.ncbi.nlm.nih.gov/ncbi-outreach-events/) provides links to webinars, courses and upcoming conference exhibits. A variety of video tutorials are available on the NLM YouTube channel that can be accessed through links in the standard NCBI page footer. User-support staff are available to answer questions at info@ncbi.nlm.nih.gov, and users can view support articles at https://support.nlm.nih.gov. Updates on NCBI resources and database enhancements are described on the NCBI Insights blog (https://ncbiinsights.ncbi.nlm.nih.gov/), NCBI social media sites (FaceBook, X and LinkedIn), and the several mailing lists and RSS feeds that provide updates on services and databases. Links to these resources are in the NCBI page footer and on NCBI Insights.
Acknowledgements
The authors would like to thank all the NCBI staff who through their dedicated efforts continue to enable NCBI to provide our full collection of services to the community.
Contributor Information
Eric W Sayers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Jeffrey Beck, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Evan E Bolton, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
J Rodney Brister, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Jessica Chan, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Ryan Connor, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Michael Feldgarden, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Anna M Fine, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Kathryn Funk, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Jinna Hoffman, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Sivakumar Kannan, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Christopher Kelly, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
William Klimke, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Sunghwan Kim, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Stacy Lathrop, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Aron Marchler-Bauer, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Terence D Murphy, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Chris O’Sullivan, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Erin Schmieder, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Yuriy Skripchenko, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Adam Stine, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Francoise Thibaud-Nissen, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Jiyao Wang, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Jian Ye, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Erin Zellers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Valerie A Schneider, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Kim D Pruitt, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Data availability
The resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Funding
This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health.
Conflict of interest statement. None declared.
References
- 1. Sayers E.W., Beck J., Bolton E.E., Brister J.R., Chan J., Comeau D.C., Connor R., DiCuccio M., Farrell C.M., Feldgarden M.et al.. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2024; 52:D33–D43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Schuler G.D., Epstein J.A., Ohkawa H., Kans J.A.. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996; 266:141–162. [DOI] [PubMed] [Google Scholar]
- 3. Chehab C., Lathrop S., Sennabaum C.. GIN McMaster Guideline Development Checklist Extension for Computable Guidelines. techRxiv doi:. 2023; 23 November 2023, preprint: not peer reviewed 10.22541/au.170073120.04197258/v1. [DOI] [Google Scholar]
- 4. Bornstein K., Gryan G., Chang E.S., Marchler-Bauer A., Schneider V.A.. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics. 2023; 24:575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Astashyn A., Tvedte E.S., Sweeney D., Sapojnikov V., Bouk N., Joukov V., Mozes E., Strope P.K., Sylla P.M., Wagner L.et al.. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024; 25:60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Galaxy C. The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res. 2024; 52:W83–W94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., Ma N., Madden T.L., Matten W.T., McGinnis S.D., Merezhuk Y.et al.. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41:W29–W33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Goldfarb T., Kodali V.K., Pujar S., Brover V., Robbertse B., Oh D.H., Astashyn A., Ermolaeva O., Farrell C.M., Haddad D.et al.. NCBI RefSeq: reference sequence standards through 25 years of curation and annotation. Nucleic Acids Res. 2024; 10.1093/nar/gkae1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. O’Leary N.A., Cox E., Holmes J.B., Anderson W.R., Falk R., Hem V., Tsuchiya M.T.N., Schuler G.D., Zhang X., Torcivia J.et al.. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci. Data. 2024; 11:732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Rangwala S.H., Rudnev D.V., Ananiev V.V., Oh D.H., Asztalos A., Benica B., Borodin E.A., Bouk N., Evgeniev V.I., Kodali V.K.et al.. The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments. PLoS Biol. 2024; 22:e3002405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., O'Sullivan C. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022; 50:D387–D390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Sayers E.W., Beck J., Bolton E.E., Bourexis D., Brister J.R., Canese K., Comeau D.C., Funk K., Kim S., Klimke W.et al.. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021; 49:D10–D17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Katz K.S., Shutov O., Lapoint R., Kimelman M., Brister J.R., O'Sullivan C. STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions. Genome Biol. 2021; 22:270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Ciufo S., Kannan S., Sharma S., Badretdin A., Clark K., Turner S., Brover S., Schoch C.L., Kimchi A., DiCuccio M.. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int. J. Syst. Evol. Microbiol. 2018; 68:2386–2392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Kannan S., Sharma S., Ciufo S., Clark K., Turner S., Kitts P.A., Schoch C.L., DiCuccio M., Kimchi A.. Collection and curation of prokaryotic genome assemblies from type strains at NCBI. Int. J. Syst. Evol. Microbiol. 2023; 73:005707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wang J., Youkharibache P., Zhang D., Lanczycki C.J., Geer R.C., Madej T., Phan L., Ward M., Lu S., Marchler G.H.et al.. iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures. Bioinformatics. 2020; 36:131–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Wang J., Youkharibache P., Marchler-Bauer A., Lanczycki C., Zhang D., Lu S., Madej T., Marchler G.H., Cheng T., Chong L.C.et al.. iCn3D: from web-based 3D viewer to structural analysis tool in batch mode. Front Mol. Biosci. 2022; 9:831740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Tawfeeq C., Wang J., Khaniya U., Madej T., Song J., Abrol R., Youkharibache P.. A universal residue numbering scheme for the immunoglobulin-fold (Ig-fold) to study Ig-Proteomes and Ig-Interactomes. 2024; bioRxiv doi:11 June 2024, preprint: not peer reviewed 10.1101/2024.06.10.598201. [DOI]
- 19. Johnson G., Wu T.T.. Kabat database and its applications: 30 years after the first variability plot. Nucleic Acids Res. 2000; 28:214–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Lefranc M.P., Pommie C., Kaas Q., Duprat E., Bosc N., Guiraudou D., Jean C., Ruiz M., Da Piedade I., Rouard M.et al.. IMGT unique numbering for immunoglobulin and T cell receptor constant domains and Ig superfamily C-like domains. Dev. Comp. Immunol. 2005; 29:185–203. [DOI] [PubMed] [Google Scholar]
- 21. Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G.A., Sonnhammer E.L.L., Tosatto S.C.E., Paladin L., Raj S., Richardson L.J.et al.. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021; 49:D412–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Letunic I., Bork P.. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res. 2018; 46:D493–D496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Galperin M.Y., Wolf Y.I., Makarova K.S., Vera Alvarez R., Landsman D., Koonin E.V.. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021; 49:D274–D281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Haft D.H., Selengut J.D., Richter R.A., Harkins D., Basu M.K., Beck E.. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 2013; 41:D387–D395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Klimke W., Agarwala R., Badretdin A., Chetvernin S., Ciufo S., Fedorov B., Kiryutin B., O’Neill K., Resch W., Resenchuk S.et al.. The National Center for Biotechnology Information’s Protein Clusters Database. Nucleic Acids Res. 2009; 37:D216–D223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Li W., O’Neill K.R., Haft D.H., DiCuccio M., Chetvernin V., Badretdin A., Coulouris G., Chitsaz F., Derbyshire M.K., Durkin A.S.et al.. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res. 2021; 49:D1020–D1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Wang J., Chitsaz F., Derbyshire M.K., Gonzales N.R., Gwadz M., Lu S., Marchler G.H., Song J.S., Thanki N., Yamashita R.A.et al.. The conserved domain database in 2023. Nucleic Acids Res. 2023; 51:D384–D388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Sayers E.W., Bolton E.E., Brister J.R., Canese K., Chan J., Comeau D.C., Farrell C.M., Feldgarden M., Fine A.M., Funk K.et al.. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 2023; 51:D29–D38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Pereira E., Conrad A., Tesfai A., Palacios A., Kandar R., Kearney A., Locas A., Jamieson F., Elliot E., Otto M.et al.. Multinational outbreak of Listeria monocytogenes infections linked to Enoki mushrooms imported from the Republic of Korea 2016–2020. J. Food Prot. 2023; 86:100101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Brown B., Allard M., Bazaco M.C., Blankenship J., Minor T.. An economic evaluation of the whole genome sequencing source tracking program in the U.S. PLoS One. 2021; 16:e0258262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Worley J.N., Crothers J.W., Wolfgang W.J., Venkata S.L.G., Hoffmann M., Jayeola V., Klompas M., Allard M., Bry L.. Prospective genomic surveillance reveals cryptic MRSA outbreaks with local to international origins among NICU patients. J. Clin. Microbiol. 2023; 61:e0001423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Feldgarden M., Brover V., Fedorov B., Haft D.H., Prasad A.B., Klimke W.. Curation of the AMRFinderPlus databases: applications, functionality and impact. Microb. Genom. 2022; 8:000832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B.A., Thiessen P.A., Yu B.et al.. PubChem 2023 update. Nucleic Acids Res. 2023; 51:D1373–D1380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Kim S. Exploring chemical information in PubChem. Curr. Protoc. 2021; 1:e217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Zaslavsky L., Cheng T., Gindulyte A., He S., Kim S., Li Q., Thiessen P., Yu B., Bolton E.E.. Discovering and summarizing relationships between chemicals, genes, proteins, and diseases in PubChem. Front. Res. Metr. Anal. 2021; 6:689059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Fu G., Batchelor C., Dumontier M., Hastings J., Willighagen E., Bolton E.. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J. Cheminform. 2015; 7:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.





