Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2024 Oct 29;53(D1):D1711–D1715. doi: 10.1093/nar/gkae967

NCBI Taxonomy: enhanced access via NCBI Datasets

Eric Cox 1,, Mirian T N Tsuchiya 2, Stacy Ciufo 3, John Torcivia 4, Robert Falk 5, W Ray Anderson 6, J Bradley Holmes 7, Vichet Hem 8, Laurie Breen 9, Emily Davis 10, Anne Ketter 11, Peifen Zhang 12, Vladimir Soussov 13, Conrad L Schoch 14,3, Nuala A O’Leary 15,3,
PMCID: PMC11701650  PMID: 39470745

Abstract

The NCBI Taxonomy resource (https://www.ncbi.nlm.nih.gov/taxonomy) has long been a trusted, curated hub for organism names, classifications, and links to related data for all taxonomic nodes. NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/) is an improved way to leverage the rich data available at NCBI so users can effectively browse, search, and download information. While taxonomy data has been a cornerstone of NCBI Datasets since its inception, we recently extended the taxonomy information available via NCBI Datasets by updating the existing NCBI Datasets taxonomy page, implementing a new taxonomy name details page, expanding programmatic access to taxonomic information via command-line tools and APIs and improving the way we handle taxonomic queries to connect users to gene and genome data. This paper highlights these improvements and provides examples to help users effectively harness these new features.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

The NCBI Taxonomy team manages and curates a comprehensive collection of nomenclature, classifications, publications, and type material spanning all domains of life. This curated data is organized into a hierarchical structure in the NCBI Taxonomy resource (1), providing the biological research community with a systematic framework to access detailed species data in the context of their evolutionary relationships. It provides the central classification for members of the International Nucleotide Sequence Database Collaboration (INSDC) (2). This database maintains precise linkage to the extensive sequence data at NCBI and supports ongoing research and advancements in biology, proving invaluable for comparative biology studies and genetic analysis across species (3).

Compared to other taxonomy resources, NCBI Taxonomy is noteworthy in its broad taxonomic scope. NCBI Taxonomy curators verify and validate taxonomic information on a daily basis by relying on authoritative sources, including numerous taxonomic and nomenclatural databases, as well as the primary literature (1). The represented diversity is comparable to the much larger Catalogue of Life and associated resources, which aggregates taxonomic information for all domains of life supported by international collaborators with periodic updates (4). This, in turn, also informs NCBI Taxonomy.

As the taxonomic hierarchy and associated data become increasingly complex, there is a growing need for simpler and more accessible ways to access NCBI Taxonomy data and the associated sequence data. Historically, researchers accessed this information through NCBI’s Entrez web pages (e.g. https://www.ncbi.nlm.nih.gov/taxonomy) or by downloading and parsing extensive datasets via FTP—methods that are becoming less practical as the volume and complexity of the data grow. In response, we developed new and expanded features in NCBI Datasets to provide options to download selected nodes or lineages in formats that are more usable for both web users and those using programmatic applications (5).

NCBI Datasets was created to make data easily discoverable and usable, especially to enhance comparative genomics research and improve human health. Following FAIR (Findable, Accessible, Interoperable and Reusable) principles (6), it provides straightforward access to a wide range of biological sequences, annotations, and metadata through user-friendly web interfaces, command-line tools, and well-documented APIs. Data is provided in compressed zip archives containing related data and metadata files to streamline data retrieval, sharing, and use.

NCBI Taxonomy data is foundational to NCBI Datasets, which is designed to be organism-focused. All taxonomy data is now available for exploration and download through both web and programmatic interfaces provided by NCBI Datasets. The web pages not only deliver general taxonomic information but also serve as portals to all sequence data for specific taxa, including reference genes and genome collections and links to other NCBI resources. The NCBI Datasets command-line tools facilitate programmatic access to taxonomic data, allowing complex queries for taxonomic identifiers and data retrieval in tabular, JSON, and JSON lines formats, essential for integrating taxonomy data into research workflows. This paper describes the new interfaces, how to use them to access taxonomy data, recent updates to the taxonomy data, and plans for future improvements.

NCBI Datasets taxonomy web interfaces

NCBI Datasets taxonomy pages address the user's need to search and browse data from an organism-centered perspective using taxonomic terms. These pages display general and detailed taxonomic information and serve as a portal to related gene and genome data. There are three pages providing access to taxonomy information (Figure 1): the main taxonomy page, the taxonomy name details page and the taxonomy browser page. In this section, we will cover the names and the main pages; the taxonomy browser will be discussed under Future Directions.

Figure 1.

Figure 1.

Screenshots of NCBI Datasets taxonomy pages and features. (A) Search box from the NCBI Datasets homepage showing the autosuggest feature for the taxon Candida tropicalis. (B) Main taxonomy page, where users can find taxonomy information, images, links to genomic resources and other databases. (C) Detail of the top of the main taxonomy page. The red arrow points to the ‘type material’ label, which indicates that a genome assembly derived from type material is available for that species. (D) Detail of the bottom of the main taxonomy page showing links to other databases. The red arrow points to the type material section. (E) Names page for Candida tropicalis, showing synonyms, type material information, current scientific name, basionym, and name authority. (F) Group name, which corresponds to the NCBI BLAST name in the legacy interface (G) Detail showing the expanded lineage for Candida tropicalis (H) Taxonomy Browser showing hierarchical taxonomy information for C. tropicalis, with both parent and children nodes, and the count of assembled genomes available for each taxon.

Main taxonomy page

When users search for an organism name at any taxonomic rank from the NCBI Datasets homepage (https://www.ncbi.nlm.nih.gov/datasets/), the autocomplete feature (Figure 1A) helps them navigate to the main taxonomy page (Figure 1B). Starting from the top of the page (Figure 1C) and moving down, users can find taxonomic information (including NCBI Taxonomy ID or taxid, rank and current scientific name), links to the name details page, the taxonomy browser and genomic resources, including a download button for the reference genome (at species level), and additional database links at the bottom of the page (Figure 1D). New features on the main taxonomy page include information about type material, nomenclature authority approval, taxonomic lineage, and images. Specifically for viruses, if a scientific name has been approved by the virus nomenclature authority (International Committee on Taxonomy of Viruses, ICTV) (7), then we include a label below the current scientific name that reads ‘ICTV ACCEPTED.’

Type material

Sequences from type material represent high-value references with unambiguous associations with taxonomic names (8). On the main taxonomy page, there are now two ways to determine if type material is available for a particular species. When an assembled genome from type material is available for a species, we now show a blue label that reads ‘Type Material’ under the current scientific name, near the top of the page (Figure 1C, red arrow). At the bottom of the page, we have added new database links that show counts for genome, sequences and BioSample records that are derived from type material (Figure 1D, red arrow). Clicking on the count allows users to navigate to the relevant records.

Taxonomic lineage

We have also improved the illustration and navigation of taxonomic lineages in both the main taxonomy page and the names page. By default, we only show taxonomic names for the standard ranks, including superkingdom, kingdom, phylum, class, order, family, genus and species (Figure 1B, blue box on the right). Note that the rank of superkingdom will be replaced by the more commonly used ‘domain’ by the end of 2024 (NCBI Insights, Upcoming Changes to NCBI Taxonomy Classifications, available at https://ncbiinsights.ncbi.nlm.nih.gov/2024/06/04/changes-ncbi-taxonomy-classifications/). To view all ranks, simply click ‘View full lineage’ to see a compact representation of all taxonomic names representing all ranks (Figure 1G). For example, for the species Candida tropicalis, there are five taxonomic nodes below the superkingdom Eukaryota and above the genus Candida that are not shown in the default lineage display but can be seen in the full lineage with all ranks. By choosing to view the full lineage, you can see the additional nodes, which include formal and informal group names such as the clade Opisthokonta, the subkingdom Dikarya, the clade Saccharomyceta, the subphylum Saccharomycotina and the clade Candida/Lodderomyces. Each taxonomic name links to a corresponding taxonomy page for that taxon, providing a way to conveniently navigate to each node.

Images

While taxonomic images were previously introduced to the main taxonomy page to enhance understanding of the taxon shown, these images are now selected using a new curation process. The main requirements for inclusion on the page is that the image be released under either a public domain or Creative Commons license and its source and attributions be public. Original source, attribution and licensing is displayed together with the image on the taxonomy page. Information about these images is also available from the ‘image.dmp’ file within the FTP taxdump files at https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/. More details on requirements will be available on the NCBI Datasets documentation pages.

Taxonomy names page

We recently implemented a new taxonomy name details page providing detailed information about nomenclature, type material, publications, and nomenclature authorities (Figure 1E). Tracking changes to taxonomic nodes can be challenging due to their complex history, which often includes name changes, merges, and the use of common names in scientific literature. To address this need, we developed a dedicated taxonomy name details page offering comprehensive information on nomenclature and type material and providing biologists and researchers with the resources to navigate the complexities of taxonomic identifiers.

Nomenclature

Compared to the previous taxonomy web interface, the display of nomenclature has been simplified, resulting in a page that is easier to read and understand. Nomenclature information used by NCBI Taxonomy has been consolidated into a smaller number of groups and some NCBI field names have been renamed to use more intuitive language. For example, the ‘GenBank common name’ in the old web interface is now displayed as first in the list of common names on the new name details page. The GenBank synonym, selected from existing synonyms for more prominent display has been changed to ‘curator synonym’. The ‘NCBI BLAST name’ in the old web interface has been renamed to ‘group name’ on the new page (Figure 1F).

Data package download

The new names page also provides direct access to taxonomy data downloads to facilitate comparative biology research (Figure 1E, yellow arrow). Previously, this information was only available through a time-consuming and difficult process of downloading and parsing data from FTP. Now, it's possible to download a table with basic taxonomic information with lineage information, or alternatively, a taxonomy data package with more detailed taxonomic information. As an example, consider the family Debaryomycetaceae, which includes yeast species isolated from diverse environmental and biological sources (9). From the name details page, users can download data for Debaryomycetaceae only, Debaryomycetaceae and all parent nodes, including the family, order, class, kingdom and superkingdom nodes, or Debaryomycetaceae and all child nodes, which includes > 900 species. The taxonomy_summary.tsv file gives you nomenclature and lineage information for each species (Table 1).

Table 1.

An excerpt of the taxonomy_summary.tsv file showing a reduced number of columns for clarity purposes

Family name Family taxid Genus name Genus taxid Species name Species taxid
Debaryomycetaceae 766764     [Candida] argentea 1016838
Debaryomycetaceae 766764 Lodderomyces 36913 Lodderomyces sp. Y-1 1027459
Debaryomycetaceae 766764 Diutina 1910789 Diutina pseudorugosa 1035112
Debaryomycetaceae 766764 Debaryomyces 4958 Debaryomyces sp. DVM 1036186
Debaryomycetaceae 766764 Yamadazyma 766765 [Candida] khao-thaluensis 1041610
Debaryomycetaceae 766764 Yamadazyma 766765 [Candida] tallmaniae 1041611
Debaryomycetaceae 766764 Yamadazyma 766765 [Candida] vaughaniae 1041612
Debaryomycetaceae 766764 Yamadazyma 766765 Yamadazyma akitaensis 1041613
Debaryomycetaceae 766764 Debaryomyces 4958 Debaryomyces sp. CFH1.A 1049477
Debaryomycetaceae 766764 Spathaspora 412764 Spathaspora sp. HMD1.1 1054078

The 10 species listed in the table represent a range of names with different properties, all currently categorized in the family Debaryomycetaceae: formal binomial names (e.g. Diutina pseudorugosa), unspecified names (e.g. Lodderomyces sp. Y-1) and formal names that could not be classified in the correct genus [Candida] argentea.

This page represents a significant improvement compared to the previous NCBI Taxonomy web interface, and we will continue improving this initial implementation through iterative, user-driven development.

Command-line tools

To make it easier to use taxonomy data in bioinformatics workflows, we’ve improved how this data can be accessed through our command-line tools (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/). The improvements focus on two main areas: better support for retrieving data by taxonomic name and new options to download or view taxonomic data.

Since its launch, NCBI Datasets has enabled the programmatic retrieval of data by taxonomic name. Challenges arise from duplicate nomenclature where the intended taxon is ambiguous. For instance, the name Drosophila might refer to a genus (taxid: 7215) in the fly family Drosophilidae or a genus (taxid: 2081351) in the mushroom family Psathyrellaceae. NCBI Datasets addresses ambiguous queries names such as Drosophila by issuing a warning and offering alternative suggestions, including the scientific name, common name, taxid and rank. This allows users to make an informed choice. Ultimately, the use of taxids for data retrieval is encouraged as it ensures accuracy in the organism(s) of interest and thereby ensures researchers get precise and reliable data.

The datasets command-line tool features two major top-level commands: ‘download’ and ‘summary’, which can each be followed by the new ‘taxonomy’ subcommand to query for taxonomic data (Figure 2). The download command returns an NCBI Datasets Taxonomy data package as a zip archive, while the summary command prints taxonomy data to the terminal screen. To construct a complete datasets command to request taxonomic data (Figure 2, red boxes), type ‘datasets’, then either the command ‘download’ or ‘summary’, followed by the subcommand ‘taxonomy’, then ‘taxon’, which specifies the type of identifier that will be used, and finally the taxonomic identifier itself, which can be either a taxonomic name or taxid. For example, ‘datasets summary taxonomy taxon ‘homo sapiens’’ prints taxonomy data describing Homo sapiens to the terminal screen. To get more information about how to obtain taxonomy data for a particular taxon, the command ‘datasets summary taxonomy taxon –help’ returns a description of the command, some examples and a list of available flags, including options to return all child or parent nodes, or all taxa of a particular rank.

Figure 2.

Figure 2.

Schematic image showing the command hierarchy and syntax of the datasets command-line tool. Highlighted in red are the newly implement taxonomy commands. In blue, we show the existing endpoints that benefit from better handling of ambiguous names.

In its initial implementation, the datasets companion tool dataformat can be used to print the tabular file to the screen (taxonomy_summary.tsv, shown on Table 1). For example, ‘datasets summary taxonomy taxon debaryomycetaceae –as-json-lines | dataformat tsv taxonomy –template tax_summary’ will print the taxonomy_summary.tsv file describing the family debaryomycetaceae to the terminal window. In a future release, we plan to support generation of custom tables using any field from either the taxonomy or names report files.

Command examples are well-documented in multiple places, including NCBI Datasets How-to guides (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/taxonomy/) and command-line reference (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/summary/taxonomy/datasets_summary_taxonomy_taxon/). Metadata report schemas are also available, with field names and example values. While taxonomy data will continue to be available by FTP for the foreseeable future, NCBI Datasets command-line tools offer a more convenient and powerful option for most users.

Future directions

We will continue to improve and modernize the accessibility and functionality of taxonomy data at NCBI through curation and improvements to NCBI Datasets. For example, we plan to enrich the taxonomic hierarchy by integrating names that lack sequence data to fill existing gaps. A commitment to transparency is central to our approach; we aim to provide users with the ability to view detailed change logs that document name changes with specific dates and modifications on a taxon by taxon basis, moving away from bulk updates and enabling faster access to the most up-to-date taxonomic data.

We are also improving our tools to facilitate the integration of taxonomy data with other data reports produced by NCBI Datasets. This integration is aimed at enhancing interoperability, allowing for easy connections between distinct data types. Our next steps include a significant focus on refining the taxonomy browser. Initially designed to browse for assembled genomes at NCBI, we plan to expand its capabilities, enhancing the hierarchical view, improving search, adding download options, and expanding connections to other types of NCBI data.

As we develop, we are guided by U.S. government standards, continually striving to make our pages Section 508 compliant (10). This commitment ensures we meet rigorous access standards, providing an inclusive user experience. Our customers who consume genomic information from NCBI expect their digital interactions with the government, including NCBI websites, to be on par with their favorite consumer websites and mobile apps. The NCBI Datasets resource aims to deliver a user experience that is as good or better than other popular genomics web resources. To achieve this goal, we use an iterative and user-centered development approach to ensure our resources align with the needs and expectations of the community.

Acknowledgements

The NCBI Datasets and Taxonomy teams would like to express their gratitude to the many people at NCBI who helped build and review the work that went into this paper. Additionally, we are deeply appreciative of the NCBI customer community, whose participation in user interviews and feedback has been essential in shaping and enhancing our development.

Author contributions: Eric Cox: Investigation, Writing – original draft, Visualization. Mirian T.N. Tsuchiya: Investigation, Writing – review & editing, Visualization. Stacy Ciufo: Investigation, Writing – original draft. John Torcivia: Software. Robert Falk: Software. W. Ray Anderson: Software. J. Bradley Holmes: Software, Supervision. Vichet Hem: Software. Laurie Breen: Investigation, Software. Emily Davis: Investigation, Project administration. Anne Ketter: Writing – review & editing. Peifen Zhang: Investigation, Project administration. Vladimir Soussov: Software. Conrad L. Schoch: Conceptualization, Investigation, Writing – original draft, Supervision. Nuala A. O’Leary: Conceptualization, Investigation, Writing – original draft, Supervision.

Contributor Information

Eric Cox, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Mirian T N Tsuchiya, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Stacy Ciufo, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

John Torcivia, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Robert Falk, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

W Ray Anderson, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

J Bradley Holmes, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Vichet Hem, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Laurie Breen, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Emily Davis, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Anne Ketter, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Peifen Zhang, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Vladimir Soussov, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Conrad L Schoch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Nuala A O’Leary, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Data availability

The data underlying this article are from the NCBI Taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy), which is best accessed using NCBI Datasets at https://www.ncbi.nlm.nih.gov/datasets/.

Funding

National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM); National Institutes of Health (NIH). Funding for open access charge: National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health.

Conflict of interest statement. None declared.

References

  • 1. Schoch C.L., Ciufo S., Domrachev M., Hotton C.L., Kannan S., Khovanskaya R., Leipe D., McVeigh R., O’Neill K., Robbertse B.et al.. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020; 2020:baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Arita M., Karsch-Mizrachi I., Cochrane G.. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2021; 49:D121–D124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Bornstein K., Gryan G., Chang E.S., Marchler-Bauer A., Schneider V.A.. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics. 2023; 24:575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hobern D., Barik S.K., Christidis L., T.Garnett S., Kirk P., Orrell T.M., Pape T., Pyle R.L., Thiele K.R., Zachos F.E.et al.. Towards a global list of accepted species VI: the Catalogue of Life checklist. Organ. Divers. Evol. 2021; 21:677–690. [Google Scholar]
  • 5. O’Leary N.A., Cox E., Holmes J.B., Anderson W.R., Falk R., Hem V., Tsuchiya M.T.N., Schuler G.D., Zhang X., Torcivia J.et al.. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci. Data. 2024; 11:732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.W., da Silva Santos L.B., Bourne P.E.et al.. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016; 3:160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lefkowitz E.J., Dempsey D.M., Hendrickson R.C., Orton R.J., Siddell S.G., Smith D.B.. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res. 2017; 46:D708–D717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Renner S.S., Scherz M.D., Schoch C.L., Gottschling M., Vences M.. Improving the gold standard in NCBI GenBank and related databases: DNA sequences from type specimens and type strains. Syst. Biol. 2024; 73:486–494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Kurtzman C.P., Suzuki M.. Phylogenetic analysis of ascomycete yeasts that form coenzyme Q-9 and the proposal of the new genera Babjeviella, Meyerozyma, Millerozyma, Priceomyces, and Scheffersomyces. Mycoscience. 2010; 51:2–14. [Google Scholar]
  • 10. United States Congress Public Law No. 115-336: 21st Century Integrated Digital Experience Act. 2018;

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data underlying this article are from the NCBI Taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy), which is best accessed using NCBI Datasets at https://www.ncbi.nlm.nih.gov/datasets/.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES